Approach to BigData

(Step by step Approach to opening up the mysteries of a database:

Select All Tables & times of update
All Fields

Compile all existing / production SQL & StoredProcedures & Scripts
& dump in SQLanalyse showing:
Databases & External columns accessed
JOINS used
New Columns/fields created

Using SQLAnalysis to Identify key/essential tables & columns.
   Compile all SQL scripts, Views & StoredProcedures & insert in to this sheet to extract :
   key JOINs
group columns together, based on their values . group columns from the various tables in order of importance of the group
   using MapFields
   using DistinctValues to spot breaks in any assumption ( the distincts should be similar)
Using MapChk to discover hard JOINS to other datasets
Using Compare List for JOINS to spot likely joins
Using the Muppix toolkit to extract all valid database columns from existing code
Using Schema Entity Explorer to navigate across tables and report key fields , for any inconsistencies

Take all proven JOINS and use SQL Find other JOIN Fields in all Tables & find any other tables that could be JOINED using the same JOIN fields

Find all tables that have been updated in last month/week/day

Find the lowest-level records , ie a Loan, or a Flightbooking or Trades, then find all types of records that are more higher / grouped level.
Organise the most important/lowestlevel JOINS

Using BRCheck (Business Rule Check) generate SQL to identify the data ranges of each column, distinct values & order of date columns and in so doing generate a concentrated overview of the real values if each table

Using SchemaIRB EntityJoin, generate SQL code to easily navigate between tables
dump the output into SQLdata, and see in Mydata the key columns all grouped together, for comparison

Identify JOINS on lowest level using existing production SQL/Views/StoredProcedures in SQLAnalysis
list the production JOINS first

Some JOINS need non-standard code , ie select only the latest record

Identify more general JOINS using MAPChk . Using MapChk to discover hard joins between multiple tables. List these lower down.

After collating all the databases & JOINs , use SchemaIRB to Auto Join multiple tables Connect Joins accross Autogenerate SQL

Seeing hard-to-spot for differences in output:
Using Compare 2 spreadsheets :
cell by cell or
By using a common Index/JOIN column identical column names
( so the lines and the columns of each datadump can be in any /random order
Must specify IndexColumnName common to both datadumps)

Compare each line with line above for differences
Sometimes a line with many columns may have subtle differences in the line above , that are hard to spot

Exploring JOINS to other datasets
Use AnalysisSSIS to extract all dataabse fields used in a SSIS package
Use SQL Analysis
Find similar
Auto generate SQL code to JOIN multiple SQL tables

Muppix provides innovative solutions and Training to make sense of large scale data.
Backed by years of industry experience, the Muppix Team have developed a Free Data Science Toolkit to extract and analyse multi-structured information from diverse data sources

Company

Training

Professional Services

Get Started

Muppix provides innovative solutions and Training to make sense of large scale data.Backed by years of industry experience, the Muppix Team have developed a Free Data Science Toolkit to extract and analyse multi-structured information from diverse data sources

Company

Training

Professional Services

Get Started

Muppix provides innovative solutions and Training to make sense of large scale data.
Backed by years of industry experience, the Muppix Team have developed a Free Data Science Toolkit to extract and analyse multi-structured information from diverse data sources