Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Agile Data Profiling
Sean Kandel

Opening Questions
in the Data Lifecycle…
Unboxing What’s in this data?
Can I make use of it?

… Become Persistent Questions
in the Data Lifecycle
What’s in this data?
Can I make use of it?
Unboxing Transformation Analysis Visualization Productization

STRUCTURING CLEANING
ENRICHMENT DISTILLATION

“Its easy to just think you know what you
are doing and not look at data at every
intermediary step.
An analysis has 30 different steps. Its
tempting to just do this then that and then
this. You have no idea in which ways you
are wrong and what data is wrong.”

What’s in the data?
• The Expected: Models, Densities, Constraints
• The Unexpected: Residuals, Outlier, Anomalies

What to compute?
• Densities and descriptive statistics
• Identify anomalies and outliers

How often to compute it?

Challenge: Agility
• Profiling throughout the lifecycle
• Particularly important as you manipulate data

Mapping out the Design Space
How much data to examine?
How accurate are the results?
How fast can you get them?

Mapping out the Design Space
Decide how your requirements fall on these axes
Find a strategy (if one exists) that fits the requirements
Accuracy
Urgency
Data Volume

Accuracy
Urgency
Data Volume
Strategy vs Cost
Head of file
Good EnoughAnomaliesBig PictureUnbox

Strategy vs Cost
Random Sample
Accuracy
Urgency
Data Volume

Strategy vs Cost
Scan, summarize, collect samples
Accuracy
Urgency
Data Volume

Far better an approximate answer
to the right question, which is often
vague, than the exact answer to
the wrong question, which can
always be made precise.
Data Analysis & Statistics, Tukey & Wilk 1966

Sanity Check: Is this really expensive?
• Computers are fast
• In-memory, column stores, OLAP, …
• Still, “Big Data” can be hard
• Big is sometimes really big
• Big data can be raw: no indexes or precomputed summaries
• Agility remains critical to harness the “informed human mind”

Two Useful Techniques
Sampling
• A variety of techniques available
Sketches
• One-pass memory-efficient structures for capturing distributions
Accuracy
Urgency
Data Volume

Approaches to Sampling
• Scan-based access
• Head-of-file
• Bernoulli
• Reservoir
• Random I/O Sampling
• Block-level sampling

Head-of-File
• Pros:
• Very fast: small data, no disk seeks
• Absolutely required when unboxing raw data
• Nested data (JSON/XML), Text (logs, database dumps, etc.)
• Cons:
• Correlation of position and value

Bernoulli
• Take a full pass, flip a (weighted) coin for each record
• Pros:
• trivial to implement
• trivial to parallelize
• almost no memory required
• Cons:
• requires a full scan of the data
• output size proportional to input size, and random
filter(lambda x : random() < 0.01, data)

Reservoir
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
… 61141217 139
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1

1141217
Reservoir … 6 133
• Pros:
• Cons:
counter = k
for x in data [k:]:
counter += 1

41217
Reservoir … 6 137 3
• Pros:
• Cons:
counter = k
for x in data [k:]:
counter += 1

Meta-Strategy: Stratified Sampling
• Sometimes you need representative samples from each “group”
• Coverage: e.g., displaying examples for every state in a map
• Robustness: e.g., consider average income
• if you miss the rare top tax bracket, estimate is way off

Stratification: the GroupBy / Agg pattern
• Given:
• A group-partitioning key for stratification
• Sizes for each stratum
• Easy to implement: partition, and construct sample per partition
• your favorite sampling technique applies
SELECT D.group_key, reservoir(D.value)
FROM data D
GROUP BY D.group_key;

Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!

Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
• Pretty inefficient: touches 1-(1-r)p blocks

Record Sampling
% items sampled
%blockstouched(expected)
1-(1-r)p with p = 100

Block Sampling
• Randomly sample blocks of records from disk
• Concern: clustering bias.
• Techniques from database literature: assess bias and correct
• Beware: even block sampling needs to be well below 5%.

Sampling in Hadoop
• Larger unit of access: HDFS blocks (128MB vs. 64KB)
• HDFS buffering makes forward seeking within block cheaper
• But CPU costs may encourage sampling within the block.
• …and Hadoop makes it easy to sample across nodes
• Each worker only processes one block
• Must find record boundaries
• Tougher when dealing with quote escaping

Sketching
• Family of algorithms for estimating contents of a data stream
• Constant-sized memory footprint
• Computed in 1 pass over the data
• Classic Examples
• Bloom filter: existence testing
• HyperLogLog Sketches (FM): distinct values
• CountMin (CM): a surprisingly versatile sketch for frequencies

CountMin Sketch: Initialization
0
dhashfunctions
w hash buckets
Count-Min Sketch
0 0 0 0
0 0 0 0 0
0 0 0 0 0

CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
Insert(7)
h1
h2
hw

dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
1
1

dhashfunctions
w hash buckets
Count-Min Sketch
Insert(4)
h1
h2
hw

dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(4)
h2(4)
hw(4)
2
1

CountMin Sketch: Query
dhashfunctions
w hash buckets
Count-Min Sketch
Count(7)?
h1
h2
hw

dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
2
1
Count(7)?

dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
2
1
min
Count(7)

CountMin Sketch: Theorem & Tuning
— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).
dhashfunctions
w hash buckets
Count-Min Sketch

dhashfunctions
w hash buckets
Count-Min Sketch
an over-estimate

dhashfunctions
w hash buckets
Count-Min Sketch
w controls expected error amount
d controls probability of error
Suppose we want:
0.1% error, 99.9% probability.
w = 2000
d = 10

CountMeanMin Sketch
dhashfunctions
w hash buckets
Count-Mean-Min Sketch Idea: subtract out expected
overage.
i.e. mean of other cells

CountMeanMin Sketch
dhashfunctions
w hash buckets
Count-Mean-Min Sketch mean
—

CountMeanMin Sketch
dhashfunctions
w hash buckets
—
mean
—
median

CountMeanMin Sketch
dhashfunctions
w hash buckets
—
mean
—
mean
—
median
Count(7)

CountMin (and CountMeanMin) answer “point frequency queries”.
Surprisingly, we can use them to answer many more questions
• densities
• even order statistics (median, quantiles, etc.)
The Versatile CountMin Sketch

More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms

0001020304050607080910111213141516171819202122232425262728293031
Count(x=13)
CountMin: Point Queries

0001020304050607080910111213141516171819202122232425262728293031
Count(x ∊ [14-15])
CountMin(⌊x/2⌋): Pair Queries

0001020304050607080910111213141516171819202122232425262728293031
Count(x ∊ [16-19])
CountMin(⌊x/4⌋): Quartet Queries

0001020304050607080910111213141516171819202122232425262728293031
Maintain all of these, and answer arbitrary range queries.
Count(x ∊ [13-24])
Dyadic CountMin: log2 CountMins
x
x/2
x/4
x/8
x/16

0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)

More Statistics
• Median
• Quantiles: generalization of Median
• Histograms
0001020304050607080910111213141516171819202122232425262728293031

More Statistics
• Median
• Quantiles
• Histograms:
• fixed-width bins: range queries
• fixed-height bins: quantiles
1-10 11-20 21-30 31-40

Wrangling Revisited

Wrangling Revisited
Head-of-file

Wrangling Revisited
Head-of-file
Bernoulli
Block
Reservoir

Wrangling Revisited
Head-of-file
Bernoulli
Sketching
Stratified
Block
Reservoir

Summary
• ABP: Always Be Profiling
• Tradeoff latency and accuracy
• Approximation methods
• Heuristics and reasonable assumptions

Acknowledgments
Adam Silberstein, Joe Hellerstein

Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Recommended

Recommended

More Related Content

Similar to Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

Similar to Sean Kandel - Data profiling: Assessing the overall content and quality of a data set (20)

More from huguk

More from huguk (20)

Recently uploaded

Recently uploaded (20)

Sean Kandel - Data profiling: Assessing the overall content and quality of a data set