SlideShare a Scribd company logo
Agile Data Profiling
Sean Kandel
What’s in your data?
Opening Questions
in the Data Lifecycle…
Unboxing What’s in this data?
Can I make use of it?
… Become Persistent Questions
in the Data Lifecycle
What’s in this data?
Can I make use of it?
Unboxing Transformation Analysis Visualization Productization
Unboxing Transformation Analysis Visualization Productization
Unboxing Transformation Analysis Visualization Productization
STRUCTURING CLEANING
ENRICHMENT DISTILLATION
“Its easy to just think you know what you
are doing and not look at data at every
intermediary step.
An analysis has 30 different steps. Its
tempting to just do this then that and then
this. You have no idea in which ways you
are wrong and what data is wrong.”
What’s in the data?
• The Expected: Models, Densities, Constraints
• The Unexpected: Residuals, Outlier, Anomalies
Average Movie Ratings
Expected
Unexpected
Overview of all variables
Show relevant perspectives
What to compute?
• Densities and descriptive statistics
• Identify anomalies and outliers
How often to compute it?
Unboxing Transformation Analysis Visualization Productization
Challenge: Agility
• Profiling throughout the lifecycle
• Particularly important as you manipulate data
Design Space and Tradeoffs
Mapping out the Design Space
How much data to examine?
How accurate are the results?
How fast can you get them?
Mapping out the Design Space
Decide how your requirements fall on these axes
Find a strategy (if one exists) that fits the requirements
Accuracy
Urgency
Data Volume
Accuracy
Urgency
Data Volume
Strategy vs Cost
Head of file
Good EnoughAnomaliesBig PictureUnbox
Strategy vs Cost
Random Sample
Accuracy
Urgency
Data Volume
Good EnoughAnomaliesBig PictureUnbox
Strategy vs Cost
Scan, summarize, collect samples
Accuracy
Urgency
Data Volume
Good EnoughAnomaliesBig PictureUnbox
Far better an approximate answer
to the right question, which is often
vague, than the exact answer to
the wrong question, which can
always be made precise.
Data Analysis & Statistics, Tukey & Wilk 1966
Technical Methods
Sanity Check: Is this really expensive?
• Computers are fast
• In-memory, column stores, OLAP, …
• Still, “Big Data” can be hard
• Big is sometimes really big
• Big data can be raw: no indexes or precomputed summaries
• Agility remains critical to harness the “informed human mind”
Two Useful Techniques
Sampling
• A variety of techniques available
Sketches
• One-pass memory-efficient structures for capturing distributions
Accuracy
Urgency
Data Volume
Technique I: Sampling
Approaches to Sampling
• Scan-based access
• Head-of-file
• Bernoulli
• Reservoir
• Random I/O Sampling
• Block-level sampling
Head-of-File
• Pros:
• Very fast: small data, no disk seeks
• Absolutely required when unboxing raw data
• Nested data (JSON/XML), Text (logs, database dumps, etc.)
• Cons:
• Correlation of position and value
Bernoulli
• Take a full pass, flip a (weighted) coin for each record
• Pros:
• trivial to implement
• trivial to parallelize
• almost no memory required
• Cons:
• requires a full scan of the data
• output size proportional to input size, and random
filter(lambda x : random() < 0.01, data)
Reservoir
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
… 61141217 139
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
1141217
Reservoir … 6 133
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
41217
Reservoir … 6 137 3
• Fix “reservoir”. For each item, with probability eject old for new
• Pros:
• trivial to implement
• easy to parallelize
• constant memory required
• fixed-size output — need not know input size in advance
• Cons:
• Requires a full scan of the data
res = data [0:k] //initialize: first k items
counter = k
for x in data [k:]:
if random () < k/float(counter+1):
res[randint(0,len(res)-1)] = x
counter += 1
Meta-Strategy: Stratified Sampling
• Sometimes you need representative samples from each “group”
• Coverage: e.g., displaying examples for every state in a map
• Robustness: e.g., consider average income
• if you miss the rare top tax bracket, estimate is way off
Stratification: the GroupBy / Agg pattern
• Given:
• A group-partitioning key for stratification
• Sizes for each stratum
• Easy to implement: partition, and construct sample per partition
• your favorite sampling technique applies
SELECT D.group_key, reservoir(D.value)
FROM data D
GROUP BY D.group_key;
Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
Record Sampling
• Randomly sample records?
• r the % items sampled; p #rows/block
• 20x random I/O penalty => read fewer than 5% of blocks!
• Pretty inefficient: touches 1-(1-r)p blocks
Record Sampling
% items sampled
%blockstouched(expected)
1-(1-r)p with p = 100
Block Sampling
• Randomly sample blocks of records from disk
• Concern: clustering bias.
• Techniques from database literature: assess bias and correct
• Beware: even block sampling needs to be well below 5%.
Sampling in Hadoop
• Larger unit of access: HDFS blocks (128MB vs. 64KB)
• HDFS buffering makes forward seeking within block cheaper
• But CPU costs may encourage sampling within the block.
• …and Hadoop makes it easy to sample across nodes
• Each worker only processes one block
• Must find record boundaries
• Tougher when dealing with quote escaping
Technique II: Sketching
Sketching
• Family of algorithms for estimating contents of a data stream
• Constant-sized memory footprint
• Computed in 1 pass over the data
• Classic Examples
• Bloom filter: existence testing
• HyperLogLog Sketches (FM): distinct values
• CountMin (CM): a surprisingly versatile sketch for frequencies
CountMin Sketch: Initialization
0
dhashfunctions
w hash buckets
Count-Min Sketch
0 0 0 0
0 0 0 0 0
0 0 0 0 0
CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
Insert(7)
h1
h2
hw
CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
1
1
CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
Insert(4)
h1
h2
hw
CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(4)
h2(4)
hw(4)
2
1
CountMin Sketch: Insertion
dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(4)
h2(4)
hw(4)
2
1
CountMin Sketch: Query
dhashfunctions
w hash buckets
Count-Min Sketch
Count(7)?
h1
h2
hw
CountMin Sketch: Query
dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
2
1
Count(7)?
CountMin Sketch: Query
dhashfunctions
w hash buckets
Count-Min Sketch
1
h1(7)
h2(7)
hw(7)
2
1
min
Count(7)
CountMin Sketch: Theorem & Tuning
— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).
dhashfunctions
w hash buckets
Count-Min Sketch
CountMin Sketch: Theorem & Tuning
— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).
dhashfunctions
w hash buckets
Count-Min Sketch
an over-estimate
CountMin Sketch: Theorem & Tuning
— Cormode/Muthukrishnan, J Algorithm 55(1) (2005).
dhashfunctions
w hash buckets
Count-Min Sketch
w controls expected error amount
d controls probability of error
Suppose we want:
0.1% error, 99.9% probability.
w = 2000
d = 10
CountMeanMin Sketch
dhashfunctions
w hash buckets
Count-Mean-Min Sketch Idea: subtract out expected
overage.
i.e. mean of other cells
CountMeanMin Sketch
dhashfunctions
w hash buckets
Count-Mean-Min Sketch mean
—
CountMeanMin Sketch
dhashfunctions
w hash buckets
Count-Mean-Min Sketch mean
—
mean
—
median
CountMeanMin Sketch
dhashfunctions
w hash buckets
Count-Mean-Min Sketch mean
—
mean
—
mean
—
median
Count(7)
CountMin (and CountMeanMin) answer “point frequency queries”.
Surprisingly, we can use them to answer many more questions
• densities
• even order statistics (median, quantiles, etc.)
The Versatile CountMin Sketch
More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms
0001020304050607080910111213141516171819202122232425262728293031
Count(x=13)
CountMin: Point Queries
0001020304050607080910111213141516171819202122232425262728293031
Count(x ∊ [14-15])
CountMin(⌊x/2⌋): Pair Queries
0001020304050607080910111213141516171819202122232425262728293031
Count(x ∊ [16-19])
CountMin(⌊x/4⌋): Quartet Queries
0001020304050607080910111213141516171819202122232425262728293031
Maintain all of these, and answer arbitrary range queries.
Count(x ∊ [13-24])
Dyadic CountMin: log2 CountMins
x
x/2
x/4
x/8
x/16
0001020304050607080910111213141516171819202122232425262728293031
Maintain all of these, and answer arbitrary range queries.
Count(x ∊ [13-24])
Dyadic CountMin: log2 CountMins
x
x/2
x/4
x/8
x/16
More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms
0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
0001020304050607080910111213141516171819202122232425262728293031
Median
Via binary search.
(Suppose we have N elements, and the real median is 14)
More Statistics
• Count-Range Queries
• Median
• Quantiles: generalization of Median
• Histograms
0001020304050607080910111213141516171819202122232425262728293031
More Statistics
• Count-Range Queries
• Median
• Quantiles
• Histograms:
• fixed-width bins: range queries
• fixed-height bins: quantiles
1-10 11-20 21-30 31-40
Putting It Together
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Head-of-file
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Head-of-file
Bernoulli
Block
Reservoir
Wrangling Revisited
Good EnoughAnomaliesBig PictureUnbox
Head-of-file
Bernoulli
Sketching
Stratified
Block
Reservoir
Summary
• ABP: Always Be Profiling
• Tradeoff latency and accuracy
• Approximation methods
• Heuristics and reasonable assumptions
Acknowledgments
Adam Silberstein, Joe Hellerstein

More Related Content

Similar to Sean Kandel - Data profiling: Assessing the overall content and quality of a data set

A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
Databricks
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
Eran Shlomo
 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
Maarten Breddels
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
Ali-ziane Myriam
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizations
Brendan Gregg
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
grepalex
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
guest0f8e278
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
Rodrigo Campos
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
MapR Technologies
 
No stress with state
No stress with stateNo stress with state
No stress with state
Uwe Friedrichsen
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
Mike Acton
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
NUS-ISS
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
Nitish Upreti
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
Russell Jurney
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
MumitAhmed1
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
SharabiNaif
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
Anonymous9etQKwW
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
Russell Jurney
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
AlexanderKyalo3
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
Bol.com Techlab
 

Similar to Sean Kandel - Data profiling: Assessing the overall content and quality of a data set (20)

A Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big DataA Production Quality Sketching Library for the Analysis of Big Data
A Production Quality Sketching Library for the Analysis of Big Data
 
Practical deep learning for computer vision
Practical deep learning for computer visionPractical deep learning for computer vision
Practical deep learning for computer vision
 
Vaex talk-pydata-paris
Vaex talk-pydata-parisVaex talk-pydata-paris
Vaex talk-pydata-paris
 
How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)How to interactively visualise and explore a billion objects (wit vaex)
How to interactively visualise and explore a billion objects (wit vaex)
 
LISA2010 visualizations
LISA2010 visualizationsLISA2010 visualizations
LISA2010 visualizations
 
Avoiding big data antipatterns
Avoiding big data antipatternsAvoiding big data antipatterns
Avoiding big data antipatterns
 
Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)Make Life Suck Less (Building Scalable Systems)
Make Life Suck Less (Building Scalable Systems)
 
Capacity Planning for fun & profit
Capacity Planning for fun & profitCapacity Planning for fun & profit
Capacity Planning for fun & profit
 
Nearest Neighbor Customer Insight
Nearest Neighbor Customer InsightNearest Neighbor Customer Insight
Nearest Neighbor Customer Insight
 
No stress with state
No stress with stateNo stress with state
No stress with state
 
Data oriented design and c++
Data oriented design and c++Data oriented design and c++
Data oriented design and c++
 
Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"Approximate "Now" is Better Than Accurate "Later"
Approximate "Now" is Better Than Accurate "Later"
 
Blinkdb
BlinkdbBlinkdb
Blinkdb
 
Agile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics ApplicationsAgile Data Science: Hadoop Analytics Applications
Agile Data Science: Hadoop Analytics Applications
 
Intro_2.ppt
Intro_2.pptIntro_2.ppt
Intro_2.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Agile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics ApplicationsAgile Data Science: Building Hadoop Analytics Applications
Agile Data Science: Building Hadoop Analytics Applications
 
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdfCSE545 sp23 (2) Streaming Algorithms 2-4.pdf
CSE545 sp23 (2) Streaming Algorithms 2-4.pdf
 
Three steps to untangle data traffic jams
Three steps to untangle data traffic jamsThree steps to untangle data traffic jams
Three steps to untangle data traffic jams
 

More from huguk

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
huguk
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
huguk
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
huguk
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
huguk
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
huguk
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
huguk
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
huguk
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
huguk
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
huguk
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
huguk
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
huguk
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
huguk
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
huguk
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
huguk
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
huguk
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
huguk
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
huguk
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
huguk
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
huguk
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
huguk
 

More from huguk (20)

Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, TrifactaData Wrangling on Hadoop - Olivier De Garrigues, Trifacta
Data Wrangling on Hadoop - Olivier De Garrigues, Trifacta
 
ether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp introether.camp - Hackathon & ether.camp intro
ether.camp - Hackathon & ether.camp intro
 
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and HadoopGoogle Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
Google Cloud Dataproc - Easier, faster, more cost-effective Spark and Hadoop
 
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
Using Big Data techniques to query and store OpenStreetMap data. Stephen Knox...
 
Extracting maximum value from data while protecting consumer privacy. Jason ...
Extracting maximum value from data while protecting consumer privacy.  Jason ...Extracting maximum value from data while protecting consumer privacy.  Jason ...
Extracting maximum value from data while protecting consumer privacy. Jason ...
 
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM WatsonIntelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
Intelligence Augmented vs Artificial Intelligence. Alex Flamant, IBM Watson
 
Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink Streaming Dataflow with Apache Flink
Streaming Dataflow with Apache Flink
 
Lambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale MLLambda architecture on Spark, Kafka for real-time large scale ML
Lambda architecture on Spark, Kafka for real-time large scale ML
 
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...Today’s reality Hadoop with Spark- How to select the best Data Science approa...
Today’s reality Hadoop with Spark- How to select the best Data Science approa...
 
Jonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & PitchingJonathon Southam: Venture Capital, Funding & Pitching
Jonathon Southam: Venture Capital, Funding & Pitching
 
Signal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News MonitoringSignal Media: Real-Time Media & News Monitoring
Signal Media: Real-Time Media & News Monitoring
 
Dean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your StartupDean Bryen: Scaling The Platform For Your Startup
Dean Bryen: Scaling The Platform For Your Startup
 
Peter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapultPeter Karney: Intro to the Digital catapult
Peter Karney: Intro to the Digital catapult
 
Cytora: Real-Time Political Risk Analysis
Cytora:  Real-Time Political Risk AnalysisCytora:  Real-Time Political Risk Analysis
Cytora: Real-Time Political Risk Analysis
 
Cubitic: Predictive Analytics
Cubitic: Predictive AnalyticsCubitic: Predictive Analytics
Cubitic: Predictive Analytics
 
Bird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made SocialBird.i: Earth Observation Data Made Social
Bird.i: Earth Observation Data Made Social
 
Aiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine IntelligenceAiseedo: Real Time Machine Intelligence
Aiseedo: Real Time Machine Intelligence
 
Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive Secrets of Spark's success - Deenar Toraskar, Think Reactive
Secrets of Spark's success - Deenar Toraskar, Think Reactive
 
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
TV Marketing and big data: cat and dog or thick as thieves? Krzysztof Osiewal...
 
Hadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun MurthyHadoop - Looking to the Future By Arun Murthy
Hadoop - Looking to the Future By Arun Murthy
 

Recently uploaded

Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
Bhaskar Mitra
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
Ralf Eggert
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
DianaGray10
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
James Anderson
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Thierry Lestable
 

Recently uploaded (20)

Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Search and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical FuturesSearch and Society: Reimagining Information Access for Radical Futures
Search and Society: Reimagining Information Access for Radical Futures
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)PHP Frameworks: I want to break free (IPC Berlin 2024)
PHP Frameworks: I want to break free (IPC Berlin 2024)
 
UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3UiPath Test Automation using UiPath Test Suite series, part 3
UiPath Test Automation using UiPath Test Suite series, part 3
 
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdfFIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
FIDO Alliance Osaka Seminar: Passkeys at Amazon.pdf
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using Deplo...
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
FIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdfFIDO Alliance Osaka Seminar: Overview.pdf
FIDO Alliance Osaka Seminar: Overview.pdf
 
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
Empowering NextGen Mobility via Large Action Model Infrastructure (LAMI): pav...
 

Sean Kandel - Data profiling: Assessing the overall content and quality of a data set