SlideShare a Scribd company logo
1 of 60
Download to read offline
Automated Data Exploration
Building efficient analysis
pipelines with Dask
Víctor Zabalza
victor.z@asidatascience.com
@zblz
@ASIDataScience
About me
• Data engineer at ASI Data Science.
• Former astrophysicist.
• Main developer of naima, a Python
package for radiative analysis of
non-thermal astronomical sources.
• matplotlib developer.
ASI Data Science
• Data science
consultancy
• Academia to Data
Industry fellowship
• SherlockML
Data Exploration
First steps in a Data Science project
• Does the data fit in a single computer?
• Data quality assessment
• Data exploration
• Data cleaning
80% → >30 h/week
Can we automate the
drudge work?
Developing a tool for
data exploration based
on Dask
Lens
Library and service for
automated data quality
assessment and exploration
Lens by example
Room occupancy dataset
• ML standard dataset
• Goal: predict occupancy based on ambient
measurements.
• What can we learn about it with Lens?
Python interface
>>> import lens
>>> df = pd.read_csv('room_occupancy.csv')
>>> ls = lens.summarise(df)
>>> type(ls)
<class 'lens.summarise.Summary'>
>>> ls.to_json('room_occupancy_lens.json')
room_occupancy_lens.json now contains all
information needed for exploration!
Python interface — Columns
>>> ls = lens.Summary.from_json('room_occupancy_lens.json')
>>> ls.columns
['date',
'Temperature',
'Humidity',
'Light',
'CO2',
'HumidityRatio',
'Occupancy']
Python interface — Categorical summary
>>> ls.summary('Occupancy')
{'name': 'Occupancy',
'desc': 'categorical',
'dtype': 'int64',
'nulls': 0, 'notnulls': 8143,
'unique': 2}
>>> ls.details('Occupancy')
{'desc': 'categorical',
'frequencies': {0: 6414, 1: 1729},
'name': 'Occupancy'}
Python interface — Numeric summary
>>> ls.details('Temperature')
{'name': 'Temperature',
'desc': 'numeric',
'iqr': 1.69,
'min': 19.0, 'max': 23.18,
'mean': 20.619, 'median': 20.39,
'std': 1.0169, 'sum': 167901.1980}
Python interface — KDE, PDF
>>> x, y = ls.kde('Temperature')
>>> x[np.argmax(y)]
19.417999999999999
>>> temperature_pdf = ls.pdf('Temperature')
>>> temperature_pdf([19, 20, 21, 22])
array([ 0.01754398, 0.76491742,
0.58947765, 0.28421244])
The lens.Summary is a
good building block, but
clunky for exploration.
Can we do better?
Jupyter
lens.Explorer for Jupyter
Jupyter: Column distribution
Jupyter: Correlation matrix
Jupyter widgets
Jupyter widgets: Column distribution
Jupyter widgets: Correlation matrix
Jupyter widgets: Pair density
Jupyter widgets: Pair density
Building Lens
Requirements
• Versatile
• Reproducible
• Portable
• Scalable
• Reusable
Our solution: Analysis
• A Python library computes dataset metrics:
• Column-wise statistics
• Pairwise densities
• ...
• Computation cost is paid up front.
• The result is serialized to JSON.
Our Solution: Interactive exploration
• Using only the report, the user can explore
the dataset through either:
• Jupyter widgets
• Web UI
The lens Python library
Why Python? Portability
A Python library easily runs in:
• Jupyter notebooks for interactive analysis.
• One-off scripts.
• Scheduled or on-demand jobs in cluster.
Why Python? Reusability
• Allows us to use the Python data
ecosystem.
• Becomes a building block in the Data
Science process.
Why Python? Scalability
• Python is great for single-core, in-memory,
numerical computations through numpy,
scipy, pandas.
• But the GIL limits its ability to parallelise
workloads.
Can Python scale?
Out-of-core options in Python
Difficult
Flexible
Easy
Restrictive
Threads, Processes, MPI, ZeroMQ
Concurrent.futures, joblib
Luigi
PySpark
Hadoop, SQL
Dask
Dask
Dask interface
• Dask objects are delayed objects.
• The user operates on them as Python
structures.
• Dask builds a DAG of the computation.
• When the final result is requested, the DAG
is executed on its workers (threads,
processes, or nodes).
Dask data structures
• numpy.ndarray → dask.array
• pandas.DataFrame → dask.dataframe
• list, set → dask.bag
dask.delayed — Build you own DAG
files = ['myfile.a.data',
'myfile.b.data',
'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
store(analyzed)
dask.delayed — Build you own DAG
@delayed
def load(filename):
...
@delayed
def clean(data):
...
@delayed
def analyze(sequence_of_data):
...
@delayed
def store(result):
with open(..., 'w') as f:
f.write(result)
dask.delayed — Build you own DAG
files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
stored = store(analyzed)
clean-2
analyze
cleanload-2
analyze store
clean-3
clean-1
load
storecleanload-1
cleanload-3load
load
stored.compute()
Dask DAG execution
In memory Released from memory
Comparison with PySpark
• Native Python — better interaction with
Python libraries
• Easy deployment
• Focused on arbitrary graphs
• Optimized for
• low latency
• low memory usage
How do we use
Dask in Lens?
Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
OutA
OutB
Corr PairDensity
Report
Building the graph with dask.delayed
# Create a series for each column in the DataFrame.
columns = df.columns
df = delayed(df)
cols = {k: delayed(df.get)(k) for k in columns}
# Create the delayed reports using Dask.
cprops = {k: delayed(metrics.column_properties)(cols[k])
for k in columns}
csumms = {k: delayed(metrics.column_summary)(cols[k], cprops[k])
for k in columns}
corr = delayed(metrics.correlation)(df, cprops)
Building the graph with dask.delayed
pdens_results = []
if pairdensities:
for col1, col2 in itertools.combinations(columns, 2):
pdens_df = delayed(pd.concat)([cols[col1], cols[col2]])
pdens_cp = {k: cprops[k] for k in [col1, col2]}
pdens_cs = {k: csumms[k] for k in [col1, col2]}
pdens_fr = {k: freqs[k] for k in [col1, col2]}
pdens = delayed(metrics.pairdensity)(
pdens_df, pdens_cp, pdens_cs, pdens_fr)
pdens_results.append(pdens)
# Join the delayed per-metric reports into a dictionary.
report = delayed(dict)(column_properties=cprops,
column_summary=csumms,
pair_density=pdens_results,
...)
return report
• Graph for
two-column
dataset generated
by lens.
• The same code
can be used for
much wider
datasets.
Integration with
infrastructure
SherlockML integration
• Every dataset entering the platform is
analysed by Lens.
• We can use the same Python library!
• The web frontend is used to interact with
datasets.
Accessing a Lens report
SherlockML: Column information
SherlockML: Column distribution
SherlockML: Correlation matrix
SherlockML: Pair density
Lens
• Keeps interactive exploration snappy by
splitting computation and exploration.
• Upfront computation leverages dask to
scale.
• Easy exploration no matter the data size!
• Keep your data scientists happy and
productive.
Lens
• Lens will be open source.
• You can use the library and service right
now on SherlockML.
Copyright © ASI 2016 All rights reserved
✓ Secure Scalable Compute
✓ Rapid Exploration and Cleaning
✓ Easy Collaboration
✓ Clear Communication
LONDON
IRISH
SecurePowerful Simple
https://sherlockml.com
Invite code: Strata2017

More Related Content

What's hot

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...Jose Quesada (hiring)
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetupamarsri
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Spark Summit
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéJen Aman
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Turi, Inc.
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Databricks
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Yves Raimond
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Turi, Inc.
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseDatabricks
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Spark Summit
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Spark Summit
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsFlink Forward
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Spark Summit
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Databricks
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...Dataconomy Media
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengDatabricks
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Jen Aman
 

What's hot (20)

A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
A full Machine learning pipeline in Scikit-learn vs in scala-Spark: pros and ...
 
Apache Lens at Hadoop meetup
Apache Lens at Hadoop meetupApache Lens at Hadoop meetup
Apache Lens at Hadoop meetup
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
Time-evolving Graph Processing on Commodity Clusters: Spark Summit East talk ...
 
Snorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher RéSnorkel: Dark Data and Machine Learning with Christopher Ré
Snorkel: Dark Data and Machine Learning with Christopher Ré
 
Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark Better {ML} Together: GraphLab Create + Spark
Better {ML} Together: GraphLab Create + Spark
 
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
Ray: A Cluster Computing Engine for Reinforcement Learning Applications with ...
 
The Evolution of Apache Kylin
The Evolution of Apache KylinThe Evolution of Apache Kylin
The Evolution of Apache Kylin
 
Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015Spark Meetup @ Netflix, 05/19/2015
Spark Meetup @ Netflix, 05/19/2015
 
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
Declarative Machine Learning: Bring your own Syntax, Algorithm, Data and Infr...
 
Build, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with EaseBuild, Scale, and Deploy Deep Learning Pipelines with Ease
Build, Scale, and Deploy Deep Learning Pipelines with Ease
 
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
Lessons Learned while Implementing a Sparse Logistic Regression Algorithm in ...
 
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
Digital Attribution Modeling Using Apache Spark-(Anny Chen and William Yan, A...
 
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the CloudsGreg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
Greg Hogan – To Petascale and Beyond- Apache Flink in the Clouds
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
Dynamic Community Detection for Large-scale e-Commerce data with Spark Stream...
 
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
Flare: Scale Up Spark SQL with Native Compilation and Set Your Data on Fire! ...
 
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft..."Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
"Source Code Abstracts Classification Using CNN", Vadim Markovtsev, Lead Soft...
 
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui MengChallenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
Challenging Web-Scale Graph Analytics with Apache Spark with Xiangrui Meng
 
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
Embrace Sparsity At Web Scale: Apache Spark MLlib Algorithms Optimization For...
 

Similar to Automated Data Exploration: Building efficient analysis pipelines with Dask

A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptSanket Shikhar
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonMiklos Christine
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemTuri, Inc.
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼NAVER D2
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010Cloudera, Inc.
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_sparkGeetanjali G
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopAmanda Casari
 
Introduction to df
Introduction to dfIntroduction to df
Introduction to dfMohit Jaggi
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on SparkAlpine Data
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in SparkDatabricks
 
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)Amazon Web Services Korea
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksDatabricks
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataDatabricks
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWSAmazon Web Services
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable PythonTravis Oliphant
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processingprajods
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupFrens Jan Rumph
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkDB Tsai
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCAapo Kyrölä
 

Similar to Automated Data Exploration: Building efficient analysis pipelines with Dask (20)

04 open source_tools
04 open source_tools04 open source_tools
04 open source_tools
 
A Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.pptA Hands-on Intro to Data Science and R Presentation.ppt
A Hands-on Intro to Data Science and R Presentation.ppt
 
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in PythonThe Nitty Gritty of Advanced Analytics Using Apache Spark in Python
The Nitty Gritty of Advanced Analytics Using Apache Spark in Python
 
New Capabilities in the PyData Ecosystem
New Capabilities in the PyData EcosystemNew Capabilities in the PyData Ecosystem
New Capabilities in the PyData Ecosystem
 
[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼[262] netflix 빅데이터 플랫폼
[262] netflix 빅데이터 플랫폼
 
HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010HP - Jerome Rolia - Hadoop World 2010
HP - Jerome Rolia - Hadoop World 2010
 
Intel realtime analytics_spark
Intel realtime analytics_sparkIntel realtime analytics_spark
Intel realtime analytics_spark
 
Apache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code WorkshopApache Spark for Everyone - Women Who Code Workshop
Apache Spark for Everyone - Women Who Code Workshop
 
Introduction to df
Introduction to dfIntroduction to df
Introduction to df
 
df: Dataframe on Spark
df: Dataframe on Sparkdf: Dataframe on Spark
df: Dataframe on Spark
 
New Developments in Spark
New Developments in SparkNew Developments in Spark
New Developments in Spark
 
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
AI 클라우드로 완전 정복하기 - 데이터 분석부터 딥러닝까지 (윤석찬, AWS테크에반젤리스트)
 
Jump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on DatabricksJump Start with Apache Spark 2.0 on Databricks
Jump Start with Apache Spark 2.0 on Databricks
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS(DAT203) Building Graph Databases on AWS
(DAT203) Building Graph Databases on AWS
 
Fast and Scalable Python
Fast and Scalable PythonFast and Scalable Python
Fast and Scalable Python
 
Apache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data ProcessingApache Spark: The Next Gen toolset for Big Data Processing
Apache Spark: The Next Gen toolset for Big Data Processing
 
PySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark MeetupPySpark Cassandra - Amsterdam Spark Meetup
PySpark Cassandra - Amsterdam Spark Meetup
 
Large-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache SparkLarge-Scale Machine Learning with Apache Spark
Large-Scale Machine Learning with Apache Spark
 
Large-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PCLarge-scale Recommendation Systems on Just a PC
Large-scale Recommendation Systems on Just a PC
 

Recently uploaded

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998YohFuh
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/managementakshesh doshi
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Callshivangimorya083
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingNeil Barnes
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...Suhani Kapoor
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...shivangimorya083
 

Recently uploaded (20)

RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998RA-11058_IRR-COMPRESS Do 198 series of 1998
RA-11058_IRR-COMPRESS Do 198 series of 1998
 
E-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptxE-Commerce Order PredictionShraddha Kamble.pptx
E-Commerce Order PredictionShraddha Kamble.pptx
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Spark3's new memory model/management
Spark3's new memory model/managementSpark3's new memory model/management
Spark3's new memory model/management
 
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
꧁❤ Greater Noida Call Girls Delhi ❤꧂ 9711199171 ☎️ Hard And Sexy Vip Call
 
Brighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data StorytellingBrighton SEO | April 2024 | Data Storytelling
Brighton SEO | April 2024 | Data Storytelling
 
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
VIP High Class Call Girls Bikaner Anushka 8250192130 Independent Escort Servi...
 
20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
VIP Call Girls Service Charbagh { Lucknow Call Girls Service 9548273370 } Boo...
 
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
Russian Call Girls Dwarka Sector 15 💓 Delhi 9999965857 @Sabina Modi VVIP MODE...
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
Full night 🥵 Call Girls Delhi New Friends Colony {9711199171} Sanya Reddy ✌️o...
 
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
꧁❤ Aerocity Call Girls Service Aerocity Delhi ❤꧂ 9999965857 ☎️ Hard And Sexy ...
 

Automated Data Exploration: Building efficient analysis pipelines with Dask

  • 1. Automated Data Exploration Building efficient analysis pipelines with Dask Víctor Zabalza victor.z@asidatascience.com @zblz @ASIDataScience
  • 2. About me • Data engineer at ASI Data Science. • Former astrophysicist. • Main developer of naima, a Python package for radiative analysis of non-thermal astronomical sources. • matplotlib developer.
  • 3. ASI Data Science • Data science consultancy • Academia to Data Industry fellowship • SherlockML
  • 5. First steps in a Data Science project • Does the data fit in a single computer? • Data quality assessment • Data exploration • Data cleaning
  • 6. 80% → >30 h/week
  • 7.
  • 8. Can we automate the drudge work?
  • 9. Developing a tool for data exploration based on Dask
  • 10. Lens Library and service for automated data quality assessment and exploration
  • 11. Lens by example Room occupancy dataset • ML standard dataset • Goal: predict occupancy based on ambient measurements. • What can we learn about it with Lens?
  • 12. Python interface >>> import lens >>> df = pd.read_csv('room_occupancy.csv') >>> ls = lens.summarise(df) >>> type(ls) <class 'lens.summarise.Summary'> >>> ls.to_json('room_occupancy_lens.json') room_occupancy_lens.json now contains all information needed for exploration!
  • 13. Python interface — Columns >>> ls = lens.Summary.from_json('room_occupancy_lens.json') >>> ls.columns ['date', 'Temperature', 'Humidity', 'Light', 'CO2', 'HumidityRatio', 'Occupancy']
  • 14. Python interface — Categorical summary >>> ls.summary('Occupancy') {'name': 'Occupancy', 'desc': 'categorical', 'dtype': 'int64', 'nulls': 0, 'notnulls': 8143, 'unique': 2} >>> ls.details('Occupancy') {'desc': 'categorical', 'frequencies': {0: 6414, 1: 1729}, 'name': 'Occupancy'}
  • 15. Python interface — Numeric summary >>> ls.details('Temperature') {'name': 'Temperature', 'desc': 'numeric', 'iqr': 1.69, 'min': 19.0, 'max': 23.18, 'mean': 20.619, 'median': 20.39, 'std': 1.0169, 'sum': 167901.1980}
  • 16. Python interface — KDE, PDF >>> x, y = ls.kde('Temperature') >>> x[np.argmax(y)] 19.417999999999999 >>> temperature_pdf = ls.pdf('Temperature') >>> temperature_pdf([19, 20, 21, 22]) array([ 0.01754398, 0.76491742, 0.58947765, 0.28421244])
  • 17. The lens.Summary is a good building block, but clunky for exploration. Can we do better?
  • 23. Jupyter widgets: Column distribution
  • 28. Requirements • Versatile • Reproducible • Portable • Scalable • Reusable
  • 29. Our solution: Analysis • A Python library computes dataset metrics: • Column-wise statistics • Pairwise densities • ... • Computation cost is paid up front. • The result is serialized to JSON.
  • 30. Our Solution: Interactive exploration • Using only the report, the user can explore the dataset through either: • Jupyter widgets • Web UI
  • 31. The lens Python library
  • 32. Why Python? Portability A Python library easily runs in: • Jupyter notebooks for interactive analysis. • One-off scripts. • Scheduled or on-demand jobs in cluster.
  • 33. Why Python? Reusability • Allows us to use the Python data ecosystem. • Becomes a building block in the Data Science process.
  • 34. Why Python? Scalability • Python is great for single-core, in-memory, numerical computations through numpy, scipy, pandas. • But the GIL limits its ability to parallelise workloads. Can Python scale?
  • 35. Out-of-core options in Python Difficult Flexible Easy Restrictive Threads, Processes, MPI, ZeroMQ Concurrent.futures, joblib Luigi PySpark Hadoop, SQL Dask
  • 36. Dask
  • 37. Dask interface • Dask objects are delayed objects. • The user operates on them as Python structures. • Dask builds a DAG of the computation. • When the final result is requested, the DAG is executed on its workers (threads, processes, or nodes).
  • 38. Dask data structures • numpy.ndarray → dask.array • pandas.DataFrame → dask.dataframe • list, set → dask.bag
  • 39. dask.delayed — Build you own DAG files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data'] loaded = [load(i) for i in files] cleaned = [clean(i) for i in loaded] analyzed = analyze(cleaned) store(analyzed)
  • 40. dask.delayed — Build you own DAG @delayed def load(filename): ... @delayed def clean(data): ... @delayed def analyze(sequence_of_data): ... @delayed def store(result): with open(..., 'w') as f: f.write(result)
  • 41. dask.delayed — Build you own DAG files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data'] loaded = [load(i) for i in files] cleaned = [clean(i) for i in loaded] analyzed = analyze(cleaned) stored = store(analyzed) clean-2 analyze cleanload-2 analyze store clean-3 clean-1 load storecleanload-1 cleanload-3load load stored.compute()
  • 42. Dask DAG execution In memory Released from memory
  • 43. Comparison with PySpark • Native Python — better interaction with Python libraries • Easy deployment • Focused on arbitrary graphs • Optimized for • low latency • low memory usage
  • 44. How do we use Dask in Lens?
  • 46. Building the graph with dask.delayed # Create a series for each column in the DataFrame. columns = df.columns df = delayed(df) cols = {k: delayed(df.get)(k) for k in columns} # Create the delayed reports using Dask. cprops = {k: delayed(metrics.column_properties)(cols[k]) for k in columns} csumms = {k: delayed(metrics.column_summary)(cols[k], cprops[k]) for k in columns} corr = delayed(metrics.correlation)(df, cprops)
  • 47. Building the graph with dask.delayed pdens_results = [] if pairdensities: for col1, col2 in itertools.combinations(columns, 2): pdens_df = delayed(pd.concat)([cols[col1], cols[col2]]) pdens_cp = {k: cprops[k] for k in [col1, col2]} pdens_cs = {k: csumms[k] for k in [col1, col2]} pdens_fr = {k: freqs[k] for k in [col1, col2]} pdens = delayed(metrics.pairdensity)( pdens_df, pdens_cp, pdens_cs, pdens_fr) pdens_results.append(pdens) # Join the delayed per-metric reports into a dictionary. report = delayed(dict)(column_properties=cprops, column_summary=csumms, pair_density=pdens_results, ...) return report
  • 48. • Graph for two-column dataset generated by lens. • The same code can be used for much wider datasets.
  • 49.
  • 51. SherlockML integration • Every dataset entering the platform is analysed by Lens. • We can use the same Python library! • The web frontend is used to interact with datasets.
  • 57. Lens • Keeps interactive exploration snappy by splitting computation and exploration. • Upfront computation leverages dask to scale. • Easy exploration no matter the data size! • Keep your data scientists happy and productive.
  • 58. Lens • Lens will be open source. • You can use the library and service right now on SherlockML.
  • 59. Copyright © ASI 2016 All rights reserved ✓ Secure Scalable Compute ✓ Rapid Exploration and Cleaning ✓ Easy Collaboration ✓ Clear Communication LONDON IRISH SecurePowerful Simple