Automated Data Exploration
Building efficient analysis
pipelines with Dask
Víctor Zabalza
victor.z@asidatascience.com
github.com/zblz
About me
• ASI Data Science on SherlockML.
About me
• ASI Data Science on SherlockML.
• Former astrophysicist.
About me
• ASI Data Science on SherlockML.
• Former astrophysicist.
• Main developer of naima, a
package to model non-thermal
astrophysical sources.
• matplotlib developer.
Data Exploration
First steps in a Data Science project
• Does the data fit in a single
computer?
First steps in a Data Science project
• Does the data fit in a single
computer?
• Data quality assessment
• Data exploration
• Data cleaning
80% → >30 h/week
Can we automate
the drudge work?
Developing a tool
for data exploration
based on Dask
Lens
Open source library for
automated data exploration
Lens by example
Room occupancy dataset
• ML standard dataset
• Goal: predict whether room is
occupied from ambient
measurements.
Lens by example
Room occupancy dataset
• ML standard dataset
• Goal: predict whether room is
occupied from ambient
measurements.
• What can we learn about it with
Lens?
Python interface
>>> import lens
Python interface
>>> import lens
>>> df = pd.read_csv('room_occupancy.csv')
>>> ls = lens.summarise(df)
>>> type(ls)
<class 'lens.summarise.Summary'>
>>> ls.to_json('room_occupancy_lens.json')
Python interface
>>> import lens
>>> df = pd.read_csv('room_occupancy.csv')
>>> ls = lens.summarise(df)
>>> type(ls)
<class 'lens.summarise.Summary'>
>>> ls.to_json('room_occupancy_lens.json')
room_occupancy_lens.json now contains
all information needed for
exploration!
Python interface — Columns
>>> ls.columns
['date',
'Temperature',
'Humidity',
'Light',
'CO2',
'HumidityRatio',
'Occupancy']
Python interface — Categorical summary
>>> ls.summary('Occupancy')
{'name': 'Occupancy',
'desc': 'categorical',
'dtype': 'int64',
'nulls': 0, 'notnulls': 8143,
'unique': 2}
>>> ls.details('Occupancy')
{'desc': 'categorical',
'frequencies': {0: 6414, 1: 1729},
'name': 'Occupancy'}
Python interface — Numeric summary
>>> ls.details('Temperature')
{'name': 'Temperature',
'desc': 'numeric',
'iqr': 1.69,
'min': 19.0, 'max': 23.18,
'mean': 20.619, 'median': 20.39,
'std': 1.0169, 'sum': 167901.1980}
The lens.Summary
is a good building
block, but clunky for
exploration.
Can we do better?
Jupyter widgets
Jupyter widgets: Column distribution
Jupyter widgets: Correlation matrix
Jupyter widgets: Pair density
Jupyter widgets: Pair density
Building Lens
Our solution: Analysis
• A Python library computes
dataset metrics:
• Column-wise statistics
• Pairwise densities
• ...
Our solution: Analysis
• A Python library computes
dataset metrics:
• Column-wise statistics
• Pairwise densities
• ...
• Computation cost is paid up
front.
• The result is serialized to JSON.
Our Solution: Interactive exploration
• Using only the report, the user
can explore the dataset through
either:
• Jupyter widgets
• Web UI
The lens Python
library
Why Python?
• Data Scientists
Why Python?
• Data Scientists
• Portability
• Reusability
Why Python?
• Data Scientists
• Portability
• Reusability
• Scalability
Why Python?
• Data Scientists
• Portability
• Reusability
• Scalability
Can Python scale?
Out-of-core options in Python
Difficult
Flexible
Easy
Restrictive
Out-of-core options in Python
Difficult
Flexible
Easy
Restrictive
Threads, Processes, MPI, ZeroMQ
Concurrent.futures, joblib
Out-of-core options in Python
Difficult
Flexible
Easy
Restrictive
Threads, Processes, MPI, ZeroMQ
Concurrent.futures, joblib
Luigi
PySpark
Hadoop, SQL
Out-of-core options in Python
Difficult
Flexible
Easy
Restrictive
Threads, Processes, MPI, ZeroMQ
Concurrent.futures, joblib
Luigi
PySpark
Hadoop, SQL
Out-of-core options in Python
Difficult
Flexible
Easy
Restrictive
Threads, Processes, MPI, ZeroMQ
Concurrent.futures, joblib
Luigi
PySpark
Hadoop, SQL
Dask
Dask
Dask interface
• Dask objects are delayed.
Dask interface
• Dask objects are delayed.
• The user operates on them as
Python structures.
Dask interface
• Dask objects are delayed.
• The user operates on them as
Python structures.
• Dask builds a DAG of the
computation.
Dask interface
• Dask objects are delayed.
• The user operates on them as
Python structures.
• Dask builds a DAG of the
computation.
• DAG is executed when a result is
requested.
Dask data structures
• numpy.ndarray
• pandas.DataFrame
• list, set
Dask data structures
• numpy.ndarray → dask.array
• pandas.DataFrame →
dask.dataframe
• list, set → dask.bag
dask.delayed — Build you own DAG
dask.delayed — Build you own DAG
files = ['myfile.a.data',
'myfile.b.data',
'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
store(analyzed)
dask.delayed — Build you own DAG
@delayed
def load(filename):
...
@delayed
def clean(data):
...
@delayed
def analyze(sequence_of_data):
...
@delayed
def store(result):
with open(..., 'w') as f:
f.write(result)
dask.delayed — Build you own DAG
files = ['myfile.a.data', 'myfile.b.data',
'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
stored = store(analyzed)
dask.delayed — Build you own DAG
files = ['myfile.a.data', 'myfile.b.data',
'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
stored = store(analyzed)
clean-2
analyze
cleanload-2
analyze store
clean-3
clean-1
load
storecleanload-1
cleanload-3load
load
dask.delayed — Build you own DAG
files = ['myfile.a.data', 'myfile.b.data',
'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
stored = store(analyzed)
clean-2
analyze
cleanload-2
analyze store
clean-3
clean-1
load
storecleanload-1
cleanload-3load
load
stored.compute()
Dask DAG execution
Dask DAG execution
In memory Released from memory
Dask DAG execution
In memory Released from memory
How do we use
Dask in Lens?
Lens pipeline
DataFrame
colA
colB
Lens pipeline
DataFrame
colA
colB
PropA
PropB
Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
OutA
OutB
Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
OutA
OutB
Corr
Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
OutA
OutB
Corr PairDensity
Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
OutA
OutB
Corr PairDensity
Report
• Graph for
two-column
dataset
generated by
lens.
• Graph for
two-column
dataset
generated by
lens.
• The same
code can be
used for
much wider
datasets.
Integration with
SherlockML
SherlockML integration
• Every dataset entering the
platform is analysed by Lens.
SherlockML integration
• Every dataset entering the
platform is analysed by Lens.
• We can use the same Python
library!
SherlockML integration
• Every dataset entering the
platform is analysed by Lens.
• We can use the same Python
library!
• The web frontend is used to
interact with datasets.
SherlockML: Column information
SherlockML: Column distribution
SherlockML: Correlation matrix
SherlockML: Pair density
Data Exploration with Lens
Data Exploration with Lens
• Scalable compute with dask.
Data Exploration with Lens
• Scalable compute with dask.
• Snappy interactive exploration.
Data Exploration with Lens
• Scalable compute with dask.
• Snappy interactive exploration.
• Lens is open source:
• Github: ASIDataScience/lens
• Docs: https://lens.rtfd.io
• PyPi: pip install lens

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask