Automated Data Exploration: Building efficient analysis pipelines with Dask

Automated Data Exploration
Building efficient analysis
pipelines with Dask
Víctor Zabalza
victor.z@asidatascience.com
@zblz
@ASIDataScience

About me
• Data engineer at ASI Data Science.
• Former astrophysicist.
• Main developer of naima, a Python
package for radiative analysis of
non-thermal astronomical sources.
• matplotlib developer.

ASI Data Science
• Data science
consultancy
• Academia to Data
Industry fellowship
• SherlockML

First steps in a Data Science project
• Does the data ﬁt in a single computer?
• Data quality assessment
• Data exploration
• Data cleaning

Can we automate the
drudge work?

Developing a tool for
data exploration based
on Dask

Lens
Library and service for
automated data quality
assessment and exploration

Lens by example
Room occupancy dataset
• ML standard dataset
• Goal: predict occupancy based on ambient
measurements.
• What can we learn about it with Lens?

Python interface
>>> import lens
>>> df = pd.read_csv('room_occupancy.csv')
>>> ls = lens.summarise(df)
>>> type(ls)
<class 'lens.summarise.Summary'>
>>> ls.to_json('room_occupancy_lens.json')
room_occupancy_lens.json now contains all
information needed for exploration!

Python interface — Columns
>>> ls = lens.Summary.from_json('room_occupancy_lens.json')
>>> ls.columns
['date',
'Temperature',
'Humidity',
'Light',
'CO2',
'HumidityRatio',
'Occupancy']

Python interface — Categorical summary
>>> ls.summary('Occupancy')
{'name': 'Occupancy',
'desc': 'categorical',
'dtype': 'int64',
'nulls': 0, 'notnulls': 8143,
'unique': 2}
>>> ls.details('Occupancy')
{'desc': 'categorical',
'frequencies': {0: 6414, 1: 1729},
'name': 'Occupancy'}

Python interface — Numeric summary
>>> ls.details('Temperature')
{'name': 'Temperature',
'desc': 'numeric',
'iqr': 1.69,
'min': 19.0, 'max': 23.18,
'mean': 20.619, 'median': 20.39,
'std': 1.0169, 'sum': 167901.1980}

Python interface — KDE, PDF
>>> x, y = ls.kde('Temperature')
>>> x[np.argmax(y)]
19.417999999999999
>>> temperature_pdf = ls.pdf('Temperature')
>>> temperature_pdf([19, 20, 21, 22])
array([ 0.01754398, 0.76491742,
0.58947765, 0.28421244])

The lens.Summary is a
good building block, but
clunky for exploration.
Can we do better?

Jupyter widgets: Column distribution

Jupyter widgets: Correlation matrix

Requirements
• Versatile
• Reproducible
• Portable
• Scalable
• Reusable

Our solution: Analysis
• A Python library computes dataset metrics:
• Column-wise statistics
• Pairwise densities
• ...
• Computation cost is paid up front.
• The result is serialized to JSON.

Our Solution: Interactive exploration
• Using only the report, the user can explore
the dataset through either:
• Jupyter widgets
• Web UI

Why Python? Portability
A Python library easily runs in:
• Jupyter notebooks for interactive analysis.
• One-off scripts.
• Scheduled or on-demand jobs in cluster.

Why Python? Reusability
• Allows us to use the Python data
ecosystem.
• Becomes a building block in the Data
Science process.

Why Python? Scalability
• Python is great for single-core, in-memory,
numerical computations through numpy,
scipy, pandas.
• But the GIL limits its ability to parallelise
workloads.
Can Python scale?

Out-of-core options in Python
Difficult
Flexible
Easy
Restrictive
Threads, Processes, MPI, ZeroMQ
Concurrent.futures, joblib
Luigi
PySpark
Hadoop, SQL
Dask

Dask interface
• Dask objects are delayed objects.
• The user operates on them as Python
structures.
• Dask builds a DAG of the computation.
• When the ﬁnal result is requested, the DAG
is executed on its workers (threads,
processes, or nodes).

Dask data structures
• numpy.ndarray → dask.array
• pandas.DataFrame → dask.dataframe
• list, set → dask.bag

dask.delayed — Build you own DAG
files = ['myfile.a.data',
'myfile.b.data',
'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
store(analyzed)

@delayed
def load(filename):
...
@delayed
def clean(data):
...
@delayed
def analyze(sequence_of_data):
...
@delayed
def store(result):
with open(..., 'w') as f:
f.write(result)

files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
stored = store(analyzed)
clean-2
analyze
cleanload-2
analyze store
clean-3
clean-1
load
storecleanload-1
cleanload-3load
load
stored.compute()

Dask DAG execution
In memory Released from memory

Comparison with PySpark
• Native Python — better interaction with
Python libraries
• Easy deployment
• Focused on arbitrary graphs
• Optimized for
• low latency
• low memory usage

Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
OutA
OutB
Corr PairDensity
Report

Building the graph with dask.delayed
# Create a series for each column in the DataFrame.
columns = df.columns
df = delayed(df)
cols = {k: delayed(df.get)(k) for k in columns}
# Create the delayed reports using Dask.
cprops = {k: delayed(metrics.column_properties)(cols[k])
for k in columns}
csumms = {k: delayed(metrics.column_summary)(cols[k], cprops[k])
for k in columns}
corr = delayed(metrics.correlation)(df, cprops)

Building the graph with dask.delayed
pdens_results = []
if pairdensities:
for col1, col2 in itertools.combinations(columns, 2):
pdens_df = delayed(pd.concat)([cols[col1], cols[col2]])
pdens_cp = {k: cprops[k] for k in [col1, col2]}
pdens_cs = {k: csumms[k] for k in [col1, col2]}
pdens_fr = {k: freqs[k] for k in [col1, col2]}
pdens = delayed(metrics.pairdensity)(
pdens_df, pdens_cp, pdens_cs, pdens_fr)
pdens_results.append(pdens)
# Join the delayed per-metric reports into a dictionary.
report = delayed(dict)(column_properties=cprops,
column_summary=csumms,
pair_density=pdens_results,
...)
return report

• Graph for
two-column
dataset generated
by lens.
• The same code
can be used for
much wider
datasets.

Integration with
infrastructure

SherlockML integration
• Every dataset entering the platform is
analysed by Lens.
• We can use the same Python library!
• The web frontend is used to interact with
datasets.

SherlockML: Column information

SherlockML: Column distribution

SherlockML: Correlation matrix

Lens
• Keeps interactive exploration snappy by
splitting computation and exploration.
• Upfront computation leverages dask to
scale.
• Easy exploration no matter the data size!
• Keep your data scientists happy and
productive.

Lens
• Lens will be open source.
• You can use the library and service right
now on SherlockML.

https://sherlockml.com
Invite code: Strata2017

Automated Data Exploration: Building efficient analysis pipelines with Dask

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Automated Data Exploration: Building efficient analysis pipelines with Dask

Similar to Automated Data Exploration: Building efficient analysis pipelines with Dask (20)

Recently uploaded

Recently uploaded (20)

Automated Data Exploration: Building efficient analysis pipelines with Dask