AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask

Automated Data Exploration
Building efficient analysis
pipelines with Dask
Víctor Zabalza
victor.z@asidatascience.com
github.com/zblz

About me
• ASI Data Science on SherlockML.

About me
• Former astrophysicist.

About me
• Former astrophysicist.
• Main developer of naima, a
package to model non-thermal
astrophysical sources.
• matplotlib developer.

First steps in a Data Science project
• Does the data ﬁt in a single
computer?

First steps in a Data Science project
• Does the data ﬁt in a single
computer?
• Data quality assessment
• Data exploration
• Data cleaning

Can we automate
the drudge work?

Developing a tool
for data exploration
based on Dask

Lens
Open source library for
automated data exploration

Lens by example
Room occupancy dataset
• ML standard dataset
• Goal: predict whether room is
occupied from ambient
measurements.

Lens by example
Room occupancy dataset
• ML standard dataset
• Goal: predict whether room is
occupied from ambient
measurements.
• What can we learn about it with
Lens?

Python interface
>>> import lens

Python interface
>>> import lens
>>> df = pd.read_csv('room_occupancy.csv')
>>> ls = lens.summarise(df)
>>> type(ls)
<class 'lens.summarise.Summary'>
>>> ls.to_json('room_occupancy_lens.json')

Python interface
>>> import lens
>>> df = pd.read_csv('room_occupancy.csv')
>>> ls = lens.summarise(df)
>>> type(ls)
<class 'lens.summarise.Summary'>
>>> ls.to_json('room_occupancy_lens.json')
room_occupancy_lens.json now contains
all information needed for
exploration!

Python interface — Columns
>>> ls.columns
['date',
'Temperature',
'Humidity',
'Light',
'CO2',
'HumidityRatio',
'Occupancy']

Python interface — Categorical summary
>>> ls.summary('Occupancy')
{'name': 'Occupancy',
'desc': 'categorical',
'dtype': 'int64',
'nulls': 0, 'notnulls': 8143,
'unique': 2}
>>> ls.details('Occupancy')
{'desc': 'categorical',
'frequencies': {0: 6414, 1: 1729},
'name': 'Occupancy'}

Python interface — Numeric summary
>>> ls.details('Temperature')
{'name': 'Temperature',
'desc': 'numeric',
'iqr': 1.69,
'min': 19.0, 'max': 23.18,
'mean': 20.619, 'median': 20.39,
'std': 1.0169, 'sum': 167901.1980}

The lens.Summary
is a good building
block, but clunky for
exploration.
Can we do better?

Jupyter widgets: Column distribution

Jupyter widgets: Correlation matrix

Our solution: Analysis
• A Python library computes
dataset metrics:
• Column-wise statistics
• Pairwise densities
• ...

Our solution: Analysis
• A Python library computes
dataset metrics:
• Column-wise statistics
• Pairwise densities
• ...
• Computation cost is paid up
front.
• The result is serialized to JSON.

Our Solution: Interactive exploration
• Using only the report, the user
can explore the dataset through
either:
• Jupyter widgets
• Web UI

Why Python?
• Data Scientists

Why Python?
• Data Scientists
• Portability
• Reusability

Why Python?
• Data Scientists
• Portability
• Reusability
• Scalability

Why Python?
• Data Scientists
• Portability
• Reusability
• Scalability
Can Python scale?

Out-of-core options in Python
Difficult
Flexible
Easy
Restrictive

Difficult
Flexible
Easy
Restrictive
Threads, Processes, MPI, ZeroMQ
Concurrent.futures, joblib

Difficult
Flexible
Easy
Restrictive
Luigi
PySpark
Hadoop, SQL

Difficult
Flexible
Easy
Restrictive
Luigi
PySpark
Hadoop, SQL
Dask

Dask interface
• Dask objects are delayed.

Dask interface
• The user operates on them as
Python structures.

Dask interface
Python structures.
• Dask builds a DAG of the
computation.

Dask interface
Python structures.
• Dask builds a DAG of the
computation.
• DAG is executed when a result is
requested.

Dask data structures
• numpy.ndarray
• pandas.DataFrame
• list, set

Dask data structures
• numpy.ndarray → dask.array
• pandas.DataFrame →
dask.dataframe
• list, set → dask.bag

dask.delayed — Build you own DAG

files = ['myfile.a.data',
'myfile.b.data',
'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
store(analyzed)

@delayed
def load(filename):
...
@delayed
def clean(data):
...
@delayed
def analyze(sequence_of_data):
...
@delayed
def store(result):
with open(..., 'w') as f:
f.write(result)

files = ['myfile.a.data', 'myfile.b.data',
'myfile.c.data']
stored = store(analyzed)

'myfile.c.data']
clean-2
analyze
cleanload-2
analyze store
clean-3
clean-1
load
storecleanload-1
cleanload-3load
load

'myfile.c.data']
clean-2
analyze
cleanload-2
analyze store
clean-3
clean-1
load
storecleanload-1
cleanload-3load
load
stored.compute()

Dask DAG execution
In memory Released from memory

Lens pipeline
DataFrame
colA
colB

Lens pipeline
DataFrame
colA
colB
PropA
PropB

Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB

Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
OutA
OutB

Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
OutA
OutB
Corr

Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
OutA
OutB
Corr PairDensity

Lens pipeline
DataFrame
colA
colB
PropA
PropB
SummA
SummB
OutA
OutB
Corr PairDensity
Report

• Graph for
two-column
dataset
generated by
lens.

• Graph for
two-column
dataset
generated by
lens.
• The same
code can be
used for
much wider
datasets.

SherlockML integration
• Every dataset entering the
platform is analysed by Lens.

• We can use the same Python
library!

• We can use the same Python
library!
• The web frontend is used to
interact with datasets.

SherlockML: Column information

SherlockML: Column distribution

SherlockML: Correlation matrix

Data Exploration with Lens
• Scalable compute with dask.

• Snappy interactive exploration.

• Snappy interactive exploration.
• Lens is open source:
• Github: ASIDataScience/lens
• Docs: https://lens.rtfd.io
• PyPi: pip install lens

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask

More Related Content

What's hot

Similar to AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask

Recently uploaded

AUTOMATED DATA EXPLORATION - Building efficient analysis pipelines with Dask