Dask can be used to automate the initial data exploration process by building an analysis graph upfront. Lens is a library and service that uses Dask to compute metrics on datasets like column statistics, correlations, and pair densities. This pre-computes the information needed for exploration so users can interactively examine datasets of any size through Jupyter widgets or a web UI without waiting for new computations. Lens was developed to improve data scientist productivity and is integrated into the SherlockML platform for collaborative data science projects.
2. About me
• Data engineer at ASI Data Science.
• Former astrophysicist.
• Main developer of naima, a Python
package for radiative analysis of
non-thermal astronomical sources.
• matplotlib developer.
3. ASI Data Science
• Data science
consultancy
• Academia to Data
Industry fellowship
• SherlockML
11. Lens by example
Room occupancy dataset
• ML standard dataset
• Goal: predict occupancy based on ambient
measurements.
• What can we learn about it with Lens?
12. Python interface
>>> import lens
>>> df = pd.read_csv('room_occupancy.csv')
>>> ls = lens.summarise(df)
>>> type(ls)
<class 'lens.summarise.Summary'>
>>> ls.to_json('room_occupancy_lens.json')
room_occupancy_lens.json now contains all
information needed for exploration!
29. Our solution: Analysis
• A Python library computes dataset metrics:
• Column-wise statistics
• Pairwise densities
• ...
• Computation cost is paid up front.
• The result is serialized to JSON.
30. Our Solution: Interactive exploration
• Using only the report, the user can explore
the dataset through either:
• Jupyter widgets
• Web UI
32. Why Python? Portability
A Python library easily runs in:
• Jupyter notebooks for interactive analysis.
• One-off scripts.
• Scheduled or on-demand jobs in cluster.
33. Why Python? Reusability
• Allows us to use the Python data
ecosystem.
• Becomes a building block in the Data
Science process.
34. Why Python? Scalability
• Python is great for single-core, in-memory,
numerical computations through numpy,
scipy, pandas.
• But the GIL limits its ability to parallelise
workloads.
Can Python scale?
35. Out-of-core options in Python
Difficult
Flexible
Easy
Restrictive
Threads, Processes, MPI, ZeroMQ
Concurrent.futures, joblib
Luigi
PySpark
Hadoop, SQL
Dask
37. Dask interface
• Dask objects are delayed objects.
• The user operates on them as Python
structures.
• Dask builds a DAG of the computation.
• When the final result is requested, the DAG
is executed on its workers (threads,
processes, or nodes).
38. Dask data structures
• numpy.ndarray → dask.array
• pandas.DataFrame → dask.dataframe
• list, set → dask.bag
39. dask.delayed — Build you own DAG
files = ['myfile.a.data',
'myfile.b.data',
'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
store(analyzed)
40. dask.delayed — Build you own DAG
@delayed
def load(filename):
...
@delayed
def clean(data):
...
@delayed
def analyze(sequence_of_data):
...
@delayed
def store(result):
with open(..., 'w') as f:
f.write(result)
41. dask.delayed — Build you own DAG
files = ['myfile.a.data', 'myfile.b.data', 'myfile.c.data']
loaded = [load(i) for i in files]
cleaned = [clean(i) for i in loaded]
analyzed = analyze(cleaned)
stored = store(analyzed)
clean-2
analyze
cleanload-2
analyze store
clean-3
clean-1
load
storecleanload-1
cleanload-3load
load
stored.compute()
46. Building the graph with dask.delayed
# Create a series for each column in the DataFrame.
columns = df.columns
df = delayed(df)
cols = {k: delayed(df.get)(k) for k in columns}
# Create the delayed reports using Dask.
cprops = {k: delayed(metrics.column_properties)(cols[k])
for k in columns}
csumms = {k: delayed(metrics.column_summary)(cols[k], cprops[k])
for k in columns}
corr = delayed(metrics.correlation)(df, cprops)
47. Building the graph with dask.delayed
pdens_results = []
if pairdensities:
for col1, col2 in itertools.combinations(columns, 2):
pdens_df = delayed(pd.concat)([cols[col1], cols[col2]])
pdens_cp = {k: cprops[k] for k in [col1, col2]}
pdens_cs = {k: csumms[k] for k in [col1, col2]}
pdens_fr = {k: freqs[k] for k in [col1, col2]}
pdens = delayed(metrics.pairdensity)(
pdens_df, pdens_cp, pdens_cs, pdens_fr)
pdens_results.append(pdens)
# Join the delayed per-metric reports into a dictionary.
report = delayed(dict)(column_properties=cprops,
column_summary=csumms,
pair_density=pdens_results,
...)
return report
51. SherlockML integration
• Every dataset entering the platform is
analysed by Lens.
• We can use the same Python library!
• The web frontend is used to interact with
datasets.
57. Lens
• Keeps interactive exploration snappy by
splitting computation and exploration.
• Upfront computation leverages dask to
scale.
• Easy exploration no matter the data size!
• Keep your data scientists happy and
productive.
58. Lens
• Lens will be open source.
• You can use the library and service right
now on SherlockML.