Python Data Ecosystem: Thoughts on Building for the Future

1
©
Cloudera,
Inc.
All
rights
reserved.

Python
Data
Ecosystem:

Thoughts
on
Building
for
the

Future

Wes
McKinney
@wesmckinn

PyData
Berlin
2016-‐05-‐21

2
©
Cloudera,
Inc.
All
rights
reserved.

Me

•  Data
Science
Tools
at
Cloudera,
formerly
DataPad
CEO/founder

•  Serial
creator
of
structured
data
tools
/
user
interfaces

•  Wrote
bestseller
Python
for
Data
Analysis
2012

•  Open
source
projects

•  Python
{pandas,
Ibis,
statsmodels}

•  Apache
{Arrow,
Parquet,
Kudu
(incubaWng)}

•  Mostly
work
in
Python
and
Cython/C/C++

3
©
Cloudera,
Inc.
All
rights
reserved.

In
process:

Python
for
Data
Analysis:
2nd
Edi4on

Coming
early
2017

4
©
Cloudera,
Inc.
All
rights
reserved.

Building
open
source
communiWes

5
©
Cloudera,
Inc.
All
rights
reserved.

Social architecture is the
conscious design of an
environment that
encourages a desired range
of social behaviors leading
towards some goal or set of
goals.
Wikipedia

6
©
Cloudera,
Inc.
All
rights
reserved.

Step
1

Be
open
and
transparent

7
©
Cloudera,
Inc.
All
rights
reserved.

Step
2

Reach
out
to
others

8
©
Cloudera,
Inc.
All
rights
reserved.

Step
3

Strive
for
consensus

9
©
Cloudera,
Inc.
All
rights
reserved.

Step
4

Value
contribuWons
extending

beyond
lines
of
code

10
©
Cloudera,
Inc.
All
rights
reserved.

Step
5

Make
things
harder
for
bad
actors

11
©
Cloudera,
Inc.
All
rights
reserved.

12
©
Cloudera,
Inc.
All
rights
reserved.

Handling
problems
carefully

13
©
Cloudera,
Inc.
All
rights
reserved.

http://numfocus.org
http://apache.org

14
©
Cloudera,
Inc.
All
rights
reserved.

Python
packaging

15
©
Cloudera,
Inc.
All
rights
reserved.

Packaging
is
hard

• 
Reproducible
infrastructure

• 
Reproducible
toolchains

• 
Reproducible
build
scripts

• 
IntegraWon
tesWng

• 
MulWple
library
version
builds

• 
MulWple
Python
versions

• 
Dependency
resoluWon

• 
HosWng
and
distribuWon

• 
MulWple
environment
management

16
©
Cloudera,
Inc.
All
rights
reserved.

ReﬂecWng
on
the
past

17
©
Cloudera,
Inc.
All
rights
reserved.

18
©
Cloudera,
Inc.
All
rights
reserved.

conda-‐forge

• 
Community-‐curated
conda
package
channel
(on
anaconda.org)

• 
Reproducible
build
infrastructure
(Docker
+
Circle
CI
+
Travis
CI
+
Appveyor)

• 
Automated
GitHub
helper
tools

conda config --add channels conda-forge

19
©
Cloudera,
Inc.
All
rights
reserved.

What’s
important
to
me
right
now?

20
©
Cloudera,
Inc.
All
rights
reserved.

Important
things

• 
Building
bridges
with
other
data
science
communiWes
(R,
Julia,
Scala,
etc.)

• 
Enabling
Python
to
more
eﬃciently
talk
to
other
systems
(e.g.
Hadoop
things)

• 
Building
Python
tools
for
new
and
changing
varieWes
of
data

21
©
Cloudera,
Inc.
All
rights
reserved.

RAM
as
the
new
disk?

•  SSD – DRAM
performance
convergence
•  NVM developments
(3D Xpoint)Memory working set
Consumer Consumer Consumer

22
©
Cloudera,
Inc.
All
rights
reserved.

Problems

• 
Memory
(data
structure)
representaWons

• 
Metadata
representaWons

• 
Memory
ownership,
life-‐cycle

23
©
Cloudera,
Inc.
All
rights
reserved.

NumPy
solved
this
problem
for
Python
scienWsts

• 
Common
memory
representaWon

• 
ndarray
strided,
homogeneous
buﬀer

• 
Common
metadata

• 
NumPy
dtypes

• 
No
well-‐deﬁned
memory
sharing
/
messaging
model:
case
by
case
basis

24
©
Cloudera,
Inc.
All
rights
reserved.

Problems
NumPy
doesn’t
solve
as
well

• 
Nested
data
types
(think
JSON)

• 
Missing
/
NULL
data

• 
Strings
and
category
types

• 
Columnar
memory
representaWon
for
tables
(think:
analyWc
SQL
databases)

25
©
Cloudera,
Inc.
All
rights
reserved.

Apache

Arrow

http://arrow.apache.org
Some slides from Strata-HW talk w/
Jacques Nadeau

26
©
Cloudera,
Inc.
All
rights
reserved.

Arrow
in
a
Slide

•  New
Top-‐level
Apache
Sonware
FoundaWon
project

•  Focused
on
Columnar
In-‐Memory
AnalyWcs

1.  10-‐100x
speedup
on
many
workloads

2.  Common
data
layer
enables
companies
to
choose
best
of

breed
systems

3.  Designed
to
work
with
any
programming
language

4.  Support
for
both
relaWonal
and
complex
data
as-‐is

•  Developers
from
13+
major
open
source
projects
involved

•  A
signiﬁcant
%
of
the
world’s
data
will
be
processed
through

Arrow!

Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R

27
©
Cloudera,
Inc.
All
rights
reserved.

Focus
on
CPU
Eﬃciency

1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache
Locality

• Super-‐scalar
&
vectorized

operaWon

• Minimal
Structure
Overhead

• Constant
value
access

• With
minimal
structure

overhead

• Operate
directly
on
columnar

compressed
data

28
©
Cloudera,
Inc.
All
rights
reserved.

High
Performance
Sharing
&
Interchange

Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)

29
©
Cloudera,
Inc.
All
rights
reserved.

Arrow
in
acWon:
Feather
File
Format
for
Python
and
R

• Problem:
fast,
language-‐
agnosWc
binary
data
frame

ﬁle
format

• By
Wes
McKinney
(Python)

and
Hadley
Wickham
(R)

• Read
speeds
close
to
disk
IO

performance

30
©
Cloudera,
Inc.
All
rights
reserved.

Real
World
Example:
Feather
File
Format
for
Python

and
R

library(feather)

path
<-‐
"my_data.feather"

write_feather(df,
path)

df
<-‐
read_feather(path)

import
feather

path
=
'my_data.feather'

feather.write_dataframe(df,
path)

df
=
feather.read_dataframe(path)

R
Python

32
©
Cloudera,
Inc.
All
rights
reserved.

Feather:
the
good
and
not-‐so-‐good

•  Good

•  Language-‐agnosWc
memory
representaWon

•  Extremely
fast

•  New
storage
features
can
be
added
without
much
diﬃculty

•  Not-‐so-‐good

•  Data
must
be
convert
to/from
storage
representaWon
(Arrow)
and
in-‐
memory
“proprietary”
data
structures
(R
/
Python
data
frames)

34
©
Cloudera,
Inc.
All
rights
reserved.

Shared
needs
for
Python,
R,
Julia,
...

•  If
PLs
can
establish
a
common
data
frame
C/C++-‐level
memory
representaWon,

we
can
share
algorithms
and
libraries
much
more
easily

•  Example:
dplyr’s
in-‐memory
backend

•  Other
requirements

•  Permissive
licensing
(Python
/
Julia
require
MIT/Apache-‐like)

•  Common
build/test/packaging
for
shared
C/C++
library
components

35
©
Cloudera,
Inc.
All
rights
reserved.

Real
World
Example:
Python
With
Spark,
Drill,
Impala

in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine

36
©
Cloudera,
Inc.
All
rights
reserved.

Get
Involved
in
Arrow

•  Join
the
community

•  dev@arrow.apache.org

•  Slack:
hups://apachearrowslackin.herokuapp.com/

•  hup://arrow.apache.org

•  @ApacheArrow

Python Data Ecosystem: Thoughts on Building for the Future

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (14)

Similar to Python Data Ecosystem: Thoughts on Building for the Future

Similar to Python Data Ecosystem: Thoughts on Building for the Future (20)

More from Wes McKinney

More from Wes McKinney (15)

Recently uploaded

Recently uploaded (20)

Python Data Ecosystem: Thoughts on Building for the Future