More Related Content
Similar to Python Data Ecosystem: Thoughts on Building for the Future (20)
More from Wes McKinney (15)
Python Data Ecosystem: Thoughts on Building for the Future
- 1. 1
©
Cloudera,
Inc.
All
rights
reserved.
Python
Data
Ecosystem:
Thoughts
on
Building
for
the
Future
Wes
McKinney
@wesmckinn
PyData
Berlin
2016-‐05-‐21
- 2. 2
©
Cloudera,
Inc.
All
rights
reserved.
Me
• Data
Science
Tools
at
Cloudera,
formerly
DataPad
CEO/founder
• Serial
creator
of
structured
data
tools
/
user
interfaces
• Wrote
bestseller
Python
for
Data
Analysis
2012
• Open
source
projects
• Python
{pandas,
Ibis,
statsmodels}
• Apache
{Arrow,
Parquet,
Kudu
(incubaWng)}
• Mostly
work
in
Python
and
Cython/C/C++
- 3. 3
©
Cloudera,
Inc.
All
rights
reserved.
In
process:
Python
for
Data
Analysis:
2nd
Edi4on
Coming
early
2017
- 4. 4
©
Cloudera,
Inc.
All
rights
reserved.
Building
open
source
communiWes
- 5. 5
©
Cloudera,
Inc.
All
rights
reserved.
Social architecture is the
conscious design of an
environment that
encourages a desired range
of social behaviors leading
towards some goal or set of
goals.
Wikipedia
- 6. 6
©
Cloudera,
Inc.
All
rights
reserved.
Step
1
Be
open
and
transparent
- 7. 7
©
Cloudera,
Inc.
All
rights
reserved.
Step
2
Reach
out
to
others
- 8. 8
©
Cloudera,
Inc.
All
rights
reserved.
Step
3
Strive
for
consensus
- 9. 9
©
Cloudera,
Inc.
All
rights
reserved.
Step
4
Value
contribuWons
extending
beyond
lines
of
code
- 10. 10
©
Cloudera,
Inc.
All
rights
reserved.
Step
5
Make
things
harder
for
bad
actors
- 12. 12
©
Cloudera,
Inc.
All
rights
reserved.
Handling
problems
carefully
- 13. 13
©
Cloudera,
Inc.
All
rights
reserved.
http://numfocus.org
http://apache.org
- 15. 15
©
Cloudera,
Inc.
All
rights
reserved.
Packaging
is
hard
•
Reproducible
infrastructure
•
Reproducible
toolchains
•
Reproducible
build
scripts
•
IntegraWon
tesWng
•
MulWple
library
version
builds
•
MulWple
Python
versions
•
Dependency
resoluWon
•
HosWng
and
distribuWon
•
MulWple
environment
management
- 18. 18
©
Cloudera,
Inc.
All
rights
reserved.
conda-‐forge
•
Community-‐curated
conda
package
channel
(on
anaconda.org)
•
Reproducible
build
infrastructure
(Docker
+
Circle
CI
+
Travis
CI
+
Appveyor)
•
Automated
GitHub
helper
tools
conda config --add channels conda-forge
- 19. 19
©
Cloudera,
Inc.
All
rights
reserved.
What’s
important
to
me
right
now?
- 20. 20
©
Cloudera,
Inc.
All
rights
reserved.
Important
things
•
Building
bridges
with
other
data
science
communiWes
(R,
Julia,
Scala,
etc.)
•
Enabling
Python
to
more
efficiently
talk
to
other
systems
(e.g.
Hadoop
things)
•
Building
Python
tools
for
new
and
changing
varieWes
of
data
- 21. 21
©
Cloudera,
Inc.
All
rights
reserved.
RAM
as
the
new
disk?
• SSD – DRAM
performance
convergence
• NVM developments
(3D Xpoint)Memory working set
Consumer Consumer Consumer
- 22. 22
©
Cloudera,
Inc.
All
rights
reserved.
Problems
•
Memory
(data
structure)
representaWons
•
Metadata
representaWons
•
Memory
ownership,
life-‐cycle
- 23. 23
©
Cloudera,
Inc.
All
rights
reserved.
NumPy
solved
this
problem
for
Python
scienWsts
•
Common
memory
representaWon
•
ndarray
strided,
homogeneous
buffer
•
Common
metadata
•
NumPy
dtypes
•
No
well-‐defined
memory
sharing
/
messaging
model:
case
by
case
basis
- 24. 24
©
Cloudera,
Inc.
All
rights
reserved.
Problems
NumPy
doesn’t
solve
as
well
•
Nested
data
types
(think
JSON)
•
Missing
/
NULL
data
•
Strings
and
category
types
•
Columnar
memory
representaWon
for
tables
(think:
analyWc
SQL
databases)
- 25. 25
©
Cloudera,
Inc.
All
rights
reserved.
Apache
Arrow
http://arrow.apache.org
Some slides from Strata-HW talk w/
Jacques Nadeau
- 26. 26
©
Cloudera,
Inc.
All
rights
reserved.
Arrow
in
a
Slide
• New
Top-‐level
Apache
Sonware
FoundaWon
project
• Focused
on
Columnar
In-‐Memory
AnalyWcs
1. 10-‐100x
speedup
on
many
workloads
2. Common
data
layer
enables
companies
to
choose
best
of
breed
systems
3. Designed
to
work
with
any
programming
language
4. Support
for
both
relaWonal
and
complex
data
as-‐is
• Developers
from
13+
major
open
source
projects
involved
• A
significant
%
of
the
world’s
data
will
be
processed
through
Arrow!
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
- 27. 27
©
Cloudera,
Inc.
All
rights
reserved.
Focus
on
CPU
Efficiency
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache
Locality
• Super-‐scalar
&
vectorized
operaWon
• Minimal
Structure
Overhead
• Constant
value
access
• With
minimal
structure
overhead
• Operate
directly
on
columnar
compressed
data
- 28. 28
©
Cloudera,
Inc.
All
rights
reserved.
High
Performance
Sharing
&
Interchange
Today With Arrow
• Each system has its own internal
memory format
• 70-80% CPU wasted on serialization
and deserialization
• Similar functionality implemented in
multiple projects
• All systems utilize the same memory
format
• No overhead for cross-system
communication
• Projects can share functionality (eg,
Parquet-to-Arrow reader)
- 29. 29
©
Cloudera,
Inc.
All
rights
reserved.
Arrow
in
acWon:
Feather
File
Format
for
Python
and
R
• Problem:
fast,
language-‐
agnosWc
binary
data
frame
file
format
• By
Wes
McKinney
(Python)
and
Hadley
Wickham
(R)
• Read
speeds
close
to
disk
IO
performance
- 30. 30
©
Cloudera,
Inc.
All
rights
reserved.
Real
World
Example:
Feather
File
Format
for
Python
and
R
library(feather)
path
<-‐
"my_data.feather"
write_feather(df,
path)
df
<-‐
read_feather(path)
import
feather
path
=
'my_data.feather'
feather.write_dataframe(df,
path)
df
=
feather.read_dataframe(path)
R
Python
- 31. 31
©
Cloudera,
Inc.
All
rights
reserved.
More
on
Feather
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather
C++ library
Rcpp
Cython
R data.frame
pandas DataFrame
- 32. 32
©
Cloudera,
Inc.
All
rights
reserved.
Feather:
the
good
and
not-‐so-‐good
• Good
• Language-‐agnosWc
memory
representaWon
• Extremely
fast
• New
storage
features
can
be
added
without
much
difficulty
• Not-‐so-‐good
• Data
must
be
convert
to/from
storage
representaWon
(Arrow)
and
in-‐
memory
“proprietary”
data
structures
(R
/
Python
data
frames)
- 33. 33
©
Cloudera,
Inc.
All
rights
reserved.
Apache
Parquet:
Python
support
is
coming
• Collaborating with Uwe Korn from
Blue Yonder
pandas
Arrow (C++ / Python)
Parquet (C++)
- 34. 34
©
Cloudera,
Inc.
All
rights
reserved.
Shared
needs
for
Python,
R,
Julia,
...
• If
PLs
can
establish
a
common
data
frame
C/C++-‐level
memory
representaWon,
we
can
share
algorithms
and
libraries
much
more
easily
• Example:
dplyr’s
in-‐memory
backend
• Other
requirements
• Permissive
licensing
(Python
/
Julia
require
MIT/Apache-‐like)
• Common
build/test/packaging
for
shared
C/C++
library
components
- 35. 35
©
Cloudera,
Inc.
All
rights
reserved.
Real
World
Example:
Python
With
Spark,
Drill,
Impala
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine
- 36. 36
©
Cloudera,
Inc.
All
rights
reserved.
Get
Involved
in
Arrow
• Join
the
community
• dev@arrow.apache.org
• Slack:
hups://apachearrowslackin.herokuapp.com/
• hup://arrow.apache.org
• @ApacheArrow
- 37. 37
©
Cloudera,
Inc.
All
rights
reserved.
Thank
you
Wes
McKinney
@wesmckinn
Views
are
my
own