PyData: The Next Generation

1
©
Cloudera,
Inc.
All
rights
reserved.

PyData:
The
Next
Genera@on

Wes
McKinney
@wesmckinn

Data
Day
Texas
2015
#ddtx15

2
©
Cloudera,
Inc.
All
rights
reserved.

PyData:
Everything’s

awesome…or
is
it?

Wes
McKinney
@wesmckinn

Data
Day
Texas
2015
#ddtx15

3
©
Cloudera,
Inc.
All
rights
reserved.

Me

•  Data
systems,
tools,
Python
guru
at
Cloudera

•  Formerly
Founder/CEO
of
DataPad
(visual
analy@cs
startup)

•  Created
pandas
in
2008,
lead
developer
un@l
2013

•  Python
for
Data
Analysis,
published
10/2012

• O’Reilly’s
best-‐selling
data
book
of
2014

•  Pythonista
since
2007

4
©
Cloudera,
Inc.
All
rights
reserved.

What’s
this
about?

•  Hopes
and
fears
for
the
community
and
ecosystem

•  Why
do
I
care?

• Python
is
fun!

• Leverage

• Accessibility
for
newbies

• Community:
smart,
nice,
humble
people

5
©
Cloudera,
Inc.
All
rights
reserved.

Python
at
Cloudera

•  Want
Cloudera
plaaorm
users
to
be
successful
with
Python

•  Spark/PySpark
part
of
the
Enterprise
Data
Hub
/
CDH

•  Ac@vely
inves@ng
in
Python
tooling

• (p.s.
we’re
hiring?)

• (p.p.s.
we
have
an
Aus@n
oﬃce
now!)

6
©
Cloudera,
Inc.
All
rights
reserved.

Historical
perspec@ve
and
background

•  20
years
of
fast
numerical
compu@ng
in
Python
(Numeric
1995)

•  10
years
of
NumPy

•  PyData
becomes
a
thing
in
2012

•  Python
as
a
data
language
goes
mainstream

• Job
descrip@ons
tell
all

• Shig
in
larger
Python
community
from
web
towards
data

•  PyCon
2015
commihee
reported
substan@al
growth
in
data-‐related

submissions!

7
©
Cloudera,
Inc.
All
rights
reserved.

How’d
this
happen?

•  Data,
data
everywhere

•  Science!
scikit-‐learn,
statsmodels,
and
friends

•  Comprehensive
data
wrangling
tools
and
in-‐memory
analy@cs/repor@ng
(pandas)

•  IPython
Notebook

•  Learning
resources
(books,
conferences,
blogs,
etc.)

•  Python
environment/library
management
that
“just
works”

8
©
Cloudera,
Inc.
All
rights
reserved.

Put
a
Python
(interface)
on
it!

Something
no
one
got
ﬁred
for,
ever.

9
©
Cloudera,
Inc.
All
rights
reserved.

Meanwhile…

•  Hadoop
and
Big
Data
go
mainstream
in
2009
onward

• First
Hadoop
World:
Fall
2009

• First
Strata
conference:
Spring
2011

•  Lots
of
smart
engineers
in
fast-‐growing
businesses
with
massive
analy@cs
/
ETL

problems

•  Solu@ons
built,
frameworks
developed,
companies
founded

•  Python
was
generally
not
a
central
part
of
those
solu@ons

• A
lot
of
our
nice
things
weren’t
much
help
for
data
munging
and
coun@ng
at

scale
(more
on
this
later)

10
©
Cloudera,
Inc.
All
rights
reserved.

We’re
lucky
to
have
lots
of
nice
things

•  What
a
language!

•  IPython:
interac@ve
compu@ng
and
collabora@on

•  Libraries
to
solve
nearly
any
(non-‐big
data)
problem

•  Trustworthy
(medium)
data
wrangling,
sta@s@cs,
machine
learning

•  HPC
/
GPU
/
parallel
compu@ng
frameworks

•  FFI
tools

•  …
and
much
more

11
©
Cloudera,
Inc.
All
rights
reserved.

“If
this
isn’t
nice,
what
is?”

—Kurt
Vonnegut

12
©
Cloudera,
Inc.
All
rights
reserved.

So,
what
kind
of
big
data?

•  Big
mul@dimensional
arrays
/
linear
algebra

•  Big
tables
(structured
data)

•  Big
text
data
(unstructured
data)

•  Empirically
I
personally
am
mostly
interested
in
big
tables

13
©
Cloudera,
Inc.
All
rights
reserved.

What
kind
of
big
data
problems?

•  ETL
/
Data
Wrangling

• Python
been
used
here
for
years
with
Hadoop
Streaming

•  BI
/
Analy@cs
(“things
you
can
do
in
SQL”)

•  Advanced
Analy@cs
/
Machine
Learning

14
©
Cloudera,
Inc.
All
rights
reserved.

Some
ways
we
are
#winning

•  Python
seen
as
a
viable
alterna@ve
to
SAS/MATLAB/proprietary
sogware
without

nearly
as
much
arguing

•  Huge
uptake
in
the
financial
sector

•  Many
current
and
upcoming
genera@ons
of
data
scien@sts
learning
Python
as
a

first
language

•  Python
in
HPC
/
scien@fic
compu@ng

15
©
Cloudera,
Inc.
All
rights
reserved.

Some
ways
we
are
not
#winning

•  Python
s@ll
doesn’t
have
a
great
“big
data
story”

•  Lihle
venture
capital
trickling
down
to
Python
projects

•  Data
structures
and
programming
APIs
lagging
modern
reali@es

•  Weak
support
for
emerging
data
formats

•  Many
companies
with
Python
big
data
successes
have
not
open-‐sourced
their

work

16
©
Cloudera,
Inc.
All
rights
reserved.

Python
in
big
data
workﬂows
in
prac@ce

HDFS
Hadoop-‐MR

Spark
SQL

Big
Data,
Many
machines
Small/Medium
Data,
One
Machine

pandas

Viz
tools

ML
/
Stats

More
coun@ng
/
ETL
More
insights
/
repor@ng

DSLs

17
©
Cloudera,
Inc.
All
rights
reserved.

Big
data
storage
formats

•  JSON
and
CSV
are
not
a
good
way
to
warehouse
data

•  Apache
Avro

• Compact
binary
data
serializa@on
format

• RPC
framework

•  Apache
Parquet

• Eﬃcient
columnar
data
format
op@mized
for
HDFS

• Supports
nested
and
repeated
ﬁelds,
compression,
encoding
schemes

• Co-‐developed
by
Twiher
and
Cloudera

• Reference
impl’s
in
Impala
(C++),
and
standalone
Java/Scala
(used
in
Spark)

18
©
Cloudera,
Inc.
All
rights
reserved.

We’re
living
in
a
JVM
world

•  Scala
rapidly
taking
over
big
data
analy@cs

• Func@onal,
concise,
good
for
building
high
level
DSLs

• Build
nice
Scala
APIs
to
clunkier
Java
frameworks

•  JVM
legi@mately
good
for
concurrent,
distributed
systems

•  Binary
interface
with
Python
a
major
issue

19
©
Cloudera,
Inc.
All
rights
reserved.

Dremel,
baby,
Dremel…

•  VLDB
2010:
Dremel:
Interac5ve
Analysis
of
Web-‐Scale
Datasets

•  Inspira@on
for
Parquet
(cf
blog
“Dremel
made
easy
with
Parquet”)

•  Peta-‐scale
analy@cs
directly
on
nested
data

•  Google
BigQuery
said
to
be
a
IaaS-‐iﬁca@on
of
Dremel

• Supports
SQL
variant
+
new
user-‐deﬁned
func@ons
with
JavaScript
+
V8

SELECT COUNT(c1 > c2)
FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1,
SUM(a.b.p.q.r) WITHIN RECORD AS c2
FROM T3)

20
©
Cloudera,
Inc.
All
rights
reserved.

Cloudera
Impala

•  Open-‐source
interac@ve
SQL
for
Hadoop

•  Analy@cal
query
processor
wrihen
in
C++
with
LLVM
code
genera@on

•  Op@mized
to
scan
tables
(best
as
Parquet
format)
in
HDFS

•  SQL
front-‐end
and
query
op@mizer
/
planner

•  User-‐deﬁned
func@on
API
(C++)

• impyla
enables
Python
UDFs
to
be
compiled
with
Numba
to
LLVM
IR

21
©
Cloudera,
Inc.
All
rights
reserved.

Cloudera
Impala
(cont’d)

•  For
high
performance
big
data
analy@cs,
Impala
could
be
Python’s
best
friend

•  C++/LLVM
backend
is
lower-‐level
than
SQL

•  Nested
data
support
is
coming

23
©
Cloudera,
Inc.
All
rights
reserved.

Set
point:
Hadley
Wickham

•  R
has
upped
it’s
game
with
dplyr,
@dyr,
and
other
new
projects

•  New
standard
for
a
uniform
interface
to
either
in-‐memory
or
in-‐database
data

processing

•  Composable
table
primi@ve
opera@ons

•  Mul@ple
major
versions
shipped,
gevng
adopted

80dc69b 2012-10-28 | Initial commit of dplyr [hadley]
tbl
%>%
filter(c==‘bar’)
%>%
group_by(a,
b)

%>%
summarise(metric=mean(d
–
f))

%>%
arrange(desc(metric))

24
©
Cloudera,
Inc.
All
rights
reserved.

Blaze

•  Shares
some
seman@cs
with
dplyr

•  Uses
a
generalized
datashape
protocol

•  Fresh
start
in
2014
under
Mahhew
Rocklin’s
(Con@nuum)
direc@on

• Deferred
expression
API

• Support
for
piping
data
between
storage
systems

• Mul@ple
backends
(pandas,
SQL,
MongoDB,
PySpark,
…)

• Growing
support
for
out-‐of-‐core
analy@cs

25
©
Cloudera,
Inc.
All
rights
reserved.

libdynd

•  Led
by
Mark
Wiebe
at
Con@nuum
Analy@cs

•  Pure
C++11
modern
reimagining
of
NumPy

•  Python
bindings

•  Supports
variadic
data
cells
and
nested
types
(datashape
protocol)

•  Development
has
focused
on
the
data
container
design
over
analy@cs

26
©
Cloudera,
Inc.
All
rights
reserved.

PySpark

•  Popularity
may
exceed
oﬃcial
Scala
API

•  Spark
was
not
exactly
designed
to
be
an
ideal
companion
to
Python

•  General
architecture

• Users
build
Spark
deferred
expression
graphs
in
Python

• User-‐supplied
func@ons
are
serialized
and
broadcast
around
the
cluster

• Spark
plans
job
and
breaks
work
into
tasks
executed
by
Python
worker
jobs

•  Data
is
managed
/
shuﬄed
by
the
Spark
Scala
master
process

•  Python
used
largely
as
a
black
box
to
transform
input
to
output

27
©
Cloudera,
Inc.
All
rights
reserved.

PySpark:
Some
more
gory
details

•  Spark
master
controlled
using
py4j

• Py4J
docs:
“If
performance
is
cri@cal
to
your
applica@on,
accessing
Java
objects

from
Python
programs
might
not
be
the
best
idea”

•  Data
is
marshalled
mostly
with
ﬁles
with
various
serializa@on
protocols
(pickle
+

bespoke
formats)

•  Does
not
na5vely
interface
with
NumPy
(yet)

•  But,
the
in-‐memory
beneﬁts
of
Spark
over
Hadoop
Streaming
alterna@ves

massively
outweigh
the
downsides

# pass large object by py4j is very slow and need much memory

28
©
Cloudera,
Inc.
All
rights
reserved.

Spartan

•  hhp://github.com/spartan-‐array/spartan

•  Python
distributed
array
expression
evaluator
(“distributed
NumPy”)

•  Developed
by
Russell
Power
&
others
at
NYU

•  Uses
ZeroMQ
and
custom
RPC
implementa@on

29
©
Cloudera,
Inc.
All
rights
reserved.

Things
I
think
we
should
do

•  Create
high
ﬁdelity
data
structures
for
Dremel-‐style
data

•  Get
serious
about
Avro,
Parquet,
and
other
new
data
format
standards

•  Invest
in
the
Python-‐Impala-‐LLVM
rela@onship

•  Eﬃcient
binary
protocols
to
receive
and
emit
data
from
Python
processes

30
©
Cloudera,
Inc.
All
rights
reserved.

Conclusions

•  Python
+
PyData
stack
is
as
strong
as
ever,
and
s@ll
gaining
momentum

•  The
@me
for
a
“dark
horse”
Python-‐centric
big
data
solu@on
has
probably
passed

us
by.
Maybe
beher
to
pursue
alliances.

•  Focused
work
is
needed
to
s@ll
be
relevant
in
2020.
Some
of
our
compe@@ve

advantages
are
eroding

PyData: The Next Generation

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to PyData: The Next Generation

Similar to PyData: The Next Generation (20)

More from Wes McKinney

More from Wes McKinney (19)

Recently uploaded

Recently uploaded (20)

PyData: The Next Generation