Data Science Languages and Industry Analytics

1
©
Cloudera,
Inc.
All
rights
reserved.

Data
Science
Languages
and

Industry
Analy<cs

Wes
McKinney,
BIDS
2015-‐09-‐19

2
©
Cloudera,
Inc.
All
rights
reserved.

Me

•  Serial
creator
of
structured
data
tools
/
user
interfaces

•  Mathema<cian
—
MIT
‘07

•  Professional
SQL
programmer
2007-‐2010
(@
AQR)

•  Created
pandas,
April
2008

•  Wrote
Python
for
Data
Analysis
2012

•  Founder
of
DataPad
-‐>
Cloudera

3
©
Cloudera,
Inc.
All
rights
reserved.

A
sample
big
data
architecture

Kafka
Kafka
Kafka
Kafka
Application data
S3 or HDFS
JSON Spark/MapReduce
Columnar
storage
Analytic SQL Engine
User
SQL

4
©
Cloudera,
Inc.
All
rights
reserved.

Big
data
architectures
currently

dominated
by
Java
/
JVM

languages

Python/R/Julia
don’t
have
much
of

a
“seat
at
the
table”

5
©
Cloudera,
Inc.
All
rights
reserved.

Industry
Analy<cs
Scien<ﬁc
Compu<ng

Heterogeneous
data

Flat
tables
and
JSON

Spark
/
MapReduce

SQL

DFS-‐friendly
/
streaming
data
formats

More
physical
machines

Homogeneous
data

Mul<dimensional
arrays

HPC
tools

Linear
algebra

Scien<ﬁc
data
formats

Fewer
physical
machines

Some
simplis<c
generaliza<ons

6
©
Cloudera,
Inc.
All
rights
reserved.

Many
Interac<ve-‐speed
SQL
engines

…
and
more

7
©
Cloudera,
Inc.
All
rights
reserved.

Ibis:
not
the
direct
subject
of
this
talk

•  hjp://blog.ibis-‐project.org

•  Craking
a
compelling
Python-‐on-‐Hadoop
user
experience

• Remove
SQL-‐programming
from
user
workﬂows

• Develop
high
performance
Python
extension
APIs

•  Pythonic
composable
DSL
designed
to
target
SQL
seman<cs

•  Develop
roadmap
targets
Impala
(C++
/
LLVM)
query
engine

• …
but
SQL
compiler
toolchain
works
well
with
other
SQL
dialects

8
©
Cloudera,
Inc.
All
rights
reserved.

Enabling
interoperability
with
big
data
systems

•  Distributed
/
MPP
query
engines:
implemented
in
a
host
language

• Typically
C++,
Java,
or
Scala

•  User-‐deﬁned
func<ons
(UDFs)
through
various
means

• Implement
in
host
language

• Implement
in
user
language
through
some
external
language
protocol

•  External
UDFs
are
usually
very
slow
(cf:
PL/Python,
PySpark,
etc.)

9
©
Cloudera,
Inc.
All
rights
reserved.

What
are
UDFs
good
for?

•  Note:
industry
data
scien<sts
have
libraries
containing
100s
of
UDFs
for
Hive
or

other
distributed
query
engines

•  Custom
data
transforma<ons

•  Custom
domain
logic
(date
/
<me
/
data
types)

•  Custom
data
types

•  Custom
aggrega<ons
(incl.
machine
learning
/
sta<s<cs
expressible
as
reduc<ons)

10
©
Cloudera,
Inc.
All
rights
reserved.

Why
are
external
UDFs
slow?

•  Serializa<on
/
deserializa<on
overhead

•  Scalar
vs
vectorized
computa<ons

•  RPC
overhead

11
©
Cloudera,
Inc.
All
rights
reserved.

How
to
make
them
fast?

•  Common
run<me
memory
representa<on
for
tabular
data

•  Share-‐memory
(zero-‐copy
or
memcpy-‐only)
external
UDF
protocol

•  Vectorized
UDF
interface
(for
interpreted
languages)

12
©
Cloudera,
Inc.
All
rights
reserved.

Memory
representa<on

•  Many
query
engines
are
standardizing
on
in-‐memory
columnar
rep’n
of

materialized
transient
data

• Apache
Drill:
hjps://drill.apache.org/faq/

• Spark

• Impala:

hjp://blog.cloudera.com/blog/2015/07/whats-‐next-‐for-‐impala-‐more-‐
reliability-‐usability-‐and-‐performance-‐at-‐even-‐greater-‐scale/

•  Industry-‐standard
serializa<on
format:
Apache
Parquet

• hjps://parquet.apache.org/

13
©
Cloudera,
Inc.
All
rights
reserved.

Serializa<on
vs
In-‐memory

•  Serializa<on
formats
(e.g.
Parquet)

• Op<mize
for
IO
/
DFS
throughput
at
expense
of
CPU/memory
bus
throughput

• Do
not
consider
random
access
or
in-‐memory
analy<cs
as
a
goal

•  No
standardized
in-‐memory
containers
for
materialized
data
from
ﬁle
/
RPC

protocols
(Parquet,
Thrik,
protobuf,
Avro,
etc.)

14
©
Cloudera,
Inc.
All
rights
reserved.

One
possible
proposal

•  Standardize
on
an
augmented
variant
of
the
Apache
Drill
in-‐memory
columnar

memory
layout

• hjps://drill.apache.org/docs/value-‐vectors/

•  Common
/
shared
C
impl
for
R/Python/Julia

• Currently
all
languages
have
poor
support
for
JSON-‐like
data

• make
your
needs
known!

• Enumerate
required
data
types
and
other
requirements

15
©
Cloudera,
Inc.
All
rights
reserved.

More
on
the
Drill
layout

persons'='[
''{
''''name:'‘wes’,
''''addresses:'[
'''''''{number:'2,'street:'‘a’},
'''''''{number:'3,'street:'‘bb’},
'''']
''},
''{
''''name:'‘mark’,
''''addresses:'[
'''''''{number:'4,'street:'‘ccc’},
'''''''{number:'5,'street:'‘dddd’},
'''''''{number:'6,'street:'‘f’},
'''']
''},

18
©
Cloudera,
Inc.
All
rights
reserved.

Array<Array<Int32>>
example

persons'='[
''{
''''name:'‘wes’,
''''fav_sequences:'[
''''''[0,'1,'2],
''''''[2,'3]
'''']
''},
''{
''''name:'‘mark’,
''''fav_sequences:'[
''''''[3],
''''''[4,'5],
''''''[6,'7]
'''']
''},
person.fav_sequences/values
person.fav_sequences
0
2
5
offset
0
3
5
6
8
0
1
2
2
3
3
4
5
6
7
offset

Data Science Languages and Industry Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (9)

Similar to Data Science Languages and Industry Analytics

Similar to Data Science Languages and Industry Analytics (20)

More from Wes McKinney

More from Wes McKinney (18)

Recently uploaded

Recently uploaded (20)

Data Science Languages and Industry Analytics