Python Data Ecosystem: Thoughts on Building for the Future

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer at Ursa Labs
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  Data	
  Ecosystem:	
  
Thoughts	
  on	
  Building	
  for	
  the	
  
Future	
  
Wes	
  McKinney	
  @wesmckinn	
  
PyData	
  Berlin	
  2016-­‐05-­‐21	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  Data	
  Science	
  Tools	
  at	
  Cloudera,	
  formerly	
  DataPad	
  CEO/founder	
  
•  Serial	
  creator	
  of	
  structured	
  data	
  tools	
  /	
  user	
  interfaces	
  
•  Wrote	
  bestseller	
  Python	
  for	
  Data	
  Analysis	
  2012	
  
•  Open	
  source	
  projects	
  
•  Python	
  {pandas,	
  Ibis,	
  statsmodels}	
  
•  Apache	
  {Arrow,	
  Parquet,	
  Kudu	
  (incubaWng)}	
  
•  Mostly	
  work	
  in	
  Python	
  and	
  Cython/C/C++	
  
	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
In	
  process:	
  
Python	
  for	
  Data	
  Analysis:	
  2nd	
  Edi4on	
  
Coming	
  early	
  2017	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Building	
  open	
  source	
  communiWes	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Social architecture is the
conscious design of an
environment that
encourages a desired range
of social behaviors leading
towards some goal or set of
goals.
Wikipedia
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  1	
  
	
  
Be	
  open	
  and	
  transparent	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  2	
  
	
  
Reach	
  out	
  to	
  others	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  3	
  
	
  
Strive	
  for	
  consensus	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  4	
  
Value	
  contribuWons	
  extending	
  
beyond	
  lines	
  of	
  code	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Step	
  5	
  
	
  
Make	
  things	
  harder	
  for	
  bad	
  actors	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Handling
problems
carefully
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
http://numfocus.org
http://apache.org
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  packaging	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Packaging	
  is	
  hard	
  
• 	
  Reproducible	
  infrastructure	
  	
  
• 	
  Reproducible	
  toolchains	
  	
  
• 	
  Reproducible	
  build	
  scripts	
  
• 	
  IntegraWon	
  tesWng	
  
• 	
  MulWple	
  library	
  version	
  builds	
  
• 	
  MulWple	
  Python	
  versions	
  
• 	
  Dependency	
  resoluWon	
  
• 	
  HosWng	
  and	
  distribuWon	
  
• 	
  MulWple	
  environment	
  management	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
ReflecWng	
  on	
  the	
  past	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
conda-­‐forge	
  
• 	
  Community-­‐curated	
  conda	
  package	
  channel	
  (on	
  anaconda.org)	
  
• 	
  Reproducible	
  build	
  infrastructure	
  (Docker	
  +	
  Circle	
  CI	
  +	
  Travis	
  CI	
  +	
  Appveyor)	
  
• 	
  Automated	
  GitHub	
  helper	
  tools	
  
conda config --add channels conda-forge
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  important	
  to	
  me	
  right	
  now?	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Important	
  things	
  
• 	
  Building	
  bridges	
  with	
  other	
  data	
  science	
  communiWes	
  (R,	
  Julia,	
  Scala,	
  etc.)	
  
• 	
  Enabling	
  Python	
  to	
  more	
  efficiently	
  talk	
  to	
  other	
  systems	
  (e.g.	
  Hadoop	
  things)	
  
• 	
  Building	
  Python	
  tools	
  for	
  new	
  and	
  changing	
  varieWes	
  of	
  data	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
RAM	
  as	
  the	
  new	
  disk?	
  
•  SSD – DRAM
performance
convergence
•  NVM developments
(3D Xpoint)Memory working set
Consumer Consumer Consumer
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Problems	
  
• 	
  Memory	
  (data	
  structure)	
  representaWons	
  
• 	
  Metadata	
  representaWons	
  
• 	
  Memory	
  ownership,	
  life-­‐cycle	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
NumPy	
  solved	
  this	
  problem	
  for	
  Python	
  scienWsts	
  
• 	
  Common	
  memory	
  representaWon	
  
• 	
  ndarray	
  strided,	
  homogeneous	
  buffer	
  
• 	
  Common	
  metadata	
  
• 	
  NumPy	
  dtypes	
  
• 	
  No	
  well-­‐defined	
  memory	
  sharing	
  /	
  messaging	
  model:	
  case	
  by	
  case	
  basis	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Problems	
  NumPy	
  doesn’t	
  solve	
  as	
  well	
  
• 	
  Nested	
  data	
  types	
  (think	
  JSON)	
  
• 	
  Missing	
  /	
  NULL	
  data	
  
• 	
  Strings	
  and	
  category	
  types	
  
• 	
  Columnar	
  memory	
  representaWon	
  for	
  tables	
  (think:	
  analyWc	
  SQL	
  databases)	
  
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  
Arrow	
  
http://arrow.apache.org
Some slides from Strata-HW talk w/
Jacques Nadeau
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  in	
  a	
  Slide	
  
•  New	
  Top-­‐level	
  Apache	
  Sonware	
  FoundaWon	
  project	
  
	
  
•  Focused	
  on	
  Columnar	
  In-­‐Memory	
  AnalyWcs	
  
1.  10-­‐100x	
  speedup	
  on	
  many	
  workloads	
  
2.  Common	
  data	
  layer	
  enables	
  companies	
  to	
  choose	
  best	
  of	
  
breed	
  systems	
  	
  
3.  Designed	
  to	
  work	
  with	
  any	
  programming	
  language	
  
4.  Support	
  for	
  both	
  relaWonal	
  and	
  complex	
  data	
  as-­‐is	
  
	
  
•  Developers	
  from	
  13+	
  major	
  open	
  source	
  projects	
  involved	
  
•  A	
  significant	
  %	
  of	
  the	
  world’s	
  data	
  will	
  be	
  processed	
  through	
  
Arrow!	
  
	
  
Calcite
Cassandra
Deeplearning4j
Drill
Hadoop
HBase
Ibis
Impala
Kudu
Pandas
Parquet
Phoenix
Spark
Storm
R
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Focus	
  on	
  CPU	
  Efficiency	
  
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
Row 1
Row 2
Row 3
Row 4
1331246660
1331246351
1331244570
1331261196
3/8/2012 2:44PM
3/8/2012 2:38PM
3/8/2012 2:09PM
3/8/2012 6:46PM
99.155.155.225
65.87.165.114
71.10.106.181
76.102.156.138
session_id
timestamp
source_ip
Traditional
Memory Buffer
Arrow
Memory Buffer
• Cache	
  Locality	
  
• Super-­‐scalar	
  &	
  vectorized	
  
operaWon	
  
• Minimal	
  Structure	
  Overhead	
  
• Constant	
  value	
  access	
  	
  
• With	
  minimal	
  structure	
  
overhead	
  
• Operate	
  directly	
  on	
  columnar	
  
compressed	
  data	
  
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
High	
  Performance	
  Sharing	
  &	
  Interchange	
  
Today With Arrow
•  Each system has its own internal
memory format
•  70-80% CPU wasted on serialization
and deserialization
•  Similar functionality implemented in
multiple projects
•  All systems utilize the same memory
format
•  No overhead for cross-system
communication
•  Projects can share functionality (eg,
Parquet-to-Arrow reader)
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  in	
  acWon:	
  Feather	
  File	
  Format	
  for	
  Python	
  and	
  R	
  
• Problem:	
  fast,	
  language-­‐
agnosWc	
  binary	
  data	
  frame	
  
file	
  format	
  
• By	
  Wes	
  McKinney	
  (Python)	
  
and	
  Hadley	
  Wickham	
  (R)	
  
• Read	
  speeds	
  close	
  to	
  disk	
  IO	
  
performance	
  
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Feather	
  File	
  Format	
  for	
  Python	
  
and	
  R	
  
library(feather)	
  
	
  	
  
path	
  <-­‐	
  "my_data.feather"	
  
write_feather(df,	
  path)	
  
	
  	
  
df	
  <-­‐	
  read_feather(path)	
  
import	
  feather	
  
	
  	
  
path	
  =	
  'my_data.feather'	
  
	
  	
  
feather.write_dataframe(df,	
  path)	
  
df	
  =	
  feather.read_dataframe(path)	
  
R	
   Python	
  
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
More	
  on	
  Feather	
  
array 0
array 1
array 2
...
array n - 1
METADATA
Feather File
libfeather
C++ library
Rcpp
Cython
R data.frame
pandas DataFrame
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Feather:	
  the	
  good	
  and	
  not-­‐so-­‐good	
  
•  Good	
  
•  Language-­‐agnosWc	
  memory	
  representaWon	
  
•  Extremely	
  fast	
  
•  New	
  storage	
  features	
  can	
  be	
  added	
  without	
  much	
  difficulty	
  
	
  
•  Not-­‐so-­‐good	
  
•  Data	
  must	
  be	
  convert	
  to/from	
  storage	
  representaWon	
  (Arrow)	
  and	
  in-­‐
memory	
  “proprietary”	
  data	
  structures	
  (R	
  /	
  Python	
  data	
  frames)	
  
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Parquet:	
  Python	
  support	
  is	
  coming	
  
•  Collaborating with Uwe Korn from
Blue Yonder
pandas
Arrow (C++ / Python)
Parquet (C++)
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Shared	
  needs	
  for	
  Python,	
  R,	
  Julia,	
  ...	
  
•  If	
  PLs	
  can	
  establish	
  a	
  common	
  data	
  frame	
  C/C++-­‐level	
  memory	
  representaWon,	
  
we	
  can	
  share	
  algorithms	
  and	
  libraries	
  much	
  more	
  easily	
  
•  Example:	
  dplyr’s	
  in-­‐memory	
  backend	
  
	
  
•  Other	
  requirements	
  
•  Permissive	
  licensing	
  (Python	
  /	
  Julia	
  require	
  MIT/Apache-­‐like)	
  
•  Common	
  build/test/packaging	
  for	
  shared	
  C/C++	
  library	
  components	
  
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Real	
  World	
  Example:	
  Python	
  With	
  Spark,	
  Drill,	
  Impala	
  
in partition 0
…
in partition
n - 1
SQL Engine
Python
function
input
Python
function
input
User-supplied
Python code
output
output
out partition 0
…
out partition
n - 1
SQL Engine
36	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Get	
  Involved	
  in	
  Arrow	
  
•  Join	
  the	
  community	
  
•  dev@arrow.apache.org	
  
•  Slack:	
  hups://apachearrowslackin.herokuapp.com/	
  
•  hup://arrow.apache.org	
  
•  @ApacheArrow	
  
37	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
  are	
  my	
  own	
  
1 of 37

Recommended

Next-generation Python Big Data Tools, powered by Apache Arrow by
Next-generation Python Big Data Tools, powered by Apache ArrowNext-generation Python Big Data Tools, powered by Apache Arrow
Next-generation Python Big Data Tools, powered by Apache ArrowWes McKinney
13K views22 slides
An Incomplete Data Tools Landscape for Hackers in 2015 by
An Incomplete Data Tools Landscape for Hackers in 2015An Incomplete Data Tools Landscape for Hackers in 2015
An Incomplete Data Tools Landscape for Hackers in 2015Wes McKinney
8.1K views22 slides
Ibis: Scaling the Python Data Experience by
Ibis: Scaling the Python Data ExperienceIbis: Scaling the Python Data Experience
Ibis: Scaling the Python Data ExperienceWes McKinney
3.8K views13 slides
My Data Journey with Python (SciPy 2015 Keynote) by
My Data Journey with Python (SciPy 2015 Keynote)My Data Journey with Python (SciPy 2015 Keynote)
My Data Journey with Python (SciPy 2015 Keynote)Wes McKinney
7.4K views37 slides
Memory Interoperability in Analytics and Machine Learning by
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningWes McKinney
5.6K views27 slides
PyData: The Next Generation by
PyData: The Next GenerationPyData: The Next Generation
PyData: The Next GenerationWes McKinney
22.2K views31 slides

More Related Content

What's hot

Python Data Wrangling: Preparing for the Future by
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the FutureWes McKinney
12.5K views27 slides
Improving Python and Spark (PySpark) Performance and Interoperability by
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and InteroperabilityWes McKinney
19.8K views37 slides
Apache Arrow (Strata-Hadoop World San Jose 2016) by
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)Wes McKinney
17K views28 slides
Apache Arrow -- Cross-language development platform for in-memory data by
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataWes McKinney
2.9K views23 slides
Improving data interoperability in Python and R by
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and RWes McKinney
2.6K views14 slides
High Performance Python on Apache Spark by
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache SparkWes McKinney
16.6K views35 slides

What's hot(20)

Python Data Wrangling: Preparing for the Future by Wes McKinney
Python Data Wrangling: Preparing for the FuturePython Data Wrangling: Preparing for the Future
Python Data Wrangling: Preparing for the Future
Wes McKinney12.5K views
Improving Python and Spark (PySpark) Performance and Interoperability by Wes McKinney
Improving Python and Spark (PySpark) Performance and InteroperabilityImproving Python and Spark (PySpark) Performance and Interoperability
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney19.8K views
Apache Arrow (Strata-Hadoop World San Jose 2016) by Wes McKinney
Apache Arrow (Strata-Hadoop World San Jose 2016)Apache Arrow (Strata-Hadoop World San Jose 2016)
Apache Arrow (Strata-Hadoop World San Jose 2016)
Wes McKinney17K views
Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney2.9K views
Improving data interoperability in Python and R by Wes McKinney
Improving data interoperability in Python and RImproving data interoperability in Python and R
Improving data interoperability in Python and R
Wes McKinney2.6K views
High Performance Python on Apache Spark by Wes McKinney
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney16.6K views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney2K views
Apache Arrow and Python: The latest by Wes McKinney
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney5.8K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney1.1K views
Apache Arrow: Cross-language Development Platform for In-memory Data by Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney6.6K views
Ibis: Scaling Python Analytics on Hadoop and Impala by Wes McKinney
Ibis: Scaling Python Analytics on Hadoop and ImpalaIbis: Scaling Python Analytics on Hadoop and Impala
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney7.6K views
Improving Python and Spark Performance and Interoperability with Apache Arrow by Julien Le Dem
Improving Python and Spark Performance and Interoperability with Apache ArrowImproving Python and Spark Performance and Interoperability with Apache Arrow
Improving Python and Spark Performance and Interoperability with Apache Arrow
Julien Le Dem4.4K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P... by Wes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney103.9K views
Large Scale Graph Analytics with JanusGraph by P. Taylor Goetz
Large Scale Graph Analytics with JanusGraphLarge Scale Graph Analytics with JanusGraph
Large Scale Graph Analytics with JanusGraph
P. Taylor Goetz19.1K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney by Hakka Labs
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs1.1K views
PyData: The Next Generation | Data Day Texas 2015 by Cloudera, Inc.
PyData: The Next Generation | Data Day Texas 2015PyData: The Next Generation | Data Day Texas 2015
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.1.8K views
HBase and Drill: How loosley typed SQL is ideal for NoSQL by DataWorks Summit
HBase and Drill: How loosley typed SQL is ideal for NoSQLHBase and Drill: How loosley typed SQL is ideal for NoSQL
HBase and Drill: How loosley typed SQL is ideal for NoSQL
DataWorks Summit641 views
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes by DataWorks Summit
Scaling HDFS to Manage Billions of Files with Distributed Storage SchemesScaling HDFS to Manage Billions of Files with Distributed Storage Schemes
Scaling HDFS to Manage Billions of Files with Distributed Storage Schemes
DataWorks Summit361 views

Viewers also liked

Raising the Tides: Open Source Analytics for Data Science by
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data ScienceWes McKinney
3.2K views28 slides
PyCon APAC 2016 Keynote by
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 KeynoteWes McKinney
3.6K views36 slides
Enabling Python to be a Better Big Data Citizen by
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data CitizenWes McKinney
6K views19 slides
pandas: Powerful data analysis tools for Python by
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for PythonWes McKinney
9.8K views38 slides
Productive Data Tools for Quants by
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for QuantsWes McKinney
1.7K views21 slides
What's new in pandas and the SciPy stack for financial users by
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWes McKinney
11.8K views23 slides

Viewers also liked(14)

Raising the Tides: Open Source Analytics for Data Science by Wes McKinney
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney3.2K views
PyCon APAC 2016 Keynote by Wes McKinney
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney3.6K views
Enabling Python to be a Better Big Data Citizen by Wes McKinney
Enabling Python to be a Better Big Data CitizenEnabling Python to be a Better Big Data Citizen
Enabling Python to be a Better Big Data Citizen
Wes McKinney6K views
pandas: Powerful data analysis tools for Python by Wes McKinney
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney9.8K views
Productive Data Tools for Quants by Wes McKinney
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney1.7K views
What's new in pandas and the SciPy stack for financial users by Wes McKinney
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney11.8K views
Data Tools and the Data Scientist Shortage by Wes McKinney
Data Tools and the Data Scientist ShortageData Tools and the Data Scientist Shortage
Data Tools and the Data Scientist Shortage
Wes McKinney3.7K views
DataFrames: The Good, Bad, and Ugly by Wes McKinney
DataFrames: The Good, Bad, and UglyDataFrames: The Good, Bad, and Ugly
DataFrames: The Good, Bad, and Ugly
Wes McKinney12.9K views
Python for Financial Data Analysis with pandas by Wes McKinney
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
Wes McKinney61.8K views
Structured Data Challenges in Finance and Statistics by Wes McKinney
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney5.3K views
User Experience for Business Analysts by Carol Smith
User Experience for Business AnalystsUser Experience for Business Analysts
User Experience for Business Analysts
Carol Smith3.7K views
Python for Science and Engineering: a presentation to A*STAR and the Singapor... by pythoncharmers
Python for Science and Engineering: a presentation to A*STAR and the Singapor...Python for Science and Engineering: a presentation to A*STAR and the Singapor...
Python for Science and Engineering: a presentation to A*STAR and the Singapor...
pythoncharmers7K views
Falcon: Fault Localization in Concurrent Programs by Sangmin Park
Falcon: Fault Localization in Concurrent ProgramsFalcon: Fault Localization in Concurrent Programs
Falcon: Fault Localization in Concurrent Programs
Sangmin Park539 views
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding... by Sangmin Park
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Griffin: Grouping Suspicious Memory-Access Patterns to Improve Understanding...
Sangmin Park412 views

Similar to Python Data Ecosystem: Thoughts on Building for the Future

Improving Data Interoperability for Python and R by
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and RWork-Bench
10.3K views14 slides
High-Performance Python On Spark by
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On SparkJen Aman
1.7K views35 slides
Building a Hadoop Data Warehouse with Impala by
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impalahuguk
2K views37 slides
Part 2: A Visual Dive into Machine Learning and Deep Learning 
 by
Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 
Cloudera, Inc.
1.5K views32 slides
Building a Hadoop Data Warehouse with Impala by
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with ImpalaSwiss Big Data User Group
7.3K views40 slides
Data Science and CDSW by
Data Science and CDSWData Science and CDSW
Data Science and CDSWJason Hubbard
1.3K views19 slides

Similar to Python Data Ecosystem: Thoughts on Building for the Future(20)

Improving Data Interoperability for Python and R by Work-Bench
Improving Data Interoperability for Python and RImproving Data Interoperability for Python and R
Improving Data Interoperability for Python and R
Work-Bench10.3K views
High-Performance Python On Spark by Jen Aman
High-Performance Python On SparkHigh-Performance Python On Spark
High-Performance Python On Spark
Jen Aman1.7K views
Building a Hadoop Data Warehouse with Impala by huguk
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
huguk2K views
Part 2: A Visual Dive into Machine Learning and Deep Learning 
 by Cloudera, Inc.
Part 2: A Visual Dive into Machine Learning and Deep Learning 
Part 2: A Visual Dive into Machine Learning and Deep Learning 

Part 2: A Visual Dive into Machine Learning and Deep Learning 

Cloudera, Inc.1.5K views
Data Science and CDSW by Jason Hubbard
Data Science and CDSWData Science and CDSW
Data Science and CDSW
Jason Hubbard1.3K views
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,... by Data Con LA
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Big Data Day LA 2016/ NoSQL track - Apache Kudu: Fast Analytics on Fast Data,...
Data Con LA369 views
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud by Stefan Lipp
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Cloudera Analytics and Machine Learning Platform - Optimized for Cloud
Stefan Lipp403 views
Building data pipelines with kite by Joey Echeverria
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
Joey Echeverria5.7K views
Hadoop 3 (2017 hadoop taiwan workshop) by Wei-Chiu Chuang
Hadoop 3 (2017 hadoop taiwan workshop)Hadoop 3 (2017 hadoop taiwan workshop)
Hadoop 3 (2017 hadoop taiwan workshop)
Wei-Chiu Chuang551 views
A brave new world in mutable big data relational storage (Strata NYC 2017) by Todd Lipcon
A brave new world in mutable big data  relational storage (Strata NYC 2017)A brave new world in mutable big data  relational storage (Strata NYC 2017)
A brave new world in mutable big data relational storage (Strata NYC 2017)
Todd Lipcon7.3K views
Pandas & Cloudera: Scaling the Python Data Experience by Turi, Inc.
Pandas & Cloudera: Scaling the Python Data ExperiencePandas & Cloudera: Scaling the Python Data Experience
Pandas & Cloudera: Scaling the Python Data Experience
Turi, Inc.648 views
Analyzing Hadoop Data Using Sparklyr
 by Cloudera, Inc.
Analyzing Hadoop Data Using Sparklyr
Analyzing Hadoop Data Using Sparklyr

Analyzing Hadoop Data Using Sparklyr

Cloudera, Inc.2.4K views
Data Science and Machine Learning for the Enterprise by Cloudera, Inc.
Data Science and Machine Learning for the EnterpriseData Science and Machine Learning for the Enterprise
Data Science and Machine Learning for the Enterprise
Cloudera, Inc.1.3K views
GSJUG: Mastering Data Streaming Pipelines 09May2023 by Timothy Spann
GSJUG: Mastering Data Streaming Pipelines 09May2023GSJUG: Mastering Data Streaming Pipelines 09May2023
GSJUG: Mastering Data Streaming Pipelines 09May2023
Timothy Spann255 views
Machine Learning in the Enterprise 2019 by Timothy Spann
Machine Learning in the Enterprise 2019   Machine Learning in the Enterprise 2019
Machine Learning in the Enterprise 2019
Timothy Spann878 views
Large-Scale Data Science on Hadoop (Intel Big Data Day) by Uri Laserson
Large-Scale Data Science on Hadoop (Intel Big Data Day)Large-Scale Data Science on Hadoop (Intel Big Data Day)
Large-Scale Data Science on Hadoop (Intel Big Data Day)
Uri Laserson1.8K views
Hambug R Meetup - Intro to H2O by Sri Ambati
Hambug R Meetup - Intro to H2OHambug R Meetup - Intro to H2O
Hambug R Meetup - Intro to H2O
Sri Ambati272 views

More from Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow by
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
1.1K views31 slides
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
1.1K views26 slides
Apache Arrow: High Performance Columnar Data Framework by
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
1.5K views53 slides
New Directions for Apache Arrow by
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
1.9K views27 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
2.2K views31 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
2K views47 slides

More from Wes McKinney(14)

Solving Enterprise Data Challenges with Apache Arrow by Wes McKinney
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney1.1K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney1.5K views
New Directions for Apache Arrow by Wes McKinney
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
Apache Arrow: Present and Future @ ScaledML 2020 by Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney970 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney2.1K views
Apache Arrow: Leveling Up the Analytics Stack by Wes McKinney
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney2.5K views
Apache Arrow: Leveling Up the Data Science Stack by Wes McKinney
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney3.5K views
Ursa Labs and Apache Arrow in 2019 by Wes McKinney
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
Shared Infrastructure for Data Science by Wes McKinney
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney8.5K views
Data Science Without Borders (JupyterCon 2017) by Wes McKinney
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney6.2K views

Recently uploaded

CryptoBotsAI by
CryptoBotsAICryptoBotsAI
CryptoBotsAIchandureddyvadala199
42 views5 slides
Optimizing Communication to Optimize Human Behavior - LCBM by
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBMYaman Kumar
39 views49 slides
"Running students' code in isolation. The hard way", Yurii Holiuk by
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk Fwdays
38 views34 slides
Netmera Presentation.pdf by
Netmera Presentation.pdfNetmera Presentation.pdf
Netmera Presentation.pdfMustafa Kuğu
22 views50 slides
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOsPriyanka Aash
171 views59 slides
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」 by
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PC Cluster Consortium
27 views12 slides

Recently uploaded(20)

Optimizing Communication to Optimize Human Behavior - LCBM by Yaman Kumar
Optimizing Communication to Optimize Human Behavior - LCBMOptimizing Communication to Optimize Human Behavior - LCBM
Optimizing Communication to Optimize Human Behavior - LCBM
Yaman Kumar39 views
"Running students' code in isolation. The hard way", Yurii Holiuk by Fwdays
"Running students' code in isolation. The hard way", Yurii Holiuk "Running students' code in isolation. The hard way", Yurii Holiuk
"Running students' code in isolation. The hard way", Yurii Holiuk
Fwdays38 views
Digital Personal Data Protection (DPDP) Practical Approach For CISOs by Priyanka Aash
Digital Personal Data Protection (DPDP) Practical Approach For CISOsDigital Personal Data Protection (DPDP) Practical Approach For CISOs
Digital Personal Data Protection (DPDP) Practical Approach For CISOs
Priyanka Aash171 views
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」 by PC Cluster Consortium
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
PCCC23:日本AMD株式会社 テーマ2「AMD EPYC™ プロセッサーを用いたAIソリューション」
LLMs in Production: Tooling, Process, and Team Structure by Aggregage
LLMs in Production: Tooling, Process, and Team StructureLLMs in Production: Tooling, Process, and Team Structure
LLMs in Production: Tooling, Process, and Team Structure
Aggregage65 views
The Role of Patterns in the Era of Large Language Models by Yunyao Li
The Role of Patterns in the Era of Large Language ModelsThe Role of Patterns in the Era of Large Language Models
The Role of Patterns in the Era of Large Language Models
Yunyao Li104 views
Mobile Core Solutions & Successful Cases.pdf by IPLOOK Networks
Mobile Core Solutions & Successful Cases.pdfMobile Core Solutions & Successful Cases.pdf
Mobile Core Solutions & Successful Cases.pdf
IPLOOK Networks16 views
Transcript: Redefining the book supply chain: A glimpse into the future - Tec... by BookNet Canada
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
Transcript: Redefining the book supply chain: A glimpse into the future - Tec...
BookNet Canada43 views
The Power of Generative AI in Accelerating No Code Adoption.pdf by Saeed Al Dhaheri
The Power of Generative AI in Accelerating No Code Adoption.pdfThe Power of Generative AI in Accelerating No Code Adoption.pdf
The Power of Generative AI in Accelerating No Code Adoption.pdf
Saeed Al Dhaheri44 views
What is Authentication Active Directory_.pptx by HeenaMehta35
What is Authentication Active Directory_.pptxWhat is Authentication Active Directory_.pptx
What is Authentication Active Directory_.pptx
HeenaMehta3515 views
Deep Tech and the Amplified Organisation: Core Concepts by Holonomics
Deep Tech and the Amplified Organisation: Core ConceptsDeep Tech and the Amplified Organisation: Core Concepts
Deep Tech and the Amplified Organisation: Core Concepts
Holonomics17 views
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays59 views
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And... by ShapeBlue
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
Enabling DPU Hardware Accelerators in XCP-ng Cloud Platform Environment - And...
ShapeBlue120 views
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or... by ShapeBlue
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
Zero to Cloud Hero: Crafting a Private Cloud from Scratch with XCP-ng, Xen Or...
ShapeBlue209 views

Python Data Ecosystem: Thoughts on Building for the Future

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Python  Data  Ecosystem:   Thoughts  on  Building  for  the   Future   Wes  McKinney  @wesmckinn   PyData  Berlin  2016-­‐05-­‐21  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  Science  Tools  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   •  Python  {pandas,  Ibis,  statsmodels}   •  Apache  {Arrow,  Parquet,  Kudu  (incubaWng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   In  process:   Python  for  Data  Analysis:  2nd  Edi4on   Coming  early  2017  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Building  open  source  communiWes  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Social architecture is the conscious design of an environment that encourages a desired range of social behaviors leading towards some goal or set of goals. Wikipedia
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Step  1     Be  open  and  transparent  
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Step  2     Reach  out  to  others  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Step  3     Strive  for  consensus  
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Step  4   Value  contribuWons  extending   beyond  lines  of  code  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Step  5     Make  things  harder  for  bad  actors  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Handling problems carefully
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   http://numfocus.org http://apache.org
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Python  packaging  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Packaging  is  hard   •   Reproducible  infrastructure     •   Reproducible  toolchains     •   Reproducible  build  scripts   •   IntegraWon  tesWng   •   MulWple  library  version  builds   •   MulWple  Python  versions   •   Dependency  resoluWon   •   HosWng  and  distribuWon   •   MulWple  environment  management  
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   ReflecWng  on  the  past  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   conda-­‐forge   •   Community-­‐curated  conda  package  channel  (on  anaconda.org)   •   Reproducible  build  infrastructure  (Docker  +  Circle  CI  +  Travis  CI  +  Appveyor)   •   Automated  GitHub  helper  tools   conda config --add channels conda-forge
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  important  to  me  right  now?  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Important  things   •   Building  bridges  with  other  data  science  communiWes  (R,  Julia,  Scala,  etc.)   •   Enabling  Python  to  more  efficiently  talk  to  other  systems  (e.g.  Hadoop  things)   •   Building  Python  tools  for  new  and  changing  varieWes  of  data  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   RAM  as  the  new  disk?   •  SSD – DRAM performance convergence •  NVM developments (3D Xpoint)Memory working set Consumer Consumer Consumer
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Problems   •   Memory  (data  structure)  representaWons   •   Metadata  representaWons   •   Memory  ownership,  life-­‐cycle  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   NumPy  solved  this  problem  for  Python  scienWsts   •   Common  memory  representaWon   •   ndarray  strided,  homogeneous  buffer   •   Common  metadata   •   NumPy  dtypes   •   No  well-­‐defined  memory  sharing  /  messaging  model:  case  by  case  basis  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Problems  NumPy  doesn’t  solve  as  well   •   Nested  data  types  (think  JSON)   •   Missing  /  NULL  data   •   Strings  and  category  types   •   Columnar  memory  representaWon  for  tables  (think:  analyWc  SQL  databases)  
  • 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   Apache   Arrow   http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
  • 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  a  Slide   •  New  Top-­‐level  Apache  Sonware  FoundaWon  project     •  Focused  on  Columnar  In-­‐Memory  AnalyWcs   1.  10-­‐100x  speedup  on  many  workloads   2.  Common  data  layer  enables  companies  to  choose  best  of   breed  systems     3.  Designed  to  work  with  any  programming  language   4.  Support  for  both  relaWonal  and  complex  data  as-­‐is     •  Developers  from  13+  major  open  source  projects  involved   •  A  significant  %  of  the  world’s  data  will  be  processed  through   Arrow!     Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  • 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Focus  on  CPU  Efficiency   1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer Arrow Memory Buffer • Cache  Locality   • Super-­‐scalar  &  vectorized   operaWon   • Minimal  Structure  Overhead   • Constant  value  access     • With  minimal  structure   overhead   • Operate  directly  on  columnar   compressed  data  
  • 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Sharing  &  Interchange   Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader)
  • 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  acWon:  Feather  File  Format  for  Python  and  R   • Problem:  fast,  language-­‐ agnosWc  binary  data  frame   file  format   • By  Wes  McKinney  (Python)   and  Hadley  Wickham  (R)   • Read  speeds  close  to  disk  IO   performance  
  • 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  • 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   More  on  Feather   array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
  • 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   Feather:  the  good  and  not-­‐so-­‐good   •  Good   •  Language-­‐agnosWc  memory  representaWon   •  Extremely  fast   •  New  storage  features  can  be  added  without  much  difficulty     •  Not-­‐so-­‐good   •  Data  must  be  convert  to/from  storage  representaWon  (Arrow)  and  in-­‐ memory  “proprietary”  data  structures  (R  /  Python  data  frames)  
  • 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Parquet:  Python  support  is  coming   •  Collaborating with Uwe Korn from Blue Yonder pandas Arrow (C++ / Python) Parquet (C++)
  • 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Shared  needs  for  Python,  R,  Julia,  ...   •  If  PLs  can  establish  a  common  data  frame  C/C++-­‐level  memory  representaWon,   we  can  share  algorithms  and  libraries  much  more  easily   •  Example:  dplyr’s  in-­‐memory  backend     •  Other  requirements   •  Permissive  licensing  (Python  /  Julia  require  MIT/Apache-­‐like)   •  Common  build/test/packaging  for  shared  C/C++  library  components  
  • 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Python  With  Spark,  Drill,  Impala   in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  • 36. 36  ©  Cloudera,  Inc.  All  rights  reserved.   Get  Involved  in  Arrow   •  Join  the  community   •  dev@arrow.apache.org   •  Slack:  hups://apachearrowslackin.herokuapp.com/   •  hup://arrow.apache.org   •  @ApacheArrow  
  • 37. 37  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own