DataFrames: The Good, Bad, and Ugly

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer at Ursa Labs
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
DataFrames:	
  The	
  Good,	
  Bad,	
  
and	
  Ugly	
  
Wes	
  McKinney	
  @wesmckinn	
  
NY	
  R	
  Conference,	
  2015-­‐04-­‐25	
  
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Disclaimer:	
  the	
  views	
  presented	
  in	
  
this	
  talk	
  are	
  my	
  personal	
  opinions	
  
and	
  not	
  necessarily	
  those	
  of	
  
Cloudera	
  
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
This	
  talk	
  
•  Some	
  commentary	
  on	
  all	
  the	
  data	
  frame	
  interfaces	
  out	
  there	
  
•  Biased	
  observaTons	
  and	
  cursory	
  judgments	
  
•  Thoughts	
  on	
  craVing	
  high	
  quality	
  data	
  tools	
  
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Disclaimer	
  #2:	
  This	
  is	
  a	
  nuanced	
  
discussion	
  
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Who	
  am	
  I?	
  
•  Father	
  of	
  pandas	
  (2008	
  -­‐	
  )	
  	
  
•  Financial	
  analyTcs	
  in	
  R	
  /	
  Python	
  starTng	
  2007	
  
•  2010-­‐2012	
  
• Hiatus	
  from	
  gainful	
  employment	
  
• Make	
  pandas	
  ready	
  for	
  primeTme	
  
• Write	
  "Python	
  for	
  Data	
  Analysis”	
  
•  2013-­‐2014:	
  DataPad	
  with	
  Chang	
  She	
  &	
  co	
  
•  2014	
  -­‐	
  :	
  Cloudera	
  
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  in	
  a	
  DataFrame?	
  
	
  
	
  
	
  
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  in	
  a	
  DataFrame?	
  
A	
  table	
  with	
  some	
  rows	
  
By	
  any	
  other	
  name	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  	
  
would	
  analyze	
  as	
  sweet.	
  
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Got	
  a	
  table?	
  
	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Got	
  a	
  table?	
  
Put	
  a	
  DataFrame	
  (interface)	
  on	
  it!	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What	
  is	
  this	
  “data	
  frame”	
  that	
  you	
  speak	
  of	
  
•  A	
  table-­‐like	
  data	
  structure	
  
•  An	
  API	
  /	
  user	
  interface	
  for	
  the	
  table	
  
• SelecTng	
  data	
  
• Math	
  and	
  relaTonal	
  algebra	
  (join,	
  filter,	
  etc.)	
  
• File	
  /	
  database	
  IO	
  
• ad	
  infinitum	
  
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  axes	
  of	
  comparison	
  
•  Data	
  structure	
  internals	
  (types,	
  in-­‐memory	
  representaTon,	
  etc.)	
  
•  Basic	
  table	
  API	
  
•  RelaTonal	
  algebra	
  support	
  
•  Group-­‐by	
  /	
  split-­‐apply-­‐combine	
  API	
  
•  Performance,	
  memory	
  use,	
  evaluaTon	
  semanTcs	
  
•  Missing	
  data	
  
•  Data	
  Tdying	
  /	
  ETL	
  tools	
  
•  IO	
  uTliTes	
  
•  Domain	
  specific	
  tools	
  (e.g.	
  Tme	
  series)	
  
•  …	
  
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
The	
  Great	
  Data	
  Tool	
  Decoupling™	
  
•  Thesis:	
  over	
  Tme,	
  user	
  interfaces,	
  data	
  storage,	
  and	
  execuTon	
  engines	
  will	
  
decouple	
  and	
  specialize	
  
•  In	
  fact,	
  you	
  should	
  really	
  want	
  this	
  to	
  happen	
  
• Share	
  systems	
  among	
  languages	
  
• Reduce	
  fragmentaTon	
  and	
  “lock-­‐in”	
  
• ShiV	
  developer	
  focus	
  to	
  usability	
  	
  
•  PredicTon:	
  we’ll	
  be	
  there	
  by	
  2025;	
  sooner	
  if	
  we	
  all	
  get	
  our	
  act	
  together	
  
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
CraVing	
  quality	
  data	
  tools	
  
•  Quality	
  /	
  usefulness	
  is	
  usually	
  forged	
  by	
  the	
  fire	
  of	
  basle	
  
•  Real	
  world	
  use	
  cases	
  and	
  social	
  proof	
  trump	
  theory	
  
•  Eat	
  that	
  dog	
  food	
  
•  When	
  in	
  doubt?	
  Look	
  at	
  the	
  test	
  suite.	
  
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
R	
  data	
  frames	
  
•  Thin	
  layer	
  on	
  top	
  of	
  R	
  list	
  type	
  
• Sequence	
  of	
  named	
  vectors	
  
• Can	
  have	
  row	
  names	
  (any	
  R	
  vector)	
  
•  Simple	
  column	
  and	
  row	
  selecTon	
  API	
  
•  AnalyTcs,	
  data	
  transformaTon,	
  etc.	
  leV	
  to	
  base	
  package	
  and	
  add-­‐on	
  libraries	
  
•  Richness	
  /	
  usability	
  comes	
  largely	
  from	
  libraries	
  
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Some	
  awesome	
  R	
  data	
  frame	
  stuff	
  
•  “Hadley	
  Stack”	
  
• dplyr,	
  Tdyr	
  
• legacy:	
  plyr,	
  reshape2	
  	
  
• ggplot2	
  
•  data.table	
  (data.frame	
  +	
  indices,	
  fast	
  algorithms)	
  
•  xts	
  :	
  Tme	
  series	
  
	
  
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
R	
  data	
  frames:	
  rough	
  edges	
  
•  Copy-­‐on-­‐write	
  semanTcs	
  
•  API	
  fragmentaTon	
  /	
  inconsistency	
  
• Use	
  the	
  “Hadley	
  stack”	
  for	
  improved	
  sanity	
  
•  Factor	
  /	
  String	
  dichotomy	
  
• stringsAsFactors=FALSE	
  a	
  blessing	
  and	
  curse	
  
•  Somewhat	
  limited	
  type	
  system	
  
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
dplyr	
  
•  Composable	
  table	
  API	
  
•  Good	
  example	
  of	
  what	
  the	
  “decoupled”	
  future	
  might	
  look	
  like	
  
• New	
  in-­‐memory	
  R/RCpp	
  execuTon	
  engine	
  
• SQL	
  backends	
  for	
  large	
  subset	
  of	
  API	
  
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  DataFrames	
  
•  R/pandas-­‐inspired	
  API	
  for	
  tabular	
  data	
  manipulaTon	
  in	
  Scala,	
  Python,	
  etc.	
  
•  Logical	
  operaTon	
  graphs	
  rewrisen	
  internally	
  in	
  more	
  efficient	
  form	
  
•  Good	
  interop	
  with	
  Spark	
  SQL	
  
•  Some	
  interoperability	
  with	
  pandas	
  
•  ParTal	
  API	
  Decoupling!	
  (it	
  sTll	
  binds	
  you	
  to	
  Spark)	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  
•  Several	
  key	
  data	
  structures,	
  data	
  frame	
  among	
  them	
  
•  Considerably	
  more	
  complex	
  internals	
  than	
  other	
  data	
  frame	
  libraries	
  
•  Some	
  good	
  things	
  
• Born	
  of	
  need	
  
• A	
  “baseries	
  included”	
  approach	
  
• Hierarchical	
  axis	
  labeling:	
  addresses	
  some	
  hard	
  use	
  cases	
  at	
  expense	
  of	
  
semanTc	
  complexity	
  
• Strong	
  Tme	
  series	
  support	
  
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas:	
  rough	
  edges	
  
•  Axis	
  labelling	
  can	
  get	
  in	
  the	
  way	
  for	
  folks	
  needing	
  “just	
  a	
  table”	
  
•  Ceded	
  control	
  of	
  its	
  type	
  system	
  /	
  data	
  rep’n	
  from	
  day	
  1	
  to	
  NumPy	
  
•  Inefficient	
  string	
  handling	
  (uses	
  NumPy	
  object	
  arrays)	
  
•  Missing	
  data	
  handling	
  less	
  precise	
  than	
  other	
  tools	
  
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Julia:	
  DataFrames.jl	
  
•  Started	
  by	
  Harlan	
  Harris	
  &	
  co	
  
•  Part	
  of	
  broader	
  JuliaStats	
  iniTaTve	
  
• More	
  R-­‐like	
  than	
  pandas-­‐like	
  
• Very	
  acTve:	
  >	
  50	
  contributors!	
  	
  
•  STll	
  comparaTvely	
  early	
  
• Less	
  comprehensive	
  API	
  
• More	
  limited	
  IO	
  capabiliTes	
  
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Other	
  data	
  frames	
  
•  Saddle	
  (Scala)	
  
• Dev’d	
  by	
  Adam	
  Klein	
  (ex-­‐AQR)	
  at	
  Novus	
  Partners	
  (fintech	
  startup)	
  
• Designed	
  and	
  used	
  for	
  financial	
  use	
  cases	
  
•  Deedle	
  (F#	
  /	
  .NET)	
  
• Dev’d	
  by	
  AK’s	
  colleagues	
  at	
  BlueMountain	
  (hedge	
  fund)	
  
•  GraphLab	
  /	
  Dato	
  
• Really	
  good	
  C++	
  data	
  frame	
  with	
  Python	
  interface	
  
• Dual-­‐licensed:	
  AGPL	
  +	
  Commercial	
  
•  That’s	
  not	
  all!	
  Haskell,	
  Go,	
  usw…	
  
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
We’re	
  not	
  done	
  yet	
  
•  The	
  future	
  is	
  JSON-­‐like	
  
• Support	
  for	
  nested	
  types	
  /	
  semi-­‐structured	
  data	
  is	
  sTll	
  weak	
  
•  Wanted:	
  Apache-­‐licensed,	
  community	
  standard	
  C/C++	
  data	
  frame	
  that	
  we	
  all	
  use	
  
(R,	
  Python,	
  Julia)	
  
•  Bring	
  on	
  the	
  Great	
  Decoupling	
  
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
@wesmckinn	
  
1 of 24

More Related Content

What's hot(20)

PyCon Singapore 2013 KeynotePyCon Singapore 2013 Keynote
PyCon Singapore 2013 Keynote
Wes McKinney94.6K views
Productive Data Tools for QuantsProductive Data Tools for Quants
Productive Data Tools for Quants
Wes McKinney1.7K views
Apache Arrow and Python: The latestApache Arrow and Python: The latest
Apache Arrow and Python: The latest
Wes McKinney5.8K views
Apache Arrow - An OverviewApache Arrow - An Overview
Apache Arrow - An Overview
Dremio Corporation2K views
Uber's data science workbenchUber's data science workbench
Uber's data science workbench
Ran Wei3.2K views

Viewers also liked(16)

Similar to DataFrames: The Good, Bad, and Ugly(20)

Kudu: Fast Analytics on Fast DataKudu: Fast Analytics on Fast Data
Kudu: Fast Analytics on Fast Data
michaelguia448 views
Building a Hadoop Data Warehouse with ImpalaBuilding a Hadoop Data Warehouse with Impala
Building a Hadoop Data Warehouse with Impala
Swiss Big Data User Group7.3K views
Impala tech-talk by Dimitris TsirogiannisImpala tech-talk by Dimitris Tsirogiannis
Impala tech-talk by Dimitris Tsirogiannis
Felicia Haggarty654 views
Introduction to Apache KuduIntroduction to Apache Kudu
Introduction to Apache Kudu
Jeff Holoman5.8K views

More from Wes McKinney(20)

New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney1.9K views
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
PyCon APAC 2016 KeynotePyCon APAC 2016 Keynote
PyCon APAC 2016 Keynote
Wes McKinney3.6K views
High Performance Python on Apache SparkHigh Performance Python on Apache Spark
High Performance Python on Apache Spark
Wes McKinney16.6K views

Recently uploaded(20)

Java Platform Approach 1.0 - Picnic MeetupJava Platform Approach 1.0 - Picnic Meetup
Java Platform Approach 1.0 - Picnic Meetup
Rick Ossendrijver23 views
Liqid: Composable CXL PreviewLiqid: Composable CXL Preview
Liqid: Composable CXL Preview
CXL Forum118 views
The Research Portal of Catalonia: Growing more (information) & more (services)The Research Portal of Catalonia: Growing more (information) & more (services)
The Research Portal of Catalonia: Growing more (information) & more (services)
CSUC - Consorci de Serveis Universitaris de Catalunya51 views
CXL at OCPCXL at OCP
CXL at OCP
CXL Forum183 views
Web Dev - 1 PPT.pdfWeb Dev - 1 PPT.pdf
Web Dev - 1 PPT.pdf
gdsczhcet48 views
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
METHOD AND SYSTEM FOR PREDICTING OPTIMAL LOAD FOR WHICH THE YIELD IS MAXIMUM ...
Prity Khastgir IPR Strategic India Patent Attorney Amplify Innovation23 views

DataFrames: The Good, Bad, and Ugly

  • 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   DataFrames:  The  Good,  Bad,   and  Ugly   Wes  McKinney  @wesmckinn   NY  R  Conference,  2015-­‐04-­‐25  
  • 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Disclaimer:  the  views  presented  in   this  talk  are  my  personal  opinions   and  not  necessarily  those  of   Cloudera  
  • 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   This  talk   •  Some  commentary  on  all  the  data  frame  interfaces  out  there   •  Biased  observaTons  and  cursory  judgments   •  Thoughts  on  craVing  high  quality  data  tools  
  • 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Disclaimer  #2:  This  is  a  nuanced   discussion  
  • 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?   •  Father  of  pandas  (2008  -­‐  )     •  Financial  analyTcs  in  R  /  Python  starTng  2007   •  2010-­‐2012   • Hiatus  from  gainful  employment   • Make  pandas  ready  for  primeTme   • Write  "Python  for  Data  Analysis”   •  2013-­‐2014:  DataPad  with  Chang  She  &  co   •  2014  -­‐  :  Cloudera  
  • 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  in  a  DataFrame?        
  • 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  in  a  DataFrame?   A  table  with  some  rows   By  any  other  name                       would  analyze  as  sweet.  
  • 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Got  a  table?    
  • 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Got  a  table?   Put  a  DataFrame  (interface)  on  it!  
  • 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  this  “data  frame”  that  you  speak  of   •  A  table-­‐like  data  structure   •  An  API  /  user  interface  for  the  table   • SelecTng  data   • Math  and  relaTonal  algebra  (join,  filter,  etc.)   • File  /  database  IO   • ad  infinitum  
  • 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Some  axes  of  comparison   •  Data  structure  internals  (types,  in-­‐memory  representaTon,  etc.)   •  Basic  table  API   •  RelaTonal  algebra  support   •  Group-­‐by  /  split-­‐apply-­‐combine  API   •  Performance,  memory  use,  evaluaTon  semanTcs   •  Missing  data   •  Data  Tdying  /  ETL  tools   •  IO  uTliTes   •  Domain  specific  tools  (e.g.  Tme  series)   •  …  
  • 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   The  Great  Data  Tool  Decoupling™   •  Thesis:  over  Tme,  user  interfaces,  data  storage,  and  execuTon  engines  will   decouple  and  specialize   •  In  fact,  you  should  really  want  this  to  happen   • Share  systems  among  languages   • Reduce  fragmentaTon  and  “lock-­‐in”   • ShiV  developer  focus  to  usability     •  PredicTon:  we’ll  be  there  by  2025;  sooner  if  we  all  get  our  act  together  
  • 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   CraVing  quality  data  tools   •  Quality  /  usefulness  is  usually  forged  by  the  fire  of  basle   •  Real  world  use  cases  and  social  proof  trump  theory   •  Eat  that  dog  food   •  When  in  doubt?  Look  at  the  test  suite.  
  • 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   R  data  frames   •  Thin  layer  on  top  of  R  list  type   • Sequence  of  named  vectors   • Can  have  row  names  (any  R  vector)   •  Simple  column  and  row  selecTon  API   •  AnalyTcs,  data  transformaTon,  etc.  leV  to  base  package  and  add-­‐on  libraries   •  Richness  /  usability  comes  largely  from  libraries  
  • 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Some  awesome  R  data  frame  stuff   •  “Hadley  Stack”   • dplyr,  Tdyr   • legacy:  plyr,  reshape2     • ggplot2   •  data.table  (data.frame  +  indices,  fast  algorithms)   •  xts  :  Tme  series    
  • 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   R  data  frames:  rough  edges   •  Copy-­‐on-­‐write  semanTcs   •  API  fragmentaTon  /  inconsistency   • Use  the  “Hadley  stack”  for  improved  sanity   •  Factor  /  String  dichotomy   • stringsAsFactors=FALSE  a  blessing  and  curse   •  Somewhat  limited  type  system  
  • 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   dplyr   •  Composable  table  API   •  Good  example  of  what  the  “decoupled”  future  might  look  like   • New  in-­‐memory  R/RCpp  execuTon  engine   • SQL  backends  for  large  subset  of  API  
  • 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  DataFrames   •  R/pandas-­‐inspired  API  for  tabular  data  manipulaTon  in  Scala,  Python,  etc.   •  Logical  operaTon  graphs  rewrisen  internally  in  more  efficient  form   •  Good  interop  with  Spark  SQL   •  Some  interoperability  with  pandas   •  ParTal  API  Decoupling!  (it  sTll  binds  you  to  Spark)  
  • 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  Several  key  data  structures,  data  frame  among  them   •  Considerably  more  complex  internals  than  other  data  frame  libraries   •  Some  good  things   • Born  of  need   • A  “baseries  included”  approach   • Hierarchical  axis  labeling:  addresses  some  hard  use  cases  at  expense  of   semanTc  complexity   • Strong  Tme  series  support  
  • 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   pandas:  rough  edges   •  Axis  labelling  can  get  in  the  way  for  folks  needing  “just  a  table”   •  Ceded  control  of  its  type  system  /  data  rep’n  from  day  1  to  NumPy   •  Inefficient  string  handling  (uses  NumPy  object  arrays)   •  Missing  data  handling  less  precise  than  other  tools  
  • 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Julia:  DataFrames.jl   •  Started  by  Harlan  Harris  &  co   •  Part  of  broader  JuliaStats  iniTaTve   • More  R-­‐like  than  pandas-­‐like   • Very  acTve:  >  50  contributors!     •  STll  comparaTvely  early   • Less  comprehensive  API   • More  limited  IO  capabiliTes  
  • 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Other  data  frames   •  Saddle  (Scala)   • Dev’d  by  Adam  Klein  (ex-­‐AQR)  at  Novus  Partners  (fintech  startup)   • Designed  and  used  for  financial  use  cases   •  Deedle  (F#  /  .NET)   • Dev’d  by  AK’s  colleagues  at  BlueMountain  (hedge  fund)   •  GraphLab  /  Dato   • Really  good  C++  data  frame  with  Python  interface   • Dual-­‐licensed:  AGPL  +  Commercial   •  That’s  not  all!  Haskell,  Go,  usw…  
  • 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   We’re  not  done  yet   •  The  future  is  JSON-­‐like   • Support  for  nested  types  /  semi-­‐structured  data  is  sTll  weak   •  Wanted:  Apache-­‐licensed,  community  standard  C/C++  data  frame  that  we  all  use   (R,  Python,  Julia)   •  Bring  on  the  Great  Decoupling  
  • 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   @wesmckinn