Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DataFrames: The Good, Bad, and Ugly

10,896 views

Published on

Thoughts on the status quo in "data frame" interfaces

Published in: Technology

DataFrames: The Good, Bad, and Ugly

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   DataFrames:  The  Good,  Bad,   and  Ugly   Wes  McKinney  @wesmckinn   NY  R  Conference,  2015-­‐04-­‐25  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Disclaimer:  the  views  presented  in   this  talk  are  my  personal  opinions   and  not  necessarily  those  of   Cloudera  
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   This  talk   •  Some  commentary  on  all  the  data  frame  interfaces  out  there   •  Biased  observaTons  and  cursory  judgments   •  Thoughts  on  craVing  high  quality  data  tools  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Disclaimer  #2:  This  is  a  nuanced   discussion  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?   •  Father  of  pandas  (2008  -­‐  )     •  Financial  analyTcs  in  R  /  Python  starTng  2007   •  2010-­‐2012   • Hiatus  from  gainful  employment   • Make  pandas  ready  for  primeTme   • Write  "Python  for  Data  Analysis”   •  2013-­‐2014:  DataPad  with  Chang  She  &  co   •  2014  -­‐  :  Cloudera  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  in  a  DataFrame?        
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  in  a  DataFrame?   A  table  with  some  rows   By  any  other  name                       would  analyze  as  sweet.  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Got  a  table?    
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Got  a  table?   Put  a  DataFrame  (interface)  on  it!  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  this  “data  frame”  that  you  speak  of   •  A  table-­‐like  data  structure   •  An  API  /  user  interface  for  the  table   • SelecTng  data   • Math  and  relaTonal  algebra  (join,  filter,  etc.)   • File  /  database  IO   • ad  infinitum  
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Some  axes  of  comparison   •  Data  structure  internals  (types,  in-­‐memory  representaTon,  etc.)   •  Basic  table  API   •  RelaTonal  algebra  support   •  Group-­‐by  /  split-­‐apply-­‐combine  API   •  Performance,  memory  use,  evaluaTon  semanTcs   •  Missing  data   •  Data  Tdying  /  ETL  tools   •  IO  uTliTes   •  Domain  specific  tools  (e.g.  Tme  series)   •  …  
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   The  Great  Data  Tool  Decoupling™   •  Thesis:  over  Tme,  user  interfaces,  data  storage,  and  execuTon  engines  will   decouple  and  specialize   •  In  fact,  you  should  really  want  this  to  happen   • Share  systems  among  languages   • Reduce  fragmentaTon  and  “lock-­‐in”   • ShiV  developer  focus  to  usability     •  PredicTon:  we’ll  be  there  by  2025;  sooner  if  we  all  get  our  act  together  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   CraVing  quality  data  tools   •  Quality  /  usefulness  is  usually  forged  by  the  fire  of  basle   •  Real  world  use  cases  and  social  proof  trump  theory   •  Eat  that  dog  food   •  When  in  doubt?  Look  at  the  test  suite.  
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   R  data  frames   •  Thin  layer  on  top  of  R  list  type   • Sequence  of  named  vectors   • Can  have  row  names  (any  R  vector)   •  Simple  column  and  row  selecTon  API   •  AnalyTcs,  data  transformaTon,  etc.  leV  to  base  package  and  add-­‐on  libraries   •  Richness  /  usability  comes  largely  from  libraries  
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Some  awesome  R  data  frame  stuff   •  “Hadley  Stack”   • dplyr,  Tdyr   • legacy:  plyr,  reshape2     • ggplot2   •  data.table  (data.frame  +  indices,  fast  algorithms)   •  xts  :  Tme  series    
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   R  data  frames:  rough  edges   •  Copy-­‐on-­‐write  semanTcs   •  API  fragmentaTon  /  inconsistency   • Use  the  “Hadley  stack”  for  improved  sanity   •  Factor  /  String  dichotomy   • stringsAsFactors=FALSE  a  blessing  and  curse   •  Somewhat  limited  type  system  
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   dplyr   •  Composable  table  API   •  Good  example  of  what  the  “decoupled”  future  might  look  like   • New  in-­‐memory  R/RCpp  execuTon  engine   • SQL  backends  for  large  subset  of  API  
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  DataFrames   •  R/pandas-­‐inspired  API  for  tabular  data  manipulaTon  in  Scala,  Python,  etc.   •  Logical  operaTon  graphs  rewrisen  internally  in  more  efficient  form   •  Good  interop  with  Spark  SQL   •  Some  interoperability  with  pandas   •  ParTal  API  Decoupling!  (it  sTll  binds  you  to  Spark)  
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  Several  key  data  structures,  data  frame  among  them   •  Considerably  more  complex  internals  than  other  data  frame  libraries   •  Some  good  things   • Born  of  need   • A  “baseries  included”  approach   • Hierarchical  axis  labeling:  addresses  some  hard  use  cases  at  expense  of   semanTc  complexity   • Strong  Tme  series  support  
  20. 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   pandas:  rough  edges   •  Axis  labelling  can  get  in  the  way  for  folks  needing  “just  a  table”   •  Ceded  control  of  its  type  system  /  data  rep’n  from  day  1  to  NumPy   •  Inefficient  string  handling  (uses  NumPy  object  arrays)   •  Missing  data  handling  less  precise  than  other  tools  
  21. 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Julia:  DataFrames.jl   •  Started  by  Harlan  Harris  &  co   •  Part  of  broader  JuliaStats  iniTaTve   • More  R-­‐like  than  pandas-­‐like   • Very  acTve:  >  50  contributors!     •  STll  comparaTvely  early   • Less  comprehensive  API   • More  limited  IO  capabiliTes  
  22. 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Other  data  frames   •  Saddle  (Scala)   • Dev’d  by  Adam  Klein  (ex-­‐AQR)  at  Novus  Partners  (fintech  startup)   • Designed  and  used  for  financial  use  cases   •  Deedle  (F#  /  .NET)   • Dev’d  by  AK’s  colleagues  at  BlueMountain  (hedge  fund)   •  GraphLab  /  Dato   • Really  good  C++  data  frame  with  Python  interface   • Dual-­‐licensed:  AGPL  +  Commercial   •  That’s  not  all!  Haskell,  Go,  usw…  
  23. 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   We’re  not  done  yet   •  The  future  is  JSON-­‐like   • Support  for  nested  types  /  semi-­‐structured  data  is  sTll  weak   •  Wanted:  Apache-­‐licensed,  community  standard  C/C++  data  frame  that  we  all  use   (R,  Python,  Julia)   •  Bring  on  the  Great  Decoupling  
  24. 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   @wesmckinn  

×