Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DataFrames: The Good, Bad, and Ugly

12,075 views

Published on

Thoughts on the status quo in "data frame" interfaces

Published in: Technology
  • Dating for everyone is here: ❶❶❶ http://bit.ly/39mQKz3 ❶❶❶
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Dating direct: ♥♥♥ http://bit.ly/39mQKz3 ♥♥♥
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Check out your Transcript here  http://t.cn/AirVFs5m
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • She lost 62lbs, now she looks like THIS... ■■■ http://t.cn/AirVsp6C
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • 7 Sacred "Sign Posts" From The Universe Revealed. Discover the "secret language" the Universes uses to send us guided messages and watch as your greatest desires manifest before your eyes. Claim your free report. ♣♣♣ https://bit.ly/30Ju5r6
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

DataFrames: The Good, Bad, and Ugly

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   DataFrames:  The  Good,  Bad,   and  Ugly   Wes  McKinney  @wesmckinn   NY  R  Conference,  2015-­‐04-­‐25  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Disclaimer:  the  views  presented  in   this  talk  are  my  personal  opinions   and  not  necessarily  those  of   Cloudera  
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   This  talk   •  Some  commentary  on  all  the  data  frame  interfaces  out  there   •  Biased  observaTons  and  cursory  judgments   •  Thoughts  on  craVing  high  quality  data  tools  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Disclaimer  #2:  This  is  a  nuanced   discussion  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Who  am  I?   •  Father  of  pandas  (2008  -­‐  )     •  Financial  analyTcs  in  R  /  Python  starTng  2007   •  2010-­‐2012   • Hiatus  from  gainful  employment   • Make  pandas  ready  for  primeTme   • Write  "Python  for  Data  Analysis”   •  2013-­‐2014:  DataPad  with  Chang  She  &  co   •  2014  -­‐  :  Cloudera  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  in  a  DataFrame?        
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  in  a  DataFrame?   A  table  with  some  rows   By  any  other  name                       would  analyze  as  sweet.  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Got  a  table?    
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Got  a  table?   Put  a  DataFrame  (interface)  on  it!  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   What  is  this  “data  frame”  that  you  speak  of   •  A  table-­‐like  data  structure   •  An  API  /  user  interface  for  the  table   • SelecTng  data   • Math  and  relaTonal  algebra  (join,  filter,  etc.)   • File  /  database  IO   • ad  infinitum  
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Some  axes  of  comparison   •  Data  structure  internals  (types,  in-­‐memory  representaTon,  etc.)   •  Basic  table  API   •  RelaTonal  algebra  support   •  Group-­‐by  /  split-­‐apply-­‐combine  API   •  Performance,  memory  use,  evaluaTon  semanTcs   •  Missing  data   •  Data  Tdying  /  ETL  tools   •  IO  uTliTes   •  Domain  specific  tools  (e.g.  Tme  series)   •  …  
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   The  Great  Data  Tool  Decoupling™   •  Thesis:  over  Tme,  user  interfaces,  data  storage,  and  execuTon  engines  will   decouple  and  specialize   •  In  fact,  you  should  really  want  this  to  happen   • Share  systems  among  languages   • Reduce  fragmentaTon  and  “lock-­‐in”   • ShiV  developer  focus  to  usability     •  PredicTon:  we’ll  be  there  by  2025;  sooner  if  we  all  get  our  act  together  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   CraVing  quality  data  tools   •  Quality  /  usefulness  is  usually  forged  by  the  fire  of  basle   •  Real  world  use  cases  and  social  proof  trump  theory   •  Eat  that  dog  food   •  When  in  doubt?  Look  at  the  test  suite.  
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   R  data  frames   •  Thin  layer  on  top  of  R  list  type   • Sequence  of  named  vectors   • Can  have  row  names  (any  R  vector)   •  Simple  column  and  row  selecTon  API   •  AnalyTcs,  data  transformaTon,  etc.  leV  to  base  package  and  add-­‐on  libraries   •  Richness  /  usability  comes  largely  from  libraries  
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Some  awesome  R  data  frame  stuff   •  “Hadley  Stack”   • dplyr,  Tdyr   • legacy:  plyr,  reshape2     • ggplot2   •  data.table  (data.frame  +  indices,  fast  algorithms)   •  xts  :  Tme  series    
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   R  data  frames:  rough  edges   •  Copy-­‐on-­‐write  semanTcs   •  API  fragmentaTon  /  inconsistency   • Use  the  “Hadley  stack”  for  improved  sanity   •  Factor  /  String  dichotomy   • stringsAsFactors=FALSE  a  blessing  and  curse   •  Somewhat  limited  type  system  
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   dplyr   •  Composable  table  API   •  Good  example  of  what  the  “decoupled”  future  might  look  like   • New  in-­‐memory  R/RCpp  execuTon  engine   • SQL  backends  for  large  subset  of  API  
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  DataFrames   •  R/pandas-­‐inspired  API  for  tabular  data  manipulaTon  in  Scala,  Python,  etc.   •  Logical  operaTon  graphs  rewrisen  internally  in  more  efficient  form   •  Good  interop  with  Spark  SQL   •  Some  interoperability  with  pandas   •  ParTal  API  Decoupling!  (it  sTll  binds  you  to  Spark)  
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  Several  key  data  structures,  data  frame  among  them   •  Considerably  more  complex  internals  than  other  data  frame  libraries   •  Some  good  things   • Born  of  need   • A  “baseries  included”  approach   • Hierarchical  axis  labeling:  addresses  some  hard  use  cases  at  expense  of   semanTc  complexity   • Strong  Tme  series  support  
  20. 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   pandas:  rough  edges   •  Axis  labelling  can  get  in  the  way  for  folks  needing  “just  a  table”   •  Ceded  control  of  its  type  system  /  data  rep’n  from  day  1  to  NumPy   •  Inefficient  string  handling  (uses  NumPy  object  arrays)   •  Missing  data  handling  less  precise  than  other  tools  
  21. 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Julia:  DataFrames.jl   •  Started  by  Harlan  Harris  &  co   •  Part  of  broader  JuliaStats  iniTaTve   • More  R-­‐like  than  pandas-­‐like   • Very  acTve:  >  50  contributors!     •  STll  comparaTvely  early   • Less  comprehensive  API   • More  limited  IO  capabiliTes  
  22. 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Other  data  frames   •  Saddle  (Scala)   • Dev’d  by  Adam  Klein  (ex-­‐AQR)  at  Novus  Partners  (fintech  startup)   • Designed  and  used  for  financial  use  cases   •  Deedle  (F#  /  .NET)   • Dev’d  by  AK’s  colleagues  at  BlueMountain  (hedge  fund)   •  GraphLab  /  Dato   • Really  good  C++  data  frame  with  Python  interface   • Dual-­‐licensed:  AGPL  +  Commercial   •  That’s  not  all!  Haskell,  Go,  usw…  
  23. 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   We’re  not  done  yet   •  The  future  is  JSON-­‐like   • Support  for  nested  types  /  semi-­‐structured  data  is  sTll  weak   •  Wanted:  Apache-­‐licensed,  community  standard  C/C++  data  frame  that  we  all  use   (R,  Python,  Julia)   •  Bring  on  the  Great  Decoupling  
  24. 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   @wesmckinn  

×