Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Data Science Languages and Industry Analytics

4,719 views

Published on

September 19, 2015 talk at Berkeley Institute for Data Science. On how comparatively poor JSON / structured data tools pose a challenge for the data science languages (Python, R, Julia, etc.).

Published in: Technology

Data Science Languages and Industry Analytics

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Data  Science  Languages  and   Industry  Analy<cs   Wes  McKinney,  BIDS  2015-­‐09-­‐19  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Mathema<cian  —  MIT  ‘07   •  Professional  SQL  programmer  2007-­‐2010  (@  AQR)   •  Created  pandas,  April  2008   •  Wrote  Python  for  Data  Analysis  2012   •  Founder  of  DataPad  -­‐>  Cloudera    
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   A  sample  big  data  architecture   Kafka Kafka Kafka Kafka Application data S3 or HDFS JSON Spark/MapReduce Columnar storage Analytic SQL Engine User SQL
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Big  data  architectures  currently   dominated  by  Java  /  JVM   languages     Python/R/Julia  don’t  have  much  of   a  “seat  at  the  table”  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Industry  Analy<cs   Scien<fic  Compu<ng   Heterogeneous  data          Flat  tables  and  JSON   Spark  /  MapReduce   SQL   DFS-­‐friendly  /  streaming  data  formats   More  physical  machines   Homogeneous  data          Mul<dimensional  arrays   HPC  tools   Linear  algebra   Scien<fic  data  formats   Fewer  physical  machines   Some  simplis<c  generaliza<ons  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Many  Interac<ve-­‐speed  SQL  engines   …  and  more  
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  not  the  direct  subject  of  this  talk   •  hjp://blog.ibis-­‐project.org   •  Craking  a  compelling  Python-­‐on-­‐Hadoop  user  experience   • Remove  SQL-­‐programming  from  user  workflows   • Develop  high  performance  Python  extension  APIs   •  Pythonic  composable  DSL  designed  to  target  SQL  seman<cs   •  Develop  roadmap  targets  Impala  (C++  /  LLVM)  query  engine   • …  but  SQL  compiler  toolchain  works  well  with  other  SQL  dialects  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Enabling  interoperability  with  big  data  systems   •  Distributed  /  MPP  query  engines:  implemented  in  a  host  language   • Typically  C++,  Java,  or  Scala   •  User-­‐defined  func<ons  (UDFs)  through  various  means   • Implement  in  host  language   • Implement  in  user  language  through  some  external  language  protocol   •  External  UDFs  are  usually  very  slow  (cf:  PL/Python,  PySpark,  etc.)  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   What  are  UDFs  good  for?   •  Note:  industry  data  scien<sts  have  libraries  containing  100s  of  UDFs  for  Hive  or   other  distributed  query  engines   •  Custom  data  transforma<ons   •  Custom  domain  logic  (date  /  <me  /  data  types)   •  Custom  data  types   •  Custom  aggrega<ons  (incl.  machine  learning  /  sta<s<cs  expressible  as  reduc<ons)  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Why  are  external  UDFs  slow?   •  Serializa<on  /  deserializa<on  overhead   •  Scalar  vs  vectorized  computa<ons   •  RPC  overhead  
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   How  to  make  them  fast?   •  Common  run<me  memory  representa<on  for  tabular  data   •  Share-­‐memory  (zero-­‐copy  or  memcpy-­‐only)  external  UDF  protocol   •  Vectorized  UDF  interface  (for  interpreted  languages)  
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Memory  representa<on   •  Many  query  engines  are  standardizing  on  in-­‐memory  columnar  rep’n  of   materialized  transient  data   • Apache  Drill:  hjps://drill.apache.org/faq/   • Spark   • Impala:   hjp://blog.cloudera.com/blog/2015/07/whats-­‐next-­‐for-­‐impala-­‐more-­‐ reliability-­‐usability-­‐and-­‐performance-­‐at-­‐even-­‐greater-­‐scale/   •  Industry-­‐standard  serializa<on  format:  Apache  Parquet   • hjps://parquet.apache.org/  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Serializa<on  vs  In-­‐memory   •  Serializa<on  formats  (e.g.  Parquet)     • Op<mize  for  IO  /  DFS  throughput  at  expense  of  CPU/memory  bus  throughput   • Do  not  consider  random  access  or  in-­‐memory  analy<cs  as  a  goal   •  No  standardized  in-­‐memory  containers  for  materialized  data  from  file  /  RPC   protocols  (Parquet,  Thrik,  protobuf,  Avro,  etc.)  
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   One  possible  proposal   •  Standardize  on  an  augmented  variant  of  the  Apache  Drill  in-­‐memory  columnar   memory  layout   • hjps://drill.apache.org/docs/value-­‐vectors/   •  Common  /  shared  C  impl  for  R/Python/Julia   • Currently  all  languages  have  poor  support  for  JSON-­‐like  data   • make  your  needs  known!   • Enumerate  required  data  types  and  other  requirements  
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   More  on  the  Drill  layout   persons'='[ ''{ ''''name:'‘wes’, ''''addresses:'[ '''''''{number:'2,'street:'‘a’}, '''''''{number:'3,'street:'‘bb’}, ''''] ''}, ''{ ''''name:'‘mark’, ''''addresses:'[ '''''''{number:'4,'street:'‘ccc’}, '''''''{number:'5,'street:'‘dddd’}, '''''''{number:'6,'street:'‘f’}, ''''] ''},
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Strings  in  Drill   person.name offset 0 3 w e s m a r k
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Array<Struct>  example   person.addresses.street person.addresses 0 2 5 offset 0 1 3 6 10 a b b c c c d d d d f person.addresses.number 2 3 4 5 6 offset
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Array<Array<Int32>>  example   persons'='[ ''{ ''''name:'‘wes’, ''''fav_sequences:'[ ''''''[0,'1,'2], ''''''[2,'3] ''''] ''}, ''{ ''''name:'‘mark’, ''''fav_sequences:'[ ''''''[3], ''''''[4,'5], ''''''[6,'7] ''''] ''}, person.fav_sequences/values person.fav_sequences 0 2 5 offset 0 3 5 6 8 0 1 2 2 3 3 4 5 6 7 offset
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own  

×