Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Similar to PyData: The Next Generation(20)

Advertisement
Advertisement

PyData: The Next Generation

  1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   PyData:  The  Next  Genera@on   Wes  McKinney  @wesmckinn   Data  Day  Texas  2015  #ddtx15  
  2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   PyData:  Everything’s   awesome…or  is  it?   Wes  McKinney  @wesmckinn   Data  Day  Texas  2015  #ddtx15  
  3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  systems,  tools,  Python  guru  at  Cloudera   •  Formerly  Founder/CEO  of  DataPad  (visual  analy@cs  startup)   •  Created  pandas  in  2008,  lead  developer  un@l  2013   •  Python  for  Data  Analysis,  published  10/2012   • O’Reilly’s  best-­‐selling  data  book  of  2014   •  Pythonista  since  2007  
  4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  this  about?   •  Hopes  and  fears  for  the  community  and  ecosystem   •  Why  do  I  care?   • Python  is  fun!   • Leverage   • Accessibility  for  newbies   • Community:  smart,  nice,  humble  people  
  5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Python  at  Cloudera   •  Want  Cloudera  plaaorm  users  to  be  successful  with  Python   •  Spark/PySpark  part  of  the  Enterprise  Data  Hub  /  CDH   •  Ac@vely  inves@ng  in  Python  tooling   • (p.s.  we’re  hiring?)   • (p.p.s.  we  have  an  Aus@n  office  now!)  
  6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Historical  perspec@ve  and  background   •  20  years  of  fast  numerical  compu@ng  in  Python  (Numeric  1995)   •  10  years  of  NumPy   •  PyData  becomes  a  thing  in  2012   •  Python  as  a  data  language  goes  mainstream   • Job  descrip@ons  tell  all   • Shig  in  larger  Python  community  from  web  towards  data   •  PyCon  2015  commihee  reported  substan@al  growth  in  data-­‐related   submissions!  
  7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   How’d  this  happen?   •  Data,  data  everywhere   •  Science!  scikit-­‐learn,  statsmodels,  and  friends   •  Comprehensive  data  wrangling  tools  and  in-­‐memory  analy@cs/repor@ng  (pandas)   •  IPython  Notebook   •  Learning  resources  (books,  conferences,  blogs,  etc.)   •  Python  environment/library  management  that  “just  works”  
  8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Put  a  Python  (interface)  on  it!   Something  no  one  got  fired  for,  ever.    
  9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Meanwhile…   •  Hadoop  and  Big  Data  go  mainstream  in  2009  onward   • First  Hadoop  World:  Fall  2009     • First  Strata  conference:  Spring  2011   •  Lots  of  smart  engineers  in  fast-­‐growing  businesses  with  massive  analy@cs  /  ETL   problems   •  Solu@ons  built,  frameworks  developed,  companies  founded   •  Python  was  generally  not  a  central  part  of  those  solu@ons   • A  lot  of  our  nice  things  weren’t  much  help  for  data  munging  and  coun@ng  at   scale  (more  on  this  later)  
  10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   We’re  lucky  to  have  lots  of  nice  things   •  What  a  language!   •  IPython:  interac@ve  compu@ng  and  collabora@on   •  Libraries  to  solve  nearly  any  (non-­‐big  data)  problem   •  Trustworthy  (medium)  data  wrangling,  sta@s@cs,  machine  learning   •  HPC  /  GPU  /  parallel  compu@ng  frameworks   •  FFI  tools   •  …  and  much  more    
  11. 11  ©  Cloudera,  Inc.  All  rights  reserved.     “If  this  isn’t  nice,  what  is?”   —Kurt  Vonnegut  
  12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   So,  what  kind  of  big  data?   •  Big  mul@dimensional  arrays  /  linear  algebra   •  Big  tables  (structured  data)   •  Big  text  data  (unstructured  data)   •  Empirically  I  personally  am  mostly  interested  in  big  tables  
  13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   What  kind  of  big  data  problems?   •  ETL  /  Data  Wrangling   • Python  been  used  here  for  years  with  Hadoop  Streaming   •  BI  /  Analy@cs  (“things  you  can  do  in  SQL”)   •  Advanced  Analy@cs  /  Machine  Learning  
  14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Some  ways  we  are  #winning   •  Python  seen  as  a  viable  alterna@ve  to  SAS/MATLAB/proprietary  sogware  without   nearly  as  much  arguing   •  Huge  uptake  in  the  financial  sector   •  Many  current  and  upcoming  genera@ons  of  data  scien@sts  learning  Python  as  a   first  language   •  Python  in  HPC  /  scien@fic  compu@ng  
  15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Some  ways  we  are  not  #winning   •  Python  s@ll  doesn’t  have  a  great  “big  data  story”   •  Lihle  venture  capital  trickling  down  to  Python  projects   •  Data  structures  and  programming  APIs  lagging  modern  reali@es   •  Weak  support  for  emerging  data  formats   •  Many  companies  with  Python  big  data  successes  have  not  open-­‐sourced  their   work  
  16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Python  in  big  data  workflows  in  prac@ce   HDFS   Hadoop-­‐MR   Spark   SQL   Big  Data,  Many  machines   Small/Medium  Data,  One  Machine   pandas   Viz  tools   ML  /  Stats   More  coun@ng  /  ETL   More  insights  /  repor@ng   DSLs  
  17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Big  data  storage  formats   •  JSON  and  CSV  are  not  a  good  way  to  warehouse  data   •  Apache  Avro   • Compact  binary  data  serializa@on  format   • RPC  framework   •  Apache  Parquet   • Efficient  columnar  data  format  op@mized  for  HDFS   • Supports  nested  and  repeated  fields,  compression,  encoding  schemes   • Co-­‐developed  by  Twiher  and  Cloudera   • Reference  impl’s  in  Impala  (C++),  and  standalone  Java/Scala  (used  in  Spark)  
  18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   We’re  living  in  a  JVM  world   •  Scala  rapidly  taking  over  big  data  analy@cs   • Func@onal,  concise,  good  for  building  high  level  DSLs   • Build  nice  Scala  APIs  to  clunkier  Java  frameworks   •  JVM  legi@mately  good  for  concurrent,  distributed  systems   •  Binary  interface  with  Python  a  major  issue  
  19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Dremel,  baby,  Dremel…   •  VLDB  2010:  Dremel:  Interac5ve  Analysis  of  Web-­‐Scale  Datasets   •  Inspira@on  for  Parquet  (cf  blog  “Dremel  made  easy  with  Parquet”)   •  Peta-­‐scale  analy@cs  directly  on  nested  data   •  Google  BigQuery  said  to  be  a  IaaS-­‐ifica@on  of  Dremel   • Supports  SQL  variant  +  new  user-­‐defined  func@ons  with  JavaScript  +  V8   SELECT COUNT(c1 > c2) FROM (SELECT SUM(a.b.c.d) WITHIN RECORD AS c1, SUM(a.b.p.q.r) WITHIN RECORD AS c2 FROM T3)
  20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  Impala   •  Open-­‐source  interac@ve  SQL  for  Hadoop   •  Analy@cal  query  processor  wrihen  in  C++  with  LLVM  code  genera@on   •  Op@mized  to  scan  tables  (best  as  Parquet  format)  in  HDFS   •  SQL  front-­‐end  and  query  op@mizer  /  planner     •  User-­‐defined  func@on  API  (C++)   • impyla  enables  Python  UDFs  to  be  compiled  with  Numba  to  LLVM  IR  
  21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Cloudera  Impala  (cont’d)   •  For  high  performance  big  data  analy@cs,  Impala  could  be  Python’s  best  friend   •  C++/LLVM  backend  is  lower-­‐level  than  SQL   •  Nested  data  support  is  coming  
  22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Some  interes@ng  things  in  recent   @mes  
  23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Set  point:  Hadley  Wickham   •  R  has  upped  it’s  game  with  dplyr,  @dyr,  and  other  new  projects   •  New  standard  for  a  uniform  interface  to  either  in-­‐memory  or  in-­‐database  data   processing   •  Composable  table  primi@ve  opera@ons   •  Mul@ple  major  versions  shipped,  gevng  adopted     80dc69b 2012-10-28 | Initial commit of dplyr [hadley] tbl  %>%  filter(c==‘bar’)  %>%  group_by(a,  b)          %>%  summarise(metric=mean(d  –  f))          %>%  arrange(desc(metric))            
  24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   Blaze   •  Shares  some  seman@cs  with  dplyr   •  Uses  a  generalized  datashape  protocol   •  Fresh  start  in  2014  under  Mahhew  Rocklin’s  (Con@nuum)  direc@on   • Deferred  expression  API   • Support  for  piping  data  between  storage  systems   • Mul@ple  backends  (pandas,  SQL,  MongoDB,  PySpark,  …)   • Growing  support  for  out-­‐of-­‐core  analy@cs  
  25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   libdynd   •  Led  by  Mark  Wiebe  at  Con@nuum  Analy@cs   •  Pure  C++11  modern  reimagining  of  NumPy   •  Python  bindings   •  Supports  variadic  data  cells  and  nested  types  (datashape  protocol)   •  Development  has  focused  on  the  data  container  design  over  analy@cs  
  26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   PySpark   •  Popularity  may  exceed  official  Scala  API   •  Spark  was  not  exactly  designed  to  be  an  ideal  companion  to  Python   •  General  architecture   • Users  build  Spark  deferred  expression  graphs  in  Python   • User-­‐supplied  func@ons  are  serialized  and  broadcast  around  the  cluster   • Spark  plans  job  and  breaks  work  into  tasks  executed  by  Python  worker  jobs   •  Data  is  managed  /  shuffled  by  the  Spark  Scala  master  process   •  Python  used  largely  as  a  black  box  to  transform  input  to  output  
  27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   PySpark:  Some  more  gory  details   •  Spark  master  controlled  using  py4j     • Py4J  docs:  “If  performance  is  cri@cal  to  your  applica@on,  accessing  Java  objects   from  Python  programs  might  not  be  the  best  idea”   •  Data  is  marshalled  mostly  with  files  with  various  serializa@on  protocols  (pickle  +   bespoke  formats)   •  Does  not  na5vely  interface  with  NumPy  (yet)   •  But,  the  in-­‐memory  benefits  of  Spark  over  Hadoop  Streaming  alterna@ves   massively  outweigh  the  downsides   # pass large object by py4j is very slow and need much memory
  28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Spartan   •  hhp://github.com/spartan-­‐array/spartan   •  Python  distributed  array  expression  evaluator  (“distributed  NumPy”)   •  Developed  by  Russell  Power  &  others  at  NYU   •  Uses  ZeroMQ  and  custom  RPC  implementa@on  
  29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Things  I  think  we  should  do   •  Create  high  fidelity  data  structures  for  Dremel-­‐style  data   •  Get  serious  about  Avro,  Parquet,  and  other  new  data  format  standards   •  Invest  in  the  Python-­‐Impala-­‐LLVM  rela@onship   •  Efficient  binary  protocols  to  receive  and  emit  data  from  Python  processes  
  30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   Conclusions   •  Python  +  PyData  stack  is  as  strong  as  ever,  and  s@ll  gaining  momentum   •  The  @me  for  a  “dark  horse”  Python-­‐centric  big  data  solu@on  has  probably  passed   us  by.  Maybe  beher  to  pursue  alliances.   •  Focused  work  is  needed  to  s@ll  be  relevant  in  2020.  Some  of  our  compe@@ve   advantages  are  eroding  
  31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   wes@cloudera.com  
Advertisement