Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Ibis: Scaling the Python Data Experience

3,080 views

Published on

Delivered at Data Science Summit July 20, 2015. See http://ibis-project.org for more

Published in: Technology

Ibis: Scaling the Python Data Experience

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Scaling  the  Python  Data   Experience   Wes  McKinney                    Marcel  Kornacker   JusFn  Erickson    Silvius  Rus  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Wes  McKinney   •  A  key  person  in  building  today’s  open  source  Python  data  community   •  Creator  of  pandas,  a  standard  Python  data  wrangling  and  analyFcs  toolkit  used   by  data  scienFsts   •  Author  of  best-­‐selling  canonical  text  Python  for  Data  Analysis  (2012)   •  Formerly  Founder/CEO  of  DataPad  (acquired  by  Cloudera  in  2014)  
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Python  is  popular…   •  Python  has  become  a  standard  language  of  data  science   •  Why  is  it  popular?   • Maximizes  producFvity  for  data  engineers  and  data  scienFsts   • Build  robust  so[ware  and  do  interacFve  data  analysis  with  100%  Python  code     • Easy-­‐to-­‐learn  and  makes  happy  and  producFve  data  teams     • Large,  diverse  open  source  development  community   • Comprehensive  libraries:  data  wrangling,  ML,  visualizaFon,  etc.   •  Main  use  case:  data  science  &  engineering  swiss  army  knife  on  small-­‐to-­‐medium   size  data  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   …but  Python  does  not  scale  today   •  Python  ecosystem  confined  to  single-­‐node  analysis   • Great  for  smaller  data  sets   • Requires  sampling  or  aggregaFons  for  larger  data   • Distributed  tools  compromise  in  various  ways   •  ExtracFng  samples  or  aggregaFons  for  larger  data  means:   • “Scales”  by  losing  more  fidelity   • AddiFonal  ETL  overhead  to  extract  samples/aggregaFons   • Loss  of  producFvity  with  mulFple  languages,  tools,  etc   • Blocks  certain  analysis  and  use  cases  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis:  Same  Python,  now  at  scale   •  Target  user:   • Data  scienFsts  and  data  engineers  (“Python  data  users”)   •  Goals:   • Mirrors  single-­‐node  Python  experience   • Scales  to  any  node  and  data  size   • No  compromise  in  funcFonality  or  usability   • InteracFve  experience  at  naFve  hardware  speeds  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  announced?   •  First  public  release  of  Ibis   • hgp://ibis-­‐project.org   •  Beta  release  to  Cloudera  Labs   •  InviFng  usage  and  community  development   •  Apache-­‐licensed  open-­‐source  
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis’s  Vision   •  Uncompromised  Python  experience   • 100%  Python  end-­‐to-­‐end  user  workflows     • Enable  integraFon  with  the  exisFng  Python  data  ecosystem  (pandas,  scikit-­‐ learn,  NumPy,  etc)   •  InteracFve  at  big  data  scale   • Full-­‐fidelity  analysis  without  extracFons   • Scalability  for  big  data   • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Advantages  of  our  approach   •  Analyze  big  data  100%  in  Python,  with  the  same  ease  as  small/medium  data  on   the  local  filesystem   •  Full-­‐fidelity  data  access   •  Familiar  Python  experience  and  integraFon  with  exisFng  Python  data  libraries   •  Provide  a  means  for  Python  high  performance  compuFng  tools  to  be  leveraged  at   Hadoop-­‐scale  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Beta  0.3  release     •  High  level  Python  API  for  describing  analyFcs  and  ETL  that  can  be  executed  by   Impala   • Familiar  API  for  users  of  pandas   • Comprehensive  coverage  of  operaFons  expressible  as  relaFonal  data  flows   •  Integrated  tools  for  managing  data  in  HDFS   •  Simple  workflows  to  query  data  files  in  several  formats  (Parquet,  Avro,  Text)   •  pandas  data  interchange  
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis/Impala  Joint  Roadmap   •  More  natural  data  modeling   • Complex  types  support   •  IntegraFon  with  full  Python  data  ecosystem   • Advanced  analyFcs  +  machine  learning   • Enable  use  of  performance  compuFng  tools   •  User  extensibility  with  naFve  performance   • In-­‐memory  columnar  format   • Python-­‐to-­‐LLVM  IR  compilaFon   •  Workflow  and  usability  tools  
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Benefits  of  Ibis   •  Maximize  developer  producFvity   • Mirrors  single-­‐node  Python  experience   • Solve  big  data  problems  without  leaving  Python   • Leverage  Python  skills,  ecosystem,  and  tools   •  Python  as  first-­‐class  language  for  Hadoop   • Full-­‐fidelity  analysis  without  extracFons   • Python  analysis  at  any  scale   • NaFve  hardware  speeds  for  a  broad  set  of  use  cases  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   wes@cloudera.com  

×