Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Next-generation Python Big Data Tools, powered by Apache Arrow

11,911 views

Published on

Given at SF Big Analytics Meetup 4/5/2016

Published in: Technology

Next-generation Python Big Data Tools, powered by Apache Arrow

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Next-­‐genera;on     Python  Big  Data  Tools,     powered  by  Apache  Arrow   Wes  McKinney  @wesmckinn   SF  Big  Analy;cs  Meetup,  2016-­‐04-­‐05  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  Science  Tools  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   • Python  {pandas,  Ibis,  statsmodels}   • Apache  {Arrow,  Parquet,  Kudu  (incuba;ng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   In  process:   Python  for  Data  Analysis:  2nd  Edi4on   Coming  late  2016  /  early   2017  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Python  +  Big  Data:  The  State  of  things   •  See  “Python  and  Apache  Hadoop:  A  State  of  the  Union”  from  February  17   •  Areas  where  much  more  work  needed   • Binary  file  format  read/write  support  (e.g.  Parquet  files)   • File  system  libraries  (HDFS,  S3,  etc.)   • Client  drivers  (Spark,  Hive,  Impala,  Kudu)   • Compute  system  integra;on  (Spark,  Impala,  etc.)  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Apache   Arrow   Many  slides  here  from  my  joint  talk  with  Jacques  Nadeau,  VP  Apache  Arrow  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  a  Slide   •  New  Top-­‐level  Apache  Sofware  Founda;on  project   •  Announced  Feb  17,  2016   •  Focused  on  Columnar  In-­‐Memory  Analy;cs   1.  10-­‐100x  speedup  on  many  workloads   2.  Common  data  layer  enables  companies  to  choose  best  of   breed  systems     3.  Designed  to  work  with  any  programming  language   4.  Support  for  both  rela;onal  and  complex  data  as-­‐is   •  Developers  from  13+  major  open  source  projects  involved   •  A  significant  %  of  the  world’s  data  will  be  processed  through   Arrow!   Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow:  What  is  it?     •  hkp://arrow.apache.org   •  Not  a  piece  of  sofware,  exactly!   •  A  standardized  in-­‐memory  representa;on  for  columnar  data   •  Enables   • Suitable  for  implemen;ng  high-­‐performance  analy;cs  in-­‐memory  (think  like   “pandas  internals”)   • Cheap  data  interchange  amongst  systems,  likle  or  no  serializa;on   • Flexible  support  for  complex  JSON-­‐like  data   •  Targets:  Impala,  Kudu,  Parquet,  Spark  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Focus  on  CPU  Efficiency   1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer   Arrow Memory Buffer   •  Cache  Locality   •  Super-­‐scalar  &  vectorized   opera;on   •  Minimal  Structure  Overhead   •  Constant  value  access     •  With  minimal  structure  overhead   •  Operate  directly  on  columnar   compressed  data  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Sharing  &  Interchange   Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Big  Data  Systems:  Poor  Python  IO  performance   h9p://wesmckinney.com/blog/pandas-­‐and-­‐apache-­‐arrow/  
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   • Problem:  fast,  language-­‐ agnos;c  binary  data  frame   file  format   • Wriken  by  Wes  McKinney   (Python)  Hadley  Wickham  (R)   • Read  speeds  close  to  disk  IO   performance   Arrow array 0 Arrow array 1 … Arrow array n Feather metadata Feather file Apache Arrow memory Google flatbuffers
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Parquet:  Binary  columnar  storage  format   •  I  just  became  a  Parquet  commiker!   •  github.com/apache/parquet-­‐cpp   •  Python  users  will  soon  be  able  to   read  Parquet  files  via  PyArrow   •  parquet-­‐cpp  <-­‐>  PyArrow  <-­‐>   pandas  
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Language  Bindings   •  Target  Languages   • Java  (beta)   • CPP  (underway)   • Python  &  Pandas  (underway)   • R   • Julia   •  Ini;al  Focus   • Read  a  structure   • Write  a  structure     • Manage  Memory  
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   pandas  and  Arrow  in  context  
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   RPC  &  IPC:  Moving  Data  Between  Systems   RPC   •  Avoid  Serializa;on  &  Deserializa;on   •  Layer  TBD:  Focused  on  suppor;ng  vectored  io   • Scaker/gather  reads/writes  against  socket   IPC   •  Alpha  implementa;on    using  memory  mapped  files   • Moving  data  between  Python  and  Drill   •  Working  on  shared  alloca;on  approach   • Shared  reference  coun;ng  and  well-­‐defined  ownership  seman;cs  
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Execu;ng  data  science  languages  in  the  compute  layer   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Python  With  Spark,  Drill,  Impala   in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  Next   •  Parquet  for  Python  &  C++   • Using  Arrow  as  intermediary   •  Available  IPC  Implementa;on   •  Spark,  Drill  Integra;on   • Faster  UDFs,  Storage  interfaces  
  20. 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow  in  prac;ce  
  21. 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Get  Involved   •  Join  the  community   • dev@arrow.apache.org   • Slack:  hkps://apachearrowslackin.herokuapp.com/   • hkp://arrow.apache.org   • @ApacheArrow  
  22. 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own  

×