Advertisement

Next-generation Python Big Data Tools, powered by Apache Arrow

Director of Ursa Labs, Open Source Developer at Ursa Labs
Apr. 6, 2016
Advertisement

More Related Content

Viewers also liked(20)

Advertisement

Similar to Next-generation Python Big Data Tools, powered by Apache Arrow(20)

Advertisement

Next-generation Python Big Data Tools, powered by Apache Arrow

  1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Next-­‐genera;on     Python  Big  Data  Tools,     powered  by  Apache  Arrow   Wes  McKinney  @wesmckinn   SF  Big  Analy;cs  Meetup,  2016-­‐04-­‐05  
  2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  Science  Tools  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   • Python  {pandas,  Ibis,  statsmodels}   • Apache  {Arrow,  Parquet,  Kudu  (incuba;ng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   In  process:   Python  for  Data  Analysis:  2nd  Edi4on   Coming  late  2016  /  early   2017  
  4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   Python  +  Big  Data:  The  State  of  things   •  See  “Python  and  Apache  Hadoop:  A  State  of  the  Union”  from  February  17   •  Areas  where  much  more  work  needed   • Binary  file  format  read/write  support  (e.g.  Parquet  files)   • File  system  libraries  (HDFS,  S3,  etc.)   • Client  drivers  (Spark,  Hive,  Impala,  Kudu)   • Compute  system  integra;on  (Spark,  Impala,  etc.)  
  5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Apache   Arrow   Many  slides  here  from  my  joint  talk  with  Jacques  Nadeau,  VP  Apache  Arrow  
  6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  in  a  Slide   •  New  Top-­‐level  Apache  Sofware  Founda;on  project   •  Announced  Feb  17,  2016   •  Focused  on  Columnar  In-­‐Memory  Analy;cs   1.  10-­‐100x  speedup  on  many  workloads   2.  Common  data  layer  enables  companies  to  choose  best  of   breed  systems     3.  Designed  to  work  with  any  programming  language   4.  Support  for  both  rela;onal  and  complex  data  as-­‐is   •  Developers  from  13+  major  open  source  projects  involved   •  A  significant  %  of  the  world’s  data  will  be  processed  through   Arrow!   Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow:  What  is  it?     •  hkp://arrow.apache.org   •  Not  a  piece  of  sofware,  exactly!   •  A  standardized  in-­‐memory  representa;on  for  columnar  data   •  Enables   • Suitable  for  implemen;ng  high-­‐performance  analy;cs  in-­‐memory  (think  like   “pandas  internals”)   • Cheap  data  interchange  amongst  systems,  likle  or  no  serializa;on   • Flexible  support  for  complex  JSON-­‐like  data   •  Targets:  Impala,  Kudu,  Parquet,  Spark  
  8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Focus  on  CPU  Efficiency   1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 Row 1 Row 2 Row 3 Row 4 1331246660 1331246351 1331244570 1331261196 3/8/2012 2:44PM 3/8/2012 2:38PM 3/8/2012 2:09PM 3/8/2012 6:46PM 99.155.155.225 65.87.165.114 71.10.106.181 76.102.156.138 session_id timestamp source_ip Traditional Memory Buffer   Arrow Memory Buffer   •  Cache  Locality   •  Super-­‐scalar  &  vectorized   opera;on   •  Minimal  Structure  Overhead   •  Constant  value  access     •  With  minimal  structure  overhead   •  Operate  directly  on  columnar   compressed  data  
  9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Sharing  &  Interchange   Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader) Pandas Drill Impala HBase KuduCassandra Parquet Spark Arrow Memory Pandas Drill Impala HBase KuduCassandra Parquet Spark Copy & Convert Copy & Convert Copy & Convert Copy & Convert Copy & Convert
  10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Big  Data  Systems:  Poor  Python  IO  performance   h9p://wesmckinney.com/blog/pandas-­‐and-­‐apache-­‐arrow/  
  11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   • Problem:  fast,  language-­‐ agnos;c  binary  data  frame   file  format   • Wriken  by  Wes  McKinney   (Python)  Hadley  Wickham  (R)   • Read  speeds  close  to  disk  IO   performance   Arrow array 0 Arrow array 1 … Arrow array n Feather metadata Feather file Apache Arrow memory Google flatbuffers
  12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Feather  File  Format  for  Python   and  R   library(feather)       path  <-­‐  "my_data.feather"   write_feather(df,  path)       df  <-­‐  read_feather(path)   import  feather       path  =  'my_data.feather'       feather.write_dataframe(df,  path)   df  =  feather.read_dataframe(path)   R   Python  
  13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Parquet:  Binary  columnar  storage  format   •  I  just  became  a  Parquet  commiker!   •  github.com/apache/parquet-­‐cpp   •  Python  users  will  soon  be  able  to   read  Parquet  files  via  PyArrow   •  parquet-­‐cpp  <-­‐>  PyArrow  <-­‐>   pandas  
  14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Language  Bindings   •  Target  Languages   • Java  (beta)   • CPP  (underway)   • Python  &  Pandas  (underway)   • R   • Julia   •  Ini;al  Focus   • Read  a  structure   • Write  a  structure     • Manage  Memory  
  15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   pandas  and  Arrow  in  context  
  16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   RPC  &  IPC:  Moving  Data  Between  Systems   RPC   •  Avoid  Serializa;on  &  Deserializa;on   •  Layer  TBD:  Focused  on  suppor;ng  vectored  io   • Scaker/gather  reads/writes  against  socket   IPC   •  Alpha  implementa;on    using  memory  mapped  files   • Moving  data  between  Python  and  Drill   •  Working  on  shared  alloca;on  approach   • Shared  reference  coun;ng  and  well-­‐defined  ownership  seman;cs  
  17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Execu;ng  data  science  languages  in  the  compute  layer   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
  18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Real  World  Example:  Python  With  Spark,  Drill,  Impala   in partition 0 … in partition n - 1 SQL Engine Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 SQL Engine
  19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  Next   •  Parquet  for  Python  &  C++   • Using  Arrow  as  intermediary   •  Available  IPC  Implementa;on   •  Spark,  Drill  Integra;on   • Faster  UDFs,  Storage  interfaces  
  20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow  in  prac;ce  
  21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Get  Involved   •  Join  the  community   • dev@arrow.apache.org   • Slack:  hkps://apachearrowslackin.herokuapp.com/   • hkp://arrow.apache.org   • @ApacheArrow  
  22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own  
Advertisement