Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
High	
  Performance	
  Python	
  on	
  
Apache	
  Spark	
  
...
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  Data	
  Science	
  Tools	
  at	
  Cloudera	
  
•  ...
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Agenda	
  
•  Why	
  care	
  about	
  Python?	
  
	
  
•  Wh...
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
•  Accessible,	
  “swiss	
  army	
  knife”	
  programming	
 ...
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Defining	
  “High	
  Performance	
  Python”	
  
•  The	
  end...
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Building	
  fast	
  Python	
  so]ware	
  
means	
  embracing...
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Having	
  a	
  healthy	
  relaUonship	
  with	
  the	
  inte...
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Mantras	
  for	
  great	
  success	
  
•  Key	
  quesUon	
  ...
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Toy	
  example:	
  interpreted	
  vs.	
  compiled	
  code	
  
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Toy	
  example:	
  interpreted	
  vs.	
  compiled	
  code	
...
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Toy	
  example:	
  interpreted	
  vs.	
  compiled	
  code	
...
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Submarines	
  and	
  Icebergs:	
  metaphors	
  for	
  fast	...
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
SubopUmal	
  control	
  flow	
  
Elsewhere Data
Python
code
...
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Beler	
  control	
  flow	
  
Extension
code
(C / C++)
Native...
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
But	
  it’s	
  much	
  easier	
  to	
  write	
  100%	
  Pyt...
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  story	
  of	
  reading	
  a	
  CSV	
  file	
  
f	
  =	
...
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
It’s	
  All	
  About	
  the	
  Benjamins	
  (Data	
  Struct...
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
What’s	
  this	
  have	
  to	
  do	
  with	
  Spark?	
  
• ...
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  IO	
  throughput	
  to/from	
  Python	
  
1.15 MB/...
20	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Spark	
  IO	
  throughput	
  to/from	
  Python	
  
Unoffici...
21	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Compared	
  with	
  HiveServer2	
  Thri]	
  RPC	
  fetch	
 ...
22	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Back	
  of	
  envelope	
  comp	
  w/	
  file	
  formats	
  
...
23	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Aside:	
  CSVs	
  can	
  be	
  fast	
  
See: https://github...
24	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  Python	
  lambdas	
  work	
  in	
  PySpark	
  
Spark...
25	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  Python	
  lambdas	
  perform	
  
NumPy array-oriente...
26	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
How	
  Python	
  lambdas	
  perform	
  
8 cores
1 core
Less...
27	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Asides	
  /	
  counterpoints	
  
•  Spark<-­‐>Python	
  IO	...
28	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  
Arrow	
  
http://arrow.apache.org
Some slides fr...
29	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Arrow	
  in	
  a	
  Slide	
  
•  New	
  Top-­‐lev...
30	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
High	
  Performance	
  Sharing	
  &	
  Interchange	
  
Toda...
31	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Arrow	
  and	
  PySpark	
  
•  Build	
  a	
  C	
  API	
  le...
32	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
• Problem:	
  fast,	
  language-­‐
agnosUc	
  binary	
  dat...
33	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
More	
  on	
  Feather	
  
array 0
array 1
array 2
...
array...
34	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Summary	
  
•  It’s	
  essenUal	
  to	
  improve	
  Spark’s...
35	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
 ...
Upcoming SlideShare
Loading in …5
×

High Performance Python on Apache Spark

11,490 views

Published on

Slides from my talk at Spark Summit West 2016 on June 7, 2016

Published in: Technology
  • Be the first to comment

High Performance Python on Apache Spark

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Python  on   Apache  Spark   Wes  McKinney  @wesmckinn   Spark  Summit  West  -­‐-­‐  June  7,  2016  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  Data  Science  Tools  at  Cloudera   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •   Working  on  expanded  and  revised  2nd  edi-on,  coming  2017   •  Open  source  projects   •  Python  {pandas,  Ibis,  statsmodels}   •  Apache  {Arrow,  Parquet,  Kudu  (incubaUng)}   •  Focused  on  C++,  Python,  and  Hybrid  projects  
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Agenda   •  Why  care  about  Python?     •  What  does  “high  performance  Python”  even  mean?       •  A  modern  approach  to  Python  data  so]ware     •  Spark  and  Python:  performance  analysis  and  development  direcUons  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   •  Accessible,  “swiss  army  knife”  programming  language     •  Highly  producUve  for  so]ware  engineering  and  data  science  alike     •  Has  excelled  as  the  agile  “orchestraUon”  or  “glue”  layer  for  applicaUon   business  logic     •  Easy  to  interface  with  C  /  C++  /  Fortran  code.  Well-­‐designed  Python  C  API     Why  care  about  (C)Python?  
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   Defining  “High  Performance  Python”   •  The  end-­‐user  workflow  involves  primarily  Python  programming;  programs  can   be  invoked  with  “python  app_entry_point.py  ...”     •  The  so]ware  uses  system  resources  within  an  acceptable  factor  of  an   equivalent  program  developed  completely  in  Java  or  C++   •  Preferably  1-­‐5x  slower,  not  20-­‐50x     •  The  so]ware  is  suitable  for  interacUve  /  exploratory  compuUng  on  modestly   large  data  sets  (=  gigabytes)  on  a  single  node  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   Building  fast  Python  so]ware   means  embracing  certain   limitaUons  
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Having  a  healthy  relaUonship  with  the  interpreter   •  The  Python  interpreter  itself  is  “slow”,  as  compared  with  hand-­‐coded  C  or  Java   •  Each  line  of  Python  code  may  feature  mulUple  internal  C  API  calls,   temporary  data  structures,  etc.     •  Python  built-­‐in  data  structures  (numbers,  strings,  tuples,  lists,  dicts,  etc.)  have   significant  memory  and  performance  use  overhead     •  Threads  performing  concurrent  CPU  or  IO  work  must  take  care  not  to  block   other  threads  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.   Mantras  for  great  success   •  Key  quesUon  1:  Am  I  making  the  Python  interpreter  do  a  lot  of  work?     •  Key  quesUon  2:  Am  I  blocking  other  interpreted  code  from  execu-ng?     •  Key  quesUon  3:  Am  I  handling  data  (memory)  in  a  “good”  way?  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Toy  example:  interpreted  vs.  compiled  code  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Toy  example:  interpreted  vs.  compiled  code     Cython: 78x faster than interpreted
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Toy  example:  interpreted  vs.  compiled  code   NumPy Creating a full 80MB temporary array + PyArray_Sum is only 35% slower than a fully inlined Cython ( C ) function Interesting: ndarray.sum by itself is almost 2x faster than the hand-coded Cython function...
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Submarines  and  Icebergs:  metaphors  for  fast  Python  so]ware  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   SubopUmal  control  flow   Elsewhere Data Python code Python data structures Pure Python computation Python data structures Pure Python computation Python data structures Data Deserialization Serialization Time for a coffee break...
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Beler  control  flow   Extension code (C / C++) Native data Python code C Func Native data Native data C Func Python app logic Python app logic Users only see this! Zoom zoom! (if the extension code is good)
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   But  it’s  much  easier  to  write  100%  Python!   •  Building  hybrid  C/C++  and  Python  systems  adds  a  lot  of  complexity  to  the   engineering  process   •  (but  it’s  o]en  worth  it)   •  See:  Cython,  SWIG,  Boost.Python,  Pybind11,  and  other  “hybrid”  so]ware   creaUon  tools     •  BONUS:  Python  programs  can  orchestrate  mulU-­‐threaded  /  concurrent  systems   wrilen  in  C/C++  (no  Python  C  API  needed)   •  The  GIL  only  comes  in  when  you  need  to  “bubble  up”  data  or  control  flow   (e.g.  Python  callbacks)  into  the  Python  interpreter  
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   A  story  of  reading  a  CSV  file   f  =  get_stream(...)   df  =  pandas.read_csv(f,  **csv_options)   while more_data(): buffer = f.read() parse_bytes(buffer) df = type_infer_columns() internally, pseudocode Concerns Uses PyString_FromStringAndSize, must hold GIL for this Synchronous or asynchronous with IO? Type infer in parallel? Data structures used?
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   It’s  All  About  the  Benjamins  (Data  Structures)   •  The  hard  currency  of  data  so]ware  is:  in-­‐memory  data  structures   •  How  costly  are  they  to  send  and  receive?   •  How  costly  to  manipulate  and  munge  in-­‐memory?   •  How  difficult  is  it  to  add  new  proprietary  computaUon  logic?     •  In  Python:  NumPy  established  a  gold  standard  for  interoperable  array  data   •  pandas  is  built  on  NumPy,  and  made  it  easy  to  “plug  in”  to  the  ecosystem   •  (but  there  are  plenty  of  warts  sUll)  
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   What’s  this  have  to  do  with  Spark?   •  Some  known  performance  issues  in  PySpark   •  IO  throughput   •  Python  to  Spark   •  Spark  to  Python  (or  Python  extension  code)   •  Running  interpreted  Python  code  on  RDDs  /  Spark  DataFrames   •  Lambda  mappers  /  reducers  (rdd.map(...))   •  Spark  SQL  UDFs  (registerFuncUon(...))    
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  IO  throughput  to/from  Python   1.15 MB/s in 9.82 MB/s out Spark 1.6.1 running on localhost 76 MB pandas.DataFrame
  20. 20. 20  ©  Cloudera,  Inc.  All  rights  reserved.   Spark  IO  throughput  to/from  Python   Unofficial improved toPandas 25.6 MB/s out
  21. 21. 21  ©  Cloudera,  Inc.  All  rights  reserved.   Compared  with  HiveServer2  Thri]  RPC  fetch   Impala 2.5 + Parquet file on localhost ibis + impyla 41.46 MB/s read hs2client (C++ / Python) 90.8 MB/s Task benchmarked: Thrift TFetchResultsReq + deserialization + conversion to pandas.DataFrame
  22. 22. 22  ©  Cloudera,  Inc.  All  rights  reserved.   Back  of  envelope  comp  w/  file  formats   Feather: 1105 MB/s write CSV (pandas): 6.2 MB/s write Feather: 2414 MB/s read CSV (pandas): 51.9 MB/s read disclaimer: warm NVMe / OS file cache
  23. 23. 23  ©  Cloudera,  Inc.  All  rights  reserved.   Aside:  CSVs  can  be  fast   See: https://github.com/wiseio/paratext
  24. 24. 24  ©  Cloudera,  Inc.  All  rights  reserved.   How  Python  lambdas  work  in  PySpark   Spark RDD Python task worker pool Python worker Python worker Python worker Python worker Python worker Data stream + pickled PyFunction See: spark/api/python/PythonRDD.scala python/pyspark/worker.py The inner loop of RDD.map map(f,  iterator)  
  25. 25. 25  ©  Cloudera,  Inc.  All  rights  reserved.   How  Python  lambdas  perform   NumPy array-oriented operations are about 100x faster… but that’s not the whole story Disclaimer: this isn’t a remotely “fair” comparison, but it helps illustrate the real pitfalls associated with introducing serialization and RPC/IPC into a computational process
  26. 26. 26  ©  Cloudera,  Inc.  All  rights  reserved.   How  Python  lambdas  perform   8 cores 1 core Lessons learned: Python data analytics should not be based around scalar object iteration
  27. 27. 27  ©  Cloudera,  Inc.  All  rights  reserved.   Asides  /  counterpoints   •  Spark<-­‐>Python  IO  may  not  be  important  -­‐-­‐  can  leave  all  of  the  data  remote   •  Spark  DataFrame  operaUons  have  reduced  the  need  for  many  types  of  Lambda   funcUons   •  Can  use  binary  file  formats  as  an  alternate  IO  interface   •  Parquet  (Python  support  soon  via  apache/parquet-­‐cpp)   •  Avro  (see  cavro,  fastavro,  pyavroc)   •  ORC  (needs  a  Python  champion)   •  ...  
  28. 28. 28  ©  Cloudera,  Inc.  All  rights  reserved.   Apache   Arrow   http://arrow.apache.org Some slides from Strata-HW talk w/ Jacques Nadeau
  29. 29. 29  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow  in  a  Slide   •  New  Top-­‐level  Apache  So]ware  FoundaUon  project   •   hlp://arrow.apache.org     •  Focused  on  Columnar  In-­‐Memory  AnalyUcs   1.  10-­‐100x  speedup  on  many  workloads   2.  Common  data  layer  enables  companies  to  choose  best  of   breed  systems     3.  Designed  to  work  with  any  programming  language   4.  Support  for  both  relaUonal  and  complex  data  as-­‐is     •  Oriented  at  collaboraUon  amongst  other  OSS  projects   Calcite Cassandra Deeplearning4j Drill Hadoop HBase Ibis Impala Kudu Pandas Parquet Phoenix Spark Storm R
  30. 30. 30  ©  Cloudera,  Inc.  All  rights  reserved.   High  Performance  Sharing  &  Interchange   Today With Arrow •  Each system has its own internal memory format •  70-80% CPU wasted on serialization and deserialization •  Similar functionality implemented in multiple projects •  All systems utilize the same memory format •  No overhead for cross-system communication •  Projects can share functionality (eg, Parquet-to-Arrow reader)
  31. 31. 31  ©  Cloudera,  Inc.  All  rights  reserved.   Arrow  and  PySpark   •  Build  a  C  API  level  data  protocol  to  move  data  between  Spark  and  Python   •  Either     •  (Fast)  Convert  Arrow  to/from  pandas.DataFrame   •  (Faster)  Perform  naUve  analyUcs  on  Arrow  data  in-­‐memory   •  Use  Arrow   •  For  efficiently  handling  nested  Spark  SQL  data  in-­‐memory   •  IO:  pandas/NumPy  data  push/pull   •  Lambda/UDF  evaluaUon    
  32. 32. 32  ©  Cloudera,  Inc.  All  rights  reserved.   • Problem:  fast,  language-­‐ agnosUc  binary  data  frame   file  format   • Creators:  Wes  McKinney   (Python)  and  Hadley   Wickham  (R)   • Read  speeds  close  to  disk  IO   performance   Arrow  in  acUon:  Feather  File  Format  for  Python  and  R  
  33. 33. 33  ©  Cloudera,  Inc.  All  rights  reserved.   More  on  Feather   array 0 array 1 array 2 ... array n - 1 METADATA Feather File libfeather C++ library Rcpp Cython R data.frame pandas DataFrame
  34. 34. 34  ©  Cloudera,  Inc.  All  rights  reserved.   Summary   •  It’s  essenUal  to  improve  Spark’s  low-­‐level  data  interoperability  with  the  Python   data  ecosystem     •  I’m  personally  excited  to  work  with  the  Spark  +  Arrow  +  PyData  +  other   communiUes  to  help  make  this  a  reality    
  35. 35. 35  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own  

×