Successfully reported this slideshow.

Enabling Python to be a Better Big Data Citizen

19

Share

1 of 19
1 of 19

Enabling Python to be a Better Big Data Citizen

19

Share

Download to read offline

These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.

These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.

More Related Content

Related Books

Free with a 14 day trial from Scribd

See all

Enabling Python to be a Better Big Data Citizen

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Enabling  Python  to  be  a  Be=er   Big  Data  Ci?zen   Wes  McKinney  @wesmckinn   NYC  Python  Meetup  2016-­‐02-­‐17  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  R&D  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   • Python  {pandas,  Ibis,  statsmodels}   • Apache  {Arrow,  Parquet,  Kudu  (incuba?ng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Industry  Analy?cs   Scien?fic  Compu?ng   Heterogeneous  data          Flat  tables  and  JSON   Spark  /  MapReduce   SQL   DFS-­‐friendly  /  streaming  data  formats   More  physical  machines   Homogeneous  data          Mul?dimensional  arrays   HPC  tools   Linear  algebra   Scien?fic  data  formats  (e.g.  HDF5)   Fewer  physical  machines   Some  simplis?c  generaliza?ons   Python:  heavy  investment,     generally   Python:  light  investment,   generally  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   A  sample  big  data  architecture   Kafka Kafka Kafka Kafka Application data HDFS JSON Spark/MapReduce Columnar storage Analytic SQL Engine User SQL
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  Hugely  popular  Python  table  /  “data  frame”  library   • Labeled  table,  array,  and  ?me  series  data  structures   •  Popular  for  data  prepara?on,  ETL,  and  in-­‐memory  analy?cs   •  Built  using  Python’s  scien?fic  compu?ng  stack   • User  API  /  domain  specific  language   • Bespoke  in-­‐memory  analy?cs  /  rela?onal  algebra  engine   • IO  interfaces  (CSV,  SQL,  etc.)   • Expanded  data  type  system  (beyond  NumPy)   •  Supports  flat  data  only  (or  semi-­‐structured  data  that  can  be  fla=ened)  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   2016  Python  Data  Trends   •  Improved  Python  interoperability  with  the  Apache  Hadoop  ecosystem   • I’m  working  with  {Arrow,  Kudu,  Impala,  Parquet,  Spark}   •  Support  for  big  data  file  formats  like  Apache  Parquet   •  Na?ve  in-­‐memory  Python  support  for  nested  /  JSON-­‐like  data  
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis  in  a  nutshell   •  For  Python  programmers  doing  analy?cs  in  industry   •  Project  Blog:  h=p://blog.ibis-­‐project.org   •  Cross-­‐team  project  @  Cloudera   •  Apache-­‐licensed,  open  source  h=p://github.com/cloudera/ibis     •  Craoing  a  compelling  Python-­‐on-­‐Hadoop  user  experience   • Remove  SQL  coding  from  user  workflows   • Develop  high  performance  extensions  in  Python  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Enabling  interoperability  with  big  data  systems   •  Distributed  /  MPP  query  engines:  implemented  in  a  host  language   • Typically  C/C++  or  Java/Scala   •  User-­‐defined  func?ons  (UDFs)  through  various  means   • Implement  in  host  language   • Implement  in  user  language  through  some  external  language  protocol  (ooen   RPC-­‐based)   •  External  UDFs  are  usually  very  slow  (cf:  PL/Python,  PySpark,  etc.)  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Execu?ng  data  science  languages  in  the  compute  layer   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Python  interoperability  challenges   •  Problem  1:  Serializa?on  /  deserializa?on  overhead   in partition 0 … in partition n - 1 Big data system Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 Big data system
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Data  movement  can  be  extremely  costly   in partition 0 Python function input Ques:ons   •  How  to  represent  “data  in-­‐flight”  (RPC)?   •  Cost  of  conversion  between  in-­‐memory  data  structures   and  RPC  representa?on   •  How  to  communicate  schemas  /  metadata?  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Data  movement  can  be  extremely  costly   in partition 0 Python function input Slow  data  movement  /  conversion  can  largely   undermine  the  performance  benefits  of  Python’s   high  performance  in-­‐memory  data  tools  
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Python  interoperability  challenges   •  Problem  2:  Scalar  vs  vectorized  computa?ons   result = np.empty(n) for i in range(n): result[i] = f(a[i], b[i]) result = f(a, b) SCALAR VECTORIZED often 100-1000x faster
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow:  What  is  it?     •  h=p://arrow.apache.org   •  Not  a  piece  of  sooware,  exactly!   •  A  standardized  in-­‐memory  representa?on  for  columnar  data   •  Enables   • Suitable  for  implemen?ng  high-­‐performance  analy?cs  in-­‐memory  (think  like   “pandas  internals”)   • Cheap  data  interchange  amongst  systems,  li=le  or  no  serializa?on   • Flexible  support  for  complex  JSON-­‐like  data   •  Targets:  Impala,  Kudu,  Parquet,  Spark  
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  data   persons'='[ ''{ ''''name:'‘wes’, ''''addresses:'[ '''''''{number:'2,'street:'‘a’}, '''''''{number:'3,'street:'‘bb’}, ''''] ''}, ''{ ''''name:'‘mark’, ''''addresses:'[ '''''''{number:'4,'street:'‘ccc’}, '''''''{number:'5,'street:'‘dddd’}, '''''''{number:'6,'street:'‘f’}, ''''] ''},
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  data   person.addresses.street person.addresses 0 2 5 offset 0 1 3 6 10 a b b c c c d d d d f person.addresses.number 2 3 4 5 6 offset
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow  in  prac?ce  
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own  

×