Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
1	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Enabling	
  Python	
  to	
  be	
  a	
  Be=er	
  
Big	
  Data...
2	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Me	
  
•  R&D	
  at	
  Cloudera,	
  formerly	
  DataPad	
  C...
3	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Industry	
  Analy?cs	
   Scien?fic	
  Compu?ng	
  
Heterogene...
4	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
A	
  sample	
  big	
  data	
  architecture	
  
Kafka
Kafka
K...
5	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
pandas	
  
•  Hugely	
  popular	
  Python	
  table	
  /	
  “...
6	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
2016	
  Python	
  Data	
  Trends	
  
•  Improved	
  Python	
...
7	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Ibis	
  in	
  a	
  nutshell	
  
•  For	
  Python	
  programm...
8	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
9	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Enabling	
  interoperability	
  with	
  big	
  data	
  syste...
10	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Execu?ng	
  data	
  science	
  languages	
  in	
  the	
  co...
11	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  interoperability	
  challenges	
  
•  Problem	
  ...
12	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  movement	
  can	
  be	
  extremely	
  costly	
  
in...
13	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Data	
  movement	
  can	
  be	
  extremely	
  costly	
  
in...
14	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Python	
  interoperability	
  challenges	
  
•  Problem	
  ...
15	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Arrow:	
  What	
  is	
  it?	
  	
  
•  h=p://arro...
16	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  data	
  
persons'='[
''{
''''name:'‘wes’,
''''a...
17	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Columnar	
  data	
  
person.addresses.street
person.address...
18	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Apache	
  Arrow	
  in	
  prac?ce	
  
19	
  ©	
  Cloudera,	
  Inc.	
  All	
  rights	
  reserved.	
  
Thank	
  you	
  
Wes	
  McKinney	
  @wesmckinn	
  
Views	
 ...
Upcoming SlideShare
Loading in …5
×

Enabling Python to be a Better Big Data Citizen

5,507 views

Published on

These slides are from my talk at the NYC Python Meetup at ODSC Office NYC on February 17, 2016. It discusses Python's architectural challenges to interoperate with the Hadoop ecosystem and how a new project, Apache Arrow, will help.

Published in: Technology
  • Hi All, We are planning to start new devops online batch on this week... If any one interested to attend the demo please register in our website... For this batch we are also provide everyday recorded sessions with Materials. For more information feel free to contact us : siva@keylabstraining.com. For Course Content and Recorded Demo Click Here : http://www.keylabstraining.com/devops-online-training-tutorial
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Learning Python, 5th Edition --- http://amzn.to/1XEkpGP
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Automate the Boring Stuff with Python: Practical Programming for Total Beginners --- http://amzn.to/1LytEaR
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Python Crash Course: A Hands-On, Project-Based Introduction to Programming --- http://amzn.to/1RXjA7O
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here

Enabling Python to be a Better Big Data Citizen

  1. 1. 1  ©  Cloudera,  Inc.  All  rights  reserved.   Enabling  Python  to  be  a  Be=er   Big  Data  Ci?zen   Wes  McKinney  @wesmckinn   NYC  Python  Meetup  2016-­‐02-­‐17  
  2. 2. 2  ©  Cloudera,  Inc.  All  rights  reserved.   Me   •  R&D  at  Cloudera,  formerly  DataPad  CEO/founder   •  Serial  creator  of  structured  data  tools  /  user  interfaces   •  Wrote  bestseller  Python  for  Data  Analysis  2012   •  Open  source  projects   • Python  {pandas,  Ibis,  statsmodels}   • Apache  {Arrow,  Parquet,  Kudu  (incuba?ng)}   •  Mostly  work  in  Python  and  Cython/C/C++    
  3. 3. 3  ©  Cloudera,  Inc.  All  rights  reserved.   Industry  Analy?cs   Scien?fic  Compu?ng   Heterogeneous  data          Flat  tables  and  JSON   Spark  /  MapReduce   SQL   DFS-­‐friendly  /  streaming  data  formats   More  physical  machines   Homogeneous  data          Mul?dimensional  arrays   HPC  tools   Linear  algebra   Scien?fic  data  formats  (e.g.  HDF5)   Fewer  physical  machines   Some  simplis?c  generaliza?ons   Python:  heavy  investment,     generally   Python:  light  investment,   generally  
  4. 4. 4  ©  Cloudera,  Inc.  All  rights  reserved.   A  sample  big  data  architecture   Kafka Kafka Kafka Kafka Application data HDFS JSON Spark/MapReduce Columnar storage Analytic SQL Engine User SQL
  5. 5. 5  ©  Cloudera,  Inc.  All  rights  reserved.   pandas   •  Hugely  popular  Python  table  /  “data  frame”  library   • Labeled  table,  array,  and  ?me  series  data  structures   •  Popular  for  data  prepara?on,  ETL,  and  in-­‐memory  analy?cs   •  Built  using  Python’s  scien?fic  compu?ng  stack   • User  API  /  domain  specific  language   • Bespoke  in-­‐memory  analy?cs  /  rela?onal  algebra  engine   • IO  interfaces  (CSV,  SQL,  etc.)   • Expanded  data  type  system  (beyond  NumPy)   •  Supports  flat  data  only  (or  semi-­‐structured  data  that  can  be  fla=ened)  
  6. 6. 6  ©  Cloudera,  Inc.  All  rights  reserved.   2016  Python  Data  Trends   •  Improved  Python  interoperability  with  the  Apache  Hadoop  ecosystem   • I’m  working  with  {Arrow,  Kudu,  Impala,  Parquet,  Spark}   •  Support  for  big  data  file  formats  like  Apache  Parquet   •  Na?ve  in-­‐memory  Python  support  for  nested  /  JSON-­‐like  data  
  7. 7. 7  ©  Cloudera,  Inc.  All  rights  reserved.   Ibis  in  a  nutshell   •  For  Python  programmers  doing  analy?cs  in  industry   •  Project  Blog:  h=p://blog.ibis-­‐project.org   •  Cross-­‐team  project  @  Cloudera   •  Apache-­‐licensed,  open  source  h=p://github.com/cloudera/ibis     •  Craoing  a  compelling  Python-­‐on-­‐Hadoop  user  experience   • Remove  SQL  coding  from  user  workflows   • Develop  high  performance  extensions  in  Python  
  8. 8. 8  ©  Cloudera,  Inc.  All  rights  reserved.  
  9. 9. 9  ©  Cloudera,  Inc.  All  rights  reserved.   Enabling  interoperability  with  big  data  systems   •  Distributed  /  MPP  query  engines:  implemented  in  a  host  language   • Typically  C/C++  or  Java/Scala   •  User-­‐defined  func?ons  (UDFs)  through  various  means   • Implement  in  host  language   • Implement  in  user  language  through  some  external  language  protocol  (ooen   RPC-­‐based)   •  External  UDFs  are  usually  very  slow  (cf:  PL/Python,  PySpark,  etc.)  
  10. 10. 10  ©  Cloudera,  Inc.  All  rights  reserved.   Execu?ng  data  science  languages  in  the  compute  layer   UI Ibis, SQL, Spark API, … Compute Analytic SQL, Spark, MapReduce Storage HDFS, Kudu, HBase Python, R, Julia, …?
  11. 11. 11  ©  Cloudera,  Inc.  All  rights  reserved.   Python  interoperability  challenges   •  Problem  1:  Serializa?on  /  deserializa?on  overhead   in partition 0 … in partition n - 1 Big data system Python function input Python function input User-supplied Python code output output out partition 0 … out partition n - 1 Big data system
  12. 12. 12  ©  Cloudera,  Inc.  All  rights  reserved.   Data  movement  can  be  extremely  costly   in partition 0 Python function input Ques:ons   •  How  to  represent  “data  in-­‐flight”  (RPC)?   •  Cost  of  conversion  between  in-­‐memory  data  structures   and  RPC  representa?on   •  How  to  communicate  schemas  /  metadata?  
  13. 13. 13  ©  Cloudera,  Inc.  All  rights  reserved.   Data  movement  can  be  extremely  costly   in partition 0 Python function input Slow  data  movement  /  conversion  can  largely   undermine  the  performance  benefits  of  Python’s   high  performance  in-­‐memory  data  tools  
  14. 14. 14  ©  Cloudera,  Inc.  All  rights  reserved.   Python  interoperability  challenges   •  Problem  2:  Scalar  vs  vectorized  computa?ons   result = np.empty(n) for i in range(n): result[i] = f(a[i], b[i]) result = f(a, b) SCALAR VECTORIZED often 100-1000x faster
  15. 15. 15  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow:  What  is  it?     •  h=p://arrow.apache.org   •  Not  a  piece  of  sooware,  exactly!   •  A  standardized  in-­‐memory  representa?on  for  columnar  data   •  Enables   • Suitable  for  implemen?ng  high-­‐performance  analy?cs  in-­‐memory  (think  like   “pandas  internals”)   • Cheap  data  interchange  amongst  systems,  li=le  or  no  serializa?on   • Flexible  support  for  complex  JSON-­‐like  data   •  Targets:  Impala,  Kudu,  Parquet,  Spark  
  16. 16. 16  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  data   persons'='[ ''{ ''''name:'‘wes’, ''''addresses:'[ '''''''{number:'2,'street:'‘a’}, '''''''{number:'3,'street:'‘bb’}, ''''] ''}, ''{ ''''name:'‘mark’, ''''addresses:'[ '''''''{number:'4,'street:'‘ccc’}, '''''''{number:'5,'street:'‘dddd’}, '''''''{number:'6,'street:'‘f’}, ''''] ''},
  17. 17. 17  ©  Cloudera,  Inc.  All  rights  reserved.   Columnar  data   person.addresses.street person.addresses 0 2 5 offset 0 1 3 6 10 a b b c c c d d d d f person.addresses.number 2 3 4 5 6 offset
  18. 18. 18  ©  Cloudera,  Inc.  All  rights  reserved.   Apache  Arrow  in  prac?ce  
  19. 19. 19  ©  Cloudera,  Inc.  All  rights  reserved.   Thank  you   Wes  McKinney  @wesmckinn   Views  are  my  own  

×