Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

95,148 views

Published on

by Wes McKinney (@wesmckinn) at PyData NYC 2013

Published in: Technology
2 Comments
38 Likes
Statistics
Notes
No Downloads
Views
Total views
95,148
On SlideShare
0
From Embeds
0
Number of Embeds
77,343
Actions
Shares
0
Downloads
299
Comments
2
Likes
38
Embeds 0
No embeds

No notes for slide

Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

  1. 1. Practical Medium Data Analytics with Python PyData NYC 2013
  2. 2. Practical Medium Data Analytics with Python 10 Things I Hate About pandas PyData NYC 2013
  3. 3. Wes McKinney @wesmckinn • Former quant and MIT math dude • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 3 www.datapad.io
  4. 4. • • 4 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
  5. 5. • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
  6. 6. Why hate on pandas?
  7. 7. 7 www.datapad.io
  8. 8. pandas rocks!
  9. 9. So, pandas • Easy-to-use, fast in-memory data wrangling and analytics library • Enabled loads of complex data work to be done by mere mortals in Python • Might have kept R from taking over the world (hehe) 10 www.datapad.io
  10. 10. 11 www.datapad.io
  11. 11. pandas, the project • 170 distinct contributors • Over 5400 issues and pull requests on GitHub • 12 Upcoming 0.13 release www.datapad.io
  12. 12. But. • pandas’s broad applicability also a liability • pandas being used in some • Only game in town for lot of things unplanned ways 13 www.datapad.io
  13. 13. Some things to love • No more structured dtype drudgery! • Easy IO! • Data alignment! • Hierarchical indexing! • Time series analytics! 14 www.datapad.io
  14. 14. More things to love • Table reshaping • Missing data handling pandas.merge, pandas.concat • Expressive groupby machinery • 15 www.datapad.io
  15. 15. Some pandas use cases • General data wrangling • ETL jobs Business analytics (incl. BI uses) • Time series analysis, statistical • modeling 16 www.datapad.io
  16. 16. pandas does many things that are tedious, slow, or difficult to do correctly without it
  17. 17. Unfortunately, pandas is not a database
  18. 18. #1 Slightly too far from the metal • DataFrame’s internal structure intended to make row-oriented ops fast on numerical data • 19 Python objects can be used as data, indices (a feature, not a bug) www.datapad.io
  19. 19. #2 No support (yet) for memory maps • Many analytics ops require a small portion of the data • Many ways to “materialize” the full data set in memory by accident • Axis indexes wouldn’t necessarily make sense on out of core data sets 20 www.datapad.io
  20. 20. #2 No support (yet) for memory maps • N.B. HDF5/PyTables support is a partial solution 21 www.datapad.io
  21. 21. #3 No tight database integration • Makes it difficult to be a serious tool in an ETL toolchain on top of some SQL-ish system • 22 Inadequacy of pandas/NumPy data type systems www.datapad.io
  22. 22. #3 No tight database integration • Jobs with heavy SQL-reading are slow and use tons of memory • 23 TODO: integrate pandas with ODBC C API and write out SQL data directly into NumPy arrays www.datapad.io
  23. 23. #4 Best-efforts NA representation • Inconsistent representation of missing data • NA needs to be a first class citizen in • No Boolean or Integer NA values analytics operations 24 www.datapad.io
  24. 24. #5 RAM management • Difficult to understand footprint of pandas object • Ample data copying throughout library • Would benefit from being able to compress data in-memory or shuttle data temporarily to disk 25 www.datapad.io
  25. 25. #6 Weak support for categorical data • Makes pandas not quite a fullyfledged R replacement • 26 GroupBy and Joins slower than they could be www.datapad.io
  26. 26. #7 Complex GroupBy operations get messy • Must write custom functions to pass to .apply(..) • 27 Easy to run up against DRY problems and general Python syntax limitations www.datapad.io
  27. 27. #8 Appending data slow and tedious • DataFrame not intended as a database table • Makes streaming data use a challenge • B+ tree tables interesting? 28 www.datapad.io
  28. 28. #9 Limited type system, column metadata • Currencies, units • Time zones Geographic data • Composite data types • 29 www.datapad.io
  29. 29. #10 No true query processing layer • • • • • • 30 Filter Group Join Aggregate Limit/TopK Sorting WHERE, HAVING GROUP BY JOIN SUM, MEAN, ... LIMIT ORDER BY www.datapad.io
  30. 30. #11 “Slow”: no multicore / distributed algos • Hampered by use of Python data structures / GIL interactions • 31 Object internals not designed for concurrent use www.datapad.io
  31. 31. Oh no what do we do
  32. 32. Stop believing in the “one tool to rule them all”
  33. 33. “Real Artists Ship” - Steve Jobs
  34. 34. www.datapad.io
  35. 35. Focus on results • I am heavily biased by focus on business analytics/BI use cases • 36 Need production-ready software to ship in relatively short time frame www.datapad.io
  36. 36. A new project • In internal development at DataPad • Code named “badger” pandas-ish syntax: designed for • data processing and analytical queries 37 www.datapad.io
  37. 37. Badger in a nutshell • Compressed columnar binary storage • • High perf analytical query processor • Data preparation/cleaning tools Consistent data type system 38 www.datapad.io
  38. 38. Badger in a nutshell • Immutable array data, little copying • • Analytics kernels: written C with no Time series analytics dependencies • 39 Caching of useful intermediates www.datapad.io
  39. 39. Some benchmarks • Data set: 2012 Election data (FEC) 5.3 mm records 7 columns • • Tools • pandas badger • • R: data.table SQL: PostgreSQL, SQLite • 40 www.datapad.io
  40. 40. Query 1 • Total contributions by candidate SELECT  cand_nm,                sum(contb_receipt_amt)  AS  total FROM  fec GROUP  BY  cand_nm 41 www.datapad.io
  41. 41. Query 1 • Total contributions by candidate badger  (in-­‐memory)  :      19ms  (1x) badger  (from-­‐disk)  :    131ms  (6.9x) pandas  (in-­‐memory)  :    273ms  (14.3x) R  data.table  1.8.10:    382ms  (20x) PostgreSQL                  :      4.7s  (247x) SQLite                          :        72s  (3800x) 42 www.datapad.io
  42. 42. Query 2 contributions by candidate • Totalstate and SELECT  cand_nm,  contbr_st,              sum(contb_receipt_amt)  AS  total FROM  fec GROUP  BY  cand_nm,  contbr_st 43 www.datapad.io
  43. 43. Query 2 • Total contributions by candidate and state badger  (in-­‐memory)  :    269ms  (1x) badger  (from-­‐disk)  :    391ms  (1.5x) R  data.table  1.8.10:    500ms  (1.8x) pandas  (in-­‐memory)  :    770ms  (2.9x) PostgreSQL                  :    5.96s  (23x) 44 www.datapad.io
  44. 44. Query 3 • Total contributions by candidate and state with 2 filter predicates SELECT  cand_nm,              sum(contb_receipt_amt)  as  total FROM  fec WHERE  contb_receipt_dt  BETWEEN                '2012-­‐05-­‐01'  and  '2012-­‐11-­‐05'    AND  contb_receipt_amt  BETWEEN                  0  and  2500 GROUP  BY  cand_nm 45 www.datapad.io
  45. 45. Query 3 • Total contributions by candidate and state with 2 filter predicates badger  (in-­‐memory)  :      96ms  (1x) badger  (from-­‐disk)  :    275ms  (2.9x) pandas  (in-­‐memory)  :    946ms  (9.8x) PostgreSQL                  :      6.2s  (65x) 46 www.datapad.io
  46. 46. Badger, the future • Distributed in-memory analytics • Multicore algorithms • ETL job-building tools • Open source in some form someday Looking for algorithms hackers to help • 47 www.datapad.io
  47. 47. Thank you! 48 www.datapad.io

×