Practical Medium Data Analytics with Python (10 Things I Hate About pandas, PyData NYC 2013)

  • 54,510 views
Uploaded on

by Wes McKinney (@wesmckinn) at PyData NYC 2013

by Wes McKinney (@wesmckinn) at PyData NYC 2013

More in: Technology
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
  • @cvaartjes I don't think so. It's a library we are running in production at DataPad, custom tailored and optimized for our use cases, and not being designed as next-gen general purpose compute infrastructure for scientific Python users.
    Are you sure you want to
    Your message goes here
  • Congrats, very nice one !-)

    Btw, I'm looking forward to trying *badger* :-)
    Are you sure you want to
    Your message goes here
No Downloads

Views

Total Views
54,510
On Slideshare
0
From Embeds
0
Number of Embeds
36

Actions

Shares
Downloads
139
Comments
2
Likes
13

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Practical Medium Data Analytics with Python PyData NYC 2013
  • 2. Practical Medium Data Analytics with Python 10 Things I Hate About pandas PyData NYC 2013
  • 3. Wes McKinney @wesmckinn • Former quant and MIT math dude • Creator of Pandas project for Python • Author of Python for Data Analysis — O’Reilly • Founder and CEO of DataPad 3 www.datapad.io
  • 4. • • 4 > 20k copies since Oct 2012 Bringing many new people to Python and data analysis with code www.datapad.io
  • 5. • http://datapad.io Founded in 2013, located in SF • In private beta, join us! • • Hiring for engineering www.datapad.io
  • 6. Why hate on pandas?
  • 7. 7 www.datapad.io
  • 8. pandas rocks!
  • 9. So, pandas • Easy-to-use, fast in-memory data wrangling and analytics library • Enabled loads of complex data work to be done by mere mortals in Python • Might have kept R from taking over the world (hehe) 10 www.datapad.io
  • 10. 11 www.datapad.io
  • 11. pandas, the project • 170 distinct contributors • Over 5400 issues and pull requests on GitHub • 12 Upcoming 0.13 release www.datapad.io
  • 12. But. • pandas’s broad applicability also a liability • pandas being used in some • Only game in town for lot of things unplanned ways 13 www.datapad.io
  • 13. Some things to love • No more structured dtype drudgery! • Easy IO! • Data alignment! • Hierarchical indexing! • Time series analytics! 14 www.datapad.io
  • 14. More things to love • Table reshaping • Missing data handling pandas.merge, pandas.concat • Expressive groupby machinery • 15 www.datapad.io
  • 15. Some pandas use cases • General data wrangling • ETL jobs Business analytics (incl. BI uses) • Time series analysis, statistical • modeling 16 www.datapad.io
  • 16. pandas does many things that are tedious, slow, or difficult to do correctly without it
  • 17. Unfortunately, pandas is not a database
  • 18. #1 Slightly too far from the metal • DataFrame’s internal structure intended to make row-oriented ops fast on numerical data • 19 Python objects can be used as data, indices (a feature, not a bug) www.datapad.io
  • 19. #2 No support (yet) for memory maps • Many analytics ops require a small portion of the data • Many ways to “materialize” the full data set in memory by accident • Axis indexes wouldn’t necessarily make sense on out of core data sets 20 www.datapad.io
  • 20. #2 No support (yet) for memory maps • N.B. HDF5/PyTables support is a partial solution 21 www.datapad.io
  • 21. #3 No tight database integration • Makes it difficult to be a serious tool in an ETL toolchain on top of some SQL-ish system • 22 Inadequacy of pandas/NumPy data type systems www.datapad.io
  • 22. #3 No tight database integration • Jobs with heavy SQL-reading are slow and use tons of memory • 23 TODO: integrate pandas with ODBC C API and write out SQL data directly into NumPy arrays www.datapad.io
  • 23. #4 Best-efforts NA representation • Inconsistent representation of missing data • NA needs to be a first class citizen in • No Boolean or Integer NA values analytics operations 24 www.datapad.io
  • 24. #5 RAM management • Difficult to understand footprint of pandas object • Ample data copying throughout library • Would benefit from being able to compress data in-memory or shuttle data temporarily to disk 25 www.datapad.io
  • 25. #6 Weak support for categorical data • Makes pandas not quite a fullyfledged R replacement • 26 GroupBy and Joins slower than they could be www.datapad.io
  • 26. #7 Complex GroupBy operations get messy • Must write custom functions to pass to .apply(..) • 27 Easy to run up against DRY problems and general Python syntax limitations www.datapad.io
  • 27. #8 Appending data slow and tedious • DataFrame not intended as a database table • Makes streaming data use a challenge • B+ tree tables interesting? 28 www.datapad.io
  • 28. #9 Limited type system, column metadata • Currencies, units • Time zones Geographic data • Composite data types • 29 www.datapad.io
  • 29. #10 No true query processing layer • • • • • • 30 Filter Group Join Aggregate Limit/TopK Sorting WHERE, HAVING GROUP BY JOIN SUM, MEAN, ... LIMIT ORDER BY www.datapad.io
  • 30. #11 “Slow”: no multicore / distributed algos • Hampered by use of Python data structures / GIL interactions • 31 Object internals not designed for concurrent use www.datapad.io
  • 31. Oh no what do we do
  • 32. Stop believing in the “one tool to rule them all”
  • 33. “Real Artists Ship” - Steve Jobs
  • 34. www.datapad.io
  • 35. Focus on results • I am heavily biased by focus on business analytics/BI use cases • 36 Need production-ready software to ship in relatively short time frame www.datapad.io
  • 36. A new project • In internal development at DataPad • Code named “badger” pandas-ish syntax: designed for • data processing and analytical queries 37 www.datapad.io
  • 37. Badger in a nutshell • Compressed columnar binary storage • • High perf analytical query processor • Data preparation/cleaning tools Consistent data type system 38 www.datapad.io
  • 38. Badger in a nutshell • Immutable array data, little copying • • Analytics kernels: written C with no Time series analytics dependencies • 39 Caching of useful intermediates www.datapad.io
  • 39. Some benchmarks • Data set: 2012 Election data (FEC) 5.3 mm records 7 columns • • Tools • pandas badger • • R: data.table SQL: PostgreSQL, SQLite • 40 www.datapad.io
  • 40. Query 1 • Total contributions by candidate SELECT  cand_nm,                sum(contb_receipt_amt)  AS  total FROM  fec GROUP  BY  cand_nm 41 www.datapad.io
  • 41. Query 1 • Total contributions by candidate badger  (in-­‐memory)  :      19ms  (1x) badger  (from-­‐disk)  :    131ms  (6.9x) pandas  (in-­‐memory)  :    273ms  (14.3x) R  data.table  1.8.10:    382ms  (20x) PostgreSQL                  :      4.7s  (247x) SQLite                          :        72s  (3800x) 42 www.datapad.io
  • 42. Query 2 contributions by candidate • Totalstate and SELECT  cand_nm,  contbr_st,              sum(contb_receipt_amt)  AS  total FROM  fec GROUP  BY  cand_nm,  contbr_st 43 www.datapad.io
  • 43. Query 2 • Total contributions by candidate and state badger  (in-­‐memory)  :    269ms  (1x) badger  (from-­‐disk)  :    391ms  (1.5x) R  data.table  1.8.10:    500ms  (1.8x) pandas  (in-­‐memory)  :    770ms  (2.9x) PostgreSQL                  :    5.96s  (23x) 44 www.datapad.io
  • 44. Query 3 • Total contributions by candidate and state with 2 filter predicates SELECT  cand_nm,              sum(contb_receipt_amt)  as  total FROM  fec WHERE  contb_receipt_dt  BETWEEN                '2012-­‐05-­‐01'  and  '2012-­‐11-­‐05'    AND  contb_receipt_amt  BETWEEN                  0  and  2500 GROUP  BY  cand_nm 45 www.datapad.io
  • 45. Query 3 • Total contributions by candidate and state with 2 filter predicates badger  (in-­‐memory)  :      96ms  (1x) badger  (from-­‐disk)  :    275ms  (2.9x) pandas  (in-­‐memory)  :    946ms  (9.8x) PostgreSQL                  :      6.2s  (65x) 46 www.datapad.io
  • 46. Badger, the future • Distributed in-memory analytics • Multicore algorithms • ETL job-building tools • Open source in some form someday Looking for algorithms hackers to help • 47 www.datapad.io
  • 47. Thank you! 48 www.datapad.io