Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

Slides from Wes McKinney keynote at PyCon Colombia in Medellín, Colombia on February 8, 2020

  • Login to see the comments

PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future

  1. 1. Wes McKinney @wesmckinn PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
  2. 2. Wes’s professional timeline pandas DataPad 2008 2013 2014 — Present Apache Arrow
  3. 3. Perspectives on the last 12 years
  4. 4. January 2020: pandas 1.0 ● 26th major release after 10 years of development ● ~2000 unique contributors Thanks, Indeed!
  5. 5. Dec 2009 - pandas 0.1 ● First open source release after ~18 months of proprietary use ● Still on PyPI!
  6. 6. Funding pandas development ● pandas received first formal grant in 2019 from Chan-Zuckerberg Initiative ● Core devs primarily volunteers, self-funded, or company-funded (Anaconda, others)
  7. 7. The early pandas gang (2011 - 2012) Wes McKinney Chang She Adam Klein
  8. 8. pandas’s amazing Core Dev Team Core Dev Meetup, 2019 Jeff Reback Tom Augspurger Brock MendelMarc Garcia Partial cast of characters Joris van den Bossche
  9. 9. Community engagement
  10. 10. Python’s journey to mainstream data language
  11. 11. "We believe that in the coming years there will be great opportunity to attract users in need of statistical data analysis tools to Python who might have previously chosen R, MATLAB, or another research environment. By designing robust, easy to-use data structures that cohere with the rest of the scientific Python stack, we can make Python compelling choice for data analysis applications. In our opinion, pandas provides a solid foundation upon which a very powerful data analysis ecosystem can be established." Me, Proceedings of SciPy 2011
  12. 12. StackOverflow data from September 2017
  13. 13. StackOverflow data from September 2017
  14. 14. Factors driving Python’s growth
  15. 15. Contributing factors ● Massive need for data wranglers + scientists ● “Perfect storm” of necessary packages ● New data science education ● Successful early adopters ● Packaging improvements
  16. 16. Perfect storm of packages
  17. 17. View from 2008
  18. 18. Confronting Fear Uncertainty Doubt
  19. 19. ● Large codebase concerns ● Long-term software lifecycle ● Interpreted languages ○ ... unsafe? ○ ... slow? ● Open source… trustworthy? Common concerns
  20. 20. May 2011 - “PyData” core dev meetings "Need a toolset that is robust, fast, and suitable for a production environment..."
  21. 21. May 2011 "Need a toolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " May 2011 - “PyData” core dev meetings
  22. 22. May 2011 "Need a toolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " "... and easy / intuitive for non-software engineers to use" May 2011 - “PyData” core dev meetings
  23. 23. May 2011 * also, we need to fix packaging May 2011 - “PyData” core dev meetings
  24. 24. July 2011- Concerns "... the current state of affairs has me rather anxious … these tools [e.g. pandas] have largely not been integrated with any other tools because of the community's collective commitment anxiety" http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
  25. 25. Reading CSV files
  26. 26. Python for Data Analysis book - 2012 ● A primer in data manipulation in Python ● Focus: NumPy, IPython /Jupyter, pandas, matplotlib ● 2 editions (2012, 2017) ● 8 translations so far
  27. 27. PyData NYC 2013: 10 Things I Hate About pandas ● November 2013 ● Summary: “pandas is not designed like, or intended to be used as, a database query engine”
  28. 28. Fall 2014: Python in a Big Data World Task: Helping Python become a first-class technology for Big Data Some Problems ● File formats ● JVM interop ● Non-array-oriented interfaces
  29. 29. Difficulties in pandas (and R) dataframes ● Limited built-in data types ● Performance and memory use issues ● Challenges with larger-than-memory datasets ● Naive execution strategies (no “query optimization”)
  30. 30. Does not cut down trees.
  31. 31. Out of memory on 10GB of CSVs
  32. 32. A of doubt
  33. 33. Changing the tides … and others
  34. 34. Fragmentation of data and code
  35. 35. Other thoughts ● Projects like pandas may be taking responsibility for too many things ● It would be more productive (long-term) to have a reusable computational foundation for data frames
  36. 36. ● New data frame format for designed for speed ● Computational foundation for data processing libraries ● Fast cross-language data interchange Arrow memory JVM Data Ecosystem Database Systems Data Science Libraries
  37. 37. Defragmenting Data
  38. 38. ● https://github.com/apache/arrow ● Over 400 unique contributors ● Some level of support for 11 programming languages
  39. 39. ● CPU/GPU-friendly columnar memory layout ● Memory map huge datasets ● Relocate data structures without serialization Important features
  40. 40. Arrow C++ Platform Multi-core Work Scheduler Core Data Platform Query Engine Datasets Framework Arrow Flight RPC Network Storage
  41. 41. “New Data Frame” projects ● dask.dataframe ● Modin ● NVIDIA RAPIDS ● Vaex ● … and more surely in development
  42. 42. Learning from R ● Domain-specific language culture (“same code, different backends”) ● Non-standard evaluation ○ Inspect and manipulate unevaluated code fragments
  43. 43. Arrow’s relationship with dplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) Can be a massive Arrow dataset
  44. 44. Arrow’s relationship with dplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime Can be a massive Arrow dataset
  45. 45. Arrow’s relationship with dplyr and friends flights %>% group_by(year, month, day) %>% select(arr_delay, dep_delay) %>% summarise( arr = mean(arr_delay, na.rm = TRUE), dep = mean(dep_delay, na.rm = TRUE) ) %>% filter(arr > 30 | dep > 30) dplyr verbs can be translated to Arrow computation graphs, executed by parallel runtime R expressions can be JIT-compiled with LLVM Can be a massive Arrow dataset
  46. 46. Funding ambitious new open source projects
  47. 47. Some Partners ● https://ursalabs.org ● Apache Arrow-powered Data Science Tools ● Funded by corporate partners ● Built in collaboration with RStudio
  48. 48. Looking forward

×