Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Looking backward, looking
forward
Wes McKinney @wesmckinn
PyCon DE / PyData Karlsruhe 2018
Motivations
Guiding questions
How to make data
analysis “easier”?
Making individuals
more productive
More fruitful open
source collaborations
Better hardware
utilization
Examining the
status quo
Change is difficult
From one existential
crisis to another
April 2008 - Avant garde PyData
● Socializing Python inside AQR, a quantitative
hedge fund
● scipy.stats.models enabled so...
Dec 2009 - pandas 0.1
● First open source release after ~18 months
of internal-only use
May 2011 - “PyData” core dev meetings
"Need a toolset that is robust, fast, and suitable
for a production environment..."
May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for intera...
May 2011
"Need a toolset that is robust, fast, and suitable
for a production environment..."
"... but also good for intera...
May 2011
* also, we need to fix packaging
May 2011 - “PyData” core dev meetings
July 2011- Concerns
"... the current state of affairs has me rather
anxious … these tools [e.g. pandas] have
largely not b...
July 2011- Concerns
"Fragmentation is killing us”
http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structure...
Reading CSV files
Python for Data Analysis book - 2012
● A primer in data
manipulation in Python
● Focus: NumPy, IPython
/Jupyter, pandas,
m...
2013-2014 - An Entrepeneurial Detour
DataPad
Python-powered
Business Analytics
● Backend built with
PyData stack + custom
...
DataPad learnings
● 200ms threshold for interactivity
● Multitenant query execution, resource management
● pandas performa...
PyData NYC 2013: 10 Things I Hate About pandas
● November 2013
● Summary: “pandas is
not designed like, or
intended to be ...
Vertical
Integration
The Good
● Control
● Development Speed
● Releases
Vertical
Integration
The Bad
● Large scope of code
ownership
● Lack of code reuse
● Bit rot
Fall 2014: Python in a Big Data World
Task: Helping Python
become a first-class
technology for Big Data
Some Problems
● Fi...
Fragmentation of data
and code
Apache Arrow:
Defragmenting data systems
● Language-independent open
standard in-memory
representation for columnar data
(...
Apache Arrow:
Defragmenting data systems
● https://github.com/apache/arrow
● Over 200 unique contributors
● Some level of ...
Funding ambitious
new open source
projects
Early Partners
● https://ursalabs.org
● Apache Arrow-powered
Data Science Tools
● Funded by corporate
partners
● Built in ...
Looking forward
Upcoming SlideShare
Loading in …5
×

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

526 views

Published on

Talk in Karlsruhe, Germany, on October 25, 2018

Published in: Technology
  • Be the first to comment

  • Be the first to like this

PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"

  1. 1. Looking backward, looking forward Wes McKinney @wesmckinn PyCon DE / PyData Karlsruhe 2018
  2. 2. Motivations
  3. 3. Guiding questions
  4. 4. How to make data analysis “easier”?
  5. 5. Making individuals more productive
  6. 6. More fruitful open source collaborations
  7. 7. Better hardware utilization
  8. 8. Examining the status quo
  9. 9. Change is difficult
  10. 10. From one existential crisis to another
  11. 11. April 2008 - Avant garde PyData ● Socializing Python inside AQR, a quantitative hedge fund ● scipy.stats.models enabled some R -> Python workload migration
  12. 12. Dec 2009 - pandas 0.1 ● First open source release after ~18 months of internal-only use
  13. 13. May 2011 - “PyData” core dev meetings "Need a toolset that is robust, fast, and suitable for a production environment..."
  14. 14. May 2011 "Need a toolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " May 2011 - “PyData” core dev meetings
  15. 15. May 2011 "Need a toolset that is robust, fast, and suitable for a production environment..." "... but also good for interactive research... " "... and easy / intuitive for non-software engineers to use" May 2011 - “PyData” core dev meetings
  16. 16. May 2011 * also, we need to fix packaging May 2011 - “PyData” core dev meetings
  17. 17. July 2011- Concerns "... the current state of affairs has me rather anxious … these tools [e.g. pandas] have largely not been integrated with any other tools because of the community's collective commitment anxiety" http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
  18. 18. July 2011- Concerns "Fragmentation is killing us” http://wesmckinney.com/blog/a-roadmap-for-rich-scientific-data-structures-in-python/
  19. 19. Reading CSV files
  20. 20. Python for Data Analysis book - 2012 ● A primer in data manipulation in Python ● Focus: NumPy, IPython /Jupyter, pandas, matplotlib ● 2 editions (2012, 2017) ● 8 translations so far
  21. 21. 2013-2014 - An Entrepeneurial Detour DataPad Python-powered Business Analytics ● Backend built with PyData stack + custom analytics ● Goal to contribute tech back to OSS ecosystem
  22. 22. DataPad learnings ● 200ms threshold for interactivity ● Multitenant query execution, resource management ● pandas performance / memory use problems
  23. 23. PyData NYC 2013: 10 Things I Hate About pandas ● November 2013 ● Summary: “pandas is not designed like, or intended to be used as, a database query engine”
  24. 24. Vertical Integration The Good ● Control ● Development Speed ● Releases
  25. 25. Vertical Integration The Bad ● Large scope of code ownership ● Lack of code reuse ● Bit rot
  26. 26. Fall 2014: Python in a Big Data World Task: Helping Python become a first-class technology for Big Data Some Problems ● File formats ● JVM interop ● Non-array-oriented interfaces
  27. 27. Fragmentation of data and code
  28. 28. Apache Arrow: Defragmenting data systems ● Language-independent open standard in-memory representation for columnar data (i.e. data frames) ● Easily reuse code targeting Arrow memory ● Efficient memory interchange Arrow memory JVM Data Ecosystem Database Systems Data Science Libraries
  29. 29. Apache Arrow: Defragmenting data systems ● https://github.com/apache/arrow ● Over 200 unique contributors ● Some level of support for 11 programming languages
  30. 30. Funding ambitious new open source projects
  31. 31. Early Partners ● https://ursalabs.org ● Apache Arrow-powered Data Science Tools ● Funded by corporate partners ● Built in collaboration with RStudio
  32. 32. Looking forward

×