Home
Explore
Submit Search
Upload
Login
Signup
Advertisement
My Data Journey with Python (SciPy 2015 Keynote)
Report
Wes McKinney
Follow
Director of Ursa Labs, Open Source Developer at Ursa Labs
Jul. 9, 2015
•
0 likes
16 likes
×
Be the first to like this
Show More
•
7,416 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Check these out next
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
High Performance Python on Apache Spark
Wes McKinney
Apache Arrow and Python: The latest
Wes McKinney
Enabling Python to be a Better Big Data Citizen
Wes McKinney
Data Science Languages and Industry Analytics
Wes McKinney
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
1
of
37
Top clipped slide
My Data Journey with Python (SciPy 2015 Keynote)
Jul. 9, 2015
•
0 likes
16 likes
×
Be the first to like this
Show More
•
7,416 views
views
×
Total views
0
On Slideshare
0
From embeds
0
Number of embeds
0
Download Now
Download to read offline
Report
Technology
Stories from my time as a Python programmer. Delivered July 9, 2015 in Austin at SciPy 2015.
Wes McKinney
Follow
Director of Ursa Labs, Open Source Developer at Ursa Labs
Advertisement
Advertisement
Advertisement
Recommended
Ibis: Scaling the Python Data Experience
Wes McKinney
3.8K views
•
13 slides
An Incomplete Data Tools Landscape for Hackers in 2015
Wes McKinney
8.1K views
•
22 slides
PyData: The Next Generation
Wes McKinney
22.2K views
•
31 slides
Python Data Ecosystem: Thoughts on Building for the Future
Wes McKinney
5.4K views
•
37 slides
Improving data interoperability in Python and R
Wes McKinney
2.6K views
•
14 slides
DataFrames: The Good, Bad, and Ugly
Wes McKinney
12.9K views
•
24 slides
More Related Content
Slideshows for you
(20)
Next-generation Python Big Data Tools, powered by Apache Arrow
Wes McKinney
•
12.9K views
High Performance Python on Apache Spark
Wes McKinney
•
16.5K views
Apache Arrow and Python: The latest
Wes McKinney
•
5.8K views
Enabling Python to be a Better Big Data Citizen
Wes McKinney
•
6K views
Data Science Languages and Industry Analytics
Wes McKinney
•
5.5K views
How Apache Arrow and Parquet boost cross-language interoperability
Uwe Korn
•
2.9K views
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
•
2.9K views
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
•
103.8K views
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
•
2K views
Python Data Wrangling: Preparing for the Future
Wes McKinney
•
12.5K views
Ibis: Scaling Python Analytics on Hadoop and Impala
Wes McKinney
•
7.6K views
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
•
5.6K views
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
•
6.6K views
Light Up Your Dark Data
Anaconda
•
3.2K views
High Performance Hadoop with Python - Webinar
Anaconda
•
13.9K views
A look inside pandas design and development
Wes McKinney
•
29.3K views
Apache Arrow - An Overview
Dremio Corporation
•
2K views
Improving Python and Spark (PySpark) Performance and Interoperability
Wes McKinney
•
19.7K views
PyData: The Next Generation | Data Day Texas 2015
Cloudera, Inc.
•
1.8K views
Apache Spark Briefing
Thomas W. Dinsmore
•
4K views
Viewers also liked
(12)
pandas: Powerful data analysis tools for Python
Wes McKinney
•
9.7K views
Productive Data Tools for Quants
Wes McKinney
•
1.7K views
Python for Financial Data Analysis with pandas
Wes McKinney
•
61.7K views
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney
•
3.2K views
Efficient Data Storage for Analytics with Apache Parquet 2.0
Cloudera, Inc.
•
158K views
What's new in pandas and the SciPy stack for financial users
Wes McKinney
•
11.8K views
Parquet Strata/Hadoop World, New York 2013
Julien Le Dem
•
132.8K views
Structured Data Challenges in Finance and Statistics
Wes McKinney
•
5.3K views
Hacking
Karan Poshattiwar
•
2.5K views
Data Tools and the Data Scientist Shortage
Wes McKinney
•
3.7K views
PyCon APAC 2016 Keynote
Wes McKinney
•
3.6K views
Основы MATLAB. Лекция 1.
Theoretical mechanics department
•
18.5K views
Advertisement
Similar to My Data Journey with Python (SciPy 2015 Keynote)
(20)
RubiOne: Apache Spark as the Backbone of a Retail Analytics Development Envir...
Databricks
•
300 views
Leveraging open source for large scale analytics
South West Data Meetup
•
191 views
Webinar: Increasing Business Agility with Real-time Processing with Apache Ha...
NebulaInc
•
662 views
Webinar Nebula&Scalr : Increasing Business Agility with Real-time Processing ...
ScalrCMP
•
444 views
Enterprise Metadata Integration, Cloudera
Neo4j
•
623 views
Big Data, Hadoop, NoSQL and more ...
Varad Meru
•
4K views
Python Notebooks for Collaborative Data Science
Anaconda
•
3K views
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Hakka Labs
•
1.1K views
Data Science at Scale Using Apache Spark and Apache Hadoop
Cloudera, Inc.
•
7.1K views
OpenStack Journey in Tieto Elastic Cloud
Jakub Pavlik
•
617 views
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Precisely
•
641 views
Open Source Applied - Real World Use Cases
All Things Open
•
368 views
Open source applied - Real world use cases (Presented at Open Source 101)
Rogue Wave Software
•
278 views
200,000 Lines Later: Our Journey to Manageable Puppet Code
David Danzilio
•
522 views
Big Data Day LA 2015 - Brainwashed: Building an IDE for Feature Engineering b...
Data Con LA
•
682 views
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Steven Totman
•
1.1K views
Market trends in IT - exchange cala - October 2015
Eduardo Pelegri-Llopart
•
1.1K views
Introduction to OpenStack Storage
NetApp
•
357 views
Syncsort, Tableau, & Cloudera present: Break the Barriers to Big Data Insight
Cloudera, Inc.
•
4.1K views
How to Enterprise Node
Julián David Duque
•
228 views
More from Wes McKinney
(15)
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
•
995 views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
•
1.1K views
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
•
1.4K views
New Directions for Apache Arrow
Wes McKinney
•
1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
•
2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
•
1.9K views
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
•
960 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
•
2.1K views
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
•
1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
•
2.5K views
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
•
3.5K views
Ursa Labs and Apache Arrow in 2019
Wes McKinney
•
4.1K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
•
1.1K views
Shared Infrastructure for Data Science
Wes McKinney
•
8.5K views
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
•
6.2K views
Advertisement
Recently uploaded
(20)
Digital Forencis.pdf
HridhayBharti2
•
0 views
Ethereum's Transaction Momentum: Closing the Gap with Visa
Mobiloitte Technologies
•
0 views
SampleDecPkg.ppt
Courtney Doutherd
•
0 views
在哪里可以办荷兰大学文凭《萨克欣应用科学大学毕业证成绩单仿制》
efagvah
•
0 views
2023-05-31_ESWC.pptx
Anisa Rula
•
0 views
Into The Box 2023 Keynote Day 1
Ortus Solutions, Corp
•
0 views
Agile Mindset, Ahmed Sidky PhD.pdf
FarizGhozali
•
0 views
Exploratory Data Analysis - A Comprehensive Guide to EDA.pdf
StephenAmell4
•
0 views
Don’t Reinvent the Wheel: Pre-built Spatial and Data Enrichment APIs for Your...
Precisely
•
0 views
Hybrid Mobile App Development Frameworks.pdf
TarunTiwari94
•
0 views
AI Intro.pptx
DSCYorkU
•
0 views
End to End Process Transformation with Signavio.pdf
IgnacioPeredoCL
•
0 views
Chapter_11-Heragu.pptx
Madan Karki
•
0 views
KMM - Kanban Maturity Model
Adail Viana Neto
•
0 views
AzureOpenAI.pptx
Udaiappa Ramachandran
•
0 views
CDP_Presentation.pptx
Abbas335883
•
0 views
evpn_in_service_provider_network-web.pdf
ThanhTrungBui5
•
0 views
KC - Kanban Coaching
Adail Viana Neto
•
0 views
Howard Wilner Explains the Impact of 5G on Automotive Manufacturers
jimcarns
•
0 views
在哪里可以办美国大学文凭《西佛罗里达大学毕业证成绩单仿制》
efagvah
•
0 views
My Data Journey with Python (SciPy 2015 Keynote)
1 © Cloudera,
Inc. All rights reserved. My Data Journey with Python Wes McKinney @wesmckinn SciPy 2015 Keynote, 2015-‐07-‐09
2 © Cloudera,
Inc. All rights reserved. Who am I?
3 © Cloudera,
Inc. All rights reserved. This talk • 2007-‐present, from my perspecOve • CelebraOng our successes • Challenges and opportuniOes for the future
4 © Cloudera,
Inc. All rights reserved. Why are we all here?
5 © Cloudera,
Inc. All rights reserved. My pre-‐2007 existence • I was a mathemaOcian! • No exposure to Python, SQL, R (or any analyOcs for that maYer) • Rude awakening ahead
6 © Cloudera,
Inc. All rights reserved. My first job: AQR (quant hedge fund) • A quant finance operaOon that lived and breathed SQL and Excel • ProducOon systems in C++, Java, Visual BASIC, and C# .NET • Some PhD-‐level researchers used MATLAB for research (as was common in finance / economics departments)
7 © Cloudera,
Inc. All rights reserved. ProducOvity frustraOons • First year: several analyOcs and staOsOcal data analysis projects • A huge amount of SQL • Some Java • A liYle bit of R • … and TONS of Excel • Projects felt like 5% conceptualizaOon, 95% tedium
8 © Cloudera,
Inc. All rights reserved. Python in early 2008: different Omes • A bleeding edge stack • NumPy 1.0.4 • SciPy 0.6.0 • matplotlib 0.91.2 • IPython 0.8.4, SVN history begins 2/2008 • Cython 0.9.8 • The scienOfic Python community seemed mainly focused on aYracOng MATLAB, HPC, and scienOfic lab users
9 © Cloudera,
Inc. All rights reserved. 2008: Things SciPythonistas didn’t care too much about • RelaOonal data or SQL • Missing data handling (outside numpy.ma) • StaOsOcs and econometrics (first statsmodels release: 2011) • StaOsOcal graphics • Machine learning (scikit-‐learn 0.1: 2/2010) • AnalyOcs and business intelligence
10 © Cloudera,
Inc. All rights reserved. Taking a gamble • Decided to give Python a shot for AQR projects aoer seeing part of MASS R package ported in scipy.stats.models by Jonathan Taylor at Stanford • proto-‐pandas first version built in April 2008 • Focused on porOng an R project to Python • May ‘08: Embedded Python interpreter in a legacy C++ system • 5/2008 – 12/2008: Skunkworks Python ports and evangelism across company
11 © Cloudera,
Inc. All rights reserved. Why did Python work out? • BaYeries included • Interoperability with C++ • Embedding Python interpreter • Wrapping C++ in Python C extensions • ProducOve user interface • Python language • IPython + matplotlib
12 © Cloudera,
Inc. All rights reserved.
13 © Cloudera,
Inc. All rights reserved. Some other cool things we built • A global macro risk modeling system (using pandas + NumPy + PyTables) • A heterogeneous market data loading and cleaning system • A task-‐based cluster compuOng system (similar to Celery) • Tick data storage and analyOcs • Various GUIs with wxPython + matplotlib
14 © Cloudera,
Inc. All rights reserved. End 2009: pandas! • AQR lets me open source pandas 0.1 on Christmas, 2009. ~/Downloads/pandas-‐0.1 $ cloc -‐-‐exclude-‐ext pandas -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Language files blank comment code -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ Python 41 3124 2933 8225 Cython 7 418 93 1247 C/C++ Header 1 0 0 1 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐ SUM: 49 3542 3026 9473 -‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐-‐
15 © Cloudera,
Inc. All rights reserved. 2010 – 2011: Python’s data growing pains • pandas did not evolve much aoer its iniOal release • No consensus or momentum behind any project for analyOcs / data wrangling • AQR —> Duke StaOsOcal Science • AQR sponsors bug fixes and new features in pandas
16 © Cloudera,
Inc. All rights reserved. May 2011: Gevng inspired • 2011-‐05-‐13: Enthought Datarray Summit • Discuss how to enable Python to become more useful staOsOcal compuOng • Me: “Library fragmentaOon is destrucOve; integraOon is beYer” • Data structures, missing data, and data wrangling tools • 2011-‐05-‐23 – 2011-‐06-‐03 : Python finance consulOng engagement • Realized that Python data tools sorely needed in industry • But not nearly mature enough yet
17 © Cloudera,
Inc. All rights reserved. 2011-‐05-‐30
18 © Cloudera,
Inc. All rights reserved. Tell me about your use cases
19 © Cloudera,
Inc. All rights reserved. Making pandas a beYer tool • ConsulOng at AppNexus (NYC ad tech company) opened eyes to new problems • June 2011 – December 2012 • Fix some pandas design issues • Build out data wrangling capabiliOes (hierarchical indexes, etc.) • Create “killer apps” (Ome series capabiliOes) • Evangelize and collaborate with other projects
20 © Cloudera,
Inc. All rights reserved. Taking advantage of temporary financial freedom
21 © Cloudera,
Inc. All rights reserved. Making a book happen • A chicken-‐and-‐egg problem • Fernando Pérez, Brian Granger, and John Hunter had been toying with the idea of a “SciPy Book” for a couple years • Decided to forge my own path in Nov 2011 • WriOng took about 9 months • Helped moOvate me to “finish” parts of pandas • ~ 50,000 copies in circulaOon
22 © Cloudera,
Inc. All rights reserved. Clarity and sooware engineering • Progress in sooware not just about hard work • Solving the right problems • … in the right order • … while wasOng liYle Ome/energy on non-‐impac}ul issues • … while being faced with real world concerns (80/20 rule) • Taking the Ome to develop a clear vision and scope for a project is a major factor in its success or failure
23 © Cloudera,
Inc. All rights reserved. It took a village • Fernando Perez & Brian Granger (IPython) • Skipper Seabold & Josef Perktold (statsmodels) • Eric Jones (Enthought) • Travis Oliphant & Peter Wang (Enthought & ConOnuum) • John Hunter (matplotlib) • … and many others
24 © Cloudera,
Inc. All rights reserved. An unlikely train ride SEA —> PDX November 18, 2011
25 © Cloudera,
Inc. All rights reserved. Seatmate: “Are you a programmer?” (he saw my Emacs buffers)
26 © Cloudera,
Inc. All rights reserved. Wes: “ Yeah, I do Python”
27 © Cloudera,
Inc. All rights reserved. Seatmate: “Oh, I do a bit of Python too”
28 © Cloudera,
Inc. All rights reserved. Wes: “Cool, well, there’s this awesome new thing called the IPython notebook”
29 © Cloudera,
Inc. All rights reserved. My seatmate was computaOonal bio professor and 5-‐year PSF member Titus Brown
30 © Cloudera,
Inc. All rights reserved. And he would later assist the IPython team in their Sloan FoundaOon $1mm grant in 2012
31 © Cloudera,
Inc. All rights reserved. Some words about John Hunter (1968 – 2012)
32 © Cloudera,
Inc. All rights reserved. Business ventures 2012 -‐ 2014 • 2012 : Lambda Foundry • Support and develop pandas • Explored creaOng a commercial Python financial toolkit • 2013 – 2014 : DataPad • “Google Drive for AnalyOcs / BI” • With Chang She (MIT —> AQR —> pandas) • Silicon Valley VC-‐backed • Acquired by Cloudera in September 2014
33 © Cloudera,
Inc. All rights reserved. Cloudera • Sort of “the Red Hat of Big Data” • The leading open source Hadoop pla}orm • SupporOng and developing a liYle over 20 Apache-‐licensed open source projects • A dream job • Full Ome open source development • Solving hard data problems faced by the world’s largest companies • P.S. we’re hiring engineers in AusOn + Bay Area
34 © Cloudera,
Inc. All rights reserved. What I’m interested in right now • Ways to enable collaboraOon on data tools across programming languages • Domain specific language design and compilaOon • Improving the Python-‐on-‐Hadoop experience • LLVM + Code generaOon
35 © Cloudera,
Inc. All rights reserved. Different kinds of Big Data • Python programmers have been dealing with big scienOfic data in HPC sevngs for years • Big… • Text data • Homogeneous array data • Tabular (structured) data • JSON-‐like (semi-‐structured) data
36 © Cloudera,
Inc. All rights reserved. The Great Data Tool Decoupling™ • Thesis: over Ome, user interfaces, data storage, and execuOon engines will decouple and specialize • In fact, you should really want this to happen • Share systems among languages • Reduce fragmentaOon and “lock-‐in” • Shio developer focus to usability • PredicOon: we’ll be there by 2025; sooner if we all get our act together
37 © Cloudera,
Inc. All rights reserved. Thank you @wesmckinn
Advertisement