Python Pandas 
Lessons Learned in Performance and 
Design
Who we are 
Chang She - CTO/Cofounder @ DataPad, core 
pandas contributor, recovering financial quant. 
Follow me on twitter: @changhiskhan 
Andy Hayden - core pandas contributor, analyst 
and software engineer from the UK turned Data 
Scientist in CA, avid data tool maker
What are we talking about 
- Why pandas? 
- What’s cool about pandas? 
- How do we improve and track performance 
- A few data structures and algorithms 
- Bad idioms and how to fix
What is it? 
- Python library for analyzing real world data 
- Created by Wes McKinney, now led by Jeff 
Reback 
- Supported on all platforms 
- Supports Python 3.4 as of latest version 
- Big and active community
Pandas Highlights 
- Labelled data and automatic alignment 
- Easy data integration 
- Flexible slicing and dicing of data 
- Analytics made to fit your brain, not vice versa 
(I’m looking at you SQL) 
USER PRODUCTIVITY
Productivity via better workflow 
- Single tool to minimize cognitive dissonance 
- Iterative and not linear workflow 
- Performant enough for interactive work
Pandas basics 
(notebook)
Priorities 
- Build the right abstractions 
- Get the API right 
- Then optimize for performance
Open source APIs 
- Sometimes you can’t be all things to all 
people 
- You can only add to an API, rarely change, 
and never get rid of APIs 
- Documentation Documentation 
Documentation
An example 
- DataFrame started life as essentially a dict of 
Series 
- There was also DataMatrix 
- Unified under DataFrame via combining 
homogeneous blocks. Performant and single 
API
Optimization 
- Push slow code paths into cython or directly 
into C 
- Try to be smart about minimizing cache 
misses and not creating unnecessary copies 
- Careful with NAs
Tracking Performance (vbench)
what to track? 
use vbench to track everything we care about 
(read: users have complained its slow ?) 
unofficial vbenches repos for numpy and scikit 
(look)
why 
Once users are using your API, they’ll notice 
performance changes “it feels slower”. 
Then timeit and have legitimate grievance… 
want to automate this process (before user-upset).
how 
(notebook)
Pandorable pandas 
(notebook)
The End

Pandas/Data Analysis at Baypiggies

  • 1.
    Python Pandas LessonsLearned in Performance and Design
  • 2.
    Who we are Chang She - CTO/Cofounder @ DataPad, core pandas contributor, recovering financial quant. Follow me on twitter: @changhiskhan Andy Hayden - core pandas contributor, analyst and software engineer from the UK turned Data Scientist in CA, avid data tool maker
  • 3.
    What are wetalking about - Why pandas? - What’s cool about pandas? - How do we improve and track performance - A few data structures and algorithms - Bad idioms and how to fix
  • 5.
    What is it? - Python library for analyzing real world data - Created by Wes McKinney, now led by Jeff Reback - Supported on all platforms - Supports Python 3.4 as of latest version - Big and active community
  • 6.
    Pandas Highlights -Labelled data and automatic alignment - Easy data integration - Flexible slicing and dicing of data - Analytics made to fit your brain, not vice versa (I’m looking at you SQL) USER PRODUCTIVITY
  • 7.
    Productivity via betterworkflow - Single tool to minimize cognitive dissonance - Iterative and not linear workflow - Performant enough for interactive work
  • 8.
  • 9.
    Priorities - Buildthe right abstractions - Get the API right - Then optimize for performance
  • 10.
    Open source APIs - Sometimes you can’t be all things to all people - You can only add to an API, rarely change, and never get rid of APIs - Documentation Documentation Documentation
  • 11.
    An example -DataFrame started life as essentially a dict of Series - There was also DataMatrix - Unified under DataFrame via combining homogeneous blocks. Performant and single API
  • 12.
    Optimization - Pushslow code paths into cython or directly into C - Try to be smart about minimizing cache misses and not creating unnecessary copies - Careful with NAs
  • 13.
  • 14.
    what to track? use vbench to track everything we care about (read: users have complained its slow ?) unofficial vbenches repos for numpy and scikit (look)
  • 15.
    why Once usersare using your API, they’ll notice performance changes “it feels slower”. Then timeit and have legitimate grievance… want to automate this process (before user-upset).
  • 16.
  • 17.
  • 18.