SlideShare a Scribd company logo
pandas: a Foundational Python library for Data Analysis
                    and Statistics

                                   Wes McKinney


                            PyHPC 2011, 18 November 2011




Wes McKinney (@wesmckinn)          Data analysis with pandas   PyHPC 2011   1 / 25
An alternate title




       High Performance Structured Data
            Manipulation in Python




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   2 / 25
My background



     Former quant hacker at AQR Capital, now entrepreneur
     Background: math, statistics, computer science, quant finance.
     Shaken, not stirred
     Active in scientific Python community
     My blog: http://blog.wesmckinney.com
     Twitter: @wesmckinn
     Book! “Python for Data Analysis”, to hit the shelves later next year
     from O’Reilly




 Wes McKinney (@wesmckinn)   Data analysis with pandas       PyHPC 2011   3 / 25
Structured data



      cname              year   agefrom      ageto             ls     lsc    pop   ccode
0     Australia          1950   15           19                64.3   15.4   558   AUS
1     Australia          1950   20           24                48.4   26.4   645   AUS
2     Australia          1950   25           29                47.9   26.2   681   AUS
3     Australia          1950   30           34                44     23.8   614   AUS
4     Australia          1950   35           39                42.1   21.9   625   AUS
5     Australia          1950   40           44                38.9   20.1   555   AUS
6     Australia          1950   45           49                34     16.9   491   AUS
7     Australia          1950   50           54                29.6   14.6   439   AUS
8     Australia          1950   55           59                28     12.9   408   AUS
9     Australia          1950   60           64                26.3   12.1   356   AUS




    Wes McKinney (@wesmckinn)      Data analysis with pandas                 PyHPC 2011   4 / 25
Structured data



     A familiar data model
           Heterogeneous columns or hyperslabs
           Each column/hyperslab is homogeneously typed
           Relational databases (SQL, etc.) are just a special case
     Need good performance in row- and column-oriented operations
     Support for axis metadata
     Data alignment is critical
     Seamless integration with Python data structures and NumPy




 Wes McKinney (@wesmckinn)       Data analysis with pandas            PyHPC 2011   5 / 25
Structured data challenges



     Table modification: column insertion/deletion
     Axis indexing and data alignment
     Aggregation and transformation by group (“group by”)
     Missing data handling
     Pivoting and reshaping
     Merging and joining
     Time series-specific manipulations
     Fast IO: flat files, databases, HDF5, ...




 Wes McKinney (@wesmckinn)    Data analysis with pandas     PyHPC 2011   6 / 25
Not all fun and games




     We care nearly equally about
           Performance
           Ease-of-use (syntax / API fits your mental model)
           Expressiveness
     Clean, consistent API design is hard and underappreciated




 Wes McKinney (@wesmckinn)      Data analysis with pandas     PyHPC 2011   7 / 25
The big picture



     Build a foundation for data analysis and statistical computing
     Craft the most expressive / flexible in-memory data manipulation tool
     in any language
           Preferably also one of the fastest, too
     Vastly simplify the data preparation, munging, and integration process
     Comfortable abstractions: master data-fu without needing to be a
     computer scientist
     Later: extend API with distributed computing backend for
     larger-than-memory datasets




 Wes McKinney (@wesmckinn)       Data analysis with pandas   PyHPC 2011   8 / 25
pandas: a brief history




     Starting building April 2008 back at AQR
     Open-sourced (BSD license) mid-2009
     29075 lines of Python/Cython code as of yesterday, and growing fast
     Heavily tested, being used by many companies (inc. lots of financial
     firms) in production




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   9 / 25
Cython: getting good performance



     My choice tool for writing performant code
     High level access to NumPy C API internals
     Buffer syntax/protocol abstracts away striding details of
     non-contiguous arrays, very low overhead vs. working with raw C
     pointers
     Reduce/remove interpreter overhead associated with working with
     Python data structures
     Interface directly with C/C++ code when necessary




 Wes McKinney (@wesmckinn)   Data analysis with pandas     PyHPC 2011   10 / 25
Axis indexing




     Key pandas feature
     The axis index is a data structure itself, which can be customized to
     support things like:
           1-1 O(1) indexing with hashable Python objects
           Datetime indexing for time series data
           Hierarchical (multi-level) indexing
     Use Python dict to support O(1) lookups and O(n) realignment ops.
     Can specialize to get better performance and memory usage




 Wes McKinney (@wesmckinn)      Data analysis with pandas    PyHPC 2011   11 / 25
Axis indexing



     Every axis has an index
     Automatic alignment between differently-indexed objects: makes it
     nearly impossible to accidentally combine misaligned data
     Hierarchical indexing provides an intuitive way of structuring and
     working with higher-dimensional data
     Natural way of expressing “group by” and join-type operations
     As good or in many cases much more integrated/flexible than
     commercial or open-source alternatives to pandas/Python




 Wes McKinney (@wesmckinn)     Data analysis with pandas     PyHPC 2011   12 / 25
The trouble with Python dicts...




     Python dict memory footprint can be quite large
           1MM key-value pairs: something like 70mb on a 64-bit system
           Even though sizeof(PyObject*) == 8
     Python dict is great, but should use a faster, threadsafe hash table for
     primitive C types (like 64-bit integer)
     BUT: using a hash table only necessary in the general case. With
     monotonic indexes you don’t need one for realignment ops




 Wes McKinney (@wesmckinn)     Data analysis with pandas       PyHPC 2011   13 / 25
Some alignment numbers



     Hardware: Macbook Pro Core i7 laptop, Python 2.7.2
     Outer-join 500k-length indexes chosen from 1MM elements
           Dict-based with random strings: 2.2 seconds
           Sorted strings: 400ms (5.5x faster)
           Sorted int64: 19ms (115x faster)
     Fortunately, time series data falls into this last category
     Alignment ops with C primitives could be fairly easily parallelized with
     OpenMP in Cython




 Wes McKinney (@wesmckinn)      Data analysis with pandas          PyHPC 2011   14 / 25
DataFrame, the pandas workhorse




     A 2D tabular data structure with row and column indexes
     Hierarchical indexing one way to support higher-dimensional data in a
     lower-dimensional structure
     Simplified NumPy type system: float, int, boolean, object
     Rich indexing operations, SQL-like join/merges, etc.
     Support heterogeneous columns WITHOUT sacrificing performance in
     the homogeneous (e.g. floating point only) case




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   15 / 25
DataFrame, under the hood




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   16 / 25
Supporting size mutability



     In order to have good row-oriented performance, need to store
     like-typed columns in a single ndarray
     “Column” insertion: accumulate 1 × N × . . . homogeneous columns,
     later consolidate with other like-typed into a single block
     I.e. avoid reallocate-copy or array concatenation steps as long as
     possible
     Column deletions can be no-copy events (since ndarrays support
     views)




 Wes McKinney (@wesmckinn)    Data analysis with pandas       PyHPC 2011   17 / 25
Hierarchical indexing




     New this year, but really should have done long ago
     Natural result of multi-key groupby
     An intuitive way to work with higher-dimensional data
     Much less ad hoc way of expressing reshaping operations
     Once you have it, things like Excel-style pivot tables just “fall out”




 Wes McKinney (@wesmckinn)     Data analysis with pandas       PyHPC 2011   18 / 25
Reshaping




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   19 / 25
Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   20 / 25
Reshaping implementation nuances




     Must deal with unbalanced group sizes / missing data
     Play vectorization tricks with the NumPy C-contiguous memory
     layout: no Python for loops allowed
     Care must be taken to handle heterogeneous and homogeneous data
     cases




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   21 / 25
GroupBy




     High level process
           split data set into groups
           apply function to each group (an aggregation or a transformation)
           combine results intelligently into a result data structure
     Can be used to emulate SQL GROUP BY operations




 Wes McKinney (@wesmckinn)      Data analysis with pandas        PyHPC 2011    22 / 25
GroupBy



     Grouping closely related to indexing
     Create correspondence between axis labels and group labels using one
     of:
           Array of group labels (like a DataFrame column)
           Python function to be applied to each axis tick
     Can group by multiple keys
     For a hierarchically indexed axis, can select a level and group by that
     (or some transformation thereof)




 Wes McKinney (@wesmckinn)      Data analysis with pandas     PyHPC 2011   23 / 25
GroupBy implementation challenges


     Computing the group labels from arbitrary Python objects is very
     expensive
           77ms for 1MM strings with 1K groups
           107ms for 1MM strings with 10K groups
           350ms for 1MM strings with 100K groups
     To sort or not to sort (for iteration)?
           Once you have the labels, can reorder the data set in O(n) (with a
           much smaller constant than computing the labels)
           Roughly 35ms to reorder 1MM float64 data points given the labels
     (By contrast, computing the mean of 1MM elements takes 1.4ms)
     Python function call overhead is significant in cases with lots of small
     groups; much better (orders of magnitude speedup) to write
     specialized Cython routines


 Wes McKinney (@wesmckinn)      Data analysis with pandas         PyHPC 2011    24 / 25
Demo, time permitting




Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   25 / 25

More Related Content

What's hot

Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
Sourabh Sahu
 
Pandas
PandasPandas
Pandas
maikroeder
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
Neeru Mittal
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
Richard Herrell
 
Introduction to NumPy
Introduction to NumPyIntroduction to NumPy
Introduction to NumPy
Huy Nguyen
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonWes McKinney
 
Python Pandas
Python PandasPython Pandas
Python Pandas
Sunil OS
 
Numpy
NumpyNumpy
MatplotLib.pptx
MatplotLib.pptxMatplotLib.pptx
MatplotLib.pptx
Paras Intotech
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
Chariza Pladin
 
Python pandas tutorial
Python pandas tutorialPython pandas tutorial
Python pandas tutorial
HarikaReddy115
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
Devashish Kumar
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
Marc Garcia
 
Data Visualization(s) Using Python
Data Visualization(s) Using PythonData Visualization(s) Using Python
Data Visualization(s) Using Python
Aniket Maithani
 
Python Class | Python Programming | Python Tutorial | Edureka
Python Class | Python Programming | Python Tutorial | EdurekaPython Class | Python Programming | Python Tutorial | Edureka
Python Class | Python Programming | Python Tutorial | Edureka
Edureka!
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in Python
Jagriti Goswami
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
Simplilearn
 
Python Interview Questions And Answers 2019 | Edureka
Python Interview Questions And Answers 2019 | EdurekaPython Interview Questions And Answers 2019 | Edureka
Python Interview Questions And Answers 2019 | Edureka
Edureka!
 
Numpy tutorial
Numpy tutorialNumpy tutorial
Numpy tutorial
HarikaReddy115
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
Agung Wahyudi
 

What's hot (20)

Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
Pandas
PandasPandas
Pandas
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Introduction to NumPy
Introduction to NumPyIntroduction to NumPy
Introduction to NumPy
 
Data Structures for Statistical Computing in Python
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Numpy
NumpyNumpy
Numpy
 
MatplotLib.pptx
MatplotLib.pptxMatplotLib.pptx
MatplotLib.pptx
 
Data Analysis and Visualization using Python
Data Analysis and Visualization using PythonData Analysis and Visualization using Python
Data Analysis and Visualization using Python
 
Python pandas tutorial
Python pandas tutorialPython pandas tutorial
Python pandas tutorial
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
Data Visualization(s) Using Python
Data Visualization(s) Using PythonData Visualization(s) Using Python
Data Visualization(s) Using Python
 
Python Class | Python Programming | Python Tutorial | Edureka
Python Class | Python Programming | Python Tutorial | EdurekaPython Class | Python Programming | Python Tutorial | Edureka
Python Class | Python Programming | Python Tutorial | Edureka
 
Data Visualization in Python
Data Visualization in PythonData Visualization in Python
Data Visualization in Python
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Python Interview Questions And Answers 2019 | Edureka
Python Interview Questions And Answers 2019 | EdurekaPython Interview Questions And Answers 2019 | Edureka
Python Interview Questions And Answers 2019 | Edureka
 
Numpy tutorial
Numpy tutorialNumpy tutorial
Numpy tutorial
 
Introduction to python
Introduction to pythonIntroduction to python
Introduction to python
 

Similar to pandas: a Foundational Python Library for Data Analysis and Statistics

Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Ken Mwai
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
Julien Le Dem
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
Jim Dowling
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
Jongwook Woo
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
DataWorks Summit
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379
 
xldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierxldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierTim Frazier
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Ben Busby
 
No sql databases
No sql databasesNo sql databases
No sql databases
Ashish Kumar Thakur
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop frameworkTu Pham
 
PDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.pptPDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.ppt
ssuser52a19e
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
Stéphane Fréchette
 
Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introduction
mattcasters
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
Denodo
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
Vijay Srinivas Agneeswaran, Ph.D
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
Jeffrey Clark
 

Similar to pandas: a Foundational Python Library for Data Analysis and Statistics (20)

Structured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
 
Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01
 
Python for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandas
 
From flat files to deconstructed database
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
 
Polyglot metadata for Hadoop
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
 
President Election of Korea in 2017
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
 
Bridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
 
Strata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
 
Data Wrangling and Visualization Using Python
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
 
xldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazierxldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazier
 
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
 
No sql databases
No sql databasesNo sql databases
No sql databases
 
Big data & hadoop framework
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
 
PDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.pptPDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.ppt
 
Modernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APSModernizing Your Data Warehouse using APS
Modernizing Your Data Warehouse using APS
 
Pentaho Data Integration Introduction
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introduction
 
Minimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
 
Koalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Recently uploaded

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
ControlCase
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Nexer Digital
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
OnBoard
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Paige Cruz
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
KAMESHS29
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Albert Hoitingh
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
UiPathCommunity
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
Laura Byrne
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
Alpen-Adria-Universität
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
Kari Kakkonen
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
Jen Stirrup
 

Recently uploaded (20)

PCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase TeamPCI PIN Basics Webinar from the Controlcase Team
PCI PIN Basics Webinar from the Controlcase Team
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?Elizabeth Buie - Older adults: Are we really designing for our future selves?
Elizabeth Buie - Older adults: Are we really designing for our future selves?
 
Leading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdfLeading Change strategies and insights for effective change management pdf 1.pdf
Leading Change strategies and insights for effective change management pdf 1.pdf
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdfObservability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
Observability Concepts EVERY Developer Should Know -- DeveloperWeek Europe.pdf
 
RESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for studentsRESUME BUILDER APPLICATION Project for students
RESUME BUILDER APPLICATION Project for students
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
Encryption in Microsoft 365 - ExpertsLive Netherlands 2024
 
UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..UiPath Community Day Dubai: AI at Work..
UiPath Community Day Dubai: AI at Work..
 
The Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and SalesThe Art of the Pitch: WordPress Relationships and Sales
The Art of the Pitch: WordPress Relationships and Sales
 
Video Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the FutureVideo Streaming: Then, Now, and in the Future
Video Streaming: Then, Now, and in the Future
 
Climate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing DaysClimate Impact of Software Testing at Nordic Testing Days
Climate Impact of Software Testing at Nordic Testing Days
 
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...The Metaverse and AI: how can decision-makers harness the Metaverse for their...
The Metaverse and AI: how can decision-makers harness the Metaverse for their...
 

pandas: a Foundational Python Library for Data Analysis and Statistics

  • 1. pandas: a Foundational Python library for Data Analysis and Statistics Wes McKinney PyHPC 2011, 18 November 2011 Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 1 / 25
  • 2. An alternate title High Performance Structured Data Manipulation in Python Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 2 / 25
  • 3. My background Former quant hacker at AQR Capital, now entrepreneur Background: math, statistics, computer science, quant finance. Shaken, not stirred Active in scientific Python community My blog: http://blog.wesmckinney.com Twitter: @wesmckinn Book! “Python for Data Analysis”, to hit the shelves later next year from O’Reilly Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 3 / 25
  • 4. Structured data cname year agefrom ageto ls lsc pop ccode 0 Australia 1950 15 19 64.3 15.4 558 AUS 1 Australia 1950 20 24 48.4 26.4 645 AUS 2 Australia 1950 25 29 47.9 26.2 681 AUS 3 Australia 1950 30 34 44 23.8 614 AUS 4 Australia 1950 35 39 42.1 21.9 625 AUS 5 Australia 1950 40 44 38.9 20.1 555 AUS 6 Australia 1950 45 49 34 16.9 491 AUS 7 Australia 1950 50 54 29.6 14.6 439 AUS 8 Australia 1950 55 59 28 12.9 408 AUS 9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 4 / 25
  • 5. Structured data A familiar data model Heterogeneous columns or hyperslabs Each column/hyperslab is homogeneously typed Relational databases (SQL, etc.) are just a special case Need good performance in row- and column-oriented operations Support for axis metadata Data alignment is critical Seamless integration with Python data structures and NumPy Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 5 / 25
  • 6. Structured data challenges Table modification: column insertion/deletion Axis indexing and data alignment Aggregation and transformation by group (“group by”) Missing data handling Pivoting and reshaping Merging and joining Time series-specific manipulations Fast IO: flat files, databases, HDF5, ... Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 6 / 25
  • 7. Not all fun and games We care nearly equally about Performance Ease-of-use (syntax / API fits your mental model) Expressiveness Clean, consistent API design is hard and underappreciated Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 7 / 25
  • 8. The big picture Build a foundation for data analysis and statistical computing Craft the most expressive / flexible in-memory data manipulation tool in any language Preferably also one of the fastest, too Vastly simplify the data preparation, munging, and integration process Comfortable abstractions: master data-fu without needing to be a computer scientist Later: extend API with distributed computing backend for larger-than-memory datasets Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 8 / 25
  • 9. pandas: a brief history Starting building April 2008 back at AQR Open-sourced (BSD license) mid-2009 29075 lines of Python/Cython code as of yesterday, and growing fast Heavily tested, being used by many companies (inc. lots of financial firms) in production Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 9 / 25
  • 10. Cython: getting good performance My choice tool for writing performant code High level access to NumPy C API internals Buffer syntax/protocol abstracts away striding details of non-contiguous arrays, very low overhead vs. working with raw C pointers Reduce/remove interpreter overhead associated with working with Python data structures Interface directly with C/C++ code when necessary Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 10 / 25
  • 11. Axis indexing Key pandas feature The axis index is a data structure itself, which can be customized to support things like: 1-1 O(1) indexing with hashable Python objects Datetime indexing for time series data Hierarchical (multi-level) indexing Use Python dict to support O(1) lookups and O(n) realignment ops. Can specialize to get better performance and memory usage Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 11 / 25
  • 12. Axis indexing Every axis has an index Automatic alignment between differently-indexed objects: makes it nearly impossible to accidentally combine misaligned data Hierarchical indexing provides an intuitive way of structuring and working with higher-dimensional data Natural way of expressing “group by” and join-type operations As good or in many cases much more integrated/flexible than commercial or open-source alternatives to pandas/Python Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 12 / 25
  • 13. The trouble with Python dicts... Python dict memory footprint can be quite large 1MM key-value pairs: something like 70mb on a 64-bit system Even though sizeof(PyObject*) == 8 Python dict is great, but should use a faster, threadsafe hash table for primitive C types (like 64-bit integer) BUT: using a hash table only necessary in the general case. With monotonic indexes you don’t need one for realignment ops Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 13 / 25
  • 14. Some alignment numbers Hardware: Macbook Pro Core i7 laptop, Python 2.7.2 Outer-join 500k-length indexes chosen from 1MM elements Dict-based with random strings: 2.2 seconds Sorted strings: 400ms (5.5x faster) Sorted int64: 19ms (115x faster) Fortunately, time series data falls into this last category Alignment ops with C primitives could be fairly easily parallelized with OpenMP in Cython Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 14 / 25
  • 15. DataFrame, the pandas workhorse A 2D tabular data structure with row and column indexes Hierarchical indexing one way to support higher-dimensional data in a lower-dimensional structure Simplified NumPy type system: float, int, boolean, object Rich indexing operations, SQL-like join/merges, etc. Support heterogeneous columns WITHOUT sacrificing performance in the homogeneous (e.g. floating point only) case Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 15 / 25
  • 16. DataFrame, under the hood Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 16 / 25
  • 17. Supporting size mutability In order to have good row-oriented performance, need to store like-typed columns in a single ndarray “Column” insertion: accumulate 1 × N × . . . homogeneous columns, later consolidate with other like-typed into a single block I.e. avoid reallocate-copy or array concatenation steps as long as possible Column deletions can be no-copy events (since ndarrays support views) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 17 / 25
  • 18. Hierarchical indexing New this year, but really should have done long ago Natural result of multi-key groupby An intuitive way to work with higher-dimensional data Much less ad hoc way of expressing reshaping operations Once you have it, things like Excel-style pivot tables just “fall out” Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 18 / 25
  • 19. Reshaping Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 19 / 25
  • 20. Reshaping In [5]: df.unstack(’agefrom’).stack(’year’) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 20 / 25
  • 21. Reshaping implementation nuances Must deal with unbalanced group sizes / missing data Play vectorization tricks with the NumPy C-contiguous memory layout: no Python for loops allowed Care must be taken to handle heterogeneous and homogeneous data cases Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 21 / 25
  • 22. GroupBy High level process split data set into groups apply function to each group (an aggregation or a transformation) combine results intelligently into a result data structure Can be used to emulate SQL GROUP BY operations Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 22 / 25
  • 23. GroupBy Grouping closely related to indexing Create correspondence between axis labels and group labels using one of: Array of group labels (like a DataFrame column) Python function to be applied to each axis tick Can group by multiple keys For a hierarchically indexed axis, can select a level and group by that (or some transformation thereof) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 23 / 25
  • 24. GroupBy implementation challenges Computing the group labels from arbitrary Python objects is very expensive 77ms for 1MM strings with 1K groups 107ms for 1MM strings with 10K groups 350ms for 1MM strings with 100K groups To sort or not to sort (for iteration)? Once you have the labels, can reorder the data set in O(n) (with a much smaller constant than computing the labels) Roughly 35ms to reorder 1MM float64 data points given the labels (By contrast, computing the mean of 1MM elements takes 1.4ms) Python function call overhead is significant in cases with lots of small groups; much better (orders of magnitude speedup) to write specialized Cython routines Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 24 / 25
  • 25. Demo, time permitting Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 25 / 25