pandas: a Foundational Python Library for Data Analysis and Statistics

Wes McKinney
Wes McKinneyDirector of Ursa Labs, Open Source Developer at Ursa Labs
pandas: a Foundational Python library for Data Analysis
                    and Statistics

                                   Wes McKinney


                            PyHPC 2011, 18 November 2011




Wes McKinney (@wesmckinn)          Data analysis with pandas   PyHPC 2011   1 / 25
An alternate title




       High Performance Structured Data
            Manipulation in Python




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   2 / 25
My background



     Former quant hacker at AQR Capital, now entrepreneur
     Background: math, statistics, computer science, quant finance.
     Shaken, not stirred
     Active in scientific Python community
     My blog: http://blog.wesmckinney.com
     Twitter: @wesmckinn
     Book! “Python for Data Analysis”, to hit the shelves later next year
     from O’Reilly




 Wes McKinney (@wesmckinn)   Data analysis with pandas       PyHPC 2011   3 / 25
Structured data



      cname              year   agefrom      ageto             ls     lsc    pop   ccode
0     Australia          1950   15           19                64.3   15.4   558   AUS
1     Australia          1950   20           24                48.4   26.4   645   AUS
2     Australia          1950   25           29                47.9   26.2   681   AUS
3     Australia          1950   30           34                44     23.8   614   AUS
4     Australia          1950   35           39                42.1   21.9   625   AUS
5     Australia          1950   40           44                38.9   20.1   555   AUS
6     Australia          1950   45           49                34     16.9   491   AUS
7     Australia          1950   50           54                29.6   14.6   439   AUS
8     Australia          1950   55           59                28     12.9   408   AUS
9     Australia          1950   60           64                26.3   12.1   356   AUS




    Wes McKinney (@wesmckinn)      Data analysis with pandas                 PyHPC 2011   4 / 25
Structured data



     A familiar data model
           Heterogeneous columns or hyperslabs
           Each column/hyperslab is homogeneously typed
           Relational databases (SQL, etc.) are just a special case
     Need good performance in row- and column-oriented operations
     Support for axis metadata
     Data alignment is critical
     Seamless integration with Python data structures and NumPy




 Wes McKinney (@wesmckinn)       Data analysis with pandas            PyHPC 2011   5 / 25
Structured data challenges



     Table modification: column insertion/deletion
     Axis indexing and data alignment
     Aggregation and transformation by group (“group by”)
     Missing data handling
     Pivoting and reshaping
     Merging and joining
     Time series-specific manipulations
     Fast IO: flat files, databases, HDF5, ...




 Wes McKinney (@wesmckinn)    Data analysis with pandas     PyHPC 2011   6 / 25
Not all fun and games




     We care nearly equally about
           Performance
           Ease-of-use (syntax / API fits your mental model)
           Expressiveness
     Clean, consistent API design is hard and underappreciated




 Wes McKinney (@wesmckinn)      Data analysis with pandas     PyHPC 2011   7 / 25
The big picture



     Build a foundation for data analysis and statistical computing
     Craft the most expressive / flexible in-memory data manipulation tool
     in any language
           Preferably also one of the fastest, too
     Vastly simplify the data preparation, munging, and integration process
     Comfortable abstractions: master data-fu without needing to be a
     computer scientist
     Later: extend API with distributed computing backend for
     larger-than-memory datasets




 Wes McKinney (@wesmckinn)       Data analysis with pandas   PyHPC 2011   8 / 25
pandas: a brief history




     Starting building April 2008 back at AQR
     Open-sourced (BSD license) mid-2009
     29075 lines of Python/Cython code as of yesterday, and growing fast
     Heavily tested, being used by many companies (inc. lots of financial
     firms) in production




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   9 / 25
Cython: getting good performance



     My choice tool for writing performant code
     High level access to NumPy C API internals
     Buffer syntax/protocol abstracts away striding details of
     non-contiguous arrays, very low overhead vs. working with raw C
     pointers
     Reduce/remove interpreter overhead associated with working with
     Python data structures
     Interface directly with C/C++ code when necessary




 Wes McKinney (@wesmckinn)   Data analysis with pandas     PyHPC 2011   10 / 25
Axis indexing




     Key pandas feature
     The axis index is a data structure itself, which can be customized to
     support things like:
           1-1 O(1) indexing with hashable Python objects
           Datetime indexing for time series data
           Hierarchical (multi-level) indexing
     Use Python dict to support O(1) lookups and O(n) realignment ops.
     Can specialize to get better performance and memory usage




 Wes McKinney (@wesmckinn)      Data analysis with pandas    PyHPC 2011   11 / 25
Axis indexing



     Every axis has an index
     Automatic alignment between differently-indexed objects: makes it
     nearly impossible to accidentally combine misaligned data
     Hierarchical indexing provides an intuitive way of structuring and
     working with higher-dimensional data
     Natural way of expressing “group by” and join-type operations
     As good or in many cases much more integrated/flexible than
     commercial or open-source alternatives to pandas/Python




 Wes McKinney (@wesmckinn)     Data analysis with pandas     PyHPC 2011   12 / 25
The trouble with Python dicts...




     Python dict memory footprint can be quite large
           1MM key-value pairs: something like 70mb on a 64-bit system
           Even though sizeof(PyObject*) == 8
     Python dict is great, but should use a faster, threadsafe hash table for
     primitive C types (like 64-bit integer)
     BUT: using a hash table only necessary in the general case. With
     monotonic indexes you don’t need one for realignment ops




 Wes McKinney (@wesmckinn)     Data analysis with pandas       PyHPC 2011   13 / 25
Some alignment numbers



     Hardware: Macbook Pro Core i7 laptop, Python 2.7.2
     Outer-join 500k-length indexes chosen from 1MM elements
           Dict-based with random strings: 2.2 seconds
           Sorted strings: 400ms (5.5x faster)
           Sorted int64: 19ms (115x faster)
     Fortunately, time series data falls into this last category
     Alignment ops with C primitives could be fairly easily parallelized with
     OpenMP in Cython




 Wes McKinney (@wesmckinn)      Data analysis with pandas          PyHPC 2011   14 / 25
DataFrame, the pandas workhorse




     A 2D tabular data structure with row and column indexes
     Hierarchical indexing one way to support higher-dimensional data in a
     lower-dimensional structure
     Simplified NumPy type system: float, int, boolean, object
     Rich indexing operations, SQL-like join/merges, etc.
     Support heterogeneous columns WITHOUT sacrificing performance in
     the homogeneous (e.g. floating point only) case




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   15 / 25
DataFrame, under the hood




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   16 / 25
Supporting size mutability



     In order to have good row-oriented performance, need to store
     like-typed columns in a single ndarray
     “Column” insertion: accumulate 1 × N × . . . homogeneous columns,
     later consolidate with other like-typed into a single block
     I.e. avoid reallocate-copy or array concatenation steps as long as
     possible
     Column deletions can be no-copy events (since ndarrays support
     views)




 Wes McKinney (@wesmckinn)    Data analysis with pandas       PyHPC 2011   17 / 25
Hierarchical indexing




     New this year, but really should have done long ago
     Natural result of multi-key groupby
     An intuitive way to work with higher-dimensional data
     Much less ad hoc way of expressing reshaping operations
     Once you have it, things like Excel-style pivot tables just “fall out”




 Wes McKinney (@wesmckinn)     Data analysis with pandas       PyHPC 2011   18 / 25
Reshaping




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   19 / 25
Reshaping

In [5]: df.unstack(’agefrom’).stack(’year’)




 Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   20 / 25
Reshaping implementation nuances




     Must deal with unbalanced group sizes / missing data
     Play vectorization tricks with the NumPy C-contiguous memory
     layout: no Python for loops allowed
     Care must be taken to handle heterogeneous and homogeneous data
     cases




 Wes McKinney (@wesmckinn)   Data analysis with pandas      PyHPC 2011   21 / 25
GroupBy




     High level process
           split data set into groups
           apply function to each group (an aggregation or a transformation)
           combine results intelligently into a result data structure
     Can be used to emulate SQL GROUP BY operations




 Wes McKinney (@wesmckinn)      Data analysis with pandas        PyHPC 2011    22 / 25
GroupBy



     Grouping closely related to indexing
     Create correspondence between axis labels and group labels using one
     of:
           Array of group labels (like a DataFrame column)
           Python function to be applied to each axis tick
     Can group by multiple keys
     For a hierarchically indexed axis, can select a level and group by that
     (or some transformation thereof)




 Wes McKinney (@wesmckinn)      Data analysis with pandas     PyHPC 2011   23 / 25
GroupBy implementation challenges


     Computing the group labels from arbitrary Python objects is very
     expensive
           77ms for 1MM strings with 1K groups
           107ms for 1MM strings with 10K groups
           350ms for 1MM strings with 100K groups
     To sort or not to sort (for iteration)?
           Once you have the labels, can reorder the data set in O(n) (with a
           much smaller constant than computing the labels)
           Roughly 35ms to reorder 1MM float64 data points given the labels
     (By contrast, computing the mean of 1MM elements takes 1.4ms)
     Python function call overhead is significant in cases with lots of small
     groups; much better (orders of magnitude speedup) to write
     specialized Cython routines


 Wes McKinney (@wesmckinn)      Data analysis with pandas         PyHPC 2011    24 / 25
Demo, time permitting




Wes McKinney (@wesmckinn)   Data analysis with pandas   PyHPC 2011   25 / 25
1 of 25

Recommended

pandas - Python Data Analysis by
pandas - Python Data Analysispandas - Python Data Analysis
pandas - Python Data AnalysisAndrew Henshaw
15.8K views23 slides
Cure, Clustering Algorithm by
Cure, Clustering AlgorithmCure, Clustering Algorithm
Cure, Clustering AlgorithmLino Possamai
19.9K views19 slides
Machine Learning by Analogy by
Machine Learning by AnalogyMachine Learning by Analogy
Machine Learning by AnalogyColleen Farrelly
78.5K views57 slides
Pandas by
PandasPandas
PandasJyoti shukla
536 views15 slides
Pandas by
PandasPandas
Pandasmaikroeder
9.7K views20 slides
Python for Financial Data Analysis with pandas by
Python for Financial Data Analysis with pandasPython for Financial Data Analysis with pandas
Python for Financial Data Analysis with pandasWes McKinney
61.8K views22 slides

More Related Content

What's hot

Python pandas tutorial by
Python pandas tutorialPython pandas tutorial
Python pandas tutorialHarikaReddy115
636 views13 slides
Support Vector Machines by
Support Vector MachinesSupport Vector Machines
Support Vector Machinesnextlib
19.9K views56 slides
ppt by
pptppt
pptbutest
1.8K views33 slides
Naive bayes by
Naive bayesNaive bayes
Naive bayesLearnbay Datascience
125 views10 slides
Support Vector Machines ( SVM ) by
Support Vector Machines ( SVM ) Support Vector Machines ( SVM )
Support Vector Machines ( SVM ) Mohammad Junaid Khan
35.7K views26 slides
Dbscan algorithom by
Dbscan algorithomDbscan algorithom
Dbscan algorithomMahbubur Rahman Shimul
15.9K views29 slides

What's hot(20)

Support Vector Machines by nextlib
Support Vector MachinesSupport Vector Machines
Support Vector Machines
nextlib19.9K views
ppt by butest
pptppt
ppt
butest1.8K views
K mean-clustering algorithm by parry prabhu
K mean-clustering algorithmK mean-clustering algorithm
K mean-clustering algorithm
parry prabhu50.5K views
Classification Based Machine Learning Algorithms by Md. Main Uddin Rony
Classification Based Machine Learning AlgorithmsClassification Based Machine Learning Algorithms
Classification Based Machine Learning Algorithms
Md. Main Uddin Rony9.9K views
ARTIFICIAL NEURAL NETWORKS by AIMS Education
ARTIFICIAL NEURAL NETWORKSARTIFICIAL NEURAL NETWORKS
ARTIFICIAL NEURAL NETWORKS
AIMS Education1.4K views
Data preprocessing using Machine Learning by Gopal Sakarkar
Data  preprocessing using Machine Learning Data  preprocessing using Machine Learning
Data preprocessing using Machine Learning
Gopal Sakarkar2.7K views
Machine Learning Clustering by Rupak Roy
Machine Learning ClusteringMachine Learning Clustering
Machine Learning Clustering
Rupak Roy133 views
pandas: Powerful data analysis tools for Python by Wes McKinney
pandas: Powerful data analysis tools for Pythonpandas: Powerful data analysis tools for Python
pandas: Powerful data analysis tools for Python
Wes McKinney9.8K views
K-Means, its Variants and its Applications by Varad Meru
K-Means, its Variants and its ApplicationsK-Means, its Variants and its Applications
K-Means, its Variants and its Applications
Varad Meru12.1K views
mlcourse.ai fall2019 Live Session 0 by Yury Kashnitsky
mlcourse.ai fall2019 Live Session 0mlcourse.ai fall2019 Live Session 0
mlcourse.ai fall2019 Live Session 0
Yury Kashnitsky1.5K views
DBSCAN (2014_11_25 06_21_12 UTC) by Cory Cook
DBSCAN (2014_11_25 06_21_12 UTC)DBSCAN (2014_11_25 06_21_12 UTC)
DBSCAN (2014_11_25 06_21_12 UTC)
Cory Cook1.5K views
Data Analysis with Python Pandas by Neeru Mittal
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
Neeru Mittal694 views

Similar to pandas: a Foundational Python Library for Data Analysis and Statistics

Structured Data Challenges in Finance and Statistics by
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and StatisticsWes McKinney
5.3K views42 slides
Slides 111017220255-phpapp01 by
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01Ken Mwai
234 views22 slides
From flat files to deconstructed database by
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed databaseJulien Le Dem
2.1K views45 slides
Polyglot metadata for Hadoop by
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for HadoopJim Dowling
222 views35 slides
President Election of Korea in 2017 by
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017Jongwook Woo
560 views37 slides
Bridging Batch and Real-time Systems for Anomaly Detection by
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly DetectionDataWorks Summit
887 views45 slides

Similar to pandas: a Foundational Python Library for Data Analysis and Statistics(20)

Structured Data Challenges in Finance and Statistics by Wes McKinney
Structured Data Challenges in Finance and StatisticsStructured Data Challenges in Finance and Statistics
Structured Data Challenges in Finance and Statistics
Wes McKinney5.3K views
Slides 111017220255-phpapp01 by Ken Mwai
Slides 111017220255-phpapp01Slides 111017220255-phpapp01
Slides 111017220255-phpapp01
Ken Mwai234 views
From flat files to deconstructed database by Julien Le Dem
From flat files to deconstructed databaseFrom flat files to deconstructed database
From flat files to deconstructed database
Julien Le Dem2.1K views
Polyglot metadata for Hadoop by Jim Dowling
Polyglot metadata for HadoopPolyglot metadata for Hadoop
Polyglot metadata for Hadoop
Jim Dowling222 views
President Election of Korea in 2017 by Jongwook Woo
President Election of Korea in 2017President Election of Korea in 2017
President Election of Korea in 2017
Jongwook Woo560 views
Bridging Batch and Real-time Systems for Anomaly Detection by DataWorks Summit
Bridging Batch and Real-time Systems for Anomaly DetectionBridging Batch and Real-time Systems for Anomaly Detection
Bridging Batch and Real-time Systems for Anomaly Detection
DataWorks Summit887 views
Strata NY 2018: The deconstructed database by Julien Le Dem
Strata NY 2018: The deconstructed databaseStrata NY 2018: The deconstructed database
Strata NY 2018: The deconstructed database
Julien Le Dem1.6K views
Data Wrangling and Visualization Using Python by MOHITKUMAR1379
Data Wrangling and Visualization Using PythonData Wrangling and Visualization Using Python
Data Wrangling and Visualization Using Python
MOHITKUMAR1379464 views
xldb2012_wed_0950_TimFrazier by Tim Frazier
xldb2012_wed_0950_TimFrazierxldb2012_wed_0950_TimFrazier
xldb2012_wed_0950_TimFrazier
Tim Frazier377 views
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817 by Ben Busby
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Datasets and tools_from_ncbi_and_elsewhere_for_microbiome_research_v_62817
Ben Busby185 views
Data Structures for Statistical Computing in Python by Wes McKinney
Data Structures for Statistical Computing in PythonData Structures for Statistical Computing in Python
Data Structures for Statistical Computing in Python
Wes McKinney89.5K views
Big data & hadoop framework by Tu Pham
Big data & hadoop frameworkBig data & hadoop framework
Big data & hadoop framework
Tu Pham2.6K views
PDS Unit - 1 Introdiction to DS.ppt by ssuser52a19e
PDS Unit - 1 Introdiction to DS.pptPDS Unit - 1 Introdiction to DS.ppt
PDS Unit - 1 Introdiction to DS.ppt
ssuser52a19e4 views
Pentaho Data Integration Introduction by mattcasters
Pentaho Data Integration IntroductionPentaho Data Integration Introduction
Pentaho Data Integration Introduction
mattcasters32.5K views
Minimizing the Complexities of Machine Learning with Data Virtualization by Denodo
Minimizing the Complexities of Machine Learning with Data VirtualizationMinimizing the Complexities of Machine Learning with Data Virtualization
Minimizing the Complexities of Machine Learning with Data Virtualization
Denodo 357 views
Koalas: Unifying Spark and pandas APIs by Takuya UESHIN
Koalas: Unifying Spark and pandas APIsKoalas: Unifying Spark and pandas APIs
Koalas: Unifying Spark and pandas APIs
Takuya UESHIN1.9K views

More from Wes McKinney

Solving Enterprise Data Challenges with Apache Arrow by
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowWes McKinney
1.1K views31 slides
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityWes McKinney
1.1K views26 slides
Apache Arrow: High Performance Columnar Data Framework by
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkWes McKinney
1.5K views53 slides
New Directions for Apache Arrow by
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache ArrowWes McKinney
1.9K views27 slides
Apache Arrow Flight: A New Gold Standard for Data Transport by
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportWes McKinney
2.2K views31 slides
ACM TechTalks : Apache Arrow and the Future of Data Frames by
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesWes McKinney
2K views47 slides

More from Wes McKinney(20)

Solving Enterprise Data Challenges with Apache Arrow by Wes McKinney
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney1.1K views
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity by Wes McKinney
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney1.1K views
Apache Arrow: High Performance Columnar Data Framework by Wes McKinney
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney1.5K views
New Directions for Apache Arrow by Wes McKinney
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney1.9K views
Apache Arrow Flight: A New Gold Standard for Data Transport by Wes McKinney
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney2.2K views
ACM TechTalks : Apache Arrow and the Future of Data Frames by Wes McKinney
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney2K views
Apache Arrow: Present and Future @ ScaledML 2020 by Wes McKinney
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney970 views
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future by Wes McKinney
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney2.1K views
Apache Arrow: Leveling Up the Analytics Stack by Wes McKinney
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney1.4K views
Apache Arrow Workshop at VLDB 2019 / BOSS Session by Wes McKinney
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney2.5K views
Apache Arrow: Leveling Up the Data Science Stack by Wes McKinney
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney3.5K views
Ursa Labs and Apache Arrow in 2019 by Wes McKinney
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney4.2K views
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward" by Wes McKinney
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney1.1K views
Apache Arrow at DataEngConf Barcelona 2018 by Wes McKinney
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney2K views
Apache Arrow: Cross-language Development Platform for In-memory Data by Wes McKinney
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney6.6K views
Apache Arrow -- Cross-language development platform for in-memory data by Wes McKinney
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney2.9K views
Shared Infrastructure for Data Science by Wes McKinney
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney8.5K views
Data Science Without Borders (JupyterCon 2017) by Wes McKinney
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney6.2K views
Memory Interoperability in Analytics and Machine Learning by Wes McKinney
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney5.6K views
Raising the Tides: Open Source Analytics for Data Science by Wes McKinney
Raising the Tides: Open Source Analytics for Data ScienceRaising the Tides: Open Source Analytics for Data Science
Raising the Tides: Open Source Analytics for Data Science
Wes McKinney3.2K views

Recently uploaded

Scaling Knowledge Graph Architectures with AI by
Scaling Knowledge Graph Architectures with AIScaling Knowledge Graph Architectures with AI
Scaling Knowledge Graph Architectures with AIEnterprise Knowledge
50 views15 slides
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院IttrainingIttraining
69 views8 slides
Microsoft Power Platform.pptx by
Microsoft Power Platform.pptxMicrosoft Power Platform.pptx
Microsoft Power Platform.pptxUni Systems S.M.S.A.
61 views38 slides
Future of AR - Facebook Presentation by
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook PresentationRob McCarty
22 views27 slides
STPI OctaNE CoE Brochure.pdf by
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdfmadhurjyapb
14 views1 slide
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveNetwork Automation Forum
43 views35 slides

Recently uploaded(20)

【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院 by IttrainingIttraining
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
【USB韌體設計課程】精選講義節錄-USB的列舉過程_艾鍗學院
Future of AR - Facebook Presentation by Rob McCarty
Future of AR - Facebook PresentationFuture of AR - Facebook Presentation
Future of AR - Facebook Presentation
Rob McCarty22 views
STPI OctaNE CoE Brochure.pdf by madhurjyapb
STPI OctaNE CoE Brochure.pdfSTPI OctaNE CoE Brochure.pdf
STPI OctaNE CoE Brochure.pdf
madhurjyapb14 views
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive by Network Automation Forum
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLiveAutomating a World-Class Technology Conference; Behind the Scenes of CiscoLive
Automating a World-Class Technology Conference; Behind the Scenes of CiscoLive
SAP Automation Using Bar Code and FIORI.pdf by Virendra Rai, PMP
SAP Automation Using Bar Code and FIORI.pdfSAP Automation Using Bar Code and FIORI.pdf
SAP Automation Using Bar Code and FIORI.pdf
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ... by Jasper Oosterveld
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
ESPC 2023 - Protect and Govern your Sensitive Data with Microsoft Purview in ...
Unit 1_Lecture 2_Physical Design of IoT.pdf by StephenTec
Unit 1_Lecture 2_Physical Design of IoT.pdfUnit 1_Lecture 2_Physical Design of IoT.pdf
Unit 1_Lecture 2_Physical Design of IoT.pdf
StephenTec15 views
Igniting Next Level Productivity with AI-Infused Data Integration Workflows by Safe Software
Igniting Next Level Productivity with AI-Infused Data Integration Workflows Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Igniting Next Level Productivity with AI-Infused Data Integration Workflows
Safe Software317 views
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf by Dr. Jimmy Schwarzkopf
STKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdfSTKI Israeli Market Study 2023   corrected forecast 2023_24 v3.pdf
STKI Israeli Market Study 2023 corrected forecast 2023_24 v3.pdf
"Surviving highload with Node.js", Andrii Shumada by Fwdays
"Surviving highload with Node.js", Andrii Shumada "Surviving highload with Node.js", Andrii Shumada
"Surviving highload with Node.js", Andrii Shumada
Fwdays33 views
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading... by The Digital Insurer
Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...Webinar : Desperately Seeking Transformation - Part 2:  Insights from leading...
Webinar : Desperately Seeking Transformation - Part 2: Insights from leading...
Data Integrity for Banking and Financial Services by Precisely
Data Integrity for Banking and Financial ServicesData Integrity for Banking and Financial Services
Data Integrity for Banking and Financial Services
Precisely29 views

pandas: a Foundational Python Library for Data Analysis and Statistics

  • 1. pandas: a Foundational Python library for Data Analysis and Statistics Wes McKinney PyHPC 2011, 18 November 2011 Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 1 / 25
  • 2. An alternate title High Performance Structured Data Manipulation in Python Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 2 / 25
  • 3. My background Former quant hacker at AQR Capital, now entrepreneur Background: math, statistics, computer science, quant finance. Shaken, not stirred Active in scientific Python community My blog: http://blog.wesmckinney.com Twitter: @wesmckinn Book! “Python for Data Analysis”, to hit the shelves later next year from O’Reilly Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 3 / 25
  • 4. Structured data cname year agefrom ageto ls lsc pop ccode 0 Australia 1950 15 19 64.3 15.4 558 AUS 1 Australia 1950 20 24 48.4 26.4 645 AUS 2 Australia 1950 25 29 47.9 26.2 681 AUS 3 Australia 1950 30 34 44 23.8 614 AUS 4 Australia 1950 35 39 42.1 21.9 625 AUS 5 Australia 1950 40 44 38.9 20.1 555 AUS 6 Australia 1950 45 49 34 16.9 491 AUS 7 Australia 1950 50 54 29.6 14.6 439 AUS 8 Australia 1950 55 59 28 12.9 408 AUS 9 Australia 1950 60 64 26.3 12.1 356 AUS Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 4 / 25
  • 5. Structured data A familiar data model Heterogeneous columns or hyperslabs Each column/hyperslab is homogeneously typed Relational databases (SQL, etc.) are just a special case Need good performance in row- and column-oriented operations Support for axis metadata Data alignment is critical Seamless integration with Python data structures and NumPy Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 5 / 25
  • 6. Structured data challenges Table modification: column insertion/deletion Axis indexing and data alignment Aggregation and transformation by group (“group by”) Missing data handling Pivoting and reshaping Merging and joining Time series-specific manipulations Fast IO: flat files, databases, HDF5, ... Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 6 / 25
  • 7. Not all fun and games We care nearly equally about Performance Ease-of-use (syntax / API fits your mental model) Expressiveness Clean, consistent API design is hard and underappreciated Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 7 / 25
  • 8. The big picture Build a foundation for data analysis and statistical computing Craft the most expressive / flexible in-memory data manipulation tool in any language Preferably also one of the fastest, too Vastly simplify the data preparation, munging, and integration process Comfortable abstractions: master data-fu without needing to be a computer scientist Later: extend API with distributed computing backend for larger-than-memory datasets Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 8 / 25
  • 9. pandas: a brief history Starting building April 2008 back at AQR Open-sourced (BSD license) mid-2009 29075 lines of Python/Cython code as of yesterday, and growing fast Heavily tested, being used by many companies (inc. lots of financial firms) in production Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 9 / 25
  • 10. Cython: getting good performance My choice tool for writing performant code High level access to NumPy C API internals Buffer syntax/protocol abstracts away striding details of non-contiguous arrays, very low overhead vs. working with raw C pointers Reduce/remove interpreter overhead associated with working with Python data structures Interface directly with C/C++ code when necessary Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 10 / 25
  • 11. Axis indexing Key pandas feature The axis index is a data structure itself, which can be customized to support things like: 1-1 O(1) indexing with hashable Python objects Datetime indexing for time series data Hierarchical (multi-level) indexing Use Python dict to support O(1) lookups and O(n) realignment ops. Can specialize to get better performance and memory usage Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 11 / 25
  • 12. Axis indexing Every axis has an index Automatic alignment between differently-indexed objects: makes it nearly impossible to accidentally combine misaligned data Hierarchical indexing provides an intuitive way of structuring and working with higher-dimensional data Natural way of expressing “group by” and join-type operations As good or in many cases much more integrated/flexible than commercial or open-source alternatives to pandas/Python Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 12 / 25
  • 13. The trouble with Python dicts... Python dict memory footprint can be quite large 1MM key-value pairs: something like 70mb on a 64-bit system Even though sizeof(PyObject*) == 8 Python dict is great, but should use a faster, threadsafe hash table for primitive C types (like 64-bit integer) BUT: using a hash table only necessary in the general case. With monotonic indexes you don’t need one for realignment ops Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 13 / 25
  • 14. Some alignment numbers Hardware: Macbook Pro Core i7 laptop, Python 2.7.2 Outer-join 500k-length indexes chosen from 1MM elements Dict-based with random strings: 2.2 seconds Sorted strings: 400ms (5.5x faster) Sorted int64: 19ms (115x faster) Fortunately, time series data falls into this last category Alignment ops with C primitives could be fairly easily parallelized with OpenMP in Cython Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 14 / 25
  • 15. DataFrame, the pandas workhorse A 2D tabular data structure with row and column indexes Hierarchical indexing one way to support higher-dimensional data in a lower-dimensional structure Simplified NumPy type system: float, int, boolean, object Rich indexing operations, SQL-like join/merges, etc. Support heterogeneous columns WITHOUT sacrificing performance in the homogeneous (e.g. floating point only) case Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 15 / 25
  • 16. DataFrame, under the hood Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 16 / 25
  • 17. Supporting size mutability In order to have good row-oriented performance, need to store like-typed columns in a single ndarray “Column” insertion: accumulate 1 × N × . . . homogeneous columns, later consolidate with other like-typed into a single block I.e. avoid reallocate-copy or array concatenation steps as long as possible Column deletions can be no-copy events (since ndarrays support views) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 17 / 25
  • 18. Hierarchical indexing New this year, but really should have done long ago Natural result of multi-key groupby An intuitive way to work with higher-dimensional data Much less ad hoc way of expressing reshaping operations Once you have it, things like Excel-style pivot tables just “fall out” Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 18 / 25
  • 19. Reshaping Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 19 / 25
  • 20. Reshaping In [5]: df.unstack(’agefrom’).stack(’year’) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 20 / 25
  • 21. Reshaping implementation nuances Must deal with unbalanced group sizes / missing data Play vectorization tricks with the NumPy C-contiguous memory layout: no Python for loops allowed Care must be taken to handle heterogeneous and homogeneous data cases Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 21 / 25
  • 22. GroupBy High level process split data set into groups apply function to each group (an aggregation or a transformation) combine results intelligently into a result data structure Can be used to emulate SQL GROUP BY operations Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 22 / 25
  • 23. GroupBy Grouping closely related to indexing Create correspondence between axis labels and group labels using one of: Array of group labels (like a DataFrame column) Python function to be applied to each axis tick Can group by multiple keys For a hierarchically indexed axis, can select a level and group by that (or some transformation thereof) Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 23 / 25
  • 24. GroupBy implementation challenges Computing the group labels from arbitrary Python objects is very expensive 77ms for 1MM strings with 1K groups 107ms for 1MM strings with 10K groups 350ms for 1MM strings with 100K groups To sort or not to sort (for iteration)? Once you have the labels, can reorder the data set in O(n) (with a much smaller constant than computing the labels) Roughly 35ms to reorder 1MM float64 data points given the labels (By contrast, computing the mean of 1MM elements takes 1.4ms) Python function call overhead is significant in cases with lots of small groups; much better (orders of magnitude speedup) to write specialized Cython routines Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 24 / 25
  • 25. Demo, time permitting Wes McKinney (@wesmckinn) Data analysis with pandas PyHPC 2011 25 / 25