SlideShare a Scribd company logo
pandas: Powerful data
analysis tools for Python
Wes McKinney
Lambda Foundry, Inc.
@wesmckinn
PhillyPUG 3/27/2012
Me
• Recovering mathematician
• 3 years in the quant finance industry
• Last 2: statistics + freelance + open source
• My new company: Lambda Foundry
• High productivity data analysis and
research tools for quant finance
Me
• Blog: http://blog.wesmckinney.com
• GitHub: http://github.com/wesm
• Twitter: @wesmckinn
Agile Tools for Real World Data
Wes McKinney
Python for
Data Analysis
• Pragmatic intro to
scientific Python
• pandas
• Case studies
• ETA: Late 2012
In the works
Agile Tools for Real World Data
pandas?
• http://pandas.pydata.org
• Rich relational data tool built on top of
NumPy
• Like R’s data.frame on steroids
• Excellent performance
• Easy-to-use, highly consistent API
• A foundation for data analysis in Python
pandas
• In heavy production use in the financial
industry, among others
• Generally much better performance than
other open source alternatives (e.g. R)
• Hope: basis for the “next generation”
statistical computing and analysis environment
Simplifying data wrangling
• Data munging / preparation / cleaning /
integration is slow, error prone, and time
consuming
• Everyone already <3’s Python for data
wrangling: pandas takes it to the next level
Explosive pandas growth
• 10 significant releases since 9/2011
• Hugely increased user base
Battle tested
• > 98% line coverage as measured by
coverage.py
• v0.3.0 (2/19/2011): 533 test functions
Battle tested
• > 98% line coverage as measured by
coverage.py
• v0.3.0 (2/19/2011): 533 test functions
• v0.7.3dev (3/27/2012): >1500 test functions
IPython
• Simply put: one of the hottest Python
projects out there
• Tab completion, introspection, interactive
debugger, command history
• Designed to enhance your productivity in
every way. I can’t live without it
• IPython HTML notebook is #winning
Series
• Subclass of numpy.ndarray
• Data: any type
• Index labels need not be ordered
• Duplicates are possible (but
result in reduced functionality)
5
6
12
-5
6.7
A
B
C
D
E
valuesindex
DataFrame
• NumPy array-like
• Each column can have a
different type
• Row and column index
• Size mutable: insert and delete
columns
0
4
8
-12
16
A
B
C
D
E
index
x
y
z
w
a
2.7
6
10
NA
18
True
True
False
False
False
foo bar baz quxcolumns
DataFrame
In [10]: tips[:10]
Out[10]:
total_bill tip sex smoker day time size
1 16.99 1.01 Female No Sun Dinner 2
2 10.34 1.66 Male No Sun Dinner 3
3 21.01 3.50 Male No Sun Dinner 3
4 23.68 3.31 Male No Sun Dinner 2
5 24.59 3.61 Female No Sun Dinner 4
6 25.29 4.71 Male No Sun Dinner 4
7 8.770 2.00 Male No Sun Dinner 2
8 26.88 3.12 Male No Sun Dinner 4
9 15.04 1.96 Male No Sun Dinner 2
10 14.78 3.23 Male No Sun Dinner 2
DataFrame
• Axis indexing enable rich data alignment,
joins / merges, reshaping, selection, etc.
day Fri Sat Sun Thur
sex smoker
Female No 3.125 2.725 3.329 2.460
Yes 2.683 2.869 3.500 2.990
Male No 2.500 3.257 3.115 2.942
Yes 2.741 2.879 3.521 3.058
Axis indexing, the special
pandas-flavored sauce
• Enables “alignment-free” programming
• Prevents major source of data munging
frustration and errors
• Fast data selection
• Powerful way of describing reshape / join /
merge / pivot-table operations
Data alignment
• Binary operations are joins!
B
C
D
E
1
2
3
4
A
B
C
D
0
1
2
3
+ =
A
B
C
D
NA
2
4
6
E NA
GroupBy
A 0
B 5
C 10
5
10
15
10
15
20
A
A
A
B
B
B
C
C
C
A 15
B 30
C 45
A
B
C
A
B
C
0
5
10
5
10
15
10
15
20
sum
ApplySplit
Key
Combine
sum
sum
Hierarchical indexes
• Semantics: a tuple at each tick
• Enables easy group selection
• Terminology:“multiple levels”
• Natural part of GroupBy and
reshape operations
A 1
2
3
1
2
3
4
B
Hierarchical indexes
• Semantics: a tuple at each tick
• Enables easy group selection
• Terminology:“multiple levels”
• Natural part of GroupBy and
reshape operations
A 1
2
3
1
2
3
4
B
{
{
Let’s have a little fun
To the IPython Notebook!
What’s in pandas?
• A big library: 40k SLOC
Tests!
• Huge accumulation of use cases originating
in real world applications
• 68 lines of tests for every 100 lines of code
pandas.core
• Data structures
• Series (1D)
• DataFrame (2D)
• Panel (3D)
• NA-friendly statistics
• Index implementations / label-indexing
pandas.core
• GroupBy engine
• Time series tools
• Date range generation
• Extensible date offsets
• Hierarchical indexing stuff
Elsewhere
• Join / concatenation algorithms
• Sparse versions of Series, DataFrame...
• IO tools: CSV files, HDF5, Excel 2003/2007
• Moving window statistics (rolling mean, ...)
• Pivot tables
• High level matplotlib interface
Hmm, pandas/src
• ~6000 lines of mostly Cython code
• Fast data algorithms that power the library
and make it fast
• pandas in PyPy?
Ok, so why Python?
• Look around you!
• Build a superior data analysis and statistical
computing environment
• Build mission-critical, data-driven
production systems
Trolling #rstats
Hash tables, anyone?
The pandas roadmap
• Improved time series capabilities
• Port GroupBy engine to NumPy only
• Better integration with statsmodels and
scikit-learn
• R integration via rpy2
The pandas roadmap
• Integration with JavaScript visualization
frameworks: D3, Flot, others
• Alternate DataFrame “backends”
• Memory maps
• HDF5 / PyTables
• SQL or NoSQL-backed
• Tighter IPython Notebook integration
ggplot2 for Python
• We need to build better a better interface
for creating statistical graphics in Python
• Use pandas as the base layer !
• Upcoming project from Peter Wang: bokeh
pandas for “Big Data”
• Quite common to need to process larger-
than-RAM data sets
• Alternate DataFrame backends are the
likely solution
• Ripe for integration with MapReduce
frameworks
Better time series
• Integration of scikits.timeseries codebase
• NumPy datetime64 dtype
• Higher performance, less memory
Better time series
• Fixed frequency handling
• Time zones
• Multiple time concepts
• Intervals: 1984, or “1984 Q4”
• Timestamps: moment in time, to micro-
or nanosecond resolution
Thanks!
• Follow me on Twitter: @wesmckinn
• pydata/pandas on GitHub!

More Related Content

What's hot

Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
Edureka!
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
Sourabh Sahu
 
What is Python JSON | Edureka
What is Python JSON | EdurekaWhat is Python JSON | Edureka
What is Python JSON | Edureka
Edureka!
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
Md. Sohag Miah
 
Python Pandas
Python PandasPython Pandas
Python Pandas
Sunil OS
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
Richard Herrell
 
Python
PythonPython
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
Girish Khanzode
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
Piyush rai
 
Numpy tutorial
Numpy tutorialNumpy tutorial
Numpy tutorial
HarikaReddy115
 
NumPy.pptx
NumPy.pptxNumPy.pptx
NumPy.pptx
EN1036VivekSingh
 
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYAPYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
Maulik Borsaniya
 
RDM 2020: Python, Numpy, and Pandas
RDM 2020: Python, Numpy, and PandasRDM 2020: Python, Numpy, and Pandas
RDM 2020: Python, Numpy, and Pandas
Henry Schreiner
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
Wes McKinney
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Andrew Ferlitsch
 
Data visualization
Data visualizationData visualization
Data visualization
Moushmi Dasgupta
 
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Edureka!
 
Zero to Hero - Introduction to Python3
Zero to Hero - Introduction to Python3Zero to Hero - Introduction to Python3
Zero to Hero - Introduction to Python3
Chariza Pladin
 
Essential NumPy
Essential NumPyEssential NumPy
Essential NumPy
zekeLabs Technologies
 
Data Structures in Python
Data Structures in PythonData Structures in Python
Data Structures in Python
Devashish Kumar
 

What's hot (20)

Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 
Python Seaborn Data Visualization
Python Seaborn Data Visualization Python Seaborn Data Visualization
Python Seaborn Data Visualization
 
What is Python JSON | Edureka
What is Python JSON | EdurekaWhat is Python JSON | Edureka
What is Python JSON | Edureka
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Python Pandas
Python PandasPython Pandas
Python Pandas
 
Data Analysis in Python
Data Analysis in PythonData Analysis in Python
Data Analysis in Python
 
Python
PythonPython
Python
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
Introduction to pandas
Introduction to pandasIntroduction to pandas
Introduction to pandas
 
Numpy tutorial
Numpy tutorialNumpy tutorial
Numpy tutorial
 
NumPy.pptx
NumPy.pptxNumPy.pptx
NumPy.pptx
 
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYAPYTHON-Chapter 4-Plotting and Data Science  PyLab - MAULIK BORSANIYA
PYTHON-Chapter 4-Plotting and Data Science PyLab - MAULIK BORSANIYA
 
RDM 2020: Python, Numpy, and Pandas
RDM 2020: Python, Numpy, and PandasRDM 2020: Python, Numpy, and Pandas
RDM 2020: Python, Numpy, and Pandas
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
Python - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning LibrariesPython - Numpy/Pandas/Matplot Machine Learning Libraries
Python - Numpy/Pandas/Matplot Machine Learning Libraries
 
Data visualization
Data visualizationData visualization
Data visualization
 
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...Python Functions Tutorial | Working With Functions In Python | Python Trainin...
Python Functions Tutorial | Working With Functions In Python | Python Trainin...
 
Zero to Hero - Introduction to Python3
Zero to Hero - Introduction to Python3Zero to Hero - Introduction to Python3
Zero to Hero - Introduction to Python3
 
Essential NumPy
Essential NumPyEssential NumPy
Essential NumPy
 
Data Structures in Python
Data Structures in PythonData Structures in Python
Data Structures in Python
 

Similar to pandas: Powerful data analysis tools for Python

An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folks
Thomas Hütter
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
Wes McKinney
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
DataStax
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
MongoDB
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Wes McKinney
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
Andrew Flatters
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Wes McKinney
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
Big Data Spain
 
Python ml
Python mlPython ml
Python ml
Shubham Sharma
 
Tableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My LifeTableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My Life
Russell Spangler
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Tech Triveni
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
Jeffrey Clark
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
Mark Rittman
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
DataPad Inc.
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
Mark Kromer
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
Amazon Web Services
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
Łukasz Grala
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Deepak Chandramouli
 
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the FieldPartner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Denodo
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data Models
EDB
 

Similar to pandas: Powerful data analysis tools for Python (20)

An R primer for SQL folks
An R primer for SQL folksAn R primer for SQL folks
An R primer for SQL folks
 
What's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial usersWhat's new in pandas and the SciPy stack for financial users
What's new in pandas and the SciPy stack for financial users
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
 
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with TableauWebinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
Webinar: Introducing the MongoDB Connector for BI 2.0 with Tableau
 
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
Practical Medium Data Analytics with Python (10 Things I Hate About pandas, P...
 
Machine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy CrossMachine Learning with ML.NET and Azure - Andy Cross
Machine Learning with ML.NET and Azure - Andy Cross
 
Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)Building Better Analytics Workflows (Strata-Hadoop World 2013)
Building Better Analytics Workflows (Strata-Hadoop World 2013)
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Python ml
Python mlPython ml
Python ml
 
Tableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My LifeTableau Seattle BI Event How Tableau Changed My Life
Tableau Seattle BI Event How Tableau Changed My Life
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
Dc python meetup
Dc python meetupDc python meetup
Dc python meetup
 
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
IlOUG Tech Days 2016 - Unlock the Value in your Data Reservoir using Oracle B...
 
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
The Last Mile: Challenges and Opportunities in Data Tools (Strata 2014)
 
Big Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft AzureBig Data Analytics in the Cloud with Microsoft Azure
Big Data Analytics in the Cloud with Microsoft Azure
 
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the CloudFSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
FSI201 FINRA’s Managed Data Lake – Next Gen Analytics in the Cloud
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
 
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the FieldPartner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
Partner Enablement: Key Differentiators of Denodo Platform 6.0 for the Field
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data Models
 

More from Wes McKinney

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
Wes McKinney
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
Wes McKinney
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Wes McKinney
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
Wes McKinney
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
Wes McKinney
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
Wes McKinney
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
Wes McKinney
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
Wes McKinney
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
Wes McKinney
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
Wes McKinney
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Wes McKinney
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
Wes McKinney
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
Wes McKinney
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
Wes McKinney
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
Wes McKinney
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
Wes McKinney
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
Wes McKinney
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
Wes McKinney
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
Wes McKinney
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
Wes McKinney
 

More from Wes McKinney (20)

The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Solving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache ArrowSolving Enterprise Data Challenges with Apache Arrow
Solving Enterprise Data Challenges with Apache Arrow
 
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise NecessityApache Arrow: Open Source Standard Becomes an Enterprise Necessity
Apache Arrow: Open Source Standard Becomes an Enterprise Necessity
 
Apache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data FrameworkApache Arrow: High Performance Columnar Data Framework
Apache Arrow: High Performance Columnar Data Framework
 
New Directions for Apache Arrow
New Directions for Apache ArrowNew Directions for Apache Arrow
New Directions for Apache Arrow
 
Apache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data TransportApache Arrow Flight: A New Gold Standard for Data Transport
Apache Arrow Flight: A New Gold Standard for Data Transport
 
ACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data FramesACM TechTalks : Apache Arrow and the Future of Data Frames
ACM TechTalks : Apache Arrow and the Future of Data Frames
 
Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020Apache Arrow: Present and Future @ ScaledML 2020
Apache Arrow: Present and Future @ ScaledML 2020
 
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
PyCon Colombia 2020 Python for Data Analysis: Past, Present, and Future
 
Apache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics StackApache Arrow: Leveling Up the Analytics Stack
Apache Arrow: Leveling Up the Analytics Stack
 
Apache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS SessionApache Arrow Workshop at VLDB 2019 / BOSS Session
Apache Arrow Workshop at VLDB 2019 / BOSS Session
 
Apache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science StackApache Arrow: Leveling Up the Data Science Stack
Apache Arrow: Leveling Up the Data Science Stack
 
Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019Ursa Labs and Apache Arrow in 2019
Ursa Labs and Apache Arrow in 2019
 
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
PyCon.DE / PyData Karlsruhe keynote: "Looking backward, looking forward"
 
Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018Apache Arrow at DataEngConf Barcelona 2018
Apache Arrow at DataEngConf Barcelona 2018
 
Apache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory DataApache Arrow: Cross-language Development Platform for In-memory Data
Apache Arrow: Cross-language Development Platform for In-memory Data
 
Apache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory dataApache Arrow -- Cross-language development platform for in-memory data
Apache Arrow -- Cross-language development platform for in-memory data
 
Shared Infrastructure for Data Science
Shared Infrastructure for Data ScienceShared Infrastructure for Data Science
Shared Infrastructure for Data Science
 
Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)Data Science Without Borders (JupyterCon 2017)
Data Science Without Borders (JupyterCon 2017)
 
Memory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine LearningMemory Interoperability in Analytics and Machine Learning
Memory Interoperability in Analytics and Machine Learning
 

Recently uploaded

Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
Mydbops
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
ScyllaDB
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
christinelarrosa
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
Ortus Solutions, Corp
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Neo4j
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
Ajin Abraham
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
leebarnesutopia
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
DanBrown980551
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
UiPathCommunity
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Ukraine
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
Fwdays
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
Antonios Katsarakis
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
Ivo Velitchkov
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
ScyllaDB
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
Pablo Gómez Abajo
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
Fwdays
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
AstuteBusiness
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
ScyllaDB
 

Recently uploaded (20)

Must Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during MigrationMust Know Postgres Extension for DBA and Developer during Migration
Must Know Postgres Extension for DBA and Developer during Migration
 
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's TipsGetting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
Getting the Most Out of ScyllaDB Monitoring: ShareChat's Tips
 
Christine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptxChristine's Supplier Sourcing Presentaion.pptx
Christine's Supplier Sourcing Presentaion.pptx
 
Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!Introducing BoxLang : A new JVM language for productivity and modularity!
Introducing BoxLang : A new JVM language for productivity and modularity!
 
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid ResearchHarnessing the Power of NLP and Knowledge Graphs for Opioid Research
Harnessing the Power of NLP and Knowledge Graphs for Opioid Research
 
Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024Northern Engraving | Nameplate Manufacturing Process - 2024
Northern Engraving | Nameplate Manufacturing Process - 2024
 
AppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSFAppSec PNW: Android and iOS Application Security with MobSF
AppSec PNW: Android and iOS Application Security with MobSF
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdfLee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
Lee Barnes - Path to Becoming an Effective Test Automation Engineer.pdf
 
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
LF Energy Webinar: Carbon Data Specifications: Mechanisms to Improve Data Acc...
 
Day 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio FundamentalsDay 2 - Intro to UiPath Studio Fundamentals
Day 2 - Intro to UiPath Studio Fundamentals
 
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
GlobalLogic Java Community Webinar #18 “How to Improve Web Application Perfor...
 
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin..."$10 thousand per minute of downtime: architecture, queues, streaming and fin...
"$10 thousand per minute of downtime: architecture, queues, streaming and fin...
 
Dandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity serverDandelion Hashtable: beyond billion requests per second on a commodity server
Dandelion Hashtable: beyond billion requests per second on a commodity server
 
Apps Break Data
Apps Break DataApps Break Data
Apps Break Data
 
A Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's ArchitectureA Deep Dive into ScyllaDB's Architecture
A Deep Dive into ScyllaDB's Architecture
 
Mutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented ChatbotsMutation Testing for Task-Oriented Chatbots
Mutation Testing for Task-Oriented Chatbots
 
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba"NATO Hackathon Winner: AI-Powered Drug Search",  Taras Kloba
"NATO Hackathon Winner: AI-Powered Drug Search", Taras Kloba
 
Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |Astute Business Solutions | Oracle Cloud Partner |
Astute Business Solutions | Oracle Cloud Partner |
 
Discover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched ContentDiscover the Unseen: Tailored Recommendation of Unwatched Content
Discover the Unseen: Tailored Recommendation of Unwatched Content
 

pandas: Powerful data analysis tools for Python

  • 1. pandas: Powerful data analysis tools for Python Wes McKinney Lambda Foundry, Inc. @wesmckinn PhillyPUG 3/27/2012
  • 2. Me • Recovering mathematician • 3 years in the quant finance industry • Last 2: statistics + freelance + open source • My new company: Lambda Foundry • High productivity data analysis and research tools for quant finance
  • 3. Me • Blog: http://blog.wesmckinney.com • GitHub: http://github.com/wesm • Twitter: @wesmckinn
  • 4. Agile Tools for Real World Data Wes McKinney Python for Data Analysis • Pragmatic intro to scientific Python • pandas • Case studies • ETA: Late 2012 In the works Agile Tools for Real World Data
  • 5. pandas? • http://pandas.pydata.org • Rich relational data tool built on top of NumPy • Like R’s data.frame on steroids • Excellent performance • Easy-to-use, highly consistent API • A foundation for data analysis in Python
  • 6. pandas • In heavy production use in the financial industry, among others • Generally much better performance than other open source alternatives (e.g. R) • Hope: basis for the “next generation” statistical computing and analysis environment
  • 7. Simplifying data wrangling • Data munging / preparation / cleaning / integration is slow, error prone, and time consuming • Everyone already <3’s Python for data wrangling: pandas takes it to the next level
  • 8.
  • 9. Explosive pandas growth • 10 significant releases since 9/2011 • Hugely increased user base
  • 10. Battle tested • > 98% line coverage as measured by coverage.py • v0.3.0 (2/19/2011): 533 test functions
  • 11. Battle tested • > 98% line coverage as measured by coverage.py • v0.3.0 (2/19/2011): 533 test functions • v0.7.3dev (3/27/2012): >1500 test functions
  • 12. IPython • Simply put: one of the hottest Python projects out there • Tab completion, introspection, interactive debugger, command history • Designed to enhance your productivity in every way. I can’t live without it • IPython HTML notebook is #winning
  • 13. Series • Subclass of numpy.ndarray • Data: any type • Index labels need not be ordered • Duplicates are possible (but result in reduced functionality) 5 6 12 -5 6.7 A B C D E valuesindex
  • 14. DataFrame • NumPy array-like • Each column can have a different type • Row and column index • Size mutable: insert and delete columns 0 4 8 -12 16 A B C D E index x y z w a 2.7 6 10 NA 18 True True False False False foo bar baz quxcolumns
  • 15. DataFrame In [10]: tips[:10] Out[10]: total_bill tip sex smoker day time size 1 16.99 1.01 Female No Sun Dinner 2 2 10.34 1.66 Male No Sun Dinner 3 3 21.01 3.50 Male No Sun Dinner 3 4 23.68 3.31 Male No Sun Dinner 2 5 24.59 3.61 Female No Sun Dinner 4 6 25.29 4.71 Male No Sun Dinner 4 7 8.770 2.00 Male No Sun Dinner 2 8 26.88 3.12 Male No Sun Dinner 4 9 15.04 1.96 Male No Sun Dinner 2 10 14.78 3.23 Male No Sun Dinner 2
  • 16. DataFrame • Axis indexing enable rich data alignment, joins / merges, reshaping, selection, etc. day Fri Sat Sun Thur sex smoker Female No 3.125 2.725 3.329 2.460 Yes 2.683 2.869 3.500 2.990 Male No 2.500 3.257 3.115 2.942 Yes 2.741 2.879 3.521 3.058
  • 17. Axis indexing, the special pandas-flavored sauce • Enables “alignment-free” programming • Prevents major source of data munging frustration and errors • Fast data selection • Powerful way of describing reshape / join / merge / pivot-table operations
  • 18. Data alignment • Binary operations are joins! B C D E 1 2 3 4 A B C D 0 1 2 3 + = A B C D NA 2 4 6 E NA
  • 19. GroupBy A 0 B 5 C 10 5 10 15 10 15 20 A A A B B B C C C A 15 B 30 C 45 A B C A B C 0 5 10 5 10 15 10 15 20 sum ApplySplit Key Combine sum sum
  • 20. Hierarchical indexes • Semantics: a tuple at each tick • Enables easy group selection • Terminology:“multiple levels” • Natural part of GroupBy and reshape operations A 1 2 3 1 2 3 4 B
  • 21. Hierarchical indexes • Semantics: a tuple at each tick • Enables easy group selection • Terminology:“multiple levels” • Natural part of GroupBy and reshape operations A 1 2 3 1 2 3 4 B { {
  • 22. Let’s have a little fun To the IPython Notebook!
  • 23. What’s in pandas? • A big library: 40k SLOC
  • 24. Tests! • Huge accumulation of use cases originating in real world applications • 68 lines of tests for every 100 lines of code
  • 25.
  • 26. pandas.core • Data structures • Series (1D) • DataFrame (2D) • Panel (3D) • NA-friendly statistics • Index implementations / label-indexing
  • 27. pandas.core • GroupBy engine • Time series tools • Date range generation • Extensible date offsets • Hierarchical indexing stuff
  • 28. Elsewhere • Join / concatenation algorithms • Sparse versions of Series, DataFrame... • IO tools: CSV files, HDF5, Excel 2003/2007 • Moving window statistics (rolling mean, ...) • Pivot tables • High level matplotlib interface
  • 29. Hmm, pandas/src • ~6000 lines of mostly Cython code • Fast data algorithms that power the library and make it fast • pandas in PyPy?
  • 30. Ok, so why Python? • Look around you! • Build a superior data analysis and statistical computing environment • Build mission-critical, data-driven production systems
  • 32. The pandas roadmap • Improved time series capabilities • Port GroupBy engine to NumPy only • Better integration with statsmodels and scikit-learn • R integration via rpy2
  • 33. The pandas roadmap • Integration with JavaScript visualization frameworks: D3, Flot, others • Alternate DataFrame “backends” • Memory maps • HDF5 / PyTables • SQL or NoSQL-backed • Tighter IPython Notebook integration
  • 34. ggplot2 for Python • We need to build better a better interface for creating statistical graphics in Python • Use pandas as the base layer ! • Upcoming project from Peter Wang: bokeh
  • 35. pandas for “Big Data” • Quite common to need to process larger- than-RAM data sets • Alternate DataFrame backends are the likely solution • Ripe for integration with MapReduce frameworks
  • 36. Better time series • Integration of scikits.timeseries codebase • NumPy datetime64 dtype • Higher performance, less memory
  • 37. Better time series • Fixed frequency handling • Time zones • Multiple time concepts • Intervals: 1984, or “1984 Q4” • Timestamps: moment in time, to micro- or nanosecond resolution
  • 38. Thanks! • Follow me on Twitter: @wesmckinn • pydata/pandas on GitHub!