Powerpoint exploring the locations used in television show Time Clash
Getting started with pandas
1. Getting started with
Pandas
Maik Röder
Barcelona Python Meetup Group
17.05.2012
Friday, May 18, 2012
2. Pandas
• Python data analysis library
• Built on top of Numpy
• Panel Data System
• Open Sourced by AQR Capital
Management, LLC in late 2009
• 30.000 lines of tested Python/Cython code
• Used in production in many companies
Friday, May 18, 2012
3. The ideal tool for data
scientists
• Munging data
• Cleaning data
• Analyzing
• Modeling data
• Organizing the results of the analysis into a
form suitable for plotting or tabular display
Friday, May 18, 2012
4. Installation
• Install Python 2.6.8 or later
• Current versions:
• Numpy 1.6.1 and Pandas 0.7.3
• Recommendation: Install with pip
pip install numpy
pip install pandas
Friday, May 18, 2012
5. Axis Indexing
• Every axis has an index
• Highly optimized data structure
• Hierarchical indexing
• group by and join-type operations
Friday, May 18, 2012
6. Series data structure
• 1-dimensional
import numpy as np
randn = np.random.randn
from pandas import *
s = Series(randn(3),
index=['a','b','c'])
s
a -0.889880
b 1.102135
c -2.187296
Friday, May 18, 2012
7. Series to/from dict
d = dict(s)
{'a': -0.88988001423312313,
'c': -2.1872960440695666,
'b': 1.1021347373670938}
Series(d)
a -0.889880
b 1.102135
c -2.187296
• Index comes from sorted dictionary keys
Friday, May 18, 2012
8. Reindexing labels
>>> s
a -0.496848
b 0.607173
c -1.570596
>>> s.reindex(['c','b','a'])
c -1.570596
b 0.607173
a -0.496848
Friday, May 18, 2012
9. Vectorization
>>> s + s
a -1.779760
b 2.204269
c -4.374592
>>> np.exp(s)
a 0.410705
b 3.010586
c 0.112220
• Series work with Numpy
Friday, May 18, 2012
10. Structured Data
• Data that can be represented as tables
• rows and columns
• Each row is a different object
• Columns represent attributes of the object
Friday, May 18, 2012
11. Structured data
• Like SQL Table or Excel Sheet
• Heterogeneous columns, but each column
homogeneously typed
• Row and column-oriented operations
• Axis meta data
• Seamless integration with Python data
structures and Numpy
Friday, May 18, 2012
12. DataFrame data structure
• Like data.frame in R
• 2-dimensional tabular data structure
• Data manipulation with integrated indexing
• Support heterogeneous columns
• Homogeneous columns
Friday, May 18, 2012
13. DataFrame
>>> d = {'one': s*s,
'two': s+s}
>>> DataFrame(d)
one two
a 0.791886 -1.779760
b 1.214701 2.204269
c 4.784264 -4.374592
Friday, May 18, 2012
14. Dataframe add column
>>> s
a -0.889880
b 1.102135
c -2.187296
>>> df['three'] = s * 3
>>> df
one two three
a 0.791886 -1.779760 -2.669640
b 1.214701 2.204269 3.306404
c 4.784264 -4.374592 -6.561888
Friday, May 18, 2012
15. Select row by label
>>> row = df.xs('a')
one 0.791886
two -1.779760
three -2.669640
Name: a
>>> type(row)
<class'pandas.core.series.Series'>
>>> df.dtypes
one float64
two float64
three float64
Friday, May 18, 2012
16. Descriptive statistics
>>> df.mean()
one 2.263617
two -1.316694
three -1.975041
• Also: count, sum, median, min, max, abs, prod,
std, var, skew, kurt, quantile, cumsum,
cumprod, cummax, cummin
Friday, May 18, 2012
18. This and much more...
• Group by: split-apply-combine
• Merge, join and aggregate
• Reshaping and Pivot Tables
• Time Series / Date functionality
• Plotting with matplotlib
• IO Tools (Text, CSV, HDF5, ...)
• Sparse data structures
Friday, May 18, 2012
19. Resources
• http://pypi.python.org/pypi/pandas
• http://code.google.com/p/pandas
Friday, May 18, 2012