Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Getting started with pandas
1. Getting started with
Pandas
Maik Röder
Barcelona Python Meetup Group
17.05.2012
Friday, May 18, 2012
2. Pandas
• Python data analysis library
• Built on top of Numpy
• Panel Data System
• Open Sourced by AQR Capital
Management, LLC in late 2009
• 30.000 lines of tested Python/Cython code
• Used in production in many companies
Friday, May 18, 2012
3. The ideal tool for data
scientists
• Munging data
• Cleaning data
• Analyzing
• Modeling data
• Organizing the results of the analysis into a
form suitable for plotting or tabular display
Friday, May 18, 2012
4. Installation
• Install Python 2.6.8 or later
• Current versions:
• Numpy 1.6.1 and Pandas 0.7.3
• Recommendation: Install with pip
pip install numpy
pip install pandas
Friday, May 18, 2012
5. Axis Indexing
• Every axis has an index
• Highly optimized data structure
• Hierarchical indexing
• group by and join-type operations
Friday, May 18, 2012
6. Series data structure
• 1-dimensional
import numpy as np
randn = np.random.randn
from pandas import *
s = Series(randn(3),
index=['a','b','c'])
s
a -0.889880
b 1.102135
c -2.187296
Friday, May 18, 2012
7. Series to/from dict
d = dict(s)
{'a': -0.88988001423312313,
'c': -2.1872960440695666,
'b': 1.1021347373670938}
Series(d)
a -0.889880
b 1.102135
c -2.187296
• Index comes from sorted dictionary keys
Friday, May 18, 2012
8. Reindexing labels
>>> s
a -0.496848
b 0.607173
c -1.570596
>>> s.reindex(['c','b','a'])
c -1.570596
b 0.607173
a -0.496848
Friday, May 18, 2012
9. Vectorization
>>> s + s
a -1.779760
b 2.204269
c -4.374592
>>> np.exp(s)
a 0.410705
b 3.010586
c 0.112220
• Series work with Numpy
Friday, May 18, 2012
10. Structured Data
• Data that can be represented as tables
• rows and columns
• Each row is a different object
• Columns represent attributes of the object
Friday, May 18, 2012
11. Structured data
• Like SQL Table or Excel Sheet
• Heterogeneous columns, but each column
homogeneously typed
• Row and column-oriented operations
• Axis meta data
• Seamless integration with Python data
structures and Numpy
Friday, May 18, 2012
12. DataFrame data structure
• Like data.frame in R
• 2-dimensional tabular data structure
• Data manipulation with integrated indexing
• Support heterogeneous columns
• Homogeneous columns
Friday, May 18, 2012
13. DataFrame
>>> d = {'one': s*s,
'two': s+s}
>>> DataFrame(d)
one two
a 0.791886 -1.779760
b 1.214701 2.204269
c 4.784264 -4.374592
Friday, May 18, 2012
14. Dataframe add column
>>> s
a -0.889880
b 1.102135
c -2.187296
>>> df['three'] = s * 3
>>> df
one two three
a 0.791886 -1.779760 -2.669640
b 1.214701 2.204269 3.306404
c 4.784264 -4.374592 -6.561888
Friday, May 18, 2012
15. Select row by label
>>> row = df.xs('a')
one 0.791886
two -1.779760
three -2.669640
Name: a
>>> type(row)
<class'pandas.core.series.Series'>
>>> df.dtypes
one float64
two float64
three float64
Friday, May 18, 2012
16. Descriptive statistics
>>> df.mean()
one 2.263617
two -1.316694
three -1.975041
• Also: count, sum, median, min, max, abs, prod,
std, var, skew, kurt, quantile, cumsum,
cumprod, cummax, cummin
Friday, May 18, 2012
18. This and much more...
• Group by: split-apply-combine
• Merge, join and aggregate
• Reshaping and Pivot Tables
• Time Series / Date functionality
• Plotting with matplotlib
• IO Tools (Text, CSV, HDF5, ...)
• Sparse data structures
Friday, May 18, 2012
19. Resources
• http://pypi.python.org/pypi/pandas
• http://code.google.com/p/pandas
Friday, May 18, 2012