Getting started with
                             Pandas

                                    Maik Röder
                          Barcelona Python Meetup Group
                                    17.05.2012




Friday, May 18, 2012
Pandas
                       • Python data analysis library
                       • Built on top of Numpy
                       • Panel Data System
                       • Open Sourced by AQR Capital
                         Management, LLC in late 2009
                       • 30.000 lines of tested Python/Cython code
                       • Used in production in many companies

Friday, May 18, 2012
The ideal tool for data
                             scientists
                       • Munging data
                       • Cleaning data
                       • Analyzing
                       • Modeling data
                       • Organizing the results of the analysis into a
                         form suitable for plotting or tabular display


Friday, May 18, 2012
Installation
                       • Install Python 2.6.8 or later
                       • Current versions:
                        • Numpy 1.6.1 and Pandas 0.7.3
                       • Recommendation: Install with pip
                         pip install numpy
                         pip install pandas



Friday, May 18, 2012
Axis Indexing

                       • Every axis has an index
                       • Highly optimized data structure
                       • Hierarchical indexing
                       • group by and join-type operations


Friday, May 18, 2012
Series data structure
              • 1-dimensional
                       import numpy as np
                       randn = np.random.randn
                       from pandas import *
                       s = Series(randn(3),
                                  index=['a','b','c'])
                       s
                       a   -0.889880
                       b    1.102135
                       c   -2.187296


Friday, May 18, 2012
Series to/from dict
                       d = dict(s)
                       {'a': -0.88988001423312313,
                         'c': -2.1872960440695666,
                         'b': 1.1021347373670938}
                       Series(d)
                       a    -0.889880
                       b     1.102135
                       c    -2.187296
                 • Index comes from sorted dictionary keys
Friday, May 18, 2012
Reindexing labels
                       >>>   s
                       a     -0.496848
                       b       0.607173
                       c     -1.570596
                       >>>   s.reindex(['c','b','a'])
                       c     -1.570596
                       b       0.607173
                       a     -0.496848


Friday, May 18, 2012
Vectorization
                       >>> s + s
                       a   -1.779760
                       b    2.204269
                       c   -4.374592
                       >>> np.exp(s)
                       a    0.410705
                       b    3.010586
                       c    0.112220
                 • Series work with Numpy
Friday, May 18, 2012
Structured Data
          • Data that can be represented as tables
           • rows and columns
          • Each row is a different object
          • Columns represent attributes of the object




Friday, May 18, 2012
Structured data
                       • Like SQL Table or Excel Sheet
                       • Heterogeneous columns, but each column
                         homogeneously typed
                       • Row and column-oriented operations
                       • Axis meta data
                       • Seamless integration with Python data
                         structures and Numpy


Friday, May 18, 2012
DataFrame data structure

                       • Like data.frame in R
                       • 2-dimensional tabular data structure
                       • Data manipulation with integrated indexing
                       • Support heterogeneous columns
                       • Homogeneous columns

Friday, May 18, 2012
DataFrame

                       >>> d = {'one': s*s,
                                'two': s+s}
                       >>> DataFrame(d)
                               one       two
                       a 0.791886 -1.779760
                       b 1.214701 2.204269
                       c 4.784264 -4.374592



Friday, May 18, 2012
Dataframe add column
                       >>> s
                       a   -0.889880
                       b     1.102135
                       c   -2.187296
                       >>> df['three'] = s * 3
                       >>> df
                               one      two     three
                       a 0.791886 -1.779760 -2.669640
                       b 1.214701 2.204269 3.306404
                       c 4.784264 -4.374592 -6.561888
Friday, May 18, 2012
Select row by label
                 >>> row = df.xs('a')
                 one      0.791886
                 two     -1.779760
                 three   -2.669640
                 Name: a
                 >>> type(row)
                 <class'pandas.core.series.Series'>
                 >>> df.dtypes
                 one      float64
                 two      float64
                 three    float64
Friday, May 18, 2012
Descriptive statistics
                       >>> df.mean()
                       one      2.263617
                       two     -1.316694
                       three   -1.975041
                 • Also: count, sum, median, min, max, abs, prod,
                       std, var, skew, kurt, quantile, cumsum,
                       cumprod, cummax, cummin


Friday, May 18, 2012
Computational Tools

                 • Covariance
                       >>> s1 = Series(randn(1000))
                       >>> s2 = Series(randn(1000))
                       >>> s1.cov(s2)
                       0.013973709323221539
                 • Also: pearson, kendall, spearman


Friday, May 18, 2012
This and much more...
                       • Group by: split-apply-combine
                       • Merge, join and aggregate
                       • Reshaping and Pivot Tables
                       • Time Series / Date functionality
                       • Plotting with matplotlib
                       • IO Tools (Text, CSV, HDF5, ...)
                       • Sparse data structures
Friday, May 18, 2012
Resources


                       • http://pypi.python.org/pypi/pandas
                       • http://code.google.com/p/pandas


Friday, May 18, 2012
Book coming soon...




Friday, May 18, 2012

Getting started with pandas

  • 1.
    Getting started with Pandas Maik Röder Barcelona Python Meetup Group 17.05.2012 Friday, May 18, 2012
  • 2.
    Pandas • Python data analysis library • Built on top of Numpy • Panel Data System • Open Sourced by AQR Capital Management, LLC in late 2009 • 30.000 lines of tested Python/Cython code • Used in production in many companies Friday, May 18, 2012
  • 3.
    The ideal toolfor data scientists • Munging data • Cleaning data • Analyzing • Modeling data • Organizing the results of the analysis into a form suitable for plotting or tabular display Friday, May 18, 2012
  • 4.
    Installation • Install Python 2.6.8 or later • Current versions: • Numpy 1.6.1 and Pandas 0.7.3 • Recommendation: Install with pip pip install numpy pip install pandas Friday, May 18, 2012
  • 5.
    Axis Indexing • Every axis has an index • Highly optimized data structure • Hierarchical indexing • group by and join-type operations Friday, May 18, 2012
  • 6.
    Series data structure • 1-dimensional import numpy as np randn = np.random.randn from pandas import * s = Series(randn(3), index=['a','b','c']) s a -0.889880 b 1.102135 c -2.187296 Friday, May 18, 2012
  • 7.
    Series to/from dict d = dict(s) {'a': -0.88988001423312313, 'c': -2.1872960440695666, 'b': 1.1021347373670938} Series(d) a -0.889880 b 1.102135 c -2.187296 • Index comes from sorted dictionary keys Friday, May 18, 2012
  • 8.
    Reindexing labels >>> s a -0.496848 b 0.607173 c -1.570596 >>> s.reindex(['c','b','a']) c -1.570596 b 0.607173 a -0.496848 Friday, May 18, 2012
  • 9.
    Vectorization >>> s + s a -1.779760 b 2.204269 c -4.374592 >>> np.exp(s) a 0.410705 b 3.010586 c 0.112220 • Series work with Numpy Friday, May 18, 2012
  • 10.
    Structured Data • Data that can be represented as tables • rows and columns • Each row is a different object • Columns represent attributes of the object Friday, May 18, 2012
  • 11.
    Structured data • Like SQL Table or Excel Sheet • Heterogeneous columns, but each column homogeneously typed • Row and column-oriented operations • Axis meta data • Seamless integration with Python data structures and Numpy Friday, May 18, 2012
  • 12.
    DataFrame data structure • Like data.frame in R • 2-dimensional tabular data structure • Data manipulation with integrated indexing • Support heterogeneous columns • Homogeneous columns Friday, May 18, 2012
  • 13.
    DataFrame >>> d = {'one': s*s, 'two': s+s} >>> DataFrame(d) one two a 0.791886 -1.779760 b 1.214701 2.204269 c 4.784264 -4.374592 Friday, May 18, 2012
  • 14.
    Dataframe add column >>> s a -0.889880 b 1.102135 c -2.187296 >>> df['three'] = s * 3 >>> df one two three a 0.791886 -1.779760 -2.669640 b 1.214701 2.204269 3.306404 c 4.784264 -4.374592 -6.561888 Friday, May 18, 2012
  • 15.
    Select row bylabel >>> row = df.xs('a') one 0.791886 two -1.779760 three -2.669640 Name: a >>> type(row) <class'pandas.core.series.Series'> >>> df.dtypes one float64 two float64 three float64 Friday, May 18, 2012
  • 16.
    Descriptive statistics >>> df.mean() one 2.263617 two -1.316694 three -1.975041 • Also: count, sum, median, min, max, abs, prod, std, var, skew, kurt, quantile, cumsum, cumprod, cummax, cummin Friday, May 18, 2012
  • 17.
    Computational Tools • Covariance >>> s1 = Series(randn(1000)) >>> s2 = Series(randn(1000)) >>> s1.cov(s2) 0.013973709323221539 • Also: pearson, kendall, spearman Friday, May 18, 2012
  • 18.
    This and muchmore... • Group by: split-apply-combine • Merge, join and aggregate • Reshaping and Pivot Tables • Time Series / Date functionality • Plotting with matplotlib • IO Tools (Text, CSV, HDF5, ...) • Sparse data structures Friday, May 18, 2012
  • 19.
    Resources • http://pypi.python.org/pypi/pandas • http://code.google.com/p/pandas Friday, May 18, 2012
  • 20.