Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Getting started with                             Pandas                                    Maik Röder                     ...
Pandas                       • Python data analysis library                       • Built on top of Numpy                 ...
The ideal tool for data                             scientists                       • Munging data                       ...
Installation                       • Install Python 2.6.8 or later                       • Current versions:              ...
Axis Indexing                       • Every axis has an index                       • Highly optimized data structure     ...
Series data structure              • 1-dimensional                       import numpy as np                       randn = ...
Series to/from dict                       d = dict(s)                       {a: -0.88988001423312313,                     ...
Reindexing labels                       >>>   s                       a     -0.496848                       b       0.6071...
Vectorization                       >>> s + s                       a   -1.779760                       b    2.204269     ...
Structured Data          • Data that can be represented as tables           • rows and columns          • Each row is a di...
Structured data                       • Like SQL Table or Excel Sheet                       • Heterogeneous columns, but e...
DataFrame data structure                       • Like data.frame in R                       • 2-dimensional tabular data s...
DataFrame                       >>> d = {one: s*s,                                two: s+s}                       >>> Data...
Dataframe add column                       >>> s                       a   -0.889880                       b     1.102135 ...
Select row by label                 >>> row = df.xs(a)                 one      0.791886                 two     -1.779760...
Descriptive statistics                       >>> df.mean()                       one      2.263617                       t...
Computational Tools                 • Covariance                       >>> s1 = Series(randn(1000))                       ...
This and much more...                       • Group by: split-apply-combine                       • Merge, join and aggreg...
Resources                       • http://pypi.python.org/pypi/pandas                       • http://code.google.com/p/pand...
Book coming soon...Friday, May 18, 2012
Upcoming SlideShare
Loading in …5
×

Getting started with pandas

7,770 views

Published on

A talk I gave at the Barcelona Python Meetup May 2012.

Published in: Technology
  • Login to see the comments

Getting started with pandas

  1. 1. Getting started with Pandas Maik Röder Barcelona Python Meetup Group 17.05.2012Friday, May 18, 2012
  2. 2. Pandas • Python data analysis library • Built on top of Numpy • Panel Data System • Open Sourced by AQR Capital Management, LLC in late 2009 • 30.000 lines of tested Python/Cython code • Used in production in many companiesFriday, May 18, 2012
  3. 3. The ideal tool for data scientists • Munging data • Cleaning data • Analyzing • Modeling data • Organizing the results of the analysis into a form suitable for plotting or tabular displayFriday, May 18, 2012
  4. 4. Installation • Install Python 2.6.8 or later • Current versions: • Numpy 1.6.1 and Pandas 0.7.3 • Recommendation: Install with pip pip install numpy pip install pandasFriday, May 18, 2012
  5. 5. Axis Indexing • Every axis has an index • Highly optimized data structure • Hierarchical indexing • group by and join-type operationsFriday, May 18, 2012
  6. 6. Series data structure • 1-dimensional import numpy as np randn = np.random.randn from pandas import * s = Series(randn(3), index=[a,b,c]) s a -0.889880 b 1.102135 c -2.187296Friday, May 18, 2012
  7. 7. Series to/from dict d = dict(s) {a: -0.88988001423312313, c: -2.1872960440695666, b: 1.1021347373670938} Series(d) a -0.889880 b 1.102135 c -2.187296 • Index comes from sorted dictionary keysFriday, May 18, 2012
  8. 8. Reindexing labels >>> s a -0.496848 b 0.607173 c -1.570596 >>> s.reindex([c,b,a]) c -1.570596 b 0.607173 a -0.496848Friday, May 18, 2012
  9. 9. Vectorization >>> s + s a -1.779760 b 2.204269 c -4.374592 >>> np.exp(s) a 0.410705 b 3.010586 c 0.112220 • Series work with NumpyFriday, May 18, 2012
  10. 10. Structured Data • Data that can be represented as tables • rows and columns • Each row is a different object • Columns represent attributes of the objectFriday, May 18, 2012
  11. 11. Structured data • Like SQL Table or Excel Sheet • Heterogeneous columns, but each column homogeneously typed • Row and column-oriented operations • Axis meta data • Seamless integration with Python data structures and NumpyFriday, May 18, 2012
  12. 12. DataFrame data structure • Like data.frame in R • 2-dimensional tabular data structure • Data manipulation with integrated indexing • Support heterogeneous columns • Homogeneous columnsFriday, May 18, 2012
  13. 13. DataFrame >>> d = {one: s*s, two: s+s} >>> DataFrame(d) one two a 0.791886 -1.779760 b 1.214701 2.204269 c 4.784264 -4.374592Friday, May 18, 2012
  14. 14. Dataframe add column >>> s a -0.889880 b 1.102135 c -2.187296 >>> df[three] = s * 3 >>> df one two three a 0.791886 -1.779760 -2.669640 b 1.214701 2.204269 3.306404 c 4.784264 -4.374592 -6.561888Friday, May 18, 2012
  15. 15. Select row by label >>> row = df.xs(a) one 0.791886 two -1.779760 three -2.669640 Name: a >>> type(row) <classpandas.core.series.Series> >>> df.dtypes one float64 two float64 three float64Friday, May 18, 2012
  16. 16. Descriptive statistics >>> df.mean() one 2.263617 two -1.316694 three -1.975041 • Also: count, sum, median, min, max, abs, prod, std, var, skew, kurt, quantile, cumsum, cumprod, cummax, cumminFriday, May 18, 2012
  17. 17. Computational Tools • Covariance >>> s1 = Series(randn(1000)) >>> s2 = Series(randn(1000)) >>> s1.cov(s2) 0.013973709323221539 • Also: pearson, kendall, spearmanFriday, May 18, 2012
  18. 18. This and much more... • Group by: split-apply-combine • Merge, join and aggregate • Reshaping and Pivot Tables • Time Series / Date functionality • Plotting with matplotlib • IO Tools (Text, CSV, HDF5, ...) • Sparse data structuresFriday, May 18, 2012
  19. 19. Resources • http://pypi.python.org/pypi/pandas • http://code.google.com/p/pandasFriday, May 18, 2012
  20. 20. Book coming soon...Friday, May 18, 2012

×