Getting started with pandas

6,262 views

Published on

A talk I gave at the Barcelona Python Meetup May 2012.

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
6,262
On SlideShare
0
From Embeds
0
Number of Embeds
5
Actions
Shares
0
Downloads
148
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide

Getting started with pandas

  1. 1. Getting started with Pandas Maik Röder Barcelona Python Meetup Group 17.05.2012Friday, May 18, 2012
  2. 2. Pandas • Python data analysis library • Built on top of Numpy • Panel Data System • Open Sourced by AQR Capital Management, LLC in late 2009 • 30.000 lines of tested Python/Cython code • Used in production in many companiesFriday, May 18, 2012
  3. 3. The ideal tool for data scientists • Munging data • Cleaning data • Analyzing • Modeling data • Organizing the results of the analysis into a form suitable for plotting or tabular displayFriday, May 18, 2012
  4. 4. Installation • Install Python 2.6.8 or later • Current versions: • Numpy 1.6.1 and Pandas 0.7.3 • Recommendation: Install with pip pip install numpy pip install pandasFriday, May 18, 2012
  5. 5. Axis Indexing • Every axis has an index • Highly optimized data structure • Hierarchical indexing • group by and join-type operationsFriday, May 18, 2012
  6. 6. Series data structure • 1-dimensional import numpy as np randn = np.random.randn from pandas import * s = Series(randn(3), index=[a,b,c]) s a -0.889880 b 1.102135 c -2.187296Friday, May 18, 2012
  7. 7. Series to/from dict d = dict(s) {a: -0.88988001423312313, c: -2.1872960440695666, b: 1.1021347373670938} Series(d) a -0.889880 b 1.102135 c -2.187296 • Index comes from sorted dictionary keysFriday, May 18, 2012
  8. 8. Reindexing labels >>> s a -0.496848 b 0.607173 c -1.570596 >>> s.reindex([c,b,a]) c -1.570596 b 0.607173 a -0.496848Friday, May 18, 2012
  9. 9. Vectorization >>> s + s a -1.779760 b 2.204269 c -4.374592 >>> np.exp(s) a 0.410705 b 3.010586 c 0.112220 • Series work with NumpyFriday, May 18, 2012
  10. 10. Structured Data • Data that can be represented as tables • rows and columns • Each row is a different object • Columns represent attributes of the objectFriday, May 18, 2012
  11. 11. Structured data • Like SQL Table or Excel Sheet • Heterogeneous columns, but each column homogeneously typed • Row and column-oriented operations • Axis meta data • Seamless integration with Python data structures and NumpyFriday, May 18, 2012
  12. 12. DataFrame data structure • Like data.frame in R • 2-dimensional tabular data structure • Data manipulation with integrated indexing • Support heterogeneous columns • Homogeneous columnsFriday, May 18, 2012
  13. 13. DataFrame >>> d = {one: s*s, two: s+s} >>> DataFrame(d) one two a 0.791886 -1.779760 b 1.214701 2.204269 c 4.784264 -4.374592Friday, May 18, 2012
  14. 14. Dataframe add column >>> s a -0.889880 b 1.102135 c -2.187296 >>> df[three] = s * 3 >>> df one two three a 0.791886 -1.779760 -2.669640 b 1.214701 2.204269 3.306404 c 4.784264 -4.374592 -6.561888Friday, May 18, 2012
  15. 15. Select row by label >>> row = df.xs(a) one 0.791886 two -1.779760 three -2.669640 Name: a >>> type(row) <classpandas.core.series.Series> >>> df.dtypes one float64 two float64 three float64Friday, May 18, 2012
  16. 16. Descriptive statistics >>> df.mean() one 2.263617 two -1.316694 three -1.975041 • Also: count, sum, median, min, max, abs, prod, std, var, skew, kurt, quantile, cumsum, cumprod, cummax, cumminFriday, May 18, 2012
  17. 17. Computational Tools • Covariance >>> s1 = Series(randn(1000)) >>> s2 = Series(randn(1000)) >>> s1.cov(s2) 0.013973709323221539 • Also: pearson, kendall, spearmanFriday, May 18, 2012
  18. 18. This and much more... • Group by: split-apply-combine • Merge, join and aggregate • Reshaping and Pivot Tables • Time Series / Date functionality • Plotting with matplotlib • IO Tools (Text, CSV, HDF5, ...) • Sparse data structuresFriday, May 18, 2012
  19. 19. Resources • http://pypi.python.org/pypi/pandas • http://code.google.com/p/pandasFriday, May 18, 2012
  20. 20. Book coming soon...Friday, May 18, 2012

×