Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our User Agreement and Privacy Policy.

Slideshare uses cookies to improve functionality and performance, and to provide you with relevant advertising. If you continue browsing the site, you agree to the use of cookies on this website. See our Privacy Policy and User Agreement for details.

Successfully reported this slideshow.

Like this presentation? Why not share!

7,770 views

Published on

A talk I gave at the Barcelona Python Meetup May 2012.

Published in:
Technology

No Downloads

Total views

7,770

On SlideShare

0

From Embeds

0

Number of Embeds

5

Shares

0

Downloads

213

Comments

4

Likes

5

No notes for slide

- 1. Getting started with Pandas Maik Röder Barcelona Python Meetup Group 17.05.2012Friday, May 18, 2012
- 2. Pandas • Python data analysis library • Built on top of Numpy • Panel Data System • Open Sourced by AQR Capital Management, LLC in late 2009 • 30.000 lines of tested Python/Cython code • Used in production in many companiesFriday, May 18, 2012
- 3. The ideal tool for data scientists • Munging data • Cleaning data • Analyzing • Modeling data • Organizing the results of the analysis into a form suitable for plotting or tabular displayFriday, May 18, 2012
- 4. Installation • Install Python 2.6.8 or later • Current versions: • Numpy 1.6.1 and Pandas 0.7.3 • Recommendation: Install with pip pip install numpy pip install pandasFriday, May 18, 2012
- 5. Axis Indexing • Every axis has an index • Highly optimized data structure • Hierarchical indexing • group by and join-type operationsFriday, May 18, 2012
- 6. Series data structure • 1-dimensional import numpy as np randn = np.random.randn from pandas import * s = Series(randn(3), index=[a,b,c]) s a -0.889880 b 1.102135 c -2.187296Friday, May 18, 2012
- 7. Series to/from dict d = dict(s) {a: -0.88988001423312313, c: -2.1872960440695666, b: 1.1021347373670938} Series(d) a -0.889880 b 1.102135 c -2.187296 • Index comes from sorted dictionary keysFriday, May 18, 2012
- 8. Reindexing labels >>> s a -0.496848 b 0.607173 c -1.570596 >>> s.reindex([c,b,a]) c -1.570596 b 0.607173 a -0.496848Friday, May 18, 2012
- 9. Vectorization >>> s + s a -1.779760 b 2.204269 c -4.374592 >>> np.exp(s) a 0.410705 b 3.010586 c 0.112220 • Series work with NumpyFriday, May 18, 2012
- 10. Structured Data • Data that can be represented as tables • rows and columns • Each row is a different object • Columns represent attributes of the objectFriday, May 18, 2012
- 11. Structured data • Like SQL Table or Excel Sheet • Heterogeneous columns, but each column homogeneously typed • Row and column-oriented operations • Axis meta data • Seamless integration with Python data structures and NumpyFriday, May 18, 2012
- 12. DataFrame data structure • Like data.frame in R • 2-dimensional tabular data structure • Data manipulation with integrated indexing • Support heterogeneous columns • Homogeneous columnsFriday, May 18, 2012
- 13. DataFrame >>> d = {one: s*s, two: s+s} >>> DataFrame(d) one two a 0.791886 -1.779760 b 1.214701 2.204269 c 4.784264 -4.374592Friday, May 18, 2012
- 14. Dataframe add column >>> s a -0.889880 b 1.102135 c -2.187296 >>> df[three] = s * 3 >>> df one two three a 0.791886 -1.779760 -2.669640 b 1.214701 2.204269 3.306404 c 4.784264 -4.374592 -6.561888Friday, May 18, 2012
- 15. Select row by label >>> row = df.xs(a) one 0.791886 two -1.779760 three -2.669640 Name: a >>> type(row) <classpandas.core.series.Series> >>> df.dtypes one float64 two float64 three float64Friday, May 18, 2012
- 16. Descriptive statistics >>> df.mean() one 2.263617 two -1.316694 three -1.975041 • Also: count, sum, median, min, max, abs, prod, std, var, skew, kurt, quantile, cumsum, cumprod, cummax, cumminFriday, May 18, 2012
- 17. Computational Tools • Covariance >>> s1 = Series(randn(1000)) >>> s2 = Series(randn(1000)) >>> s1.cov(s2) 0.013973709323221539 • Also: pearson, kendall, spearmanFriday, May 18, 2012
- 18. This and much more... • Group by: split-apply-combine • Merge, join and aggregate • Reshaping and Pivot Tables • Time Series / Date functionality • Plotting with matplotlib • IO Tools (Text, CSV, HDF5, ...) • Sparse data structuresFriday, May 18, 2012
- 19. Resources • http://pypi.python.org/pypi/pandas • http://code.google.com/p/pandasFriday, May 18, 2012
- 20. Book coming soon...Friday, May 18, 2012

No public clipboards found for this slide

Login to see the comments