Getting started with pandas

•

5 likes•6,679 views

maikroeder

A talk I gave at the Barcelona Python Meetup May 2012.

Technology

Getting started with
Pandas

Maik Röder
Barcelona Python Meetup Group
17.05.2012

Friday, May 18, 2012

Pandas
• Python data analysis library
• Built on top of Numpy
• Panel Data System
• Open Sourced by AQR Capital
Management, LLC in late 2009
• 30.000 lines of tested Python/Cython code
• Used in production in many companies

Friday, May 18, 2012

The ideal tool for data
scientists
• Munging data
• Cleaning data
• Analyzing
• Modeling data
• Organizing the results of the analysis into a
form suitable for plotting or tabular display

Friday, May 18, 2012

Installation
• Install Python 2.6.8 or later
• Current versions:
• Numpy 1.6.1 and Pandas 0.7.3
• Recommendation: Install with pip
pip install numpy
pip install pandas

Friday, May 18, 2012

Axis Indexing

• Every axis has an index
• Highly optimized data structure
• Hierarchical indexing
• group by and join-type operations

Friday, May 18, 2012

Series data structure
• 1-dimensional
import numpy as np
randn = np.random.randn
from pandas import *
s = Series(randn(3),
index=['a','b','c'])
s
a -0.889880
b 1.102135
c -2.187296

Friday, May 18, 2012

Series to/from dict
d = dict(s)
{'a': -0.88988001423312313,
'c': -2.1872960440695666,
'b': 1.1021347373670938}
Series(d)
a -0.889880
b 1.102135
c -2.187296
• Index comes from sorted dictionary keys
Friday, May 18, 2012

Reindexing labels
>>> s
a -0.496848
b 0.607173
c -1.570596
>>> s.reindex(['c','b','a'])
c -1.570596
b 0.607173
a -0.496848

Friday, May 18, 2012

Vectorization
>>> s + s
a -1.779760
b 2.204269
c -4.374592
>>> np.exp(s)
a 0.410705
b 3.010586
c 0.112220
• Series work with Numpy
Friday, May 18, 2012

Structured Data
• Data that can be represented as tables
• rows and columns
• Each row is a different object
• Columns represent attributes of the object

Friday, May 18, 2012

Structured data
• Like SQL Table or Excel Sheet
• Heterogeneous columns, but each column
homogeneously typed
• Row and column-oriented operations
• Axis meta data
• Seamless integration with Python data
structures and Numpy

Friday, May 18, 2012

DataFrame data structure

• Like data.frame in R
• 2-dimensional tabular data structure
• Data manipulation with integrated indexing
• Support heterogeneous columns
• Homogeneous columns

Friday, May 18, 2012

DataFrame

>>> d = {'one': s*s,
'two': s+s}
>>> DataFrame(d)
one two
a 0.791886 -1.779760
b 1.214701 2.204269
c 4.784264 -4.374592

Friday, May 18, 2012

Dataframe add column
>>> s
a -0.889880
b 1.102135
c -2.187296
>>> df['three'] = s * 3
>>> df
one two three
a 0.791886 -1.779760 -2.669640
b 1.214701 2.204269 3.306404
c 4.784264 -4.374592 -6.561888
Friday, May 18, 2012

Select row by label
>>> row = df.xs('a')
one 0.791886
two -1.779760
three -2.669640
Name: a
>>> type(row)
<class'pandas.core.series.Series'>
>>> df.dtypes
one float64
two float64
three float64
Friday, May 18, 2012

Descriptive statistics
>>> df.mean()
one 2.263617
two -1.316694
three -1.975041
• Also: count, sum, median, min, max, abs, prod,
std, var, skew, kurt, quantile, cumsum,
cumprod, cummax, cummin

Friday, May 18, 2012

Computational Tools

• Covariance
>>> s1 = Series(randn(1000))
>>> s2 = Series(randn(1000))
>>> s1.cov(s2)
0.013973709323221539
• Also: pearson, kendall, spearman

Friday, May 18, 2012

This and much more...
• Group by: split-apply-combine
• Merge, join and aggregate
• Reshaping and Pivot Tables
• Time Series / Date functionality
• Plotting with matplotlib
• IO Tools (Text, CSV, HDF5, ...)
• Sparse data structures
Friday, May 18, 2012

Resources

• http://pypi.python.org/pypi/pandas
• http://code.google.com/p/pandas

Friday, May 18, 2012

Book coming soon...

Friday, May 18, 2012

What's hot

Introduction to NumPyHuy Nguyen

Python Seaborn Data Visualization Sourabh Sahu

Data Analysis in Python-NumPyDevashish Kumar

Introduction to numpy Session 1Jatin Miglani

NumPyAbhijeetAnand88

Python PandasSunil OS

Cryptography & SteganographyAnimesh Shaw

NumPy/SciPy StatisticsEnthought, Inc.

Data science life cycleManoj Mishra

Python programming : StringsEmertxe Information Technologies Pvt Ltd

PandaszekeLabs Technologies

Data Mining: Mining ,associations, and correlationsDatamining Tools

Machine Learning with RBarbara Fusinska

Lecture 10 intrudersrajakhurram

Final Report(SuddhasatwaSatpathy)SkyBits Technologies Pvt. Ltd.

Introduction to NumPy (PyData SV 2013)PyData

Introduction to Python amiable_indian

Python NumPy Tutorial | NumPy Array | EdurekaEdureka!

OSINT: Open Source Intelligence - Rohan BraganzaNSConclave

Python Scipy NumpyGirish Khanzode

What's hot (20)

Introduction to NumPy

Python Seaborn Data Visualization

Data Analysis in Python-NumPy

Introduction to numpy Session 1

NumPy

Python Pandas

Cryptography & Steganography

NumPy/SciPy Statistics

Data science life cycle

Python programming : Strings

Pandas

Data Mining: Mining ,associations, and correlations

Machine Learning with R

Lecture 10 intruders

Final Report(SuddhasatwaSatpathy)

Introduction to NumPy (PyData SV 2013)

Introduction to Python

Python NumPy Tutorial | NumPy Array | Edureka

OSINT: Open Source Intelligence - Rohan Braganza

Python Scipy Numpy

Similar to Getting started with pandas

Pandas data transformational data structure patterns and challenges finalRajesh M

5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...Kangaroot

Introduction to Datamining Concept and TechniquesSơn Còm Nhom

ggplotcourse.pptxJAVIERDELAHOZ8

Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...Sri Ambati

A look inside pandas design and developmentWes McKinney

NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!Daniel Cousineau

Effective Named Entity Recognition for Idiosyncratic Web CollectionseXascale Infolab

Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza

Data Exploration in R.pptxRamakrishna Reddy Bijjam

ClickHouse 2018. How to stop waiting for your queries to complete and start ...Altinity Ltd

Data science in Node.jsSean Byrnes

Quick dive to pandasRobin Kiplangat

Week 12 Dimensionality Reduction Bagian 1khairulhuda242

Using the python_data_toolkit_timbers_slidesTiffany Timbers

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce

A Hacking Toolset for Big Tabular Files (3)Toshiyuki Shimono

Quick Wikipedia Mining using Elastic Map Reduceohkura

Quick WinsHighLoad2009

pandas: Powerful data analysis tools for PythonWes McKinney

Similar to Getting started with pandas (20)

Pandas data transformational data structure patterns and challenges final

5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...

Introduction to Datamining Concept and Techniques

ggplotcourse.pptx

Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...

A look inside pandas design and development

NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!

Effective Named Entity Recognition for Idiosyncratic Web Collections

Lens: Data exploration with Dask and Jupyter widgets

Data Exploration in R.pptx

ClickHouse 2018. How to stop waiting for your queries to complete and start ...

Data science in Node.js

Quick dive to pandas

Week 12 Dimensionality Reduction Bagian 1

Using the python_data_toolkit_timbers_slides

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...

A Hacking Toolset for Big Tabular Files (3)

Quick Wikipedia Mining using Elastic Map Reduce

Quick Wins

pandas: Powerful data analysis tools for Python

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

Gen AI in Business - Global Trends Report 2024.pdfAddepto

What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina

unit 4 immunoblotting technique complete.pptxBkGupta21

Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf

Unleash Your Potential - Namagunga Girls Coding Club

DevoxxFR 2024 Reproducible Builds with Apache Maven

Gen AI in Business - Global Trends Report 2024.pdf

What is DBT - The Ultimate Data Build Tool.pdf

unit 4 immunoblotting technique complete.pptx

Generative AI for Technical Writer or Information Developers

Are Multi-Cloud and Serverless Good or Bad?

SIP trunking in Janus @ Kamailio World 2024

SALESFORCE EDUCATION CLOUD | FEXLE SERVICES

"Debugging python applications inside k8s environment", Andrii Soldatenko

The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx

Take control of your SAP testing with UiPath Test Suite

Dev Dives: Streamline document processing with UiPath Studio Web

Streamlining Python Development: A Guide to a Modern Project Setup

What's New in Teams Calling, Meetings and Devices March 2024

Ensuring Technical Readiness For Copilot in Microsoft 365

A Deep Dive on Passkeys: FIDO Paris Seminar.pptx

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)

Getting started with pandas

1. Getting started with Pandas Maik Röder Barcelona Python Meetup Group 17.05.2012 Friday, May 18, 2012

2. Pandas • Python data analysis library • Built on top of Numpy • Panel Data System • Open Sourced by AQR Capital Management, LLC in late 2009 • 30.000 lines of tested Python/Cython code • Used in production in many companies Friday, May 18, 2012

3. The ideal tool for data scientists • Munging data • Cleaning data • Analyzing • Modeling data • Organizing the results of the analysis into a form suitable for plotting or tabular display Friday, May 18, 2012

4. Installation • Install Python 2.6.8 or later • Current versions: • Numpy 1.6.1 and Pandas 0.7.3 • Recommendation: Install with pip pip install numpy pip install pandas Friday, May 18, 2012

5. Axis Indexing • Every axis has an index • Highly optimized data structure • Hierarchical indexing • group by and join-type operations Friday, May 18, 2012

6. Series data structure • 1-dimensional import numpy as np randn = np.random.randn from pandas import * s = Series(randn(3), index=['a','b','c']) s a -0.889880 b 1.102135 c -2.187296 Friday, May 18, 2012

7. Series to/from dict d = dict(s) {'a': -0.88988001423312313, 'c': -2.1872960440695666, 'b': 1.1021347373670938} Series(d) a -0.889880 b 1.102135 c -2.187296 • Index comes from sorted dictionary keys Friday, May 18, 2012

8. Reindexing labels >>> s a -0.496848 b 0.607173 c -1.570596 >>> s.reindex(['c','b','a']) c -1.570596 b 0.607173 a -0.496848 Friday, May 18, 2012

9. Vectorization >>> s + s a -1.779760 b 2.204269 c -4.374592 >>> np.exp(s) a 0.410705 b 3.010586 c 0.112220 • Series work with Numpy Friday, May 18, 2012

10. Structured Data • Data that can be represented as tables • rows and columns • Each row is a different object • Columns represent attributes of the object Friday, May 18, 2012

11. Structured data • Like SQL Table or Excel Sheet • Heterogeneous columns, but each column homogeneously typed • Row and column-oriented operations • Axis meta data • Seamless integration with Python data structures and Numpy Friday, May 18, 2012

12. DataFrame data structure • Like data.frame in R • 2-dimensional tabular data structure • Data manipulation with integrated indexing • Support heterogeneous columns • Homogeneous columns Friday, May 18, 2012

13. DataFrame >>> d = {'one': s*s, 'two': s+s} >>> DataFrame(d) one two a 0.791886 -1.779760 b 1.214701 2.204269 c 4.784264 -4.374592 Friday, May 18, 2012

14. Dataframe add column >>> s a -0.889880 b 1.102135 c -2.187296 >>> df['three'] = s * 3 >>> df one two three a 0.791886 -1.779760 -2.669640 b 1.214701 2.204269 3.306404 c 4.784264 -4.374592 -6.561888 Friday, May 18, 2012

15. Select row by label >>> row = df.xs('a') one 0.791886 two -1.779760 three -2.669640 Name: a >>> type(row) <class'pandas.core.series.Series'> >>> df.dtypes one float64 two float64 three float64 Friday, May 18, 2012

16. Descriptive statistics >>> df.mean() one 2.263617 two -1.316694 three -1.975041 • Also: count, sum, median, min, max, abs, prod, std, var, skew, kurt, quantile, cumsum, cumprod, cummax, cummin Friday, May 18, 2012

17. Computational Tools • Covariance >>> s1 = Series(randn(1000)) >>> s2 = Series(randn(1000)) >>> s1.cov(s2) 0.013973709323221539 • Also: pearson, kendall, spearman Friday, May 18, 2012

18. This and much more... • Group by: split-apply-combine • Merge, join and aggregate • Reshaping and Pivot Tables • Time Series / Date functionality • Plotting with matplotlib • IO Tools (Text, CSV, HDF5, ...) • Sparse data structures Friday, May 18, 2012

19. Resources • http://pypi.python.org/pypi/pandas • http://code.google.com/p/pandas Friday, May 18, 2012

20. Book coming soon... Friday, May 18, 2012

Getting started with pandas

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Getting started with pandas

Similar to Getting started with pandas (20)

More from maikroeder

More from maikroeder (6)

Recently uploaded

Recently uploaded (20)

Getting started with pandas