Getting started with pandas

•

5 likes•6,679 views

maikroeder

A talk I gave at the Barcelona Python Meetup May 2012.

Technology

Getting started with
Pandas

Maik Röder
Barcelona Python Meetup Group
17.05.2012

Friday, May 18, 2012

Pandas
• Python data analysis library
• Built on top of Numpy
• Panel Data System
• Open Sourced by AQR Capital
Management, LLC in late 2009
• 30.000 lines of tested Python/Cython code
• Used in production in many companies

Friday, May 18, 2012

The ideal tool for data
scientists
• Munging data
• Cleaning data
• Analyzing
• Modeling data
• Organizing the results of the analysis into a
form suitable for plotting or tabular display

Friday, May 18, 2012

Installation
• Install Python 2.6.8 or later
• Current versions:
• Numpy 1.6.1 and Pandas 0.7.3
• Recommendation: Install with pip
pip install numpy
pip install pandas

Friday, May 18, 2012

Axis Indexing

• Every axis has an index
• Highly optimized data structure
• Hierarchical indexing
• group by and join-type operations

Friday, May 18, 2012

Series data structure
• 1-dimensional
import numpy as np
randn = np.random.randn
from pandas import *
s = Series(randn(3),
index=['a','b','c'])
s
a -0.889880
b 1.102135
c -2.187296

Friday, May 18, 2012

Series to/from dict
d = dict(s)
{'a': -0.88988001423312313,
'c': -2.1872960440695666,
'b': 1.1021347373670938}
Series(d)
a -0.889880
b 1.102135
c -2.187296
• Index comes from sorted dictionary keys
Friday, May 18, 2012

Reindexing labels
>>> s
a -0.496848
b 0.607173
c -1.570596
>>> s.reindex(['c','b','a'])
c -1.570596
b 0.607173
a -0.496848

Friday, May 18, 2012

Vectorization
>>> s + s
a -1.779760
b 2.204269
c -4.374592
>>> np.exp(s)
a 0.410705
b 3.010586
c 0.112220
• Series work with Numpy
Friday, May 18, 2012

Structured Data
• Data that can be represented as tables
• rows and columns
• Each row is a different object
• Columns represent attributes of the object

Friday, May 18, 2012

Structured data
• Like SQL Table or Excel Sheet
• Heterogeneous columns, but each column
homogeneously typed
• Row and column-oriented operations
• Axis meta data
• Seamless integration with Python data
structures and Numpy

Friday, May 18, 2012

DataFrame data structure

• Like data.frame in R
• 2-dimensional tabular data structure
• Data manipulation with integrated indexing
• Support heterogeneous columns
• Homogeneous columns

Friday, May 18, 2012

DataFrame

>>> d = {'one': s*s,
'two': s+s}
>>> DataFrame(d)
one two
a 0.791886 -1.779760
b 1.214701 2.204269
c 4.784264 -4.374592

Friday, May 18, 2012

Dataframe add column
>>> s
a -0.889880
b 1.102135
c -2.187296
>>> df['three'] = s * 3
>>> df
one two three
a 0.791886 -1.779760 -2.669640
b 1.214701 2.204269 3.306404
c 4.784264 -4.374592 -6.561888
Friday, May 18, 2012

Select row by label
>>> row = df.xs('a')
one 0.791886
two -1.779760
three -2.669640
Name: a
>>> type(row)
<class'pandas.core.series.Series'>
>>> df.dtypes
one float64
two float64
three float64
Friday, May 18, 2012

Descriptive statistics
>>> df.mean()
one 2.263617
two -1.316694
three -1.975041
• Also: count, sum, median, min, max, abs, prod,
std, var, skew, kurt, quantile, cumsum,
cumprod, cummax, cummin

Friday, May 18, 2012

Computational Tools

• Covariance
>>> s1 = Series(randn(1000))
>>> s2 = Series(randn(1000))
>>> s1.cov(s2)
0.013973709323221539
• Also: pearson, kendall, spearman

Friday, May 18, 2012

This and much more...
• Group by: split-apply-combine
• Merge, join and aggregate
• Reshaping and Pivot Tables
• Time Series / Date functionality
• Plotting with matplotlib
• IO Tools (Text, CSV, HDF5, ...)
• Sparse data structures
Friday, May 18, 2012

Resources

• http://pypi.python.org/pypi/pandas
• http://code.google.com/p/pandas

Friday, May 18, 2012

Book coming soon...

Friday, May 18, 2012

What's hot

Numpy tutorialHarikaReddy115

pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney

Python : Data TypesEmertxe Information Technologies Pvt Ltd

Python sqlite3Alexey Bovanenko

Datastructures in pythonhydpy

pandas: Powerful data analysis tools for PythonWes McKinney

Date and Time Module in Python | EdurekaEdureka!

List in pythonHARSHITHA EBBALI

PythonChetan Khanzode

Set methods in pythondeepalishinkar1

PandaszekeLabs Technologies

Python listMohammed Sikander

PandasJyoti shukla

Threads in pythonbaabtra.com - No. 1 supplier of quality freshers

Python ModulesNitin Reddy Katkam

Key-Value NoSQL DatabaseHeman Hosainpana

Longest Common Subsequence & Matrix Chain MultiplicationJaneAlamAdnan

Python-List.pptxAnitaDevi158873

Basic data structures in pythonLifna C.S

Arrays In Python | Python Array Operations | EdurekaEdureka!

What's hot (20)

Numpy tutorial

pandas: a Foundational Python Library for Data Analysis and Statistics

Python : Data Types

Python sqlite3

Datastructures in python

pandas: Powerful data analysis tools for Python

Date and Time Module in Python | Edureka

List in python

Python

Set methods in python

Pandas

Python list

Pandas

Threads in python

Python Modules

Key-Value NoSQL Database

Longest Common Subsequence & Matrix Chain Multiplication

Python-List.pptx

Basic data structures in python

Arrays In Python | Python Array Operations | Edureka

Similar to Getting started with pandas

Pandas data transformational data structure patterns and challenges finalRajesh M

5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...Kangaroot

Introduction to Datamining Concept and TechniquesSơn Còm Nhom

ggplotcourse.pptxJAVIERDELAHOZ8

Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...Sri Ambati

A look inside pandas design and developmentWes McKinney

NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!Daniel Cousineau

Effective Named Entity Recognition for Idiosyncratic Web CollectionseXascale Infolab

Lens: Data exploration with Dask and Jupyter widgetsVíctor Zabalza

Data Exploration in R.pptxRamakrishna Reddy Bijjam

ClickHouse 2018. How to stop waiting for your queries to complete and start ...Altinity Ltd

Data science in Node.jsSean Byrnes

Quick dive to pandasRobin Kiplangat

Week 12 Dimensionality Reduction Bagian 1khairulhuda242

Using the python_data_toolkit_timbers_slidesTiffany Timbers

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...Rodney Joyce

A Hacking Toolset for Big Tabular Files (3)Toshiyuki Shimono

Quick Wikipedia Mining using Elastic Map Reduceohkura

Quick WinsHighLoad2009

MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case StudyMongoDB

Similar to Getting started with pandas (20)

Pandas data transformational data structure patterns and challenges final

5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...

Introduction to Datamining Concept and Techniques

ggplotcourse.pptx

Feature Engineering in H2O Driverless AI - Dmitry Larko - H2O AI World London...

A look inside pandas design and development

NOSQL101, Or: How I Learned To Stop Worrying And Love The Mongo!

Effective Named Entity Recognition for Idiosyncratic Web Collections

Lens: Data exploration with Dask and Jupyter widgets

Data Exploration in R.pptx

ClickHouse 2018. How to stop waiting for your queries to complete and start ...

Data science in Node.js

Quick dive to pandas

Week 12 Dimensionality Reduction Bagian 1

Using the python_data_toolkit_timbers_slides

Data Science for Dummies - Data Engineering with Titanic dataset + Databricks...

A Hacking Toolset for Big Tabular Files (3)

Quick Wikipedia Mining using Elastic Map Reduce

Quick Wins

MongoDB World 2018: Overnight to 60 Seconds: An IOT ETL Performance Case Study

Recently uploaded

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal

Gen AI in Business - Global Trends Report 2024.pdfAddepto

APIForce Zurich 5 April Automation LPDGMarianaLemus7

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang

SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero

Commit 2024 - Secret Management made easyAlfredo García Lavilla

"ML in Production",Oleksandr BaganFwdays

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Story boards and shot lists for my a level piececharlottematthew16

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited

Pigging Solutions in Pet Food ManufacturingPigging Solutions

Artificial intelligence in cctv survelliance.pptxhariprasad279825

"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko

SAP Build Work Zone - Overview L2-L3.pptx

Gen AI in Business - Global Trends Report 2024.pdf

APIForce Zurich 5 April Automation LPDG

Bun (KitWorks Team Study 노별마루 발표 2024.4.22)

SIP trunking in Janus @ Kamailio World 2024

Commit 2024 - Secret Management made easy

"ML in Production",Oleksandr Bagan

Ensuring Technical Readiness For Copilot in Microsoft 365

WordPress Websites for Engineers: Elevate Your Brand

Scanning the Internet for External Cloud Exposures via SSL Certs

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics

Advanced Test Driven-Development @ php[tek] 2024

Story boards and shot lists for my a level piece

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...

Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365

Pigging Solutions in Pet Food Manufacturing

Artificial intelligence in cctv survelliance.pptx

"Federated learning: out of reach no matter how close",Oleksandr Lapshyn

Powerpoint exploring the locations used in television show Time Clash

Getting started with pandas

1. Getting started with Pandas Maik Röder Barcelona Python Meetup Group 17.05.2012 Friday, May 18, 2012

2. Pandas • Python data analysis library • Built on top of Numpy • Panel Data System • Open Sourced by AQR Capital Management, LLC in late 2009 • 30.000 lines of tested Python/Cython code • Used in production in many companies Friday, May 18, 2012

3. The ideal tool for data scientists • Munging data • Cleaning data • Analyzing • Modeling data • Organizing the results of the analysis into a form suitable for plotting or tabular display Friday, May 18, 2012

4. Installation • Install Python 2.6.8 or later • Current versions: • Numpy 1.6.1 and Pandas 0.7.3 • Recommendation: Install with pip pip install numpy pip install pandas Friday, May 18, 2012

5. Axis Indexing • Every axis has an index • Highly optimized data structure • Hierarchical indexing • group by and join-type operations Friday, May 18, 2012

6. Series data structure • 1-dimensional import numpy as np randn = np.random.randn from pandas import * s = Series(randn(3), index=['a','b','c']) s a -0.889880 b 1.102135 c -2.187296 Friday, May 18, 2012

7. Series to/from dict d = dict(s) {'a': -0.88988001423312313, 'c': -2.1872960440695666, 'b': 1.1021347373670938} Series(d) a -0.889880 b 1.102135 c -2.187296 • Index comes from sorted dictionary keys Friday, May 18, 2012

8. Reindexing labels >>> s a -0.496848 b 0.607173 c -1.570596 >>> s.reindex(['c','b','a']) c -1.570596 b 0.607173 a -0.496848 Friday, May 18, 2012

9. Vectorization >>> s + s a -1.779760 b 2.204269 c -4.374592 >>> np.exp(s) a 0.410705 b 3.010586 c 0.112220 • Series work with Numpy Friday, May 18, 2012

10. Structured Data • Data that can be represented as tables • rows and columns • Each row is a different object • Columns represent attributes of the object Friday, May 18, 2012

11. Structured data • Like SQL Table or Excel Sheet • Heterogeneous columns, but each column homogeneously typed • Row and column-oriented operations • Axis meta data • Seamless integration with Python data structures and Numpy Friday, May 18, 2012

12. DataFrame data structure • Like data.frame in R • 2-dimensional tabular data structure • Data manipulation with integrated indexing • Support heterogeneous columns • Homogeneous columns Friday, May 18, 2012

13. DataFrame >>> d = {'one': s*s, 'two': s+s} >>> DataFrame(d) one two a 0.791886 -1.779760 b 1.214701 2.204269 c 4.784264 -4.374592 Friday, May 18, 2012

14. Dataframe add column >>> s a -0.889880 b 1.102135 c -2.187296 >>> df['three'] = s * 3 >>> df one two three a 0.791886 -1.779760 -2.669640 b 1.214701 2.204269 3.306404 c 4.784264 -4.374592 -6.561888 Friday, May 18, 2012

15. Select row by label >>> row = df.xs('a') one 0.791886 two -1.779760 three -2.669640 Name: a >>> type(row) <class'pandas.core.series.Series'> >>> df.dtypes one float64 two float64 three float64 Friday, May 18, 2012

16. Descriptive statistics >>> df.mean() one 2.263617 two -1.316694 three -1.975041 • Also: count, sum, median, min, max, abs, prod, std, var, skew, kurt, quantile, cumsum, cumprod, cummax, cummin Friday, May 18, 2012

17. Computational Tools • Covariance >>> s1 = Series(randn(1000)) >>> s2 = Series(randn(1000)) >>> s1.cov(s2) 0.013973709323221539 • Also: pearson, kendall, spearman Friday, May 18, 2012

18. This and much more... • Group by: split-apply-combine • Merge, join and aggregate • Reshaping and Pivot Tables • Time Series / Date functionality • Plotting with matplotlib • IO Tools (Text, CSV, HDF5, ...) • Sparse data structures Friday, May 18, 2012

19. Resources • http://pypi.python.org/pypi/pandas • http://code.google.com/p/pandas Friday, May 18, 2012

20. Book coming soon... Friday, May 18, 2012

Getting started with pandas

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Getting started with pandas

Similar to Getting started with pandas (20)

More from maikroeder

More from maikroeder (6)

Recently uploaded

Recently uploaded (20)

Getting started with pandas