SlideShare a Scribd company logo
1 of 23
Python for Data Analysis
Andrew Henshaw
Georgia Tech Research Institute
• PowerPoint
• IPython (ipython –pylab=inline)
• Custom bridge (ipython2powerpoint)
• Python for Data
Analysis
• Wes McKinney
• Lead developer of
pandas
• Quantitative Financial
Analyst
• Python library to provide data analysis features, similar to:
• R
• MATLAB
• SAS
• Built on NumPy, SciPy, and to some extent, matplotlib
• Key components provided by pandas:
• Series
• DataFrame
• One-dimensional
array-like object
containing data
and labels (or
index)
• Lots of ways to
build a Series
>>> import pandas as pd
>>> s = pd.Series(list('abcdef'))
>>> s
0 a
1 b
2 c
3 d
4 e
5 f
>>> s = pd.Series([2, 4, 6, 8])
>>> s
0 2
1 4
2 6
3 8
• A Series index can be
specified
• Single values can be
selected by index
• Multiple values can be
selected with multiple
indexes
>>> s = pd.Series([2, 4, 6, 8],
index = ['f', 'a', 'c', 'e'])
>>>
>>> s
f 2
a 4
c 6
e 8
>>> s['a']
4
>>> s[['a', 'c']]
a 4
c 6
• Think of a Series as a
fixed-length, ordered
dict
• However, unlike a dict,
index items don't have
to be unique
>>> s2 = pd.Series(range(4),
index = list('abab'))
>>> s2
a 0
b 1
a 2
b 3
>>> s['a]
>>>
>>> s['a']
4
>>> s2['a']
a 0
a 2
>>> s2['a'][0]
0
• Filtering
• NumPy-type operations
on data
>>> s
f 2
a 4
c 6
e 8
>>> s[s > 4]
c 6
e 8
>>> s>4
f False
a False
c True
e True
>>> s*2
f 4
a 8
c 12
e 16
• pandas can
accommodate
incomplete data
>>> sdata = {'b':100, 'c':150, 'd':200}
>>> s = pd.Series(sdata)
>>> s
b 100
c 150
d 200
>>> s = pd.Series(sdata, list('abcd'))
>>> s
a NaN
b 100
c 150
d 200
>>> s*2
a NaN
b 200
c 300
d 400
• Unlike in a NumPy
ndarray, data is
automatically aligned
>>> s2 = pd.Series([1, 2, 3],
index = ['c', 'b', 'a'])
>>> s2
c 1
b 2
a 3
>>> s
a NaN
b 100
c 150
d 200
>>> s*s2
a NaN
b 200
c 150
d NaN
• Spreadsheet-like data structure containing an
ordered collection of columns
• Has both a row and column index
• Consider as dict of Series (with shared index)
>>> data = {'state': ['FL', 'FL', 'GA', 'GA', 'GA'],
'year': [2010, 2011, 2008, 2010, 2011],
'pop': [18.8, 19.1, 9.7, 9.7, 9.8]}
>>> frame = pd.DataFrame(data)
>>> frame
pop state year
0 18.8 FL 2010
1 19.1 FL 2011
2 9.7 GA 2008
3 9.7 GA 2010
4 9.8 GA 2011
>>> pop_data = {'FL': {2010:18.8, 2011:19.1},
'GA': {2008: 9.7, 2010: 9.7, 2011:9.8}}
>>> pop = pd.DataFrame(pop_data)
>>> pop
FL GA
2008 NaN 9.7
2010 18.8 9.7
2011 19.1 9.8
• Columns can be retrieved
as a Series
• dict notation
• attribute notation
• Rows can retrieved by
position or by name (using
ix attribute)
>>> frame['state']
0 FL
1 FL
2 GA
3 GA
4 GA
Name: state
>>> frame.describe
<bound method DataFrame.describe
of pop state year
0 18.8 FL 2010
1 19.1 FL 2011
2 9.7 GA 2008
3 9.7 GA 2010
4 9.8 GA 2011>
• New columns can be
added (by computation
or direct assignment)
>>> frame['other'] = NaN
>>> frame
pop state year other
0 18.8 FL 2010 NaN
1 19.1 FL 2011 NaN
2 9.7 GA 2008 NaN
3 9.7 GA 2010 NaN
4 9.8 GA 2011 NaN
>>> frame['calc'] = frame['pop'] * 2
>>> frame
pop state year other calc
0 18.8 FL 2010 NaN 37.6
1 19.1 FL 2011 NaN 38.2
2 9.7 GA 2008 NaN 19.4
3 9.7 GA 2010 NaN 19.4
4 9.8 GA 2011 NaN 19.6
>>> obj = pd.Series(['blue', 'purple', 'red'],
index=[0,2,4])
>>> obj
0 blue
2 purple
4 red
>>> obj.reindex(range(4))
0 blue
1 NaN
2 purple
3 NaN
>>> obj.reindex(range(5), fill_value='black')
0 blue
1 black
2 purple
3 black
4 red
>>> obj.reindex(range(5), method='ffill')
0 blue
1 blue
2 purple
3 purple
4 red
• Creation of new
object with the
data conformed
to a new index
>>> pop
FL GA
2008 NaN 9.7
2010 18.8 9.7
2011 19.1 9.8
>>> pop.sum()
FL 37.9
GA 29.2
>>> pop.mean()
FL 18.950000
GA 9.733333
>>> pop.describe()
FL GA
count 2.000000 3.000000
mean 18.950000 9.733333
std 0.212132 0.057735
min 18.800000 9.700000
25% 18.875000 9.700000
50% 18.950000 9.700000
75% 19.025000 9.750000
max 19.100000 9.800000
>>> pop
FL GA
2008 NaN 9.7
2010 18.8 9.7
2011 19.1 9.8
>>> pop < 9.8
FL GA
2008 False True
2010 False True
2011 False False
>>> pop[pop < 9.8] = 0
>>> pop
FL GA
2008 NaN 0.0
2010 18.8 0.0
2011 19.1 9.8
• pandas supports several ways to handle data loading
• Text file data
• read_csv
• read_table
• Structured data (JSON, XML, HTML)
• works well with existing libraries
• Excel (depends upon xlrd and openpyxl packages)
• Database
• pandas.io.sql module (read_frame)
>>> tips = pd.read_csv('/users/ah6/Desktop/pandas
talk/data/tips.csv')
>>> tips.ix[:2]
total_bill tip sex smoker day time size
0 16.99 1.01 Female No Sun Dinner 2
1 10.34 1.66 Male No Sun Dinner 3
2 21.01 3.50 Male No Sun Dinner 3
>>> party_counts = pd.crosstab(tips.day, tips.size)
>>> party_counts
size 1 2 3 4 5 6
day
Fri 1 16 1 1 0 0
Sat 2 53 18 13 1 0
Sun 0 39 15 18 3 1
Thur 1 48 4 5 1 3
>>> sum_by_day = party_counts.sum(1).astype(float)
>>> party_pcts = party_counts.div(sum_by_day, axis=0)
>>> party_pcts
size 1 2 3 4 5 6
day
Fri 0.052632 0.842105 0.052632 0.052632 0.000000 0.000000
Sat 0.022989 0.609195 0.206897 0.149425 0.011494 0.000000
Sun 0.000000 0.513158 0.197368 0.236842 0.039474 0.013158
Thur 0.016129 0.774194 0.064516 0.080645 0.016129 0.048387
>>> party_pcts.plot(kind='bar', stacked=True)
<matplotlib.axes.AxesSubplot at 0x6bf2ed0>
>>> tips['tip_pct'] = tips['tip'] / tips['total_bill']
>>> tips['tip_pct'].hist(bins=50)
<matplotlib.axes.AxesSubplot at 0x6c10d30>
>>> tips['tip_pct'].describe()
count 244.000000
mean 0.160803
std 0.061072
min 0.035638
25% 0.129127
50% 0.154770
75% 0.191475
max 0.710345
• Data Aggregation
• GroupBy
• Pivot Tables
• Time Series
• Periods/Frequencies
• Operations with Time Series with Different Frequencies
• Downsampling/Upsampling
• Plotting with TimeSeries (auto-adjust scale)
• Advanced Analysis
• Decile and Quartile Analysis
• Signal Frontier Analysis
• Future Contract Rolling
• Rolling Correlation and Linear Regression

More Related Content

What's hot

Python Pandas
Python PandasPython Pandas
Python PandasSunil OS
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python PandasNeeru Mittal
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Alexander Hendorf
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Simplilearn
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandasAkshitaKanther
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPyDevashish Kumar
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandasmaikroeder
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)PyData
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaEdureka!
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in PythonMarc Garcia
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and StatisticsWes McKinney
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using PythonNishantKumar1179
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Dr. Volkan OBAN
 
Sorting in python
Sorting in python Sorting in python
Sorting in python Simplilearn
 

What's hot (20)

Python Pandas
Python PandasPython Pandas
Python Pandas
 
Data Analysis with Python Pandas
Data Analysis with Python PandasData Analysis with Python Pandas
Data Analysis with Python Pandas
 
Python pandas Library
Python pandas LibraryPython pandas Library
Python pandas Library
 
Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]Introduction to Pandas and Time Series Analysis [PyCon DE]
Introduction to Pandas and Time Series Analysis [PyCon DE]
 
Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...Data Science With Python | Python For Data Science | Python Data Science Cour...
Data Science With Python | Python For Data Science | Python Data Science Cour...
 
Presentation on data preparation with pandas
Presentation on data preparation with pandasPresentation on data preparation with pandas
Presentation on data preparation with pandas
 
Introduction to numpy
Introduction to numpyIntroduction to numpy
Introduction to numpy
 
Data Analysis in Python-NumPy
Data Analysis in Python-NumPyData Analysis in Python-NumPy
Data Analysis in Python-NumPy
 
Getting started with pandas
Getting started with pandasGetting started with pandas
Getting started with pandas
 
Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)Introduction to NumPy (PyData SV 2013)
Introduction to NumPy (PyData SV 2013)
 
NUMPY
NUMPY NUMPY
NUMPY
 
Python NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | EdurekaPython NumPy Tutorial | NumPy Array | Edureka
Python NumPy Tutorial | NumPy Array | Edureka
 
Data visualization in Python
Data visualization in PythonData visualization in Python
Data visualization in Python
 
NumPy
NumPyNumPy
NumPy
 
pandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statisticspandas: a Foundational Python Library for Data Analysis and Statistics
pandas: a Foundational Python Library for Data Analysis and Statistics
 
PPT on Data Science Using Python
PPT on Data Science Using PythonPPT on Data Science Using Python
PPT on Data Science Using Python
 
Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet Python Pandas for Data Science cheatsheet
Python Pandas for Data Science cheatsheet
 
Python Scipy Numpy
Python Scipy NumpyPython Scipy Numpy
Python Scipy Numpy
 
Numpy
NumpyNumpy
Numpy
 
Sorting in python
Sorting in python Sorting in python
Sorting in python
 

Similar to pandas - Python Data Analysis

R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine LearningAmanBhalla14
 
Sas Talk To R Users Group
Sas Talk To R Users GroupSas Talk To R Users Group
Sas Talk To R Users Groupgeorgette1200
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Amazon Web Services
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftAmazon Web Services
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?ukdpe
 
SQL Server 2008 Overview
SQL Server 2008 OverviewSQL Server 2008 Overview
SQL Server 2008 OverviewEric Nelson
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2rowensCap
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifyNeville Li
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Julian Hyde
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStoreMariaDB plc
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSAmazon Web Services
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTjixuan1989
 
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...Kangaroot
 
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStoreBig Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStoreMatt Stubbs
 

Similar to pandas - Python Data Analysis (20)

R programming & Machine Learning
R programming & Machine LearningR programming & Machine Learning
R programming & Machine Learning
 
Sas Talk To R Users Group
Sas Talk To R Users GroupSas Talk To R Users Group
Sas Talk To R Users Group
 
Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
Best Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon RedshiftBest Practices for Migrating your Data Warehouse to Amazon Redshift
Best Practices for Migrating your Data Warehouse to Amazon Redshift
 
SAS - overview of SAS
SAS - overview of SASSAS - overview of SAS
SAS - overview of SAS
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?What's New for Developers in SQL Server 2008?
What's New for Developers in SQL Server 2008?
 
SQL Server 2008 Overview
SQL Server 2008 OverviewSQL Server 2008 Overview
SQL Server 2008 Overview
 
Prog1 chap1 and chap 2
Prog1 chap1 and chap 2Prog1 chap1 and chap 2
Prog1 chap1 and chap 2
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
Sorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at SpotifySorry - How Bieber broke Google Cloud at Spotify
Sorry - How Bieber broke Google Cloud at Spotify
 
Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!Don't optimize my queries, organize my data!
Don't optimize my queries, organize my data!
 
MariaDB ColumnStore
MariaDB ColumnStoreMariaDB ColumnStore
MariaDB ColumnStore
 
Deploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWSDeploying your Data Warehouse on AWS
Deploying your Data Warehouse on AWS
 
Apache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoTApache IOTDB: a Time Series Database for Industrial IoT
Apache IOTDB: a Time Series Database for Industrial IoT
 
HivePart1.pptx
HivePart1.pptxHivePart1.pptx
HivePart1.pptx
 
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
5_MariaDB_What's New in MariaDB Server 10.2 and Big Data Analytics with Maria...
 
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStoreBig Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
Big Data LDN 2017: Big Data Analytics with MariaDB ColumnStore
 

Recently uploaded

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusZilliz
 

Recently uploaded (20)

Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source MilvusA Beginners Guide to Building a RAG App Using Open Source Milvus
A Beginners Guide to Building a RAG App Using Open Source Milvus
 

pandas - Python Data Analysis

  • 1. Python for Data Analysis Andrew Henshaw Georgia Tech Research Institute
  • 2. • PowerPoint • IPython (ipython –pylab=inline) • Custom bridge (ipython2powerpoint)
  • 3. • Python for Data Analysis • Wes McKinney • Lead developer of pandas • Quantitative Financial Analyst
  • 4. • Python library to provide data analysis features, similar to: • R • MATLAB • SAS • Built on NumPy, SciPy, and to some extent, matplotlib • Key components provided by pandas: • Series • DataFrame
  • 5. • One-dimensional array-like object containing data and labels (or index) • Lots of ways to build a Series >>> import pandas as pd >>> s = pd.Series(list('abcdef')) >>> s 0 a 1 b 2 c 3 d 4 e 5 f >>> s = pd.Series([2, 4, 6, 8]) >>> s 0 2 1 4 2 6 3 8
  • 6. • A Series index can be specified • Single values can be selected by index • Multiple values can be selected with multiple indexes >>> s = pd.Series([2, 4, 6, 8], index = ['f', 'a', 'c', 'e']) >>> >>> s f 2 a 4 c 6 e 8 >>> s['a'] 4 >>> s[['a', 'c']] a 4 c 6
  • 7. • Think of a Series as a fixed-length, ordered dict • However, unlike a dict, index items don't have to be unique >>> s2 = pd.Series(range(4), index = list('abab')) >>> s2 a 0 b 1 a 2 b 3 >>> s['a] >>> >>> s['a'] 4 >>> s2['a'] a 0 a 2 >>> s2['a'][0] 0
  • 8. • Filtering • NumPy-type operations on data >>> s f 2 a 4 c 6 e 8 >>> s[s > 4] c 6 e 8 >>> s>4 f False a False c True e True >>> s*2 f 4 a 8 c 12 e 16
  • 9. • pandas can accommodate incomplete data >>> sdata = {'b':100, 'c':150, 'd':200} >>> s = pd.Series(sdata) >>> s b 100 c 150 d 200 >>> s = pd.Series(sdata, list('abcd')) >>> s a NaN b 100 c 150 d 200 >>> s*2 a NaN b 200 c 300 d 400
  • 10. • Unlike in a NumPy ndarray, data is automatically aligned >>> s2 = pd.Series([1, 2, 3], index = ['c', 'b', 'a']) >>> s2 c 1 b 2 a 3 >>> s a NaN b 100 c 150 d 200 >>> s*s2 a NaN b 200 c 150 d NaN
  • 11. • Spreadsheet-like data structure containing an ordered collection of columns • Has both a row and column index • Consider as dict of Series (with shared index)
  • 12. >>> data = {'state': ['FL', 'FL', 'GA', 'GA', 'GA'], 'year': [2010, 2011, 2008, 2010, 2011], 'pop': [18.8, 19.1, 9.7, 9.7, 9.8]} >>> frame = pd.DataFrame(data) >>> frame pop state year 0 18.8 FL 2010 1 19.1 FL 2011 2 9.7 GA 2008 3 9.7 GA 2010 4 9.8 GA 2011
  • 13. >>> pop_data = {'FL': {2010:18.8, 2011:19.1}, 'GA': {2008: 9.7, 2010: 9.7, 2011:9.8}} >>> pop = pd.DataFrame(pop_data) >>> pop FL GA 2008 NaN 9.7 2010 18.8 9.7 2011 19.1 9.8
  • 14. • Columns can be retrieved as a Series • dict notation • attribute notation • Rows can retrieved by position or by name (using ix attribute) >>> frame['state'] 0 FL 1 FL 2 GA 3 GA 4 GA Name: state >>> frame.describe <bound method DataFrame.describe of pop state year 0 18.8 FL 2010 1 19.1 FL 2011 2 9.7 GA 2008 3 9.7 GA 2010 4 9.8 GA 2011>
  • 15. • New columns can be added (by computation or direct assignment) >>> frame['other'] = NaN >>> frame pop state year other 0 18.8 FL 2010 NaN 1 19.1 FL 2011 NaN 2 9.7 GA 2008 NaN 3 9.7 GA 2010 NaN 4 9.8 GA 2011 NaN >>> frame['calc'] = frame['pop'] * 2 >>> frame pop state year other calc 0 18.8 FL 2010 NaN 37.6 1 19.1 FL 2011 NaN 38.2 2 9.7 GA 2008 NaN 19.4 3 9.7 GA 2010 NaN 19.4 4 9.8 GA 2011 NaN 19.6
  • 16. >>> obj = pd.Series(['blue', 'purple', 'red'], index=[0,2,4]) >>> obj 0 blue 2 purple 4 red >>> obj.reindex(range(4)) 0 blue 1 NaN 2 purple 3 NaN >>> obj.reindex(range(5), fill_value='black') 0 blue 1 black 2 purple 3 black 4 red >>> obj.reindex(range(5), method='ffill') 0 blue 1 blue 2 purple 3 purple 4 red • Creation of new object with the data conformed to a new index
  • 17. >>> pop FL GA 2008 NaN 9.7 2010 18.8 9.7 2011 19.1 9.8 >>> pop.sum() FL 37.9 GA 29.2 >>> pop.mean() FL 18.950000 GA 9.733333 >>> pop.describe() FL GA count 2.000000 3.000000 mean 18.950000 9.733333 std 0.212132 0.057735 min 18.800000 9.700000 25% 18.875000 9.700000 50% 18.950000 9.700000 75% 19.025000 9.750000 max 19.100000 9.800000
  • 18. >>> pop FL GA 2008 NaN 9.7 2010 18.8 9.7 2011 19.1 9.8 >>> pop < 9.8 FL GA 2008 False True 2010 False True 2011 False False >>> pop[pop < 9.8] = 0 >>> pop FL GA 2008 NaN 0.0 2010 18.8 0.0 2011 19.1 9.8
  • 19. • pandas supports several ways to handle data loading • Text file data • read_csv • read_table • Structured data (JSON, XML, HTML) • works well with existing libraries • Excel (depends upon xlrd and openpyxl packages) • Database • pandas.io.sql module (read_frame)
  • 20. >>> tips = pd.read_csv('/users/ah6/Desktop/pandas talk/data/tips.csv') >>> tips.ix[:2] total_bill tip sex smoker day time size 0 16.99 1.01 Female No Sun Dinner 2 1 10.34 1.66 Male No Sun Dinner 3 2 21.01 3.50 Male No Sun Dinner 3 >>> party_counts = pd.crosstab(tips.day, tips.size) >>> party_counts size 1 2 3 4 5 6 day Fri 1 16 1 1 0 0 Sat 2 53 18 13 1 0 Sun 0 39 15 18 3 1 Thur 1 48 4 5 1 3 >>> sum_by_day = party_counts.sum(1).astype(float)
  • 21. >>> party_pcts = party_counts.div(sum_by_day, axis=0) >>> party_pcts size 1 2 3 4 5 6 day Fri 0.052632 0.842105 0.052632 0.052632 0.000000 0.000000 Sat 0.022989 0.609195 0.206897 0.149425 0.011494 0.000000 Sun 0.000000 0.513158 0.197368 0.236842 0.039474 0.013158 Thur 0.016129 0.774194 0.064516 0.080645 0.016129 0.048387 >>> party_pcts.plot(kind='bar', stacked=True) <matplotlib.axes.AxesSubplot at 0x6bf2ed0>
  • 22. >>> tips['tip_pct'] = tips['tip'] / tips['total_bill'] >>> tips['tip_pct'].hist(bins=50) <matplotlib.axes.AxesSubplot at 0x6c10d30> >>> tips['tip_pct'].describe() count 244.000000 mean 0.160803 std 0.061072 min 0.035638 25% 0.129127 50% 0.154770 75% 0.191475 max 0.710345
  • 23. • Data Aggregation • GroupBy • Pivot Tables • Time Series • Periods/Frequencies • Operations with Time Series with Different Frequencies • Downsampling/Upsampling • Plotting with TimeSeries (auto-adjust scale) • Advanced Analysis • Decile and Quartile Analysis • Signal Frontier Analysis • Future Contract Rolling • Rolling Correlation and Linear Regression