pandas - Python Data Analysis

5,855 views

Published on

pandas talk given at Atlanta Python Meetup

Published in: Technology
0 Comments
17 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
5,855
On SlideShare
0
From Embeds
0
Number of Embeds
6
Actions
Shares
0
Downloads
0
Comments
0
Likes
17
Embeds 0
No embeds

No notes for slide

pandas - Python Data Analysis

  1. 1. Python for Data AnalysisAndrew HenshawGeorgia Tech Research Institute
  2. 2. • PowerPoint• IPython (ipython –pylab=inline)• Custom bridge (ipython2powerpoint)
  3. 3. • Python for DataAnalysis• Wes McKinney• Lead developer ofpandas• Quantitative FinancialAnalyst
  4. 4. • Python library to provide data analysis features, similar to:• R• MATLAB• SAS• Built on NumPy, SciPy, and to some extent, matplotlib• Key components provided by pandas:• Series• DataFrame
  5. 5. • One-dimensionalarray-like objectcontaining dataand labels (orindex)• Lots of ways tobuild a Series>>> import pandas as pd>>> s = pd.Series(list(abcdef))>>> s0 a1 b2 c3 d4 e5 f>>> s = pd.Series([2, 4, 6, 8])>>> s0 21 42 63 8
  6. 6. • A Series index can bespecified• Single values can beselected by index• Multiple values can beselected with multipleindexes>>> s = pd.Series([2, 4, 6, 8],index = [f, a, c, e])>>>>>> sf 2a 4c 6e 8>>> s[a]4>>> s[[a, c]]a 4c 6
  7. 7. • Think of a Series as afixed-length, ordereddict• However, unlike a dict,index items dont haveto be unique>>> s2 = pd.Series(range(4),index = list(abab))>>> s2a 0b 1a 2b 3>>> s[a]>>>>>> s[a]4>>> s2[a]a 0a 2>>> s2[a][0]0
  8. 8. • Filtering• NumPy-type operationson data>>> sf 2a 4c 6e 8>>> s[s > 4]c 6e 8>>> s>4f Falsea Falsec Truee True>>> s*2f 4a 8c 12e 16
  9. 9. • pandas canaccommodateincomplete data>>> sdata = {b:100, c:150, d:200}>>> s = pd.Series(sdata)>>> sb 100c 150d 200>>> s = pd.Series(sdata, list(abcd))>>> sa NaNb 100c 150d 200>>> s*2a NaNb 200c 300d 400
  10. 10. • Unlike in a NumPyndarray, data isautomatically aligned>>> s2 = pd.Series([1, 2, 3],index = [c, b, a])>>> s2c 1b 2a 3>>> sa NaNb 100c 150d 200>>> s*s2a NaNb 200c 150d NaN
  11. 11. • Spreadsheet-like data structure containing anordered collection of columns• Has both a row and column index• Consider as dict of Series (with shared index)
  12. 12. >>> data = {state: [FL, FL, GA, GA, GA],year: [2010, 2011, 2008, 2010, 2011],pop: [18.8, 19.1, 9.7, 9.7, 9.8]}>>> frame = pd.DataFrame(data)>>> framepop state year0 18.8 FL 20101 19.1 FL 20112 9.7 GA 20083 9.7 GA 20104 9.8 GA 2011
  13. 13. >>> pop_data = {FL: {2010:18.8, 2011:19.1},GA: {2008: 9.7, 2010: 9.7, 2011:9.8}}>>> pop = pd.DataFrame(pop_data)>>> popFL GA2008 NaN 9.72010 18.8 9.72011 19.1 9.8
  14. 14. • Columns can be retrievedas a Series• dict notation• attribute notation• Rows can retrieved byposition or by name (usingix attribute)>>> frame[state]0 FL1 FL2 GA3 GA4 GAName: state>>> frame.describe<bound method DataFrame.describeof pop state year0 18.8 FL 20101 19.1 FL 20112 9.7 GA 20083 9.7 GA 20104 9.8 GA 2011>
  15. 15. • New columns can beadded (by computationor direct assignment)>>> frame[other] = NaN>>> framepop state year other0 18.8 FL 2010 NaN1 19.1 FL 2011 NaN2 9.7 GA 2008 NaN3 9.7 GA 2010 NaN4 9.8 GA 2011 NaN>>> frame[calc] = frame[pop] * 2>>> framepop state year other calc0 18.8 FL 2010 NaN 37.61 19.1 FL 2011 NaN 38.22 9.7 GA 2008 NaN 19.43 9.7 GA 2010 NaN 19.44 9.8 GA 2011 NaN 19.6
  16. 16. >>> obj = pd.Series([blue, purple, red],index=[0,2,4])>>> obj0 blue2 purple4 red>>> obj.reindex(range(4))0 blue1 NaN2 purple3 NaN>>> obj.reindex(range(5), fill_value=black)0 blue1 black2 purple3 black4 red>>> obj.reindex(range(5), method=ffill)0 blue1 blue2 purple3 purple4 red• Creation of newobject with thedata conformedto a new index
  17. 17. >>> popFL GA2008 NaN 9.72010 18.8 9.72011 19.1 9.8>>> pop.sum()FL 37.9GA 29.2>>> pop.mean()FL 18.950000GA 9.733333>>> pop.describe()FL GAcount 2.000000 3.000000mean 18.950000 9.733333std 0.212132 0.057735min 18.800000 9.70000025% 18.875000 9.70000050% 18.950000 9.70000075% 19.025000 9.750000max 19.100000 9.800000
  18. 18. >>> popFL GA2008 NaN 9.72010 18.8 9.72011 19.1 9.8>>> pop < 9.8FL GA2008 False True2010 False True2011 False False>>> pop[pop < 9.8] = 0>>> popFL GA2008 NaN 0.02010 18.8 0.02011 19.1 9.8
  19. 19. • pandas supports several ways to handle data loading• Text file data• read_csv• read_table• Structured data (JSON, XML, HTML)• works well with existing libraries• Excel (depends upon xlrd and openpyxl packages)• Database• pandas.io.sql module (read_frame)
  20. 20. >>> tips = pd.read_csv(/users/ah6/Desktop/pandastalk/data/tips.csv)>>> tips.ix[:2]total_bill tip sex smoker day time size0 16.99 1.01 Female No Sun Dinner 21 10.34 1.66 Male No Sun Dinner 32 21.01 3.50 Male No Sun Dinner 3>>> party_counts = pd.crosstab(tips.day, tips.size)>>> party_countssize 1 2 3 4 5 6dayFri 1 16 1 1 0 0Sat 2 53 18 13 1 0Sun 0 39 15 18 3 1Thur 1 48 4 5 1 3>>> sum_by_day = party_counts.sum(1).astype(float)
  21. 21. >>> party_pcts = party_counts.div(sum_by_day, axis=0)>>> party_pctssize 1 2 3 4 5 6dayFri 0.052632 0.842105 0.052632 0.052632 0.000000 0.000000Sat 0.022989 0.609195 0.206897 0.149425 0.011494 0.000000Sun 0.000000 0.513158 0.197368 0.236842 0.039474 0.013158Thur 0.016129 0.774194 0.064516 0.080645 0.016129 0.048387>>> party_pcts.plot(kind=bar, stacked=True)<matplotlib.axes.AxesSubplot at 0x6bf2ed0>
  22. 22. >>> tips[tip_pct] = tips[tip] / tips[total_bill]>>> tips[tip_pct].hist(bins=50)<matplotlib.axes.AxesSubplot at 0x6c10d30>>>> tips[tip_pct].describe()count 244.000000mean 0.160803std 0.061072min 0.03563825% 0.12912750% 0.15477075% 0.191475max 0.710345
  23. 23. • Data Aggregation• GroupBy• Pivot Tables• Time Series• Periods/Frequencies• Operations with Time Series with Different Frequencies• Downsampling/Upsampling• Plotting with TimeSeries (auto-adjust scale)• Advanced Analysis• Decile and Quartile Analysis• Signal Frontier Analysis• Future Contract Rolling• Rolling Correlation and Linear Regression

×