Slides from my lightning talk at the Boston Predictive Analytics Meetup hosted at Predictive Analytics World, Boston, October 1, 2012.
Full code and data are available on github: http://bit.ly/pawdata
Jeffrey BreenTechnology and travel analyst at Atmosphere Research Group
1. Tapping the Data Deluge with R
Finding and using supplemental data
to add context to your analysis
by Jeffrey Breen
Principal, Think Big Academy
Code & Data on github
http://bit.ly/pawdata email: jeffrey.breen@thinkbiganalytics.com
blog: http://jeffreybreen.wordpress.com
Twitter: @JeffreyBreen
1
2. Data data everywhere!
This may be how you picture the data deluge looks like if you work for the Economist.
But those of us who wrangle data for living know that it’s usually not so prosaic or buttoned-down, proper or quaint.
3. Real data hits us in the face...
3
Real data can hit you in the face.
Yet we keep coming back for more.
4. ...and then there’s Big Data.
4
And I’m not even going to talk about Big Data tonight. (For a change!)
5. Finding the right data makes all the difference
5
Tonight we’re going to look at a few different places to find those data sets which can make a difference, and a few techniques
to access them so you can incorporate them into your analysis.
6. The two types of data
Data you have
Data you don’t
have... yet
6
Perhaps you’ve heard the joke: There are two kinds of people: People who think there are two kinds of people and people
who don’t.
I like to think that there are two kinds of data.
7. The two types of data
• Data you have
– CSV files, spreadsheets
– files from other sta>s>cs packages (SPSS, SAS, Stata,...)
– databases, data warehouses (SQL, NoSQL, HBase,...)
– whatever your boss emailed you on his way to lunch
– datasets within R and R packages
• Data you don’t have... yet
– file downloads & web scraping
– data marketplaces and other APIs
Code & Data on github: http://bit.ly/pawdata 7
8. Reading CSV files is easy
$ head -5 data/mpg-3-13-2012.csv | cut -c 1-60
"Model Yr","Mfr Name","Division","Carline","Verify Mfr Cd","
2012,"aston martin","Aston Martin Lagonda Ltd","V12 Vantage"
2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
2012,"aston martin","Aston Martin Lagonda Ltd","V8 Vantage",
data = read.csv('data/mpg-3-13-2012.csv')
View(data)
see R/01-read.csv-mpg.R 8
9. But so is reading Excel files directly
library(XLConnect)
wb = loadWorkbook("data/mpg.xlsx", create=F)
data = readWorksheet(wb, sheet='3-7-2012')
see R/02-XLConnect-mpg.R 9
11. RelaMonal databases
library(RMySQL)
con = dbConnect(MySQL(), user="root", dbname="test")
data = dbGetQuery(con, "select * from airport")
dbDisconnect(con)
View(data)
airport_code airport_name location state_code country_name time_zone_code
1 ATL WILLIAM B. HARTSFIELD ATLANTA,GEORGIA GA USA EST
2 BOS LOGAN INTERNATIONAL BOSTON,MASSACHUSETTS MA USA EST
3 BWI BALTIMORE/WASHINGTON INTERNATIONAL BALTIMORE,MARYLAND MD USA EST
4 DEN STAPLETON INTERNATIONAL DENVER,COLORADO CO USA MST
5 DFW DALLAS/FORT WORTH INTERNATIONAL DALLAS/FT. WORTH,TEXAS TX USA CST
6 OAK METROPOLITAN OAKLAND INTERNATIONAL OAKLAND,CALIFORNIA CA USA PST
7 PHL PHILADELPHIA INTERNATIONAL PHILADELPHIA PA/WILM'TON,DE PA USA EST
8 PIT GREATER PITTSBURGH PITTSBURGH,PENNSYLVANIA PA USA EST
9 SFO SAN FRANCISCO INTERNATIONAL SAN FRANCISCO,CALIFORNIA CA USA PST
see R/04-RMySQL-airport.R 11
12. Non-‐relaMonal databases too
> library(rhbase)
> hb.init(serialize='raw')
> x = hb.get(tablename='tweets', rows='221325531868692480')
> str(x)
List of 1
$ :List of 3
..$ : chr "221325531868692480"
..$ : chr [1:10] "created:" "favorited:" "id:" "replyToSID:" ...
..$ :List of 10
.. ..$ : chr "2012-07-06 19:31:33"
.. ..$ : chr "FALSE"
.. ..$ : chr "221325531868692480"
.. ..$ : chr "NA"
.. ..$ : chr "NA"
.. ..$ : chr "NA"
.. ..$ : chr "arnicas"
.. ..$ : chr "<a href="http://www.tweetdeck.com"
rel="nofollow">TweetDeck</a>"
.. ..$ : chr "RT @bycoffe: From @DrewLinzer, an #Rstats function for querying
the HuffPost Pollster API. http://t.co/fXnG32JX cc @thewhyaxis"
.. ..$ : chr "FALSE"
12
13. weird emails from the boss
con = textConnection('
# Hi:
#
# Please invite these paid volunteers to the spontaneous rally at 3PM today:
#
Name Department "Hourly Rate" email
Alice Operations 32 alice@wonderland.org
Billy Logistics 5 billy.pilgrim@slaugterhouse5.com
Winston Records 20 winston.smith@truth.gov.oc
#
#Thanks,
#Your Boss
#! ! ! ! !
')
data = read.table(con, header=T, comment.char='#')
close.connection(con)
View(data) Name Department Hourly.Rate email
1 Alice Operations 32 alice@wonderland.org
2 Billy Logistics 5 billy.pilgrim@slaugterhouse5.com
3 Winston Records 20 winston.smith@truth.gov.oc
see R/05-textConnection-email.R 13
14. > data()
Data sets in package ‘datasets’:
AirPassengers Monthly Airline Passenger Numbers 1949-1960
BJsales Sales Data with Leading Indicator
BJsales.lead (BJsales)
Sales Data with Leading Indicator
BOD Biochemical Oxygen Demand
CO2 Carbon Dioxide Uptake in Grass Plants
ChickWeight Weight versus age of chicks on different diets
DNase Elisa assay of DNase
EuStockMarkets Daily Closing Prices of Major European Stock
Indices, 1991-1998
Formaldehyde Determination of Formaldehyde
HairEyeColor Hair and Eye Color of Statistics Students
Harman23.cor Harman Example 2.3
Harman74.cor Harman Example 7.4
Indometh Pharmacokinetics of Indomethacin
InsectSprays Effectiveness of Insect Sprays
JohnsonJohnson Quarterly Earnings per Johnson & Johnson Share
LakeHuron Level of Lake Huron 1875-1972
LifeCycleSavings Intercountry Life-Cycle Savings Data
Loblolly Growth of Loblolly pine trees
Nile Flow of the River Nile
Orange Growth of Orange Trees
OrchardSprays Potency of Orchard Sprays
PlantGrowth Results from an Experiment on Plant Growth
Puromycin Reaction Velocity of an Enzymatic Reaction
Seatbelts Road Casualties in Great Britain 1969-84
Theoph Pharmacokinetics of Theophylline
Titanic Survival of passengers on the Titanic
ToothGrowth The Effect of Vitamin C on Tooth Growth in
Guinea Pigs
UCBAdmissions Student Admissions at UC Berkeley
UKDriverDeaths Road Casualties in Great Britain 1969-84
UKgas UK Quarterly Gas Consumption
USAccDeaths Accidental Deaths in the US 1973-1978
USArrests Violent Crime Rates by US State
USJudgeRatings Lawyers' Ratings of State Judges in the US
Superior Court
USPersonalExpenditure Personal Expenditure Data
VADeaths Death Rates in Virginia (1940)
WWWusage Internet Usage per Minute
WorldPhones The World's Telephones
ability.cov Ability and Intelligence Tests
airmiles Passenger Miles on Commercial US Airlines,
1937-1960
airquality New York Air Quality Measurements
[...]
15. > library(zipcode)
> data(zipcode)
> str(zipcode)
'data.frame': 44336 obs. of 5 variables:
$ zip : chr "00210" "00211" "00212" "00213" ...
$ city : chr "Portsmouth" "Portsmouth" "Portsmouth" "Portsmouth" ...
$ state : chr "NH" "NH" "NH" "NH" ...
$ latitude : num 43 43 43 43 43 ...
$ longitude: num -71 -71 -71 -71 -71 ...
> subset(zipcode, city=='Boston' & state=='MA')
zip city state latitude longitude
664 02101 Boston MA 42.37057 -71.02696
665 02102 Boston MA 42.33895 -70.91963
666 02103 Boston MA 42.33895 -70.91963
667 02104 Boston MA 42.33895 -70.91963
668 02105 Boston MA 42.33895 -70.91963
669 02106 Boston MA 42.35432 -71.07345
670 02107 Boston MA 42.33895 -70.91963
671 02108 Boston MA 42.35790 -71.06408
672 02109 Boston MA 42.36148 -71.05417
673 02110 Boston MA 42.35653 -71.05365
674 02111 Boston MA 42.34984 -71.06101
675 02112 Boston MA 42.33895 -70.91963
676 02113 Boston MA 42.36503 -71.05636
677 02114 Boston MA 42.36179 -71.06774
678 02115 Boston MA 42.34308 -71.09268
679 02116 Boston MA 42.34962 -71.07372
680 02117 Boston MA 42.33895 -70.91963
681 02118 Boston MA 42.33872 -71.07276
682 02119 Boston MA 42.32451 -71.08455
683 02120 Boston MA 42.33210 -71.09651
684 02121 Boston MA 42.30745 -71.08127
685 02122 Boston MA 42.29630 -71.05454
686 02123 Boston MA 42.33895 -70.91963
687 02124 Boston MA 42.28713 -71.07156
688 02125 Boston MA 42.31685 -71.05811
690 02127 Boston MA 42.33499 -71.04562
691 02128 Boston MA 42.37830 -71.02550
696 02133 Boston MA 42.33895 -70.91963
726 02163 Boston MA 42.36795 -71.12056
757 02196 Boston MA 42.33895 -70.91963
[...]
17. The two types of data
• Data you have
– CSV files, spreadsheets
– files from other sta>s>cs packages (SPSS, SAS, Stata,...)
– databases, data warehouses (SQL, NoSQL, HBase,...)
– whatever your boss emailed you on his way to lunch
– datasets within R and R packages
• Data you don’t have... yet
– file downloads & web scraping
– data marketplaces and other APIs
Code & Data on github: http://bit.ly/pawdata 17
20. Many base funcMons take URLs
url = 'http://ichart.finance.yahoo.com/table.csv?
s=YHOO&d=8&e=28&f=2012&g=d&a=3&b=12&c=1996&
ignore=.csv'
data = read.csv(url)
ggplot(data) + geom_point(aes(x=as.Date(Date),
y=Close), size = 1) + scale_y_log10() + theme_bw()
see R/06-read.csv-url-yahoo.R 20
22. download.file() if URLs aren’t supported
library(XLConnect)
url = "http://www.fueleconomy.gov/feg/EPAGreenGuide/xls/
all_alpha_12.xls"
local.xls.file = 'data/all_alpha_12.xls'
download.file(url, local.xls.file)
wb = loadWorkbook(local.xls.file, create=F)
data = readWorksheet(wb, sheet='all_alpha_12')
View(data)
see R/07-download.file-XLConnect-green.R 22
23. image credit: http://groovynoms.com/2011/07/25/beer-of-the-week-2/
Now, I don’t mean to oversell this next one, but if you’ve spent as much time as I have finding -- and trying to deal with --
interesting data sets on web pages, you might agree that this next function alone is worth the price of admission.
24. not even HTML tables are safe
library(XML)
url = 'http://en.wikipedia.org/wiki/List_of_capitals_in_the_United_States'
state.capitals.df = readHTMLTable(url, which=2)
State Abr. Date of statehood Capital Capital since Land area (mi²) Most populous city?
1 Alabama AL 1819 Montgomery 1846 155.4 No
2 Alaska AK 1959 Juneau 1906 2716.7 No
3 Arizona AZ 1912 Phoenix 1889 474.9 Yes
4 Arkansas AR 1836 Little Rock 1821 116.2 Yes
5 California CA 1850 Sacramento 1854 97.2 No
6 Colorado CO 1876 Denver 1867 153.4 Yes
7 Connecticut CT 1788 Hartford 1875 17.3 No
8 Delaware DE 1787 Dover 1777 22.4 No
9 Florida FL 1845 Tallahassee 1824 95.7 No
10 Georgia GA 1788 Atlanta 1868 131.7 Yes
see R/08-readHTMLTable.R 24
As you’d expect from a package called “XML”, it parses well-formed XML files.
But I didn’t expect it would do such a good job with HTML.
And I certainly didn’t expect to find a function as handy as readHTMLTable()!
27. ..and couldn’t be easier to access.
library(rdatamarket)
oil.prod = dmseries("http://data.is/nyFeP9")
plot(oil.prod)
see R/09-rdatamarket.R 27
DataMarket includes its own URL shortner -- like bit.ly but just for their data.
Long or short, just give dmseries() the URL, and it will download the data set for you.
28. Make a withdrawal from the World Bank
> library(WDI)
> WDIsearch('population, total')
indicator name
"SP.POP.TOTL" "Population, total"
> WDIsearch('fertility .*total')
indicator name
"SP.DYN.TFRT.IN" "Fertility rate, total (births per woman)"
> WDIsearch('life expectancy .*birth.*total')
indicator name
"SP.DYN.LE00.IN" "Life expectancy at birth, total (years)"
> WDIsearch('GDP per capita .*constant')
indicator name
[1,] "NY.GDP.PCAP.KD" "GDP per capita (constant 2000 US$)"
[2,] "NY.GDP.PCAP.KN" "GDP per capita (constant LCU)"
> WDIsearch('population, total')
indicator name
"SP.POP.TOTL" "Population, total"
see R/10-WDI.R 28
29. Swedish Accent Not Included
data = WDI(country=c('BR', 'CN', 'GB', 'JP', 'IN', 'SE', 'US'),
! ! ! indicator=c('SP.DYN.TFRT.IN', 'SP.DYN.LE00.IN', 'SP.POP.TOTL',
! ! ! ! ! ! 'NY.GDP.PCAP.KD'),
! ! ! start=1900, end=2010)
library(googleVis)
g = gvisMotionChart(data, idvar='country', timevar='year')
plot(g)
see R/10-WDI.R 29
30. quantmod: the king of symbols
• getSymbols() downloads Mme series data from
source specified by “src” parameter:
– yahoo = Yahoo! Finance
– google = Google Finance
– FRED = St. Louis Fed’s Federal Reserve Economic Data
– oanda = OANDA Forex Trading & Exchange Rates
– csv
– MySQL
– RData
30
31. Hello, FRED
55,000 economic +me series • Federal Reserve Bank of Kansas • Thomson Reuters/University of
from 45 sources: City Michigan
• Federal Reserve Bank of • U.S. Congress: Congressional
• AutomaMc Data Processing, Inc.
Philadelphia Budget Office
• Banca d'Italia
• Federal Reserve Bank of St. Louis • U.S. Department of Commerce:
• Banco de Mexico Bureau of Economic Analysis
• Freddie Mac
• Bank of Japan • U.S. Department of Commerce:
• Haver AnalyMcs
• Bankrate, Inc. Census Bureau
• InsMtute for Supply Management
• Board of Governors of the • U.S. Department of Energy:
Federal Reserve System • InternaMonal Monetary Fund
Energy InformaMon
• London Bullion Market AdministraMon
• BofA Merrill Lynch
AssociaMon
• BriMsh Bankers' AssociaMon • U.S. Department of Housing and
• NaMonal AssociaMon of Realtors Urban Development
• Central Bank of the Republic of
Turkey • NaMonal Bureau of Economic • U.S. Department of Labor:
Research Bureau of Labor StaMsMcs
• Chicago Board OpMons Exchange
• OrganisaMon for Economic Co-‐ • U.S. Department of Labor:
• CredAbility Nonprofit Credit operaMon and Development Employment and Training
Counseling & EducaMon
• Reserve Bank of Australia AdministraMon
• Deutsche Bundesbank
• Standard and Poor's • U.S. Department of the Treasury:
• Dow Jones & Company Financial Management Service
• Swiss NaMonal Bank
• Eurostat • U.S. Department of
• The White House: Council of
• Federal Financial InsMtuMons Economic Advisors TransportaMon: Federal Highway
ExaminaMon Council AdministraMon
• The White House: Office of
• Federal Housing Finance Agency Management and Budget • Wilshire Associates Incorporated
• Federal Reserve Bank of Chicago • World Bank
31
32. BLS Jobless data (FRED) + S&P (Yahoo!)
library(quantmod)
initial.claims = getSymbols('ICSA', src='FRED', auto.assign=F)
sp500 = getSymbols('^GSPC', src='yahoo', auto.assign=F)
# Convert quotes to weekly and fetch Cl() closing price
sp500.weekly = Cl(to.weekly(sp500))
see R/11-quantmod.R 32
33. Resources
• Expanded code snippets and all data for this talk
– http://bit.ly/pawdata
• R Data Import/Export manual
– http://cran.r-project.org/doc/manuals/R-data.html
• CRAN: Comprehensive R Archive Network
– package lists: http://cran.r-project.org/web/packages/
– Featured: XLConnect, foreign, RMySQL, XML, quantmod, rdatamarket, WDI,
quantmod
– Database: RODBC, DBI, RJDBC, ROracle, RPostgreSQL, RSQLite, RMongo, RCassandra
– Data sets: zipcode, agridat, GANPAdata
– Data access: crn, rgbif, RISmed, govdat, myepisodes, msProstate, corpora
• rhbase from the RHadoop project
– https://github.com/RevolutionAnalytics/RHadoop
33
34. When I first said that R is my “Swiss Army
Knife” for data, you might have pictured this:
36. Thank you!
by Jeffrey Breen
Principal, Think Big Academy
Code & Data on github
http://bit.ly/pawdata email: jeffrey.breen@thinkbiganalytics.com
blog: http://jeffreybreen.wordpress.com
Twitter: @JeffreyBreen
36