3. ... And Hype
The Economist May 14-20,
2011: “Corporate chefs
are in demand again, office
rents are soaring and the
pay being offered to
talented folk in fashionable
fields like data science
is reaching Hollywood
levels.”
9. A Few Thoughts ...
• The underlying data unit is in minutes. So?
• Why is he only looking at this particular set
of year?
• How do the Red Sox compare to the other
teams in MLB?
• Crap, that last point will require
downloading a lot of data ... and my flight
was boarding in 10 minutes!
11. Where Can We Get Data?
http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml
12. Where Can We Get Data?
http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml
Just step through
all of the teams:
COL, BOS, etc.
13. Where Can We Get Data?
http://www.baseball-reference.com/teams/COL/2010-schedule-scores.shtml
Just step through and any years
all of the teams: that you are
COL, BOS, etc. interested in.
18. Are The Games Getting Longer?
• I don’t know!
• I would say that the evidence supports an
increase up until 2000 and then it’s been
constant or slightly decreasing.
• This is not an exercise in statistical
inference; I was just mining the data and
looking for trends.
• thelogcabin.wordpress.com/
• github.com/rtelmore/MLB
19. Another Exercise
• In a conversation with Paul Parker, he asked
if the minimum number of pitches per (full)
inning (6 pitches) has ever been attained.
• This is a hard problem!
• Where do you find this sort of data?
• Back to baseball-reference.com ... the box
scores.
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
21. How Do We Proceed?
The most systematic way that I could find
was to break it down like this:
• 30 Teams
• 2005 - 2010
• Everyday from Apr 1 through Oct 31
• This is a little more than 78K URLs!
• My program took about 3 hrs 25 min.
22. Was That Minimum Attained?
• NO! Unless there is an error in my code.
• Did we learn something? Of course.
• Example: I should’ve stored everything in a
database while I was downloading and
processing the data. Why? I didn’t save any
of the data from the 3+ hrs of computing.
http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
23. Using Google Trends Data
• http://www.google.com/trends
• You can put a search term in and it will
return a lot of historical statistics related to
your query (e.g., flu trends)
• There is an R package (RGoogleTrends)
that allows access to the GT API if you have
a google account (e.g., gmail).
• Use the getGTrends(“query”) function
30. Conclusions/Discussion
• There is a lot of data available on the web!
• You can access this data from a browser;
however, you can access A LOT more data
if you let your computer do the work.
• Good tools for data mining: R, python,
perl, etc.
• Download data and see where you go