... And HypeThe Economist May 14-20,2011: “Corporate chefsare in demand again, ofﬁcerents are soaring and thepay being offered totalented folk in fashionableﬁelds like data scienceis reaching Hollywoodlevels.”
A Few Thoughts ...• The underlying data unit is in minutes. So?• Why is he only looking at this particular set of year?• How do the Red Sox compare to the other teams in MLB?• Crap, that last point will require downloading a lot of data ... and my ﬂight was boarding in 10 minutes!
Are The Games Getting Longer? • I don’t know! • I would say that the evidence supports an increase up until 2000 and then it’s been constant or slightly decreasing. • This is not an exercise in statistical inference; I was just mining the data and looking for trends. • thelogcabin.wordpress.com/ • github.com/rtelmore/MLB
Another Exercise • In a conversation with Paul Parker, he asked if the minimum number of pitches per (full) inning (6 pitches) has ever been attained. • This is a hard problem! • Where do you ﬁnd this sort of data? • Back to baseball-reference.com ... the box scores.http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
How Do We Proceed?The most systematic way that I could ﬁndwas to break it down like this:• 30 Teams• 2005 - 2010• Everyday from Apr 1 through Oct 31• This is a little more than 78K URLs!• My program took about 3 hrs 25 min.
Was That Minimum Attained? • NO! Unless there is an error in my code. • Did we learn something? Of course. • Example: I should’ve stored everything in a database while I was downloading and processing the data. Why? I didn’t save any of the data from the 3+ hrs of computing.http://www.baseball-reference.com/boxes/COL/COL201104010.shtml
Using Google Trends Data• http://www.google.com/trends• You can put a search term in and it will return a lot of historical statistics related to your query (e.g., ﬂu trends)• There is an R package (RGoogleTrends) that allows access to the GT API if you have a google account (e.g., gmail).• Use the getGTrends(“query”) function
Conclusions/Discussion• There is a lot of data available on the web!• You can access this data from a browser; however, you can access A LOT more data if you let your computer do the work.• Good tools for data mining: R, python, perl, etc.• Download data and see where you go