Oscon 2013 Jesse Anderson
Upcoming SlideShare
Loading in...5

Oscon 2013 Jesse Anderson



Jesse Anderson's OSCON 2013 talk

Jesse Anderson's OSCON 2013 talk



Total Views
Views on SlideShare
Embed Views



5 Embeds 129

http://www.oscon.com 97
http://www.weebly.com 14
http://lanyrd.com 11
https://twitter.com 6
http://www.datascienceassn.org 1


Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • Extract value and insight.http://www.flickr.com/photos/billlublin/3972999678/sizes/o/
  • http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
  • Unstructured data. Human generated.http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
  • Incomplete passes to a receiver averaged over seasons togetherA.Luck to R.WayneG.Ferotte to C.ChambersJ.Freeman to V.JacksonT.Brady to R.MossA.Luck to D.Avery
  • This break up creates 96 different queryablecolumsnhttp://www.flickr.com/photos/modenadude/6150263821/sizes/o/in/photostream/
  • 1st downs are 52% runs and 42% pass2nd downs are 45% runs and 49% pass3rd downs are 26% runs and 66% passhttp://www.flickr.com/photos/crackerbunny/3215652008/sizes/l/
  • Easy for humans to parse data, hard for computers.Natural language processingWhile breaking down the data, we need to know what questions we want to answer.Look back at my commits to see what I've added.http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
  • http://www.flickr.com/photos/modenadude/6150820962/sizes/o/
  • This break up creates 96 different queryable columns.Limited to data about playshttp://www.flickr.com/photos/modenadude/6150263821/sizes/o/in/photostream/
  • 1 yard is 65% runX and 24 has the highest chance of a sack at 4.6%X and 21 has the highest chance of a QB scramble 1.7%X and 10 is about even between pass and run at high 40'shttp://www.flickr.com/photos/crackerbunny/3215652008/sizes/l/
  • 6% of plays lack weather dataHours spent diagnosing missing or bad dataHours spent downloading datahttp://www.flickr.com/photos/37611179@N00/2295452969/in/photolist-4uQNck-5SRuWS-5WYBDL-677pYM-7cscT7-7vyC7G-7XRk46-84U1Ft-ayVaRS-7ReJrS-dpXi1U-8cTwQ1-7Pq9iE-bEo82F-98LeR5-9Ue2aF-b3vtrz-7YWv62
  • 100-81: 9%80 - 3%79-50:41%49-21: 28%20-0:18%http://en.wikipedia.org/wiki/File:Acre_over_football_field.svghttp://www.flickr.com/photos/10792703@N07/5753103429/in/photolist-9Loadr-cFWwK5-7EF4kv-d8HppU-aWhuw6-8HBrik-9X7RqK-9XaR7f-e81wbX-89PW2o-8u8GKc-dCM1x1-9bbf31-8Mco3M-ck72kf-bmuLcL-dPUGbG-8HEzxY-bSMizz-92FLxy-7LCu9g-8qcDik-81ASaj-81ASas-81ASam-81ASad-dqfGpZ-9X81MM-ck73Q3-dgnu17-dgnsVy-dgntA5-dgnrba-a85BMW-aBZgcM-beiJi2-boaW1F-7CbZ6C-a9FcCw-8nEGtU-8JwV5X-dAgFZu-doXFTj
  • Georgia Domehttp://www.flickr.com/photos/ucumari/481430551/sizes/o/
  • Date of game is important later on
  • http://www.flickr.com/photos/aneebaba/5154335641/sizes/o/
  • http://www.flickr.com/photos/aneebaba/5154335641/sizes/o/
  • http://www.flickr.com/photos/zruda/1807289958/in/photolist-3KGQkG-44bNJx-4js8Cg-4pQ1bg-4sNLUK-4wBzkz-4wFGmh-559J6y-5nxQVm-5qnF14-5r9AyS-5r9AJq-5KGLMR-5KGQNx-5W2oxe-5W2oKZ-5W6Gt9-5W6Gvs-6k6HX8-6k6J2B-6k6Jcn-6kaUuC-6kaUQE-6wffW7-7chpaN-dFfSAs-8RsNT8-9Pzgh1-9PwrNF-812vNy-a6s3Ec-8NFpHL-bpjMZq-bpjRu1-bnv3gS-8qemwV-dFfSuG-aKju4r-9gin1L/http://www.flickr.com/photos/17251027@N00/2190657211/in/photolist-4kzG4V-4qfDjD-5e3UP6-5k4eSa-5m73Pf-5mR3nR-5nSv8u-5qnF14-5rGWN8-5rM4m3-5rM58f-5rMcT7-5rMdB3-5rMeko-5rMeZs-5rMhBN-5rMEqb-5rNvKb-5vrbfb-5zUrSt-5C3LQs-5CcaoK-5Cgq7N-5Cgtko-643317-6433ym-649s84-6EBd5T-6LwGEX-6XnJXg-6Y6D6D-71kkp7-741GVR-741H1z-741H5r-741Hcg-741HfM-741Hja-741Hoa-741HyT-741HBx-741HF6-741HJn-741HMR-741J5p-741J9r-741JdM-741Jiz-741JnM-741Jtv-741JxPhttp://www.flickr.com/photos/kevharb/3124008816/
  • http://www.flickr.com/photos/keithallison/2310794054/sizes/o/
  • No direct key between stadium and weather station.The average for weather scoring is 21-18 and without weather is 21-19
  • Miami has the worst 14-18Pittsburgh has the biggest non-weather advantage 24-14http://www.flickr.com/photos/37611179@N00/2295452969/in/photolist-4uQNck-5SRuWS-5WYBDL-677pYM-7cscT7-7vyC7G-7XRk46-84U1Ft-ayVaRS-7ReJrS-dpXi1U-8cTwQ1-7Pq9iE-bEo82F-98LeR5-9Ue2aF-b3vtrz-7YWv62
  • Used by permission of Lego Police Force https://www.facebook.com/LegoPD
  • 2008 was the peak with 29 or 32 teams with an arrest.Commissioner Goodell implemented a personal conduct policy in 2007 for the 2008 season.http://www.thebiglead.com/index.php/2013/07/01/nfl-offseason-arrests-are-up-61-since-roger-goodell-implemented-personal-conduct-policy-in-2007/
  • Weather not as big as issue.Arrests not a big issueWe need to use data to make decisions.
  • Learn more at screencast.Use QuickStart VM
  • http://www.flickr.com/photos/paolo_rosa/5062025369/sizes/o/
  • http://www.flickr.com/photos/billlublin/3973002210/sizes/o/

Oscon 2013 Jesse Anderson Oscon 2013 Jesse Anderson Presentation Transcript

  • 1 Headline Goes Here Speaker Name or Subhead Goes Here DO NOT USE PUBLICLY PRIOR TO 10/23/12 Doing Data Science on the NFL Play by Play Dataset Jesse Anderson | Curriculum Developer and Instructor July 2013 v2
  • Plays 2 • Advanced NFL stats released all Play by Play since 2002 season • 2,898 total games • 471,392 plays
  • Full Play Entry 3 20121119_CHI@SF,3,1 7,48,SF,CHI,3,2,76,20, 0,(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC,0,3,0,27,7 ,2012
  • Play Description 4 (2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC
  • There's A Chart for That 5
  • There's A Custom MapReduce Behind That 6 public class IncompletesMapper extends Mapper<LongWritable, Text, Text, PassWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); if (line.contains("incomplete")) { Matcher matcher = incompletePass.matcher(line); if (matcher.find()) { context.write(new Text(matcher.group(1) + "-" + matcher.group(2)), new PassWritable(1,Integer.parseInt(matcher.group(3))));
  • 7 The Hive Story Enter the Query
  • Queryable Data 8 Give me every run play by New Orleans in the 2010 season
  • From the Data: Fourth Downs 9 15% of 4th down plays weren't kicks
  • Play by Play Pieces 10 (2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC
  • From the Data: Sacks 11 QB sacks and scrambles double on 3rd downs
  • Hive • Abstraction on top of MapReduce • Allows queries using a SQL-like language 12
  • Hive Query 13 Give me every run by New Orleans in the 2010 season: SELECT * FROM playbyplay WHERE playtype = "RUN" and year = 2010 and game like "%NO%";
  • From the Data: Yards to Go 14 With 1 yard to go, 65% of plays are runs
  • 15 Lost in data Algorithm Alone
  • Data Janitorial 16
  • From the Data: Number of Plays By Yard Line 17 Direction of Offense
  • Stadium 18
  • Figuring Out Stadium 19 20121119_CHI@SF Date Played Away Team Home Team
  • From the Data: Stadium Attendance 20 Stadiums with the smallest capacities average the best scores 20.55-17.79
  • Stadium Data 21 Stadium The capacity of the stadium Expanded Capacity The expanded capacity of the stadium Location The location of the stadium Playing Surface The type of grass, etc that the stadium has Is Artificial Is the playing surface artificial Team The name of the team that plays at the stadium Roof Type The type of roof in the stadium (None, Retractable, Dome) Elevation The elevation of the stadium
  • From the Data: Stadium Elevation 22 There is a 1% increase in passes at Mile High versus sea level stadiums
  • Weather 23 1,015 games had weather
  • From the Data: Fumble 24 Games with weather have a fumble 93% of the time compared to 56% without
  • Weather Data 25 STATION Station identifier STATION NAME Station location name READING DATE Date of reading PRCP Precipitation AWND Average daily wind speed WV20 Fog, ice fog, or freezing fog (may include heavy fog) TMAX Maximum temperature TMIN Minimum temperature
  • From the Data: Home Field Advantage 26 Baltimore has the biggest weather advantage 22-14
  • Arrests 27
  • Arrest Data 28 Season Player Arrested in (February to February) Team Team person played on Player Name of player Arrested Player Arrested Was a player in the play arrested that season Offense Player Arrested Offense had player arrested in season Defense Player Arrested Defense had player arrested in season Home Team Player Arrested Home Team had player arrested in season Away Team Player Arrested Away Team had player arrested in season
  • Whenever there are arrests either in the home team, away team or both, the home team 29 From 2002 to 2012, each team had many arrests. From to a low in 2002 of 56% to a HIGH OFWINS Arrest = Win?
  • 30
  • 31
  • 32 The Low Downs • /me - http://www.jesse-anderson.com • @jessetanderson • Code - https://github.com/eljefe6a/nfldata *I am not in any way affiliated with the NFL or any Team
  • 33
  • From the Data: Weather 34 Wind had the most effect on games At calm winds 41% pass and 37% run At >30 MPH 34% pass and 46% run
  • From the Data: Field Goals 35 Weather only increases misses by %1 14% of Field Goals are missed 21% of Field Goals are missed 30-39 MPH average winds