1. Hands-on Hadoop with the
NFL Play by Play Dataset
Headline Goes Here
Ryan Bosshart | Systems Engineer
Speaker Name or Subhead Goes Here
Oct 2013 v2
1
DO NOT USE PUBLICLY
PRIOR TO 10/23/12
2. What’s Ahead?
“Hands on” with Hadoop using NFL Play-by-play
• No prior experience needed
• Feel free to ask questions
•
7. Stadium Data
Lambeau Field,79594,79594,Green Bay Wisconsin,Desso
GrassMaster,TRUE,GB,1957,GHCND:USW00014898,None,581
Stadium
Expanded Capacity
Location
Playing Surface
Is Artificial
Team
The capacity of the stadium
The expanded capacity of the stadium
The location of the stadium
The type of grass, etc that the stadium has
Is the playing surface artificial
The name of the team that plays at the stadium
Roof Type
Elevation
7
The type of roof in the stadium (None, Retractable,
Dome)
The elevation of the stadium
9. Weather Data
GHCND:USW00014898,GREEN BAY AUSTIN STRAUBEL INTERNATIONAL AIRPORT WI
US,20020101,-9999,-9999,-9999,0,0,0,-9999,477,-22,-133,-9999,0,-9999,23,30,20,9999,45,54,-9999,1514,1402,-9999,-9999,
STATION
Station identifier
STATION NAME
Station location name
READING DATE
Date of reading
PRCP
Precipitation
AWND
Average daily wind speed
WV20
Fog, ice fog, or freezing fog (may include heavy fog)
TMAX
TMIN
9
Maximum temperature
Minimum temperature
11. Arrest Data
Season
Player Arrested in (February to February)
Team
Team person played on
Player
Name of player Arrested
Player Arrested
Was a player in the play arrested that season
Offense Player Arrested
Offense had player arrested in season
Defense Player Arrested
Defense had player arrested in season
Home Team Player Arrested
Away Team Player Arrested
11
Home Team had player arrested in season
Away Team had player arrested in season
14. HDFS: Hadoop Distributed File System
•
Inspired by the Google File System
•
•
Provides low-cost storage for massive amounts of data
Not a general purpose filesystem
optimized for processing data with Hadoop
• Cannot modify file content once written
• It’s actually a user-space Java process
• Accessed using special commands or APIs
•
15. HDFS Blocks
•
When data is loaded into HDFS, it’s split into blocks
Blocks are of a fixed size (64 MB by default)
• These are huge when compared to UNIX filesystems
•
Block 1 (64 MB)
230 MB
Input File
Block 2 (64 MB)
Block 3 (64 MB)
Block 4 (38 MB)
16. HDFS Replication
•
Each block is then replicated to multiple machines
•
Default replication factor is three (but configurable)
Slave node A
Slave node B
Block 1 (64 MB)
Slave node C
Slave node D
Slave node E
24. Play by Play Pieces
(2:48) C.Kaepernick
pass short right to
M.Crabtree to SF
25 for 1 yard
(C.Tillman). Caught
at SF 25. 0-yds YAC
24
25. There's A Custom MapReduce Behind That
public class IncompletesMapper extends Mapper<LongWritable, Text, Text,
PassWritable> {
@Override
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
if (line.contains("incomplete")) {
Matcher matcher = incompletePass.matcher(line);
if (matcher.find()) {
context.write(new Text(matcher.group(1) + "-" +
matcher.group(2)), new PassWritable(1,Integer.parseInt(matcher.group(3))));
25
26. 20020905_SF@NYG,1,,0,SF,NYG,,,
,J.Cortez kicks 75 yards from SF 30
to NYG -5. R.Dixon
Touchback.,0,0,2002
…..
Map(k,v)
Key: 20020905_SF@NYG
Value:...
false false
false
KICK
NYG
SF
20020905_SF@NYG,1,60,0,NYG,S
F,1,10,80,(15:00) T.Barber left end
to NYG 24 for 4 yards (C.Okeafor
J.Webster).,0,0,2002
Map(k,v)
Key: 20020905_SF@NYG
Value: ...
false false
false
PASS
NYG
SF
Map(k,v)
Key: 20020905_SF@NYG
Value: ...
false false
false
RUN
NYG
SF
…….
20020905_SF@NYG,1,53,16,NYG,
SF,1,10,36,(8:16) T.Barber right
guard to SF 30 for 6 yards
(J.Winborn).,0,0,2002
26
27.
28. • Driver, Mapper, Reducer
• Driver does configuration, sets mapper/reducer
• Our PlayByPlayDriver takes two arguments:
• Input directory
• Output directory
• Most common error in MapReduce:
• Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
/home/cloudera/output already exists
28
29. Import java project:
• Open eclipse: file->import-> "existing projects into workspace" ->
home/cloudera/workspace/nfldata -> finish -> ok
Create the job:
• $ cd src
• $ javac -classpath `hadoop classpath` *.java
•
Note those are back quotes. This runs the hadoop classpath command uses the output for javac
• $ jar cf ../playbyplay.jar *.class
• $ cd ..
Run the job:
• $ hadoop jar playbyplay.jar PlayByPlayDriver input playoutput
• $ hadoop jar playbyplay.jar ArrestJoinDriver playoutput joinedoutput arrests.csv
29
32. Stadium Data
Lambeau Field,79594,79594,Green Bay Wisconsin,Desso
GrassMaster,TRUE,GB,1957,GHCND:USW00014898,None,581
Stadium
Expanded Capacity
Location
Playing Surface
Is Artificial
Team
The capacity of the stadium
The expanded capacity of the stadium
The location of the stadium
The type of grass, etc that the stadium has
Is the playing surface artificial
The name of the team that plays at the stadium
Roof Type
Elevation
32
The type of roof in the stadium (None, Retractable,
Dome)
The elevation of the stadium
33. playbyplay_tablecreate.hql
drop table if exists stadium;
CREATE EXTERNAL TABLE stadium (
Stadium STRING COMMENT 'The name of the stadium',
Capacity INT COMMENT 'The capacity of the stadium',
ExpandedCapacity INT COMMENT 'The expanded capacity of the stadium',
StadiumLocation STRING COMMENT 'The location of the stadium',
PlayingSurface STRING COMMENT 'The type of grass, etc that the stadium has',
IsArtificial BOOLEAN COMMENT 'Is the playing surface artificial',
Team STRING COMMENT 'The name of the team that plays at the stadium',
Opened INT COMMENT 'The year the stadium opened',
WeatherStation STRING COMMENT 'The name of the weather station closest to the stadium',
RoofType STRING COMMENT '(Possible Values:None,Retractable,Dome) - The type of roof in the stadium',
Elevation INT COMMENT 'The altitude of the stadium'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION "/user/cloudera/stadium";
33
35. Hive Query
Give me every run by
New Orleans in the
2010 season:
SELECT * FROM
playbyplay WHERE
playtype = "RUN"
and year = 2010
and game like
"%NO%";
35
37. From the Data: Field Goals
Weather only increases
misses by %1
14% of Field Goals are
missed
21% of Field Goals are
missed 30-39 MPH
average winds
37
38. Sentry
Open Source authorization module for Impala & Hive
Unlocks Key RBAC Requirements
Secure, fine-grained, role-based authorization
Multi-tenant administration
Open Source
Submitted to ASF
Supported in Impala 1.1 & Hiveserver2 initially
38
FunStructured + unstructuredMake some meaningful analysis
I know we have a mixed group here, so some of you will probably have a good idea of what Hadoop does, problems it meant to solve, and some of the processing frameworks that sit on topMy goal here is to reach a broad audienceHopefully you will learn:How to interact with HDFSTo to run a MapReduce job (probably not how to write one)How to create a table in Hive (or impala)Problem that Jesse solved. - pretty common problem -> land raw data in hadoop, do ETL, Caveats - thankfully, I will not be teaching java development - will show conceptually what is happening
Learn more at screencast.Use QuickStart VMI merely the tour guide – I am not the author and there is plenty about this dataset and MapReduce that I don’t know. Jesse Anderson is a curriculum developer for Cloudera University:We will be using MapReduce, Hive, and there are additional code examples for using Pig - this will be high level -> more in-depth should be - our training program - online tutorials
Answer some questions about the NFL: - instead of asking the “experts” on the NFL shows - you want to really knowHow could we do that? - I’ll need the data, right? - really nice to have granular data -> I can ask different types of questions - SQL is really nice because I already know it – I might even want to make some visualizations with it -
Extract value and insight.Play by play vs wins - different aspects -> wins, particular player, all punts, all kicks, roll-up-by quarterGood example of why hadoop is valuable -> granular data. - true of other systems -> POS, medical records, etchttp://www.flickr.com/photos/billlublin/3972999678/sizes/o/
Why is there a homefield advantage? - turf? - size of stadium? - certain fans?Characteristics about the stadium? - turf, etchttp://www.flickr.com/photos/37611179@N00/2295452969/in/photolist-4uQNck-5SRuWS-5WYBDL-677pYM-7cscT7-7vyC7G-7XRk46-84U1Ft-ayVaRS-7ReJrS-dpXi1U-8cTwQ1-7Pq9iE-bEo82F-98LeR5-9Ue2aF-b3vtrz-7YWv62
No direct key between stadium and weather station.The average for weather scoring is 21-18 and without weather is 21-19
Used by permission of Lego Police Force https://www.facebook.com/LegoPD
Underlying Hadoop is HDFS. It was invented by google with their GFS - not only scale easily, cost effective - but something that is designed for processing massive amount of informationThere are design trade-offs that were made: - the first is that this is an append only file system - so, once a file is written and closed, it’s done – you can’t update it. - that said, you interact with it much like a regular linux filesystem - imagine I have a shared linux filesystem - I can create user directories, I can list contents, write files, - but really your are interacting with a bunch of java daemons running on top of the underlying linux filesystemthere are Rest APIs, and Java APIs – and through those you interact with Hadoop like you would any other file system
When we write data to disk in HDFS – we optimize it for processing - to avoid seeks and maximize for throughput. - to do this, we write data sequentially on disk in very large blocks. - avoid seeks - size of the block is configurable, default is 64 – most people use 128 - max throughput – we are going to saturate the reading off the diskIt’s perhaps obvious to avoid seeks - we also don’t want files too big either: - imagine we had a 10GB file, if that was all in one block, we could get there with one seek - 100MB/sec read - it would take a really long time to read from disk: >1 minute - with these fixed size blocks we are balancing balancing throughput and latencyOne other thing to think about: - generally, we want our file size to not be two small, two reasons: 1. Memory on a master node that keeps track of our files – it keeps track of our filesystem in RAM - each file takes 120 bytes or so in RAM of the master node - we don’t’ care about this as much anymore – machines are bigger 2. The main problem with small files is that there is later, when we process on this data, we want
What happens when we write data to HDFS: - we replicate the data, by default 3 timesWhy do we replicate it 3 times?Fault tolerance: - HDFS can handle data corruption or failure of a node, and we still have 2 copies leftHDFS is aware of its physical infrastructure and can handle node or rack failures 2. There is one other reason, and this is important – we refer to it as data locality. - when we do processing on one of these data block, we’re not pulling it across the network - we are going to send the job to the data - what if this node is busy processing some other data block? - worst case we pull it over the network, preference is a node in the same rack
Look at it in HUE - we’re actually in our home directory -
- (not a java developer?) -> don't worry, you don't have to be a java developer - lots of other options - use ETL tools - import from database using sqoop as hive table
6% of plays lack weather dataHours spent diagnosing missing or bad dataHours spent downloading datahttp://www.flickr.com/photos/37611179@N00/2295452969/in/photolist-4uQNck-5SRuWS-5WYBDL-677pYM-7cscT7-7vyC7G-7XRk46-84U1Ft-ayVaRS-7ReJrS-dpXi1U-8cTwQ1-7Pq9iE-bEo82F-98LeR5-9Ue2aF-b3vtrz-7YWv62
This break up creates 96 different queryablecolumsnhttp://www.flickr.com/photos/modenadude/6150263821/sizes/o/in/photostream/
Unstructured data. Human generated.http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
Easy for humans to parse data, hard for computers.Natural language processingWhile breaking down the data, we need to know what questions we want to answer.Look back at my commits to see what I've added.http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
Look at the source code for our mapper, reducer, driverGo to source directory where our drivers & mappers are.
1 yard is 65% runX and 24 has the highest chance of a sack at 4.6%X and 21 has the highest chance of a QB scramble 1.7%X and 10 is about even between pass and run at high 40'shttp://www.flickr.com/photos/crackerbunny/3215652008/sizes/l/
This break up creates 96 different queryable columns.Limited to data about playshttp://www.flickr.com/photos/modenadude/6150263821/sizes/o/in/photostream/
If you want interactive SQL – use Impala - impala is a SQL engine that lives on top of Hadoop - this bypasses mapreduce and will give you
So, the directories and files in HDFS use posix based security controls – so, owner, group, world + read, write, executeOne the things our customers were coming across though, is that although you could control who was able to access a file or a directory in HDFS, what happens if that file or directory contains a combination of protected and non-protected data. - the health record file contains information you’re okay with everyone