Hadoop hands on madison

Hands-on Hadoop with the
NFL Play by Play Dataset
Headline Goes Here
Ryan Bosshart | Systems Engineer
Speaker Name or Subhead Goes Here
Oct 2013 v2

1

DO NOT USE PUBLICLY
PRIOR TO 10/23/12

What’s Ahead?
“Hands on” with Hadoop using NFL Play-by-play
• No prior experience needed
• Feel free to ask questions
•

Thanks, Coach
http://www.jesse-anderson.com
• @jessetanderson
• Code - https://github.com/eljefe6a/nfldata
•

*we are not in any way affiliated with the NFL or any Team
3

Basic questions
•

How does Brett Favre’s
best season compare to
other Aaron Rodgers?

Plays
Advanced NFL
stats released all
Play by Play since
2002 season
• 2,898 total games
• 471,392 plays
•

5

Basic Questions: Home Field Advantage?

6

Stadium Data
Lambeau Field,79594,79594,Green Bay Wisconsin,Desso
GrassMaster,TRUE,GB,1957,GHCND:USW00014898,None,581
Stadium
Expanded Capacity
Location
Playing Surface
Is Artificial

Team

The capacity of the stadium
The expanded capacity of the stadium
The location of the stadium
The type of grass, etc that the stadium has
Is the playing surface artificial

The name of the team that plays at the stadium

Roof Type
Elevation

7

The type of roof in the stadium (None, Retractable,
Dome)
The elevation of the stadium

Weather Data
GHCND:USW00014898,GREEN BAY AUSTIN STRAUBEL INTERNATIONAL AIRPORT WI
US,20020101,-9999,-9999,-9999,0,0,0,-9999,477,-22,-133,-9999,0,-9999,23,30,20,9999,45,54,-9999,1514,1402,-9999,-9999,
STATION

Station identifier

STATION NAME

Station location name

READING DATE

Date of reading

PRCP

Precipitation

AWND

Average daily wind speed

WV20

Fog, ice fog, or freezing fog (may include heavy fog)

TMAX

TMIN
9

Maximum temperature

Minimum temperature

Arrest Data

Season

Player Arrested in (February to February)

Team

Team person played on

Player

Name of player Arrested

Player Arrested

Was a player in the play arrested that season

Offense Player Arrested

Offense had player arrested in season

Defense Player Arrested

Defense had player arrested in season

Home Team Player Arrested
Away Team Player Arrested

11

Home Team had player arrested in season
Away Team had player arrested in season

Stadium

Weather
Arrest

Raw play-by-play

12

MR

Cleaned play-byplay

MR
(map
only)

Arrest + Play-byplay

MR
(Hive
)

Play-by-play +
Arrest + Stadium
+ Weather

Step 1: Put the data in HDFS

13

HDFS: Hadoop Distributed File System
•

Inspired by the Google File System
•

•

Provides low-cost storage for massive amounts of data

Not a general purpose filesystem
optimized for processing data with Hadoop
• Cannot modify file content once written
• It’s actually a user-space Java process
• Accessed using special commands or APIs
•

HDFS Blocks
•

When data is loaded into HDFS, it’s split into blocks
Blocks are of a fixed size (64 MB by default)
• These are huge when compared to UNIX filesystems
•

Block 1 (64 MB)
230 MB
Input File

Block 2 (64 MB)
Block 3 (64 MB)
Block 4 (38 MB)

HDFS Replication
•

Each block is then replicated to multiple machines
•

Default replication factor is three (but configurable)
Slave node A
Slave node B
Block 1 (64 MB)

Slave node C
Slave node D
Slave node E

Try this
1. $ whoami
1. $ hadoop fs -ls /
1. $ hadoop fs –ls /user/
1. $ hadoop fs –mkdir /user/test/
2. $ hadoop fs –mkdir /user/cloudera/test
3. $ hadoop fs –ls
17

Loading our data
Load the data:
• $ cd /home/cloudera/workspace/nfldata
• $ hadoop fs -put -f input
• $ hadoop fs -mkdir weather
• $ hadoop fs -put -f 173328.csv weather/
• $ hadoop fs -mkdir stadium
• $ hadoop fs -put -f stadiums.csv stadium/
• $ hadoop fs -put -f arrests.csv

Check it out in HUE:
Go to: http://localhost:8888/filebrowser/

18

Full Play Entry
20121119_CHI@SF,3,1
7,48,SF,CHI,3,2,76,20,
0,(2:48) C.Kaepernick
pass short right to
M.Crabtree to SF 25
for 1 yard (C.Tillman).
Caught at SF 25. 0-yds
YAC,0,3,0,27,7 ,2012

21

Queryable Data

Give me every run
play by New Orleans in
the 2010 season

22

Play Description
(2:48) C.Kaepernick
pass short right to
M.Crabtree to SF
25 for 1 yard
(C.Tillman). Caught
at SF 25. 0-yds YAC
23

Play by Play Pieces
(2:48) C.Kaepernick
pass short right to
M.Crabtree to SF
25 for 1 yard
(C.Tillman). Caught
at SF 25. 0-yds YAC
24

There's A Custom MapReduce Behind That
public class IncompletesMapper extends Mapper<LongWritable, Text, Text,
PassWritable> {
@Override
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
if (line.contains("incomplete")) {
Matcher matcher = incompletePass.matcher(line);

if (matcher.find()) {
context.write(new Text(matcher.group(1) + "-" +
matcher.group(2)), new PassWritable(1,Integer.parseInt(matcher.group(3))));
25

20020905_SF@NYG,1,,0,SF,NYG,,,
,J.Cortez kicks 75 yards from SF 30
to NYG -5. R.Dixon
Touchback.,0,0,2002
…..

Map(k,v)

Key: 20020905_SF@NYG
Value:...
false false

false

KICK

NYG

SF

20020905_SF@NYG,1,60,0,NYG,S
F,1,10,80,(15:00) T.Barber left end
to NYG 24 for 4 yards (C.Okeafor
J.Webster).,0,0,2002

Map(k,v)

Value: ...
false false

false

PASS

NYG

SF

Map(k,v)

Value: ...
false false

false

RUN

NYG

SF

…….
20020905_SF@NYG,1,53,16,NYG,
SF,1,10,36,(8:16) T.Barber right
guard to SF 30 for 6 yards
(J.Winborn).,0,0,2002

26

• Driver, Mapper, Reducer
• Driver does configuration, sets mapper/reducer
• Our PlayByPlayDriver takes two arguments:
• Input directory
• Output directory
• Most common error in MapReduce:
• Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
/home/cloudera/output already exists

28

Import java project:
• Open eclipse: file->import-> "existing projects into workspace" ->
home/cloudera/workspace/nfldata -> finish -> ok
Create the job:
• $ cd src
• $ javac -classpath `hadoop classpath` *.java
•

Note those are back quotes. This runs the hadoop classpath command uses the output for javac

• $ jar cf ../playbyplay.jar *.class
• $ cd ..
Run the job:
• $ hadoop jar playbyplay.jar PlayByPlayDriver input playoutput
• $ hadoop jar playbyplay.jar ArrestJoinDriver playoutput joinedoutput arrests.csv
29

Enter the Query
The Hive Story

30

Hive
Abstraction on top of
MapReduce
• Allows queries using a SQL-like
language
•

31

Stadium Data
Lambeau Field,79594,79594,Green Bay Wisconsin,Desso
GrassMaster,TRUE,GB,1957,GHCND:USW00014898,None,581
Stadium
Expanded Capacity
Location
Playing Surface
Is Artificial

Team

The capacity of the stadium
The expanded capacity of the stadium
The location of the stadium
The type of grass, etc that the stadium has
Is the playing surface artificial

The name of the team that plays at the stadium

Roof Type
Elevation

32

The type of roof in the stadium (None, Retractable,
Dome)
The elevation of the stadium

playbyplay_tablecreate.hql
drop table if exists stadium;
CREATE EXTERNAL TABLE stadium (
Stadium STRING COMMENT 'The name of the stadium',
Capacity INT COMMENT 'The capacity of the stadium',
ExpandedCapacity INT COMMENT 'The expanded capacity of the stadium',
StadiumLocation STRING COMMENT 'The location of the stadium',
PlayingSurface STRING COMMENT 'The type of grass, etc that the stadium has',
IsArtificial BOOLEAN COMMENT 'Is the playing surface artificial',
Team STRING COMMENT 'The name of the team that plays at the stadium',
Opened INT COMMENT 'The year the stadium opened',
WeatherStation STRING COMMENT 'The name of the weather station closest to the stadium',
RoofType STRING COMMENT '(Possible Values:None,Retractable,Dome) - The type of roof in the stadium',
Elevation INT COMMENT 'The altitude of the stadium'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION "/user/cloudera/stadium";

33

Yards to Go
$
$ hive -S -f playbyplay_tablecreate.hql
$ hive -S -f playbyplay_join.hql
$ hive -S -f adddrives.hql
$ hive -S -f adddriveresult.hql

34

Hive Query
Give me every run by
New Orleans in the
2010 season:
SELECT * FROM
playbyplay WHERE
playtype = "RUN"
and year = 2010
and game like
"%NO%";
35

Impala
Modern MPP database
built on top of HDFS
Really fast! Written in C++
10-100x faster than Hive

36

From the Data: Field Goals
Weather only increases
misses by %1
14% of Field Goals are
missed
21% of Field Goals are
missed 30-39 MPH
average winds

37

Sentry
Open Source authorization module for Impala & Hive
Unlocks Key RBAC Requirements
Secure, fine-grained, role-based authorization
Multi-tenant administration

Open Source
Submitted to ASF

Supported in Impala 1.1 & Hiveserver2 initially
38

Hadoop hands on madison

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (16)

Similar to Hadoop hands on madison

Similar to Hadoop hands on madison (20)

Recently uploaded

Recently uploaded (20)

Hadoop hands on madison

Editor's Notes