SlideShare a Scribd company logo
1 of 39
Hands-on Hadoop with the
NFL Play by Play Dataset
Headline Goes Here
Ryan Bosshart | Systems Engineer
Speaker Name or Subhead Goes Here
Oct 2013 v2

1

DO NOT USE PUBLICLY
PRIOR TO 10/23/12
What’s Ahead?
“Hands on” with Hadoop using NFL Play-by-play
• No prior experience needed
• Feel free to ask questions
•
Thanks, Coach
http://www.jesse-anderson.com
• @jessetanderson
• Code - https://github.com/eljefe6a/nfldata
•

*we are not in any way affiliated with the NFL or any Team
3
Basic questions
•

How does Brett Favre’s
best season compare to
other Aaron Rodgers?
Plays
Advanced NFL
stats released all
Play by Play since
2002 season
• 2,898 total games
• 471,392 plays
•

5
Basic Questions: Home Field Advantage?

6
Stadium Data
Lambeau Field,79594,79594,Green Bay Wisconsin,Desso
GrassMaster,TRUE,GB,1957,GHCND:USW00014898,None,581
Stadium
Expanded Capacity
Location
Playing Surface
Is Artificial

Team

The capacity of the stadium
The expanded capacity of the stadium
The location of the stadium
The type of grass, etc that the stadium has
Is the playing surface artificial

The name of the team that plays at the stadium

Roof Type
Elevation

7

The type of roof in the stadium (None, Retractable,
Dome)
The elevation of the stadium
What about weather?

8
Weather Data
GHCND:USW00014898,GREEN BAY AUSTIN STRAUBEL INTERNATIONAL AIRPORT WI
US,20020101,-9999,-9999,-9999,0,0,0,-9999,477,-22,-133,-9999,0,-9999,23,30,20,9999,45,54,-9999,1514,1402,-9999,-9999,
STATION

Station identifier

STATION NAME

Station location name

READING DATE

Date of reading

PRCP

Precipitation

AWND

Average daily wind speed

WV20

Fog, ice fog, or freezing fog (may include heavy fog)

TMAX

TMIN
9

Maximum temperature

Minimum temperature
Do arrests hurt the team?

10
Arrest Data

Season

Player Arrested in (February to February)

Team

Team person played on

Player

Name of player Arrested

Player Arrested

Was a player in the play arrested that season

Offense Player Arrested

Offense had player arrested in season

Defense Player Arrested

Defense had player arrested in season

Home Team Player Arrested
Away Team Player Arrested

11

Home Team had player arrested in season
Away Team had player arrested in season
Stadium

Weather
Arrest

Raw play-by-play

12

MR

Cleaned play-byplay

MR
(map
only)

Arrest + Play-byplay

MR
(Hive
)

Play-by-play +
Arrest + Stadium
+ Weather
Step 1: Put the data in HDFS

13
HDFS: Hadoop Distributed File System
•

Inspired by the Google File System
•

•

Provides low-cost storage for massive amounts of data

Not a general purpose filesystem
optimized for processing data with Hadoop
• Cannot modify file content once written
• It’s actually a user-space Java process
• Accessed using special commands or APIs
•
HDFS Blocks
•

When data is loaded into HDFS, it’s split into blocks
Blocks are of a fixed size (64 MB by default)
• These are huge when compared to UNIX filesystems
•

Block 1 (64 MB)
230 MB
Input File

Block 2 (64 MB)
Block 3 (64 MB)
Block 4 (38 MB)
HDFS Replication
•

Each block is then replicated to multiple machines
•

Default replication factor is three (but configurable)
Slave node A
Slave node B
Block 1 (64 MB)

Slave node C
Slave node D
Slave node E
Try this
1. $ whoami
1. $ hadoop fs -ls /
1. $ hadoop fs –ls /user/
1. $ hadoop fs –mkdir /user/test/
2. $ hadoop fs –mkdir /user/cloudera/test
3. $ hadoop fs –ls
17
Loading our data
Load the data:
• $ cd /home/cloudera/workspace/nfldata
• $ hadoop fs -put -f input
• $ hadoop fs -mkdir weather
• $ hadoop fs -put -f 173328.csv weather/
• $ hadoop fs -mkdir stadium
• $ hadoop fs -put -f stadiums.csv stadium/
• $ hadoop fs -put -f arrests.csv

Check it out in HUE:
Go to: http://localhost:8888/filebrowser/

18
Step 2: MapReduce

19
Data Janitorial

20
Full Play Entry
20121119_CHI@SF,3,1
7,48,SF,CHI,3,2,76,20,
0,(2:48) C.Kaepernick
pass short right to
M.Crabtree to SF 25
for 1 yard (C.Tillman).
Caught at SF 25. 0-yds
YAC,0,3,0,27,7 ,2012

21
Queryable Data

Give me every run
play by New Orleans in
the 2010 season

22
Play Description
(2:48) C.Kaepernick
pass short right to
M.Crabtree to SF
25 for 1 yard
(C.Tillman). Caught
at SF 25. 0-yds YAC
23
Play by Play Pieces
(2:48) C.Kaepernick
pass short right to
M.Crabtree to SF
25 for 1 yard
(C.Tillman). Caught
at SF 25. 0-yds YAC
24
There's A Custom MapReduce Behind That
public class IncompletesMapper extends Mapper<LongWritable, Text, Text,
PassWritable> {
@Override
public void map(LongWritable key, Text value, Context context) throws
IOException, InterruptedException {
String line = value.toString();
if (line.contains("incomplete")) {
Matcher matcher = incompletePass.matcher(line);

if (matcher.find()) {
context.write(new Text(matcher.group(1) + "-" +
matcher.group(2)), new PassWritable(1,Integer.parseInt(matcher.group(3))));
25
20020905_SF@NYG,1,,0,SF,NYG,,,
,J.Cortez kicks 75 yards from SF 30
to NYG -5. R.Dixon
Touchback.,0,0,2002
…..

Map(k,v)

Key: 20020905_SF@NYG
Value:...
false false

false

KICK

NYG

SF

20020905_SF@NYG,1,60,0,NYG,S
F,1,10,80,(15:00) T.Barber left end
to NYG 24 for 4 yards (C.Okeafor
J.Webster).,0,0,2002

Map(k,v)

Key: 20020905_SF@NYG
Value: ...
false false

false

PASS

NYG

SF

Map(k,v)

Key: 20020905_SF@NYG
Value: ...
false false

false

RUN

NYG

SF

…….
20020905_SF@NYG,1,53,16,NYG,
SF,1,10,36,(8:16) T.Barber right
guard to SF 30 for 6 yards
(J.Winborn).,0,0,2002

26
• Driver, Mapper, Reducer
• Driver does configuration, sets mapper/reducer
• Our PlayByPlayDriver takes two arguments:
• Input directory
• Output directory
• Most common error in MapReduce:
• Exception in thread "main"
org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory
/home/cloudera/output already exists

28
Import java project:
• Open eclipse: file->import-> "existing projects into workspace" ->
home/cloudera/workspace/nfldata -> finish -> ok
Create the job:
• $ cd src
• $ javac -classpath `hadoop classpath` *.java
•

Note those are back quotes. This runs the hadoop classpath command uses the output for javac

• $ jar cf ../playbyplay.jar *.class
• $ cd ..
Run the job:
• $ hadoop jar playbyplay.jar PlayByPlayDriver input playoutput
• $ hadoop jar playbyplay.jar ArrestJoinDriver playoutput joinedoutput arrests.csv
29
Enter the Query
The Hive Story

30
Hive
Abstraction on top of
MapReduce
• Allows queries using a SQL-like
language
•

31
Stadium Data
Lambeau Field,79594,79594,Green Bay Wisconsin,Desso
GrassMaster,TRUE,GB,1957,GHCND:USW00014898,None,581
Stadium
Expanded Capacity
Location
Playing Surface
Is Artificial

Team

The capacity of the stadium
The expanded capacity of the stadium
The location of the stadium
The type of grass, etc that the stadium has
Is the playing surface artificial

The name of the team that plays at the stadium

Roof Type
Elevation

32

The type of roof in the stadium (None, Retractable,
Dome)
The elevation of the stadium
playbyplay_tablecreate.hql
drop table if exists stadium;
CREATE EXTERNAL TABLE stadium (
Stadium STRING COMMENT 'The name of the stadium',
Capacity INT COMMENT 'The capacity of the stadium',
ExpandedCapacity INT COMMENT 'The expanded capacity of the stadium',
StadiumLocation STRING COMMENT 'The location of the stadium',
PlayingSurface STRING COMMENT 'The type of grass, etc that the stadium has',
IsArtificial BOOLEAN COMMENT 'Is the playing surface artificial',
Team STRING COMMENT 'The name of the team that plays at the stadium',
Opened INT COMMENT 'The year the stadium opened',
WeatherStation STRING COMMENT 'The name of the weather station closest to the stadium',
RoofType STRING COMMENT '(Possible Values:None,Retractable,Dome) - The type of roof in the stadium',
Elevation INT COMMENT 'The altitude of the stadium'
)
ROW FORMAT DELIMITED
FIELDS TERMINATED BY ','
STORED AS TEXTFILE
LOCATION "/user/cloudera/stadium";

33
Yards to Go
$
$ hive -S -f playbyplay_tablecreate.hql
$ hive -S -f playbyplay_join.hql
$ hive -S -f adddrives.hql
$ hive -S -f adddriveresult.hql

34
Hive Query
Give me every run by
New Orleans in the
2010 season:
SELECT * FROM
playbyplay WHERE
playtype = "RUN"
and year = 2010
and game like
"%NO%";
35
Impala
Modern MPP database
built on top of HDFS
Really fast! Written in C++
10-100x faster than Hive

36
From the Data: Field Goals
Weather only increases
misses by %1
14% of Field Goals are
missed
21% of Field Goals are
missed 30-39 MPH
average winds

37
Sentry
Open Source authorization module for Impala & Hive
Unlocks Key RBAC Requirements
Secure, fine-grained, role-based authorization
Multi-tenant administration

Open Source
Submitted to ASF

Supported in Impala 1.1 & Hiveserver2 initially
38
39

More Related Content

Viewers also liked

Cual es papel de la química en las diferentes formas artisticas
Cual es papel de la química en las diferentes formas artisticasCual es papel de la química en las diferentes formas artisticas
Cual es papel de la química en las diferentes formas artisticasDiego Medina
 
CloudSocial: A New Approach to Enabling Open Content for Broad Reuse
CloudSocial: A New Approach to Enabling Open Content for Broad ReuseCloudSocial: A New Approach to Enabling Open Content for Broad Reuse
CloudSocial: A New Approach to Enabling Open Content for Broad ReuseCharles Severance
 
Apresentação e Proposta de Trabalho Elion Publicidade
Apresentação e Proposta de Trabalho Elion PublicidadeApresentação e Proposta de Trabalho Elion Publicidade
Apresentação e Proposta de Trabalho Elion PublicidadeElion Elion
 
Y a mi que me toca? Actores y formas del desarrollo urbano en Argentina
Y a mi que me toca? Actores y formas del desarrollo urbano en ArgentinaY a mi que me toca? Actores y formas del desarrollo urbano en Argentina
Y a mi que me toca? Actores y formas del desarrollo urbano en ArgentinaPablo Guiraldes
 
The heart failure association global awareness programme.
The heart failure association global awareness programme.The heart failure association global awareness programme.
The heart failure association global awareness programme.drucsamal
 
Management and organization (jamuna group) PPT
Management and organization (jamuna group) PPTManagement and organization (jamuna group) PPT
Management and organization (jamuna group) PPTelena sopnita
 
Gregor Mendel y la Genética
Gregor Mendel y la GenéticaGregor Mendel y la Genética
Gregor Mendel y la GenéticaAnita Lopez Moure
 
Upward communication
Upward communicationUpward communication
Upward communicationrenujain1208
 
Downward communication
Downward communicationDownward communication
Downward communicationrenujain1208
 
Frederick herzberg-dual factor theory of motivation
Frederick herzberg-dual factor theory of  motivationFrederick herzberg-dual factor theory of  motivation
Frederick herzberg-dual factor theory of motivationrenujain1208
 

Viewers also liked (16)

Trabajo 4
Trabajo 4Trabajo 4
Trabajo 4
 
Apesar de você
Apesar de vocêApesar de você
Apesar de você
 
21
2121
21
 
Cual es papel de la química en las diferentes formas artisticas
Cual es papel de la química en las diferentes formas artisticasCual es papel de la química en las diferentes formas artisticas
Cual es papel de la química en las diferentes formas artisticas
 
CloudSocial: A New Approach to Enabling Open Content for Broad Reuse
CloudSocial: A New Approach to Enabling Open Content for Broad ReuseCloudSocial: A New Approach to Enabling Open Content for Broad Reuse
CloudSocial: A New Approach to Enabling Open Content for Broad Reuse
 
Apresentação e Proposta de Trabalho Elion Publicidade
Apresentação e Proposta de Trabalho Elion PublicidadeApresentação e Proposta de Trabalho Elion Publicidade
Apresentação e Proposta de Trabalho Elion Publicidade
 
Architecting for Big Data with AWS
Architecting for Big Data with AWSArchitecting for Big Data with AWS
Architecting for Big Data with AWS
 
Y a mi que me toca? Actores y formas del desarrollo urbano en Argentina
Y a mi que me toca? Actores y formas del desarrollo urbano en ArgentinaY a mi que me toca? Actores y formas del desarrollo urbano en Argentina
Y a mi que me toca? Actores y formas del desarrollo urbano en Argentina
 
The heart failure association global awareness programme.
The heart failure association global awareness programme.The heart failure association global awareness programme.
The heart failure association global awareness programme.
 
Planificacion de sistemas
Planificacion de sistemasPlanificacion de sistemas
Planificacion de sistemas
 
Management and organization (jamuna group) PPT
Management and organization (jamuna group) PPTManagement and organization (jamuna group) PPT
Management and organization (jamuna group) PPT
 
Gregor Mendel y la Genética
Gregor Mendel y la GenéticaGregor Mendel y la Genética
Gregor Mendel y la Genética
 
Upward communication
Upward communicationUpward communication
Upward communication
 
Downward communication
Downward communicationDownward communication
Downward communication
 
Frederick herzberg-dual factor theory of motivation
Frederick herzberg-dual factor theory of  motivationFrederick herzberg-dual factor theory of  motivation
Frederick herzberg-dual factor theory of motivation
 
File security system
File security systemFile security system
File security system
 

Similar to Hadoop hands on madison

April Big Data Milwaukee - Hands On Session
April Big Data Milwaukee - Hands On SessionApril Big Data Milwaukee - Hands On Session
April Big Data Milwaukee - Hands On SessionRyan Bosshart
 
Базы данных. HDFS
Базы данных. HDFSБазы данных. HDFS
Базы данных. HDFSVadim Tsesko
 
Data Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
Data Warehousing in the Era of Big Data: Deep Dive into Amazon RedshiftData Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
Data Warehousing in the Era of Big Data: Deep Dive into Amazon RedshiftAmazon Web Services
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and FastJulian Hyde
 
A deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonA deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonCheng Min Chi
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsCheng Min Chi
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesEugene Dvorkin
 
Adversarial search
Adversarial searchAdversarial search
Adversarial searchDheerendra k
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftAmazon Web Services
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesOleksii Diagiliev
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases MongoDB
 
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWSAWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWSCobus Bernard
 
Data Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataData Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataAmazon Web Services
 
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon RedshiftSRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon RedshiftAmazon Web Services
 

Similar to Hadoop hands on madison (20)

April Big Data Milwaukee - Hands On Session
April Big Data Milwaukee - Hands On SessionApril Big Data Milwaukee - Hands On Session
April Big Data Milwaukee - Hands On Session
 
Games.4
Games.4Games.4
Games.4
 
Базы данных. HDFS
Базы данных. HDFSБазы данных. HDFS
Базы данных. HDFS
 
Data Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
Data Warehousing in the Era of Big Data: Deep Dive into Amazon RedshiftData Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
Data Warehousing in the Era of Big Data: Deep Dive into Amazon Redshift
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
Lazy beats Smart and Fast
Lazy beats Smart and FastLazy beats Smart and Fast
Lazy beats Smart and Fast
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Hadoop Overview kdd2011
Hadoop Overview kdd2011Hadoop Overview kdd2011
Hadoop Overview kdd2011
 
A deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidsonA deeper-understanding-of-spark-internals-aaron-davidson
A deeper-understanding-of-spark-internals-aaron-davidson
 
A deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internalsA deeper-understanding-of-spark-internals
A deeper-understanding-of-spark-internals
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL DatabasesReal-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
Adversarial search
Adversarial searchAdversarial search
Adversarial search
 
Data Warehousing with Amazon Redshift
Data Warehousing with Amazon RedshiftData Warehousing with Amazon Redshift
Data Warehousing with Amazon Redshift
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Real-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpacesReal-Time Big Data with Storm, Kafka and GigaSpaces
Real-Time Big Data with Storm, Kafka and GigaSpaces
 
Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases Real-Time Integration Between MongoDB and SQL Databases
Real-Time Integration Between MongoDB and SQL Databases
 
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWSAWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
AWS SSA Webinar 20 - Getting Started with Data Warehouses on AWS
 
Deep Dive on Amazon Redshift
Deep Dive on Amazon RedshiftDeep Dive on Amazon Redshift
Deep Dive on Amazon Redshift
 
Data Warehousing in the Era of Big Data
Data Warehousing in the Era of Big DataData Warehousing in the Era of Big Data
Data Warehousing in the Era of Big Data
 
SRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon RedshiftSRV405 Ancestry's Journey to Amazon Redshift
SRV405 Ancestry's Journey to Amazon Redshift
 

Recently uploaded

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhisoniya singh
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraDeakin University
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDGMarianaLemus7
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 

Recently uploaded (20)

My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | DelhiFULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
FULL ENJOY 🔝 8264348440 🔝 Call Girls in Diplomatic Enclave | Delhi
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Artificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning eraArtificial intelligence in the post-deep learning era
Artificial intelligence in the post-deep learning era
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
APIForce Zurich 5 April Automation LPDG
APIForce Zurich 5 April  Automation LPDGAPIForce Zurich 5 April  Automation LPDG
APIForce Zurich 5 April Automation LPDG
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 

Hadoop hands on madison

  • 1. Hands-on Hadoop with the NFL Play by Play Dataset Headline Goes Here Ryan Bosshart | Systems Engineer Speaker Name or Subhead Goes Here Oct 2013 v2 1 DO NOT USE PUBLICLY PRIOR TO 10/23/12
  • 2. What’s Ahead? “Hands on” with Hadoop using NFL Play-by-play • No prior experience needed • Feel free to ask questions •
  • 3. Thanks, Coach http://www.jesse-anderson.com • @jessetanderson • Code - https://github.com/eljefe6a/nfldata • *we are not in any way affiliated with the NFL or any Team 3
  • 4. Basic questions • How does Brett Favre’s best season compare to other Aaron Rodgers?
  • 5. Plays Advanced NFL stats released all Play by Play since 2002 season • 2,898 total games • 471,392 plays • 5
  • 6. Basic Questions: Home Field Advantage? 6
  • 7. Stadium Data Lambeau Field,79594,79594,Green Bay Wisconsin,Desso GrassMaster,TRUE,GB,1957,GHCND:USW00014898,None,581 Stadium Expanded Capacity Location Playing Surface Is Artificial Team The capacity of the stadium The expanded capacity of the stadium The location of the stadium The type of grass, etc that the stadium has Is the playing surface artificial The name of the team that plays at the stadium Roof Type Elevation 7 The type of roof in the stadium (None, Retractable, Dome) The elevation of the stadium
  • 9. Weather Data GHCND:USW00014898,GREEN BAY AUSTIN STRAUBEL INTERNATIONAL AIRPORT WI US,20020101,-9999,-9999,-9999,0,0,0,-9999,477,-22,-133,-9999,0,-9999,23,30,20,9999,45,54,-9999,1514,1402,-9999,-9999, STATION Station identifier STATION NAME Station location name READING DATE Date of reading PRCP Precipitation AWND Average daily wind speed WV20 Fog, ice fog, or freezing fog (may include heavy fog) TMAX TMIN 9 Maximum temperature Minimum temperature
  • 10. Do arrests hurt the team? 10
  • 11. Arrest Data Season Player Arrested in (February to February) Team Team person played on Player Name of player Arrested Player Arrested Was a player in the play arrested that season Offense Player Arrested Offense had player arrested in season Defense Player Arrested Defense had player arrested in season Home Team Player Arrested Away Team Player Arrested 11 Home Team had player arrested in season Away Team had player arrested in season
  • 12. Stadium Weather Arrest Raw play-by-play 12 MR Cleaned play-byplay MR (map only) Arrest + Play-byplay MR (Hive ) Play-by-play + Arrest + Stadium + Weather
  • 13. Step 1: Put the data in HDFS 13
  • 14. HDFS: Hadoop Distributed File System • Inspired by the Google File System • • Provides low-cost storage for massive amounts of data Not a general purpose filesystem optimized for processing data with Hadoop • Cannot modify file content once written • It’s actually a user-space Java process • Accessed using special commands or APIs •
  • 15. HDFS Blocks • When data is loaded into HDFS, it’s split into blocks Blocks are of a fixed size (64 MB by default) • These are huge when compared to UNIX filesystems • Block 1 (64 MB) 230 MB Input File Block 2 (64 MB) Block 3 (64 MB) Block 4 (38 MB)
  • 16. HDFS Replication • Each block is then replicated to multiple machines • Default replication factor is three (but configurable) Slave node A Slave node B Block 1 (64 MB) Slave node C Slave node D Slave node E
  • 17. Try this 1. $ whoami 1. $ hadoop fs -ls / 1. $ hadoop fs –ls /user/ 1. $ hadoop fs –mkdir /user/test/ 2. $ hadoop fs –mkdir /user/cloudera/test 3. $ hadoop fs –ls 17
  • 18. Loading our data Load the data: • $ cd /home/cloudera/workspace/nfldata • $ hadoop fs -put -f input • $ hadoop fs -mkdir weather • $ hadoop fs -put -f 173328.csv weather/ • $ hadoop fs -mkdir stadium • $ hadoop fs -put -f stadiums.csv stadium/ • $ hadoop fs -put -f arrests.csv Check it out in HUE: Go to: http://localhost:8888/filebrowser/ 18
  • 21. Full Play Entry 20121119_CHI@SF,3,1 7,48,SF,CHI,3,2,76,20, 0,(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC,0,3,0,27,7 ,2012 21
  • 22. Queryable Data Give me every run play by New Orleans in the 2010 season 22
  • 23. Play Description (2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC 23
  • 24. Play by Play Pieces (2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC 24
  • 25. There's A Custom MapReduce Behind That public class IncompletesMapper extends Mapper<LongWritable, Text, Text, PassWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); if (line.contains("incomplete")) { Matcher matcher = incompletePass.matcher(line); if (matcher.find()) { context.write(new Text(matcher.group(1) + "-" + matcher.group(2)), new PassWritable(1,Integer.parseInt(matcher.group(3)))); 25
  • 26. 20020905_SF@NYG,1,,0,SF,NYG,,, ,J.Cortez kicks 75 yards from SF 30 to NYG -5. R.Dixon Touchback.,0,0,2002 ….. Map(k,v) Key: 20020905_SF@NYG Value:... false false false KICK NYG SF 20020905_SF@NYG,1,60,0,NYG,S F,1,10,80,(15:00) T.Barber left end to NYG 24 for 4 yards (C.Okeafor J.Webster).,0,0,2002 Map(k,v) Key: 20020905_SF@NYG Value: ... false false false PASS NYG SF Map(k,v) Key: 20020905_SF@NYG Value: ... false false false RUN NYG SF ……. 20020905_SF@NYG,1,53,16,NYG, SF,1,10,36,(8:16) T.Barber right guard to SF 30 for 6 yards (J.Winborn).,0,0,2002 26
  • 27.
  • 28. • Driver, Mapper, Reducer • Driver does configuration, sets mapper/reducer • Our PlayByPlayDriver takes two arguments: • Input directory • Output directory • Most common error in MapReduce: • Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /home/cloudera/output already exists 28
  • 29. Import java project: • Open eclipse: file->import-> "existing projects into workspace" -> home/cloudera/workspace/nfldata -> finish -> ok Create the job: • $ cd src • $ javac -classpath `hadoop classpath` *.java • Note those are back quotes. This runs the hadoop classpath command uses the output for javac • $ jar cf ../playbyplay.jar *.class • $ cd .. Run the job: • $ hadoop jar playbyplay.jar PlayByPlayDriver input playoutput • $ hadoop jar playbyplay.jar ArrestJoinDriver playoutput joinedoutput arrests.csv 29
  • 30. Enter the Query The Hive Story 30
  • 31. Hive Abstraction on top of MapReduce • Allows queries using a SQL-like language • 31
  • 32. Stadium Data Lambeau Field,79594,79594,Green Bay Wisconsin,Desso GrassMaster,TRUE,GB,1957,GHCND:USW00014898,None,581 Stadium Expanded Capacity Location Playing Surface Is Artificial Team The capacity of the stadium The expanded capacity of the stadium The location of the stadium The type of grass, etc that the stadium has Is the playing surface artificial The name of the team that plays at the stadium Roof Type Elevation 32 The type of roof in the stadium (None, Retractable, Dome) The elevation of the stadium
  • 33. playbyplay_tablecreate.hql drop table if exists stadium; CREATE EXTERNAL TABLE stadium ( Stadium STRING COMMENT 'The name of the stadium', Capacity INT COMMENT 'The capacity of the stadium', ExpandedCapacity INT COMMENT 'The expanded capacity of the stadium', StadiumLocation STRING COMMENT 'The location of the stadium', PlayingSurface STRING COMMENT 'The type of grass, etc that the stadium has', IsArtificial BOOLEAN COMMENT 'Is the playing surface artificial', Team STRING COMMENT 'The name of the team that plays at the stadium', Opened INT COMMENT 'The year the stadium opened', WeatherStation STRING COMMENT 'The name of the weather station closest to the stadium', RoofType STRING COMMENT '(Possible Values:None,Retractable,Dome) - The type of roof in the stadium', Elevation INT COMMENT 'The altitude of the stadium' ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION "/user/cloudera/stadium"; 33
  • 34. Yards to Go $ $ hive -S -f playbyplay_tablecreate.hql $ hive -S -f playbyplay_join.hql $ hive -S -f adddrives.hql $ hive -S -f adddriveresult.hql 34
  • 35. Hive Query Give me every run by New Orleans in the 2010 season: SELECT * FROM playbyplay WHERE playtype = "RUN" and year = 2010 and game like "%NO%"; 35
  • 36. Impala Modern MPP database built on top of HDFS Really fast! Written in C++ 10-100x faster than Hive 36
  • 37. From the Data: Field Goals Weather only increases misses by %1 14% of Field Goals are missed 21% of Field Goals are missed 30-39 MPH average winds 37
  • 38. Sentry Open Source authorization module for Impala & Hive Unlocks Key RBAC Requirements Secure, fine-grained, role-based authorization Multi-tenant administration Open Source Submitted to ASF Supported in Impala 1.1 & Hiveserver2 initially 38
  • 39. 39

Editor's Notes

  1. FunStructured + unstructuredMake some meaningful analysis
  2. I know we have a mixed group here, so some of you will probably have a good idea of what Hadoop does, problems it meant to solve, and some of the processing frameworks that sit on topMy goal here is to reach a broad audienceHopefully you will learn:How to interact with HDFSTo to run a MapReduce job (probably not how to write one)How to create a table in Hive (or impala)Problem that Jesse solved. - pretty common problem -&gt; land raw data in hadoop, do ETL, Caveats - thankfully, I will not be teaching java development - will show conceptually what is happening
  3. Learn more at screencast.Use QuickStart VMI merely the tour guide – I am not the author and there is plenty about this dataset and MapReduce that I don’t know. Jesse Anderson is a curriculum developer for Cloudera University:We will be using MapReduce, Hive, and there are additional code examples for using Pig - this will be high level -&gt; more in-depth should be - our training program - online tutorials
  4. Answer some questions about the NFL: - instead of asking the “experts” on the NFL shows - you want to really knowHow could we do that? - I’ll need the data, right? - really nice to have granular data -&gt; I can ask different types of questions - SQL is really nice because I already know it – I might even want to make some visualizations with it -
  5. Extract value and insight.Play by play vs wins - different aspects -&gt; wins, particular player, all punts, all kicks, roll-up-by quarterGood example of why hadoop is valuable -&gt; granular data. - true of other systems -&gt; POS, medical records, etchttp://www.flickr.com/photos/billlublin/3972999678/sizes/o/
  6. Why is there a homefield advantage? - turf? - size of stadium? - certain fans?Characteristics about the stadium? - turf, etchttp://www.flickr.com/photos/37611179@N00/2295452969/in/photolist-4uQNck-5SRuWS-5WYBDL-677pYM-7cscT7-7vyC7G-7XRk46-84U1Ft-ayVaRS-7ReJrS-dpXi1U-8cTwQ1-7Pq9iE-bEo82F-98LeR5-9Ue2aF-b3vtrz-7YWv62
  7. http://www.flickr.com/photos/zruda/1807289958/in/photolist-3KGQkG-44bNJx-4js8Cg-4pQ1bg-4sNLUK-4wBzkz-4wFGmh-559J6y-5nxQVm-5qnF14-5r9AyS-5r9AJq-5KGLMR-5KGQNx-5W2oxe-5W2oKZ-5W6Gt9-5W6Gvs-6k6HX8-6k6J2B-6k6Jcn-6kaUuC-6kaUQE-6wffW7-7chpaN-dFfSAs-8RsNT8-9Pzgh1-9PwrNF-812vNy-a6s3Ec-8NFpHL-bpjMZq-bpjRu1-bnv3gS-8qemwV-dFfSuG-aKju4r-9gin1L/http://www.flickr.com/photos/17251027@N00/2190657211/in/photolist-4kzG4V-4qfDjD-5e3UP6-5k4eSa-5m73Pf-5mR3nR-5nSv8u-5qnF14-5rGWN8-5rM4m3-5rM58f-5rMcT7-5rMdB3-5rMeko-5rMeZs-5rMhBN-5rMEqb-5rNvKb-5vrbfb-5zUrSt-5C3LQs-5CcaoK-5Cgq7N-5Cgtko-643317-6433ym-649s84-6EBd5T-6LwGEX-6XnJXg-6Y6D6D-71kkp7-741GVR-741H1z-741H5r-741Hcg-741HfM-741Hja-741Hoa-741HyT-741HBx-741HF6-741HJn-741HMR-741J5p-741J9r-741JdM-741Jiz-741JnM-741Jtv-741JxPhttp://www.flickr.com/photos/kevharb/3124008816/
  8. No direct key between stadium and weather station.The average for weather scoring is 21-18 and without weather is 21-19
  9. Used by permission of Lego Police Force https://www.facebook.com/LegoPD
  10. Underlying Hadoop is HDFS. It was invented by google with their GFS - not only scale easily, cost effective - but something that is designed for processing massive amount of informationThere are design trade-offs that were made: - the first is that this is an append only file system - so, once a file is written and closed, it’s done – you can’t update it. - that said, you interact with it much like a regular linux filesystem - imagine I have a shared linux filesystem - I can create user directories, I can list contents, write files, - but really your are interacting with a bunch of java daemons running on top of the underlying linux filesystemthere are Rest APIs, and Java APIs – and through those you interact with Hadoop like you would any other file system
  11. When we write data to disk in HDFS – we optimize it for processing - to avoid seeks and maximize for throughput. - to do this, we write data sequentially on disk in very large blocks. - avoid seeks - size of the block is configurable, default is 64 – most people use 128 - max throughput – we are going to saturate the reading off the diskIt’s perhaps obvious to avoid seeks - we also don’t want files too big either: - imagine we had a 10GB file, if that was all in one block, we could get there with one seek - 100MB/sec read - it would take a really long time to read from disk: &gt;1 minute - with these fixed size blocks we are balancing balancing throughput and latencyOne other thing to think about: - generally, we want our file size to not be two small, two reasons: 1. Memory on a master node that keeps track of our files – it keeps track of our filesystem in RAM - each file takes 120 bytes or so in RAM of the master node - we don’t’ care about this as much anymore – machines are bigger 2. The main problem with small files is that there is later, when we process on this data, we want
  12. What happens when we write data to HDFS: - we replicate the data, by default 3 timesWhy do we replicate it 3 times?Fault tolerance: - HDFS can handle data corruption or failure of a node, and we still have 2 copies leftHDFS is aware of its physical infrastructure and can handle node or rack failures 2. There is one other reason, and this is important – we refer to it as data locality. - when we do processing on one of these data block, we’re not pulling it across the network - we are going to send the job to the data - what if this node is busy processing some other data block? - worst case we pull it over the network, preference is a node in the same rack
  13. Look at it in HUE - we’re actually in our home directory -
  14. - (not a java developer?) -&gt; don&apos;t worry, you don&apos;t have to be a java developer - lots of other options - use ETL tools - import from database using sqoop as hive table
  15. 6% of plays lack weather dataHours spent diagnosing missing or bad dataHours spent downloading datahttp://www.flickr.com/photos/37611179@N00/2295452969/in/photolist-4uQNck-5SRuWS-5WYBDL-677pYM-7cscT7-7vyC7G-7XRk46-84U1Ft-ayVaRS-7ReJrS-dpXi1U-8cTwQ1-7Pq9iE-bEo82F-98LeR5-9Ue2aF-b3vtrz-7YWv62
  16. http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
  17. This break up creates 96 different queryablecolumsnhttp://www.flickr.com/photos/modenadude/6150263821/sizes/o/in/photostream/
  18. Unstructured data. Human generated.http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
  19. Easy for humans to parse data, hard for computers.Natural language processingWhile breaking down the data, we need to know what questions we want to answer.Look back at my commits to see what I&apos;ve added.http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
  20. The%data%passed%to%the%Mapper%is%specified%by%an%InputFormat+– Specified&quot;in&quot;the&quot;driver&quot;code&quot;– Defines&quot;the&quot;locaAon&quot;of&quot;the&quot;input&quot;data&quot;– A&quot;file&quot;or&quot;directory,&quot;for&quot;example&quot;– Determines&quot;how&quot;to&quot;split&quot;the&quot;input&quot;data&quot;into&quot;input&amp;splits&quot;– Each&quot;Mapper&quot;deals&quot;with&quot;a&quot;single&quot;input&quot;split&quot;&quot;– InputFormat&quot;is&quot;a&quot;factory&quot;for&quot;RecordReader&quot;objects&quot;to&quot;extract&quot;&quot;(key,&quot;value)&quot;records&quot;from&quot;the&quot;input&quot;source&quot;
  21. Look at the source code for our mapper, reducer, driverGo to source directory where our drivers &amp; mappers are.
  22. 1 yard is 65% runX and 24 has the highest chance of a sack at 4.6%X and 21 has the highest chance of a QB scramble 1.7%X and 10 is about even between pass and run at high 40&apos;shttp://www.flickr.com/photos/crackerbunny/3215652008/sizes/l/
  23. This break up creates 96 different queryable columns.Limited to data about playshttp://www.flickr.com/photos/modenadude/6150263821/sizes/o/in/photostream/
  24. If you want interactive SQL – use Impala - impala is a SQL engine that lives on top of Hadoop - this bypasses mapreduce and will give you
  25. http://www.flickr.com/photos/billlublin/3973002210/sizes/o/
  26. So, the directories and files in HDFS use posix based security controls – so, owner, group, world + read, write, executeOne the things our customers were coming across though, is that although you could control who was able to access a file or a directory in HDFS, what happens if that file or directory contains a combination of protected and non-protected data. - the health record file contains information you’re okay with everyone