Hadoop hands on madison


Published on

Published in: Technology, Sports
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • FunStructured + unstructuredMake some meaningful analysis
  • I know we have a mixed group here, so some of you will probably have a good idea of what Hadoop does, problems it meant to solve, and some of the processing frameworks that sit on topMy goal here is to reach a broad audienceHopefully you will learn:How to interact with HDFSTo to run a MapReduce job (probably not how to write one)How to create a table in Hive (or impala)Problem that Jesse solved. - pretty common problem -> land raw data in hadoop, do ETL, Caveats - thankfully, I will not be teaching java development - will show conceptually what is happening
  • Learn more at screencast.Use QuickStart VMI merely the tour guide – I am not the author and there is plenty about this dataset and MapReduce that I don’t know. Jesse Anderson is a curriculum developer for Cloudera University:We will be using MapReduce, Hive, and there are additional code examples for using Pig - this will be high level -> more in-depth should be - our training program - online tutorials
  • Answer some questions about the NFL: - instead of asking the “experts” on the NFL shows - you want to really knowHow could we do that? - I’ll need the data, right? - really nice to have granular data -> I can ask different types of questions - SQL is really nice because I already know it – I might even want to make some visualizations with it -
  • Extract value and insight.Play by play vs wins - different aspects -> wins, particular player, all punts, all kicks, roll-up-by quarterGood example of why hadoop is valuable -> granular data. - true of other systems -> POS, medical records, etchttp://www.flickr.com/photos/billlublin/3972999678/sizes/o/
  • Why is there a homefield advantage? - turf? - size of stadium? - certain fans?Characteristics about the stadium? - turf, etchttp://www.flickr.com/photos/37611179@N00/2295452969/in/photolist-4uQNck-5SRuWS-5WYBDL-677pYM-7cscT7-7vyC7G-7XRk46-84U1Ft-ayVaRS-7ReJrS-dpXi1U-8cTwQ1-7Pq9iE-bEo82F-98LeR5-9Ue2aF-b3vtrz-7YWv62
  • http://www.flickr.com/photos/zruda/1807289958/in/photolist-3KGQkG-44bNJx-4js8Cg-4pQ1bg-4sNLUK-4wBzkz-4wFGmh-559J6y-5nxQVm-5qnF14-5r9AyS-5r9AJq-5KGLMR-5KGQNx-5W2oxe-5W2oKZ-5W6Gt9-5W6Gvs-6k6HX8-6k6J2B-6k6Jcn-6kaUuC-6kaUQE-6wffW7-7chpaN-dFfSAs-8RsNT8-9Pzgh1-9PwrNF-812vNy-a6s3Ec-8NFpHL-bpjMZq-bpjRu1-bnv3gS-8qemwV-dFfSuG-aKju4r-9gin1L/http://www.flickr.com/photos/17251027@N00/2190657211/in/photolist-4kzG4V-4qfDjD-5e3UP6-5k4eSa-5m73Pf-5mR3nR-5nSv8u-5qnF14-5rGWN8-5rM4m3-5rM58f-5rMcT7-5rMdB3-5rMeko-5rMeZs-5rMhBN-5rMEqb-5rNvKb-5vrbfb-5zUrSt-5C3LQs-5CcaoK-5Cgq7N-5Cgtko-643317-6433ym-649s84-6EBd5T-6LwGEX-6XnJXg-6Y6D6D-71kkp7-741GVR-741H1z-741H5r-741Hcg-741HfM-741Hja-741Hoa-741HyT-741HBx-741HF6-741HJn-741HMR-741J5p-741J9r-741JdM-741Jiz-741JnM-741Jtv-741JxPhttp://www.flickr.com/photos/kevharb/3124008816/
  • No direct key between stadium and weather station.The average for weather scoring is 21-18 and without weather is 21-19
  • Used by permission of Lego Police Force https://www.facebook.com/LegoPD
  • Underlying Hadoop is HDFS. It was invented by google with their GFS - not only scale easily, cost effective - but something that is designed for processing massive amount of informationThere are design trade-offs that were made: - the first is that this is an append only file system - so, once a file is written and closed, it’s done – you can’t update it. - that said, you interact with it much like a regular linux filesystem - imagine I have a shared linux filesystem - I can create user directories, I can list contents, write files, - but really your are interacting with a bunch of java daemons running on top of the underlying linux filesystemthere are Rest APIs, and Java APIs – and through those you interact with Hadoop like you would any other file system
  • When we write data to disk in HDFS – we optimize it for processing - to avoid seeks and maximize for throughput. - to do this, we write data sequentially on disk in very large blocks. - avoid seeks - size of the block is configurable, default is 64 – most people use 128 - max throughput – we are going to saturate the reading off the diskIt’s perhaps obvious to avoid seeks - we also don’t want files too big either: - imagine we had a 10GB file, if that was all in one block, we could get there with one seek - 100MB/sec read - it would take a really long time to read from disk: >1 minute - with these fixed size blocks we are balancing balancing throughput and latencyOne other thing to think about: - generally, we want our file size to not be two small, two reasons: 1. Memory on a master node that keeps track of our files – it keeps track of our filesystem in RAM - each file takes 120 bytes or so in RAM of the master node - we don’t’ care about this as much anymore – machines are bigger 2. The main problem with small files is that there is later, when we process on this data, we want
  • What happens when we write data to HDFS: - we replicate the data, by default 3 timesWhy do we replicate it 3 times?Fault tolerance: - HDFS can handle data corruption or failure of a node, and we still have 2 copies leftHDFS is aware of its physical infrastructure and can handle node or rack failures 2. There is one other reason, and this is important – we refer to it as data locality. - when we do processing on one of these data block, we’re not pulling it across the network - we are going to send the job to the data - what if this node is busy processing some other data block? - worst case we pull it over the network, preference is a node in the same rack
  • Look at it in HUE - we’re actually in our home directory -
  • - (not a java developer?) -> don't worry, you don't have to be a java developer - lots of other options - use ETL tools - import from database using sqoop as hive table
  • 6% of plays lack weather dataHours spent diagnosing missing or bad dataHours spent downloading datahttp://www.flickr.com/photos/37611179@N00/2295452969/in/photolist-4uQNck-5SRuWS-5WYBDL-677pYM-7cscT7-7vyC7G-7XRk46-84U1Ft-ayVaRS-7ReJrS-dpXi1U-8cTwQ1-7Pq9iE-bEo82F-98LeR5-9Ue2aF-b3vtrz-7YWv62
  • http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
  • This break up creates 96 different queryablecolumsnhttp://www.flickr.com/photos/modenadude/6150263821/sizes/o/in/photostream/
  • Unstructured data. Human generated.http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
  • Easy for humans to parse data, hard for computers.Natural language processingWhile breaking down the data, we need to know what questions we want to answer.Look back at my commits to see what I've added.http://www.flickr.com/photos/nathaninsandiego/5159833527/sizes/o/
  • The%data%passed%to%the%Mapper%is%specified%by%an%InputFormat+– Specified"in"the"driver"code"– Defines"the"locaAon"of"the"input"data"– A"file"or"directory,"for"example"– Determines"how"to"split"the"input"data"into"input&splits"– Each"Mapper"deals"with"a"single"input"split""– InputFormat"is"a"factory"for"RecordReader"objects"to"extract""(key,"value)"records"from"the"input"source"
  • Look at the source code for our mapper, reducer, driverGo to source directory where our drivers & mappers are.
  • 1 yard is 65% runX and 24 has the highest chance of a sack at 4.6%X and 21 has the highest chance of a QB scramble 1.7%X and 10 is about even between pass and run at high 40'shttp://www.flickr.com/photos/crackerbunny/3215652008/sizes/l/
  • This break up creates 96 different queryable columns.Limited to data about playshttp://www.flickr.com/photos/modenadude/6150263821/sizes/o/in/photostream/
  • If you want interactive SQL – use Impala - impala is a SQL engine that lives on top of Hadoop - this bypasses mapreduce and will give you
  • http://www.flickr.com/photos/billlublin/3973002210/sizes/o/
  • So, the directories and files in HDFS use posix based security controls – so, owner, group, world + read, write, executeOne the things our customers were coming across though, is that although you could control who was able to access a file or a directory in HDFS, what happens if that file or directory contains a combination of protected and non-protected data. - the health record file contains information you’re okay with everyone
  • Hadoop hands on madison

    1. 1. Hands-on Hadoop with the NFL Play by Play Dataset Headline Goes Here Ryan Bosshart | Systems Engineer Speaker Name or Subhead Goes Here Oct 2013 v2 1 DO NOT USE PUBLICLY PRIOR TO 10/23/12
    2. 2. What’s Ahead? “Hands on” with Hadoop using NFL Play-by-play • No prior experience needed • Feel free to ask questions •
    3. 3. Thanks, Coach http://www.jesse-anderson.com • @jessetanderson • Code - https://github.com/eljefe6a/nfldata • *we are not in any way affiliated with the NFL or any Team 3
    4. 4. Basic questions • How does Brett Favre’s best season compare to other Aaron Rodgers?
    5. 5. Plays Advanced NFL stats released all Play by Play since 2002 season • 2,898 total games • 471,392 plays • 5
    6. 6. Basic Questions: Home Field Advantage? 6
    7. 7. Stadium Data Lambeau Field,79594,79594,Green Bay Wisconsin,Desso GrassMaster,TRUE,GB,1957,GHCND:USW00014898,None,581 Stadium Expanded Capacity Location Playing Surface Is Artificial Team The capacity of the stadium The expanded capacity of the stadium The location of the stadium The type of grass, etc that the stadium has Is the playing surface artificial The name of the team that plays at the stadium Roof Type Elevation 7 The type of roof in the stadium (None, Retractable, Dome) The elevation of the stadium
    8. 8. What about weather? 8
    9. 9. Weather Data GHCND:USW00014898,GREEN BAY AUSTIN STRAUBEL INTERNATIONAL AIRPORT WI US,20020101,-9999,-9999,-9999,0,0,0,-9999,477,-22,-133,-9999,0,-9999,23,30,20,9999,45,54,-9999,1514,1402,-9999,-9999, STATION Station identifier STATION NAME Station location name READING DATE Date of reading PRCP Precipitation AWND Average daily wind speed WV20 Fog, ice fog, or freezing fog (may include heavy fog) TMAX TMIN 9 Maximum temperature Minimum temperature
    10. 10. Do arrests hurt the team? 10
    11. 11. Arrest Data Season Player Arrested in (February to February) Team Team person played on Player Name of player Arrested Player Arrested Was a player in the play arrested that season Offense Player Arrested Offense had player arrested in season Defense Player Arrested Defense had player arrested in season Home Team Player Arrested Away Team Player Arrested 11 Home Team had player arrested in season Away Team had player arrested in season
    12. 12. Stadium Weather Arrest Raw play-by-play 12 MR Cleaned play-byplay MR (map only) Arrest + Play-byplay MR (Hive ) Play-by-play + Arrest + Stadium + Weather
    13. 13. Step 1: Put the data in HDFS 13
    14. 14. HDFS: Hadoop Distributed File System • Inspired by the Google File System • • Provides low-cost storage for massive amounts of data Not a general purpose filesystem optimized for processing data with Hadoop • Cannot modify file content once written • It’s actually a user-space Java process • Accessed using special commands or APIs •
    15. 15. HDFS Blocks • When data is loaded into HDFS, it’s split into blocks Blocks are of a fixed size (64 MB by default) • These are huge when compared to UNIX filesystems • Block 1 (64 MB) 230 MB Input File Block 2 (64 MB) Block 3 (64 MB) Block 4 (38 MB)
    16. 16. HDFS Replication • Each block is then replicated to multiple machines • Default replication factor is three (but configurable) Slave node A Slave node B Block 1 (64 MB) Slave node C Slave node D Slave node E
    17. 17. Try this 1. $ whoami 1. $ hadoop fs -ls / 1. $ hadoop fs –ls /user/ 1. $ hadoop fs –mkdir /user/test/ 2. $ hadoop fs –mkdir /user/cloudera/test 3. $ hadoop fs –ls 17
    18. 18. Loading our data Load the data: • $ cd /home/cloudera/workspace/nfldata • $ hadoop fs -put -f input • $ hadoop fs -mkdir weather • $ hadoop fs -put -f 173328.csv weather/ • $ hadoop fs -mkdir stadium • $ hadoop fs -put -f stadiums.csv stadium/ • $ hadoop fs -put -f arrests.csv Check it out in HUE: Go to: http://localhost:8888/filebrowser/ 18
    19. 19. Step 2: MapReduce 19
    20. 20. Data Janitorial 20
    21. 21. Full Play Entry 20121119_CHI@SF,3,1 7,48,SF,CHI,3,2,76,20, 0,(2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC,0,3,0,27,7 ,2012 21
    22. 22. Queryable Data Give me every run play by New Orleans in the 2010 season 22
    23. 23. Play Description (2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC 23
    24. 24. Play by Play Pieces (2:48) C.Kaepernick pass short right to M.Crabtree to SF 25 for 1 yard (C.Tillman). Caught at SF 25. 0-yds YAC 24
    25. 25. There's A Custom MapReduce Behind That public class IncompletesMapper extends Mapper<LongWritable, Text, Text, PassWritable> { @Override public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException { String line = value.toString(); if (line.contains("incomplete")) { Matcher matcher = incompletePass.matcher(line); if (matcher.find()) { context.write(new Text(matcher.group(1) + "-" + matcher.group(2)), new PassWritable(1,Integer.parseInt(matcher.group(3)))); 25
    26. 26. 20020905_SF@NYG,1,,0,SF,NYG,,, ,J.Cortez kicks 75 yards from SF 30 to NYG -5. R.Dixon Touchback.,0,0,2002 ….. Map(k,v) Key: 20020905_SF@NYG Value:... false false false KICK NYG SF 20020905_SF@NYG,1,60,0,NYG,S F,1,10,80,(15:00) T.Barber left end to NYG 24 for 4 yards (C.Okeafor J.Webster).,0,0,2002 Map(k,v) Key: 20020905_SF@NYG Value: ... false false false PASS NYG SF Map(k,v) Key: 20020905_SF@NYG Value: ... false false false RUN NYG SF ……. 20020905_SF@NYG,1,53,16,NYG, SF,1,10,36,(8:16) T.Barber right guard to SF 30 for 6 yards (J.Winborn).,0,0,2002 26
    27. 27. • Driver, Mapper, Reducer • Driver does configuration, sets mapper/reducer • Our PlayByPlayDriver takes two arguments: • Input directory • Output directory • Most common error in MapReduce: • Exception in thread "main" org.apache.hadoop.mapred.FileAlreadyExistsException: Output directory /home/cloudera/output already exists 28
    28. 28. Import java project: • Open eclipse: file->import-> "existing projects into workspace" -> home/cloudera/workspace/nfldata -> finish -> ok Create the job: • $ cd src • $ javac -classpath `hadoop classpath` *.java • Note those are back quotes. This runs the hadoop classpath command uses the output for javac • $ jar cf ../playbyplay.jar *.class • $ cd .. Run the job: • $ hadoop jar playbyplay.jar PlayByPlayDriver input playoutput • $ hadoop jar playbyplay.jar ArrestJoinDriver playoutput joinedoutput arrests.csv 29
    29. 29. Enter the Query The Hive Story 30
    30. 30. Hive Abstraction on top of MapReduce • Allows queries using a SQL-like language • 31
    31. 31. Stadium Data Lambeau Field,79594,79594,Green Bay Wisconsin,Desso GrassMaster,TRUE,GB,1957,GHCND:USW00014898,None,581 Stadium Expanded Capacity Location Playing Surface Is Artificial Team The capacity of the stadium The expanded capacity of the stadium The location of the stadium The type of grass, etc that the stadium has Is the playing surface artificial The name of the team that plays at the stadium Roof Type Elevation 32 The type of roof in the stadium (None, Retractable, Dome) The elevation of the stadium
    32. 32. playbyplay_tablecreate.hql drop table if exists stadium; CREATE EXTERNAL TABLE stadium ( Stadium STRING COMMENT 'The name of the stadium', Capacity INT COMMENT 'The capacity of the stadium', ExpandedCapacity INT COMMENT 'The expanded capacity of the stadium', StadiumLocation STRING COMMENT 'The location of the stadium', PlayingSurface STRING COMMENT 'The type of grass, etc that the stadium has', IsArtificial BOOLEAN COMMENT 'Is the playing surface artificial', Team STRING COMMENT 'The name of the team that plays at the stadium', Opened INT COMMENT 'The year the stadium opened', WeatherStation STRING COMMENT 'The name of the weather station closest to the stadium', RoofType STRING COMMENT '(Possible Values:None,Retractable,Dome) - The type of roof in the stadium', Elevation INT COMMENT 'The altitude of the stadium' ) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' STORED AS TEXTFILE LOCATION "/user/cloudera/stadium"; 33
    33. 33. Yards to Go $ $ hive -S -f playbyplay_tablecreate.hql $ hive -S -f playbyplay_join.hql $ hive -S -f adddrives.hql $ hive -S -f adddriveresult.hql 34
    34. 34. Hive Query Give me every run by New Orleans in the 2010 season: SELECT * FROM playbyplay WHERE playtype = "RUN" and year = 2010 and game like "%NO%"; 35
    35. 35. Impala Modern MPP database built on top of HDFS Really fast! Written in C++ 10-100x faster than Hive 36
    36. 36. From the Data: Field Goals Weather only increases misses by %1 14% of Field Goals are missed 21% of Field Goals are missed 30-39 MPH average winds 37
    37. 37. Sentry Open Source authorization module for Impala & Hive Unlocks Key RBAC Requirements Secure, fine-grained, role-based authorization Multi-tenant administration Open Source Submitted to ASF Supported in Impala 1.1 & Hiveserver2 initially 38
    38. 38. 39