Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
www.edureka.co/big-data-and-hadoop
Hadoop the ultimate data storage
And processing Together
Slide 2 www.edureka.co/big-data-and-hadoop
Objectives
Analyze different use-cases where MapReduce is used
Differentiate ...
Slide 3 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
Weather Forecasting
HealthCare
 Problem Statement:
» ...
Slide 4 www.edureka.co/big-data-and-hadoop
Where MapReduce is Used?
MapReduce
FeaturesLarge Scale
Distributed Model
Used i...
Slide 5 www.edureka.co/big-data-and-hadoop
The Traditional Way
Very
Big
Data
Split Data matches
All
matches
grep
grep
grep...
Slide 6 www.edureka.co/big-data-and-hadoop
MapReduce Way
Very
Big
Data
Split Data
All
matches
:
Split Data
Split Data
Spli...
Slide 7 www.edureka.co/big-data-and-hadoop
MapReduce Paradigm
The Overall MapReduce Word Count Process
Input Splitting Map...
Slide 8 www.edureka.co/big-data-and-hadoop
Anatomy of a MapReduce Program
MapReduce
Map:
Reduce:
(K1, V1) List (K2, V2)
(K...
Slide 9 www.edureka.co/big-data-and-hadoop
Why MapReduce?
Two biggest Advantages:
» Taking processing to the data
» Proce...
Slide 10 www.edureka.co/big-data-and-hadoop
 ApplicationMaster
» One per application
» Short life
» Coordinates and Manag...
Slide 11 www.edureka.co/big-data-and-hadoop
BATCH
(MapReduce)
INTERACTIVE
(Text)
ONLINE
(HBase)
STREAMING
(Storm, S4, …)
G...
Slide 12 www.edureka.co/big-data-and-hadoop
MapReduce Application Execution
Executing MapReduce Application on YARN
Slide 13 www.edureka.co/big-data-and-hadoop
YARN MR Application Execution Flow
MapReduce Job Execution
» Job Submission
»...
Slide 14 www.edureka.co/big-data-and-hadoop
HDFS
Application Job Object
Client JVM
Client
Resource
Manager
Management Node...
Slide 15 www.edureka.co/big-data-and-hadoop
HDFS
3. Prepare the
Application submit
context
3.1 App Jar
3.2 Job Resources(B...
Slide 16 www.edureka.co/big-data-and-hadoop
HDFS
Resource
Manager
3. Prepare the Application
submit context
3.1 App Jar
3....
Slide 17 www.edureka.co/big-data-and-hadoop
YARN MR Application Execution Flow
11.Task get Executed.
12.If any reducer in ...
Slide 18 www.edureka.co/big-data-and-hadoop
Hadoop 2.x : YARN Workflow
Node Manager
Node Manager
Node Manager
Node Manager...
Slide 19 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an appli...
Slide 20 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an appli...
Slide 21 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an appli...
Slide 22 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an appli...
Slide 23 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an appli...
Slide 24 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an appli...
Slide 25 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an appli...
Slide 26 www.edureka.co/big-data-and-hadoop
Summary: Application Workflow
Execution Sequence :
1. Client submits an appli...
Slide 27 www.edureka.co/big-data-and-hadoop
Input Splits
INPUT DATA
Physical
Division
Logical
Division
HDFS
Blocks
Input
S...
Slide 28 www.edureka.co/big-data-and-hadoop
Relation Between Input Splits and HDFS Blocks
1 2 3 4 5 6 7 8 9 10 11
 Logica...
Slide 29 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Node 1 Node 2...
Slide 30 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task...
Slide 31 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task...
Slide 32 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task...
Slide 33 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task...
Slide 34 www.edureka.co/big-data-and-hadoop
MapReduce Job Submission Flow
Input data is distributed to nodes
Each map task...
Slide 35 www.edureka.co/big-data-and-hadoop
Combiner
Combiner
Reducer
(B,1)
(C,1)
(D,1)
(E,1)
(D,1)
(B,1)
(D,1)
(A,1)
(A,1...
Slide 36 www.edureka.co/big-data-and-hadoop
Partitioner – Redirecting Output from Mapper
Map
Map
Map
Reducer
Reducer
Reduc...
Slide 37 www.edureka.co/big-data-and-hadoop
Getting Data to the Mapper
Input File Input File
Input split Input split Input...
Slide 38 www.edureka.co/big-data-and-hadoop
Partition and Shuffle
Mapper Mapper Mapper Mapper
(intermediates) (intermediat...
Slide 39 www.edureka.co/big-data-and-hadoop
Demo of Word Count Program
To illustrate Default Input Format
(Text Input Form...
Slide 40 www.edureka.co/big-data-and-hadoop
Input file
Input Split Input Split Input Split
Record
Reader
Record
Reader
Rec...
Slide 41 www.edureka.co/big-data-and-hadoop
Combine File
Input Format<K,V>
Text Input Format
Key Value Text
Input Format
N...
Slide 42 www.edureka.co/big-data-and-hadoop
Reducer
RecordWriter
Output file
Reducer
RecordWriter
Output file
Reducer
Reco...
Slide 43 www.edureka.co/big-data-and-hadoop
Text Output Format
<K,V>
Sequence File
Output Format<K,V>
Output Format <K,V>
...
Slide 44 www.edureka.co/big-data-and-hadoop
Demo
Demo: Custom Input Format
XML Parsing with Map Reduce
Upcoming SlideShare
Loading in …5
×

XML Parsing with Map Reduce

Forrester predicts, CIOs who are late to the Hadoop game will finally make the platform a priority in 2015. Hadoop has evolved as a must-to-know technology and has been a reason for better career, salary and job opportunities for many professionals.

XML Parsing with Map Reduce

  1. 1. www.edureka.co/big-data-and-hadoop Hadoop the ultimate data storage And processing Together
  2. 2. Slide 2 www.edureka.co/big-data-and-hadoop Objectives Analyze different use-cases where MapReduce is used Differentiate between Traditional way and MapReduce way Learn about Hadoop 2.x MapReduce architecture and components Understand execution flow of YARN MapReduce application Implement basic MapReduce concepts Run a MapReduce Program At the end of this module, you will be able to
  3. 3. Slide 3 www.edureka.co/big-data-and-hadoop Where MapReduce is Used? Weather Forecasting HealthCare  Problem Statement: » De-identify personal health information.  Problem Statement: » Finding Maximum temperature recorded in a year.
  4. 4. Slide 4 www.edureka.co/big-data-and-hadoop Where MapReduce is Used? MapReduce FeaturesLarge Scale Distributed Model Used in Function Design Pattern Parallel Programming A Program Model Classification Analytics Recommendation Index and Search Map Reduce Classification Eg: Top N records Analytics Eg: Join, Selection Recommendation Eg: Sort Summarization Eg: Inverted Index Implemented Google Apache Hadoop HDFS Pig Hive HBase For
  5. 5. Slide 5 www.edureka.co/big-data-and-hadoop The Traditional Way Very Big Data Split Data matches All matches grep grep grep cat grep : matches matches matches Split Data Split Data Split Data
  6. 6. Slide 6 www.edureka.co/big-data-and-hadoop MapReduce Way Very Big Data Split Data All matches : Split Data Split Data Split Data M A P R E D U C E MapReduce Framework
  7. 7. Slide 7 www.edureka.co/big-data-and-hadoop MapReduce Paradigm The Overall MapReduce Word Count Process Input Splitting Mapping Shuffling Reducing Final Result List(K3,V3) Deer Bear River Dear Bear River Car Car River Deer Car Bear Bear, 2 Car, 3 Deer, 2 River, 2 Deer, 1 Bear, 1 River, 1 Car, 1 Car, 1 River, 1 Deer, 1 Car, 1 Bear, 1 K2,List(V2)List(K2,V2) K1,V1 Car Car River Deer Car Bear Bear, 2 Car, 3 Deer, 2 River, 2 Bear, (1,1) Car, (1,1,1) Deer, (1,1) River, (1,1)
  8. 8. Slide 8 www.edureka.co/big-data-and-hadoop Anatomy of a MapReduce Program MapReduce Map: Reduce: (K1, V1) List (K2, V2) (K2, list (V2)) List (K3, V3) Key Value
  9. 9. Slide 9 www.edureka.co/big-data-and-hadoop Why MapReduce? Two biggest Advantages: » Taking processing to the data » Processing data in parallel a b c Map Task HDFS Block Data Center Rack Node
  10. 10. Slide 10 www.edureka.co/big-data-and-hadoop  ApplicationMaster » One per application » Short life » Coordinates and Manages MapReduce Jobs » Negotiates with Resource Manager to schedule tasks » The tasks are started by NodeManager(s)  Job History Server » Maintains information about submitted MapReduce jobs after their ApplicationMaster terminates  Client » Submits a MapReduce Job  Resource Manager » Cluster Level resource manager » Long Life, High Quality Hardware  Node Manager » One per Data Node » Monitors resources on Data Node Hadoop 2.x MapReduce Components  Container » Created by NM when requested » Allocates certain amount of resources (memory, CPU etc.) on a slave node
  11. 11. Slide 11 www.edureka.co/big-data-and-hadoop BATCH (MapReduce) INTERACTIVE (Text) ONLINE (HBase) STREAMING (Storm, S4, …) GRAPH (Giraph) IN-MEMORY (Spark) HPC MPI (OpenMPI) OTHER (Search) (Weave..) http://hadoop.apache.org/docs/stable2/hadoop-yarn/hadoop-yarn-site/YARN.html YARN – Moving beyond MapReduce
  12. 12. Slide 12 www.edureka.co/big-data-and-hadoop MapReduce Application Execution Executing MapReduce Application on YARN
  13. 13. Slide 13 www.edureka.co/big-data-and-hadoop YARN MR Application Execution Flow MapReduce Job Execution » Job Submission » Job Initialization » Tasks Assignment » Memory Assignment » Status Updates » Failure Recovery
  14. 14. Slide 14 www.edureka.co/big-data-and-hadoop HDFS Application Job Object Client JVM Client Resource Manager Management Node Run Job 2. Get New Application ID 4. Submit Application Context 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information 1. Notify Start Application YARN MR Application Execution Flow
  15. 15. Slide 15 www.edureka.co/big-data-and-hadoop HDFS 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information Node Manager 5. Start AppMaster container / Allocate Context for AppMaster App Master 6.Alloate Container for AppMaster 7.Request Resources 8.Notify with resources Availability Data Node YARN MR Application Execution Flow Application Job Object Client JVM Client Resource Manager Management Node Run Job 2. Get New Application ID 4. Submit Application Context 1. Notify Start Application
  16. 16. Slide 16 www.edureka.co/big-data-and-hadoop HDFS Resource Manager 3. Prepare the Application submit context 3.1 App Jar 3.2 Job Resources(Block locations) 3.3 User Information Management Node Node Manager 5. Start AppMaster container / Allocate Context for AppMaster App Master 6. Allocate Container for AppMaster 7.Request Resources 8.Notify with resources Availability Data Node Client Node Manager Data node-1 Node Manager Map Block 9.Start Container in the worker node Data node-2 Node Manager Map Block 10.NM allocate Container 10.NM allocate Container 2. Get New Application 4. Submit Application 1. Notify Start Application 9.Start Container in the worker node YARN MR Application Execution Flow
  17. 17. Slide 17 www.edureka.co/big-data-and-hadoop YARN MR Application Execution Flow 11.Task get Executed. 12.If any reducer in a Job Reducer, again AppMaster Request the Node Manager to start the and Allocate Container 13.Output of All the Maps given to reducer and Reducer get executed 14.Once Job finished, Application Master notify the Resource Manager and Client Library 15.Application Master closed.
  18. 18. Slide 18 www.edureka.co/big-data-and-hadoop Hadoop 2.x : YARN Workflow Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Node Manager Container 1.2 Container 1.1 Container 2.1 Container 2.2 Container 2.3 App Master 2 App Master 1 Scheduler Applications Manager (AsM) Resource Manager
  19. 19. Slide 19 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application Client RM NM AM 1
  20. 20. Slide 20 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM Client RM NM AM 1 2
  21. 21. Slide 21 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM Client RM NM AM 1 2 3
  22. 22. Slide 22 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM Client RM NM AM 1 2 3 4
  23. 23. Slide 23 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers Client RM NM AM 1 2 3 4 5
  24. 24. Slide 24 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container Client RM NM AM 1 2 3 4 5 6
  25. 25. Slide 25 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status Client RM NM AM 1 2 3 4 5 7 6
  26. 26. Slide 26 www.edureka.co/big-data-and-hadoop Summary: Application Workflow Execution Sequence : 1. Client submits an application 2. RM allocates a container to start AM 3. AM registers with RM 4. AM asks containers from RM 5. AM notifies NM to launch containers 6. Application code is executed in container 7. Client contacts RM/AM to monitor application’s status 8. AM unregisters with RM Client RM NM AM 1 2 3 4 5 7 8 6
  27. 27. Slide 27 www.edureka.co/big-data-and-hadoop Input Splits INPUT DATA Physical Division Logical Division HDFS Blocks Input Splits
  28. 28. Slide 28 www.edureka.co/big-data-and-hadoop Relation Between Input Splits and HDFS Blocks 1 2 3 4 5 6 7 8 9 10 11  Logical records do not fit neatly into the HDFS blocks.  Logical records are lines that cross the boundary of the blocks.  First split contains line 5 although it spans across blocks. File Lines Block Boundary Block Boundary Block Boundary Block Boundary Split Split Split
  29. 29. Slide 29 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Node 1 Node 2 INPUT DATA
  30. 30. Slide 30 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Map Node 1 Map Node 2 INPUT DATA
  31. 31. Slide 31 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Map Node 1 Map Node 2 INPUT DATA
  32. 32. Slide 32 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Map Node 1 Map Node 2 Node 1 Node 2 INPUT DATA
  33. 33. Slide 33 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Map Node 1 Map Node 2 Reduce Node 1 Reduce Node 2 INPUT DATA
  34. 34. Slide 34 www.edureka.co/big-data-and-hadoop MapReduce Job Submission Flow Input data is distributed to nodes Each map task works on a “split” of data Mapper outputs intermediate data Data exchange between nodes in a “shuffle” process Intermediate data of the same key goes to the same reducer Reducer output is stored Map Node 1 Map Node 2 Reduce Node 1 Reduce Node 2 INPUT DATA
  35. 35. Slide 35 www.edureka.co/big-data-and-hadoop Combiner Combiner Reducer (B,1) (C,1) (D,1) (E,1) (D,1) (B,1) (D,1) (A,1) (A,1) (C,1) (B,1) (D,1) (B,2) (C,1) (D,2) (E,1) (D,2) (A,2) (C,1) (B,1) (A, [2]) (B, [2,1]) (C, [1,1]) (D, [2,2]) (E, [1]) (A,2) (B,3) (C,2) (D,4) (E,1) Shuffle CombinerMapper Mapper B C D E D B D A A C B D Block1Block2
  36. 36. Slide 36 www.edureka.co/big-data-and-hadoop Partitioner – Redirecting Output from Mapper Map Map Map Reducer Reducer Reducer Partitioner Partitioner Partitioner
  37. 37. Slide 37 www.edureka.co/big-data-and-hadoop Getting Data to the Mapper Input File Input File Input split Input split Input split Input split RecordReader RecordReader RecordReader RecordReader Mapper Mapper Mapper Mapper (intermediates) (intermediates) (intermediates) (intermediates)
  38. 38. Slide 38 www.edureka.co/big-data-and-hadoop Partition and Shuffle Mapper Mapper Mapper Mapper (intermediates) (intermediates) (intermediates) (intermediates) Partitioner Partitioner Partitioner Partitioner (intermediates) (intermediates) (intermediates) Reducer Reducer Reducer
  39. 39. Slide 39 www.edureka.co/big-data-and-hadoop Demo of Word Count Program To illustrate Default Input Format (Text Input Format) Demo
  40. 40. Slide 40 www.edureka.co/big-data-and-hadoop Input file Input Split Input Split Input Split Record Reader Record Reader Record Reader Mapper Mapper Mapper (Intermediates) (Intermediates) (Intermediates) InputFormat Input Split Record Reader Mapper Input file (Intermediates) Input Format
  41. 41. Slide 41 www.edureka.co/big-data-and-hadoop Combine File Input Format<K,V> Text Input Format Key Value Text Input Format Nline Input Format Sequence File Input Format<K,V> File Input Format <K,V> Input Format<K,V> org.apache.hadoop.mapreduce <<interface>> Composable Input Format <K,V> Composite Input Format <K,V> DB Input Format<T> Sequence File As Binary Input Format Sequence File As Text Input Format Sequence File Input Filter<K,V> Input Format – Class Hierarchy
  42. 42. Slide 42 www.edureka.co/big-data-and-hadoop Reducer RecordWriter Output file Reducer RecordWriter Output file Reducer RecordWriter Output file OutputFormat Output Format
  43. 43. Slide 43 www.edureka.co/big-data-and-hadoop Text Output Format <K,V> Sequence File Output Format<K,V> Output Format <K,V> org.apache.hadoop.mapreduce DB Output Format <K,V> File Output Format <K,V> Null Output Format <K,V> Filter Output Format <K,V> Sequence File As Binary Output Format Lazy Output Format <K,V> Output Format – Class Hierarchy
  44. 44. Slide 44 www.edureka.co/big-data-and-hadoop Demo Demo: Custom Input Format

×