Author: Tom White
Apache Hadoop committee, Cloudera
Reported by: Tzu-Li Tai
NCKU, HPDS Lab
Hadoop: The Definitive Guide 3r...
Hadoop:
The Definitive Guide 3rd Edition
By Tom White
Published by O’Reilly Media, 2012
Referenced Chapters:
Chapter 2 – M...
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduc...
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduc...
A. What is MapReduce?
B. An Example: NCDC Weather DataSet
C. Without Hadoop: Analyzing with Unix
D. With Hadoop: Java MapR...
 A computation framework for distributed data processing above HDFS.
 Consists of two phases: Map phase and Reduce phase...
• Loops through all year files and uses awk to extract
“temperature” and “quality” fields to manipulate.
• Complete run fo...
• Straightforward(?): Run parts of program in parallel.
• Appropriate division of the work into pieces isn’t easy.
• Multi...
(key, value…………………………………………………………… )
MAPPER
function
Shuffle and Sort
(key, value )
REDUCER
function
Mapper Function
in Java
Reducer Function
in Java
Running the
MapReduce Job
in Java
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduc...
A. Terminology
B. Data Flow
C. Combiner Functions
 job – A unit of work that client wants to be performed. Consists of
input data, MapReduce program, and configuration inf...
 Two types of nodes that control the job execution process:
jobtracker –
Coordinates jobs
Schedules tasks
Keeps record of...
 The input to a job is divded into input splits, or splits.
 Each split contains several records.
 The output of a redu...
Map
task
Map
task
Map
task
input
split
input
split
input
split
record
Deciding split size
Load Balancing Overhead
(how many splits?)
 A good split tends to be the size of an HDFS block (64MB ...
 Data locality optimization – Best to run the map task on a node where
the input data resides in HDFS  saves cluster ban...
Data flow
for a single
reduce task
 The Default MapReduce Job comes with a single reducer 
setNumReduceTasks() on Job.
 For multiple reducers, map tasks p...
Data flow
with multiple
reduce tasks
Data flow
with no
reduce tasks
 Jobs are limited by the bandwidth.
 Should minimize the data transferred between map and reduce tasks.
 A combiner fun...
Shuffle-and sort
(off-node data transfer;
costs bandwidth)
Map
task
Map
task
Reduce
task
HDFS
Without combiner function;
H...
Shuffle-and sort
(off-node data transfer;
costs bandwidth)
Map
task
Map
task
Reduce
task
HDFS
Using combiner function;
Low...
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduc...
A. Job submission
B. Job initialization
C. Task assignment
D. Task execution
E. Progress and status updates
F. Job complet...
 The client: submits the MapReduce job.
 The jobtracker: coordinates the job run  JobTracker
 The tasktrackers: run th...
1. Run job
 waitForCompletion()
 calls submit() method on Job
 creates a JobSummitter instance
 calls submitJobInterna...
2. Get new job ID
JobSummitter asks the
jobtracker for a new job ID
 call getNewJobID() on
JobTracker).
input/output verification
 Checks output specification
 Computes input splits
3. Copy job resources
 Job JAR file
 Configuration file
 Computed splits
 Copy to jobtracker’s
filesystem in a directo...
4. Submit job
JobSummitter tells the jobtracker
that the job is ready
 call submitJob() on JobTracker.
5. Initialize job
 job placed into internal queue
 scheduler picks it up and
initializes it
 Create object to represent...
6. Retrieve input splits
Create the list of tasks:
 retrieve computed splits
 one map task for each split
 create reduc...
7. Heartbeat (returns task)
 TaskTracker confirms op.
 is ready for a new task
 JobTracker assigns new task
8. Retrieve job resources
 localize job JAR
 create local working directory
9. Launch and 10. Run
 TasTracker creates TaskRunner
instance
 TaskRunner launces child
JVM
 Child process runs task
Terminology
 Status of a job and its tasks:
 state of the job or task
 progress of maps and reduces
 values of the job...
Updating Hierarchy
 Updating TaskTracker:
 Child sets flag if complete
 Every 3s, flag is checked
 Updating JobTracker...
Status update for client
 Client polls JobTracker every
sec. for job status.
 getStatus() on Job  JobStatus
instance
 On completing the job cleanup task, JobTracker changes status to “successful”.
 Job learns the job has completed  prin...
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduc...
A. What is YARN?
B. YARN Architechture
C. Improvement of SPOF using YARN
 The next generation MapReduce: YARN – Yet Another Resource Negotiator
 The two roles of the jobtracker, job scheduling ...
 More general than MapReduce.
 Higher manageability and cluster utilization.
 Even possible to run different versions o...
Entities of YARN MapReduce
 The client: submits the job.
 The YARN ResourceManager:
coordinates allocation of cluster re...
 application ID
 submitApplication()
 5a. Start container and
5b. MRAppMaster launch
 Decide: Run uber task?
 Small Job:
< 10 mappers, 1 reducer
 Allocate containers for tasks (8)
 Memory requirements are specified
(unlike classic MapReduce)
 Min. allocation (1024...
 The container is started by calling
the NodeManager (9a.)
 launch child JVM YarnChild (9b.)
 YARN:
 Classic MapReduce:
Task
MRAppMaster
Child JVM TaskTracker JobTracker
 The ResourceManager is designed with a checkpoint mechanism to save its state.
 State: consists of node managers in the...
I. An Introduction: Weather Dataset
II. Scaling Out: MapReduce for Large Inputs
III.Anatomy of a MapReduce Job
IV.MapReduc...
 MapReduce is inherently long-running and batch-oriented.
 Hive and Pig translate queries into MapReduce jobs, and are t...
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Upcoming SlideShare
Loading in...5
×

Hadoop MapReduce

1,429

Published on

[Study Report]
Study Material:
http://shop.oreilly.com/product/0636920021773.do

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,429
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
149
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide

Hadoop MapReduce

  1. 1. Author: Tom White Apache Hadoop committee, Cloudera Reported by: Tzu-Li Tai NCKU, HPDS Lab Hadoop: The Definitive Guide 3rd Edition
  2. 2. Hadoop: The Definitive Guide 3rd Edition By Tom White Published by O’Reilly Media, 2012 Referenced Chapters: Chapter 2 – MapReduce Chapter 6 – How MapReduce Works
  3. 3. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  4. 4. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  5. 5. A. What is MapReduce? B. An Example: NCDC Weather DataSet C. Without Hadoop: Analyzing with Unix D. With Hadoop: Java MapReduce
  6. 6.  A computation framework for distributed data processing above HDFS.  Consists of two phases: Map phase and Reduce phase.  Inherently parallel, therefore works on very-large scale data inputs as well as small inputs (for performance testing).  Data locality optimization.
  7. 7. • Loops through all year files and uses awk to extract “temperature” and “quality” fields to manipulate. • Complete run for data of a century took 42 minutes on a single EC2 High-CPU Extra Large Instance.
  8. 8. • Straightforward(?): Run parts of program in parallel. • Appropriate division of the work into pieces isn’t easy. • Multiple machine (distributed computing) is troublesome. • This is where Hadoop and MapReduce comes in!
  9. 9. (key, value…………………………………………………………… ) MAPPER function
  10. 10. Shuffle and Sort (key, value ) REDUCER function
  11. 11. Mapper Function in Java
  12. 12. Reducer Function in Java
  13. 13. Running the MapReduce Job in Java
  14. 14. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  15. 15. A. Terminology B. Data Flow C. Combiner Functions
  16. 16.  job – A unit of work that client wants to be performed. Consists of input data, MapReduce program, and configuration information.  task – The job is run by dividing it into two types of tasks: map tasks and reduce tasks.
  17. 17.  Two types of nodes that control the job execution process: jobtracker – Coordinates jobs Schedules tasks Keeps record of progress tasktrackers – Run tasks. Send progress reports to jobtracker.
  18. 18.  The input to a job is divded into input splits, or splits.  Each split contains several records.  The output of a reduce task is called a part.
  19. 19. Map task Map task Map task input split input split input split record
  20. 20. Deciding split size Load Balancing Overhead (how many splits?)  A good split tends to be the size of an HDFS block (64MB default).
  21. 21.  Data locality optimization – Best to run the map task on a node where the input data resides in HDFS  saves cluster bandwidth
  22. 22. Data flow for a single reduce task
  23. 23.  The Default MapReduce Job comes with a single reducer  setNumReduceTasks() on Job.  For multiple reducers, map tasks partition their output, creating one partition for each reduce task.  (key, value) records for any given key are all in a single partition.
  24. 24. Data flow with multiple reduce tasks
  25. 25. Data flow with no reduce tasks
  26. 26.  Jobs are limited by the bandwidth.  Should minimize the data transferred between map and reduce tasks.  A combiner function can help cut down the amount of data shuffled between the map and reduce tasks.
  27. 27. Shuffle-and sort (off-node data transfer; costs bandwidth) Map task Map task Reduce task HDFS Without combiner function; Higher bandwidth consumption
  28. 28. Shuffle-and sort (off-node data transfer; costs bandwidth) Map task Map task Reduce task HDFS Using combiner function; Lower bandwidth consumption
  29. 29. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  30. 30. A. Job submission B. Job initialization C. Task assignment D. Task execution E. Progress and status updates F. Job completion
  31. 31.  The client: submits the MapReduce job.  The jobtracker: coordinates the job run  JobTracker  The tasktrackers: run the map/reduce tasks  TaskTracker  The distributed filesystem, HDFS
  32. 32. 1. Run job  waitForCompletion()  calls submit() method on Job  creates a JobSummitter instance  calls submitJobInternal()
  33. 33. 2. Get new job ID JobSummitter asks the jobtracker for a new job ID  call getNewJobID() on JobTracker).
  34. 34. input/output verification  Checks output specification  Computes input splits
  35. 35. 3. Copy job resources  Job JAR file  Configuration file  Computed splits  Copy to jobtracker’s filesystem in a directory named after the job ID
  36. 36. 4. Submit job JobSummitter tells the jobtracker that the job is ready  call submitJob() on JobTracker.
  37. 37. 5. Initialize job  job placed into internal queue  scheduler picks it up and initializes it  Create object to represent job
  38. 38. 6. Retrieve input splits Create the list of tasks:  retrieve computed splits  one map task for each split  create reduce tasks, amount know by setNumReduceTasks()  job setup and cleanup task
  39. 39. 7. Heartbeat (returns task)  TaskTracker confirms op.  is ready for a new task  JobTracker assigns new task
  40. 40. 8. Retrieve job resources  localize job JAR  create local working directory
  41. 41. 9. Launch and 10. Run  TasTracker creates TaskRunner instance  TaskRunner launces child JVM  Child process runs task
  42. 42. Terminology  Status of a job and its tasks:  state of the job or task  progress of maps and reduces  values of the job’s counters  status message set by the user.  Progress: the proportion of the task completed.  half of input processed for map task: Progress = 50%  half of input processed for reduce task: Progress = 1/3 (copy phase) + 1/3 (sort phase) + 1/2*1/3 (half of input) = 5/6
  43. 43. Updating Hierarchy  Updating TaskTracker:  Child sets flag if complete  Every 3s, flag is checked  Updating JobTracker:  Every 5s, all status of tasks on TaskTracker is sent to JobTracker
  44. 44. Status update for client  Client polls JobTracker every sec. for job status.  getStatus() on Job  JobStatus instance
  45. 45.  On completing the job cleanup task, JobTracker changes status to “successful”.  Job learns the job has completed  prints a message  returns from waitForCompletion().
  46. 46. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  47. 47. A. What is YARN? B. YARN Architechture C. Improvement of SPOF using YARN
  48. 48.  The next generation MapReduce: YARN – Yet Another Resource Negotiator  The two roles of the jobtracker, job scheduling and task progress monitoring, is separated into two independent daemons: a resource manager and an application master. resource manager Application master node manager 1. Ask for resource 2. Allocate “container”
  49. 49.  More general than MapReduce.  Higher manageability and cluster utilization.  Even possible to run different versions of MapReduce on the same cluster  upgrading process of MapReduce more manageable.
  50. 50. Entities of YARN MapReduce  The client: submits the job.  The YARN ResourceManager: coordinates allocation of cluster resources  ResourceManager  The YARN NodeManager(s): launch and monitor containers.  NodeManager  The MapReduce application master: coordinates tasks running the MapReduce job  MRAppMaster  The distributed filesystem, HDFS
  51. 51.  application ID  submitApplication()
  52. 52.  5a. Start container and 5b. MRAppMaster launch  Decide: Run uber task?  Small Job: < 10 mappers, 1 reducer
  53. 53.  Allocate containers for tasks (8)  Memory requirements are specified (unlike classic MapReduce)  Min. allocation (1024MB = 1GB) Max. allocation (10240MB = 10GB)
  54. 54.  The container is started by calling the NodeManager (9a.)  launch child JVM YarnChild (9b.)
  55. 55.  YARN:  Classic MapReduce: Task MRAppMaster Child JVM TaskTracker JobTracker
  56. 56.  The ResourceManager is designed with a checkpoint mechanism to save its state.  State: consists of node managers in the system as well as the running applications.  Amount of state to be stored much smaller (manageable) than classic MapReduce.
  57. 57. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  58. 58.  MapReduce is inherently long-running and batch-oriented.  Hive and Pig translate queries into MapReduce jobs, and are therefore non-adhoc and have high latency.  Google Dremel does not use the MapReduce framework and supports adhoc queries. (note: do not confuse with real-time streaming engines, such as “Storm”)  Future of Hive/Pig? Apache Drill
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×