Hadoop MapReduce

  • 1,158 views
Uploaded on

[Study Report] …

[Study Report]
Study Material:
http://shop.oreilly.com/product/0636920021773.do

More in: Technology , Business
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Be the first to comment
No Downloads

Views

Total Views
1,158
On Slideshare
0
From Embeds
0
Number of Embeds
0

Actions

Shares
Downloads
123
Comments
0
Likes
1

Embeds 0

No embeds

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
    No notes for slide

Transcript

  • 1. Author: Tom White Apache Hadoop committee, Cloudera Reported by: Tzu-Li Tai NCKU, HPDS Lab Hadoop: The Definitive Guide 3rd Edition
  • 2. Hadoop: The Definitive Guide 3rd Edition By Tom White Published by O’Reilly Media, 2012 Referenced Chapters: Chapter 2 – MapReduce Chapter 6 – How MapReduce Works
  • 3. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 4. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 5. A. What is MapReduce? B. An Example: NCDC Weather DataSet C. Without Hadoop: Analyzing with Unix D. With Hadoop: Java MapReduce
  • 6.  A computation framework for distributed data processing above HDFS.  Consists of two phases: Map phase and Reduce phase.  Inherently parallel, therefore works on very-large scale data inputs as well as small inputs (for performance testing).  Data locality optimization.
  • 7. • Loops through all year files and uses awk to extract “temperature” and “quality” fields to manipulate. • Complete run for data of a century took 42 minutes on a single EC2 High-CPU Extra Large Instance.
  • 8. • Straightforward(?): Run parts of program in parallel. • Appropriate division of the work into pieces isn’t easy. • Multiple machine (distributed computing) is troublesome. • This is where Hadoop and MapReduce comes in!
  • 9. (key, value…………………………………………………………… ) MAPPER function
  • 10. Shuffle and Sort (key, value ) REDUCER function
  • 11. Mapper Function in Java
  • 12. Reducer Function in Java
  • 13. Running the MapReduce Job in Java
  • 14. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 15. A. Terminology B. Data Flow C. Combiner Functions
  • 16.  job – A unit of work that client wants to be performed. Consists of input data, MapReduce program, and configuration information.  task – The job is run by dividing it into two types of tasks: map tasks and reduce tasks.
  • 17.  Two types of nodes that control the job execution process: jobtracker – Coordinates jobs Schedules tasks Keeps record of progress tasktrackers – Run tasks. Send progress reports to jobtracker.
  • 18.  The input to a job is divded into input splits, or splits.  Each split contains several records.  The output of a reduce task is called a part.
  • 19. Map task Map task Map task input split input split input split record
  • 20. Deciding split size Load Balancing Overhead (how many splits?)  A good split tends to be the size of an HDFS block (64MB default).
  • 21.  Data locality optimization – Best to run the map task on a node where the input data resides in HDFS  saves cluster bandwidth
  • 22. Data flow for a single reduce task
  • 23.  The Default MapReduce Job comes with a single reducer  setNumReduceTasks() on Job.  For multiple reducers, map tasks partition their output, creating one partition for each reduce task.  (key, value) records for any given key are all in a single partition.
  • 24. Data flow with multiple reduce tasks
  • 25. Data flow with no reduce tasks
  • 26.  Jobs are limited by the bandwidth.  Should minimize the data transferred between map and reduce tasks.  A combiner function can help cut down the amount of data shuffled between the map and reduce tasks.
  • 27. Shuffle-and sort (off-node data transfer; costs bandwidth) Map task Map task Reduce task HDFS Without combiner function; Higher bandwidth consumption
  • 28. Shuffle-and sort (off-node data transfer; costs bandwidth) Map task Map task Reduce task HDFS Using combiner function; Lower bandwidth consumption
  • 29. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 30. A. Job submission B. Job initialization C. Task assignment D. Task execution E. Progress and status updates F. Job completion
  • 31.  The client: submits the MapReduce job.  The jobtracker: coordinates the job run  JobTracker  The tasktrackers: run the map/reduce tasks  TaskTracker  The distributed filesystem, HDFS
  • 32. 1. Run job  waitForCompletion()  calls submit() method on Job  creates a JobSummitter instance  calls submitJobInternal()
  • 33. 2. Get new job ID JobSummitter asks the jobtracker for a new job ID  call getNewJobID() on JobTracker).
  • 34. input/output verification  Checks output specification  Computes input splits
  • 35. 3. Copy job resources  Job JAR file  Configuration file  Computed splits  Copy to jobtracker’s filesystem in a directory named after the job ID
  • 36. 4. Submit job JobSummitter tells the jobtracker that the job is ready  call submitJob() on JobTracker.
  • 37. 5. Initialize job  job placed into internal queue  scheduler picks it up and initializes it  Create object to represent job
  • 38. 6. Retrieve input splits Create the list of tasks:  retrieve computed splits  one map task for each split  create reduce tasks, amount know by setNumReduceTasks()  job setup and cleanup task
  • 39. 7. Heartbeat (returns task)  TaskTracker confirms op.  is ready for a new task  JobTracker assigns new task
  • 40. 8. Retrieve job resources  localize job JAR  create local working directory
  • 41. 9. Launch and 10. Run  TasTracker creates TaskRunner instance  TaskRunner launces child JVM  Child process runs task
  • 42. Terminology  Status of a job and its tasks:  state of the job or task  progress of maps and reduces  values of the job’s counters  status message set by the user.  Progress: the proportion of the task completed.  half of input processed for map task: Progress = 50%  half of input processed for reduce task: Progress = 1/3 (copy phase) + 1/3 (sort phase) + 1/2*1/3 (half of input) = 5/6
  • 43. Updating Hierarchy  Updating TaskTracker:  Child sets flag if complete  Every 3s, flag is checked  Updating JobTracker:  Every 5s, all status of tasks on TaskTracker is sent to JobTracker
  • 44. Status update for client  Client polls JobTracker every sec. for job status.  getStatus() on Job  JobStatus instance
  • 45.  On completing the job cleanup task, JobTracker changes status to “successful”.  Job learns the job has completed  prints a message  returns from waitForCompletion().
  • 46. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 47. A. What is YARN? B. YARN Architechture C. Improvement of SPOF using YARN
  • 48.  The next generation MapReduce: YARN – Yet Another Resource Negotiator  The two roles of the jobtracker, job scheduling and task progress monitoring, is separated into two independent daemons: a resource manager and an application master. resource manager Application master node manager 1. Ask for resource 2. Allocate “container”
  • 49.  More general than MapReduce.  Higher manageability and cluster utilization.  Even possible to run different versions of MapReduce on the same cluster  upgrading process of MapReduce more manageable.
  • 50. Entities of YARN MapReduce  The client: submits the job.  The YARN ResourceManager: coordinates allocation of cluster resources  ResourceManager  The YARN NodeManager(s): launch and monitor containers.  NodeManager  The MapReduce application master: coordinates tasks running the MapReduce job  MRAppMaster  The distributed filesystem, HDFS
  • 51.  application ID  submitApplication()
  • 52.  5a. Start container and 5b. MRAppMaster launch  Decide: Run uber task?  Small Job: < 10 mappers, 1 reducer
  • 53.  Allocate containers for tasks (8)  Memory requirements are specified (unlike classic MapReduce)  Min. allocation (1024MB = 1GB) Max. allocation (10240MB = 10GB)
  • 54.  The container is started by calling the NodeManager (9a.)  launch child JVM YarnChild (9b.)
  • 55.  YARN:  Classic MapReduce: Task MRAppMaster Child JVM TaskTracker JobTracker
  • 56.  The ResourceManager is designed with a checkpoint mechanism to save its state.  State: consists of node managers in the system as well as the running applications.  Amount of state to be stored much smaller (manageable) than classic MapReduce.
  • 57. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 58.  MapReduce is inherently long-running and batch-oriented.  Hive and Pig translate queries into MapReduce jobs, and are therefore non-adhoc and have high latency.  Google Dremel does not use the MapReduce framework and supports adhoc queries. (note: do not confuse with real-time streaming engines, such as “Storm”)  Future of Hive/Pig? Apache Drill