Your SlideShare is downloading. ×
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Hadoop MapReduce
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Hadoop MapReduce

1,348

Published on

[Study Report] …

[Study Report]
Study Material:
http://shop.oreilly.com/product/0636920021773.do

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,348
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
146
Comments
0
Likes
2
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. Author: Tom White Apache Hadoop committee, Cloudera Reported by: Tzu-Li Tai NCKU, HPDS Lab Hadoop: The Definitive Guide 3rd Edition
  • 2. Hadoop: The Definitive Guide 3rd Edition By Tom White Published by O’Reilly Media, 2012 Referenced Chapters: Chapter 2 – MapReduce Chapter 6 – How MapReduce Works
  • 3. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 4. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 5. A. What is MapReduce? B. An Example: NCDC Weather DataSet C. Without Hadoop: Analyzing with Unix D. With Hadoop: Java MapReduce
  • 6.  A computation framework for distributed data processing above HDFS.  Consists of two phases: Map phase and Reduce phase.  Inherently parallel, therefore works on very-large scale data inputs as well as small inputs (for performance testing).  Data locality optimization.
  • 7. • Loops through all year files and uses awk to extract “temperature” and “quality” fields to manipulate. • Complete run for data of a century took 42 minutes on a single EC2 High-CPU Extra Large Instance.
  • 8. • Straightforward(?): Run parts of program in parallel. • Appropriate division of the work into pieces isn’t easy. • Multiple machine (distributed computing) is troublesome. • This is where Hadoop and MapReduce comes in!
  • 9. (key, value…………………………………………………………… ) MAPPER function
  • 10. Shuffle and Sort (key, value ) REDUCER function
  • 11. Mapper Function in Java
  • 12. Reducer Function in Java
  • 13. Running the MapReduce Job in Java
  • 14. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 15. A. Terminology B. Data Flow C. Combiner Functions
  • 16.  job – A unit of work that client wants to be performed. Consists of input data, MapReduce program, and configuration information.  task – The job is run by dividing it into two types of tasks: map tasks and reduce tasks.
  • 17.  Two types of nodes that control the job execution process: jobtracker – Coordinates jobs Schedules tasks Keeps record of progress tasktrackers – Run tasks. Send progress reports to jobtracker.
  • 18.  The input to a job is divded into input splits, or splits.  Each split contains several records.  The output of a reduce task is called a part.
  • 19. Map task Map task Map task input split input split input split record
  • 20. Deciding split size Load Balancing Overhead (how many splits?)  A good split tends to be the size of an HDFS block (64MB default).
  • 21.  Data locality optimization – Best to run the map task on a node where the input data resides in HDFS  saves cluster bandwidth
  • 22. Data flow for a single reduce task
  • 23.  The Default MapReduce Job comes with a single reducer  setNumReduceTasks() on Job.  For multiple reducers, map tasks partition their output, creating one partition for each reduce task.  (key, value) records for any given key are all in a single partition.
  • 24. Data flow with multiple reduce tasks
  • 25. Data flow with no reduce tasks
  • 26.  Jobs are limited by the bandwidth.  Should minimize the data transferred between map and reduce tasks.  A combiner function can help cut down the amount of data shuffled between the map and reduce tasks.
  • 27. Shuffle-and sort (off-node data transfer; costs bandwidth) Map task Map task Reduce task HDFS Without combiner function; Higher bandwidth consumption
  • 28. Shuffle-and sort (off-node data transfer; costs bandwidth) Map task Map task Reduce task HDFS Using combiner function; Lower bandwidth consumption
  • 29. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 30. A. Job submission B. Job initialization C. Task assignment D. Task execution E. Progress and status updates F. Job completion
  • 31.  The client: submits the MapReduce job.  The jobtracker: coordinates the job run  JobTracker  The tasktrackers: run the map/reduce tasks  TaskTracker  The distributed filesystem, HDFS
  • 32. 1. Run job  waitForCompletion()  calls submit() method on Job  creates a JobSummitter instance  calls submitJobInternal()
  • 33. 2. Get new job ID JobSummitter asks the jobtracker for a new job ID  call getNewJobID() on JobTracker).
  • 34. input/output verification  Checks output specification  Computes input splits
  • 35. 3. Copy job resources  Job JAR file  Configuration file  Computed splits  Copy to jobtracker’s filesystem in a directory named after the job ID
  • 36. 4. Submit job JobSummitter tells the jobtracker that the job is ready  call submitJob() on JobTracker.
  • 37. 5. Initialize job  job placed into internal queue  scheduler picks it up and initializes it  Create object to represent job
  • 38. 6. Retrieve input splits Create the list of tasks:  retrieve computed splits  one map task for each split  create reduce tasks, amount know by setNumReduceTasks()  job setup and cleanup task
  • 39. 7. Heartbeat (returns task)  TaskTracker confirms op.  is ready for a new task  JobTracker assigns new task
  • 40. 8. Retrieve job resources  localize job JAR  create local working directory
  • 41. 9. Launch and 10. Run  TasTracker creates TaskRunner instance  TaskRunner launces child JVM  Child process runs task
  • 42. Terminology  Status of a job and its tasks:  state of the job or task  progress of maps and reduces  values of the job’s counters  status message set by the user.  Progress: the proportion of the task completed.  half of input processed for map task: Progress = 50%  half of input processed for reduce task: Progress = 1/3 (copy phase) + 1/3 (sort phase) + 1/2*1/3 (half of input) = 5/6
  • 43. Updating Hierarchy  Updating TaskTracker:  Child sets flag if complete  Every 3s, flag is checked  Updating JobTracker:  Every 5s, all status of tasks on TaskTracker is sent to JobTracker
  • 44. Status update for client  Client polls JobTracker every sec. for job status.  getStatus() on Job  JobStatus instance
  • 45.  On completing the job cleanup task, JobTracker changes status to “successful”.  Job learns the job has completed  prints a message  returns from waitForCompletion().
  • 46. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 47. A. What is YARN? B. YARN Architechture C. Improvement of SPOF using YARN
  • 48.  The next generation MapReduce: YARN – Yet Another Resource Negotiator  The two roles of the jobtracker, job scheduling and task progress monitoring, is separated into two independent daemons: a resource manager and an application master. resource manager Application master node manager 1. Ask for resource 2. Allocate “container”
  • 49.  More general than MapReduce.  Higher manageability and cluster utilization.  Even possible to run different versions of MapReduce on the same cluster  upgrading process of MapReduce more manageable.
  • 50. Entities of YARN MapReduce  The client: submits the job.  The YARN ResourceManager: coordinates allocation of cluster resources  ResourceManager  The YARN NodeManager(s): launch and monitor containers.  NodeManager  The MapReduce application master: coordinates tasks running the MapReduce job  MRAppMaster  The distributed filesystem, HDFS
  • 51.  application ID  submitApplication()
  • 52.  5a. Start container and 5b. MRAppMaster launch  Decide: Run uber task?  Small Job: < 10 mappers, 1 reducer
  • 53.  Allocate containers for tasks (8)  Memory requirements are specified (unlike classic MapReduce)  Min. allocation (1024MB = 1GB) Max. allocation (10240MB = 10GB)
  • 54.  The container is started by calling the NodeManager (9a.)  launch child JVM YarnChild (9b.)
  • 55.  YARN:  Classic MapReduce: Task MRAppMaster Child JVM TaskTracker JobTracker
  • 56.  The ResourceManager is designed with a checkpoint mechanism to save its state.  State: consists of node managers in the system as well as the running applications.  Amount of state to be stored much smaller (manageable) than classic MapReduce.
  • 57. I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
  • 58.  MapReduce is inherently long-running and batch-oriented.  Hive and Pig translate queries into MapReduce jobs, and are therefore non-adhoc and have high latency.  Google Dremel does not use the MapReduce framework and supports adhoc queries. (note: do not confuse with real-time streaming engines, such as “Storm”)  Future of Hive/Pig? Apache Drill

×