Hadoop MapReduce
Upcoming SlideShare
Loading in...5
×
 

Hadoop MapReduce

on

  • 1,064 views

[Study Report]

[Study Report]
Study Material:
http://shop.oreilly.com/product/0636920021773.do

Statistics

Views

Total Views
1,064
Views on SlideShare
1,064
Embed Views
0

Actions

Likes
1
Downloads
106
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Hadoop MapReduce Hadoop MapReduce Presentation Transcript

    • Author: Tom White Apache Hadoop committee, Cloudera Reported by: Tzu-Li Tai NCKU, HPDS Lab Hadoop: The Definitive Guide 3rd Edition
    • Hadoop: The Definitive Guide 3rd Edition By Tom White Published by O’Reilly Media, 2012 Referenced Chapters: Chapter 2 – MapReduce Chapter 6 – How MapReduce Works
    • I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
    • I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
    • A. What is MapReduce? B. An Example: NCDC Weather DataSet C. Without Hadoop: Analyzing with Unix D. With Hadoop: Java MapReduce
    •  A computation framework for distributed data processing above HDFS.  Consists of two phases: Map phase and Reduce phase.  Inherently parallel, therefore works on very-large scale data inputs as well as small inputs (for performance testing).  Data locality optimization.
    • • Loops through all year files and uses awk to extract “temperature” and “quality” fields to manipulate. • Complete run for data of a century took 42 minutes on a single EC2 High-CPU Extra Large Instance.
    • • Straightforward(?): Run parts of program in parallel. • Appropriate division of the work into pieces isn’t easy. • Multiple machine (distributed computing) is troublesome. • This is where Hadoop and MapReduce comes in!
    • (key, value…………………………………………………………… ) MAPPER function
    • Shuffle and Sort (key, value ) REDUCER function
    • Mapper Function in Java
    • Reducer Function in Java
    • Running the MapReduce Job in Java
    • I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
    • A. Terminology B. Data Flow C. Combiner Functions
    •  job – A unit of work that client wants to be performed. Consists of input data, MapReduce program, and configuration information.  task – The job is run by dividing it into two types of tasks: map tasks and reduce tasks.
    •  Two types of nodes that control the job execution process: jobtracker – Coordinates jobs Schedules tasks Keeps record of progress tasktrackers – Run tasks. Send progress reports to jobtracker.
    •  The input to a job is divded into input splits, or splits.  Each split contains several records.  The output of a reduce task is called a part.
    • Map task Map task Map task input split input split input split record
    • Deciding split size Load Balancing Overhead (how many splits?)  A good split tends to be the size of an HDFS block (64MB default).
    •  Data locality optimization – Best to run the map task on a node where the input data resides in HDFS  saves cluster bandwidth
    • Data flow for a single reduce task
    •  The Default MapReduce Job comes with a single reducer  setNumReduceTasks() on Job.  For multiple reducers, map tasks partition their output, creating one partition for each reduce task.  (key, value) records for any given key are all in a single partition.
    • Data flow with multiple reduce tasks
    • Data flow with no reduce tasks
    •  Jobs are limited by the bandwidth.  Should minimize the data transferred between map and reduce tasks.  A combiner function can help cut down the amount of data shuffled between the map and reduce tasks.
    • Shuffle-and sort (off-node data transfer; costs bandwidth) Map task Map task Reduce task HDFS Without combiner function; Higher bandwidth consumption
    • Shuffle-and sort (off-node data transfer; costs bandwidth) Map task Map task Reduce task HDFS Using combiner function; Lower bandwidth consumption
    • I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
    • A. Job submission B. Job initialization C. Task assignment D. Task execution E. Progress and status updates F. Job completion
    •  The client: submits the MapReduce job.  The jobtracker: coordinates the job run  JobTracker  The tasktrackers: run the map/reduce tasks  TaskTracker  The distributed filesystem, HDFS
    • 1. Run job  waitForCompletion()  calls submit() method on Job  creates a JobSummitter instance  calls submitJobInternal()
    • 2. Get new job ID JobSummitter asks the jobtracker for a new job ID  call getNewJobID() on JobTracker).
    • input/output verification  Checks output specification  Computes input splits
    • 3. Copy job resources  Job JAR file  Configuration file  Computed splits  Copy to jobtracker’s filesystem in a directory named after the job ID
    • 4. Submit job JobSummitter tells the jobtracker that the job is ready  call submitJob() on JobTracker.
    • 5. Initialize job  job placed into internal queue  scheduler picks it up and initializes it  Create object to represent job
    • 6. Retrieve input splits Create the list of tasks:  retrieve computed splits  one map task for each split  create reduce tasks, amount know by setNumReduceTasks()  job setup and cleanup task
    • 7. Heartbeat (returns task)  TaskTracker confirms op.  is ready for a new task  JobTracker assigns new task
    • 8. Retrieve job resources  localize job JAR  create local working directory
    • 9. Launch and 10. Run  TasTracker creates TaskRunner instance  TaskRunner launces child JVM  Child process runs task
    • Terminology  Status of a job and its tasks:  state of the job or task  progress of maps and reduces  values of the job’s counters  status message set by the user.  Progress: the proportion of the task completed.  half of input processed for map task: Progress = 50%  half of input processed for reduce task: Progress = 1/3 (copy phase) + 1/3 (sort phase) + 1/2*1/3 (half of input) = 5/6
    • Updating Hierarchy  Updating TaskTracker:  Child sets flag if complete  Every 3s, flag is checked  Updating JobTracker:  Every 5s, all status of tasks on TaskTracker is sent to JobTracker
    • Status update for client  Client polls JobTracker every sec. for job status.  getStatus() on Job  JobStatus instance
    •  On completing the job cleanup task, JobTracker changes status to “successful”.  Job learns the job has completed  prints a message  returns from waitForCompletion().
    • I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
    • A. What is YARN? B. YARN Architechture C. Improvement of SPOF using YARN
    •  The next generation MapReduce: YARN – Yet Another Resource Negotiator  The two roles of the jobtracker, job scheduling and task progress monitoring, is separated into two independent daemons: a resource manager and an application master. resource manager Application master node manager 1. Ask for resource 2. Allocate “container”
    •  More general than MapReduce.  Higher manageability and cluster utilization.  Even possible to run different versions of MapReduce on the same cluster  upgrading process of MapReduce more manageable.
    • Entities of YARN MapReduce  The client: submits the job.  The YARN ResourceManager: coordinates allocation of cluster resources  ResourceManager  The YARN NodeManager(s): launch and monitor containers.  NodeManager  The MapReduce application master: coordinates tasks running the MapReduce job  MRAppMaster  The distributed filesystem, HDFS
    •  application ID  submitApplication()
    •  5a. Start container and 5b. MRAppMaster launch  Decide: Run uber task?  Small Job: < 10 mappers, 1 reducer
    •  Allocate containers for tasks (8)  Memory requirements are specified (unlike classic MapReduce)  Min. allocation (1024MB = 1GB) Max. allocation (10240MB = 10GB)
    •  The container is started by calling the NodeManager (9a.)  launch child JVM YarnChild (9b.)
    •  YARN:  Classic MapReduce: Task MRAppMaster Child JVM TaskTracker JobTracker
    •  The ResourceManager is designed with a checkpoint mechanism to save its state.  State: consists of node managers in the system as well as the running applications.  Amount of state to be stored much smaller (manageable) than classic MapReduce.
    • I. An Introduction: Weather Dataset II. Scaling Out: MapReduce for Large Inputs III.Anatomy of a MapReduce Job IV.MapReduce 2: YARN V. Interesting Topics
    •  MapReduce is inherently long-running and batch-oriented.  Hive and Pig translate queries into MapReduce jobs, and are therefore non-adhoc and have high latency.  Google Dremel does not use the MapReduce framework and supports adhoc queries. (note: do not confuse with real-time streaming engines, such as “Storm”)  Future of Hive/Pig? Apache Drill