HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters
Upcoming SlideShare
Loading in...5
×
 

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters

on

  • 3,152 views

An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, ...

An increasing number of popular applications become data-intensive in nature. In the past decade, the World Wide Web has been adopted as an ideal platform for developing data-intensive applications, since the communication paradigm of the Web is sufficiently open and powerful. Data-intensive applications like data mining and web indexing need to access ever-expanding data sets ranging from a few gigabytes to several terabytes or even petabytes. Google leverages the MapReduce model to process approximately twenty petabytes of data per day in a parallel fashion. In this talk, we introduce the Google’s MapReduce framework for processing huge datasets on large clusters. We first outline the motivations of the MapReduce framework. Then, we describe the dataflow of MapReduce. Next, we show a couple of example applications of MapReduce. Finally, we present our research project on the Hadoop Distributed File System.

The current Hadoop implementation assumes that computing nodes in a cluster are homogeneous in nature. Data locality has not been taken into
account for launching speculative map tasks, because it is
assumed that most maps are data-local. Unfortunately, both
the homogeneity and data locality assumptions are not satisfied
in virtualized data centers. We show that ignoring the datalocality issue in heterogeneous environments can noticeably
reduce the MapReduce performance. In this paper, we address
the problem of how to place data across nodes in a way that
each node has a balanced data processing load. Given a dataintensive application running on a Hadoop MapReduce cluster,
our data placement scheme adaptively balances the amount of
data stored in each node to achieve improved data-processing
performance. Experimental results on two real data-intensive
applications show that our data placement strategy can always
improve the MapReduce performance by rebalancing data
across nodes before performing a data-intensive application
in a heterogeneous Hadoop cluster.

Statistics

Views

Total Views
3,152
Views on SlideShare
3,150
Embed Views
2

Actions

Likes
1
Downloads
93
Comments
0

2 Embeds 2

http://www.linkedin.com 1
http://www.steampdf.com 1

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

CC Attribution License

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Cite here What is the hadoop? http://www.cloudera.com/what-is-hadoop/ Hadoop is an open-source project administered by the Apache Software Foundation . Hadoop’s contributors work for some of the world’s biggest technology companies. That diverse, motivated community has produced a genuinely innovative platform for consolidating, combining and understanding data. Technically, Hadoop consists of two key services: reliable data storage using the Hadoop Distributed File System (HDFS) and high-performance parallel data processing using a technique called MapReduce. Hadoop runs on a collection of commodity, shared-nothing servers. You can add or remove servers in a Hadoop cluster at will; the system detects and compensates for hardware or system problems on any server. Hadoop, in other words, is self-healing. It can deliver data — and can run large-scale, high-performance processing jobs — in spite of system changes or failures.
  • Slow down, add animation of speculative task helping Note: Mention that we tested this on a heterogeneous Hadoop cluster
  • Copy page 7 to here Note: Please add legend on this diagram. E.g., black bars represent ******, red bars represent ***** Show data movement (migration)
  • Real results for the aforementioned motivational example.
  • Note: Explain what is the definition of computing ratio
  • Note: Name the over-utilized node list and the under-utilized node list on the diagram. a->A, b->B
  • The heterogeneity measurement of a cluster depends on data-intensive applications. If multiple MapReduce applications must process the same input file, the data placement mechanism may need to distribute the input file’s fragments in several ways - one for each MapReduce application. In the case where multiple applications are similar in terms of data processing speed, one data placement decision may fit the needs of all the application
  • Note: improve the resolution of the figures
  • Title Application dependence Independenced of Data size Note 1: improve the quality of the figures. Large figures and large legend Note 2:
  • number

HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters HDFS-HC: A Data Placement Module for Heterogeneous Hadoop Clusters Presentation Transcript

  • MapReduce: Simplified Data Processing on Large Clusters 02/28/11 Xiao Qin Department of Computer Science and Software Engineering Auburn University http://www.eng.auburn.edu/~xqin [email_address] Slides 2-20 are adapted from notes by Subbarao Kambhampati (ASU), Dan Weld (U. Washington), Jeff Dean, Sanjay Ghemawat, (Google, Inc.)
  • Motivation
    • Large-Scale Data Processing
      • Want to use 1000s of CPUs
        • But don ’ t want hassle of managing things
    • MapReduce provides
      • Automatic parallelization & distribution
      • Fault tolerance
      • I/O scheduling
      • Monitoring & status updates
  • Map/Reduce
    • Map/Reduce
      • Programming model from Lisp
      • (and other functional languages)
    • Many problems can be phrased this way
    • Easy to distribute across nodes
    • Nice retry/failure semantics
  • Distributed Grep Very big data Split data Split data Split data Split data grep grep grep grep matches matches matches matches cat All matches
  • Distributed Word Count Very big data Split data Split data Split data Split data count count count count count count count count merge merged count
  • Map Reduce
    • Map:
      • Accepts input key/value pair
      • Emits intermediate key/value pair
    • Reduce :
      • Accepts intermediate key/value* pair
      • Emits output key/value pair
    Very big data Result M A P R E D U C E Partitioning Function
  • Map in Lisp (Scheme)
    • (map f list [list 2 list 3 … ] )
    • (map square ‘ (1 2 3 4))
      • (1 4 9 16)
    • (reduce + ‘ (1 4 9 16))
      • (+ 16 (+ 9 (+ 4 1) ) )
      • 30
    • (reduce + (map square (map – l 1 l 2 ))))
    Unary operator Binary operator
  • Map/Reduce ala Google
    • map(key, val) is run on each item in set
      • emits new-key / new-val pairs
    • reduce(key, vals) is run for each unique key emitted by map()
      • emits final output
  • count words in docs
      • Input consists of (url, contents) pairs
      • map(key=url, val=contents):
        • For each word w in contents, emit (w, “ 1 ” )
      • reduce(key=word, values=uniq_counts):
        • Sum all “ 1 ” s in values list
        • Emit result “ (word, sum) ”
  • Count, Illustrated
      • map(key=url, val=contents):
        • For each word w in contents, emit (w, “ 1 ” )
      • reduce(key=word, values=uniq_counts):
        • Sum all “ 1 ” s in values list
        • Emit result “ (word, sum) ”
      • see bob throw
      • see spot run
      • see 1
      • bob 1
      • run 1
      • see 1
      • spot 1
      • throw 1
      • bob 1
      • Run 1
      • see 2
      • spot 1
      • throw 1
  • Grep
      • Input consists of (url+offset, single line)
      • map(key=url+offset, val=line):
        • If contents matches regexp, emit (line, “ 1 ” )
      • reduce(key=line, values=uniq_counts):
        • Don ’ t do anything; just emit line
  • Reverse Web-Link Graph
    • Map
      • For each URL linking to target, …
      • Output <target, source> pairs
    • Reduce
      • Concatenate list of all source URLs
      • Outputs: <target, list (source)> pairs
  • Model is Widely Applicable MapReduce Programs In Google Source Tree Example uses: distributed grep   distributed sort   web link-graph reversal term-vector / host web access log stats inverted index construction document clustering machine learning statistical machine translation ... ... ...
  • Implementation Overview
    • Typical cluster:
          • 100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
          • Limited bisection bandwidth
          • Storage is on local IDE disks
          • GFS: distributed file system manages data (SOSP'03)
          • Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines
    • Implementation is a C++ library linked into user programs
  • Execution
    • How is this distributed?
      • Partition input key/value pairs into chunks, run map() tasks in parallel
      • After all map()s are complete, consolidate all emitted values for each unique emitted key
      • Now partition space of output map keys, and run reduce() in parallel
    • If map() or reduce() fails, reexecute!
  • Job Processing JobTracker TaskTracker 0 TaskTracker 1 TaskTracker 2 TaskTracker 3 TaskTracker 4 TaskTracker 5
    • Client submits “grep” job, indicating code and input files
    • JobTracker breaks input file into k chunks, (in this case 6). Assigns work to ttrackers.
    • After map(), tasktrackers exchange map-output to build reduce() keyspace
    • JobTracker breaks reduce() keyspace into m chunks (in this case 6). Assigns work.
    • reduce() output may go to NDFS
    “ grep”
  • Execution
  • Parallel Execution
  • Task Granularity & Pipelining
    • Fine granularity tasks: map tasks >> machines
      • Minimizes time for fault recovery
      • Can pipeline shuffling with map execution
      • Better dynamic load balancing
    • Often use 200,000 map & 5000 reduce tasks
    • Running on 2000 machines
  • MapReduce outside Google
    • Hadoop (Java)
      • Emulates MapReduce and GFS
    • The architecture of Hadoop MapReduce and DFS is master/slave
    Master Slave MapReduce jobtracker tasktracker DFS namenode datanode
  • Improving MapReduce Performance through Data Placement in Heterogeneous Hadoop Clusters Download Software at: http://www.eng.auburn.edu/~xqin/software/hdfs-hc This HDFS-HC tool was described in our paper -  Improving MapReduce Performance via Data Placement in Heterogeneous Hadoop Clusters  - by J. Xie, S. Yin, X.-J. Ruan, Z.-Y. Ding, Y. Tian, J. Majors, and X. Qin, published in Proc. 19th Int'l Heterogeneity in Computing Workshop, Atlanta, Georgia, April 2010.
  • Hadoop Overview (J. Dean and S. Ghemawat. Mapreduce: Simplified data processing on large clusters. OSDI ’04, pages 137–150, 2008)
  • One time setup
    • set hadoop-site.xml and slaves
    • Initiate namenode
    • Run Hadoop MapReduce and DFS
    • Upload your data to DFS
    • Run your process…
    • Download your data from DFS
  • Hadoop Distributed File System (http://lucene.apache.org/hadoop)
  • Motivational Example Time (min) Node A (fast) Node B (slow) Node C (slowest) 2x slower 3x slower 1 task/min
  • The Native Strategy Node A Node B Node C 3 tasks 2 tasks 6 tasks Loading Transferring Processing Time (min)
  • Our Solution --Reducing data transfer time Node A’ Node B’ Node C’ 3 tasks 2 tasks 6 tasks Loading Transferring Processing Time (min) Node A
  • Preliminary Results Impact of data placement on performance of grep
  • Challenges  
    • Does computing ratio depend on the application?
    • Initial data distribution
    • Data skew problem
      • New data arrival
      • Data deletion  
      • New joining node
      • Data updating
  • Measure Computing Ratios
    • Computing ratio
    • Fast machines process large data sets
    Time Node A Node B Node C 2x slower 3x slower 1 task/min
  • Steps to Measure Computing Ratios 1. Run the application on each node with the same size data, individually collect the response time 2. Set the ratio of the shortest response as 1, accordingly set the ratio of other nodes 3.Caculate the least common multiple of these ratios 4. Count the portion of each node Node Response time(s) Ratio # of File Fragments Speed Node A 10 1 6 Fastest Node B 20 2 3 Average Node C 30 3 2 Slowest
  • Initial Data Distribution Namenode Datanodes File1 6 c
    • Input files split into 64MB blocks
    • Round-robin data distribution algorithm
    C B A Portion 3:2:1 1 2 3 4 5 7 8 9 a b
  • Data Redistribution
    • 1.Get network topology, the ratio and utilization
    • 2.Build and sort two lists:
    • under-utilized node list  L1
    • over-utilized node list  L2
    • 3. Select the source and destination node from the lists.
    • 4.Transfer data
    • 5.Repeat step 3, 4 until the list is empty.
    1 Namenode 1 2 3 4 5 6 7 8 9 a b c C A C B A B 2 3 4 L1 L2 Portion 3:2:1
  • Sharing Files among Multiple Applications
    • The computing ratio depends on data-intensive applications.
      • Redistribution
      • Redundancy
  • Experimental Environment Five nodes in a hadoop heterogeneous cluster Node CPU Model CPU(Hz) L1 Cache(KB) Node A Intel core 2 Duo 2*1G=2G 204 Node B Intel Celeron 2.8G 256 Node C Intel Pentium 3 1.2G 256 Node D Intel Pentium 3 1.2G 256 Node E Intel Pentium 3 1.2G 256
  • Grep and WordCount
    • Grep is a tool searching for a regular expression in a text file
    • WordCount is a program used to count words in a text file
  • Computing ratio for two applications Computing ratio of the five nodes with respective of Grep and Wordcount applications Computing Node Ratios for Grep Ratios for Wordcount Node A 1 1 Node B 2 2 Node C 3.3 5 Node D 3.3 5 Node E 3.3 5
  • Response time of Grep and wordcount in each Node Application dependence Data size independence
  • Six Data Placement Decisions
  • Impact of data placement on performance of Grep
  • Impact of data placement on performance of WordCount
  • Conclusion
    • Identify the performance degradation caused by heterogeneity.
    • Designed and implemented a data placement mechanism in HDFS.
  • Future Work
    • Data redundancy issue
    • Dynamic data distribution mechanism
    • Prefetching
  • Fellowship Program Samuel Ginn College of Engineering at Auburn University
    • Dean's Fellowship: $32,000 per year plus tuition fellowship
    • College Fellowship: $24,000 per year plus tuition fellowship
    • Departmental Fellowship: $20,000 per year plus tuition fellowship.
    • Tuition Fellowships: Tuition Fellowships provide a full tuition waiver for a student with a 25 percent or greater full-time-equivalent (FTE) assignment. Both graduate research assistants (GRAs) and graduate teaching assistants (GTAs) are eligible.
  • http://www.eng.auburn.edu/programs/grad-school/fellowship-program/
  • http://www.eng.auburn.edu
  • http://www.eng.auburn.edu
  • http://www.eng.auburn.edu
  • Download the presentation slides http://www.slideshare.net/xqin74 Google: slideshare Xiao Qin
  • Questions