• Share
  • Email
  • Embed
  • Like
  • Save
  • Private Content
Behm Shah Pagerank
 

Behm Shah Pagerank

on

  • 5,797 views

 

Statistics

Views

Total Views
5,797
Views on SlideShare
5,775
Embed Views
22

Actions

Likes
6
Downloads
191
Comments
0

1 Embed 22

http://www.slideshare.net 22

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    Behm Shah Pagerank Behm Shah Pagerank Presentation Transcript

    • Computing PageRank Using Hadoop (+Introduction to MapReduce) Alexander Behm, Ajey Shah University of California, Irvine Instructor: Prof. Chen Li
    • Outline
      • Introduction to MapReduce
        • Motivation + Goals
        • MapReduce Paradigm + Example
      • Introduction to Hadoop
        • Architecture
        • Our setup
      • Computing PageRank using Map Reduce
        • Link Analysis
        • Matrix Multiplication
    • Motivation for MapReduce
      • How can we process huge amounts of data quickly (think web-scale)?
        • Mainframe (one big machine)
          • expensive, one vendor, hard to scale radically, single point of failure
        • COTS Cluster (many small machines)
          • cheap components, many vendors, easy to scale
      • COTS Clusters very popular because of price and scalability
      • Main drawback is complexity of programming parallel applications on them
      • COTS = C ommodity o ff t he s helf
    • Motivation for MapReduce
      • What are the main challenges of programming a COTS cluster?
        • 1. Fault Tolerance (many machines  many failures)
        • 2. Transparency: how to hide underlying details of cluster
        • 3. Scheduling and load balancing
      Parallel Programming Models
      • High exposure to programmer
      • Complex programming
      • High Efficiency
      • (Long development time)
      • Low exposure to programmer
      • Simple programming
      • Lower Efficiency
      • (Short development time)
      MapReduce
    • MapReduce Goals
      • Provide easy but general model for programmers to use cluster resources
      • Hide network communication (i.e. RPCs)
      • Hide storage details, file chunks are automatically distributed and replicated
      • Provide transparent fault tolerance
        • Failed tasks are automatically rescheduled on live nodes
      • High throughput and automatic load balancing
        • E.g. scheduling tasks on nodes that already have data
      • RPC = R emote P rocedure C all
    • MapReduce is NOT…
      • An operating system
      • A programming language
      • Meant for online processing
      • Hadoop (it is an implementation of MapReduce)
      MapReduce is a programming paradigm!
    • MapReduce Flow Input Map Key, Value Key, Value … = Map Map Split Input into Key-Value pairs. For each K-V pair call Map. Each Map produces new set of K-V pairs. Reduce(K, V[ ]) Sort Output Key, Value Key, Value … = For each distinct key, call reduce. Produces one K-V pair for each distinct key. Output as a set of Key Value Pairs. Key, Value Key, Value … Key, Value Key, Value … Key, Value Key, Value …
    • MapReduce WordCount Example Output: Number of occurrences of each word Input: File containing words Hello World Bye World Hello Hadoop Bye Hadoop Bye Hadoop Hello Hadoop Bye 3 Hadoop 4 Hello 3 World 2 MapReduce How can we do this within the MapReduce framework? Basic idea: parallelize on lines in input file!
    • MapReduce WordCount Example Input 1, “Hello World Bye World” 2, “Hello Hadoop Bye Hadoop” 3, “Bye Hadoop Hello Hadoop” Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Map(K, V) { For each word w in V Collect(w, 1); } Map Map Map
    • MapReduce WordCount Example Reduce(K, V[ ]) { Int count = 0; For each v in V count += v; Collect(K, count); } Map Output <Hello,1> <World,1> <Bye,1> <World,1> <Hello,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Bye,1> <Hadoop,1> <Hello,1> <Hadoop,1> Internal Grouping <Bye  1, 1, 1> <Hadoop  1, 1, 1, 1> <Hello  1, 1, 1> <World  1, 1> Reduce Output <Bye, 3> <Hadoop, 4> <Hello, 3> <World, 2> Reduce Reduce Reduce Reduce
      • Open Source implementation of MapReduce by Apache
      • Java software framework
      • In use and supported by Yahoo!
      • Hadoop consists of the following components:
        • Processing : Map Reduce
        • Storage: HDFS, Hbase (Google Bigtable)
      @Yahoo! Some Webmap size data: Number of links between pages in the index: roughly 1 trillion links Size of output: over 300 TB, compressed! Number of cores used to run a single Map-Reduce job: over 10,000 Raw disk used in the production cluster: over 5 Petabytes (source: http://developer.yahoo.com/blogs/hadoop/2008/02/yahoo-worlds-largest-production-hadoop.html)
    • Typical Hadoop Setup
    • Our Hadoop Setup MASTER peach Namenode JobTracker TaskTracker DataNode SLAVES watermelon DataNode TaskTracker cherry DataNode TaskTracker avocado DataNode TaskTracker blueberry DataNode TaskTracker Switch
    • Our Hadoop Setup Demo: Hadoop Admin Pages!
      • Single Name Node- manages meta data and block placement
      • DataNode – stores blocks
      Storage: HDFS
    • Run Application Job Tracker Task Tracker Task Tracker Task Tracker … Task Task Task Task Task Task Hadoop Black Box Job Execution Diagram
      • Input -> Map -> Shuffle -> Reduce -> Output
      Processing: Hadoop MapReduce
    • Using Hadoop To Program Reduce(…) Mapper(…) extends extends implements implements
    • Sample Map Class
      • public static class Map extends MapReduceBase implements Mapper<LongWritable, Text, Text, IntWritable>
      • {     
      • private final static IntWritable one = new IntWritable(1);  
      • private Text word = new Text();
      • public void map(LongWritable key, Text value, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException
      • {
        •   String line = value.toString();
        • StringTokenizer tokenizer = new StringTokenizer(line);
        •     while (tokenizer.hasMoreTokens())
        • {         
        • word.set(tokenizer.nextToken());         
        • output.collect(word, one);       
        • }     
      • }   
      • }
    • Sample Reduce Class public static class Reduce extends MapReduceBase implements Reducer<Text, IntWritable, Text, IntWritable> {      public void reduce(Text key, Iterator<IntWritable> values, OutputCollector<Text, IntWritable> output, Reporter reporter) throws IOException {       int sum = 0;        while (values.hasNext()) {      sum += values.next().get();        }   output.collect(key, new IntWritable(sum));      }    }
    • Running a Job Demo: Show WordCount Example
    • Project5: PageRank on Hadoop
    • #|colNum| NumOfRows| <R,val>…..<R,val>| #..... Link Analysis Crawled Pages Output Link Extractor
    • PageRank on MapReduce Very Basic PageRank Algorithm Input: PageRankVector DistributionMatrix ComputePageRank { Until converged { PageRankVector = DistributionMatrix * PageRankVector; } } Output: PageRankVector
      • Challenges
      • Storage of matrix and vector
      • Parallel matrix multiplication
      • Determine convergence
      • Implementation on Hadoop
    • PageRank on MapReduce Why is storage a challenge? UCI domain: 500000 pages Assuming 4 Bytes per entry Size of Vector: 500000 * 4 = 2000000 = 2MB Size of Matrix: 500000 * 500000 * 4 = 10 12 = 1TB Assumes a fully connected graph. Cleary this is very unrealistic for web pages! Solution: Sparse Matrix But: Row-Wise or Column-Wise? Depends on usage patterns! (i.e. how we do parallel matrix multiplication, updating of matrix, etc.)
    • PageRank on MapReduce Parallel Matrix Multiplication Requirement: Make it work!  Simple but practical solution X = M V M x V Every Row of M is “combined” with V, yielding one element of M x V each Intuition: - Parallelize on rows: each parallel task computes one final value - Use row-wise sparse matrix, so above can be done easily! (column-wise is actually better for PageRank)
    • PageRank on MapReduce 0 0 0 0 1 0 0 1 0 1 0 0 1 1 0 0 0 0 0 0 0 1 1 0 1 1 0 0 0 0 0 1 0 0 0 1 1 2 3 4 5 6 1 2 3 4 5 6 Stored As 1 2 3 4 5 6 5, 1 2, 1 4, 1 1, 1 2, 1 4, 1 5, 1 1, 1 2, 1 2, 1 6, 1 Original Matrix Row-Wise Sparse Matrix New Storage Requirements UCI domain: 500000 pages Assuming 4 Bytes per entry Assuming max 100 outgoing links per page Size of Matrix: 500000 * 100 * (4 + 4) = 400 * 10 6 = 400MB Notice: No more random access!
    • PageRank on MapReduce Map(Key, Row) { Vector v = getVector(); Int sum = 0; For each Element e in Row sum += e.value * v.at(e.columnNumber); collect(Key, sum); } Reduce(Key, Value) { collect(Key, Value); } Map-Reduce procedures for parallel matrix*vector multiplication using row-wise sparse matrix
    • Matrix Vector Multiplication Demo: Show Matrix-Vector Multiplication
    • Hadoop: Implementing Own File Format HDFS File
      • InputFormat
      • - Splits File into Chunks (“InputSplits”)
      • Byte Oriented
      • - Provides appropriate RecordReader
      InputSplit - Filename - Start Offset - End Offset - Hosts in HDFS
      • RecordReader
      • Record oriented
      • Extracts Records
      • from an InputSplit
      InputSplit InputSplit RecordReader RecordReader Map Map Map
    • [1] Jeffrey Dean and Sanjay Ghemawat, MapReduce: Simplified Data Processing on Large Clusters , Sixth Symposium on Operating System Design and Implementation (OSDI'04), San Francisco, CA, December, 2004 [2] http://www.cs.cmu.edu/~knigam/15-505/HW1.html [3] http://bnrg.cs.berkeley.edu/~adj/cs16x/Nachos/project2.html [4] http://lucene.apache.org/hadoop/ References
    • Flow TextInputFormat implements InputFormat getSplits() getRecordReader() LineRecordReader implements RecordReader One for each Split Next(Key, Value) FileSplit implements InputSplit File Start Offset End Offset Hosts where chunks of File live FileInputFormat implements InputFormat getSplits() getRecordReader()