CSE509 Lecture 4
Upcoming SlideShare
Loading in...5

Lecture 4 of CSE509:Web Science and Technology Summer Course

Lecture 4 of CSE509:Web Science and Technology Summer Course



Total Views
Views on SlideShare
Embed Views



0 Embeds 0

No embeds



Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
Post Comment
Edit your comment
  • 2 In traditional high-performance computing (HPC) applications (e.g.,for climate or nuclear simulations), it is commonplace for a supercomputer to have “processing nodes”and “storage nodes” linked together by a high-capacity interconnect. Many data-intensive workloadsare not very processor-demanding, which means that the separation of compute and storage createsa bottleneck in the network. As an alternative to moving data around, it is more efficient to movethe processing around. That is, MapReduce assumes an architecture where processors and storage(disk) are co-located. In such a setup, we can take advantage of data locality by running code on theprocessor directly attached to the block of data we need. The distributed file system is responsiblefor managing the data over which MapReduce operates.3 Data-intensive processing by definition meansthat the relevant datasets are too large to fit in memory and must be held on disk. Seek times forrandom disk access are fundamentally limited by the mechanical nature of the devices: read heads can only move so fast and platters can only spin so rapidly. As a result, it is desirable to avoidrandom data access, and instead organize computations so that data are processed sequentially. Asimple scenario10 poignantly illustrates the large performance gap between sequential operationsand random seeks: assume a 1 terabyte database containing 1010 100-byte records. Given reasonableassumptions about disk latency and throughput, a back-of-the-envelop calculation will show thatupdating 1% of the records (by accessing and then mutating each record) will take about a monthon a single machine. On the other hand, if one simply reads the entire database and rewrites allthe records (mutating those that need updating), the process would finish in under a work day ona single machine. Sequential data access is, literally, orders of magnitude faster than random dataaccess.11The development of solid-state drives is unlikely to change this balance for at least tworeasons. First, the cost differential between traditional magnetic disks and solid-state disks remainssubstantial: large-data will for the most part remain on mechanical drives, at least in the nearfuture. Second, although solid-state disks have substantially faster seek times, order-of-magnitudedifferences in performance between sequential and random access still remain.MapReduce is primarily designed for batch processing over large datasets. To the extentpossible, all computations are organized into long streaming operations that take advantage of theaggregate bandwidth of many disks in a cluster. Many aspects of MapReduce’s design explicitly tradelatency for throughput.

CSE509 Lecture 4 CSE509 Lecture 4 Presentation Transcript

  • CSE509: Introduction to Web Science and Technology
    Lecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems and MapReduce
    Muhammad AtifQureshi
    Web Science Research Group
    Institute of Business Administration (IBA)
  • Last Time…
    Search Engine Architecture
    Overview of Web Crawling
    Web Link Structure
    Ranking Problem
    SEO and Web Spam
    Web Spam Research
    July 30, 2011
  • Today
    Web Data Explosion
    Part I
    MapReduce Basics
    MapReduce Example and Details
    MapReduce Case-Study: Web Crawler based on MapReduce Architecture
    Part II
    Large-Scale File Systems
    Google File System Case-Study
    July 30, 2011
  • Introduction
    Web data sets can be very large
    Tens to hundreds of terabytes
    Cannot mine on a single server (why?)
    “Big data” is a fact on the World Wide Web
    Larger data implies effective algorithms
    Web-scale processing: Data-intensive processing
    Also applies to startups and niche players
    July 30, 2011
  • How Much Data?
    Google processes 20 PB a day (2008)
    Facebook has 2.5 PB of user data + 15 TB/day (4/2009)
    eBay has 6.5 PB of user data + 50 TB/day (5/2009)
    CERN’s LHC will generate 15 PB a year (??)
    July 30, 2011
  • Cluster Architecture
    July 30, 2011
    2-10 Gbps backbone between racks
    1 Gbps between
    any pair of nodes
    in a rack

    Each rack contains 16-64 nodes
  • Concerns
    If we had to abort and restart the computation every time one component fails, then the computation might never complete successfully
    If one node fails, all its files would be unavailable until the node is replaced
    Can also lead to permanent loss of files
    July 30, 2011
    Solutions: MapReduce and Google File system
  • PART I: MapReduce
    July 30, 2011
  • Major Ideas
    Scale “out”, not “up” (Distributed vs. SMP)
    Limits of SMP and large shared-memory machines
    Move processing to the data
    Cluster have limited bandwidth
    Process data sequentially, avoid random access
    Seeks are expensive, disk throughput is reasonable
    Seamless scalability
    From the traditional mythical man-month approach to a newly known phenomenon tradable machine-hour
    Twenty-one chicken together cannot make an egg hatch in a day
    July 30, 2011
  • Traditional Parallelization: Divide and Conquer
    July 30, 2011
  • Parallelization Challenges
    How do we assign work units to workers?
    What if we have more work units than workers?
    What if workers need to share partial results?
    How do we aggregate partial results?
    How do we know all the workers have finished?
    What if workers die?
    July 30, 2011
  • Common Theme
    Parallelization problems arise from:
    Communication between workers (e.g., to exchange state)
    Access to shared resources (e.g., data)
    Thus, we need a synchronization mechanism
    July 30, 2011
  • Parallelization is Hard
    Traditionally, concurrency is difficult to reason about (uni to small-scale architecture)
    Concurrency is even more difficult to reason about
    At the scale of datacenters (even across datacenters)
    In the presence of failures
    In terms of multiple interacting services
    Not to mention debugging…
    The reality:
    Write your own dedicated library, then program with it
    Burden on the programmer to explicitly manage everything
    July 30, 2011
  • Solution: MapReduce
    Programming model for expressing distributed computations at a massive scale
    Hides system-level details from the developers
    No more race conditions, lock contention, etc.
    Separating the what from how
    Developer specifies the computation that needs to be performed
    Execution framework (“runtime”) handles actual execution
    July 30, 2011
  • What is MapReduce Used For?
    At Google:
    Index building for Google Search
    Article clustering for Google News
    Statistical machine translation
    At Yahoo!:
    Index building for Yahoo! Search
    Spam detection for Yahoo! Mail
    At Facebook:
    Data mining
    Ad optimization
    Spam detection
    July 30, 2011
  • Typical MapReduce Execution
    Iterate over a large number of records
    Extract something of interest from each
    Shuffle and sort intermediate results
    Aggregate intermediate results
    Generate final output
    Key idea: provide a functional abstraction for these two operations
    (Dean and Ghemawat, OSDI 2004)
  • MapReduce Basics
    Programmers specify two functions:
    map (k, v) -> <k’, v’>*
    reduce (k’, v’) -> <k’, v’>*
    All values with the same key are sent to the same reducer
    The execution framework handles everything else…
    July 30, 2011
  • Warm Up Example: Word Count
    We have a large file of words, one word to a line
    Count the number of times each distinct word appears in the file
    Sample application: analyze web server logs to find popular URLs
    July 30, 2011
  • Word Count (2)
    Case 1: Entire file fits in memory
    Case 2: File too large for mem, but all <word, count> pairs fit in mem
    Case 3: File on disk, too many distinct words to fit in memory
    sort datafile | uniq –c
    July 30, 2011
  • Word Count (3)
    To make it slightly harder, suppose we have a large corpus of documents
    Count the number of times each distinct word occurs in the corpus
    words(docs/*) | sort | uniq -c
    where words takes a file and outputs the words in it, one to a line
    The above captures the essence of MapReduce
    Great thing is it is naturally parallelizable
    July 30, 2011
  • Word Count using MapReduce
    July 30, 2011
    map(key, value):
    // key: document name; value: text of document
    for each word w in value:
    emit(w, 1)
    reduce(key, values):
    // key: a word; values: an iterator over counts
    result = 0
    for each count v in values:
    result += v
  • Word Count Illustration
    July 30, 2011
    map(key=url, val=contents):
    For each word w in contents, emit (w, “1”)
    reduce(key=word, values=uniq_counts):
    Sum all “1”s in values list
    Emit result “(word, sum)”
    see 1
    bob 1
    run 1
    see 1
    spot 1
    throw 1
    bob 1
    run 1
    see 2
    spot 1
    throw 1
    see bob run
    see spot throw
  • Implementation Overview
    100s/1000s of 2-CPU x86 machines, 2-4 GB of memory
    Limited bandwidth
    Storage is on local IDE disks
    GFS: distributed file system manages data (SOSP'03)
    Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines
    July 30, 2011
    Implementation at Google is a C++ library linked to user programs
  • Distributed Execution Overview
    July 30, 2011
    (1) submit
    (2) schedule map
    (2) schedule reduce
    split 0
    (6) write
    file 0
    (5) remote read
    split 1
    (3) read
    split 2
    (4) local write
    split 3
    file 1
    split 4
    Intermediate files
    (on local disk)
    Adapted from (Dean and Ghemawat, OSDI 2004)
  • MapReduce Implementations
    Google has a proprietary implementation in C++
    Bindings in Java, Python
    Hadoop is an open-source implementation in Java
    Development led by Yahoo, used in production
    Now an Apache project
    Rapidly expanding software ecosystem
    Lots of custom research implementations
    For GPUs, cell processors, etc.
    July 30, 2011
  • Bonus Assignment
    Write MapReduce version of Assignment no. 2
    July 30, 2011
  • MapReduce in VisionerBOT
    July 30, 2011
  • VisionerBOT Distributed Design
    July 30, 2011
  • PART II: Google File System
    July 30, 2011
  • Distributed File System
    Don’t move data to workers… move workers to the data!
    Store data on the local disks of nodes in the cluster
    Start up the workers on the node that has the data local
    Not enough RAM to hold all the data in memory
    Disk access is slow, but disk throughput is reasonable
    A distributed file system is the answer
    GFS (Google File System) for Google’s MapReduce
    HDFS (Hadoop Distributed File System) for Hadoop
  • GFS: Assumptions
    Commodity hardware over “exotic” hardware
    Scale “out”, not “up”
    High component failure rates
    Inexpensive commodity components fail all the time
    “Modest” number of huge files
    Multi-gigabyte files are common, if not encouraged
    Files are write-once, mostly appended to
    Perhaps concurrently
    Large streaming reads over random access
    High sustained throughput over low latency
    GFS slides adapted from material by (Ghemawat et al., SOSP 2003)
  • GFS: Design Decisions
    Files stored as chunks
    Fixed size (64MB)
    Reliability through replication
    Each chunk replicated across 3+ chunkservers
    Single master to coordinate access, keep metadata
    Simple centralized management
    No data caching
    Little benefit due to large datasets, streaming reads
    Simplify the API
    Push some of the issues onto the client (e.g., data layout)
    HDFS = GFS clone (same basic ideas)
    July 30, 2011