CSE509 Lecture 4


Published on

Lecture 4 of CSE509:Web Science and Technology Summer Course

Published in: Technology
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • 2 In traditional high-performance computing (HPC) applications (e.g.,for climate or nuclear simulations), it is commonplace for a supercomputer to have “processing nodes”and “storage nodes” linked together by a high-capacity interconnect. Many data-intensive workloadsare not very processor-demanding, which means that the separation of compute and storage createsa bottleneck in the network. As an alternative to moving data around, it is more efficient to movethe processing around. That is, MapReduce assumes an architecture where processors and storage(disk) are co-located. In such a setup, we can take advantage of data locality by running code on theprocessor directly attached to the block of data we need. The distributed file system is responsiblefor managing the data over which MapReduce operates.3 Data-intensive processing by definition meansthat the relevant datasets are too large to fit in memory and must be held on disk. Seek times forrandom disk access are fundamentally limited by the mechanical nature of the devices: read heads can only move so fast and platters can only spin so rapidly. As a result, it is desirable to avoidrandom data access, and instead organize computations so that data are processed sequentially. Asimple scenario10 poignantly illustrates the large performance gap between sequential operationsand random seeks: assume a 1 terabyte database containing 1010 100-byte records. Given reasonableassumptions about disk latency and throughput, a back-of-the-envelop calculation will show thatupdating 1% of the records (by accessing and then mutating each record) will take about a monthon a single machine. On the other hand, if one simply reads the entire database and rewrites allthe records (mutating those that need updating), the process would finish in under a work day ona single machine. Sequential data access is, literally, orders of magnitude faster than random dataaccess.11The development of solid-state drives is unlikely to change this balance for at least tworeasons. First, the cost differential between traditional magnetic disks and solid-state disks remainssubstantial: large-data will for the most part remain on mechanical drives, at least in the nearfuture. Second, although solid-state disks have substantially faster seek times, order-of-magnitudedifferences in performance between sequential and random access still remain.MapReduce is primarily designed for batch processing over large datasets. To the extentpossible, all computations are organized into long streaming operations that take advantage of theaggregate bandwidth of many disks in a cluster. Many aspects of MapReduce’s design explicitly tradelatency for throughput.
  • CSE509 Lecture 4

    1. 1. CSE509: Introduction to Web Science and Technology<br />Lecture 4: Dealing with Large-Scale Web data: Large-Scale File Systems and MapReduce<br />Muhammad AtifQureshi<br />Web Science Research Group<br />Institute of Business Administration (IBA)<br />
    2. 2. Last Time…<br />Search Engine Architecture<br />Overview of Web Crawling<br />Web Link Structure<br />Ranking Problem<br />SEO and Web Spam<br />Web Spam Research<br />July 30, 2011<br />
    3. 3. Today<br />Web Data Explosion<br />Part I<br />MapReduce Basics<br />MapReduce Example and Details<br />MapReduce Case-Study: Web Crawler based on MapReduce Architecture<br />Part II<br />Large-Scale File Systems<br />Google File System Case-Study<br />July 30, 2011<br />
    4. 4. Introduction<br />Web data sets can be very large <br />Tens to hundreds of terabytes<br />Cannot mine on a single server (why?)<br />“Big data” is a fact on the World Wide Web<br />Larger data implies effective algorithms<br />Web-scale processing: Data-intensive processing<br />Also applies to startups and niche players<br />July 30, 2011<br />
    5. 5. How Much Data?<br />Google processes 20 PB a day (2008)<br />Facebook has 2.5 PB of user data + 15 TB/day (4/2009) <br />eBay has 6.5 PB of user data + 50 TB/day (5/2009)<br />CERN’s LHC will generate 15 PB a year (??)<br />July 30, 2011<br />
    6. 6. Cluster Architecture<br />July 30, 2011<br />CPU<br />CPU<br />CPU<br />CPU<br />Mem<br />Mem<br />Mem<br />Mem<br />Disk<br />Disk<br />Disk<br />Disk<br />2-10 Gbps backbone between racks<br />1 Gbps between <br />any pair of nodes<br />in a rack<br />Switch<br />Switch<br />Switch<br />…<br />…<br />Each rack contains 16-64 nodes<br />
    7. 7. Concerns<br />If we had to abort and restart the computation every time one component fails, then the computation might never complete successfully<br />If one node fails, all its files would be unavailable until the node is replaced<br />Can also lead to permanent loss of files<br />July 30, 2011<br />Solutions: MapReduce and Google File system<br />
    8. 8. PART I: MapReduce<br />July 30, 2011<br />
    9. 9. Major Ideas<br />Scale “out”, not “up” (Distributed vs. SMP) <br />Limits of SMP and large shared-memory machines<br />Move processing to the data<br />Cluster have limited bandwidth<br />Process data sequentially, avoid random access<br />Seeks are expensive, disk throughput is reasonable<br />Seamless scalability<br />From the traditional mythical man-month approach to a newly known phenomenon tradable machine-hour<br />Twenty-one chicken together cannot make an egg hatch in a day<br />July 30, 2011<br />
    10. 10. Traditional Parallelization: Divide and Conquer<br />July 30, 2011<br />“Work”<br />Partition<br />w1<br />w2<br />w3<br />“worker”<br />“worker”<br />“worker”<br />r1<br />r2<br />r3<br />Combine<br />“Result”<br />
    11. 11. Parallelization Challenges<br />How do we assign work units to workers?<br />What if we have more work units than workers?<br />What if workers need to share partial results?<br />How do we aggregate partial results?<br />How do we know all the workers have finished?<br />What if workers die?<br />July 30, 2011<br />
    12. 12. Common Theme<br />Parallelization problems arise from:<br />Communication between workers (e.g., to exchange state)<br />Access to shared resources (e.g., data)<br />Thus, we need a synchronization mechanism<br />July 30, 2011<br />
    13. 13. Parallelization is Hard<br />Traditionally, concurrency is difficult to reason about (uni to small-scale architecture)<br />Concurrency is even more difficult to reason about<br />At the scale of datacenters (even across datacenters)<br />In the presence of failures<br />In terms of multiple interacting services<br />Not to mention debugging…<br />The reality:<br />Write your own dedicated library, then program with it<br />Burden on the programmer to explicitly manage everything<br />July 30, 2011<br />
    14. 14. Solution: MapReduce<br />Programming model for expressing distributed computations at a massive scale<br />Hides system-level details from the developers<br />No more race conditions, lock contention, etc.<br />Separating the what from how<br />Developer specifies the computation that needs to be performed<br />Execution framework (“runtime”) handles actual execution<br />July 30, 2011<br />
    15. 15. What is MapReduce Used For?<br />At Google:<br />Index building for Google Search<br />Article clustering for Google News<br />Statistical machine translation<br />At Yahoo!:<br />Index building for Yahoo! Search<br />Spam detection for Yahoo! Mail<br />At Facebook:<br />Data mining<br />Ad optimization<br />Spam detection<br />July 30, 2011<br />
    16. 16. Typical MapReduce Execution<br />Iterate over a large number of records<br />Extract something of interest from each<br />Shuffle and sort intermediate results<br />Aggregate intermediate results<br />Generate final output<br />Map<br />Reduce<br />Key idea: provide a functional abstraction for these two operations<br />(Dean and Ghemawat, OSDI 2004)<br />
    17. 17. MapReduce Basics<br />Programmers specify two functions:<br />map (k, v) -> <k’, v’>*<br />reduce (k’, v’) -> <k’, v’>*<br />All values with the same key are sent to the same reducer<br />The execution framework handles everything else…<br />July 30, 2011<br />
    18. 18. Warm Up Example: Word Count<br />We have a large file of words, one word to a line<br />Count the number of times each distinct word appears in the file<br />Sample application: analyze web server logs to find popular URLs<br />July 30, 2011<br />
    19. 19. Word Count (2)<br />Case 1: Entire file fits in memory<br />Case 2: File too large for mem, but all <word, count> pairs fit in mem<br />Case 3: File on disk, too many distinct words to fit in memory<br />sort datafile | uniq –c<br />July 30, 2011<br />
    20. 20. Word Count (3)<br />To make it slightly harder, suppose we have a large corpus of documents<br />Count the number of times each distinct word occurs in the corpus<br />words(docs/*) | sort | uniq -c<br />where words takes a file and outputs the words in it, one to a line<br />The above captures the essence of MapReduce<br />Great thing is it is naturally parallelizable<br />July 30, 2011<br />
    21. 21. Word Count using MapReduce<br />July 30, 2011<br />map(key, value):<br />// key: document name; value: text of document<br /> for each word w in value:<br /> emit(w, 1)<br />reduce(key, values):<br />// key: a word; values: an iterator over counts<br /> result = 0<br /> for each count v in values:<br /> result += v<br /> emit(key,result)<br />
    22. 22. Word Count Illustration<br />July 30, 2011<br />map(key=url, val=contents):<br />For each word w in contents, emit (w, “1”)<br />reduce(key=word, values=uniq_counts):<br />Sum all “1”s in values list<br />Emit result “(word, sum)”<br />see 1<br />bob 1 <br />run 1<br />see 1<br />spot 1<br />throw 1<br />bob 1 <br />run 1<br />see 2<br />spot 1<br />throw 1<br />see bob run<br />see spot throw<br />
    23. 23. Implementation Overview<br />100s/1000s of 2-CPU x86 machines, 2-4 GB of memory<br />Limited bandwidth <br />Storage is on local IDE disks <br />GFS: distributed file system manages data (SOSP'03) <br />Job scheduling system: jobs made up of tasks, scheduler assigns tasks to machines <br />July 30, 2011<br />Implementation at Google is a C++ library linked to user programs<br />
    24. 24. Distributed Execution Overview<br />July 30, 2011<br />UserProgram<br />(1) submit<br />Master<br />(2) schedule map<br />(2) schedule reduce<br />worker<br />split 0<br />(6) write<br />output<br />file 0<br />(5) remote read<br />worker<br />split 1<br />(3) read<br />split 2<br />(4) local write<br />worker<br />split 3<br />output<br />file 1<br />split 4<br />worker<br />worker<br />Input<br />files<br />Map<br />phase<br />Intermediate files<br />(on local disk)<br />Reduce<br />phase<br />Output<br />files<br />Adapted from (Dean and Ghemawat, OSDI 2004)<br />
    25. 25. MapReduce Implementations<br />Google has a proprietary implementation in C++<br />Bindings in Java, Python<br />Hadoop is an open-source implementation in Java<br />Development led by Yahoo, used in production<br />Now an Apache project<br />Rapidly expanding software ecosystem<br />Lots of custom research implementations<br />For GPUs, cell processors, etc.<br />July 30, 2011<br />
    26. 26. Bonus Assignment<br />Write MapReduce version of Assignment no. 2<br />July 30, 2011<br />
    27. 27. MapReduce in VisionerBOT<br />July 30, 2011<br />
    28. 28. VisionerBOT Distributed Design<br />July 30, 2011<br />
    29. 29. PART II: Google File System<br />July 30, 2011<br />
    30. 30. Distributed File System<br />Don’t move data to workers… move workers to the data!<br />Store data on the local disks of nodes in the cluster<br />Start up the workers on the node that has the data local<br />Why?<br />Not enough RAM to hold all the data in memory<br />Disk access is slow, but disk throughput is reasonable<br />A distributed file system is the answer<br />GFS (Google File System) for Google’s MapReduce<br />HDFS (Hadoop Distributed File System) for Hadoop<br />
    31. 31. GFS: Assumptions<br />Commodity hardware over “exotic” hardware<br />Scale “out”, not “up”<br />High component failure rates<br />Inexpensive commodity components fail all the time<br />“Modest” number of huge files<br />Multi-gigabyte files are common, if not encouraged<br />Files are write-once, mostly appended to<br />Perhaps concurrently<br />Large streaming reads over random access<br />High sustained throughput over low latency<br />GFS slides adapted from material by (Ghemawat et al., SOSP 2003)<br />
    32. 32. GFS: Design Decisions<br />Files stored as chunks<br />Fixed size (64MB)<br />Reliability through replication<br />Each chunk replicated across 3+ chunkservers<br />Single master to coordinate access, keep metadata<br />Simple centralized management<br />No data caching<br />Little benefit due to large datasets, streaming reads<br />Simplify the API<br />Push some of the issues onto the client (e.g., data layout)<br />HDFS = GFS clone (same basic ideas)<br />
    33. 33. QUESTIONS?<br />July 30, 2011<br />