Hadoop & MapReduce


Published on

This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.

Published in: Technology
1 Comment
No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Hadoop & MapReduce

  1. 1. Hadoop & MapReduce Dr. Ioannis Konstantinou http://www.cslab.ntua.gr/~ikons AWS Usergroup Greece 18/07/2012 Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens
  2. 2. Big Data90% of todays data was created in the last 2 yearsMoores law: Data volume doubles every 18 monthsYouTube: 13 million hours and 700 billion views in 2010Facebook: 20TB/day (compressed)CERN/LHC: 40TB/day (15PB/year)Many more examplesWeb logs, presentation files, medical files etc
  3. 3. Problem: Data explosion 1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes) Data traffic of mobile telephony in the USA in 2010 1.2 ZB (Zettabyte) = 1200 EB Total of digital data in 2010 35 ZB (Zettabyte = 1021 bytes) Estimate for volume of total digital data in 2020
  4. 4. Solution: scalability How?
  5. 5. Source: Wikipedia (IBM Roadrunner)
  6. 6. Divide and Conquer “Problem” Partition w1 w2 w3“worker” “worker” “worker” r1 r2 r3 “Result” Combine
  7. 7. Parallelization challenges How to assign units of work to the workers? What if there are more units of work than workers? What if the workers need to share intermediate incomplete data? How do we aggregate such intermediate data? How do we know when all workers have completed their assignments? What if some workers failed?
  8. 8. What is MapReduce?A programming modelA programming frameworkUsed to develop solutions that will  Process large amounts of data in a parallelized fashion  In clusters of computing nodesOriginally a closed-source implementation at Google  Scientific papers of ’03 & ’04 describe the frameworkHadoop: opensource implementation of the algorithms described in the scientific papers  http://hadoop.apache.org/
  9. 9. What is Hadoop? 2 large subsystems, 1 for data management & 1 for computation:  HDFS (Hadoop Distributed File System)  MapReduce computation framework runs above HDFS  HDFS is essentially the I/O of Hadoop Written in java: A set of java processes running in multiple nodes Who uses it:  Yahoo!  Amazon  Facebook  Twitter  Plus many more...
  10. 10. HDFS – distributed file system A scalable distributed file system for applications dealing with large data sets.  Distributed: runs in a cluster  Scalable: 10Κ nodes, 100Κ files 10PB storage Storage space is seamless for the whole cluster Files broken into blocks Typical block size: 128 MB. Replication: Each block copied to multiple data nodes.
  11. 11. Architecture of HDFS/MapReduce Master/Slave scheme  HDFS: A central NameNode administers multple DataNodes  NameNode: holds information about which DataNode holds which files  DataNodes: «dummy» servers that hold raw file chunks  MapReduce: A central JobTracker administers multiple TaskTrackers-NameNode and JobTracker They run on the master-DataNode and TaskTracker They run on the slaves
  12. 12. MapReduceThe problem is broken down in 2 phases. ● Map: Non overlapping sets of data input (<key,value> records) are assigned to different processes (mappers) that produce a set of intermediate <key,value> results ● Reduce: Data of Map phase are fed to a typically smaller number of processes(reducers) that aggregate the input results to a smaller number of <key,value> records.
  13. 13. How does it work?
  14. 14. Initialization phaseInput is uploaded to HDFS and is split into pieces of fixed sizeEach TaskTracker node that participates in the computation is executing a copy of the MapReduce programOne of the nodes plays the JobTracker master role. This node will assign tasks to the rest (workers). Tasks can either be of type map or reduce.
  15. 15. JobTracker (Master)The jobTracker holds data about: Status of tasks Location of input, output and intermediate data (runs together with NameNode - HDFS master)The master is responsible for timecheduling of work tasks execution.
  16. 16. TaskTracker (Slave)The TaskTracker runs tasks assigned by the master.Runs at the same node as the DataNode (HFDS slave)Task can be either of type Map or type ReduceTypically the maximum number of concurrent tasks that can be run by a node is equal to the number of cpu cores it has (achieving optimal CPU utilization)
  17. 17. Map task A worker (TaskTracker) that has been assigned a map task ● Reads the relevant input data (input split) from HDFS, analyzes the <key, value> pairs and the output is passed as input to the map function. ● The map function processes the pairs and produces intermediate pairs that are aggregated in memory. ● Periodically a partition function is executed which stores the intermediate key- value pairs in the local node storage, while grouping them in R sets.This function is user defined. ● When the partition function completes the storage of the key-value pairs it informs the master that the task is complete and where the data are stored. ● The master forwards this information to the workers that run the reduce tasks
  18. 18. Reduce task A worker that has been assigned a reduce task  Reads from every map process that has been executed the pairs that correspond to itself based on the locations instructed by the master.  When all intermediate pairs have been retrieved they are sorted based on their key. Entries with the same key are grouped together.  Function reduce is executed with input the pairs <key, group_of_values> that were the result of the previous phase.  The reduce task processes the input data and produces the final pairs.  The output pairs are attached in a file in the local file system. When the reduce task is completed the file becomes available in the distributed file system.
  19. 19. Task CompletionWhen a worker has completed its task it informs the master.When all workers have informed the master then the master will return the function to the original program of the user.
  20. 20. Example Master worker Map Reduce workerPart 1Part 2Input worker Map Reduce worker OutputPart 3 worker Map Reduce worker
  21. 21. MapReduce
  22. 22. Example: Word count 1/3 Objective: measure the frequency of appearance of words in a large set of documents Potential use case: Discovery of popular url in a set of webserver logfiles Implementation plan:  “Upload” documents on MapReduce  Author a map function  Author a reduce function  Run a MapReduce task  Retrieve results
  23. 23. Example: Word count 2/3map(key, value):// key: document name; value: text of document for each word w in value: emit(w, 1)reduce(key, values):// key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)
  24. 24. Example: Word count 3/3 (w1, 2) (w1,2) (d1, ‘’w1 w2 w4’) (w2, 3) (w2,3) (d2, ‘ w1 w2 w3 w4’) (w3, 2) (w1,3) (d3, ‘ w2 w3 w4’) (w4,3) (w2,4) (w1,3) (w1,7) (w2,3) (w2,15) (d4, ‘ w1 w2 w3’) (w1,3) (d5, ‘w1 w3 w4’) (w2,4)(d6, ‘ w1 w4 w2 w2) (w3,2) (d7, ‘ w4 w2 w1’) (w4,3) (w3,2) (w3,8) (w4,3) (w4,7) (d8, ‘ w2 w2 w3’) (w1,3) (w3,2) (d9, ‘w1 w1 w3 w3’) (w2,3) (w4,3)(d10, ‘ w2 w1 w4 w3’) (w3,4) (w3,4) (w4,1) (w4,1) M=3 mappers R=2 reducers
  25. 25. Extra functions
  26. 26. LocalityMove computation near the data: The master tries to have a task executed on a worker that is as “near” as possible to the input data, thus reducing the bandwidth usage How does the master know?
  27. 27. Task distributionThe number of tasks is usually higher than the number of the available workersOne worker can execute more than one tasksThe balance of work load is improved. In the case of a single worker failure there is faster recovery and redistribution of tasks to other nodes.
  28. 28. Redundant task executionsSome tasks can be delayed, resulting in a delay in the overall work executionThe solution to the problem is the creation of task copies that can be executed in parallel from 2 or more different workers (speculative execution)A task is considered complete when the master is informed about its completion by at least one node.
  29. 29. PartitioningA user can specify a custom function that will partition the tasks during shuffling.The type of input and output data can be defined by the user and has no limitation on what form it should have.
  30. 30. The input of a reducer is always sortedThere is the possibility to execute tasks locally in a serial mannerThe master provides web interfaces for Monitoring tasks progress Browsing of HDFS
  31. 31. When should I use it?Good choice for jobs that can be broken into parallelized jobs:  Indexing/Analysis of log files  Sorting of large data sets  Image processing• Bad choice for serial or low latency jobs: – Computation of number π with precision of 1,000,000 digits – Computation of Fibonacci sequence – Replacing MySQL
  32. 32. Use cases 1/3  Large Scale Image Conversions  100 Amazon EC2 Instances, 4TB raw TIFF data  11 Million PDF in 24 hours and 240$ • Internal log processing • Reporting, analytics and machine learning • Cluster of 1110 machines, 8800 cores and 12PB raw storage • Open source contributors (Hive) • Store and process tweets, logs, etc • Open source contributors (hadoop-lzo) • Large scale machine learning
  33. 33. Use cases 2/3  100.000 CPUs in 25.000 computers  Content/Ads Optimization, Search index  Machine learning (e.g. spam filtering)  Open source contributors (Pig) • Natural language search (through Powerset) • 400 nodes in EC2, storage in S3 • Open source contributors (!) to HBase • ElasticMapReduce service • On demand elastic Hadoop clusters for the Cloud
  34. 34. Use cases 3/3 ETL processing, statistics generation Advanced algorithms for behavioral analysis and targeting • Used for discovering People you May Know, and for other apps • 3X30 node cluster, 16GB RAM and 8TB storage • Leading Chinese language search engine • Search log analysis, data mining • 300TB per week • 10 to 500 node clusters
  35. 35. Amazon ElasticMapReduce (EMR) A hosted Hadoop-as-a-service solution provided by AWS No need for management or tuning of Hadoop clusters ● upload your input data, store your output data on S3 ● procure as many EC2 instances as you need and only pay for the time you use them Hive and Pig support makes it easy to write data analytical scripts Java, Perl, Python, PHP, C++ for more sophisticated algorithms Integrates to dynamoDB (process combined datasets in S3 & dynamoDB) Support for HBase (NoSQL)
  36. 36. Questions