Hadoop & MapReduce
Upcoming SlideShare
Loading in...5
×
 

Hadoop & MapReduce

on

  • 7,117 views

This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens. ...

This is a deck of slides from a recent meetup of AWS Usergroup Greece, presented by Ioannis Konstantinou from the National Technical University of Athens.
The presentation gives an overview of the Map Reduce framework and a description of its open source implementation (Hadoop). Amazon's own Elastic Map Reduce (EMR) service is also mentioned. With the growing interest on Big Data this is a good introduction to the subject.

Statistics

Views

Total Views
7,117
Views on SlideShare
1,887
Embed Views
5,230

Actions

Likes
2
Downloads
100
Comments
0

5 Embeds 5,230

http://mobicon.tistory.com 5044
http://www.newvem.com 183
http://blog.naver.com 1
https://www.google.co.kr 1
http://mail.naver.com 1

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Hadoop & MapReduce Hadoop & MapReduce Presentation Transcript

  • Hadoop & MapReduce Dr. Ioannis Konstantinou http://www.cslab.ntua.gr/~ikons AWS Usergroup Greece 18/07/2012 Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens
  • Big Data90% of todays data was created in the last 2 yearsMoores law: Data volume doubles every 18 monthsYouTube: 13 million hours and 700 billion views in 2010Facebook: 20TB/day (compressed)CERN/LHC: 40TB/day (15PB/year)Many more examplesWeb logs, presentation files, medical files etc
  • Problem: Data explosion 1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes) Data traffic of mobile telephony in the USA in 2010 1.2 ZB (Zettabyte) = 1200 EB Total of digital data in 2010 35 ZB (Zettabyte = 1021 bytes) Estimate for volume of total digital data in 2020
  • Solution: scalability How?
  • Source: Wikipedia (IBM Roadrunner)
  • Divide and Conquer “Problem” Partition w1 w2 w3“worker” “worker” “worker” r1 r2 r3 “Result” Combine
  • Parallelization challenges How to assign units of work to the workers? What if there are more units of work than workers? What if the workers need to share intermediate incomplete data? How do we aggregate such intermediate data? How do we know when all workers have completed their assignments? What if some workers failed?
  • What is MapReduce?A programming modelA programming frameworkUsed to develop solutions that will  Process large amounts of data in a parallelized fashion  In clusters of computing nodesOriginally a closed-source implementation at Google  Scientific papers of ’03 & ’04 describe the frameworkHadoop: opensource implementation of the algorithms described in the scientific papers  http://hadoop.apache.org/
  • What is Hadoop? 2 large subsystems, 1 for data management & 1 for computation:  HDFS (Hadoop Distributed File System)  MapReduce computation framework runs above HDFS  HDFS is essentially the I/O of Hadoop Written in java: A set of java processes running in multiple nodes Who uses it:  Yahoo!  Amazon  Facebook  Twitter  Plus many more...
  • HDFS – distributed file system A scalable distributed file system for applications dealing with large data sets.  Distributed: runs in a cluster  Scalable: 10Κ nodes, 100Κ files 10PB storage Storage space is seamless for the whole cluster Files broken into blocks Typical block size: 128 MB. Replication: Each block copied to multiple data nodes.
  • Architecture of HDFS/MapReduce Master/Slave scheme  HDFS: A central NameNode administers multple DataNodes  NameNode: holds information about which DataNode holds which files  DataNodes: «dummy» servers that hold raw file chunks  MapReduce: A central JobTracker administers multiple TaskTrackers-NameNode and JobTracker They run on the master-DataNode and TaskTracker They run on the slaves
  • MapReduceThe problem is broken down in 2 phases. ● Map: Non overlapping sets of data input (<key,value> records) are assigned to different processes (mappers) that produce a set of intermediate <key,value> results ● Reduce: Data of Map phase are fed to a typically smaller number of processes(reducers) that aggregate the input results to a smaller number of <key,value> records.
  • How does it work?
  • Initialization phaseInput is uploaded to HDFS and is split into pieces of fixed sizeEach TaskTracker node that participates in the computation is executing a copy of the MapReduce programOne of the nodes plays the JobTracker master role. This node will assign tasks to the rest (workers). Tasks can either be of type map or reduce.
  • JobTracker (Master)The jobTracker holds data about: Status of tasks Location of input, output and intermediate data (runs together with NameNode - HDFS master)The master is responsible for timecheduling of work tasks execution.
  • TaskTracker (Slave)The TaskTracker runs tasks assigned by the master.Runs at the same node as the DataNode (HFDS slave)Task can be either of type Map or type ReduceTypically the maximum number of concurrent tasks that can be run by a node is equal to the number of cpu cores it has (achieving optimal CPU utilization)
  • Map task A worker (TaskTracker) that has been assigned a map task ● Reads the relevant input data (input split) from HDFS, analyzes the <key, value> pairs and the output is passed as input to the map function. ● The map function processes the pairs and produces intermediate pairs that are aggregated in memory. ● Periodically a partition function is executed which stores the intermediate key- value pairs in the local node storage, while grouping them in R sets.This function is user defined. ● When the partition function completes the storage of the key-value pairs it informs the master that the task is complete and where the data are stored. ● The master forwards this information to the workers that run the reduce tasks
  • Reduce task A worker that has been assigned a reduce task  Reads from every map process that has been executed the pairs that correspond to itself based on the locations instructed by the master.  When all intermediate pairs have been retrieved they are sorted based on their key. Entries with the same key are grouped together.  Function reduce is executed with input the pairs <key, group_of_values> that were the result of the previous phase.  The reduce task processes the input data and produces the final pairs.  The output pairs are attached in a file in the local file system. When the reduce task is completed the file becomes available in the distributed file system.
  • Task CompletionWhen a worker has completed its task it informs the master.When all workers have informed the master then the master will return the function to the original program of the user.
  • Example Master worker Map Reduce workerPart 1Part 2Input worker Map Reduce worker OutputPart 3 worker Map Reduce worker
  • MapReduce
  • Example: Word count 1/3 Objective: measure the frequency of appearance of words in a large set of documents Potential use case: Discovery of popular url in a set of webserver logfiles Implementation plan:  “Upload” documents on MapReduce  Author a map function  Author a reduce function  Run a MapReduce task  Retrieve results
  • Example: Word count 2/3map(key, value):// key: document name; value: text of document for each word w in value: emit(w, 1)reduce(key, values):// key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)
  • Example: Word count 3/3 (w1, 2) (w1,2) (d1, ‘’w1 w2 w4’) (w2, 3) (w2,3) (d2, ‘ w1 w2 w3 w4’) (w3, 2) (w1,3) (d3, ‘ w2 w3 w4’) (w4,3) (w2,4) (w1,3) (w1,7) (w2,3) (w2,15) (d4, ‘ w1 w2 w3’) (w1,3) (d5, ‘w1 w3 w4’) (w2,4)(d6, ‘ w1 w4 w2 w2) (w3,2) (d7, ‘ w4 w2 w1’) (w4,3) (w3,2) (w3,8) (w4,3) (w4,7) (d8, ‘ w2 w2 w3’) (w1,3) (w3,2) (d9, ‘w1 w1 w3 w3’) (w2,3) (w4,3)(d10, ‘ w2 w1 w4 w3’) (w3,4) (w3,4) (w4,1) (w4,1) M=3 mappers R=2 reducers
  • Extra functions
  • LocalityMove computation near the data: The master tries to have a task executed on a worker that is as “near” as possible to the input data, thus reducing the bandwidth usage How does the master know?
  • Task distributionThe number of tasks is usually higher than the number of the available workersOne worker can execute more than one tasksThe balance of work load is improved. In the case of a single worker failure there is faster recovery and redistribution of tasks to other nodes.
  • Redundant task executionsSome tasks can be delayed, resulting in a delay in the overall work executionThe solution to the problem is the creation of task copies that can be executed in parallel from 2 or more different workers (speculative execution)A task is considered complete when the master is informed about its completion by at least one node.
  • PartitioningA user can specify a custom function that will partition the tasks during shuffling.The type of input and output data can be defined by the user and has no limitation on what form it should have.
  • The input of a reducer is always sortedThere is the possibility to execute tasks locally in a serial mannerThe master provides web interfaces for Monitoring tasks progress Browsing of HDFS
  • When should I use it?Good choice for jobs that can be broken into parallelized jobs:  Indexing/Analysis of log files  Sorting of large data sets  Image processing• Bad choice for serial or low latency jobs: – Computation of number π with precision of 1,000,000 digits – Computation of Fibonacci sequence – Replacing MySQL
  • Use cases 1/3  Large Scale Image Conversions  100 Amazon EC2 Instances, 4TB raw TIFF data  11 Million PDF in 24 hours and 240$ • Internal log processing • Reporting, analytics and machine learning • Cluster of 1110 machines, 8800 cores and 12PB raw storage • Open source contributors (Hive) • Store and process tweets, logs, etc • Open source contributors (hadoop-lzo) • Large scale machine learning
  • Use cases 2/3  100.000 CPUs in 25.000 computers  Content/Ads Optimization, Search index  Machine learning (e.g. spam filtering)  Open source contributors (Pig) • Natural language search (through Powerset) • 400 nodes in EC2, storage in S3 • Open source contributors (!) to HBase • ElasticMapReduce service • On demand elastic Hadoop clusters for the Cloud
  • Use cases 3/3 ETL processing, statistics generation Advanced algorithms for behavioral analysis and targeting • Used for discovering People you May Know, and for other apps • 3X30 node cluster, 16GB RAM and 8TB storage • Leading Chinese language search engine • Search log analysis, data mining • 300TB per week • 10 to 500 node clusters
  • Amazon ElasticMapReduce (EMR) A hosted Hadoop-as-a-service solution provided by AWS No need for management or tuning of Hadoop clusters ● upload your input data, store your output data on S3 ● procure as many EC2 instances as you need and only pay for the time you use them Hive and Pig support makes it easy to write data analytical scripts Java, Perl, Python, PHP, C++ for more sophisticated algorithms Integrates to dynamoDB (process combined datasets in S3 & dynamoDB) Support for HBase (NoSQL)
  • Questions