Hadoop & MapReduce
          Dr. Ioannis Konstantinou
      http://www.cslab.ntua.gr/~ikons


           AWS Usergroup Greece
               18/07/2012


        Computing Systems Laboratory
 School of Electrical and Computer Engineering
    National Technical University of Athens
Big Data
90% of today's data was created in the last 2 years
Moore's law: Data volume doubles every 18 months
YouTube: 13 million hours and 700 billion views in 2010
Facebook: 20TB/day (compressed)
CERN/LHC: 40TB/day (15PB/year)

Many more examples
Web logs, presentation files, medical files etc
Problem: Data explosion

   1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes)
   Data traffic of mobile telephony in the USA in 2010



       1.2 ZB (Zettabyte) = 1200 EB
       Total of digital data in 2010



        35 ZB (Zettabyte = 1021 bytes)
        Estimate for volume of total digital
        data in 2020
Solution: scalability


           How?
Source: Wikipedia (IBM Roadrunner)
Divide and Conquer
           “Problem”
                                  Partition


  w1          w2         w3

“worker”    “worker”   “worker”


   r1         r2         r3




           “Result”               Combine
Parallelization challenges
 How to assign units of work to the workers?
 What if there are more units of work than workers?
 What if the workers need to share intermediate incomplete
  data?
 How do we aggregate such intermediate data?
 How do we know when all workers have completed their
  assignments?
 What if some workers failed?
What is MapReduce?
A programming model
A programming framework
Used to develop solutions that will
    Process large amounts of data in a parallelized fashion

    In clusters of computing nodes

Originally a closed-source implementation at Google
    Scientific papers of ’03 & ’04 describe the framework

Hadoop: opensource implementation of the algorithms described in
  the scientific papers
    http://hadoop.apache.org/
What is Hadoop?
 2 large subsystems, 1 for data management & 1 for computation:
     HDFS (Hadoop Distributed File System)

     MapReduce computation framework runs above HDFS

     HDFS is essentially the I/O of Hadoop

 Written in java: A set of java processes running in multiple nodes

 Who uses it:
     Yahoo!

     Amazon

     Facebook

     Twitter

     Plus many more...
HDFS – distributed file system

 A scalable distributed file system for applications dealing with
  large data sets.
    Distributed: runs in a cluster

    Scalable: 10Κ nodes, 100Κ files 10PB storage

 Storage space is seamless for the whole cluster
 Files broken into blocks
 Typical block size: 128 MB.
 Replication: Each block copied to multiple data nodes.
Architecture of HDFS/MapReduce
 Master/Slave scheme
    HDFS: A central NameNode administers multple DataNodes

        NameNode: holds information about which DataNode holds which files
        DataNodes: «dummy» servers that hold raw file chunks
    MapReduce: A central JobTracker administers multiple TaskTrackers

-NameNode and JobTracker
   They run on the master
-DataNode and TaskTracker
   They run on the slaves
MapReduce
The problem is broken down in 2 phases.
   ●
       Map: Non overlapping sets of data input
       (<key,value> records) are assigned to different
       processes (mappers) that produce a set of
       intermediate <key,value> results
   ●
       Reduce: Data of Map phase are fed to a typically
       smaller number of processes(reducers) that
       aggregate the input results to a smaller number of
       <key,value> records.
How does it work?
Initialization phase
Input is uploaded to HDFS and is split into pieces of
 fixed size
Each TaskTracker node that participates in the
 computation is executing a copy of the MapReduce
 program
One of the nodes plays the JobTracker master role.
 This node will assign tasks to the rest (workers). Tasks
 can either be of type map or reduce.
JobTracker (Master)
The jobTracker holds data about:
  Status of tasks

  Location of input, output and intermediate data (runs
    together with NameNode - HDFS master)
The master is responsible for timecheduling of work
 tasks execution.
TaskTracker (Slave)
The TaskTracker runs tasks assigned by the master.
Runs at the same node as the DataNode (HFDS slave)
Task can be either of type Map or type Reduce
Typically the maximum number of concurrent tasks
 that can be run by a node is equal to the number of
 cpu cores it has (achieving optimal CPU utilization)
Map task
 A worker (TaskTracker) that has been assigned a map task
    ●
        Reads the relevant input data (input split) from HDFS, analyzes the <key, value>
        pairs and the output is passed as input to the map function.
    ●
        The map function processes the pairs and produces intermediate pairs that are
        aggregated in memory.
    ●
        Periodically a partition function is executed which stores the intermediate key-
        value pairs in the local node storage, while grouping them in R sets.This function
        is user defined.
    ●
        When the partition function completes the storage of the key-value pairs it
        informs the master that the task is complete and where the data are stored.
    ●
        The master forwards this information to the workers that run the reduce tasks
Reduce task
 A worker that has been assigned a reduce task

    Reads from every map process that has been executed the pairs that
      correspond to itself based on the locations instructed by the master.
    When all intermediate pairs have been retrieved they are sorted based on
      their key. Entries with the same key are grouped together.
    Function reduce is executed with input the pairs <key, group_of_values>
      that were the result of the previous phase.
    The reduce task processes the input data and produces the final pairs.

    The output pairs are attached in a file in the local file system. When the
      reduce task is completed the file becomes available in the distributed file
      system.
Task Completion
When a worker has completed its task it informs
 the master.
When all workers have informed the master then
 the master will return the function to the original
 program of the user.
Example
                    Master




         worker
          Map                Reduce
                             worker
Part 1


Part 2
Input    worker
          Map                Reduce
                             worker   Output

Part 3
         worker
          Map                Reduce
                             worker
MapReduce
Example: Word count 1/3

 Objective: measure the frequency of appearance of words in a large set
  of documents
 Potential use case: Discovery of popular url in a set of webserver
  logfiles
 Implementation plan:
    “Upload” documents on MapReduce

    Author a map function

    Author a reduce function

    Run a MapReduce task

    Retrieve results
Example: Word count 2/3
map(key, value):
// key: document name; value: text of document
     for each word w in value:
         emit(w, 1)

reduce(key, values):
// key: a word; value: an iterator over counts
   result = 0
   for each count v in values:
      result += v
   emit(result)
Example: Word count 3/3
                              (w1, 2)   (w1,2)
    (d1, ‘’w1 w2 w4’)
                              (w2, 3)   (w2,3)
  (d2, ‘ w1 w2 w3 w4’)
                              (w3, 2)   (w1,3)
    (d3, ‘ w2 w3 w4’)
                              (w4,3)    (w2,4)
                                        (w1,3)           (w1,7)
                                        (w2,3)           (w2,15)
     (d4, ‘ w1 w2 w3’)        (w1,3)
     (d5, ‘w1 w3 w4’)         (w2,4)
(d6, ‘ w1 w4 w2 w2)           (w3,2)
    (d7, ‘ w4 w2 w1’)         (w4,3)
                                        (w3,2)           (w3,8)
                                        (w4,3)           (w4,7)
   (d8, ‘ w2 w2 w3’)          (w1,3)    (w3,2)
 (d9, ‘w1 w1 w3 w3’)          (w2,3)    (w4,3)
(d10, ‘ w2 w1 w4 w3’)         (w3,4)    (w3,4)
                              (w4,1)    (w4,1)


                M=3 mappers               R=2 reducers
Extra functions
Locality

Move computation near the data: The master tries to
 have a task executed on a worker that is as “near” as
 possible to the input data, thus reducing the
 bandwidth usage
  How does the master know?
Task distribution
The number of tasks is usually higher than the
 number of the available workers
One worker can execute more than one tasks
The balance of work load is improved. In the case
 of a single worker failure there is faster recovery
 and redistribution of tasks to other nodes.
Redundant task executions
Some tasks can be delayed, resulting in a delay in the
 overall work execution
The solution to the problem is the creation of task
 copies that can be executed in parallel from 2 or more
 different workers (speculative execution)
A task is considered complete when the master is
 informed about its completion by at least one node.
Partitioning
A user can specify a custom function that will
 partition the tasks during shuffling.
The type of input and output data can be defined by
 the user and has no limitation on what form it should
 have.
The input of a reducer is always sorted
There is the possibility to execute tasks locally in a
  serial manner
The master provides web interfaces for
  Monitoring tasks progress

  Browsing of HDFS
When should I use it?
Good choice for jobs that can be broken into parallelized jobs:
     Indexing/Analysis of log files

     Sorting of large data sets

     Image processing


•
    Bad choice for serial or low latency jobs:
    –
        Computation of number π with precision of 1,000,000 digits
    –
        Computation of Fibonacci sequence
    –
        Replacing MySQL
Use cases 1/3
             Large Scale Image Conversions
             100 Amazon EC2 Instances, 4TB raw TIFF data
             11 Million PDF in 24 hours and 240$
        •
              Internal log processing
        •
              Reporting, analytics and machine learning
        •
              Cluster of 1110 machines, 8800 cores and 12PB
              raw storage
        •
              Open source contributors (Hive)


        •
              Store and process tweets, logs, etc
        •
              Open source contributors (hadoop-lzo)
        •
              Large scale machine learning
Use cases 2/3
        100.000 CPUs in 25.000 computers

        Content/Ads Optimization, Search index

        Machine learning (e.g. spam filtering)

        Open source contributors (Pig)


       •
           Natural language search (through
           Powerset)
       •
           400 nodes in EC2, storage in S3
       •
           Open source contributors (!) to HBase
       •
           ElasticMapReduce service
       •
           On demand elastic Hadoop clusters for the
           Cloud
Use cases 3/3
           ETL processing, statistics generation
           Advanced algorithms for behavioral
             analysis and targeting
       •
             Used for discovering People you May Know,
             and for other apps
       •
             3X30 node cluster, 16GB RAM and 8TB
             storage
       •
             Leading Chinese language search engine
       •
             Search log analysis, data mining
       •
             300TB per week
       •
             10 to 500 node clusters
Amazon ElasticMapReduce (EMR)
  A hosted Hadoop-as-a-service solution provided by AWS
 No need for management or tuning of Hadoop clusters
     ●
         upload your input data, store your output data on S3
     ●
         procure as many EC2 instances as you need and only pay for the
         time you use them
 Hive and Pig support makes it easy to write data analytical scripts

 Java, Perl, Python, PHP, C++ for more sophisticated algorithms

 Integrates to dynamoDB (process combined datasets in S3 &
  dynamoDB)
 Support for HBase (NoSQL)
Questions

Hadoop & MapReduce

  • 1.
    Hadoop & MapReduce Dr. Ioannis Konstantinou http://www.cslab.ntua.gr/~ikons AWS Usergroup Greece 18/07/2012 Computing Systems Laboratory School of Electrical and Computer Engineering National Technical University of Athens
  • 2.
    Big Data 90% oftoday's data was created in the last 2 years Moore's law: Data volume doubles every 18 months YouTube: 13 million hours and 700 billion views in 2010 Facebook: 20TB/day (compressed) CERN/LHC: 40TB/day (15PB/year) Many more examples Web logs, presentation files, medical files etc
  • 3.
    Problem: Data explosion 1 EB (Exabyte=1018bytes) = 1000 PB (Petabyte=1015bytes) Data traffic of mobile telephony in the USA in 2010 1.2 ZB (Zettabyte) = 1200 EB Total of digital data in 2010 35 ZB (Zettabyte = 1021 bytes) Estimate for volume of total digital data in 2020
  • 4.
  • 5.
  • 6.
    Divide and Conquer “Problem” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine
  • 7.
    Parallelization challenges  Howto assign units of work to the workers?  What if there are more units of work than workers?  What if the workers need to share intermediate incomplete data?  How do we aggregate such intermediate data?  How do we know when all workers have completed their assignments?  What if some workers failed?
  • 8.
    What is MapReduce? Aprogramming model A programming framework Used to develop solutions that will  Process large amounts of data in a parallelized fashion  In clusters of computing nodes Originally a closed-source implementation at Google  Scientific papers of ’03 & ’04 describe the framework Hadoop: opensource implementation of the algorithms described in the scientific papers  http://hadoop.apache.org/
  • 9.
    What is Hadoop? 2 large subsystems, 1 for data management & 1 for computation:  HDFS (Hadoop Distributed File System)  MapReduce computation framework runs above HDFS  HDFS is essentially the I/O of Hadoop  Written in java: A set of java processes running in multiple nodes  Who uses it:  Yahoo!  Amazon  Facebook  Twitter  Plus many more...
  • 10.
    HDFS – distributedfile system  A scalable distributed file system for applications dealing with large data sets.  Distributed: runs in a cluster  Scalable: 10Κ nodes, 100Κ files 10PB storage  Storage space is seamless for the whole cluster  Files broken into blocks  Typical block size: 128 MB.  Replication: Each block copied to multiple data nodes.
  • 11.
    Architecture of HDFS/MapReduce Master/Slave scheme  HDFS: A central NameNode administers multple DataNodes  NameNode: holds information about which DataNode holds which files  DataNodes: «dummy» servers that hold raw file chunks  MapReduce: A central JobTracker administers multiple TaskTrackers -NameNode and JobTracker They run on the master -DataNode and TaskTracker They run on the slaves
  • 12.
    MapReduce The problem isbroken down in 2 phases. ● Map: Non overlapping sets of data input (<key,value> records) are assigned to different processes (mappers) that produce a set of intermediate <key,value> results ● Reduce: Data of Map phase are fed to a typically smaller number of processes(reducers) that aggregate the input results to a smaller number of <key,value> records.
  • 13.
  • 14.
    Initialization phase Input isuploaded to HDFS and is split into pieces of fixed size Each TaskTracker node that participates in the computation is executing a copy of the MapReduce program One of the nodes plays the JobTracker master role. This node will assign tasks to the rest (workers). Tasks can either be of type map or reduce.
  • 15.
    JobTracker (Master) The jobTrackerholds data about: Status of tasks Location of input, output and intermediate data (runs together with NameNode - HDFS master) The master is responsible for timecheduling of work tasks execution.
  • 16.
    TaskTracker (Slave) The TaskTrackerruns tasks assigned by the master. Runs at the same node as the DataNode (HFDS slave) Task can be either of type Map or type Reduce Typically the maximum number of concurrent tasks that can be run by a node is equal to the number of cpu cores it has (achieving optimal CPU utilization)
  • 17.
    Map task  Aworker (TaskTracker) that has been assigned a map task ● Reads the relevant input data (input split) from HDFS, analyzes the <key, value> pairs and the output is passed as input to the map function. ● The map function processes the pairs and produces intermediate pairs that are aggregated in memory. ● Periodically a partition function is executed which stores the intermediate key- value pairs in the local node storage, while grouping them in R sets.This function is user defined. ● When the partition function completes the storage of the key-value pairs it informs the master that the task is complete and where the data are stored. ● The master forwards this information to the workers that run the reduce tasks
  • 18.
    Reduce task  Aworker that has been assigned a reduce task  Reads from every map process that has been executed the pairs that correspond to itself based on the locations instructed by the master.  When all intermediate pairs have been retrieved they are sorted based on their key. Entries with the same key are grouped together.  Function reduce is executed with input the pairs <key, group_of_values> that were the result of the previous phase.  The reduce task processes the input data and produces the final pairs.  The output pairs are attached in a file in the local file system. When the reduce task is completed the file becomes available in the distributed file system.
  • 19.
    Task Completion When aworker has completed its task it informs the master. When all workers have informed the master then the master will return the function to the original program of the user.
  • 20.
    Example Master worker Map Reduce worker Part 1 Part 2 Input worker Map Reduce worker Output Part 3 worker Map Reduce worker
  • 21.
  • 22.
    Example: Word count1/3  Objective: measure the frequency of appearance of words in a large set of documents  Potential use case: Discovery of popular url in a set of webserver logfiles  Implementation plan:  “Upload” documents on MapReduce  Author a map function  Author a reduce function  Run a MapReduce task  Retrieve results
  • 23.
    Example: Word count2/3 map(key, value): // key: document name; value: text of document for each word w in value: emit(w, 1) reduce(key, values): // key: a word; value: an iterator over counts result = 0 for each count v in values: result += v emit(result)
  • 24.
    Example: Word count3/3 (w1, 2) (w1,2) (d1, ‘’w1 w2 w4’) (w2, 3) (w2,3) (d2, ‘ w1 w2 w3 w4’) (w3, 2) (w1,3) (d3, ‘ w2 w3 w4’) (w4,3) (w2,4) (w1,3) (w1,7) (w2,3) (w2,15) (d4, ‘ w1 w2 w3’) (w1,3) (d5, ‘w1 w3 w4’) (w2,4) (d6, ‘ w1 w4 w2 w2) (w3,2) (d7, ‘ w4 w2 w1’) (w4,3) (w3,2) (w3,8) (w4,3) (w4,7) (d8, ‘ w2 w2 w3’) (w1,3) (w3,2) (d9, ‘w1 w1 w3 w3’) (w2,3) (w4,3) (d10, ‘ w2 w1 w4 w3’) (w3,4) (w3,4) (w4,1) (w4,1) M=3 mappers R=2 reducers
  • 25.
  • 26.
    Locality Move computation nearthe data: The master tries to have a task executed on a worker that is as “near” as possible to the input data, thus reducing the bandwidth usage How does the master know?
  • 27.
    Task distribution The numberof tasks is usually higher than the number of the available workers One worker can execute more than one tasks The balance of work load is improved. In the case of a single worker failure there is faster recovery and redistribution of tasks to other nodes.
  • 28.
    Redundant task executions Sometasks can be delayed, resulting in a delay in the overall work execution The solution to the problem is the creation of task copies that can be executed in parallel from 2 or more different workers (speculative execution) A task is considered complete when the master is informed about its completion by at least one node.
  • 29.
    Partitioning A user canspecify a custom function that will partition the tasks during shuffling. The type of input and output data can be defined by the user and has no limitation on what form it should have.
  • 30.
    The input ofa reducer is always sorted There is the possibility to execute tasks locally in a serial manner The master provides web interfaces for Monitoring tasks progress Browsing of HDFS
  • 31.
    When should Iuse it? Good choice for jobs that can be broken into parallelized jobs:  Indexing/Analysis of log files  Sorting of large data sets  Image processing • Bad choice for serial or low latency jobs: – Computation of number π with precision of 1,000,000 digits – Computation of Fibonacci sequence – Replacing MySQL
  • 32.
    Use cases 1/3  Large Scale Image Conversions  100 Amazon EC2 Instances, 4TB raw TIFF data  11 Million PDF in 24 hours and 240$ • Internal log processing • Reporting, analytics and machine learning • Cluster of 1110 machines, 8800 cores and 12PB raw storage • Open source contributors (Hive) • Store and process tweets, logs, etc • Open source contributors (hadoop-lzo) • Large scale machine learning
  • 33.
    Use cases 2/3  100.000 CPUs in 25.000 computers  Content/Ads Optimization, Search index  Machine learning (e.g. spam filtering)  Open source contributors (Pig) • Natural language search (through Powerset) • 400 nodes in EC2, storage in S3 • Open source contributors (!) to HBase • ElasticMapReduce service • On demand elastic Hadoop clusters for the Cloud
  • 34.
    Use cases 3/3 ETL processing, statistics generation Advanced algorithms for behavioral analysis and targeting • Used for discovering People you May Know, and for other apps • 3X30 node cluster, 16GB RAM and 8TB storage • Leading Chinese language search engine • Search log analysis, data mining • 300TB per week • 10 to 500 node clusters
  • 35.
    Amazon ElasticMapReduce (EMR) A hosted Hadoop-as-a-service solution provided by AWS  No need for management or tuning of Hadoop clusters ● upload your input data, store your output data on S3 ● procure as many EC2 instances as you need and only pay for the time you use them  Hive and Pig support makes it easy to write data analytical scripts  Java, Perl, Python, PHP, C++ for more sophisticated algorithms  Integrates to dynamoDB (process combined datasets in S3 & dynamoDB)  Support for HBase (NoSQL)
  • 36.