SlideShare a Scribd company logo
1 of 48
Download to read offline
Guest Lecture Eindhoven University of Technology
                                      Notes on Data-Intensive Processing
                                                with Hadoop MapReduce
                                                                                                 Evert Lammerts
                                                                                                   May 30, 2012




Image source: http://valley-of-the-shmoon.blogspot.com/2011/04/pushing-elephant-up-stairs.html
To start with...

●   About me
●
    Note on this lecture
    ●   Adapted from Jimmy Lin's Cloud Computing course...
        http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html
    ●   … and from Jimmy's slidedeck from the SIKS Big Data course and his talk at UvA
        http://www.umiacs.umd.edu/~jimmylin/
    ●   Today's slides available at
        http://www.slideshare.net/evertlammerts
●
    About you
    ●   Big Data?
    ●   Cloud computing?
    ●   Supercomputing?
    ●   Hadoop and / or MapReduce?
The lecture

●   Why “Big Data”?
●   How “Big Data”?

●   MapReduce
●   Implementations
Why “Big Data”?




The Economist, Feb 25th 2010
1. Science

●   The emergence of the 4th paradigm
    ●   http://research.microsoft.com/en-us/collaboration/fourthparadigm/
    ●   CERN stores 15 PB LHC data per year, a fraction of the actual produced
        data
    ●   Square Kilometer Array expectation: 10 PB / hour




                        Adapted from (Jimmy Lin, University of Maryland / Twitter, 2011)
2. Engineering

        ●      Count and normalize




http://infrawatch.liacs.nl/




                               Adapted from (Jimmy Lin, University of Maryland / Twitter, 2011)
3. Commerce

●   Know thy customers
●   Data → Insights → Competitive advantages
    ●   Google was processing 20 PB each day... in 2008!
    ●   FaceBook's collected 25 TB of HTTP logs each day... in 2009!
    ●   eBay had ~9 PB of user data, and a growth rate of more than 50 TB /
        day in 2011




                       Adapted from (Jimmy Lin, University of Maryland / Twitter, 2011)
IEEE Intelligent Systems, March/April 2009
s/knowledge/data/g




  Jimmy Lin, University of Maryland / Twitter, 2011
Also see

●   P. Russom, Big Data Analytics, The Data Warehousing Institute, 2011
●   James G. Kobielus, The Forrester Wave™: Enterprise Hadoop
    Solutions, Forrester Research, 2012
●   James Manyika et al., Big data: The next frontier for innovation,
    competition, and productivity, McKinsey Global Institute, 2011
●   Dirk de Roos et al., Understanding Big Data: Analytics for Enterprise
    Class Hadoop and Streaming Data, IBM, 2011


    Etcetera
How “Big Data”?
Divide and Conquer



                           “Work”
                                                                    Partition


  w1                           w2                              w3

“worker”                    “worker”                     “worker”


  r1                            r2                             r3




                          “Result”                                  Combine




           Jimmy Lin, University of Maryland / Twitter, 2011
Amdahl's Law
Challenges in Parallel systems

●   How do we divide the work into separate tasks?
●   How do we get these tasks to our workers?
●   What if we have more tasks than workers?
●   What if our tasks need to exchange information?
●   What if workers crash? (That's no exception!)
●   How do we aggregate results?
Managing Parallel Applications

●   A synchronization mechanism is needed
    ●   To coordinate communication (like exchanging state) between workers
    ●   To manage access to shared resources like data

●   What if you don't?
    ●   Mutual Exclusion
    ●   Resource Starvation
    ●   Race Conditions
    ●   Dining philosophers, sleeping barber, cigarette smokers, readers-writers,
        producers-consumers, etcetera



                      Managing parallelism is hard!
Source: Ricardo Guimarães Herrmann
Well known tools and patterns

●   Programming models                                        Shared Memory                  Message Passing


        Shared memory (pthreads)




                                                                                   Memory
    ●


    ●   Message passing (MPI)
●   Design patterns                                         P1 P2 P3 P4 P5                   P1 P2 P3 P4 P5


    ●   Master-slave
    ●   Producer-consumer
    ●   Shared queues

                        producer consumer
           master




                                                                                work queue

           slaves

                                         producer consumer




                            Jimmy Lin, University of Maryland / Twitter, 2011
From Von Neumann...




http://www.lrr.in.tum.de/~jasmin/neumann.html
… to a datacenter
Where to go from here

●   The search for the right level of abstraction
    ●   How do we build an architecture for a scaled environment?
    ●   From HAL to DCAL

●   Hiding parallel application management from the developer
    ●   It's hard!

●   Separating the what from the how
    ●   The developer specifies the computation
    ●   The runtime environment handles the execution




           Barosso, 2009
Ideas on scaling

●   Scale “out”, don't scale “up”
    ●   Hard upper-bound on the capacity of a single machine
    ●   No upper-bound on the amount of machines you can buy (in theory)

●   When dealing with large data...
    ●   Prefer sequential reads over random reads
        & rather not store a trillion small files, but a million big ones
         –   Disk access is slow, but throughput is reasonable!
    ●   Try to understand when a NAS / SAN architecture is really necessary
         –   It's expensive to scale!
MapReduce
An abstraction of typical large-data problems

(1) Iterate over a large number of records
(2) Extract something of interest from each
(3) Shuffle and sort intermediate results
(4) Aggregate intermediate results
(5) Generate final output
An abstraction of typical large-data problems

(1) Iterate over a large number of records
                                           M
(2) Extract something of interest from each A   P
(3) Shuffle and sort intermediate R
                                  results
                                  ED
(4) Aggregate intermediate results U
                                        C
(5) Generate final output                E




   MapReduce provides a functional abstraction of step 2 and step 4
Roots in functional programming

Map(S: array, f())
●   Apply f(s ∈ S) for all items in S


Fold(S: array, f())
●   Recursively apply f() to each item in S and the result of the previous
    operation, or nil if such an operation does not exist




                                  Source: Wikipedia
MapReduce

The programmer specifies two functions:
●   map(k, v) → <k', v'>*
●   reduce(k', v'[ ]) → <k', v'>*
       All values associated with the same key are sent to the same reducer


The execution framework handles everything else
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6




 map                 map                    map                       map


a 1    b 2        c 3     c 6           a 5     c 2             b 7     c 8

      Shuffle and Sort: aggregate values by keys
             a    1 5              b    2 7              c    2 3 6 8




        reduce                reduce                reduce


          r1 s1                 r2 s2                 r3 s3




                  Jimmy Lin, University of Maryland / Twitter, 2011
MapReduce “Hello World”: WordCount

●   Question: how can we count unique words in a given text?
    ●   Line-based input (a record is one line)
    ●   Key: position of first character in the whole document
    ●   Value: a line not including the EOL character
    ●   Input looks like:
           Key: 0,     value: “a wise old owl lived in an oak”
           Key: 31,    value: “the more he saw the less he spoke”
           Key: 63,    value: “the less he spoke the more he heard”
           Key: 99,    value: “why can't we all be like that wise old bird”
    ●   Output looks like:
           (a,1)            (an,1)       (be,1)
           (he,4)           (in,1)       (we,1)
           (all,1)          (oak,1)      (old,2)
           (owl,1)          (saw,1)      (the,4)
           (why,1)          (bird,1)     (less,2)
           (like,1)         (more,2)     (that,1)
           (wise,2)         (can't,1)    (heard,1)
           (lived,1)        (spoke,2)
MapReduce “Hello World”: WordCount
MapReduce

The programmer specifies two functions:
●   map(k, v) → <k', v'>*
●   reduce(k', v'[ ]) → <k', v'>*
       All values associated with the same key are sent to the same reducer


The “execution framework” handles ? everything else ?
MapReduce execution framework

●   Handles scheduling
    ●   Assigns map and reduce tasks to workers
    ●   Handles “data-awareness”: moves processes to data
●   Handles synchronization
    ●   Gathers, sorts, and shuffles intermediate data
●   Handles errors and faults
    ●   Detects worker failures and restarts
●   Handles communication with the distributed filesystem
MapReduce

The programmer specifies two functions:
●   map (k, v) → <k', v'>*
●   reduce (k', v'[ ]) → <k', v'>*
        All values associated with the same key are sent to the same reducer


The execution framework handles everything else...
Not quite... usually, programmers also specify:
●   partition (k', number of partitions) → partition for k'
    ●   Often a simple hash of the key, e.g., hash(k') mod n
    ●   Divides up key space for parallel reduce operations
●   combine (k', v') → <k', v'>*
    ●   Mini-reducers that run in memory after the map phase
    ●   Used as optimization to reduce network traffic
k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6




  map                   map                   map                        map


a 1    b 2           c 3     c 6            a 5    c 2             b 7     c 8

 combine              combine                combine                 combine



a 1    b 2                 c 9              a 5    c 2             b 7     c 8

 partition             partition             partition               partition

      Shuffle and Sort: aggregate values by keys
               a     1 5              b     2 7             c     2 9 8
                                                                    3 6




         reduce                    reduce                reduce


             r1 s1                  r2 s2                 r3 s3




                     Jimmy Lin, University of Maryland / Twitter, 2011
Quick note...

The term “MapReduce” can refer to:
●   The programming model
●   The “execution framework”
●   The specific implementation
Implementation(s)
MapReduce implementations

●   Google (C++)
    ●   Dean & Ghemawat, MapReduce: simplified data processing on large
        clusters, 2004
    ●   Ghemawat, Gobioff, Leung, The Google File System, 2003
●   Apache Hadoop (Java)
    ●   Open source implementation
    ●   Originally led by Yahoo!
    ●   Broadly adopted
●   Custom research implementations
    ●   For GPU's, supercomputers, etcetera
User
                                                 Program

                                                     (1) submit


                                                 Master

                               (2) schedule map        (2) schedule reduce


                     worker
split 0
                                                                                  (6) write   output
split 1                                              (5) remote read    worker
          (3) read                                                                             file 0
split 2                        (4) local write
                     worker
split 3
split 4                                                                                       output
                                                                        worker
                                                                                               file 1

                     worker


Input                 Map             Intermediate files                 Reduce               Output
 files               phase              (on local disk)                   phase                files




                     Jimmy Lin, Adapted from (Dean and Ghemawat, OSDI 2004)
User
                                                 Program

                                                     (1) submit


                                                 Master

                               (2) schedule map        (2) schedule reduce


                     worker
split 0
                                                                                  (6) write   output
split 1                                              (5) remote read    worker
          (3) read                                                                             file 0
split 2                        (4) local write
                     worker
split 3
split 4                                                                                       output
                                                                        worker
                                                                                               file 1

                     worker


Input                 Map             Intermediate files                 Reduce               Output
 files               phase              (on local disk)                   phase                files




                     Jimmy Lin, Adapted from (Dean and Ghemawat, OSDI 2004)
User
                                                           Program

                                                               (1) submit


                                                           Master

                                         (2) schedule map        (2) schedule reduce


                               worker
          split 0
                                                                                            (6) write   output
          split 1                                              (5) remote read    worker
                    (3) read                                                                             file 0
          split 2                        (4) local write
                               worker
          split 3
          split 4                                                                                       output
                                                                                  worker
                                                                                                         file 1

                               worker


          Input                 Map             Intermediate files                 Reduce               Output
           files               phase              (on local disk)                   phase                files


How do we get our input data to the map()'s on the workers?



                               Jimmy Lin, Adapted from (Dean and Ghemawat, OSDI 2004)
Distributed File System

●   Don't move data to the workers... move workers to the data!
    ●   Store data on the local disks of nodes in the cluster
    ●   Start up the work on the node that has the data local

●   A distributed files system is the answer
    ●   GFS (Google File System) for Google's MapReduce
    ●   HDFS (Hadoop Distributed File System) for Hadoop
GFS: Design decisions

●   Files stored as chunks
    ●   Fixed size (64MB)
●   Reliability through replication
    ●   Each chunk replicated across 3+ chunkservers
●   Single master to coordinate access, keep metadata
    ●   Simple centralized management
●   No data caching
    ●   Little benefit due to large datasets, streaming reads
●   Simplify the API
    ●   Push some of the issues onto the client (e.g., data layout)


               HDFS = GFS clone (same basic ideas)

                         Jimmy Lin, Adapted from (Ghemawat, SOSP 2003)
From GFS to HDFS

●   Terminology differences:
    ●   GFS Master = Hadoop NameNode
    ●   GFS Chunkservers = Hadoop DataNode
    ●   Chunk = Block
●   Functional differences
    ●   File appends in HDFS is relatively new
    ●   HDFS performance is (likely) slower
    ●   Blocksize is configurable by the client




                      We use Hadoop terminology
HDFS Architecture


                                                          HDFS namenode

Application                                                                  /foo/bar
                  (file name, block id)
                                                  File namespace              block 3df2
HDFS Client
                (block id, block location)




                                                  instructions to datanode

                                                                 datanode state
              (block id, byte range)
                                                HDFS datanode                     HDFS datanode
              block data
                                                Linux file system                 Linux file system

                                                                 …                                …




                           Jimmy Lin, Adapted from (Ghemawat, SOSP 2003)
Namenode Responsibilities

●   Managing the file system namespace:
    ●   Holds file/directory structure, metadata, file-to-block mapping, access
        permissions, etcetera
●   Coordinating file operations
    ●   Directs clients to DataNodes for reads and writes
    ●   No data is moved through the NameNode
●   Maintaining overall health:
    ●   Periodic communication with the DataNodes
    ●   Block re-replication and rebalancing
    ●   Garbage collection
Putting everything together



                     namenode                  job submission node


             namenode daemon                          jobtracker




   tasktracker                     tasktracker                        tasktracker

datanode daemon                 datanode daemon                   datanode daemon

 Linux file system               Linux file system                 Linux file system

                 …                                 …                                …
   slave node                      slave node                         slave node




                         Jimmy Lin, University of Maryland / Twitter, 2011
Questions?

More Related Content

What's hot

Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringBADR
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoopVarun Narang
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar ReportAtul Kushwaha
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Ashok Royal
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingCloudera, Inc.
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...Daniel Abadi
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course pptNjain85
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processingroyans
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesDaniel Abadi
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overviewharithakannan
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Thanh Nguyen
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo pptPhil Young
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionCloudera, Inc.
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceHortonworks
 

What's hot (20)

Apache Hadoop - Big Data Engineering
Apache Hadoop - Big Data EngineeringApache Hadoop - Big Data Engineering
Apache Hadoop - Big Data Engineering
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Hadoop Seminar Report
Hadoop Seminar ReportHadoop Seminar Report
Hadoop Seminar Report
 
Hadoop
HadoopHadoop
Hadoop
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
Detailed presentation on big data hadoop +Hadoop Project Near Duplicate Detec...
 
Hadoop: Distributed Data Processing
Hadoop: Distributed Data ProcessingHadoop: Distributed Data Processing
Hadoop: Distributed Data Processing
 
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
From HadoopDB to Hadapt: A Case Study of Transitioning a VLDB paper into Real...
 
Hadoop
HadoopHadoop
Hadoop
 
Hadoop for beginners free course ppt
Hadoop for beginners   free course pptHadoop for beginners   free course ppt
Hadoop for beginners free course ppt
 
Hadoop seminar
Hadoop seminarHadoop seminar
Hadoop seminar
 
Hadoop: Distributed data processing
Hadoop: Distributed data processingHadoop: Distributed data processing
Hadoop: Distributed data processing
 
Hadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and OpportunitiesHadoop and Graph Data Management: Challenges and Opportunities
Hadoop and Graph Data Management: Challenges and Opportunities
 
Hadoop bigdata overview
Hadoop bigdata overviewHadoop bigdata overview
Hadoop bigdata overview
 
Hadoop Report
Hadoop ReportHadoop Report
Hadoop Report
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
An Introduction to the World of Hadoop
An Introduction to the World of HadoopAn Introduction to the World of Hadoop
An Introduction to the World of Hadoop
 
Hadoop demo ppt
Hadoop demo pptHadoop demo ppt
Hadoop demo ppt
 
EclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An IntroductionEclipseCon Keynote: Apache Hadoop - An Introduction
EclipseCon Keynote: Apache Hadoop - An Introduction
 
Large Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduceLarge Scale Math with Hadoop MapReduce
Large Scale Math with Hadoop MapReduce
 

Similar to Notes on data-intensive processing with Hadoop Mapreduce

Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing  with MapReduce Data-Intensive Text Processing  with MapReduce
Data-Intensive Text Processing with MapReduce George Ang
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceData-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceGeorge Ang
 
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...Altinity Ltd
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Alexandru Iosup
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkAndy Petrella
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine LearningSudarsun Santhiappan
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReducePietro Michiardi
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataTrieu Nguyen
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvmAdam Gibson
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019VMware Tanzu
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learningArnaud Rachez
 
Summer School DSL 2013 - SpreadSheet Engineering
Summer School DSL 2013 - SpreadSheet EngineeringSummer School DSL 2013 - SpreadSheet Engineering
Summer School DSL 2013 - SpreadSheet EngineeringJácome Cunha
 
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015NoSQLmatters
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15MLconf
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConfXavier Amatriain
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systemsXavier Amatriain
 
Jubatus Invited Talk at XLDB Asia
Jubatus Invited Talk at XLDB AsiaJubatus Invited Talk at XLDB Asia
Jubatus Invited Talk at XLDB AsiaPreferred Networks
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshersrajkamaltibacademy
 

Similar to Notes on data-intensive processing with Hadoop Mapreduce (20)

Cloud accounting software uk
Cloud accounting software ukCloud accounting software uk
Cloud accounting software uk
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing  with MapReduce Data-Intensive Text Processing  with MapReduce
Data-Intensive Text Processing with MapReduce
 
Data-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduceData-Intensive Text Processing with MapReduce
Data-Intensive Text Processing with MapReduce
 
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
OSA Con 2022 - What Data Engineering Can Learn from Frontend Engineering - Pe...
 
Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.Cloud Programming Models: eScience, Big Data, etc.
Cloud Programming Models: eScience, Big Data, etc.
 
What is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache SparkWhat is Distributed Computing, Why we use Apache Spark
What is Distributed Computing, Why we use Apache Spark
 
Challenges in Large Scale Machine Learning
Challenges in Large Scale  Machine LearningChallenges in Large Scale  Machine Learning
Challenges in Large Scale Machine Learning
 
Scalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduceScalable Algorithm Design with MapReduce
Scalable Algorithm Design with MapReduce
 
Lambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big dataLambda Architecture and open source technology stack for real time big data
Lambda Architecture and open source technology stack for real time big data
 
Future of ai on the jvm
Future of ai on the jvmFuture of ai on the jvm
Future of ai on the jvm
 
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
AI on Greenplum Using
 Apache MADlib and MADlib Flow - Greenplum Summit 2019
 
Scalable machine learning
Scalable machine learningScalable machine learning
Scalable machine learning
 
Summer School DSL 2013 - SpreadSheet Engineering
Summer School DSL 2013 - SpreadSheet EngineeringSummer School DSL 2013 - SpreadSheet Engineering
Summer School DSL 2013 - SpreadSheet Engineering
 
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
Adrian Colyer - Keynote: NoSQL matters - NoSQL matters Dublin 2015
 
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
Xavier Amatriain, VP of Engineering, Quora at MLconf SF - 11/13/15
 
10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf10 more lessons learned from building Machine Learning systems - MLConf
10 more lessons learned from building Machine Learning systems - MLConf
 
10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems10 more lessons learned from building Machine Learning systems
10 more lessons learned from building Machine Learning systems
 
Jubatus Invited Talk at XLDB Asia
Jubatus Invited Talk at XLDB AsiaJubatus Invited Talk at XLDB Asia
Jubatus Invited Talk at XLDB Asia
 
MapReduce Algorithm Design
MapReduce Algorithm DesignMapReduce Algorithm Design
MapReduce Algorithm Design
 
Hadoop Training Tutorial for Freshers
Hadoop Training Tutorial for FreshersHadoop Training Tutorial for Freshers
Hadoop Training Tutorial for Freshers
 

Recently uploaded

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?XfilesPro
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 

Recently uploaded (20)

From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?How to Remove Document Management Hurdles with X-Docs?
How to Remove Document Management Hurdles with X-Docs?
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 

Notes on data-intensive processing with Hadoop Mapreduce

  • 1. Guest Lecture Eindhoven University of Technology Notes on Data-Intensive Processing with Hadoop MapReduce Evert Lammerts May 30, 2012 Image source: http://valley-of-the-shmoon.blogspot.com/2011/04/pushing-elephant-up-stairs.html
  • 2. To start with... ● About me ● Note on this lecture ● Adapted from Jimmy Lin's Cloud Computing course... http://www.umiacs.umd.edu/~jimmylin/cloud-2010-Spring/index.html ● … and from Jimmy's slidedeck from the SIKS Big Data course and his talk at UvA http://www.umiacs.umd.edu/~jimmylin/ ● Today's slides available at http://www.slideshare.net/evertlammerts ● About you ● Big Data? ● Cloud computing? ● Supercomputing? ● Hadoop and / or MapReduce?
  • 3. The lecture ● Why “Big Data”? ● How “Big Data”? ● MapReduce ● Implementations
  • 4. Why “Big Data”? The Economist, Feb 25th 2010
  • 5. 1. Science ● The emergence of the 4th paradigm ● http://research.microsoft.com/en-us/collaboration/fourthparadigm/ ● CERN stores 15 PB LHC data per year, a fraction of the actual produced data ● Square Kilometer Array expectation: 10 PB / hour Adapted from (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 6. 2. Engineering ● Count and normalize http://infrawatch.liacs.nl/ Adapted from (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 7. 3. Commerce ● Know thy customers ● Data → Insights → Competitive advantages ● Google was processing 20 PB each day... in 2008! ● FaceBook's collected 25 TB of HTTP logs each day... in 2009! ● eBay had ~9 PB of user data, and a growth rate of more than 50 TB / day in 2011 Adapted from (Jimmy Lin, University of Maryland / Twitter, 2011)
  • 8. IEEE Intelligent Systems, March/April 2009
  • 9. s/knowledge/data/g Jimmy Lin, University of Maryland / Twitter, 2011
  • 10. Also see ● P. Russom, Big Data Analytics, The Data Warehousing Institute, 2011 ● James G. Kobielus, The Forrester Wave™: Enterprise Hadoop Solutions, Forrester Research, 2012 ● James Manyika et al., Big data: The next frontier for innovation, competition, and productivity, McKinsey Global Institute, 2011 ● Dirk de Roos et al., Understanding Big Data: Analytics for Enterprise Class Hadoop and Streaming Data, IBM, 2011 Etcetera
  • 12.
  • 13. Divide and Conquer “Work” Partition w1 w2 w3 “worker” “worker” “worker” r1 r2 r3 “Result” Combine Jimmy Lin, University of Maryland / Twitter, 2011
  • 15. Challenges in Parallel systems ● How do we divide the work into separate tasks? ● How do we get these tasks to our workers? ● What if we have more tasks than workers? ● What if our tasks need to exchange information? ● What if workers crash? (That's no exception!) ● How do we aggregate results?
  • 16. Managing Parallel Applications ● A synchronization mechanism is needed ● To coordinate communication (like exchanging state) between workers ● To manage access to shared resources like data ● What if you don't? ● Mutual Exclusion ● Resource Starvation ● Race Conditions ● Dining philosophers, sleeping barber, cigarette smokers, readers-writers, producers-consumers, etcetera Managing parallelism is hard!
  • 18. Well known tools and patterns ● Programming models Shared Memory Message Passing Shared memory (pthreads) Memory ● ● Message passing (MPI) ● Design patterns P1 P2 P3 P4 P5 P1 P2 P3 P4 P5 ● Master-slave ● Producer-consumer ● Shared queues producer consumer master work queue slaves producer consumer Jimmy Lin, University of Maryland / Twitter, 2011
  • 20. … to a datacenter
  • 21.
  • 22. Where to go from here ● The search for the right level of abstraction ● How do we build an architecture for a scaled environment? ● From HAL to DCAL ● Hiding parallel application management from the developer ● It's hard! ● Separating the what from the how ● The developer specifies the computation ● The runtime environment handles the execution Barosso, 2009
  • 23. Ideas on scaling ● Scale “out”, don't scale “up” ● Hard upper-bound on the capacity of a single machine ● No upper-bound on the amount of machines you can buy (in theory) ● When dealing with large data... ● Prefer sequential reads over random reads & rather not store a trillion small files, but a million big ones – Disk access is slow, but throughput is reasonable! ● Try to understand when a NAS / SAN architecture is really necessary – It's expensive to scale!
  • 25. An abstraction of typical large-data problems (1) Iterate over a large number of records (2) Extract something of interest from each (3) Shuffle and sort intermediate results (4) Aggregate intermediate results (5) Generate final output
  • 26. An abstraction of typical large-data problems (1) Iterate over a large number of records M (2) Extract something of interest from each A P (3) Shuffle and sort intermediate R results ED (4) Aggregate intermediate results U C (5) Generate final output E MapReduce provides a functional abstraction of step 2 and step 4
  • 27. Roots in functional programming Map(S: array, f()) ● Apply f(s ∈ S) for all items in S Fold(S: array, f()) ● Recursively apply f() to each item in S and the result of the previous operation, or nil if such an operation does not exist Source: Wikipedia
  • 28. MapReduce The programmer specifies two functions: ● map(k, v) → <k', v'>* ● reduce(k', v'[ ]) → <k', v'>* All values associated with the same key are sent to the same reducer The execution framework handles everything else
  • 29. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 3 6 8 reduce reduce reduce r1 s1 r2 s2 r3 s3 Jimmy Lin, University of Maryland / Twitter, 2011
  • 30. MapReduce “Hello World”: WordCount ● Question: how can we count unique words in a given text? ● Line-based input (a record is one line) ● Key: position of first character in the whole document ● Value: a line not including the EOL character ● Input looks like: Key: 0, value: “a wise old owl lived in an oak” Key: 31, value: “the more he saw the less he spoke” Key: 63, value: “the less he spoke the more he heard” Key: 99, value: “why can't we all be like that wise old bird” ● Output looks like: (a,1) (an,1) (be,1) (he,4) (in,1) (we,1) (all,1) (oak,1) (old,2) (owl,1) (saw,1) (the,4) (why,1) (bird,1) (less,2) (like,1) (more,2) (that,1) (wise,2) (can't,1) (heard,1) (lived,1) (spoke,2)
  • 32. MapReduce The programmer specifies two functions: ● map(k, v) → <k', v'>* ● reduce(k', v'[ ]) → <k', v'>* All values associated with the same key are sent to the same reducer The “execution framework” handles ? everything else ?
  • 33. MapReduce execution framework ● Handles scheduling ● Assigns map and reduce tasks to workers ● Handles “data-awareness”: moves processes to data ● Handles synchronization ● Gathers, sorts, and shuffles intermediate data ● Handles errors and faults ● Detects worker failures and restarts ● Handles communication with the distributed filesystem
  • 34. MapReduce The programmer specifies two functions: ● map (k, v) → <k', v'>* ● reduce (k', v'[ ]) → <k', v'>* All values associated with the same key are sent to the same reducer The execution framework handles everything else... Not quite... usually, programmers also specify: ● partition (k', number of partitions) → partition for k' ● Often a simple hash of the key, e.g., hash(k') mod n ● Divides up key space for parallel reduce operations ● combine (k', v') → <k', v'>* ● Mini-reducers that run in memory after the map phase ● Used as optimization to reduce network traffic
  • 35. k1 v1 k2 v2 k3 v3 k4 v4 k5 v5 k6 v6 map map map map a 1 b 2 c 3 c 6 a 5 c 2 b 7 c 8 combine combine combine combine a 1 b 2 c 9 a 5 c 2 b 7 c 8 partition partition partition partition Shuffle and Sort: aggregate values by keys a 1 5 b 2 7 c 2 9 8 3 6 reduce reduce reduce r1 s1 r2 s2 r3 s3 Jimmy Lin, University of Maryland / Twitter, 2011
  • 36. Quick note... The term “MapReduce” can refer to: ● The programming model ● The “execution framework” ● The specific implementation
  • 38. MapReduce implementations ● Google (C++) ● Dean & Ghemawat, MapReduce: simplified data processing on large clusters, 2004 ● Ghemawat, Gobioff, Leung, The Google File System, 2003 ● Apache Hadoop (Java) ● Open source implementation ● Originally led by Yahoo! ● Broadly adopted ● Custom research implementations ● For GPU's, supercomputers, etcetera
  • 39. User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files Jimmy Lin, Adapted from (Dean and Ghemawat, OSDI 2004)
  • 40. User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files Jimmy Lin, Adapted from (Dean and Ghemawat, OSDI 2004)
  • 41. User Program (1) submit Master (2) schedule map (2) schedule reduce worker split 0 (6) write output split 1 (5) remote read worker (3) read file 0 split 2 (4) local write worker split 3 split 4 output worker file 1 worker Input Map Intermediate files Reduce Output files phase (on local disk) phase files How do we get our input data to the map()'s on the workers? Jimmy Lin, Adapted from (Dean and Ghemawat, OSDI 2004)
  • 42. Distributed File System ● Don't move data to the workers... move workers to the data! ● Store data on the local disks of nodes in the cluster ● Start up the work on the node that has the data local ● A distributed files system is the answer ● GFS (Google File System) for Google's MapReduce ● HDFS (Hadoop Distributed File System) for Hadoop
  • 43. GFS: Design decisions ● Files stored as chunks ● Fixed size (64MB) ● Reliability through replication ● Each chunk replicated across 3+ chunkservers ● Single master to coordinate access, keep metadata ● Simple centralized management ● No data caching ● Little benefit due to large datasets, streaming reads ● Simplify the API ● Push some of the issues onto the client (e.g., data layout) HDFS = GFS clone (same basic ideas) Jimmy Lin, Adapted from (Ghemawat, SOSP 2003)
  • 44. From GFS to HDFS ● Terminology differences: ● GFS Master = Hadoop NameNode ● GFS Chunkservers = Hadoop DataNode ● Chunk = Block ● Functional differences ● File appends in HDFS is relatively new ● HDFS performance is (likely) slower ● Blocksize is configurable by the client We use Hadoop terminology
  • 45. HDFS Architecture HDFS namenode Application /foo/bar (file name, block id) File namespace block 3df2 HDFS Client (block id, block location) instructions to datanode datanode state (block id, byte range) HDFS datanode HDFS datanode block data Linux file system Linux file system … … Jimmy Lin, Adapted from (Ghemawat, SOSP 2003)
  • 46. Namenode Responsibilities ● Managing the file system namespace: ● Holds file/directory structure, metadata, file-to-block mapping, access permissions, etcetera ● Coordinating file operations ● Directs clients to DataNodes for reads and writes ● No data is moved through the NameNode ● Maintaining overall health: ● Periodic communication with the DataNodes ● Block re-replication and rebalancing ● Garbage collection
  • 47. Putting everything together namenode job submission node namenode daemon jobtracker tasktracker tasktracker tasktracker datanode daemon datanode daemon datanode daemon Linux file system Linux file system Linux file system … … … slave node slave node slave node Jimmy Lin, University of Maryland / Twitter, 2011