SlideShare a Scribd company logo
Data Processing in NoSQL?

                  An Introduction to Map Reduce
                                By Dan Harvey




Thursday, 12 April 12
No SQL?

                        People thinking about
                         their data storage.



Thursday, 12 April 12
Storage Patterns

                            Denormalization
                            Sharding / Hashing
                               Replication
                                    ...




Thursday, 12 April 12
Ad Hoc Queries?


                           Hard to do...



Thursday, 12 April 12
The problem
                                        High


                                                In-memory         Hadoop
                        Query Entropy



                                                                   (Offline)



                                                Key-value              ?
                                        Low
                                                   (Online)

                                               Low                         High

                                                       Query Latency


Thursday, 12 April 12
“The Apache Hadoop project
                   develops open-source software
                        for reliable, scalable,
                       distributed computing”



Thursday, 12 April 12
MapReduce: Simplified Data Processing on Large Clusters

                                                              Jeffrey Dean and Sanjay Ghemawat
                                                                  jeff@google.com, sanjay@google.com

                                                                             Google, Inc.



                                               Abstract                               given day, etc. Most such computations are conceptu-
                                                                                      ally straightforward. However, the input data is usually
                           MapReduce is a programming model and an associ-            large and the computations have to be distributed across
                        ated implementation for processing and generating large       hundreds or thousands of machines in order to finish in
                        data sets. Users specify a map function that processes a      a reasonable amount of time. The issues of how to par-
                        key/value pair to generate a set of intermediate key/value    allelize the computation, distribute the data, and handle
                        pairs, and a reduce function that merges all intermediate     failures conspire to obscure the original simple compu-
                        values associated with the same intermediate key. Many        tation with large amounts of complex code to deal with
                        real world tasks are expressible in this model, as shown      these issues.
                        in the paper.                                                    As a reaction to this complexity, we designed a new
                           Programs written in this functional style are automati-    abstraction that allows us to express the simple computa-
                        cally parallelized and executed on a large cluster of com-    tions we were trying to perform but hides the messy de-
                        modity machines. The run-time system takes care of the        tails of parallelization, fault-tolerance, data distribution



                             Google’s MapReduce
                        details of partitioning the input data, scheduling the pro-   and load balancing in a library. Our abstraction is in-
                        gram’s execution across a set of machines, handling ma-       spired by the map and reduce primitives present in Lisp
                        chine failures, and managing the required inter-machine       and many other functional languages. We realized that
                        communication. This allows programmers without any            most of our computations involved applying a map op-
                        experience with parallel and distributed systems to eas-      eration to each logical “record” in our input in order to
                        ily utilize the resources of a large distributed system.      compute a set of intermediate key/value pairs, and then
                           Our implementation of MapReduce runs on a large            applying a reduce operation to all the values that shared
                        cluster of commodity machines and is highly scalable:         the same key, in order to combine the derived data ap-
Thursday, 12 April 12                                                                 propriately. Our use of a functional model with user-
Distributed Storage
                                                                                                                                                                                                          MapReduce: Simplified Data Processing on Large Clusters

                                                                                                                                                                                                                                    Jeffrey Dean and Sanjay Ghemawat
                                                                                                                                                                              MapReduce: Simplified Data Processing on Large Clusters
                                                                                                                                                                                                             jeff@google.com, sanjay@google.com

                                                                                                                                                                                                                                                   Google, Inc.
                                                                                                                                                                                                        Jeffrey Dean and Sanjay Ghemawat
                                                                                                                                                          MapReduce: Simplified Data Processing on Large Clusters
                                                                                                                                                                                                            jeff@google.com, sanjay@google.com
                                                                                                                                                                                                             Abstract                                                 given day, etc. Most such computations are conceptu-
                                                                                                                                                                                                               Google, Inc.                                           ally straightforward. However, the input data is usually
                                                                                                                                                                                   Jeffrey Dean and Sanjay Ghemawatmodel and an associ-
                                                                                                                                                                                            MapReduce is a programming                                                large and the computations have to be distributed across
                                                                                                                                                                                                ated implementation for processing and generating large               hundreds or thousands of machines in order to finish in
                                                                                                                                                                                           jeff@google.com, sanjay@google.com function that processes a
                                                                                                                                                                                                data sets. Users specify a map
                  MapReduce: Simplified Data Processing on Large Clusters                                                                                                                        key/value pair to generate a set of intermediate key/value
                                                                                                                                                                                                                                                                      a reasonable amount of time. The issues of how to par-
                                                                                                                                                                                                                                                                      allelize the computation, distribute the data, and handle
                                                                                                                                                                                            Abstract a reduce function that merges all intermediate such computations are conceptu-
                                                                                                                                                                                                pairs, Google, Inc.
                                                                                                                                                                                                        and                             given day, etc. Most
                                                                                                                                                                                                                                                                      failures conspire to obscure the original simple compu-
                                                                                                                                                                                                values associated with the same intermediate key. ManyHowever, the input data is usually
                                                                                                                                                                      MapReduce is a programming model and an associ-
                                                                                                                                                                                                                                        ally straightforward.
                                                                                                                                                                                                                                                                      tation with large amounts of complex code to deal with
                                          Jeffrey Dean and Sanjay Ghemawat                                                                                                                      real world tasks are expressible largemodel, computations have to be distributed across
                                                                                                                                                                                                                                         this and the
                                                                                                                                                                  ated implementation for processing and generating large in hundreds or as shown of these issues. order to finish in
                                                                                                                                                                                                                                                       thousands machines in
                                                                                                                                                                  data sets. Users specify a in thefunction that processes a
                                                                                                                                                                                                 map paper.                             a reasonable amount of time. Thereaction to thisto par-
                                                                                                                                                                                                                                                                         As a issues of how complexity, we designed a new
                                              jeff@google.com, sanjay@google.com                                                                                       Abstract                     Programs written key/value Most such are automati- areabstraction that allows
                                                                                                                                                                                                                   given day, etc.                 computations         conceptu-
                                                                                                                                                                  key/value pair to generate a set of intermediate in this functional stylethe computation, distribute the data, and us to express the simple computa-
                                                                                                                                                                                                                                        allelize                                                handle
                                                                                                                                                                                                                   ally straightforward.large clusterthe com- data is usually
                                                                                                                                                                                                                                            However, of input
                                                         Google, Inc.                                                                           MapReduce is apairs, and a reduce function that merges all intermediate on a
                                                                                                                                                                    programming model and cally parallelized andand the computations conspirebe distributed across trying to perform but hides the messy de-
                                                                                                                                                                                                  an associ-       large
                                                                                                                                                                                                                          executed
                                                                                                                                                                                                                                        failures have to to obscure thewe were simple compu-
                                                                                                                                                                                                                                                                      tions original
                                                                                                                                                                  values associated with the modity machines. The run-time system takes care of the
                                                                                                                                                                                                 same intermediate key. Many
                                                                                                                                            ated implementation for processing and generating large                                     tation with large amounts tails of parallelization, fault-tolerance, data distribution
                                                                                                                                                                                                                                                                       of finish in code to deal with
                                                                                                                                                                                                                                                                          complex
                                                                                                                                                                                                                   hundreds or thousands of machines in order to
                                                                                                                                            data sets. Users specifyworld tasks are expressible in this model, as shown data, scheduling the pro- of how to par-
                                                                                                                                                                  real a map function that processes partitioning the input these issues.
                                                                                                                                                                                                details of a                                                          and load balancing in a library. Our abstraction is in-
                                                                                                                                                                                                gram’s execution reasonable amount of time. The issues
                                                                                                                                            key/value pair to generatepaper.of intermediate key/value
                                                                                                                                                                  in the a set
                                                                                                                                                                                                                   a across a set of machines, handling ma-
                                                                                                                                                                                                                                            As distribute to data, and handle map and reduce primitives present in Lisp
                                                                                                                                                                                                                                                                      spired by the
                                                                                                                                                                                                                   allelize the computation, a reaction thethis complexity, we designed a new
                                                                                                                                                                                                chine failures, and managing the required inter-machine
                                                                                                                                            pairs, and a reduce function that merges all intermediate style are automati- toabstraction that allowssimple compu-other functional languages. We realized that
                                                                                                                                                                                                                                                                      and manythe simple computa-
                           Abstract                               given day, etc. Most such computations are conceptu-                                                Programs written in this functional                   conspire obscure the original us to express
                                                                                                                                                                                                                   failuresallows




                                                                                                                                                               Replicated Blocks
                                                                                                                                                                                                communication. This of com- programmers without any
                                                                                                                                            values associated with the same intermediate key. on a large cluster with large amounts of complex code to deal but hides the messy involved applying a map op-
                                                                                                                                                                  cally parallelized and executed Many                                                                most of our computations de-
                                                                  ally straightforward. However, the input data is usually                                                                                         tation               tions we were trying to perform with
       MapReduce is a programming model and an associ-            large and the computations have to be distributed across                  real world tasks are expressible in this model, as shown withthese issues. distributedof parallelization, fault-tolerance, data distribution in our input in order to
                                                                                                                                                                                                experience           parallel and
                                                                                                                                                                  modity machines. The run-time system takes care of the                tails
                                                                                                                                                                                                                                                systems to eas-       eration to each logical “record”
    ated implementation for processing and generating large                                                                                 in the paper.                                       ily utilize the resources ofpro-
                                                                                                                                                                                                                               a large distributed system.
                                                                                                                                                                  details of partitioning the input data, scheduling a reaction to and load balancing in a library. new set of intermediate key/value pairs, and then
                                                                                                                                                                                                                                                                      compute a
                                                                  hundreds or thousands of machines in order to finish in                                                                                              As the             this complexity, we designed a Our abstraction is in-
    data sets. Users specify a map function that processes a      a reasonable amount of time. The issues of how to par-                                          gram’s functional style a setOurmachines, handling ma- allows usrunsthe map and reduce primitives present in Lisp all the values that shared
                                                                                                                                                                                                     of implementation of that
                                                                                                                                                Programs written in thisexecution across are automati-             abstraction          spired to express the simple computa- reduce operation to
                                                                                                                                                                                                                                MapReduce by on a large               applying a
    key/value pair to generate a set of intermediate key/value    allelize the computation, distribute the data, and handle                 cally parallelized and executed on and managing of com-commodity machines and to many other functional messy de- on Largethat
                                                                                                                                                                  chine failures, a large cluster the requiredtions we were trying is performscalable:Processing We realized combine the derived data ap-
                                                                                                                                                                                                cluster of MapReduce: Simplified Data
                                                                                                                                                                                                                    inter-machine       and highly but hides thethe same key, in order to Clusters
                                                                                                                                                                                                                                                                        languages.
    pairs, and a reduce function that merges all intermediate     failures conspire to obscure the original simple compu-                                         communication. This allows programmers without any
                                                                                                                                                                                                 care of the                            most of our computations propriately. Our use mapaop-
                                                                                                                                            modity machines. The run-time system takes a typical MapReduce computation processes many ter-                             involved applying a of functional model with user-
                                                                                                                                                                                                                   tails of parallelization, fault-tolerance, data distribution
    values associated with the same intermediate key. Many        tation with large amounts of complex code to deal with                                          experience data, scheduling distributed on thousands of machines. Programmers “record” in map and in order to
                                                                                                                                                                                                abytes of data
                                                                                                                                            details of partitioning the inputwith parallel and the pro- systems to balancingeration to each Our abstraction is in-
                                                                                                                                                                                                                   and load eas-          in a library. logical
                                                                                                                                                                                                                                                                      specified our input reduce operations allows us to paral-
    real world tasks are expressible in this model, as shown      these issues.                                                                                                                  a large system easy to by the map Jeffrey Dean and Sanjay in large computations then and to use re-execution
                                                                                                                                            gram’s execution across a set the machines, of find thema-              spired
                                                                                                                                                                                                                           use:                MapReduce intermediate key/value pairs, and easily
                                                                                                                                                                  ily utilize of resources handling distributed system. hundreds ofreduce set of pro- present Ghemawat
                                                                                                                                                                                                                                        compute a primitives
                                                                                                                                                                                                                                        and
                                                                                                                                                                                                                                                                      lelize
                                                                                                                                                                                                                                                                              Lisp
    in the paper.                                                    As a reaction to this complexity, we designed a new                                                     MapReduce: Simplified Data Processing one thou- as the primary
                                                                                                                                                                                                grams have been implemented and applying a on Large Clustersvalues that shared fault tolerance.
                                                                                                                                                                                                                                         upwards ofreduce operation to all the mechanism for
                                                                                                                                            chine failures, and managing the required inter-machine runs on a large functional languages. We realized that
                                                                                                                                                                      Our implementation of MapReduce and many other
       Programs written in this functional style are automati-    abstraction that allows us to express the simple computa-                                                                     sand MapReduce jobs are executed onjeff@google.com, sanjay@google.com contributions ap-this work are a simple and
                                                                                                                                                                  cluster of programmers without any                                          Google’s clusters to combine the derived data of
                                                                                                                                                                                                                                        the same key, in order
                                                                                                                                            communication. This allows commodity machines and is highly of our computations involved applying a map op-                  The major
                                                                                                                                                                                                                   most scalable:
    cally parallelized and executed on a large cluster of com-    tions we were trying to perform but hides the messy de-                   experience with parallel andMapReduce systems today.processes many each logical “record” Our use of apowerfulto model that enables automatic parallelization
                                                                                                                                                                                                every eas-
                                                                                                                                                                  a typical distributed computation                eration to ter-      propriately. in our input in order interface with user-
                                                                                                                                                                                                                                                                        functional
    modity machines. The run-time system takes care of the        tails of parallelization, fault-tolerance, data distribution              ily utilize the resources of oflarge distributed system.                                    specified mapGoogle, Inc. distribution of large-scale computations, combined
                                                                                                                                                                                                                                                                      and
                                                                                                                                                                  abytes a data on thousands of machines.Dean and set of intermediate key/value pairs, and then allows us to paral-
                                                                                                                                                                                                        Jeffrey compute a Sanjay Ghemawatand reduce with an implementation of this interface that achieves
                                                                                                                                                                                                                     Programmers                                       operations
    details of partitioning the input data, scheduling the pro-   and load balancing in a library. Our abstraction is in-                       Our implementation the system easy to use:1hundreds of MapReduceapro-
                                                                                                                                                                  find of MapReduce runs on a large                                      lelize large computations easily and to use re-execution
                                                                                                                                                                                                                   applying reduce operation to all the values that shared
                                                                                                                                                                                                      Introduction
    gram’s execution across a set of machines, handling ma-       spired by the map and reduce primitives present in Lisp                              MapReduce: Simplified scalable: the one key, Large Clusters derived Section 2 describeslargebasic programming model and
                                                                                                                                            cluster of commodity machines and is highlyData upwards ofsame thou- in order toprimary mechanism for fault tolerance.
                                                                                                                                                                  grams have been implemented and Processing on                         as the combine the
                                                                                                                                                                                                            jeff@google.com, sanjay@google.com
                                                                                                                                                                                                                                                                      high performance on
                                                                                                                                                                                                                                                                          data ap-
                                                                                                                                                                                                                                                                                               the
                                                                                                                                                                                                                                                                                                    clusters of commodity PCs.
    chine failures, and managing the required inter-machine       and many other functional languages. We realized that                                           sand MapReduceprocessesexecuted on Google’s clusters useThe a functional model with user- are a simple and
                                                                                                                                            a typical MapReduce computation           jobs are many ter-           propriately. Our          of major contributions of this work
    communication. This allows programmers without any            most of our computations involved applying a map op-                                            every day.
                                                                                                                                            abytes of data on thousands of machines. Programmers                        AbstractInc.
                                                                                                                                                                                                                         Google,
                                                                                                                                                                                                Over the past five years, mapauthors and many othersthat enables paral- Most such computations are conceptu-
                                                                                                                                                                                                                   specified   the and powerful interface allows us to automatic parallelization
                                                                                                                                                                                                                                        reduce operations      at given day, etc. examples. Section 3 describes an imple-
                                                                                                                                                                                                                                                                      gives several
                                                                                                                                                                                                Google have implemented computationsspecial-purpose ally straightforward.MapReduce interface tailored towards
                                                                                                                                                                                                                   lelize large hundreds of easily andof use re-executionof the combined
                                                                                                                                                                                                                                                                      mentation
    experience with parallel and distributed systems to eas-      eration to each logical “record” in our input in order to                 find the system easy to use: hundreds of MapReduce pro- is a programming model and an associ- large-scale computations,However, the input data is usually
                                                                                                                                                                                                                                        and distribution to
                                                                                                                                                                                  Jeffrey Dean and Sanjay Ghemawat mechanism for fault tolerance.ourand interface that achieves to be distributed acrossde-
                                                                                                                                                                                                MapReduce that process large amounts of raw data, large cluster-based computing environment. Section 4
    ily utilize the resources of a large distributed system.      compute a set of intermediate key/value pairs, and then                                                                       computations as the primary             with an implementation of this the computations have
                                                                                                                                            grams have been implemented and upwards of one thou-
       Our implementation of MapReduce runs on a large            applying a reduce operation to all the values that shared                                       1 Introduction ated implementationdocuments, contributions of thisetc.,on largescribes and
                                                                                                                                                                                                such as crawled for processing and generating large to hundreds several refinements of the programming model
                                                                                                                                                                                     jeff@google.com,Users            The map
                                                                                                                                                                                                                                   web high performance
                                                                                                                                                                                                                                         request logs,
                                                                                                                                            sand MapReduce jobs are executed on Google’s clusters specify amajor function that processes a are a simple orof commodity machines in order to finish in
                                                                                                                                                                                                                                                          work         clusters thousands of PCs.
                                                                                                                                                                                             data sets. sanjay@google.comderived data, such as inverted a reasonable amount ofuseful. The issues of how to par-
                                                                                                                                                                                                compute various kinds of
    cluster of commodity machines and is highly scalable:         the same key, in order to combine the derived data ap-                    every day.                                Abstract                                     given day, etc. 2 describes parallelization are conceptu- Section 5 has performance
                                                                                                                                                                                                                                            Section automatic computations found time.
                                                                                                                                                                                                                                                                      that we have
                                                                                                                                                                                                                   powerful interface that enables Most suchthe basic programming model and
                                                                                                                                                                                             key/value pair to generate a set of intermediate key/value
    a typical MapReduce computation processes many ter-           propriately. Our use of a functional model with user-                                           Over the past five years, the authors and many others at allyoflarge-scale computations,measurements of usually
                                                                                                                                                                                                indices, various representations gives severalstructure allelize inputdescribes an distribute the data, and variety of
                                                                                                                                                                                                                                           the graph examples. Section
                                                                                                                                                                                                   Google, Inc. summaries of straightforward. However,tasks.the3computation, imple-
                                                                                                                                                                                                                                                                        combineddata is
                                                                                                                                                                                                                                                                        the                our implementation for a handle
                                                                                                                                                                                             pairs, and documents, distribution of
                                                                                                                                                                                                                   and
                                                                                                                                                                MapReduce is a programming web a reduce function that merges allthe computations have tointerface tailored towardsoriginalMapReduce within
                                                                                                                                                                                                                                                intermediate
    abytes of data on thousands of machines. Programmers          specified map and reduce operations allows us to paral-                                          Google have implementedof model andspecial-purpose largethe number of pages failures beSection to explores the use of simple compu-
                                                                                                                                                                                                  hundreds of an associ-
                                                                                                                                                                                                                   with an implementation of this the MapReduce conspire obscure
                                                                                                                                                                                                                                          and
                                                                                                                                                                                                                                        mentation of interface that achieves    distributed across the
                                                                                                                                                                                                                                                                                        6
                                                                  lelize large computations easily and to use re-execution                  1 Introduction   ated computations that process large amounts of the same most our cluster-based computing environment. Section 4 complex code toitdeal with
                                                                                                                                                                   implementation for processingassociated with raw data,intermediate key. Many of machines in includingfinish in de-
                                                                                                                                                                                             values andper host, the set of
                                                                                                                                                                                                crawled generating large                 frequentthousands a tation with order amounts of
                                                                                                                                                                                                                                   hundreds or      queries in        Google large to our experiences in using as the basis
    find the system easy to use: hundreds of MapReduce pro-                                                                                                                                                         high performance on large clusters of commodity PCs.
                                                                  as the primary mechanism for fault tolerance.                                              data such as crawled documents, web tasks arelogs, etc., toin reasonableseveral refinements of issues of how to par-
                                                                                                                                                                   sets. Users specify a map functionrequest expressible a this model,amount of time. The the programming model
                                                                                                                                                                                             real world that processes a
                                                                                                                                                                                                                                        scribes
                                                                                                                                                                                                                                                    as shown       these issues.
    grams have been implemented and upwards of one thou-                                                                                                                                                              Section 2 describes the basic programming model and
                                                                                                                                                             key/value pairvarious kinds setthe intermediate key/value
                                                                                                                                                                              to generate a in derived data, such as invertedallelize the computation, useful. Section 5 has this complexity, we designed a new
                                                                                                                                                                                                 of paper.
    sand MapReduce jobs are executed on Google’s clusters            The major contributions of this work are a simple and                  Over the past five Abstractauthors and many others at given day,several examples.we have found distribute imple- to performance
                                                                                                                                                                  compute
                                                                                                                                                                  years, the                 of                                         that Section 3 describes ana the
                                                                                                                                                                                                                                                                      As reaction
                                                                                                                                                                                                                   gives etc. Most such computations are conceptu- data, and handle
                                                                                                                                                             pairs, and a reduce function that merges written in this functional measurements of our implementation allows variety of the simple computa-
                                                                                                                                                                                                Programs all intermediate                style are automati-
    every day.                                                    powerful interface that enables automatic parallelization                 Google have implemented hundredsrepresentations of the graph structure failures conspire on Operating usually that for andus to express
                                                                                                                                                                                                                               of      MapReduce                   abstraction
                                                                                                                                                                  indices, various of special-purpose ally mentation ’04:the However, theinterface tailoredoriginal simple a Implementation
                                                                                                                                                                                                                   straightforward.a large cluster ofto obscureisthe towards     Design compu-
                                                                                                                                                             values associated with and USENIX Association executed on6th Symposium inputexplores Systemsof MapReduce withinbut hides the messy de- 137
                                                                                                                                                                                       the same intermediate and OSDI
                                                                                                                                                                                             cally parallelized key. Many
                                                                                                                                          MapReduce is athat of web documents, summaries of the number the pages computing to                            com- data
                                                                                                                                                                                                                                                 large be distributed thewe 4 de- trying to with
                                                                                                                                                                                                                                   tation with Section 6amounts ofSection were to deal perform
                                                                                                                                                                                                                                                                   tions use code
                                                                                                                                                                                                                                                                     complex
                                                                  and distribution of large-scale computations, combined                    computations programming model                  an associ-
                                                                                                                                                                   process large amounts of raw data,                and of computations have environment. across
                                                                                                                                                                                                                   our cluster-based tasks.
                                                                  with an implementation of this interface that achieves                                     real crawled per and generating in this largeThe run-time
                                                                                                                                       ated such as crawled for processing are expressible large model,scribes severalthese issues. including our totails of in in using it as fault-tolerance, data distribution
                                                                                                                                                                  world tasks host, the set of most   machines. as shown
                                                                                                                                             implementation documents, web request modityetc., frequent queries in asystem takes carethe order experiences
                                                                                                                                                                                                                                        Google of in theof                  parallelization, the basis
                                                                                                                                                                                              logs,        to hundreds or thousands of machines
                                                                                                                                                                                                                                     refinements             programming model
                                                                                                                                                                                                                                                                     finish
    1 Introduction                                                high performance on large clusters of commodity PCs.                 data compute various theapaper.derived data, such as inverteda reasonable have data, schedulingSection 5 complexity, we designed aanew
                                                                                                                                                             in
                                                                                                                                             sets. Users specify map function that processesof partitioning the input found time. The the pro- hasand load balancing in library. Our abstraction is in-
                                                                                                                                                                kinds of
                                                                                                                                                                                             details a                                As a reaction to this
                                                                                                                                                                                                                   that we amount of useful. issues of how to par-   performance
                                                                     Section 2 describes the basic programming model and               key/value pair to generate a set of intermediatefunctional style are measurementsmachines, handling ma- forspired by the map and reduce primitives present in Lisp
                                                                                                                                            indices, various representations of
                                                                                                                                                                                             gram’s execution across a set of abstraction that allows us to express the simple computa-
                                                                                                                                                                Programs written inthe graph structure
                                                                                                                                                                                       this key/value               automati-
                                                                                                                                                                                                              allelize the computation, distribute the data, anda handle of
                                                                                                                                                                                                                                    of our implementation               variety
    Over the past five years, the authors and many others at       gives several examples. Section 3 describes an imple-                pairs, and a reduce cally parallelized of the number of large cluster of com- the tions we wereoriginalto perform many other functional languages. We realized that
                                                                                                                                                              function that merges executed on failures,failures conspire to required inter-machine
                                                                                                                                            of web documents, summaries and all intermediate
                                                                                                                                                                                             chine a
                                                                                                                                                                                                       pages
                                                                                                                                                                                                               and managing                                        and but hides
                                                                                                                                                                                                                   tasks. Section 6 explores thetrying simple compu-
                                                                                                                                                                                                                                     obscure the     use of MapReduce within the messy de-
    Google have implemented hundreds of special-purpose           mentation of the MapReduce interface tailored towards                values associated host, the same intermediate key.’04: 6thin takes This of the programmerscomplex inand to most with data distribution
                                                                                                                                                             modity set of most frequent queries Symposium allows tails Systems Design using it as of basis
                                                                                                                                                             USENIX Association OSDI system  communication. care on Operating of parallelization, fault-tolerance,
                                                                                                                                                                                                                                                 without any Implementation
                                                                                                                                            crawled per with the machines. The run-timeMany a tation with large amounts experiences code deal the our computations involved applying a map op-
                                                                                                                                                                                                                   Google including our of                                                              137
    computations that process large amounts of raw data,          our cluster-based computing environment. Section 4 de-               real world tasks are expressible in this model, experience with parallelpro- distributed systems to eas-a library. Our each logical is in-
                                                                                                                                                             details of partitioning the inputshownscheduling the and
                                                                                                                                                                                             as data,         these issues. a largeand load balancing in
                                                                                                                                                                                                                                                                   eration to abstraction “record” in our input in order to
                                                                                                                                       in the paper.         gram’s execution across a set of machines, handlingof
                                                                                                                                                                                             ily utilize the resources ma-          distributed system.            compute a set of intermediate key/value pairs, and then
    such as crawled documents, web request logs, etc., to         scribes several refinements of the programming model                                                                                            As a reaction to spired by the mapwe designed primitives present in Lisp
                                                                                                                                                                                                                                    this complexity, and reduce a new
    compute various kinds of derived data, such as inverted       that we have found useful. Section 5 has performance                                       chine failures, and managing the required inter-machine
                                                                                                                                          Programs written in this functional style are automati-                                  and many other the simple computa-a reduce operation to all the values that shared
                                                                                                                                                                                                Our implementation of MapReduce runs on a large                    applying
                                                                                                                                                                                                              abstraction that allows us to express functional languages. We realized that
                                                                                                                                                             communication. This allows programmers without any                                                    the same key, 137in order to combine the derived data ap-
    indices, various representations of the graph structure       measurements of our implementation for a variety of                  cally parallelized and executed’04:a6th Symposium com- commodity machines mostis Implementation the involved applying a map op-
                                                                                                                                       USENIX Association OSDI on large cluster of on of     cluster Operating Systems Design andofhighly scalable:
                                                                                                                                                                                                                                   and
                                                                                                                                                                                                              tions we were trying to performcomputations messy de-
                                                                                                                                                                                                                                              our but hides
    of web documents, summaries of the number of pages            tasks. Section 6 explores the use of MapReduce within                                      experience with parallel and distributed systems computation processes many ter- “record” in our input in orderato
                                                                                                                                       modity machines. The run-time system takes care of the MapReduceto eas-
                                                                                                                                                                                             a typical                             eration to each logical         propriately. Our use of functional model with user-
                                                                                                                                                                                                              tails of parallelization, fault-tolerance, data distribution
                                                                                                                                                             ily utilize the resources of a abytesdistributed thousands of machines. a set of intermediate key/value pairs, reduce operations allows us to paral-
                                                                                                                                       details of partitioning the input data, scheduling the pro- dataandsystem.
                                                                                                                                                                                             large of          on                              Programmers         specified map and and then
    crawled per host, the set of most frequent queries in a       Google including our experiences in using it as the basis                                                                                        load balancingcompute
                                                                                                                                                                                                                                     in a library. Our abstraction is in-
                                                                                                                                                                                             find ma-                               applying a primitives present in Lisp computations easily and to use re-execution
                                                                                                                                                                                                              spired a the map and reduce reduce
                                                                                                                                                                Our implementation of MapReduce runs on by large                                          pro-     lelize large
                                                                                                                                       gram’s execution across a set of machines, handlingthe system easy to use: hundreds of MapReduceoperation to all the values that shared
                                                                                                                                                                                                                                                                   as the primary mechanism for fault tolerance.
                                                                                                                                       chine failures, and cluster of commodity machines and is been implemented functional languages. Weto combine the derived data ap-
                                                                                                                                                             managing the required inter-machine highlymany other and upwards of one order realized that
                                                                                                                                                                                             grams have
                                                                                                                                                                                                              and scalable:        the same key, in thou-
                                                                                                                                       communication. This allows programmers without MapReduce jobs our executed on Google’s clusters a afunctional model with user- this work are a simple and
                                                                                                                                                                                             sand any
                                                                                                                                                             a typical MapReduce computation processes many ter-
                                                                                                                                                                                                              most of
                                                                                                                                                                                                                        are        propriately. Our use of
                                                                                                                                                                                                                            computations involved applying map op-
                                                                                                                                                                                                                                                                      The major contributions of
 USENIX Association OSDI ’04: 6th Symposium on Operating Systems Design and Implementation                                       137                                                          of machines. eration to each logical “record” in our input operations interface thatparal- automatic parallelization
                                                                                                                                                                                                                                   specified map and reduce in order to allows us to enables
                                                                                                                                                                                                                                                                   powerful
                                                                                                                                       experience with parallel and data on thousandseveryeas-
                                                                                                                                                             abytes of distributed systems to day.             Programmers
                                                                                                                                       ily utilize the resources thea large distributed system. of MapReduceapro- of intermediate computations easily distributionre-execution computations, combined
                                                                                                                                                             find of system easy to use: hundreds                                   lelize large key/value pairs, and and to use of large-scale
Thursday, 12 April 12                                                                                                                     Our implementation ofhave been implementedaand upwards of one thou-
                                                                                                                                                             grams MapReduce runs on large
                                                                                                                                                                                                              compute set                                           and then
                                                                                                                                                                                                                                                                   with an implementation of this interface that achieves
                                                                                                                                                                                                              applying a reduce operation to allmechanismthat fault tolerance.
                                                                                                                                                                                                                                   as the primary the values for shared
Map, Shuffle, Reduce




Thursday, 12 April 12
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop

More Related Content

What's hot

Cloud batch a batch job queuing system on clouds with hadoop and h-base
Cloud batch  a batch job queuing system on clouds with hadoop and h-baseCloud batch  a batch job queuing system on clouds with hadoop and h-base
Cloud batch a batch job queuing system on clouds with hadoop and h-base
João Gabriel Lima
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
eSAT Publishing House
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
cscpconf
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET Journal
 
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
EMC
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
KamranKhan587
 
lec1_ref.pdf
lec1_ref.pdflec1_ref.pdf
lec1_ref.pdf
vishal choudhary
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
ijsrd.com
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudBharat Rane
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
ijcses
 
Integrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopIntegrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoop
João Gabriel Lima
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
lec13_ref.pdf
lec13_ref.pdflec13_ref.pdf
lec13_ref.pdf
vishal choudhary
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
ijdms
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
csandit
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
PyData
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
IOSR Journals
 

What's hot (18)

Cloud batch a batch job queuing system on clouds with hadoop and h-base
Cloud batch  a batch job queuing system on clouds with hadoop and h-baseCloud batch  a batch job queuing system on clouds with hadoop and h-base
Cloud batch a batch job queuing system on clouds with hadoop and h-base
 
Implementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big dataImplementation of p pic algorithm in map reduce to handle big data
Implementation of p pic algorithm in map reduce to handle big data
 
Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce Survey of Parallel Data Processing in Context with MapReduce
Survey of Parallel Data Processing in Context with MapReduce
 
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
IRJET - Evaluating and Comparing the Two Variation with Current Scheduling Al...
 
335 340
335 340335 340
335 340
 
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...Cisco and Greenplum  Partner to Deliver  High-Performance  Hadoop Reference  ...
Cisco and Greenplum Partner to Deliver High-Performance Hadoop Reference ...
 
Hadoop by kamran khan
Hadoop by kamran khanHadoop by kamran khan
Hadoop by kamran khan
 
lec1_ref.pdf
lec1_ref.pdflec1_ref.pdf
lec1_ref.pdf
 
A Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis TechniquesA Survey on Big Data Analysis Techniques
A Survey on Big Data Analysis Techniques
 
Jovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloudJovian DATA: A multidimensional database for the cloud
Jovian DATA: A multidimensional database for the cloud
 
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
A NOBEL HYBRID APPROACH FOR EDGE  DETECTIONA NOBEL HYBRID APPROACH FOR EDGE  DETECTION
A NOBEL HYBRID APPROACH FOR EDGE DETECTION
 
Integrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoopIntegrating dbm ss as a read only execution layer into hadoop
Integrating dbm ss as a read only execution layer into hadoop
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
lec13_ref.pdf
lec13_ref.pdflec13_ref.pdf
lec13_ref.pdf
 
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATIONMAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
MAP REDUCE BASED ON CLOAK DHT DATA REPLICATION EVALUATION
 
Design architecture based on web
Design architecture based on webDesign architecture based on web
Design architecture based on web
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Big Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – HadoopBig Data Analysis and Its Scheduling Policy – Hadoop
Big Data Analysis and Its Scheduling Policy – Hadoop
 

Similar to Data Processing in the Work of NoSQL? An Introduction to Hadoop

An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to HadoopDan Harvey
 
An Intro to Hadoop
An Intro to HadoopAn Intro to Hadoop
An Intro to Hadoop
Matthew McCullough
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
jencyjayastina
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
SANTOSH WAYAL
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
cscpconf
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
Hadoop v0.3.1
Hadoop v0.3.1Hadoop v0.3.1
Hadoop v0.3.1
Matthew McCullough
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
Varun Narang
 
Aug 2012 HUG: Random vs. Sequential
Aug 2012 HUG: Random vs. SequentialAug 2012 HUG: Random vs. Sequential
Aug 2012 HUG: Random vs. Sequential
Yahoo Developer Network
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
UT, San Antonio
 
big data and hadoop
big data and hadoopbig data and hadoop
big data and hadoop
Shamama Kamal
 
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
AM Publications
 
B04 06 0918
B04 06 0918B04 06 0918
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111NavNeet KuMar
 
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopMochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopGeorge Ang
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
IJERA Editor
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
LeMeniz Infotech
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
dbpublications
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache HadoopChristopher Pezza
 

Similar to Data Processing in the Work of NoSQL? An Introduction to Hadoop (20)

An Introduction to Hadoop
An Introduction to HadoopAn Introduction to Hadoop
An Introduction to Hadoop
 
An Intro to Hadoop
An Intro to HadoopAn Intro to Hadoop
An Intro to Hadoop
 
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
Introduction to map reduce s. jency jayastina II MSC COMPUTER SCIENCE BON SEC...
 
A data aware caching 2415
A data aware caching 2415A data aware caching 2415
A data aware caching 2415
 
Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System Design Issues and Challenges of Peer-to-Peer Video on Demand System
Design Issues and Challenges of Peer-to-Peer Video on Demand System
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
Hadoop v0.3.1
Hadoop v0.3.1Hadoop v0.3.1
Hadoop v0.3.1
 
Seminar_Report_hadoop
Seminar_Report_hadoopSeminar_Report_hadoop
Seminar_Report_hadoop
 
Aug 2012 HUG: Random vs. Sequential
Aug 2012 HUG: Random vs. SequentialAug 2012 HUG: Random vs. Sequential
Aug 2012 HUG: Random vs. Sequential
 
Big Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory ComputationBig Data Processing: Performance Gain Through In-Memory Computation
Big Data Processing: Performance Gain Through In-Memory Computation
 
big data and hadoop
big data and hadoopbig data and hadoop
big data and hadoop
 
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
Enhancement of Map Function Image Processing System Using DHRF Algorithm on B...
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging HadoopMochi: Visual Log-Analysis Based Tools for Debugging Hadoop
Mochi: Visual Log-Analysis Based Tools for Debugging Hadoop
 
Eg4301808811
Eg4301808811Eg4301808811
Eg4301808811
 
B04 06 0918
B04 06 0918B04 06 0918
B04 06 0918
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
 
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
On Traffic-Aware Partition and Aggregation in Map Reduce for Big Data Applica...
 
Introduction to Apache Hadoop
Introduction to Apache HadoopIntroduction to Apache Hadoop
Introduction to Apache Hadoop
 

Recently uploaded

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Jeffrey Haguewood
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Inflectra
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
DanBrown980551
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
Safe Software
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
Ana-Maria Mihalceanu
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
Prayukth K V
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
Elena Simperl
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
UiPathCommunity
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
Jemma Hussein Allen
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
Sri Ambati
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Tobias Schneck
 

Recently uploaded (20)

Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
Slack (or Teams) Automation for Bonterra Impact Management (fka Social Soluti...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered QualitySoftware Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
Software Delivery At the Speed of AI: Inflectra Invests In AI-Powered Quality
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
LF Energy Webinar: Electrical Grid Modelling and Simulation Through PowSyBl -...
 
Essentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with ParametersEssentials of Automations: Optimizing FME Workflows with Parameters
Essentials of Automations: Optimizing FME Workflows with Parameters
 
Monitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR EventsMonitoring Java Application Security with JDK Tools and JFR Events
Monitoring Java Application Security with JDK Tools and JFR Events
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 previewState of ICS and IoT Cyber Threat Landscape Report 2024 preview
State of ICS and IoT Cyber Threat Landscape Report 2024 preview
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Knowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and backKnowledge engineering: from people to machines and back
Knowledge engineering: from people to machines and back
 
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
Dev Dives: Train smarter, not harder – active learning and UiPath LLMs for do...
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
The Future of Platform Engineering
The Future of Platform EngineeringThe Future of Platform Engineering
The Future of Platform Engineering
 
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
GenAISummit 2024 May 28 Sri Ambati Keynote: AGI Belongs to The Community in O...
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
Kubernetes & AI - Beauty and the Beast !?! @KCD Istanbul 2024
 

Data Processing in the Work of NoSQL? An Introduction to Hadoop

  • 1. Data Processing in NoSQL? An Introduction to Map Reduce By Dan Harvey Thursday, 12 April 12
  • 2. No SQL? People thinking about their data storage. Thursday, 12 April 12
  • 3. Storage Patterns Denormalization Sharding / Hashing Replication ... Thursday, 12 April 12
  • 4. Ad Hoc Queries? Hard to do... Thursday, 12 April 12
  • 5. The problem High In-memory Hadoop Query Entropy (Offline) Key-value ? Low (Online) Low High Query Latency Thursday, 12 April 12
  • 6. “The Apache Hadoop project develops open-source software for reliable, scalable, distributed computing” Thursday, 12 April 12
  • 7. MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanjay@google.com Google, Inc. Abstract given day, etc. Most such computations are conceptu- ally straightforward. However, the input data is usually MapReduce is a programming model and an associ- large and the computations have to be distributed across ated implementation for processing and generating large hundreds or thousands of machines in order to finish in data sets. Users specify a map function that processes a a reasonable amount of time. The issues of how to par- key/value pair to generate a set of intermediate key/value allelize the computation, distribute the data, and handle pairs, and a reduce function that merges all intermediate failures conspire to obscure the original simple compu- values associated with the same intermediate key. Many tation with large amounts of complex code to deal with real world tasks are expressible in this model, as shown these issues. in the paper. As a reaction to this complexity, we designed a new Programs written in this functional style are automati- abstraction that allows us to express the simple computa- cally parallelized and executed on a large cluster of com- tions we were trying to perform but hides the messy de- modity machines. The run-time system takes care of the tails of parallelization, fault-tolerance, data distribution Google’s MapReduce details of partitioning the input data, scheduling the pro- and load balancing in a library. Our abstraction is in- gram’s execution across a set of machines, handling ma- spired by the map and reduce primitives present in Lisp chine failures, and managing the required inter-machine and many other functional languages. We realized that communication. This allows programmers without any most of our computations involved applying a map op- experience with parallel and distributed systems to eas- eration to each logical “record” in our input in order to ily utilize the resources of a large distributed system. compute a set of intermediate key/value pairs, and then Our implementation of MapReduce runs on a large applying a reduce operation to all the values that shared cluster of commodity machines and is highly scalable: the same key, in order to combine the derived data ap- Thursday, 12 April 12 propriately. Our use of a functional model with user-
  • 8. Distributed Storage MapReduce: Simplified Data Processing on Large Clusters Jeffrey Dean and Sanjay Ghemawat MapReduce: Simplified Data Processing on Large Clusters jeff@google.com, sanjay@google.com Google, Inc. Jeffrey Dean and Sanjay Ghemawat MapReduce: Simplified Data Processing on Large Clusters jeff@google.com, sanjay@google.com Abstract given day, etc. Most such computations are conceptu- Google, Inc. ally straightforward. However, the input data is usually Jeffrey Dean and Sanjay Ghemawatmodel and an associ- MapReduce is a programming large and the computations have to be distributed across ated implementation for processing and generating large hundreds or thousands of machines in order to finish in jeff@google.com, sanjay@google.com function that processes a data sets. Users specify a map MapReduce: Simplified Data Processing on Large Clusters key/value pair to generate a set of intermediate key/value a reasonable amount of time. The issues of how to par- allelize the computation, distribute the data, and handle Abstract a reduce function that merges all intermediate such computations are conceptu- pairs, Google, Inc. and given day, etc. Most failures conspire to obscure the original simple compu- values associated with the same intermediate key. ManyHowever, the input data is usually MapReduce is a programming model and an associ- ally straightforward. tation with large amounts of complex code to deal with Jeffrey Dean and Sanjay Ghemawat real world tasks are expressible largemodel, computations have to be distributed across this and the ated implementation for processing and generating large in hundreds or as shown of these issues. order to finish in thousands machines in data sets. Users specify a in thefunction that processes a map paper. a reasonable amount of time. Thereaction to thisto par- As a issues of how complexity, we designed a new jeff@google.com, sanjay@google.com Abstract Programs written key/value Most such are automati- areabstraction that allows given day, etc. computations conceptu- key/value pair to generate a set of intermediate in this functional stylethe computation, distribute the data, and us to express the simple computa- allelize handle ally straightforward.large clusterthe com- data is usually However, of input Google, Inc. MapReduce is apairs, and a reduce function that merges all intermediate on a programming model and cally parallelized andand the computations conspirebe distributed across trying to perform but hides the messy de- an associ- large executed failures have to to obscure thewe were simple compu- tions original values associated with the modity machines. The run-time system takes care of the same intermediate key. Many ated implementation for processing and generating large tation with large amounts tails of parallelization, fault-tolerance, data distribution of finish in code to deal with complex hundreds or thousands of machines in order to data sets. Users specifyworld tasks are expressible in this model, as shown data, scheduling the pro- of how to par- real a map function that processes partitioning the input these issues. details of a and load balancing in a library. Our abstraction is in- gram’s execution reasonable amount of time. The issues key/value pair to generatepaper.of intermediate key/value in the a set a across a set of machines, handling ma- As distribute to data, and handle map and reduce primitives present in Lisp spired by the allelize the computation, a reaction thethis complexity, we designed a new chine failures, and managing the required inter-machine pairs, and a reduce function that merges all intermediate style are automati- toabstraction that allowssimple compu-other functional languages. We realized that and manythe simple computa- Abstract given day, etc. Most such computations are conceptu- Programs written in this functional conspire obscure the original us to express failuresallows Replicated Blocks communication. This of com- programmers without any values associated with the same intermediate key. on a large cluster with large amounts of complex code to deal but hides the messy involved applying a map op- cally parallelized and executed Many most of our computations de- ally straightforward. However, the input data is usually tation tions we were trying to perform with MapReduce is a programming model and an associ- large and the computations have to be distributed across real world tasks are expressible in this model, as shown withthese issues. distributedof parallelization, fault-tolerance, data distribution in our input in order to experience parallel and modity machines. The run-time system takes care of the tails systems to eas- eration to each logical “record” ated implementation for processing and generating large in the paper. ily utilize the resources ofpro- a large distributed system. details of partitioning the input data, scheduling a reaction to and load balancing in a library. new set of intermediate key/value pairs, and then compute a hundreds or thousands of machines in order to finish in As the this complexity, we designed a Our abstraction is in- data sets. Users specify a map function that processes a a reasonable amount of time. The issues of how to par- gram’s functional style a setOurmachines, handling ma- allows usrunsthe map and reduce primitives present in Lisp all the values that shared of implementation of that Programs written in thisexecution across are automati- abstraction spired to express the simple computa- reduce operation to MapReduce by on a large applying a key/value pair to generate a set of intermediate key/value allelize the computation, distribute the data, and handle cally parallelized and executed on and managing of com-commodity machines and to many other functional messy de- on Largethat chine failures, a large cluster the requiredtions we were trying is performscalable:Processing We realized combine the derived data ap- cluster of MapReduce: Simplified Data inter-machine and highly but hides thethe same key, in order to Clusters languages. pairs, and a reduce function that merges all intermediate failures conspire to obscure the original simple compu- communication. This allows programmers without any care of the most of our computations propriately. Our use mapaop- modity machines. The run-time system takes a typical MapReduce computation processes many ter- involved applying a of functional model with user- tails of parallelization, fault-tolerance, data distribution values associated with the same intermediate key. Many tation with large amounts of complex code to deal with experience data, scheduling distributed on thousands of machines. Programmers “record” in map and in order to abytes of data details of partitioning the inputwith parallel and the pro- systems to balancingeration to each Our abstraction is in- and load eas- in a library. logical specified our input reduce operations allows us to paral- real world tasks are expressible in this model, as shown these issues. a large system easy to by the map Jeffrey Dean and Sanjay in large computations then and to use re-execution gram’s execution across a set the machines, of find thema- spired use: MapReduce intermediate key/value pairs, and easily ily utilize of resources handling distributed system. hundreds ofreduce set of pro- present Ghemawat compute a primitives and lelize Lisp in the paper. As a reaction to this complexity, we designed a new MapReduce: Simplified Data Processing one thou- as the primary grams have been implemented and applying a on Large Clustersvalues that shared fault tolerance. upwards ofreduce operation to all the mechanism for chine failures, and managing the required inter-machine runs on a large functional languages. We realized that Our implementation of MapReduce and many other Programs written in this functional style are automati- abstraction that allows us to express the simple computa- sand MapReduce jobs are executed onjeff@google.com, sanjay@google.com contributions ap-this work are a simple and cluster of programmers without any Google’s clusters to combine the derived data of the same key, in order communication. This allows commodity machines and is highly of our computations involved applying a map op- The major most scalable: cally parallelized and executed on a large cluster of com- tions we were trying to perform but hides the messy de- experience with parallel andMapReduce systems today.processes many each logical “record” Our use of apowerfulto model that enables automatic parallelization every eas- a typical distributed computation eration to ter- propriately. in our input in order interface with user- functional modity machines. The run-time system takes care of the tails of parallelization, fault-tolerance, data distribution ily utilize the resources of oflarge distributed system. specified mapGoogle, Inc. distribution of large-scale computations, combined and abytes a data on thousands of machines.Dean and set of intermediate key/value pairs, and then allows us to paral- Jeffrey compute a Sanjay Ghemawatand reduce with an implementation of this interface that achieves Programmers operations details of partitioning the input data, scheduling the pro- and load balancing in a library. Our abstraction is in- Our implementation the system easy to use:1hundreds of MapReduceapro- find of MapReduce runs on a large lelize large computations easily and to use re-execution applying reduce operation to all the values that shared Introduction gram’s execution across a set of machines, handling ma- spired by the map and reduce primitives present in Lisp MapReduce: Simplified scalable: the one key, Large Clusters derived Section 2 describeslargebasic programming model and cluster of commodity machines and is highlyData upwards ofsame thou- in order toprimary mechanism for fault tolerance. grams have been implemented and Processing on as the combine the jeff@google.com, sanjay@google.com high performance on data ap- the clusters of commodity PCs. chine failures, and managing the required inter-machine and many other functional languages. We realized that sand MapReduceprocessesexecuted on Google’s clusters useThe a functional model with user- are a simple and a typical MapReduce computation jobs are many ter- propriately. Our of major contributions of this work communication. This allows programmers without any most of our computations involved applying a map op- every day. abytes of data on thousands of machines. Programmers AbstractInc. Google, Over the past five years, mapauthors and many othersthat enables paral- Most such computations are conceptu- specified the and powerful interface allows us to automatic parallelization reduce operations at given day, etc. examples. Section 3 describes an imple- gives several Google have implemented computationsspecial-purpose ally straightforward.MapReduce interface tailored towards lelize large hundreds of easily andof use re-executionof the combined mentation experience with parallel and distributed systems to eas- eration to each logical “record” in our input in order to find the system easy to use: hundreds of MapReduce pro- is a programming model and an associ- large-scale computations,However, the input data is usually and distribution to Jeffrey Dean and Sanjay Ghemawat mechanism for fault tolerance.ourand interface that achieves to be distributed acrossde- MapReduce that process large amounts of raw data, large cluster-based computing environment. Section 4 ily utilize the resources of a large distributed system. compute a set of intermediate key/value pairs, and then computations as the primary with an implementation of this the computations have grams have been implemented and upwards of one thou- Our implementation of MapReduce runs on a large applying a reduce operation to all the values that shared 1 Introduction ated implementationdocuments, contributions of thisetc.,on largescribes and such as crawled for processing and generating large to hundreds several refinements of the programming model jeff@google.com,Users The map web high performance request logs, sand MapReduce jobs are executed on Google’s clusters specify amajor function that processes a are a simple orof commodity machines in order to finish in work clusters thousands of PCs. data sets. sanjay@google.comderived data, such as inverted a reasonable amount ofuseful. The issues of how to par- compute various kinds of cluster of commodity machines and is highly scalable: the same key, in order to combine the derived data ap- every day. Abstract given day, etc. 2 describes parallelization are conceptu- Section 5 has performance Section automatic computations found time. that we have powerful interface that enables Most suchthe basic programming model and key/value pair to generate a set of intermediate key/value a typical MapReduce computation processes many ter- propriately. Our use of a functional model with user- Over the past five years, the authors and many others at allyoflarge-scale computations,measurements of usually indices, various representations gives severalstructure allelize inputdescribes an distribute the data, and variety of the graph examples. Section Google, Inc. summaries of straightforward. However,tasks.the3computation, imple- combineddata is the our implementation for a handle pairs, and documents, distribution of and MapReduce is a programming web a reduce function that merges allthe computations have tointerface tailored towardsoriginalMapReduce within intermediate abytes of data on thousands of machines. Programmers specified map and reduce operations allows us to paral- Google have implementedof model andspecial-purpose largethe number of pages failures beSection to explores the use of simple compu- hundreds of an associ- with an implementation of this the MapReduce conspire obscure and mentation of interface that achieves distributed across the 6 lelize large computations easily and to use re-execution 1 Introduction ated computations that process large amounts of the same most our cluster-based computing environment. Section 4 complex code toitdeal with implementation for processingassociated with raw data,intermediate key. Many of machines in includingfinish in de- values andper host, the set of crawled generating large frequentthousands a tation with order amounts of hundreds or queries in Google large to our experiences in using as the basis find the system easy to use: hundreds of MapReduce pro- high performance on large clusters of commodity PCs. as the primary mechanism for fault tolerance. data such as crawled documents, web tasks arelogs, etc., toin reasonableseveral refinements of issues of how to par- sets. Users specify a map functionrequest expressible a this model,amount of time. The the programming model real world that processes a scribes as shown these issues. grams have been implemented and upwards of one thou- Section 2 describes the basic programming model and key/value pairvarious kinds setthe intermediate key/value to generate a in derived data, such as invertedallelize the computation, useful. Section 5 has this complexity, we designed a new of paper. sand MapReduce jobs are executed on Google’s clusters The major contributions of this work are a simple and Over the past five Abstractauthors and many others at given day,several examples.we have found distribute imple- to performance compute years, the of that Section 3 describes ana the As reaction gives etc. Most such computations are conceptu- data, and handle pairs, and a reduce function that merges written in this functional measurements of our implementation allows variety of the simple computa- Programs all intermediate style are automati- every day. powerful interface that enables automatic parallelization Google have implemented hundredsrepresentations of the graph structure failures conspire on Operating usually that for andus to express of MapReduce abstraction indices, various of special-purpose ally mentation ’04:the However, theinterface tailoredoriginal simple a Implementation straightforward.a large cluster ofto obscureisthe towards Design compu- values associated with and USENIX Association executed on6th Symposium inputexplores Systemsof MapReduce withinbut hides the messy de- 137 the same intermediate and OSDI cally parallelized key. Many MapReduce is athat of web documents, summaries of the number the pages computing to com- data large be distributed thewe 4 de- trying to with tation with Section 6amounts ofSection were to deal perform tions use code complex and distribution of large-scale computations, combined computations programming model an associ- process large amounts of raw data, and of computations have environment. across our cluster-based tasks. with an implementation of this interface that achieves real crawled per and generating in this largeThe run-time ated such as crawled for processing are expressible large model,scribes severalthese issues. including our totails of in in using it as fault-tolerance, data distribution world tasks host, the set of most machines. as shown implementation documents, web request modityetc., frequent queries in asystem takes carethe order experiences Google of in theof parallelization, the basis logs, to hundreds or thousands of machines refinements programming model finish 1 Introduction high performance on large clusters of commodity PCs. data compute various theapaper.derived data, such as inverteda reasonable have data, schedulingSection 5 complexity, we designed aanew in sets. Users specify map function that processesof partitioning the input found time. The the pro- hasand load balancing in library. Our abstraction is in- kinds of details a As a reaction to this that we amount of useful. issues of how to par- performance Section 2 describes the basic programming model and key/value pair to generate a set of intermediatefunctional style are measurementsmachines, handling ma- forspired by the map and reduce primitives present in Lisp indices, various representations of gram’s execution across a set of abstraction that allows us to express the simple computa- Programs written inthe graph structure this key/value automati- allelize the computation, distribute the data, anda handle of of our implementation variety Over the past five years, the authors and many others at gives several examples. Section 3 describes an imple- pairs, and a reduce cally parallelized of the number of large cluster of com- the tions we wereoriginalto perform many other functional languages. We realized that function that merges executed on failures,failures conspire to required inter-machine of web documents, summaries and all intermediate chine a pages and managing and but hides tasks. Section 6 explores thetrying simple compu- obscure the use of MapReduce within the messy de- Google have implemented hundreds of special-purpose mentation of the MapReduce interface tailored towards values associated host, the same intermediate key.’04: 6thin takes This of the programmerscomplex inand to most with data distribution modity set of most frequent queries Symposium allows tails Systems Design using it as of basis USENIX Association OSDI system communication. care on Operating of parallelization, fault-tolerance, without any Implementation crawled per with the machines. The run-timeMany a tation with large amounts experiences code deal the our computations involved applying a map op- Google including our of 137 computations that process large amounts of raw data, our cluster-based computing environment. Section 4 de- real world tasks are expressible in this model, experience with parallelpro- distributed systems to eas-a library. Our each logical is in- details of partitioning the inputshownscheduling the and as data, these issues. a largeand load balancing in eration to abstraction “record” in our input in order to in the paper. gram’s execution across a set of machines, handlingof ily utilize the resources ma- distributed system. compute a set of intermediate key/value pairs, and then such as crawled documents, web request logs, etc., to scribes several refinements of the programming model As a reaction to spired by the mapwe designed primitives present in Lisp this complexity, and reduce a new compute various kinds of derived data, such as inverted that we have found useful. Section 5 has performance chine failures, and managing the required inter-machine Programs written in this functional style are automati- and many other the simple computa-a reduce operation to all the values that shared Our implementation of MapReduce runs on a large applying abstraction that allows us to express functional languages. We realized that communication. This allows programmers without any the same key, 137in order to combine the derived data ap- indices, various representations of the graph structure measurements of our implementation for a variety of cally parallelized and executed’04:a6th Symposium com- commodity machines mostis Implementation the involved applying a map op- USENIX Association OSDI on large cluster of on of cluster Operating Systems Design andofhighly scalable: and tions we were trying to performcomputations messy de- our but hides of web documents, summaries of the number of pages tasks. Section 6 explores the use of MapReduce within experience with parallel and distributed systems computation processes many ter- “record” in our input in orderato modity machines. The run-time system takes care of the MapReduceto eas- a typical eration to each logical propriately. Our use of functional model with user- tails of parallelization, fault-tolerance, data distribution ily utilize the resources of a abytesdistributed thousands of machines. a set of intermediate key/value pairs, reduce operations allows us to paral- details of partitioning the input data, scheduling the pro- dataandsystem. large of on Programmers specified map and and then crawled per host, the set of most frequent queries in a Google including our experiences in using it as the basis load balancingcompute in a library. Our abstraction is in- find ma- applying a primitives present in Lisp computations easily and to use re-execution spired a the map and reduce reduce Our implementation of MapReduce runs on by large pro- lelize large gram’s execution across a set of machines, handlingthe system easy to use: hundreds of MapReduceoperation to all the values that shared as the primary mechanism for fault tolerance. chine failures, and cluster of commodity machines and is been implemented functional languages. Weto combine the derived data ap- managing the required inter-machine highlymany other and upwards of one order realized that grams have and scalable: the same key, in thou- communication. This allows programmers without MapReduce jobs our executed on Google’s clusters a afunctional model with user- this work are a simple and sand any a typical MapReduce computation processes many ter- most of are propriately. Our use of computations involved applying map op- The major contributions of USENIX Association OSDI ’04: 6th Symposium on Operating Systems Design and Implementation 137 of machines. eration to each logical “record” in our input operations interface thatparal- automatic parallelization specified map and reduce in order to allows us to enables powerful experience with parallel and data on thousandseveryeas- abytes of distributed systems to day. Programmers ily utilize the resources thea large distributed system. of MapReduceapro- of intermediate computations easily distributionre-execution computations, combined find of system easy to use: hundreds lelize large key/value pairs, and and to use of large-scale Thursday, 12 April 12 Our implementation ofhave been implementedaand upwards of one thou- grams MapReduce runs on large compute set and then with an implementation of this interface that achieves applying a reduce operation to allmechanismthat fault tolerance. as the primary the values for shared

Editor's Notes

  1. An introduction to Hadoop\nSlides will be a mix of technical content and non technical\nHigher level\nNot sure of the audiance.\n
  2. \n
  3. \n
  4. \n
  5. \n
  6. \n
  7. \n
  8. - Data split into blocks\n - Replicated > three times on different machines\n - Fault tolerant storage\n
  9. \n
  10. - Data split into blocks\n - Replicated > three times on different machines\n - Fault tolerant storage\n
  11. - Data split into blocks\n - Replicated > three times on different machines\n - Fault tolerant storage\n
  12. - Data split into blocks\n - Replicated > three times on different machines\n - Fault tolerant storage\n
  13. \n
  14. \n
  15. \n
  16. \n
  17. \n
  18. \n