SlideShare a Scribd company logo
1 of 147
Hadoop
                               Divide and conquer gigantic data




© Matthew McCullough, Ambient Ideas, LLC
Talk Metadata
Twitter
  @matthewmccull
  #HadoopIntro
Matthew McCullough
  Ambient Ideas, LLC
  matthewm@ambientideas.com
  http://ambientideas.com/blog
  http://speakerrate.com/matthew.mccullough
MapReduce: Simplified Dat
                                             a Processing                                on Large Clusters

                                             Jeffrey Dean and Sanjay Ghe
                                                                        mawat
                                                 jeff@google.com, sanjay@goo
                                                                            gle.com

                                                            Google, Inc.


                            Abstract
                                                                         given day, etc. Most such comp
        MapReduce is a programming                                                                               utations are conceptu-
                                            model and an associ-        ally straightforward. However,
     ated implementation for proce                                                                            the input data is usually
                                       ssing and generating large       large and the computations have
     data sets. Users specify a map                                                                             to be distributed across
                                        function that processes a       hundreds or thousands of mach
     key/value pair to generate a set                                                                        ines in order to finish in
                                        of intermediate key/value       a reasonable amount of time.
     pairs, and a reduce function that                                                                     The issues of how to par-
                                          merges all intermediate       allelize the computation, distri
    values associated with the same                                                                        bute the data, and handle
                                         intermediate key. Many        failures conspire to obscure the
    real world tasks are expressible                                                                          original simple compu-
                                         in this model, as shown       tation with large amounts of
    in the paper.                                                                                       complex code to deal with
                                                                       these issues.
       Programs written in this funct                                     As a reaction to this complexity
                                        ional style are automati-                                                 , we designed a new
    cally parallelized and executed                                   abstraction that allows us to expre
                                      on a large cluster of com-                                               ss the simple computa-
   modity machines. The run-time                                      tions we were trying to perfo
                                        system takes care of the                                       rm but hides the messy de-
   details of partitioning the input                                  tails of parallelization, fault-
                                       data, scheduling the pro-                                       tolerance, data distribution
   gram’s execution across a set                                     and load balancing in a librar
                                   of machines, handling ma-                                             y. Our abstraction is in-
   chine failures, and managing                                      spired by the map and reduce
                                   the required inter-machine                                            primitives present in Lisp
   communication. This allows                                        and many other functional langu
                                    programmers without any                                                  ages. We realized that
  experience with parallel and                                       most of our computations invol
                                   distributed systems to eas-                                             ved applying a map op-
  ily utilize the resources of a large                               eration to each logical “record”
                                         distributed system.                                                in our input in order to
      Our implementation of MapR                                    compute a set of intermediat
                                        educe runs on a large                                        e key/value pairs, and then
  cluster of commodity machines                                     applying a reduce operation to
                                        and is highly scalable:                                          all the values that shared
  a typical MapReduce computatio                                    the same key, in order to comb
                                         n processes many ter-                                            ine the derived data ap-
 abytes of data on thousands of                                    propriately. Our use of a funct
                                     machines. Programmers                                                 ional model with user-
 find the system easy to use: hund                                  specified map and reduce opera
                                      reds of MapReduce pro-                                             tions allows us to paral-
 grams have been implemented                                       lelize large computations easily
                                   and upwards of one thou-                                               and to use re-execution
 sand MapReduce jobs are execu                                     as the primary mechanism for
                                     ted on Google’s clusters                                         fault tolerance.
 every day.                                                            The major contributions of this
                                                                                                             work are a simple and
                                                                  powerful interface that enables
                                                                                                        automatic parallelization
                                                                  and distribution of large-scal
                                                                                                    e computations, combined
 1 Introduction                                                   with an implementation of this
                                                                                                          interface that achieves
                                                                  high performance on large clust
                                                                                                       ers of commodity PCs.
 Over the past five years, the                                         Section 2 describes the basic
                              authors and many others at                                               programming model and
 Google have implemented hund                                    gives several examples. Secti
                                 reds of special-purpose                                             on 3 describes an imple-
 computations that process large                                 mentation of the MapReduce
                                   amounts of raw data,                                             interface tailored towards
such as crawled documents,                                       our cluster-based computing
                               web request logs, etc., to                                         environment. Section 4 de-
compute various kinds of deriv                                   scribes several refinements of
                                ed data, such as inverted                                            the programming model
indices, various representations                                that we have found useful. Secti
                                  of the graph structure                                                  on 5 has performance
of web documents, summaries                                     measurements of our implement
                                of the number of pages                                                   ation for a variety of
crawled per host, the set of                                    tasks. Section 6 explores the
                              most frequent queries in a                                           use of MapReduce within
                                                                Google including our experienc
                                                                                                     es in using it as the basis
To appear in OSDI 2004
                                                                                                                                1
Jeffrey Dean and Sanjay
                                                                        Ghemawat
                                                    jeff@google.com, sanjay
                                                                           @   google.com

                                                               Google, Inc.


                              Abstract
                                                                           given day, etc. Most su
         MapReduce is a progra                                                                          ch computations are co
                                    mming model and an as                  ally straightforward. Ho                                nceptu-
      ated implementation fo                                   soci-                                   wever, the input data is
                               r processing and genera                    large and the computatio                                 usually
     data sets. Users specify                             ting large                                   ns have to be distributed
                                 a map function that proc                 hundreds or thousands                                     across
     key/value pair to genera                                esses a                                 of machines in order to
                               te a set of intermediate ke                a reasonable amount of                                 finish in
     pairs, and a reduce func                               y/value                                   time. The issues of how
                               tion that merges all inter                 allelize the computatio                                  to par-
     values associated with th                              mediate                                 n, distribute the data, an
                                e same intermediate key.                 failures conspire to obsc                              d handle
    real world tasks are expr                                 Many                                    ure the original simple
                                essible in this model, as                tation with large amount                                 compu-
    in the paper.                                            shown                                    s of complex code to de
                                                                         these issues.                                            al with
       Programs written in this                                             As a reaction to this co
                                   functional style are auto                                            mplexity, we designed
    cally parallelized and ex                                 mati-     abstraction that allows us                                 a new
                              ecuted on a large cluste                                                to express the simple co
   modity machines. The                                 r of com-       tions we were trying to                                  mputa-
                             run-time system takes ca                                               perform but hides the m
   details of partitioning th                            re of the      tails of parallelization,                              essy de-
                              e input data, scheduling                                             fault-tolerance, data distr
   gram’s execution across                                the pro-      and load balancing in                                    ibution
                              a set of machines, hand                                             a library. Our abstractio
   chine failures, and man                               ling ma-      spired by the map and                                    n is in-
                            aging the required inter                                              reduce primitives presen
  communication. This all                              -machine        and many other functio                                 t in Lisp
                               ows programmers with                                                nal languages. We reali
  experience with paralle                                 out any      most of our computatio                                  zed that
                            l and distributed system                                               ns involved applying a
  ily utilize the resources                             s to eas-      eration to each logical                                map op-
                            of a large distributed sy                                             “record” in our input in
                                                       stem.          compute a set of interm                                 order to
     Our implementation of                                                                         ediate key/value pairs,
                                 MapReduce runs on a                                                                         and then
 cluster of commodity m                                     large     applying a reduce oper
                             achines and is highly sc                                            ation to all the values th
 a typical MapReduce co                                   alable:     the same key, in order                                at shared
                             mputation processes m                                               to combine the derived
 abytes of data on thousa                               any ter-     propriately. Our use of                                 data ap-
                            nds of machines. Progra                                                 a functional model with
find the system easy to us                                mmers       specified map and redu                                        user-
                            e: hundreds of MapRedu                                              ce operations allows us
grams have been implem                                   ce pro-     lelize large computatio                                to paral-
                            ented and upwards of on                                             ns easily and to use re-e
sand MapReduce jobs ar                                  e thou-     as the primary mechani                                  xecution
                            e executed on Google’s                                              sm for fault tolerance.
every day.                                             clusters          The major contributions
                                                                                                     of this work are a simpl
                                                                    powerful interface that                                     e and
                                                                                               enables automatic paralle
                                                                    and distribution of large                                lization
1 Introduction                                                                                   -scale computations, co
                                                                   with an implementatio                                    mbined
                                                                                               n of this interface that
                                                                   high performance on lar                                 achieves
                                                                                               ge clusters of commodity
Over the past five years,                                               Section 2 describes the                                PCs.
                         the authors and many ot                                                  basic programming mod
Google have implemen                            hers at            gives several examples                                     el and
                       ted hu
Abstract
                                                                        given da
       MapReduce is a progra                                            ally strai
                                  mming model and an a
    ated implementation fo                                   ssoci-
                             r processing and genera                   large and
    data sets. Users specify                            ting large
                               a map function that pro                 hundreds
    key/value pair to genera                              cesses a
                             te a set of intermediate k                a reasona
    pairs, and a reduce func                             ey/value
                             tion that merges all inte                 allelize th
   values associated with th                            rmediate
                              e same intermediate key                 failures co
   real world tasks are exp                               . Many
                             ressible in this model, a                tation wit
   in the paper.                                         s shown
                                                                      these issue
      Programs written in this                                           As a re
                                 functional style are auto
  cally parallelized and ex                                 mati-    abstraction
                            ecuted on a large cluste
 modity machines. The                                 r o f co m -   tions we w
                           run-time system takes c
 details of partitioning th                           are of the    tails of par
                            e input data, scheduling
 gram’s execution across                                the pro-    and load b
                            a set of machines, hand
 chine failures, and man                               ling ma-     spired by th
                          aging the required inter-
communication. This a                                  machine      and many o
                           llows programmers wit
experience with paralle                               hout any      most of ou
                          l and distributed system
il                                                    s to eas-
expressible in this mod                          tation wit
   in the paper.                                   e   l, as shown
                                                                            these issu
       Programs written in this                                                As a re
                                  functional style are auto
    cally parallelized and ex                                 mati-        abstractio
                              ecuted on a large cluste
   modity machines. The                                 r o f co m -       tions we w
                             run-time system takes c
   details of partitioning th                          are of the          tails of pa
                              e input data, scheduling
   gram’s execution across                               the pro-          and load b
                              a set of machines, hand
   chine failures, and man                              ling ma-          spired by t
                             aging the required inter-
  communication. This a                                 machine           and many
                             llows programmers wit
  experience with paralle                              hout any           most of ou
                            l and distributed system
  ily utilize the resources                            s to eas-          eration to e
                            of a large distributed sy
                                                      stem.              compute a
     Our implementation of
                                 MapReduce runs on a
 cluster of commodity m                                     large        applying a
                             achines and is highly s
 a typical MapReduce c                                  calable:         the same ke
                            omputation processes m
 abytes of data on thous                               any ter-         propriately.
                           ands of machines. Prog
find the system easy to u                              rammers           specified m
                           se: hundreds of MapRed
grams have been implem                                 uce pro-         lelize large
                             ented and upwards of o
sand MapReduce jobs a                                 ne thou-         as the prima
                           re executed on Google’s
every day.                                             clusters             The major
                                                                       powerful int
                                                                       and distribut
MapReduce history




“              ing mod    el and
A pr  ogramm
             ion for p rocessing
imp lementat

                                     ”
                g large d ata sets
 and generatin
Origins
  MapReduce implementation

  Founded by

  OpenSource at
Today
0.20.1 current version

Dozens of companies contributing
Hundreds of companies using
Why Hadoop?
$74
  .85
$74
          .85
    b
g
4
b
          1t
        $74
          .85
    b
g
4
vs
0
                        0
                     ,0
                 $ 10



            vs
        0
     0
 1 ,0
$
vs
ur
                               Y o ut
                           y      o ure
                         Bu ay ail
                          w F
                             f
                           o



                    vs
          is
      re ble
    lu ta
   i i
Fa ev         a
                p
 in      C he
     Go
Failure is
inevitable

Go Cheap
Sproinnnng!

Bzzzt!

                       Crrrkt!
server Funerals



No pagers go off when machines die
Report of dead machines once a week
 Clean out the carcasses
utes pre vented
  obustness attrib              n code
R
              g into a pplicatio
from  bleedin
          Data redundancy
          Node death
          Retries
          Data geography
          Parallelism
          Scalability
Hadoop for what?
Structured
Structured




Unstructured
NOSQL
NOSQL

 Death of the RDBMS is a lie
NOSQL

 Death of the RDBMS is a lie
 NoJOINs
NOSQL

 Death of the RDBMS is a lie
 NoJOINs
 NoNormalization
NOSQL

 Death of the RDBMS is a lie
 NoJOINs
 NoNormalization
 Big-data tools are solving
 different issues than RDBMSes
Applications
Applications

 Protein folding
 (pharmaceuticals)
Applications

 Protein folding
 (pharmaceuticals)


 Search engines
Applications

 Protein folding
 (pharmaceuticals)


 Search engines
 Sorting
Applications

 Protein folding
 (pharmaceuticals)


 Search engines
 Sorting
 Classification
 (government intelligence)
Applications
Applications

 Price search
Applications

 Price search
 Steganography
Applications

 Price search
 Steganography
 Analytics
Applications

 Price search
 Steganography
 Analytics
 Primes
 (code breaking)
Particle Physics
Particle Physics


 Large Hadron Collider
Particle Physics


 Large Hadron Collider
 15 petabytes of data per year
Financial Trends
Financial Trends
 Daily trade performance analysis
Financial Trends
 Daily trade performance analysis
 Market trending
Financial Trends
 Daily trade performance analysis
 Market trending

 Uses employee desktops during off
 hours
Financial Trends
 Daily trade performance analysis
 Market trending

 Uses employee desktops during off
 hours
 Fiscally responsible/economical
Contextual Ads
Contextual Ads
30%
of Amazon sales are from
recommendations
Not right now...
Not right now...

 Do you expect to tackle a very
 large problem before you:
   change jobs
   change industries
   retire
   die
   see the heat death of the universe
In the next decade, the class (scale) of
problems we are aiming to solve will
grow exponentially.
MapReduce
MapReduce
 map then... um... reduce.
The process
The process
 Every item in dataset is parallel candidate for
 Map
The process
 Every item in dataset is parallel candidate for
 Map
 Map(k1,v1) -> list(k2,v2)
The process
 Every item in dataset is parallel candidate for
 Map
 Map(k1,v1) -> list(k2,v2)
 Collects and groups pairs from all lists by key
The process
 Every item in dataset is parallel candidate for
 Map
 Map(k1,v1) -> list(k2,v2)
 Collects and groups pairs from all lists by key
 Reduce in parallel on each group
The process
 Every item in dataset is parallel candidate for
 Map
 Map(k1,v1) -> list(k2,v2)
 Collects and groups pairs from all lists by key
 Reduce in parallel on each group

 Reduce(k2, list (v2)) -> list(v3)
FP For the Grid


 MapReduce
  Functional programming
  on
  a distributed processing platform
The Goal
The Goal


Provide the occurrence count
of each distinct word across
all documents
Start
Map
Grouping
Reduce
MapReduce
  Demo
Have Code,
  Will Travel


 Code travels to the data
 Opposite of traditional systems
Speed Test
Competition
 TeraSort
 Jim Gray, MSFT
  1985 paper
  Derived sort benchmark
  http://sortbenchmark.org/

 209 seconds (2007)
 120 seconds (2009)
Nodes
Processing Nodes


 Anonymous
 “No identity” is good
 Commodity equipment
Master Node

 Master is a special machine
 Use high quality hardware
 Single point of failure
  But recoverable
Hadoop Family
Hadoop
Components
 Pig
 Hive
 Core
  Common
 Chukwa
 HBase
 HDFS
the Players
the PlayAs
the PlayAs



             a                Comm
       Chukw      ZooKeeper       on   HBa
                                          se

Hive                                           HDFS
HDFS
HDFS Basics
HDFS Basics


 Based on Google BigTable
HDFS Basics


 Based on Google BigTable
 Replicated data store
HDFS Basics


 Based on Google BigTable
 Replicated data store
 Stored in 64MB blocks
Data Overload
Data Overload
Data Overload
Data Overload
Data Overload
HDFS

 Replicating
 Rack-location aware
 Configurable redundancy factor
 Self-healing
HDFS Demo
Pig
Pig Basics

 Yahoo-authored add-on DSL & tool
 Origin: Pig Latin
 Analyzes large data sets
 High-level language for expressing data
 analysis programs
PIG Questions
 Ask big questions on unstructured
 data
  How many ___?
  Should we ____?
 Decide on the questions you want to
 ask long after you`ve collected the
 data.
Pig Sample


A = load 'passwd' using PigStorage(':');
B = foreach A generate $0 as id;
dump B;
store B into 'id.out';
Pig demo
HBase
HBase Basics


 Structured data store
 Notice we didn’t say relational
 Relies on ZooKeeper and HDFS
NoSQL

 Voldemort
 Google BigTable
 MongoDB
 HBase
HBase Demo
Hive
Hive Basics

 Authored by
 SQL interface to HBase
 Hive is low-level
 Hive-specific metadata
Sqoop


  Sqoop by             is higher
  level
  Importing from RDBMS to Hive
  sqoop --connect jdbc:mysql://database.example.com/
Sync, Async


 RDBMS SQL is realtime
 Hadoop is primarily asynchronous
on
Amazon
 Elastic
  MapReduce
Amazon
 Elastic
  MapReduce
  Hosted Hadoop clusters
Amazon
 Elastic
  MapReduce
  Hosted Hadoop clusters
  True use of cloud computing
Amazon
 Elastic
  MapReduce
  Hosted Hadoop clusters
  True use of cloud computing
  Easy to set up
Amazon
 Elastic
  MapReduce
  Hosted Hadoop clusters
  True use of cloud computing
  Easy to set up
  Pay per use
EMR Languages
 Supports applications in...

       Java
                  PHP
       Perl
                  R
       Ruby
                  C++
       Python
EMR Languages
 Supports applications in...

       Java
                  PHP
       Perl
                  R
       Ruby
                  C++
       Python
EMR Pricing
EMR Pricing
EMR Functions
 RunJobFlow: Creates a job flow request, starts
 EC2 instances and begins processing.
 DescribeJobFlows: Provides status of your job
 flow request(s).
 AddJobFlowSteps: Adds additional step to an
 already running job flow.
 TerminateJobFlows: Terminates running job flow
 and shutdowns all instances.
EMR Functions
 RunJobFlow: Creates a job flow request, starts
 EC2 instances and begins processing.
 DescribeJobFlows: Provides status of your job
 flow request(s).
 AddJobFlowSteps: Adds additional step to an
 already running job flow.
 TerminateJobFlows: Terminates running job flow
 and shutdowns all instances.
Final
Thoughts
Ha! Your
   Hadoop is       Shut up!
slower than my   I’m reducing.
    Hadoop!
The RDBMS is not dead
Has new friends, helpers
NoSQL is taking the world by
storm
No more throwing away
perfectly good historical data
Failure is acceptable
Failure is acceptable
❖    Failure is inevitable
Failure is acceptable
❖    Failure is inevitable
❖    Go cheap
Failure is acceptable
❖    Failure is inevitable
❖    Go cheap

    Go distributed
Use Hadoop!
Hadoop
          Divide and conquer gigantic data


          Matthew McCullough
Email     matthewm@ambientideas.com
Twitter   @matthewmccull
Blog      http://ambientideas.com/blog
Credits
http://www.fontspace.com/david-rakowski/tribeca
http://www.cern.ch/
http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre-
has-serious-cooling-needs/
http://www.greenm3.com/2009/10/googles-secret-to-efficient-data-
center-design-ability-to-predict-performance.html
http://upload.wikimedia.org/wikipedia/commons/f/fc/
CERN_LHC_Tunnel1.jpg
http://www.flickr.com/photos/mandj98/3804322095/
http://www.flickr.com/photos/8583446@N05/3304141843/
http://www.flickr.com/photos/joits/219824254/
http://www.flickr.com/photos/streetfly_jz/2312194534/
http://www.flickr.com/photos/sybrenstuvel/2811467787/
http://www.flickr.com/photos/lacklusters/2080288154/
http://www.flickr.com/photos/sybrenstuvel/2811467787/
http://www.flickr.com/photos/robryb/14826417/sizes/l/
http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/
http://www.flickr.com/photos/robryb/14826486/sizes/l/
All others, iStockPhoto.com

More Related Content

Similar to Hadoop v0.3.1

Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacmlmphuong06
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingcoolmirza143
 
Map reduce
Map reduceMap reduce
Map reducexydii
 
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to HadoopData Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to HadoopDan Harvey
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)anh tuan
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014ijcite
 
MAP REDUCE SLIDESHARE
MAP REDUCE SLIDESHAREMAP REDUCE SLIDESHARE
MAP REDUCE SLIDESHAREdharanis15
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTijwscjournal
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation foringenioustech
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...LeMeniz Infotech
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computinghuda2018
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNDataWorks Summit
 
The big data_computing_architecture-graph500
The big data_computing_architecture-graph500The big data_computing_architecture-graph500
The big data_computing_architecture-graph500Accenture
 
The big data_computing_architecture-graph500
The big data_computing_architecture-graph500The big data_computing_architecture-graph500
The big data_computing_architecture-graph500Accenture
 

Similar to Hadoop v0.3.1 (20)

Mapreduce2008 cacm
Mapreduce2008 cacmMapreduce2008 cacm
Mapreduce2008 cacm
 
Map reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreadingMap reduceoriginalpaper mandatoryreading
Map reduceoriginalpaper mandatoryreading
 
Map reduce
Map reduceMap reduce
Map reduce
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
Data Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to HadoopData Processing in the Work of NoSQL? An Introduction to Hadoop
Data Processing in the Work of NoSQL? An Introduction to Hadoop
 
Map reduce
Map reduceMap reduce
Map reduce
 
2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)2004 map reduce simplied data processing on large clusters (mapreduce)
2004 map reduce simplied data processing on large clusters (mapreduce)
 
Lecture 1 mapreduce
Lecture 1  mapreduceLecture 1  mapreduce
Lecture 1 mapreduce
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
E5 05 ijcite august 2014
E5 05 ijcite august 2014E5 05 ijcite august 2014
E5 05 ijcite august 2014
 
MAP REDUCE SLIDESHARE
MAP REDUCE SLIDESHAREMAP REDUCE SLIDESHARE
MAP REDUCE SLIDESHARE
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
CLOUD BIOINFORMATICS Part1
 CLOUD BIOINFORMATICS Part1 CLOUD BIOINFORMATICS Part1
CLOUD BIOINFORMATICS Part1
 
Exploiting dynamic resource allocation for
Exploiting dynamic resource allocation forExploiting dynamic resource allocation for
Exploiting dynamic resource allocation for
 
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
Self adjusting slot configurations for homogeneous and heterogeneous hadoop c...
 
Data-Intensive Technologies for Cloud Computing
Data-Intensive Technologies for CloudComputingData-Intensive Technologies for CloudComputing
Data-Intensive Technologies for Cloud Computing
 
Pregel - Paper Review
Pregel - Paper ReviewPregel - Paper Review
Pregel - Paper Review
 
Parallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARNParallel Linear Regression in Interative Reduce and YARN
Parallel Linear Regression in Interative Reduce and YARN
 
The big data_computing_architecture-graph500
The big data_computing_architecture-graph500The big data_computing_architecture-graph500
The big data_computing_architecture-graph500
 
The big data_computing_architecture-graph500
The big data_computing_architecture-graph500The big data_computing_architecture-graph500
The big data_computing_architecture-graph500
 

More from Matthew McCullough

Using Git and GitHub Effectively at Emerge Interactive
Using Git and GitHub Effectively at Emerge InteractiveUsing Git and GitHub Effectively at Emerge Interactive
Using Git and GitHub Effectively at Emerge InteractiveMatthew McCullough
 
All About GitHub Pull Requests
All About GitHub Pull RequestsAll About GitHub Pull Requests
All About GitHub Pull RequestsMatthew McCullough
 
Git Graphs, Hashes, and Compression, Oh My
Git Graphs, Hashes, and Compression, Oh MyGit Graphs, Hashes, and Compression, Oh My
Git Graphs, Hashes, and Compression, Oh MyMatthew McCullough
 
Git and GitHub at the San Francisco JUG
 Git and GitHub at the San Francisco JUG Git and GitHub at the San Francisco JUG
Git and GitHub at the San Francisco JUGMatthew McCullough
 
Migrating from Subversion to Git and GitHub
Migrating from Subversion to Git and GitHubMigrating from Subversion to Git and GitHub
Migrating from Subversion to Git and GitHubMatthew McCullough
 
Build Lifecycle Craftsmanship for the Transylvania JUG
Build Lifecycle Craftsmanship for the Transylvania JUGBuild Lifecycle Craftsmanship for the Transylvania JUG
Build Lifecycle Craftsmanship for the Transylvania JUGMatthew McCullough
 
Git Going for the Transylvania JUG
Git Going for the Transylvania JUGGit Going for the Transylvania JUG
Git Going for the Transylvania JUGMatthew McCullough
 
Transylvania JUG Pre-Meeting Announcements
Transylvania JUG Pre-Meeting AnnouncementsTransylvania JUG Pre-Meeting Announcements
Transylvania JUG Pre-Meeting AnnouncementsMatthew McCullough
 
Game Theory for Software Developers at the Boulder JUG
Game Theory for Software Developers at the Boulder JUGGame Theory for Software Developers at the Boulder JUG
Game Theory for Software Developers at the Boulder JUGMatthew McCullough
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGMatthew McCullough
 

More from Matthew McCullough (20)

Using Git and GitHub Effectively at Emerge Interactive
Using Git and GitHub Effectively at Emerge InteractiveUsing Git and GitHub Effectively at Emerge Interactive
Using Git and GitHub Effectively at Emerge Interactive
 
All About GitHub Pull Requests
All About GitHub Pull RequestsAll About GitHub Pull Requests
All About GitHub Pull Requests
 
Adam Smith Builds an App
Adam Smith Builds an AppAdam Smith Builds an App
Adam Smith Builds an App
 
Git's Filter Branch Command
Git's Filter Branch CommandGit's Filter Branch Command
Git's Filter Branch Command
 
Git Graphs, Hashes, and Compression, Oh My
Git Graphs, Hashes, and Compression, Oh MyGit Graphs, Hashes, and Compression, Oh My
Git Graphs, Hashes, and Compression, Oh My
 
Git and GitHub at the San Francisco JUG
 Git and GitHub at the San Francisco JUG Git and GitHub at the San Francisco JUG
Git and GitHub at the San Francisco JUG
 
Finding Things in Git
Finding Things in GitFinding Things in Git
Finding Things in Git
 
Git and GitHub for RallyOn
Git and GitHub for RallyOnGit and GitHub for RallyOn
Git and GitHub for RallyOn
 
Migrating from Subversion to Git and GitHub
Migrating from Subversion to Git and GitHubMigrating from Subversion to Git and GitHub
Migrating from Subversion to Git and GitHub
 
Git Notes and GitHub
Git Notes and GitHubGit Notes and GitHub
Git Notes and GitHub
 
Intro to Git and GitHub
Intro to Git and GitHubIntro to Git and GitHub
Intro to Git and GitHub
 
Build Lifecycle Craftsmanship for the Transylvania JUG
Build Lifecycle Craftsmanship for the Transylvania JUGBuild Lifecycle Craftsmanship for the Transylvania JUG
Build Lifecycle Craftsmanship for the Transylvania JUG
 
Git Going for the Transylvania JUG
Git Going for the Transylvania JUGGit Going for the Transylvania JUG
Git Going for the Transylvania JUG
 
Transylvania JUG Pre-Meeting Announcements
Transylvania JUG Pre-Meeting AnnouncementsTransylvania JUG Pre-Meeting Announcements
Transylvania JUG Pre-Meeting Announcements
 
Game Theory for Software Developers at the Boulder JUG
Game Theory for Software Developers at the Boulder JUGGame Theory for Software Developers at the Boulder JUG
Game Theory for Software Developers at the Boulder JUG
 
Cascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUGCascading Through Hadoop for the Boulder JUG
Cascading Through Hadoop for the Boulder JUG
 
JQuery Mobile
JQuery MobileJQuery Mobile
JQuery Mobile
 
R Data Analysis Software
R Data Analysis SoftwareR Data Analysis Software
R Data Analysis Software
 
Please, Stop Using Git
Please, Stop Using GitPlease, Stop Using Git
Please, Stop Using Git
 
Dr. Strangedev
Dr. StrangedevDr. Strangedev
Dr. Strangedev
 

Recently uploaded

LEVERAGING SYNERGISM INDUSTRY-ACADEMIA PARTNERSHIP FOR IMPLEMENTATION OF NAT...
LEVERAGING SYNERGISM INDUSTRY-ACADEMIA PARTNERSHIP FOR IMPLEMENTATION OF  NAT...LEVERAGING SYNERGISM INDUSTRY-ACADEMIA PARTNERSHIP FOR IMPLEMENTATION OF  NAT...
LEVERAGING SYNERGISM INDUSTRY-ACADEMIA PARTNERSHIP FOR IMPLEMENTATION OF NAT...pragatimahajan3
 
How to create _name_search function in odoo 17
How to create _name_search function in odoo 17How to create _name_search function in odoo 17
How to create _name_search function in odoo 17Celine George
 
18. Training and prunning of horicultural crops.pptx
18. Training and prunning of horicultural crops.pptx18. Training and prunning of horicultural crops.pptx
18. Training and prunning of horicultural crops.pptxUmeshTimilsina1
 
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 8 - CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC ...
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 8 - CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC ...BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 8 - CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC ...
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 8 - CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC ...Nguyen Thanh Tu Collection
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesVijayaLaxmi84
 
BBA 205 UNIT 3 INDUSTRIAL POLICY dr kanchan.pptx
BBA 205 UNIT 3 INDUSTRIAL POLICY dr kanchan.pptxBBA 205 UNIT 3 INDUSTRIAL POLICY dr kanchan.pptx
BBA 205 UNIT 3 INDUSTRIAL POLICY dr kanchan.pptxProf. Kanchan Kumari
 
The Shop Floor Overview in the Odoo 17 ERP
The Shop Floor Overview in the Odoo 17 ERPThe Shop Floor Overview in the Odoo 17 ERP
The Shop Floor Overview in the Odoo 17 ERPCeline George
 
Jordan Chrietzberg In Media Res Media Component
Jordan Chrietzberg In Media Res Media ComponentJordan Chrietzberg In Media Res Media Component
Jordan Chrietzberg In Media Res Media ComponentInMediaRes1
 
Paul Dobryden In Media Res Media Component
Paul Dobryden In Media Res Media ComponentPaul Dobryden In Media Res Media Component
Paul Dobryden In Media Res Media ComponentInMediaRes1
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxkarenfajardo43
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroomSamsung Business USA
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineCeline George
 
Shark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristicsShark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristicsArubSultan
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6Vanessa Camilleri
 
DBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfDBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfChristalin Nelson
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxAnupam32727
 
Unit :1 Basics of Professional Intelligence
Unit :1 Basics of Professional IntelligenceUnit :1 Basics of Professional Intelligence
Unit :1 Basics of Professional IntelligenceDr Vijay Vishwakarma
 

Recently uploaded (20)

LEVERAGING SYNERGISM INDUSTRY-ACADEMIA PARTNERSHIP FOR IMPLEMENTATION OF NAT...
LEVERAGING SYNERGISM INDUSTRY-ACADEMIA PARTNERSHIP FOR IMPLEMENTATION OF  NAT...LEVERAGING SYNERGISM INDUSTRY-ACADEMIA PARTNERSHIP FOR IMPLEMENTATION OF  NAT...
LEVERAGING SYNERGISM INDUSTRY-ACADEMIA PARTNERSHIP FOR IMPLEMENTATION OF NAT...
 
How to create _name_search function in odoo 17
How to create _name_search function in odoo 17How to create _name_search function in odoo 17
How to create _name_search function in odoo 17
 
Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...Introduction to Research ,Need for research, Need for design of Experiments, ...
Introduction to Research ,Need for research, Need for design of Experiments, ...
 
18. Training and prunning of horicultural crops.pptx
18. Training and prunning of horicultural crops.pptx18. Training and prunning of horicultural crops.pptx
18. Training and prunning of horicultural crops.pptx
 
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 8 - CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC ...
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 8 - CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC ...BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 8 - CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC ...
BÀI TẬP BỔ TRỢ 4 KĨ NĂNG TIẾNG ANH LỚP 8 - CẢ NĂM - GLOBAL SUCCESS - NĂM HỌC ...
 
Sulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their usesSulphonamides, mechanisms and their uses
Sulphonamides, mechanisms and their uses
 
BBA 205 UNIT 3 INDUSTRIAL POLICY dr kanchan.pptx
BBA 205 UNIT 3 INDUSTRIAL POLICY dr kanchan.pptxBBA 205 UNIT 3 INDUSTRIAL POLICY dr kanchan.pptx
BBA 205 UNIT 3 INDUSTRIAL POLICY dr kanchan.pptx
 
The Shop Floor Overview in the Odoo 17 ERP
The Shop Floor Overview in the Odoo 17 ERPThe Shop Floor Overview in the Odoo 17 ERP
The Shop Floor Overview in the Odoo 17 ERP
 
Israel Genealogy Research Assoc. April 2024 Database Release
Israel Genealogy Research Assoc. April 2024 Database ReleaseIsrael Genealogy Research Assoc. April 2024 Database Release
Israel Genealogy Research Assoc. April 2024 Database Release
 
Jordan Chrietzberg In Media Res Media Component
Jordan Chrietzberg In Media Res Media ComponentJordan Chrietzberg In Media Res Media Component
Jordan Chrietzberg In Media Res Media Component
 
Paul Dobryden In Media Res Media Component
Paul Dobryden In Media Res Media ComponentPaul Dobryden In Media Res Media Component
Paul Dobryden In Media Res Media Component
 
Chi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical VariableChi-Square Test Non Parametric Test Categorical Variable
Chi-Square Test Non Parametric Test Categorical Variable
 
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptxGrade Three -ELLNA-REVIEWER-ENGLISH.pptx
Grade Three -ELLNA-REVIEWER-ENGLISH.pptx
 
6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom6 ways Samsung’s Interactive Display powered by Android changes the classroom
6 ways Samsung’s Interactive Display powered by Android changes the classroom
 
How to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command LineHow to Uninstall a Module in Odoo 17 Using Command Line
How to Uninstall a Module in Odoo 17 Using Command Line
 
Shark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristicsShark introduction Morphology and its behaviour characteristics
Shark introduction Morphology and its behaviour characteristics
 
ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6ICS 2208 Lecture Slide Notes for Topic 6
ICS 2208 Lecture Slide Notes for Topic 6
 
DBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdfDBMSArchitecture_QueryProcessingandOptimization.pdf
DBMSArchitecture_QueryProcessingandOptimization.pdf
 
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptxCLASSIFICATION OF ANTI - CANCER DRUGS.pptx
CLASSIFICATION OF ANTI - CANCER DRUGS.pptx
 
Unit :1 Basics of Professional Intelligence
Unit :1 Basics of Professional IntelligenceUnit :1 Basics of Professional Intelligence
Unit :1 Basics of Professional Intelligence
 

Hadoop v0.3.1

  • 1. Hadoop Divide and conquer gigantic data © Matthew McCullough, Ambient Ideas, LLC
  • 2. Talk Metadata Twitter @matthewmccull #HadoopIntro Matthew McCullough Ambient Ideas, LLC matthewm@ambientideas.com http://ambientideas.com/blog http://speakerrate.com/matthew.mccullough
  • 3.
  • 4.
  • 5.
  • 6.
  • 7. MapReduce: Simplified Dat a Processing on Large Clusters Jeffrey Dean and Sanjay Ghe mawat jeff@google.com, sanjay@goo gle.com Google, Inc. Abstract given day, etc. Most such comp MapReduce is a programming utations are conceptu- model and an associ- ally straightforward. However, ated implementation for proce the input data is usually ssing and generating large large and the computations have data sets. Users specify a map to be distributed across function that processes a hundreds or thousands of mach key/value pair to generate a set ines in order to finish in of intermediate key/value a reasonable amount of time. pairs, and a reduce function that The issues of how to par- merges all intermediate allelize the computation, distri values associated with the same bute the data, and handle intermediate key. Many failures conspire to obscure the real world tasks are expressible original simple compu- in this model, as shown tation with large amounts of in the paper. complex code to deal with these issues. Programs written in this funct As a reaction to this complexity ional style are automati- , we designed a new cally parallelized and executed abstraction that allows us to expre on a large cluster of com- ss the simple computa- modity machines. The run-time tions we were trying to perfo system takes care of the rm but hides the messy de- details of partitioning the input tails of parallelization, fault- data, scheduling the pro- tolerance, data distribution gram’s execution across a set and load balancing in a librar of machines, handling ma- y. Our abstraction is in- chine failures, and managing spired by the map and reduce the required inter-machine primitives present in Lisp communication. This allows and many other functional langu programmers without any ages. We realized that experience with parallel and most of our computations invol distributed systems to eas- ved applying a map op- ily utilize the resources of a large eration to each logical “record” distributed system. in our input in order to Our implementation of MapR compute a set of intermediat educe runs on a large e key/value pairs, and then cluster of commodity machines applying a reduce operation to and is highly scalable: all the values that shared a typical MapReduce computatio the same key, in order to comb n processes many ter- ine the derived data ap- abytes of data on thousands of propriately. Our use of a funct machines. Programmers ional model with user- find the system easy to use: hund specified map and reduce opera reds of MapReduce pro- tions allows us to paral- grams have been implemented lelize large computations easily and upwards of one thou- and to use re-execution sand MapReduce jobs are execu as the primary mechanism for ted on Google’s clusters fault tolerance. every day. The major contributions of this work are a simple and powerful interface that enables automatic parallelization and distribution of large-scal e computations, combined 1 Introduction with an implementation of this interface that achieves high performance on large clust ers of commodity PCs. Over the past five years, the Section 2 describes the basic authors and many others at programming model and Google have implemented hund gives several examples. Secti reds of special-purpose on 3 describes an imple- computations that process large mentation of the MapReduce amounts of raw data, interface tailored towards such as crawled documents, our cluster-based computing web request logs, etc., to environment. Section 4 de- compute various kinds of deriv scribes several refinements of ed data, such as inverted the programming model indices, various representations that we have found useful. Secti of the graph structure on 5 has performance of web documents, summaries measurements of our implement of the number of pages ation for a variety of crawled per host, the set of tasks. Section 6 explores the most frequent queries in a use of MapReduce within Google including our experienc es in using it as the basis To appear in OSDI 2004 1
  • 8. Jeffrey Dean and Sanjay Ghemawat jeff@google.com, sanjay @ google.com Google, Inc. Abstract given day, etc. Most su MapReduce is a progra ch computations are co mming model and an as ally straightforward. Ho nceptu- ated implementation fo soci- wever, the input data is r processing and genera large and the computatio usually data sets. Users specify ting large ns have to be distributed a map function that proc hundreds or thousands across key/value pair to genera esses a of machines in order to te a set of intermediate ke a reasonable amount of finish in pairs, and a reduce func y/value time. The issues of how tion that merges all inter allelize the computatio to par- values associated with th mediate n, distribute the data, an e same intermediate key. failures conspire to obsc d handle real world tasks are expr Many ure the original simple essible in this model, as tation with large amount compu- in the paper. shown s of complex code to de these issues. al with Programs written in this As a reaction to this co functional style are auto mplexity, we designed cally parallelized and ex mati- abstraction that allows us a new ecuted on a large cluste to express the simple co modity machines. The r of com- tions we were trying to mputa- run-time system takes ca perform but hides the m details of partitioning th re of the tails of parallelization, essy de- e input data, scheduling fault-tolerance, data distr gram’s execution across the pro- and load balancing in ibution a set of machines, hand a library. Our abstractio chine failures, and man ling ma- spired by the map and n is in- aging the required inter reduce primitives presen communication. This all -machine and many other functio t in Lisp ows programmers with nal languages. We reali experience with paralle out any most of our computatio zed that l and distributed system ns involved applying a ily utilize the resources s to eas- eration to each logical map op- of a large distributed sy “record” in our input in stem. compute a set of interm order to Our implementation of ediate key/value pairs, MapReduce runs on a and then cluster of commodity m large applying a reduce oper achines and is highly sc ation to all the values th a typical MapReduce co alable: the same key, in order at shared mputation processes m to combine the derived abytes of data on thousa any ter- propriately. Our use of data ap- nds of machines. Progra a functional model with find the system easy to us mmers specified map and redu user- e: hundreds of MapRedu ce operations allows us grams have been implem ce pro- lelize large computatio to paral- ented and upwards of on ns easily and to use re-e sand MapReduce jobs ar e thou- as the primary mechani xecution e executed on Google’s sm for fault tolerance. every day. clusters The major contributions of this work are a simpl powerful interface that e and enables automatic paralle and distribution of large lization 1 Introduction -scale computations, co with an implementatio mbined n of this interface that high performance on lar achieves ge clusters of commodity Over the past five years, Section 2 describes the PCs. the authors and many ot basic programming mod Google have implemen hers at gives several examples el and ted hu
  • 9. Abstract given da MapReduce is a progra ally strai mming model and an a ated implementation fo ssoci- r processing and genera large and data sets. Users specify ting large a map function that pro hundreds key/value pair to genera cesses a te a set of intermediate k a reasona pairs, and a reduce func ey/value tion that merges all inte allelize th values associated with th rmediate e same intermediate key failures co real world tasks are exp . Many ressible in this model, a tation wit in the paper. s shown these issue Programs written in this As a re functional style are auto cally parallelized and ex mati- abstraction ecuted on a large cluste modity machines. The r o f co m - tions we w run-time system takes c details of partitioning th are of the tails of par e input data, scheduling gram’s execution across the pro- and load b a set of machines, hand chine failures, and man ling ma- spired by th aging the required inter- communication. This a machine and many o llows programmers wit experience with paralle hout any most of ou l and distributed system il s to eas-
  • 10. expressible in this mod tation wit in the paper. e l, as shown these issu Programs written in this As a re functional style are auto cally parallelized and ex mati- abstractio ecuted on a large cluste modity machines. The r o f co m - tions we w run-time system takes c details of partitioning th are of the tails of pa e input data, scheduling gram’s execution across the pro- and load b a set of machines, hand chine failures, and man ling ma- spired by t aging the required inter- communication. This a machine and many llows programmers wit experience with paralle hout any most of ou l and distributed system ily utilize the resources s to eas- eration to e of a large distributed sy stem. compute a Our implementation of MapReduce runs on a cluster of commodity m large applying a achines and is highly s a typical MapReduce c calable: the same ke omputation processes m abytes of data on thous any ter- propriately. ands of machines. Prog find the system easy to u rammers specified m se: hundreds of MapRed grams have been implem uce pro- lelize large ented and upwards of o sand MapReduce jobs a ne thou- as the prima re executed on Google’s every day. clusters The major powerful int and distribut
  • 11. MapReduce history “ ing mod el and A pr ogramm ion for p rocessing imp lementat ” g large d ata sets and generatin
  • 12.
  • 13.
  • 14. Origins MapReduce implementation Founded by OpenSource at
  • 15. Today 0.20.1 current version Dozens of companies contributing Hundreds of companies using
  • 16.
  • 17.
  • 18.
  • 19.
  • 20.
  • 21.
  • 24. $74 .85 b g 4
  • 25. b 1t $74 .85 b g 4
  • 26. vs
  • 27. 0 0 ,0 $ 10 vs 0 0 1 ,0 $
  • 28. vs
  • 29. ur Y o ut y o ure Bu ay ail w F f o vs is re ble lu ta i i Fa ev a p in C he Go
  • 31.
  • 32.
  • 34. server Funerals No pagers go off when machines die Report of dead machines once a week Clean out the carcasses
  • 35. utes pre vented obustness attrib n code R g into a pplicatio from bleedin Data redundancy Node death Retries Data geography Parallelism Scalability
  • 37.
  • 40. NOSQL
  • 41. NOSQL Death of the RDBMS is a lie
  • 42. NOSQL Death of the RDBMS is a lie NoJOINs
  • 43. NOSQL Death of the RDBMS is a lie NoJOINs NoNormalization
  • 44. NOSQL Death of the RDBMS is a lie NoJOINs NoNormalization Big-data tools are solving different issues than RDBMSes
  • 46. Applications Protein folding (pharmaceuticals)
  • 47. Applications Protein folding (pharmaceuticals) Search engines
  • 48. Applications Protein folding (pharmaceuticals) Search engines Sorting
  • 49. Applications Protein folding (pharmaceuticals) Search engines Sorting Classification (government intelligence)
  • 52. Applications Price search Steganography
  • 53. Applications Price search Steganography Analytics
  • 54. Applications Price search Steganography Analytics Primes (code breaking)
  • 56. Particle Physics Large Hadron Collider
  • 57. Particle Physics Large Hadron Collider 15 petabytes of data per year
  • 59. Financial Trends Daily trade performance analysis
  • 60. Financial Trends Daily trade performance analysis Market trending
  • 61. Financial Trends Daily trade performance analysis Market trending Uses employee desktops during off hours
  • 62. Financial Trends Daily trade performance analysis Market trending Uses employee desktops during off hours Fiscally responsible/economical
  • 65. 30% of Amazon sales are from recommendations
  • 67. Not right now... Do you expect to tackle a very large problem before you: change jobs change industries retire die see the heat death of the universe
  • 68. In the next decade, the class (scale) of problems we are aiming to solve will grow exponentially.
  • 70. MapReduce map then... um... reduce.
  • 72. The process Every item in dataset is parallel candidate for Map
  • 73. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2)
  • 74. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2) Collects and groups pairs from all lists by key
  • 75. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2) Collects and groups pairs from all lists by key Reduce in parallel on each group
  • 76. The process Every item in dataset is parallel candidate for Map Map(k1,v1) -> list(k2,v2) Collects and groups pairs from all lists by key Reduce in parallel on each group Reduce(k2, list (v2)) -> list(v3)
  • 77. FP For the Grid MapReduce Functional programming on a distributed processing platform
  • 79. The Goal Provide the occurrence count of each distinct word across all documents
  • 80. Start
  • 81. Map
  • 85. Have Code, Will Travel Code travels to the data Opposite of traditional systems
  • 87. Competition TeraSort Jim Gray, MSFT 1985 paper Derived sort benchmark http://sortbenchmark.org/ 209 seconds (2007) 120 seconds (2009)
  • 88. Nodes
  • 89. Processing Nodes Anonymous “No identity” is good Commodity equipment
  • 90. Master Node Master is a special machine Use high quality hardware Single point of failure But recoverable
  • 92. Hadoop Components Pig Hive Core Common Chukwa HBase HDFS
  • 95. the PlayAs a Comm Chukw ZooKeeper on HBa se Hive HDFS
  • 96. HDFS
  • 98. HDFS Basics Based on Google BigTable
  • 99. HDFS Basics Based on Google BigTable Replicated data store
  • 100. HDFS Basics Based on Google BigTable Replicated data store Stored in 64MB blocks
  • 106. HDFS Replicating Rack-location aware Configurable redundancy factor Self-healing
  • 107.
  • 109. Pig
  • 110. Pig Basics Yahoo-authored add-on DSL & tool Origin: Pig Latin Analyzes large data sets High-level language for expressing data analysis programs
  • 111. PIG Questions Ask big questions on unstructured data How many ___? Should we ____? Decide on the questions you want to ask long after you`ve collected the data.
  • 112. Pig Sample A = load 'passwd' using PigStorage(':'); B = foreach A generate $0 as id; dump B; store B into 'id.out';
  • 114. HBase
  • 115. HBase Basics Structured data store Notice we didn’t say relational Relies on ZooKeeper and HDFS
  • 116. NoSQL Voldemort Google BigTable MongoDB HBase
  • 118. Hive
  • 119. Hive Basics Authored by SQL interface to HBase Hive is low-level Hive-specific metadata
  • 120. Sqoop Sqoop by is higher level Importing from RDBMS to Hive sqoop --connect jdbc:mysql://database.example.com/
  • 121. Sync, Async RDBMS SQL is realtime Hadoop is primarily asynchronous
  • 122.
  • 123. on
  • 124. Amazon Elastic MapReduce
  • 125. Amazon Elastic MapReduce Hosted Hadoop clusters
  • 126. Amazon Elastic MapReduce Hosted Hadoop clusters True use of cloud computing
  • 127. Amazon Elastic MapReduce Hosted Hadoop clusters True use of cloud computing Easy to set up
  • 128. Amazon Elastic MapReduce Hosted Hadoop clusters True use of cloud computing Easy to set up Pay per use
  • 129. EMR Languages Supports applications in... Java PHP Perl R Ruby C++ Python
  • 130. EMR Languages Supports applications in... Java PHP Perl R Ruby C++ Python
  • 133. EMR Functions RunJobFlow: Creates a job flow request, starts EC2 instances and begins processing. DescribeJobFlows: Provides status of your job flow request(s). AddJobFlowSteps: Adds additional step to an already running job flow. TerminateJobFlows: Terminates running job flow and shutdowns all instances.
  • 134. EMR Functions RunJobFlow: Creates a job flow request, starts EC2 instances and begins processing. DescribeJobFlows: Provides status of your job flow request(s). AddJobFlowSteps: Adds additional step to an already running job flow. TerminateJobFlows: Terminates running job flow and shutdowns all instances.
  • 135.
  • 137. Ha! Your Hadoop is Shut up! slower than my I’m reducing. Hadoop!
  • 138. The RDBMS is not dead Has new friends, helpers NoSQL is taking the world by storm No more throwing away perfectly good historical data
  • 139.
  • 141. Failure is acceptable ❖ Failure is inevitable
  • 142. Failure is acceptable ❖ Failure is inevitable ❖ Go cheap
  • 143. Failure is acceptable ❖ Failure is inevitable ❖ Go cheap Go distributed
  • 145. Hadoop Divide and conquer gigantic data Matthew McCullough Email matthewm@ambientideas.com Twitter @matthewmccull Blog http://ambientideas.com/blog
  • 147. http://www.fontspace.com/david-rakowski/tribeca http://www.cern.ch/ http://www.robinmajumdar.com/2006/08/05/google-dalles-data-centre- has-serious-cooling-needs/ http://www.greenm3.com/2009/10/googles-secret-to-efficient-data- center-design-ability-to-predict-performance.html http://upload.wikimedia.org/wikipedia/commons/f/fc/ CERN_LHC_Tunnel1.jpg http://www.flickr.com/photos/mandj98/3804322095/ http://www.flickr.com/photos/8583446@N05/3304141843/ http://www.flickr.com/photos/joits/219824254/ http://www.flickr.com/photos/streetfly_jz/2312194534/ http://www.flickr.com/photos/sybrenstuvel/2811467787/ http://www.flickr.com/photos/lacklusters/2080288154/ http://www.flickr.com/photos/sybrenstuvel/2811467787/ http://www.flickr.com/photos/robryb/14826417/sizes/l/ http://www.flickr.com/photos/mckaysavage/1037160492/sizes/l/ http://www.flickr.com/photos/robryb/14826486/sizes/l/ All others, iStockPhoto.com