SlideShare a Scribd company logo
1 of 77
Download to read offline
Large Scale Math with
 Hadoop MapReduce

   Tsz-Wo (Nicholas) Sze, PhD


          Hadoop Summit
           June 29, 2011

                                1
Who am I?
• Hortonworks Software Engineer
• Apache Hadoop PMC Member
• Mathematician


  Interests:
       Distributed Computing
       Algorithms
       Number Theory

                                  2
Agenda
    • Introduction

    • Integer Multiplication
              • MapReduce-FFT
              • MapReduce-Sum
              • MapReduce-SSA

    • A New World Record
              • The “Machine” Behind the Computation

Tsz-Wo Sze,         Hadoop Summit 2011                 3
Agenda
    • Introduction

    • Integer Multiplication
              • MapReduce-FFT
              • MapReduce-Sum
              • MapReduce-SSA

    • A New World Record
              • The “Machine” Behind the Computation

Tsz-Wo Sze,         Hadoop Summit 2011                 4
Typical Hadoop Applications
          Major applications of Hadoop include
              •   Search and crawling
              •   Text processing
              •   Machine learning
              •   ...




Tsz-Wo Sze,           Hadoop Summit 2011         5
Typical Hadoop Applications
          Major applications of Hadoop include
              •   Search and crawling
              •   Text processing
              •   Machine learning
              •   ...

          But not yet commonly used in scientific
         or mathematical applications.

                                           Why?
Tsz-Wo Sze,           Hadoop Summit 2011           6
Why Not Math?

          No MapReduce math libraries available, and

         More fundamentally,
         MapReduce math algorithms are not well studied.




Tsz-Wo Sze,       Hadoop Summit 2011                   7
Existing Library
         Really no MapReduce Math Library?
         Not exactly.




Tsz-Wo Sze,      Hadoop Summit 2011          8
Existing Library
         Really no MapReduce Math Library?
         Not exactly.

          Apache Mahout
              • A machine learning library.
              • Includes packages for matrix operations.




Tsz-Wo Sze,         Hadoop Summit 2011                     9
Existing Library
         Really no MapReduce Math Library?
         Not exactly.

          Apache Mahout
              • A machine learning library.
              • Includes packages for matrix operations.

          Apache Hama (Incubation)
              • A matrix computational package.

Tsz-Wo Sze,         Hadoop Summit 2011                     10
Computational Intensive Problems                                               (1)




           Integer Factoring
               • a.k.a. breaking RSA cryptosystem
                    Given N , e and c, compute m such that
                                                          
                                                          
                                     e
                               c ≡ m (mod N ),
                                                          
                                                          
                                                          
                        where N is a product of two primes.

               • a 768-bit RSA modulus was factored1 in 2009
  1
      Kleinjung et al., Factorization of a 768-bit RSA modulus, CRYPTO 2010.


Tsz-Wo Sze,                  Hadoop Summit 2011                                  11
Computational Intensive Problems                                                           (2)




          Solving PDEs (Partial Differential Equations)
              •   Fluid dynamics
              •   Electromagnetism
              •   Financial analysis
              •   ...




                                           (Two-dimensional Turbulence, courtesy of Y.K. Tsang)

Tsz-Wo Sze,           Hadoop Summit 2011                                                     12
Computational Intensive Problems                              (3)



          Finding complex zeros of Riemann Zeta function
                           ∞
                          1
               ζ(s) =                  for s ∈ C,   (s) > 1
                      n=1
                          ns

         and then analytically continued to all s = 1.




Tsz-Wo Sze,       Hadoop Summit 2011                            13
Computational Intensive Problems                                              (3)



          Finding complex zeros of Riemann Zeta function
                                   ∞
                               1
                    ζ(s) =                         for s ∈ C,       (s) > 1
                           n=1
                               ns

         and then analytically continued to all s = 1.
              • Disprove Riemann Hypothesis (RH)
                Then, you will get $1,000,000 dollars2.
                However, RH is unlikely to be false.

  2
      See http://www.claymath.org/millennium/Riemann_Hypothesis/.


Tsz-Wo Sze,               Hadoop Summit 2011                                    14
Computational Intensive Problems                                (3)



          Finding complex zeros of Riemann Zeta function
                             ∞
                            1
                 ζ(s) =                  for s ∈ C,   (s) > 1
                        n=1
                            ns

         and then analytically continued to all s = 1.
              • Disprove Riemann Hypothesis (RH)
                Then, you will get $1,000,000 dollars.
                However, RH is unlikely to be false.
              • More likely:
                Obtain more evidents which support RH.
Tsz-Wo Sze,         Hadoop Summit 2011                            15
Computational Intensive Problems                                        (4)




         Computing π
         Latest world records:
              • Five trillion decimal digits (August 2010)
                           by Alexander Yee & Shigeru Kondo3




  3
      See http://www.numberworld.org/misc_runs/pi-5t/announce_en.html


Tsz-Wo Sze,               Hadoop Summit 2011                              16
Computational Intensive Problems                                                          (4)




         Computing π
         Latest world records:
              • Five trillion decimal digits (August 2010)
                           by Alexander Yee & Shigeru Kondo

              • The two quadrillionth bits (July 2010)
                         by Tsz-Wo Sze &
                        the Yahoo! Cloud Computing Team4

  4
      See http://developer.yahoo.net/blogs/hadoop/2010/09/two_quadrillionth_bit_pi.html


Tsz-Wo Sze,               Hadoop Summit 2011                                                17
Missing Functionalities
          Fast Fourier Transform (FFT)
         – the basic rountine behind many algorithms.

          Arbitrary Precision Arithmetic
                Integer functions
                Floating-point functions
                Complex functions

          ...


Tsz-Wo Sze,        Hadoop Summit 2011                   18
Agenda
    • Introduction

    • Integer Multiplication
              • MapReduce-FFT
              • MapReduce-Sum
              • MapReduce-SSA

    • A New World Record
              • The “Machine” Behind the Computation

Tsz-Wo Sze,         Hadoop Summit 2011                 19
Why Integer Multiplication?
          There exist fast algorithms.

          Many applications
              •   Division
              •   Logarithm
              •   Trigonometric functions
              •   ...




Tsz-Wo Sze,           Hadoop Summit 2011    20
Prerequisite of Algorithms




                                       D.J. Bernstein, Fast
                                       multiplication and its
                                       applications, ANTS 2008.

Tsz-Wo Sze,       Hadoop Summit 2011                              21
Integer Multiplication Algorithms
          Na¨ O(N 2)
            ıve,

          Karatsuba, O(N log2 3) = O(N 1.585)

          Toom-Cook, O(N log(2D−1)/ log D )
         If D = 3, then O(N log 5/ log 3) = O(N 1.465)

          FFT-based algorithms O(N log N · · · )




Tsz-Wo Sze,        Hadoop Summit 2011                    22
FFT-based Algorithms
          Basic FFT, O(N log N log log N log log log N · · · )

          Sch¨nhage-Strassen, O(N log N log log N )
             o

          Nussbaumer, O(N log N log log N )

                                        log∗ N
          F¨rer, O(N (log N )2
           u                                     )

                                                      log∗ N
          De-Kurur-Saha-Saptharishi, O(N (log N )2             )



Tsz-Wo Sze,        Hadoop Summit 2011                              23
Convolution
          By the convolution theorem,

                          a × b = dft−1(dft(a) ∗ dft(b)),

         where

                ×            denotes the convolution operator ,
                 ∗           denotes componentwise multiplication,
              dft( · )       denotes discrete Fourier transform.



Tsz-Wo Sze,              Hadoop Summit 2011                          24
Sch¨nhage-Strassen Algorithm
            o
                    (SSA)



          Represent integers as polynomials. Then, com-
          pute convolution with DFTs modulo an integer5.




  5
      It has the form 2n + 1 and is called the Sch¨nhage-Strassen modulas.
                                                  o


Tsz-Wo Sze,                  Hadoop Summit 2011                              25
SSA Steps
          Step 1: two DFTs,
                  def                                ˆ def dft(b);
                ˆ
                a = dft(a)                 and       b=

          Step 2: componentwise multiplication,
                                          def
                                            ˆ ˆ
                                        p = a ∗ b;
                                        ˆ

          Step 3: a DFT inverse,
                                                −1
                                       p = dft (ˆ );
                                                p

          Step 4: normalization.
Tsz-Wo Sze,       Hadoop Summit 2011                                 26
Calculating DFTs



          DFT can be calculated by a family of algorithms
         called Fast Fourier Transform (FFT).




Tsz-Wo Sze,       Hadoop Summit 2011                    27
FFT Family
          Recursive-FFT
          Parallel-FFT
          Cooley-Tukey (decimation-in-time)
          Gentleman-Sande (decimation-in-frequency)
          Danielson-Lanczos
          Ping-pong FFT
          ...

Tsz-Wo Sze,       Hadoop Summit 2011                  28
Data Model    (1)




          Need a data model which allows accessing
         terabit integers efficiently.

          An integer x is represented as a D-dimensional
         tuple
                     x = (xD−1, xD−2, . . . , x0).




Tsz-Wo Sze,       Hadoop Summit 2011                   29
Data Model             (2)




          Write
                                       D = IJ.
         where I and J are powers of two.

          Define J-dimensional tuples
                      (i) def
                  x       = (x(J−1)I+i, x(J−2)I+i, . . . , xi)

         for 0 ≤ i < I.


Tsz-Wo Sze,       Hadoop Summit 2011                             30
Data Model     (3)




         Then,
                                                          
            x(0)         x(J−1)I       x(J−2)I      . . . x0
          (1)  
          x   x(J−1)I+1            x(J−2)I+1     . . . x1 
                                                               
          . =            .             .         ...     . 
          .              .             .                 . 
           x(I−1)     x(J−1)I+(I−1) x(J−2)I+(I−1)   . . . xI−1

          We call it the (I, J)-format of x.




Tsz-Wo Sze,        Hadoop Summit 2011                        31
Data Model     (4)




          Each x(i) is a sequence of J records.

          Each record is a key-value pair.
              Record #         <Key,          Value>
                 0                < i,        xi >
                 1           < J + i,         xJ+i >
                  .
                  .                  .
                                     .
               J −1    < (J − 1)I + i,        x(J−1)I+i >



Tsz-Wo Sze,       Hadoop Summit 2011                        32
Data Model    (5)




         Thus, an integer is stored as I SequenceFiles in
         HDFS, each SequenceFile contains J records.




Tsz-Wo Sze,       Hadoop Summit 2011                    33
Parallel-FFT Steps
          Step 1: I inner DFTs with J-point,
                                   a(i) = dft(a(i));
          Step 2: componentwise shifting,
                                         def
                                   zjI+i = ζ ij a(i)j ;
          Step 3: transposition,
                     [j] def
                 z       = (zjI+(I−1), zjI+(I−2), . . . , zjI );
          Step 4: J outer DFTs with I-point,
                                    [j] def
                                   z =        dft(z[j]).

Tsz-Wo Sze,       Hadoop Summit 2011                               34
MapReduce Model
      Input


              Map1                Map2     Map3      Map4


    Shuffle



          Reduce1               Reduce2   Reduce3   Reduce4

    Output



Tsz-Wo Sze,          Hadoop Summit 2011                     35
MapReduce-FFT
         Input


            Inner FFT1            Inner FFT2   Inner FFT3   Inner FFT4

Transposition
 (by shuffle)



           Outer FFT1             Outer FFT2   Outer FFT3   Outer FFT4

       Output



   Tsz-Wo Sze,           Hadoop Summit 2011                         36
Data Locality



         The FFT transposition, which is traditionally dif-
         ficult in preserving locality, becomes trivial in
         MapReduce.




Tsz-Wo Sze,       Hadoop Summit 2011                      37
MapReduce-FFT                   (1)




          Map function:

                            (k1, v1) −→ list k2, v2

         Algorithm 1 (Forward FFT, Mapper).
         (f.m.1) read key i, value a(i);
         (f.m.2) calculate a J-point DFT;
         (f.m.3) componentwise multiply;
         (f.m.4) for 0 ≤ j < J, emit key j, value (i, zjI+i).

Tsz-Wo Sze,       Hadoop Summit 2011                        38
MapReduce-FFT                  (2)




          Reduce function:

                        (k2, list v2 ) −→ list k3, v3 .

         Algorithm 2 (Forward FFT, Reducer).
         (f.r.1) receive key j, list [(i, zjI+i)]0≤i<I ;
         (f.r.2) calculate an I-point DFT;
         (f.r.3) write key j, value z[j].



Tsz-Wo Sze,         Hadoop Summit 2011                     39
Normalization



          Normalization can be viewed as a summation of
         three integers.




Tsz-Wo Sze,      Hadoop Summit 2011                   40
Summation



          Integer summation can be done by (1) componen-
         twise summation, (2) carry evaluation and then
         (3) parallel carrying.




Tsz-Wo Sze,      Hadoop Summit 2011                    41
MapReduce Model
      Input


              Map1                Map2     Map3      Map4


    Shuffle



          Reduce1               Reduce2   Reduce3   Reduce4

    Output



Tsz-Wo Sze,          Hadoop Summit 2011                     42
MapReduce-Sum
           Input


           Summation1            Summation2     Summation3   Summation4

 Carry Evaluation
(modified shuffle)



              Carrying1             Carrying2    Carrying3    Carrying4

         Output



     Tsz-Wo Sze,          Hadoop Summit 2011                         43
Job 1: Componwise Summation
      Input


      Summation1           Summation2    Summation3   Summation4

    Output




              A map-only job.




Tsz-Wo Sze,         Hadoop Summit 2011                       44
Job 2: Carrying
                                          Input

                                             Carry
                                           Evaluation




         Carrying1             Carrying2                Carrying3   Carrying4

    Output



Tsz-Wo Sze,          Hadoop Summit 2011                                    45
MapReduce-SSA
           two concurrent forward FFT jobs;

          a backward FFT job with componentwise
          multiplication and splitting ;

           a componentwise summation map-only job;

           a carrying job6.



  6
      It is possible to combine the last two jobs if we modify the shuffle process in MapReduce [.next].


Tsz-Wo Sze,                   Hadoop Summit 2011                                                         46
Prototype Implementation
          DistMpMult
         – distributed multi-precision multiplication
               DistFft – distributed FFT
               DistCompSum – distributed componentwise
              summation
               DistCarrying – distributed carrying

          Open source – available at
         https://issues.apache.org/jira/browse/MAPREDUCE-2471



Tsz-Wo Sze,       Hadoop Summit 2011                      47
Cluster Configuration
          A shared cluster:
               Apache Hadoop 0.20
               1350 nodes
               6 GB memory per node
               2 map tasks & 1 reduce task per node
               Imposed a limitation on the aggregated
              memory usage of individual jobs.




Tsz-Wo Sze,       Hadoop Summit 2011                    48
Running Time
                                      Actual running time for 236 ≤ N ≤ 240.
                                      11.5
   t is the elapsed time in seconds




                                       11
                                      10.5
                                       10
                                       9.5
                 log(t)




                                        9
                                       8.5
                                        8
                                       7.5
                                        7
                                             32     33        34       35     36     37   38   39   40
                                                                            log(N)


Tsz-Wo Sze,                                       Hadoop Summit 2011                                     49
Agenda
    • Introduction

    • Integer Multiplication
              • MapReduce-FFT
              • MapReduce-Sum
              • MapReduce-SSA

    • A New World Record
              • The “Machine” Behind the Computation

Tsz-Wo Sze,         Hadoop Summit 2011                 50
What is π?

          π is a mathematical
          constant such that,
          for any circle,
             circumference C
          π=              = .
                diameter   d




Tsz-Wo Sze,      Hadoop Summit 2011     51
What is π?

          π is a mathematical
          constant such that,
          for any circle,
             circumference C
          π=              = .
                diameter   d

          We have π = 3.244




Tsz-Wo Sze,      Hadoop Summit 2011     52
What is π?

          π is a mathematical
          constant such that,
          for any circle,
             circumference C
          π=              = .
                diameter   d

          We have π = 3.244
          (in hexadecimal )


Tsz-Wo Sze,      Hadoop Summit 2011     53
Decimal, Hexadecimal & Binary
          Representing π in different bases

         π = 3.1415926535 8979323846 2643383279 ...
              = 3.243F6A88 85A308D3 13198A2E ...
              = 11.00100100 00111111 01101010 ...

          Bit position is counted after the radix point.
          e.g., the eight bits starting at the ninth bit position
         are 00111111 in binary or 3F in hexadecimal.

Tsz-Wo Sze,        Hadoop Summit 2011                          54
A New World Record
          Yahoo! Cloud Computing (July 2010)
              • Machines: Idle slices of 1000-node clusters
                     Each node has two quad-core 1.8-2.5 GHz CPUs
              • Duration: 23 days
              • CPU time: 503 years
              • Verification: 582 years CPU time




Tsz-Wo Sze,         Hadoop Summit 2011                          55
A New World Record
          Bit values (in hexadecimal)
              0E6C1294 AED40403 F56D2D76 4026265B
              CA98511D 0FCFFAA1 0F4D28B1 BB5392B8




Tsz-Wo Sze,       Hadoop Summit 2011                56
A New World Record
          Bit values (in hexadecimal)
                 0E6C1294 AED40403 F56D2D76 4026265B
                 CA98511D 0FCFFAA1 0F4D28B1 BB5392B8
                 (256 bits)

              The first bit position: 1,999,999,999,999,997 (= 2 · 1015 − 3)

              The last bit position: 2,000,000,000,000,252 (= 2·1015 +252)

              The two quadrillionth (2 · 1015th) bit is 0.


Tsz-Wo Sze,             Hadoop Summit 2011                               57
BBC News                  (16 Sep 2010)


          Pi record smashed as team finds two-quadrillionth digit
         http://www.bbc.co.uk/news/technology-11313194




Tsz-Wo Sze,           Hadoop Summit 2011                           58
NewScientist                  (17 Sep 2010)


          New pi record exploits Yahoo’s computers
         http://www.newscientist.com/article/dn19465-new-pi-record-exploits-yahoos-com
         html




Tsz-Wo Sze,           Hadoop Summit 2011                                       59
Other News Coverage
               New Pi Record Exploits Yahoo’s Computers
         http://cacm.acm.org/news/99207-new-pi-record-exploits-yahoos-computers


                                      The Yahoo!     boffin scores pi’s two
         quadrillionth bit
         http://www.theregister.co.uk/2010/09/16/pi_record_at_yahoo



                          Pi calculation more than doubles old record
         http://www.radionz.co.nz/news/world/57128/pi-calculation-more-than-doubles-ol


                  Hadoop used to calculate Pi’s two quadrillionth bit
         http://www.zdnet.co.uk/blogs/mapping-babel-10017967/hadoop-used-to-calculate-

Tsz-Wo Sze,           Hadoop Summit 2011                                          60
Yahoo! researcher breaks Pi record in finding
         the two-quadrillionth digit
         http://www.engadget.com/2010/09/17/yahoo-researcher-breaks-pi-record-in-findi

                             Nicholas Sze of Yahoo Finds Two-Quadrillionth
         Digit of Pi
         http://science.slashdot.org/story/10/09/16/2155227/Nicholas-Sze-of-Yahoo-Find

                   The 2,000,000,000,000,000th digit of the mathemat-
         ical constant pi discovered
         http://news.gather.com/viewArticle.action?articleId=281474978525563


                       Researcher Shatters Pi Record by Finding
         Two-Quadrillionth Digit
         http://www.maximumpc.com/article/news/researcher_shatters_pi_record_finding_
         two-quadrillionth_digit

Tsz-Wo Sze,            Hadoop Summit 2011                                      61
A bigger slice of pi
         http://radar.oreilly.com/2010/09/strata-week-grabbing-a-slice.html



                        2 Quadrillionth digit of PI is found: Scientist
         celebration in worldwide Pandemonium
         http://engforum.pravda.ru/showthread.php?296242-2-Quadrillionth-digit-of-PI-i



                      And the number is...0
         http://www.hexus.net/content/item.php?item=26505



                         Pi Record Smashed as Team Finds Two-
         Quadrillionth Digit
         http://hardocp.com/news/2010/09/16/pi_record_smashed_as_team_finds_twoquadril
         digit

Tsz-Wo Sze,           Hadoop Summit 2011                                       62
Yahoo Engineer Calculates Two Quadrillionth
         Bit Of Pi
         http://www.webpronews.com/topnews/2010/09/17/yahoo-engineer-calculates-two-qu


                        A Cloud Computing Milestone:                    Yahoo!
         Reaches the 2 Quadrillionth Bit of Pi
         http://www.readwriteweb.com/cloud/2010/09/a-cloud-computing-milestone-ya.
         php

                            Yahoo researcher Nicolas Sze determines
         the 2,000,000,000,000,000th digit of the mathematical con-
         stant pi
         http://www.thaindian.com/newsportal/sci-tech/yahoo-researcher-nicolas-sze-det
         100430278.html

          ...
Tsz-Wo Sze,           Hadoop Summit 2011                                       63
Computing π
          How to compute the nth bits of π?




Tsz-Wo Sze,       Hadoop Summit 2011          64
Computing π
          How to compute the nth bits of π?


              Let’s ignore this question in this talk ...
              and focus on:




Tsz-Wo Sze,       Hadoop Summit 2011                        65
Computing π
          How to compute the nth bits of π?


              Let’s ignore this question in this talk ...
              and focus on:

          How to execute such huge computation?




Tsz-Wo Sze,       Hadoop Summit 2011                        66
Map- & Reduce-side Computations
          Developed a generic framework to execute tasks
         on either the map-side or the reduce-side.

          Applications define two functions:

              • partition(c, m):
                partition the computation c into m parts.
              • compute(c):
                execute the computation c


Tsz-Wo Sze,         Hadoop Summit 2011                      67
Map-side Job
          Contains multiple mappers and zero reducers
              • A PartitionInputFormat partitions c
                into m parts
              • Each part is executed by a mapper




Tsz-Wo Sze,         Hadoop Summit 2011                  68
Reduce-side Job
          Contains a mapper and multiple reducers
              • A SingletonInputFormat launches
                a PartitionMapper
              • An Indexer launches m reducers.




Tsz-Wo Sze,        Hadoop Summit 2011               69
Abstract Machine           (1)




          Machine
         – an abstract base class allows abstract Runner(s)
         to execute MachineComputable tasks.
          Machine subclasses
              • Map Side Machine
                m100t3: 100 maps with 3 threads each.
              • Reduce Side Machine
                r50t2: 50 reduces with 2 threads each.


Tsz-Wo Sze,         Hadoop Summit 2011                    70
Abstract Machine                      (2)




          More Machine subclasses
              • Mix Machine – chooses Map-/Reduce-side
                jobs according to the cluster status.
                x-m200t1-r100t2-5: either launch a job with 200 maps
               with 1 thread each; or a job with 100 reduces with 2 thread each.

              • Alternation Machine – alternates Map-side
                and Reduce-side jobs in a regular pattern.
                a-m200t1-r100t2-mrr: submit a map job, then a re-
               duce job, then another reduce job and repeat this pattern.

              • Null Machine – does nothing for testing.
Tsz-Wo Sze,           Hadoop Summit 2011                                      71
Utilizing The Idle Slices
          Monitor cluster status
              • Submit a map-side (or reduce-side) job if there
                are sufficient available map (or reduce) slots.

          Small jobs
              • Hold resource only for a short period of time

          Interruptible & resumable
              • can be interrupted at any time by simply
                killing the running jobs

Tsz-Wo Sze,          Hadoop Summit 2011                       72
Running The Jobs




Tsz-Wo Sze,   Hadoop Summit 2011   73
The Implementation
          Main programs:
              DistBbp – a program to submit jobs.
              DistSum – distributed summation.


          Open source – available at
         https://issues.apache.org/jira/browse/MAPREDUCE-1923




Tsz-Wo Sze,       Hadoop Summit 2011                      74
The World Record Computation
          35,000 MapReduce jobs, each job either has:
              • 200 map tasks with one thread each, or
              • 100 reduce tasks with two threads each.

          Each thread computes 200,000,000 terms
              • ∼45 minutes.

          Submit up to 60 concurrent jobs
          The entire computation took:
              • 23 days of real time and 503 CPU years
Tsz-Wo Sze,         Hadoop Summit 2011                    75
Referneces
    •    [1] Tsz-Wo Sze. Sch¨nhage-Strassen Algorithm with MapReduce for Mul-
                              o
         tiplying Terabit Integers. Symbolic-Numeric Computation 2011, to ap-
         pear. Preprint available at http://people.apache.org/~szetszwo/
         ssmr20110430.pdf


    •    [2] Tsz-Wo Sze. The Two Quadrillionth Bit of Pi is 0! Distributed
         Computation of Pi with Apache Hadoop. In IEEE 2nd International
         Conference on Cloud Computing Technology and Science (CloudCom),
         pages 727-732, 2010. (Earlier versions available at http://arxiv.org/
         abs/1008.3171)




Tsz-Wo Sze,           Hadoop Summit 2011                                    76
Thank you!



Tsz-Wo Sze,   Hadoop Summit 2011   77

More Related Content

What's hot

Open vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NATOpen vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NATThomas Graf
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Preferred Networks
 
Overlapping community detection survey
Overlapping community detection surveyOverlapping community detection survey
Overlapping community detection survey煜林 车
 
Describing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPIDescribing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPIDale Lane
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsDataWorks Summit
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks 신동 강
 
Packet Walk(s) In Kubernetes
Packet Walk(s) In KubernetesPacket Walk(s) In Kubernetes
Packet Walk(s) In KubernetesDon Jayakody
 
Cilium - Network security for microservices
Cilium - Network security for microservicesCilium - Network security for microservices
Cilium - Network security for microservicesThomas Graf
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012Jinwon Lee
 
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...Vietnam Open Infrastructure User Group
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking ExplainedThomas Graf
 
Deep dive in container service discovery
Deep dive in container service discoveryDeep dive in container service discovery
Deep dive in container service discoveryDocker, Inc.
 
Cadence: Orchestration as Code
Cadence: Orchestration as CodeCadence: Orchestration as Code
Cadence: Orchestration as CodeMaxim Fateev
 
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)Zebra SRv6 CLI on Linux Dataplane (ENOG#49)
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)Kentaro Ebisawa
 
Issues of OpenStack multi-region mode
Issues of OpenStack multi-region modeIssues of OpenStack multi-region mode
Issues of OpenStack multi-region modeJoe Huang
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - IntroductionJungwon Kim
 
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Module 2
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Module 2Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Module 2
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Module 2Kirill Eremenko
 

What's hot (20)

YOLACT
YOLACTYOLACT
YOLACT
 
Open vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NATOpen vSwitch - Stateful Connection Tracking & Stateful NAT
Open vSwitch - Stateful Connection Tracking & Stateful NAT
 
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
Introduction to Graph Neural Networks: Basics and Applications - Katsuhiko Is...
 
Demystfying container-networking
Demystfying container-networkingDemystfying container-networking
Demystfying container-networking
 
Overlapping community detection survey
Overlapping community detection surveyOverlapping community detection survey
Overlapping community detection survey
 
Describing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPIDescribing Kafka security in AsyncAPI
Describing Kafka security in AsyncAPI
 
Compression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of TradeoffsCompression Options in Hadoop - A Tale of Tradeoffs
Compression Options in Hadoop - A Tale of Tradeoffs
 
Graph Convolutional Neural Networks
Graph Convolutional Neural Networks Graph Convolutional Neural Networks
Graph Convolutional Neural Networks
 
Packet Walk(s) In Kubernetes
Packet Walk(s) In KubernetesPacket Walk(s) In Kubernetes
Packet Walk(s) In Kubernetes
 
Cilium - Network security for microservices
Cilium - Network security for microservicesCilium - Network security for microservices
Cilium - Network security for microservices
 
synchonization PTP
synchonization PTP synchonization PTP
synchonization PTP
 
Faster R-CNN - PR012
Faster R-CNN - PR012Faster R-CNN - PR012
Faster R-CNN - PR012
 
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
Unrevealed Story Behind Viettel Network Cloud Hotpot | Đặng Văn Đại, Hà Mạnh ...
 
Linux Networking Explained
Linux Networking ExplainedLinux Networking Explained
Linux Networking Explained
 
Deep dive in container service discovery
Deep dive in container service discoveryDeep dive in container service discovery
Deep dive in container service discovery
 
Cadence: Orchestration as Code
Cadence: Orchestration as CodeCadence: Orchestration as Code
Cadence: Orchestration as Code
 
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)Zebra SRv6 CLI on Linux Dataplane (ENOG#49)
Zebra SRv6 CLI on Linux Dataplane (ENOG#49)
 
Issues of OpenStack multi-region mode
Issues of OpenStack multi-region modeIssues of OpenStack multi-region mode
Issues of OpenStack multi-region mode
 
Graph Neural Network - Introduction
Graph Neural Network - IntroductionGraph Neural Network - Introduction
Graph Neural Network - Introduction
 
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Module 2
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Module 2Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Module 2
Deep Learning A-Z™: Convolutional Neural Networks (CNN) - Module 2
 

Viewers also liked

Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsLynn Langit
 
Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Kevin Weil
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopVictoria López
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabImpetus Technologies
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design PatternsDonald Miner
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaSpark Summit
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce AlgorithmsAmund Tveit
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraPatrick McFadin
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...DataWorks Summit/Hadoop Summit
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsAnju Singh
 
Stratégie de tests type
Stratégie de tests typeStratégie de tests type
Stratégie de tests typemadspock
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkSlim Baltagi
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joinsShalish VJ
 

Viewers also liked (13)

Hadoop MapReduce Fundamentals
Hadoop MapReduce FundamentalsHadoop MapReduce Fundamentals
Hadoop MapReduce Fundamentals
 
Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010Spatial Analytics, Where 2.0 2010
Spatial Analytics, Where 2.0 2010
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Big Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLabBig Data Analytics with Storm, Spark and GraphLab
Big Data Analytics with Storm, Spark and GraphLab
 
MapReduce Design Patterns
MapReduce Design PatternsMapReduce Design Patterns
MapReduce Design Patterns
 
Time Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy RyzaTime Series Analysis with Spark by Sandy Ryza
Time Series Analysis with Spark by Sandy Ryza
 
Mapreduce Algorithms
Mapreduce AlgorithmsMapreduce Algorithms
Mapreduce Algorithms
 
Analyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and CassandraAnalyzing Time Series Data with Apache Spark and Cassandra
Analyzing Time Series Data with Apache Spark and Cassandra
 
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
The Columnar Era: Leveraging Parquet, Arrow and Kudu for High-Performance Ana...
 
Hadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce DetailsHadoop Real Life Use Case & MapReduce Details
Hadoop Real Life Use Case & MapReduce Details
 
Stratégie de tests type
Stratégie de tests typeStratégie de tests type
Stratégie de tests type
 
Transitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to SparkTransitioning Compute Models: Hadoop MapReduce to Spark
Transitioning Compute Models: Hadoop MapReduce to Spark
 
Hadoop MapReduce joins
Hadoop MapReduce joinsHadoop MapReduce joins
Hadoop MapReduce joins
 

Similar to Large Scale Math with Hadoop MapReduce

An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentationMapR Technologies
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupCloudera, Inc.
 
R for the semantic web, Quesada useR 2009
R for the semantic web, Quesada useR 2009R for the semantic web, Quesada useR 2009
R for the semantic web, Quesada useR 2009Jose Quesada
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data miningAhmad Ammari
 
Introduction to Hadoop and Big-Data
Introduction to Hadoop and Big-DataIntroduction to Hadoop and Big-Data
Introduction to Hadoop and Big-DataRamsay Key
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoMark Kromer
 
10c introduction
10c introduction10c introduction
10c introductionInyoung Cho
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalabilityWANdisco Plc
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkThoughtWorks
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Big Data Spain
 
Intro to threp
Intro to threpIntro to threp
Intro to threpHong Wu
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben whiteData Con LA
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopStefano Paluello
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderDmitry Makarchuk
 

Similar to Large Scale Math with Hadoop MapReduce (20)

An introduction to apache drill presentation
An introduction to apache drill presentationAn introduction to apache drill presentation
An introduction to apache drill presentation
 
Apache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's GroupApache Hadoop & Friends at Utah Java User's Group
Apache Hadoop & Friends at Utah Java User's Group
 
R for the semantic web, Quesada useR 2009
R for the semantic web, Quesada useR 2009R for the semantic web, Quesada useR 2009
R for the semantic web, Quesada useR 2009
 
Distributed data mining
Distributed data miningDistributed data mining
Distributed data mining
 
Introduction to Hadoop and Big-Data
Introduction to Hadoop and Big-DataIntroduction to Hadoop and Big-Data
Introduction to Hadoop and Big-Data
 
Hadoop Family and Ecosystem
Hadoop Family and EcosystemHadoop Family and Ecosystem
Hadoop Family and Ecosystem
 
Big Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with PentahoBig Data Analytics Projects - Real World with Pentaho
Big Data Analytics Projects - Real World with Pentaho
 
10c introduction
10c introduction10c introduction
10c introduction
 
10c introduction
10c introduction10c introduction
10c introduction
 
Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013Big dataanalyticsbeyondhadoop public_20_june_2013
Big dataanalyticsbeyondhadoop public_20_june_2013
 
Hadoop scalability
Hadoop scalabilityHadoop scalability
Hadoop scalability
 
HadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software FrameworkHadoopThe Hadoop Java Software Framework
HadoopThe Hadoop Java Software Framework
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Drill njhug -19 feb2013
Drill njhug -19 feb2013Drill njhug -19 feb2013
Drill njhug -19 feb2013
 
Intro to threp
Intro to threpIntro to threp
Intro to threp
 
20140614 introduction to spark-ben white
20140614 introduction to spark-ben white20140614 introduction to spark-ben white
20140614 introduction to spark-ben white
 
Graph Theory and Databases
Graph Theory and DatabasesGraph Theory and Databases
Graph Theory and Databases
 
A gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and HadoopA gentle introduction to the world of BigData and Hadoop
A gentle introduction to the world of BigData and Hadoop
 
Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013Big data analytics_7_giants_public_24_sep_2013
Big data analytics_7_giants_public_24_sep_2013
 
Hadoop and mysql by Chris Schneider
Hadoop and mysql by Chris SchneiderHadoop and mysql by Chris Schneider
Hadoop and mysql by Chris Schneider
 

More from Hortonworks

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyHortonworks
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakHortonworks
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsHortonworks
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysHortonworks
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's NewHortonworks
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerHortonworks
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsHortonworks
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeHortonworks
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidHortonworks
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleHortonworks
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATAHortonworks
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Hortonworks
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseHortonworks
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseHortonworks
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationHortonworks
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementHortonworks
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHortonworks
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCHortonworks
 

More from Hortonworks (20)

Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next LevelHortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
Hortonworks DataFlow (HDF) 3.3 - Taking Stream Processing to the Next Level
 
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT StrategyIoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
IoT Predictions for 2019 and Beyond: Data at the Heart of Your IoT Strategy
 
Getting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with CloudbreakGetting the Most Out of Your Data in the Cloud with Cloudbreak
Getting the Most Out of Your Data in the Cloud with Cloudbreak
 
Johns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log EventsJohns Hopkins - Using Hadoop to Secure Access Log Events
Johns Hopkins - Using Hadoop to Secure Access Log Events
 
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad GuysCatch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
Catch a Hacker in Real-Time: Live Visuals of Bots and Bad Guys
 
HDF 3.2 - What's New
HDF 3.2 - What's NewHDF 3.2 - What's New
HDF 3.2 - What's New
 
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging ManagerCuring Kafka Blindness with Hortonworks Streams Messaging Manager
Curing Kafka Blindness with Hortonworks Streams Messaging Manager
 
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical EnvironmentsInterpretation Tool for Genomic Sequencing Data in Clinical Environments
Interpretation Tool for Genomic Sequencing Data in Clinical Environments
 
IBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data LandscapeIBM+Hortonworks = Transformation of the Big Data Landscape
IBM+Hortonworks = Transformation of the Big Data Landscape
 
Premier Inside-Out: Apache Druid
Premier Inside-Out: Apache DruidPremier Inside-Out: Apache Druid
Premier Inside-Out: Apache Druid
 
Accelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at ScaleAccelerating Data Science and Real Time Analytics at Scale
Accelerating Data Science and Real Time Analytics at Scale
 
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATATIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
TIME SERIES: APPLYING ADVANCED ANALYTICS TO INDUSTRIAL PROCESS DATA
 
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
Blockchain with Machine Learning Powered by Big Data: Trimble Transportation ...
 
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: ClearsenseDelivering Real-Time Streaming Data for Healthcare Customers: Clearsense
Delivering Real-Time Streaming Data for Healthcare Customers: Clearsense
 
Making Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with EaseMaking Enterprise Big Data Small with Ease
Making Enterprise Big Data Small with Ease
 
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World PresentationWebinewbie to Webinerd in 30 Days - Webinar World Presentation
Webinewbie to Webinerd in 30 Days - Webinar World Presentation
 
Driving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data ManagementDriving Digital Transformation Through Global Data Management
Driving Digital Transformation Through Global Data Management
 
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming FeaturesHDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
HDF 3.1 pt. 2: A Technical Deep-Dive on New Streaming Features
 
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
Hortonworks DataFlow (HDF) 3.1 - Redefining Data-In-Motion with Modern Data A...
 
Unlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDCUnlock Value from Big Data with Apache NiFi and Streaming CDC
Unlock Value from Big Data with Apache NiFi and Streaming CDC
 

Recently uploaded

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 

Recently uploaded (20)

🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 

Large Scale Math with Hadoop MapReduce

  • 1. Large Scale Math with Hadoop MapReduce Tsz-Wo (Nicholas) Sze, PhD Hadoop Summit June 29, 2011 1
  • 2. Who am I? • Hortonworks Software Engineer • Apache Hadoop PMC Member • Mathematician Interests: Distributed Computing Algorithms Number Theory 2
  • 3. Agenda • Introduction • Integer Multiplication • MapReduce-FFT • MapReduce-Sum • MapReduce-SSA • A New World Record • The “Machine” Behind the Computation Tsz-Wo Sze, Hadoop Summit 2011 3
  • 4. Agenda • Introduction • Integer Multiplication • MapReduce-FFT • MapReduce-Sum • MapReduce-SSA • A New World Record • The “Machine” Behind the Computation Tsz-Wo Sze, Hadoop Summit 2011 4
  • 5. Typical Hadoop Applications Major applications of Hadoop include • Search and crawling • Text processing • Machine learning • ... Tsz-Wo Sze, Hadoop Summit 2011 5
  • 6. Typical Hadoop Applications Major applications of Hadoop include • Search and crawling • Text processing • Machine learning • ... But not yet commonly used in scientific or mathematical applications. Why? Tsz-Wo Sze, Hadoop Summit 2011 6
  • 7. Why Not Math? No MapReduce math libraries available, and More fundamentally, MapReduce math algorithms are not well studied. Tsz-Wo Sze, Hadoop Summit 2011 7
  • 8. Existing Library Really no MapReduce Math Library? Not exactly. Tsz-Wo Sze, Hadoop Summit 2011 8
  • 9. Existing Library Really no MapReduce Math Library? Not exactly. Apache Mahout • A machine learning library. • Includes packages for matrix operations. Tsz-Wo Sze, Hadoop Summit 2011 9
  • 10. Existing Library Really no MapReduce Math Library? Not exactly. Apache Mahout • A machine learning library. • Includes packages for matrix operations. Apache Hama (Incubation) • A matrix computational package. Tsz-Wo Sze, Hadoop Summit 2011 10
  • 11. Computational Intensive Problems (1) Integer Factoring • a.k.a. breaking RSA cryptosystem Given N , e and c, compute m such that     e c ≡ m (mod N ),       where N is a product of two primes. • a 768-bit RSA modulus was factored1 in 2009 1 Kleinjung et al., Factorization of a 768-bit RSA modulus, CRYPTO 2010. Tsz-Wo Sze, Hadoop Summit 2011 11
  • 12. Computational Intensive Problems (2) Solving PDEs (Partial Differential Equations) • Fluid dynamics • Electromagnetism • Financial analysis • ... (Two-dimensional Turbulence, courtesy of Y.K. Tsang) Tsz-Wo Sze, Hadoop Summit 2011 12
  • 13. Computational Intensive Problems (3) Finding complex zeros of Riemann Zeta function ∞ 1 ζ(s) = for s ∈ C, (s) > 1 n=1 ns and then analytically continued to all s = 1. Tsz-Wo Sze, Hadoop Summit 2011 13
  • 14. Computational Intensive Problems (3) Finding complex zeros of Riemann Zeta function ∞ 1 ζ(s) = for s ∈ C, (s) > 1 n=1 ns and then analytically continued to all s = 1. • Disprove Riemann Hypothesis (RH) Then, you will get $1,000,000 dollars2. However, RH is unlikely to be false. 2 See http://www.claymath.org/millennium/Riemann_Hypothesis/. Tsz-Wo Sze, Hadoop Summit 2011 14
  • 15. Computational Intensive Problems (3) Finding complex zeros of Riemann Zeta function ∞ 1 ζ(s) = for s ∈ C, (s) > 1 n=1 ns and then analytically continued to all s = 1. • Disprove Riemann Hypothesis (RH) Then, you will get $1,000,000 dollars. However, RH is unlikely to be false. • More likely: Obtain more evidents which support RH. Tsz-Wo Sze, Hadoop Summit 2011 15
  • 16. Computational Intensive Problems (4) Computing π Latest world records: • Five trillion decimal digits (August 2010) by Alexander Yee & Shigeru Kondo3 3 See http://www.numberworld.org/misc_runs/pi-5t/announce_en.html Tsz-Wo Sze, Hadoop Summit 2011 16
  • 17. Computational Intensive Problems (4) Computing π Latest world records: • Five trillion decimal digits (August 2010) by Alexander Yee & Shigeru Kondo • The two quadrillionth bits (July 2010) by Tsz-Wo Sze & the Yahoo! Cloud Computing Team4 4 See http://developer.yahoo.net/blogs/hadoop/2010/09/two_quadrillionth_bit_pi.html Tsz-Wo Sze, Hadoop Summit 2011 17
  • 18. Missing Functionalities Fast Fourier Transform (FFT) – the basic rountine behind many algorithms. Arbitrary Precision Arithmetic Integer functions Floating-point functions Complex functions ... Tsz-Wo Sze, Hadoop Summit 2011 18
  • 19. Agenda • Introduction • Integer Multiplication • MapReduce-FFT • MapReduce-Sum • MapReduce-SSA • A New World Record • The “Machine” Behind the Computation Tsz-Wo Sze, Hadoop Summit 2011 19
  • 20. Why Integer Multiplication? There exist fast algorithms. Many applications • Division • Logarithm • Trigonometric functions • ... Tsz-Wo Sze, Hadoop Summit 2011 20
  • 21. Prerequisite of Algorithms D.J. Bernstein, Fast multiplication and its applications, ANTS 2008. Tsz-Wo Sze, Hadoop Summit 2011 21
  • 22. Integer Multiplication Algorithms Na¨ O(N 2) ıve, Karatsuba, O(N log2 3) = O(N 1.585) Toom-Cook, O(N log(2D−1)/ log D ) If D = 3, then O(N log 5/ log 3) = O(N 1.465) FFT-based algorithms O(N log N · · · ) Tsz-Wo Sze, Hadoop Summit 2011 22
  • 23. FFT-based Algorithms Basic FFT, O(N log N log log N log log log N · · · ) Sch¨nhage-Strassen, O(N log N log log N ) o Nussbaumer, O(N log N log log N ) log∗ N F¨rer, O(N (log N )2 u ) log∗ N De-Kurur-Saha-Saptharishi, O(N (log N )2 ) Tsz-Wo Sze, Hadoop Summit 2011 23
  • 24. Convolution By the convolution theorem, a × b = dft−1(dft(a) ∗ dft(b)), where × denotes the convolution operator , ∗ denotes componentwise multiplication, dft( · ) denotes discrete Fourier transform. Tsz-Wo Sze, Hadoop Summit 2011 24
  • 25. Sch¨nhage-Strassen Algorithm o (SSA) Represent integers as polynomials. Then, com- pute convolution with DFTs modulo an integer5. 5 It has the form 2n + 1 and is called the Sch¨nhage-Strassen modulas. o Tsz-Wo Sze, Hadoop Summit 2011 25
  • 26. SSA Steps Step 1: two DFTs, def ˆ def dft(b); ˆ a = dft(a) and b= Step 2: componentwise multiplication, def ˆ ˆ p = a ∗ b; ˆ Step 3: a DFT inverse, −1 p = dft (ˆ ); p Step 4: normalization. Tsz-Wo Sze, Hadoop Summit 2011 26
  • 27. Calculating DFTs DFT can be calculated by a family of algorithms called Fast Fourier Transform (FFT). Tsz-Wo Sze, Hadoop Summit 2011 27
  • 28. FFT Family Recursive-FFT Parallel-FFT Cooley-Tukey (decimation-in-time) Gentleman-Sande (decimation-in-frequency) Danielson-Lanczos Ping-pong FFT ... Tsz-Wo Sze, Hadoop Summit 2011 28
  • 29. Data Model (1) Need a data model which allows accessing terabit integers efficiently. An integer x is represented as a D-dimensional tuple x = (xD−1, xD−2, . . . , x0). Tsz-Wo Sze, Hadoop Summit 2011 29
  • 30. Data Model (2) Write D = IJ. where I and J are powers of two. Define J-dimensional tuples (i) def x = (x(J−1)I+i, x(J−2)I+i, . . . , xi) for 0 ≤ i < I. Tsz-Wo Sze, Hadoop Summit 2011 30
  • 31. Data Model (3) Then,     x(0) x(J−1)I x(J−2)I . . . x0  (1)    x   x(J−1)I+1 x(J−2)I+1 . . . x1    . = . . ... .   .   . . .  x(I−1) x(J−1)I+(I−1) x(J−2)I+(I−1) . . . xI−1 We call it the (I, J)-format of x. Tsz-Wo Sze, Hadoop Summit 2011 31
  • 32. Data Model (4) Each x(i) is a sequence of J records. Each record is a key-value pair. Record # <Key, Value> 0 < i, xi > 1 < J + i, xJ+i > . . . . J −1 < (J − 1)I + i, x(J−1)I+i > Tsz-Wo Sze, Hadoop Summit 2011 32
  • 33. Data Model (5) Thus, an integer is stored as I SequenceFiles in HDFS, each SequenceFile contains J records. Tsz-Wo Sze, Hadoop Summit 2011 33
  • 34. Parallel-FFT Steps Step 1: I inner DFTs with J-point, a(i) = dft(a(i)); Step 2: componentwise shifting, def zjI+i = ζ ij a(i)j ; Step 3: transposition, [j] def z = (zjI+(I−1), zjI+(I−2), . . . , zjI ); Step 4: J outer DFTs with I-point, [j] def z = dft(z[j]). Tsz-Wo Sze, Hadoop Summit 2011 34
  • 35. MapReduce Model Input Map1 Map2 Map3 Map4 Shuffle Reduce1 Reduce2 Reduce3 Reduce4 Output Tsz-Wo Sze, Hadoop Summit 2011 35
  • 36. MapReduce-FFT Input Inner FFT1 Inner FFT2 Inner FFT3 Inner FFT4 Transposition (by shuffle) Outer FFT1 Outer FFT2 Outer FFT3 Outer FFT4 Output Tsz-Wo Sze, Hadoop Summit 2011 36
  • 37. Data Locality The FFT transposition, which is traditionally dif- ficult in preserving locality, becomes trivial in MapReduce. Tsz-Wo Sze, Hadoop Summit 2011 37
  • 38. MapReduce-FFT (1) Map function: (k1, v1) −→ list k2, v2 Algorithm 1 (Forward FFT, Mapper). (f.m.1) read key i, value a(i); (f.m.2) calculate a J-point DFT; (f.m.3) componentwise multiply; (f.m.4) for 0 ≤ j < J, emit key j, value (i, zjI+i). Tsz-Wo Sze, Hadoop Summit 2011 38
  • 39. MapReduce-FFT (2) Reduce function: (k2, list v2 ) −→ list k3, v3 . Algorithm 2 (Forward FFT, Reducer). (f.r.1) receive key j, list [(i, zjI+i)]0≤i<I ; (f.r.2) calculate an I-point DFT; (f.r.3) write key j, value z[j]. Tsz-Wo Sze, Hadoop Summit 2011 39
  • 40. Normalization Normalization can be viewed as a summation of three integers. Tsz-Wo Sze, Hadoop Summit 2011 40
  • 41. Summation Integer summation can be done by (1) componen- twise summation, (2) carry evaluation and then (3) parallel carrying. Tsz-Wo Sze, Hadoop Summit 2011 41
  • 42. MapReduce Model Input Map1 Map2 Map3 Map4 Shuffle Reduce1 Reduce2 Reduce3 Reduce4 Output Tsz-Wo Sze, Hadoop Summit 2011 42
  • 43. MapReduce-Sum Input Summation1 Summation2 Summation3 Summation4 Carry Evaluation (modified shuffle) Carrying1 Carrying2 Carrying3 Carrying4 Output Tsz-Wo Sze, Hadoop Summit 2011 43
  • 44. Job 1: Componwise Summation Input Summation1 Summation2 Summation3 Summation4 Output A map-only job. Tsz-Wo Sze, Hadoop Summit 2011 44
  • 45. Job 2: Carrying Input Carry Evaluation Carrying1 Carrying2 Carrying3 Carrying4 Output Tsz-Wo Sze, Hadoop Summit 2011 45
  • 46. MapReduce-SSA two concurrent forward FFT jobs; a backward FFT job with componentwise multiplication and splitting ; a componentwise summation map-only job; a carrying job6. 6 It is possible to combine the last two jobs if we modify the shuffle process in MapReduce [.next]. Tsz-Wo Sze, Hadoop Summit 2011 46
  • 47. Prototype Implementation DistMpMult – distributed multi-precision multiplication DistFft – distributed FFT DistCompSum – distributed componentwise summation DistCarrying – distributed carrying Open source – available at https://issues.apache.org/jira/browse/MAPREDUCE-2471 Tsz-Wo Sze, Hadoop Summit 2011 47
  • 48. Cluster Configuration A shared cluster: Apache Hadoop 0.20 1350 nodes 6 GB memory per node 2 map tasks & 1 reduce task per node Imposed a limitation on the aggregated memory usage of individual jobs. Tsz-Wo Sze, Hadoop Summit 2011 48
  • 49. Running Time Actual running time for 236 ≤ N ≤ 240. 11.5 t is the elapsed time in seconds 11 10.5 10 9.5 log(t) 9 8.5 8 7.5 7 32 33 34 35 36 37 38 39 40 log(N) Tsz-Wo Sze, Hadoop Summit 2011 49
  • 50. Agenda • Introduction • Integer Multiplication • MapReduce-FFT • MapReduce-Sum • MapReduce-SSA • A New World Record • The “Machine” Behind the Computation Tsz-Wo Sze, Hadoop Summit 2011 50
  • 51. What is π? π is a mathematical constant such that, for any circle, circumference C π= = . diameter d Tsz-Wo Sze, Hadoop Summit 2011 51
  • 52. What is π? π is a mathematical constant such that, for any circle, circumference C π= = . diameter d We have π = 3.244 Tsz-Wo Sze, Hadoop Summit 2011 52
  • 53. What is π? π is a mathematical constant such that, for any circle, circumference C π= = . diameter d We have π = 3.244 (in hexadecimal ) Tsz-Wo Sze, Hadoop Summit 2011 53
  • 54. Decimal, Hexadecimal & Binary Representing π in different bases π = 3.1415926535 8979323846 2643383279 ... = 3.243F6A88 85A308D3 13198A2E ... = 11.00100100 00111111 01101010 ... Bit position is counted after the radix point. e.g., the eight bits starting at the ninth bit position are 00111111 in binary or 3F in hexadecimal. Tsz-Wo Sze, Hadoop Summit 2011 54
  • 55. A New World Record Yahoo! Cloud Computing (July 2010) • Machines: Idle slices of 1000-node clusters Each node has two quad-core 1.8-2.5 GHz CPUs • Duration: 23 days • CPU time: 503 years • Verification: 582 years CPU time Tsz-Wo Sze, Hadoop Summit 2011 55
  • 56. A New World Record Bit values (in hexadecimal) 0E6C1294 AED40403 F56D2D76 4026265B CA98511D 0FCFFAA1 0F4D28B1 BB5392B8 Tsz-Wo Sze, Hadoop Summit 2011 56
  • 57. A New World Record Bit values (in hexadecimal) 0E6C1294 AED40403 F56D2D76 4026265B CA98511D 0FCFFAA1 0F4D28B1 BB5392B8 (256 bits) The first bit position: 1,999,999,999,999,997 (= 2 · 1015 − 3) The last bit position: 2,000,000,000,000,252 (= 2·1015 +252) The two quadrillionth (2 · 1015th) bit is 0. Tsz-Wo Sze, Hadoop Summit 2011 57
  • 58. BBC News (16 Sep 2010) Pi record smashed as team finds two-quadrillionth digit http://www.bbc.co.uk/news/technology-11313194 Tsz-Wo Sze, Hadoop Summit 2011 58
  • 59. NewScientist (17 Sep 2010) New pi record exploits Yahoo’s computers http://www.newscientist.com/article/dn19465-new-pi-record-exploits-yahoos-com html Tsz-Wo Sze, Hadoop Summit 2011 59
  • 60. Other News Coverage New Pi Record Exploits Yahoo’s Computers http://cacm.acm.org/news/99207-new-pi-record-exploits-yahoos-computers The Yahoo! boffin scores pi’s two quadrillionth bit http://www.theregister.co.uk/2010/09/16/pi_record_at_yahoo Pi calculation more than doubles old record http://www.radionz.co.nz/news/world/57128/pi-calculation-more-than-doubles-ol Hadoop used to calculate Pi’s two quadrillionth bit http://www.zdnet.co.uk/blogs/mapping-babel-10017967/hadoop-used-to-calculate- Tsz-Wo Sze, Hadoop Summit 2011 60
  • 61. Yahoo! researcher breaks Pi record in finding the two-quadrillionth digit http://www.engadget.com/2010/09/17/yahoo-researcher-breaks-pi-record-in-findi Nicholas Sze of Yahoo Finds Two-Quadrillionth Digit of Pi http://science.slashdot.org/story/10/09/16/2155227/Nicholas-Sze-of-Yahoo-Find The 2,000,000,000,000,000th digit of the mathemat- ical constant pi discovered http://news.gather.com/viewArticle.action?articleId=281474978525563 Researcher Shatters Pi Record by Finding Two-Quadrillionth Digit http://www.maximumpc.com/article/news/researcher_shatters_pi_record_finding_ two-quadrillionth_digit Tsz-Wo Sze, Hadoop Summit 2011 61
  • 62. A bigger slice of pi http://radar.oreilly.com/2010/09/strata-week-grabbing-a-slice.html 2 Quadrillionth digit of PI is found: Scientist celebration in worldwide Pandemonium http://engforum.pravda.ru/showthread.php?296242-2-Quadrillionth-digit-of-PI-i And the number is...0 http://www.hexus.net/content/item.php?item=26505 Pi Record Smashed as Team Finds Two- Quadrillionth Digit http://hardocp.com/news/2010/09/16/pi_record_smashed_as_team_finds_twoquadril digit Tsz-Wo Sze, Hadoop Summit 2011 62
  • 63. Yahoo Engineer Calculates Two Quadrillionth Bit Of Pi http://www.webpronews.com/topnews/2010/09/17/yahoo-engineer-calculates-two-qu A Cloud Computing Milestone: Yahoo! Reaches the 2 Quadrillionth Bit of Pi http://www.readwriteweb.com/cloud/2010/09/a-cloud-computing-milestone-ya. php Yahoo researcher Nicolas Sze determines the 2,000,000,000,000,000th digit of the mathematical con- stant pi http://www.thaindian.com/newsportal/sci-tech/yahoo-researcher-nicolas-sze-det 100430278.html ... Tsz-Wo Sze, Hadoop Summit 2011 63
  • 64. Computing π How to compute the nth bits of π? Tsz-Wo Sze, Hadoop Summit 2011 64
  • 65. Computing π How to compute the nth bits of π? Let’s ignore this question in this talk ... and focus on: Tsz-Wo Sze, Hadoop Summit 2011 65
  • 66. Computing π How to compute the nth bits of π? Let’s ignore this question in this talk ... and focus on: How to execute such huge computation? Tsz-Wo Sze, Hadoop Summit 2011 66
  • 67. Map- & Reduce-side Computations Developed a generic framework to execute tasks on either the map-side or the reduce-side. Applications define two functions: • partition(c, m): partition the computation c into m parts. • compute(c): execute the computation c Tsz-Wo Sze, Hadoop Summit 2011 67
  • 68. Map-side Job Contains multiple mappers and zero reducers • A PartitionInputFormat partitions c into m parts • Each part is executed by a mapper Tsz-Wo Sze, Hadoop Summit 2011 68
  • 69. Reduce-side Job Contains a mapper and multiple reducers • A SingletonInputFormat launches a PartitionMapper • An Indexer launches m reducers. Tsz-Wo Sze, Hadoop Summit 2011 69
  • 70. Abstract Machine (1) Machine – an abstract base class allows abstract Runner(s) to execute MachineComputable tasks. Machine subclasses • Map Side Machine m100t3: 100 maps with 3 threads each. • Reduce Side Machine r50t2: 50 reduces with 2 threads each. Tsz-Wo Sze, Hadoop Summit 2011 70
  • 71. Abstract Machine (2) More Machine subclasses • Mix Machine – chooses Map-/Reduce-side jobs according to the cluster status. x-m200t1-r100t2-5: either launch a job with 200 maps with 1 thread each; or a job with 100 reduces with 2 thread each. • Alternation Machine – alternates Map-side and Reduce-side jobs in a regular pattern. a-m200t1-r100t2-mrr: submit a map job, then a re- duce job, then another reduce job and repeat this pattern. • Null Machine – does nothing for testing. Tsz-Wo Sze, Hadoop Summit 2011 71
  • 72. Utilizing The Idle Slices Monitor cluster status • Submit a map-side (or reduce-side) job if there are sufficient available map (or reduce) slots. Small jobs • Hold resource only for a short period of time Interruptible & resumable • can be interrupted at any time by simply killing the running jobs Tsz-Wo Sze, Hadoop Summit 2011 72
  • 73. Running The Jobs Tsz-Wo Sze, Hadoop Summit 2011 73
  • 74. The Implementation Main programs: DistBbp – a program to submit jobs. DistSum – distributed summation. Open source – available at https://issues.apache.org/jira/browse/MAPREDUCE-1923 Tsz-Wo Sze, Hadoop Summit 2011 74
  • 75. The World Record Computation 35,000 MapReduce jobs, each job either has: • 200 map tasks with one thread each, or • 100 reduce tasks with two threads each. Each thread computes 200,000,000 terms • ∼45 minutes. Submit up to 60 concurrent jobs The entire computation took: • 23 days of real time and 503 CPU years Tsz-Wo Sze, Hadoop Summit 2011 75
  • 76. Referneces • [1] Tsz-Wo Sze. Sch¨nhage-Strassen Algorithm with MapReduce for Mul- o tiplying Terabit Integers. Symbolic-Numeric Computation 2011, to ap- pear. Preprint available at http://people.apache.org/~szetszwo/ ssmr20110430.pdf • [2] Tsz-Wo Sze. The Two Quadrillionth Bit of Pi is 0! Distributed Computation of Pi with Apache Hadoop. In IEEE 2nd International Conference on Cloud Computing Technology and Science (CloudCom), pages 727-732, 2010. (Earlier versions available at http://arxiv.org/ abs/1008.3171) Tsz-Wo Sze, Hadoop Summit 2011 76
  • 77. Thank you! Tsz-Wo Sze, Hadoop Summit 2011 77