SlideShare a Scribd company logo
Big Data Technologies and
           Techniques
                Ryan Brush
Distinguished Engineer, Cerner Corporation
               @ryanbrush
Relational Databases are Awesome
Atomic, transactional updates
   Guaranteed consistency

Relational Databases are Awesome
             Declarative queries
Easy to reason about
     Long track record of success
Relational Databases are Awesome
          …so use them!
Relational Databases are Awesome
          …so use them!

        But…
Those advantages have a cost
Global, atomic state means global,
atomic coordination

      Coordination does not scale linearly
The costs of coordination
     Remember the
     network effect?
The costs of coordination

             n(n -1)
  channels =
                2

  2 nodes = 1 channel
  5 nodes = 10 channels
  12 nodes = 66 channels
  25 nodes = 300 channels
So we better be able to scale
The costs of coordination
  Databases have optimized this in
  many clever ways, but a limit on
  scalability still exists
Let’s look at some ways to scale
Bulk processing billions of records
Bulk processing billions of records
 Data aggregation and storage
Bulk processing billions of records
 Data aggregation and storage
    Real-time processing of updates
Bulk processing billions of records
 Data aggregation and storage
    Real-time processing of updates
 Serving data for: Online Apps
                   Analytics
Let’s start with scalability of
bulk processing
Quiz: which one is scalable?
Quiz: which one is scalable?
    1000-node Hadoop cluster where
    jobs depend on a common process
Quiz: which one is scalable?
    1000-node Hadoop cluster where
    jobs depend on a common process
    1000 Windows ME machines running
    independent Excel macros
Quiz: which one is scalable?
    1000-node Hadoop cluster where
    jobs depend on a common process
    1000 Windows ME machines running
    independent Excel macros
Independence   Parallelizable
Independence      Parallelizable

      Parallelizable     Scalable
“Shared Nothing” architectures are the
most scalable…
“Shared Nothing” architectures are the
most scalable…
     …but most real-world problems require
     us to share something…
“Shared Nothing” architectures are the
most scalable…
     …but most real-world problems require
     us to share something…
  …so our designs usually have a parallel
  part and a serial part
The key is to make sure the vast majority
of our work in the cloud is independent and
parallelizable.
Amdahl’s Law
             1           S : speed improvement
S(N ) =                  P : ratio of the problem that
        (1- P) + P           can be parallelized
                     N   N: number of processors
MapReduce Primer
Input Data      Map Phase   Shuffle   Reduce
  Split 1                             Phase
                 Mapper 1
  Split 2        Mapper 2
                                      Reducer 1
  Split 3        Mapper 3
                                      Reducer 2
     .              .                     .
     .              .                     .
     .              .

                                      Reducer N


  Split N        Mapper N
MapReduce Example: Word Count
  Books   Map Phase     Shuffle   Reduce
          Count words             Phase
            per book              Sum words
          Count words                A-C
            per book              Sum words
               .                     D-E
                                     .
               .                      .
               .


                                  Sum words
                                     W-Z
          Count words
            per book
Notice there is still a serial part of the
problem: the of the reducers must be
combined
Notice there is still a serial part of the
problem: the of the reducers must be
combined
   …but this is much smaller, and can be
   handled by a single process
Also notice that the network is a shared
resource when processing big data
Also notice that the network is a shared
resource when processing big data
 So rather than moving data to computation,
 we move computation to data.
MapReduce Data Locality
Input Data     Map Phase       Shuffle     Reduce
  Split 1                                  Phase
                Mapper 1

  Split 2       Mapper 2                    Reducer 1

  Split 3       Mapper 3                    Reducer 2
                                                  .
                                                  .
     .             .
     .             .
     .             .
                                            Reducer N


  Split N       Mapper N


                           = a physical machine
Data locality is only guaranteed the Map
phase
Data locality is only guaranteed the Map
phase
 So the most data-intensive work should be
 done in the map, with smaller sets set to
 the reducer
Data locality is only guaranteed the Map
phase
 So the most data-intensive work should be
 done in the map, with smaller sets set to the
 reducer
Some Map/Reduce jobs have no reducer at
all!
MapReduce Gone Wrong
Books     Map Phase     Shuffle   Reduce
          Count words             Phase
            per book              Sum words
          Count words                A-C
            per book
                                  Sum words    Word
               .                     D-E
                                     .        Addition
               .
                                      .
               .                              Service

                                  Sum words
                                     W-Z
          Count words
            per book
Even if our Word Addition Service is
scalable, we’d need to scale it to the size of
the largest Map/Reduce job that will ever
use it
So for data processing, prefer embedded
libraries over remote services
So for data processing, prefer embedded
libraries over remote services
Use remote services for configuration, to
prime caches, etc. – just not for every data
element!
Joining a billion records
Word counts are great, but many real-world
problems mean bringing together multiple
datasets.

 So how do we “join” with MapReduce?
Map-Side Joins
When joining one big input to a small one,
Simply copy the small data set to each mapper
    Data Set 1     Map Phase     Shuffle   Reduce
                    Mapper 1               Phase
      Split 1
                    Data set 2

                                           Reducer 1
                    Mapper 2
      Split 2
                    Data set 2             Reducer 2
                                               .
                    Mapper 3                   .
      Split 3
                    Data set 2
Merge in Reducer
Route common items to the same reducer
  Data Set 1     Map Phase      Shuffle   Reduce
     Split 1                              Phase
                 Group by key
    Split 2      Group by key
                                          Reducer 1
    Split 3      Group by key
                                          Reducer 2
                                              .
                                              .
  Data Set 2
     Split 1     Group by key
                                          Reducer N
    Split 2      Group by key

    Split 3      Group by key
Higher-Level Constructs
MapReduce is a primitive operation for
higher-level constructs
Hive, Pig, Cascading, and Crunch all compile
Into MapReduce
                  Use one!


                                Crunch!
MapReduce and MPP Databases
MapReduce                          MPP Databases
Data in a distributed filesystem   Data in sharded relational databases
MapReduce                          MPP Databases
Data in a distributed filesystem   Data in sharded relational databases
Oriented towards unstructured      Oriented towards structured data
or semi-structured data
MapReduce                           MPP Databases
Data in a distributed filesystem    Data in sharded relational databases
Oriented towards unstructured       Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages   SQL
(e.g., Pig and Hive)
MapReduce                               MPP Databases
Data in a distributed filesystem        Data in sharded relational databases
Oriented towards unstructured           Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages       SQL
(e.g., Pig and Hive)
Poor support for iterative operations   Good support of iterative operations
MapReduce                               MPP Databases
Data in a distributed filesystem        Data in sharded relational databases
Oriented towards unstructured           Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages       SQL
(e.g., Pig and Hive)
Poor support for iterative operations   Good support of iterative operations
Arbitrarily complex programs            SQL and User-Defined Functions
running next to data                    running next to data
MapReduce                               MPP Databases
Data in a distributed filesystem        Data in sharded relational databases
Oriented towards unstructured           Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages       SQL
(e.g., Pig and Hive)
Poor support for iterative operations   Good support of iterative operations
Arbitrarily complex programs            SQL and User-Defined Functions
running next to data                    running next to data
Poor interactive query support          Good interactive query support
MapReduce   MPP Databases
      …are complementary!
MapReduce           MPP Databases
          …are complementary!

Map/Reduce to clean, normalize, reconcile
and codify data to load into a MPP system
for interactive analysis
Bulk processing of millions of records
 Data aggregation and storage
Hadoop Distributed Filesystem
  Scales to many petabytes
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
  The name node keeps track of what
  blocks belong to what file
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
  The name node keeps track of what
  blocks belong to what file
  All blocks written in triplicate
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
  The name node keeps track of what
  blocks belong to what file
  All blocks written in triplicate
  Write and append only –
  no random updates!
HDFS Writes

            Lookup Data Node
                                 Name Node
   Client

       Write

      Data Node 1              Data Node 2              Data Node N
      Block Replicate           Block Replicate . . .   Block

              Block                                         Block
HDFS Reads
              Lookup Block
              locations        Name Node
   Client

                Read

      Data Node 1            Data Node 2         Data Node N
      Block                  Block         ...   Block

              Block                                  Block
HDFS Shortcomings
 No random reads
 No random writes
 Doesn’t deal with many small files
HDFS Shortcomings
 No random reads
 No random writes
 Doesn’t deal with many small files


             Enter HBase
“Random Access To Your Planet-Size Data”
HBase
 Emulates random I/O with a
 Write Ahead Log (WAL)
 Periodically flushes log to sorted files
HBase
 Emulates random I/O with a
 Write Ahead Log (WAL)
 Periodically flushes log to sorted files
 Files accessible as tables, split across
 many regions, hosted by region servers
HBase
 Emulates random I/O with a
 Write Ahead Log (WAL)
 Periodically flushes log to sorted files
 Files accessible as tables, split across
 many regions, hosted by region servers
 Preserves scalability, data locality, and
 Map/Reduce features of Hadoop
Use HBase when:
 You have noisy, semi-structured data
Use HBase when:
 You have noisy, semi-structured data
 You want to apply massively parallel
 processing to your problem
Use HBase when:
 You have noisy, semi-structured data
 You want to apply massively parallel
 processing to your problem
 To handle huge write loads
Use HBase when:
 You have noisy, semi-structured data
 You want to apply massively parallel
 processing to your problem
 To handle huge write loads
 As a scalable key/value store
But there are drawbacks:
  Limited schema support
  Limited atomicity guarantees
  No built-in secondary indexes

HBase is a great tool for many jobs,
but not every job
The data store should align
with the needs of the application
So a pattern is emerging:
     Collection   Aggregation   Processing    Storage

     Millennium                                MPP

       CCDs                                  Relational
                   Hadoop
                                MapReduce
                     with
       Claims                      Jobs
                    HBase                    Document
                                               Store
        HL7
                                              HBase
But we have a potential bottleneck
     Collection   Aggregation   Processing    Storage

    Millennium                                 MPP

       CCDs                                  Relational
                   Hadoop
                                MapReduce
                     with
      Claims                       Jobs
                    HBase                    Document
                                               Store
        HL7
                                              HBase
Direct inserts are designed for online
updates, not massively parallel data loads
So shift the work into MapReduce, and pre-
build files for bulk import

      Oracle Loader for Hadoop
 HBase HFile Import     Bulk Loads for MPP
And we’re missing an important piece:
     Collection   Aggregation   Processing    Storage

    Millennium                                 MPP

       CCDs                                  Relational
                   Hadoop
                                MapReduce
                     with
      Claims                       Jobs
                    HBase                    Document
                                               Store
        HL7
                                              HBase
And we’re missing an important piece:
     Collection   Aggregation   Processing    Storage

    Millennium                                 MPP
                                Realtime
                                Processing
       CCDs                                  Relational
                   Hadoop
                     with
      Claims        HBase                    Document
                                Map/Red        Store
        HL7                     uce Jobs
                                 (batch)
                                              HBase
How do we make it fast?

                             Speed Layer




                              Batch Layer


http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
How do we make it fast?
                          Move data to computation
 Hours of data

             Speed Layer
                                             Incremental
  Low Latency (seconds to process)           updates




                        Move computation to data
 Years of data

                 Batch Layer
                                              Bulk loads
High Latency (minutes or hours to process)
How do we make it fast?
               Complex Event Processing

          Speed Layer
  Storm




          Batch Layer         Hadoop
 MapReduce
And now, the challenge…
Process all data overnight
Quickly create new data models
   Fast iteration cycles means fast innovation

    Process all data overnight
             Simple correction of any bugs
Much easier to understand and work with
Questions?

More Related Content

What's hot

CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
Big Data Spain
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Deanna Kosaraju
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
Victoria López
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
Hanborq Inc.
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
PyData
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
vishal choudhary
 
lec2_ref.pdf
lec2_ref.pdflec2_ref.pdf
lec2_ref.pdf
vishal choudhary
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
Mohammad Mustaqeem
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
GERARDO BARBERENA
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
Soumee Maschatak
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
Bikash Chandra Karmokar
 
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMSIMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
csandit
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
Kyong-Ha Lee
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
Kyong-Ha Lee
 

What's hot (14)

CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
CloudMC: A cloud computing map-reduce implementation for radiotherapy. RUBEN ...
 
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
Optimal Execution Of MapReduce Jobs In Cloud - Voices 2015
 
Hadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoopHadoop, MapReduce and R = RHadoop
Hadoop, MapReduce and R = RHadoop
 
Hanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduceHanborq Optimizations on Hadoop MapReduce
Hanborq Optimizations on Hadoop MapReduce
 
Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)Python in an Evolving Enterprise System (PyData SV 2013)
Python in an Evolving Enterprise System (PyData SV 2013)
 
Hadoop Distributed file system.pdf
Hadoop Distributed file system.pdfHadoop Distributed file system.pdf
Hadoop Distributed file system.pdf
 
lec2_ref.pdf
lec2_ref.pdflec2_ref.pdf
lec2_ref.pdf
 
Application of MapReduce in Cloud Computing
Application of MapReduce in Cloud ComputingApplication of MapReduce in Cloud Computing
Application of MapReduce in Cloud Computing
 
Introduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to HadoopIntroduccion a Hadoop / Introduction to Hadoop
Introduccion a Hadoop / Introduction to Hadoop
 
Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)Managing Big data Module 3 (1st part)
Managing Big data Module 3 (1st part)
 
dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...dmapply: A functional primitive to express distributed machine learning algor...
dmapply: A functional primitive to express distributed machine learning algor...
 
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMSIMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
IMPROVING SCHEDULING OF DATA TRANSMISSION IN TDMA SYSTEMS
 
MapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvementMapReduce: A useful parallel tool that still has room for improvement
MapReduce: A useful parallel tool that still has room for improvement
 
KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.KIISE:SIGDB Workshop presentation.
KIISE:SIGDB Workshop presentation.
 

Viewers also liked

Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
Boston Consulting Group
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
elliando dias
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
Taro L. Saito
 
Powerpoint act 1
Powerpoint act 1Powerpoint act 1
Powerpoint act 1
iamemilioh
 
9 his 19.7.11
9 his 19.7.119 his 19.7.11
9 his 19.7.11
rahulmcoolnsmart
 
טרנספורמציה התמרה
טרנספורמציה  התמרה טרנספורמציה  התמרה
טרנספורמציה התמרה BSDjewishcoaching
 
Ix history 21.11.11
Ix history 21.11.11Ix history 21.11.11
Ix history 21.11.11
rahulmcoolnsmart
 
Cicada selling-seo
Cicada selling-seoCicada selling-seo
Cicada selling-seo
Ned Wells
 
Ple ja yhteisöllinen oppiminen
Ple ja yhteisöllinen oppiminenPle ja yhteisöllinen oppiminen
Ple ja yhteisöllinen oppiminenTuure Puurunen
 
9 civics 3.8.11
9 civics 3.8.119 civics 3.8.11
9 civics 3.8.11
rahulmcoolnsmart
 
9 civics3.8.11
9 civics3.8.119 civics3.8.11
9 civics3.8.11
rahulmcoolnsmart
 
התמרה ושינוי בטבע
התמרה ושינוי בטבעהתמרה ושינוי בטבע
התמרה ושינוי בטבעBSDjewishcoaching
 
Osi model
Osi modelOsi model
Osi model
Bhawana Singh
 
9 history 2.8.11
9 history 2.8.119 history 2.8.11
9 history 2.8.11
rahulmcoolnsmart
 
SOM1 - Immateriaalivarkauden sietämätön helppous
SOM1 - Immateriaalivarkauden sietämätön helppousSOM1 - Immateriaalivarkauden sietämätön helppous
SOM1 - Immateriaalivarkauden sietämätön helppousTuure Puurunen
 
JA3 - kurssin aloitus
JA3 - kurssin aloitusJA3 - kurssin aloitus
JA3 - kurssin aloitus
Tuure Puurunen
 
Digestive system
Digestive systemDigestive system
Digestive system
Lisa Josue
 
How did you use media technologies in the construction and research
How did you use media technologies in the construction and researchHow did you use media technologies in the construction and research
How did you use media technologies in the construction and research
StellaK17
 

Viewers also liked (20)

Big Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data setsBig Data: tools and techniques for working with large data sets
Big Data: tools and techniques for working with large data sets
 
Distributed Data processing in a Cloud
Distributed Data processing in a CloudDistributed Data processing in a Cloud
Distributed Data processing in a Cloud
 
Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014Streaming Distributed Data Processing with Silk #deim2014
Streaming Distributed Data Processing with Silk #deim2014
 
Powerpoint act 1
Powerpoint act 1Powerpoint act 1
Powerpoint act 1
 
9 his 19.7.11
9 his 19.7.119 his 19.7.11
9 his 19.7.11
 
טרנספורמציה התמרה
טרנספורמציה  התמרה טרנספורמציה  התמרה
טרנספורמציה התמרה
 
Ix history 21.11.11
Ix history 21.11.11Ix history 21.11.11
Ix history 21.11.11
 
ÄII1 - Intro
ÄII1 - IntroÄII1 - Intro
ÄII1 - Intro
 
Cicada selling-seo
Cicada selling-seoCicada selling-seo
Cicada selling-seo
 
Ple ja yhteisöllinen oppiminen
Ple ja yhteisöllinen oppiminenPle ja yhteisöllinen oppiminen
Ple ja yhteisöllinen oppiminen
 
9 civics 3.8.11
9 civics 3.8.119 civics 3.8.11
9 civics 3.8.11
 
9 civics3.8.11
9 civics3.8.119 civics3.8.11
9 civics3.8.11
 
Presentation of arg
Presentation of argPresentation of arg
Presentation of arg
 
התמרה ושינוי בטבע
התמרה ושינוי בטבעהתמרה ושינוי בטבע
התמרה ושינוי בטבע
 
Osi model
Osi modelOsi model
Osi model
 
9 history 2.8.11
9 history 2.8.119 history 2.8.11
9 history 2.8.11
 
SOM1 - Immateriaalivarkauden sietämätön helppous
SOM1 - Immateriaalivarkauden sietämätön helppousSOM1 - Immateriaalivarkauden sietämätön helppous
SOM1 - Immateriaalivarkauden sietämätön helppous
 
JA3 - kurssin aloitus
JA3 - kurssin aloitusJA3 - kurssin aloitus
JA3 - kurssin aloitus
 
Digestive system
Digestive systemDigestive system
Digestive system
 
How did you use media technologies in the construction and research
How did you use media technologies in the construction and researchHow did you use media technologies in the construction and research
How did you use media technologies in the construction and research
 

Similar to Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Matthew Lease
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Somnath Mazumdar
 
WELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANINGWELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANING
Abhishek Mukherjee
 
WELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANINGWELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANING
Utkarsh Srivastava
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
Csaba Toth
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
Chicago Hadoop Users Group
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
Revolution Analytics
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
Aaron Cordova
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
Mohamed Ali Mahmoud khouder
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
AnilVijayagiri
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Cloudera, Inc.
 
Intro to threp
Intro to threpIntro to threp
Intro to threp
Hong Wu
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
HARIKRISHNANU13
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
Dilip Reddy
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
VNIT-ACM Student Chapter
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
Prashant Gupta
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
Genoveva Vargas-Solar
 
MapReduce
MapReduceMapReduce
MapReduce
KavyaGo
 
Adaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware MappersAdaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware Mappers
rvernica
 

Similar to Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques" (20)

Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
Lecture 3: Data-Intensive Computing for Text Analysis (Fall 2011)
 
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and CassandraBrief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
Brief introduction on Hadoop,Dremel, Pig, FlumeJava and Cassandra
 
WELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANINGWELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANING
 
WELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANINGWELCOME TO BIG DATA TRANING
WELCOME TO BIG DATA TRANING
 
Hadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User GroupHadoop and Mapreduce for .NET User Group
Hadoop and Mapreduce for .NET User Group
 
Introduction to MapReduce
Introduction to MapReduceIntroduction to MapReduce
Introduction to MapReduce
 
The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)The Powerful Marriage of Hadoop and R (David Champagne)
The Powerful Marriage of Hadoop and R (David Champagne)
 
MapReduce and NoSQL
MapReduce and NoSQLMapReduce and NoSQL
MapReduce and NoSQL
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
MapReduce-Notes.pdf
MapReduce-Notes.pdfMapReduce-Notes.pdf
MapReduce-Notes.pdf
 
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
Hadoop World 2011: The Powerful Marriage of R and Hadoop - David Champagne, R...
 
Intro to threp
Intro to threpIntro to threp
Intro to threp
 
MAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptxMAP REDUCE IN DATA SCIENCE.pptx
MAP REDUCE IN DATA SCIENCE.pptx
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
MapReduce Paradigm
MapReduce ParadigmMapReduce Paradigm
MapReduce Paradigm
 
Hadoop Map Reduce
Hadoop Map ReduceHadoop Map Reduce
Hadoop Map Reduce
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
2 mapreduce-model-principles
2 mapreduce-model-principles2 mapreduce-model-principles
2 mapreduce-model-principles
 
MapReduce
MapReduceMapReduce
MapReduce
 
Adaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware MappersAdaptive MapReduce using Situation-Aware Mappers
Adaptive MapReduce using Situation-Aware Mappers
 

Recently uploaded

Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel ChartSatta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
 
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
BBPMedia1
 
Profiles of Iconic Fashion Personalities.pdf
Profiles of Iconic Fashion Personalities.pdfProfiles of Iconic Fashion Personalities.pdf
Profiles of Iconic Fashion Personalities.pdf
TTop Threads
 
Pitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deckPitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deck
HajeJanKamps
 
GKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt PresentationGKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt Presentation
GraceKohler1
 
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Kalyan Satta Matka Guessing Matka Result Main Bazar chart
 
Best practices for project execution and delivery
Best practices for project execution and deliveryBest practices for project execution and delivery
Best practices for project execution and delivery
CLIVE MINCHIN
 
Call8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessingCall8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessing
➑➌➋➑➒➎➑➑➊➍
 
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
Lacey Max
 
Digital Marketing with a Focus on Sustainability
Digital Marketing with a Focus on SustainabilityDigital Marketing with a Focus on Sustainability
Digital Marketing with a Focus on Sustainability
sssourabhsharma
 
Best Forex Brokers Comparison in INDIA 2024
Best Forex Brokers Comparison in INDIA 2024Best Forex Brokers Comparison in INDIA 2024
Best Forex Brokers Comparison in INDIA 2024
Top Forex Brokers Review
 
Anny Serafina Love - Letter of Recommendation by Kellen Harkins, MS.
Anny Serafina Love - Letter of Recommendation by Kellen Harkins, MS.Anny Serafina Love - Letter of Recommendation by Kellen Harkins, MS.
Anny Serafina Love - Letter of Recommendation by Kellen Harkins, MS.
AnnySerafinaLove
 
4 Benefits of Partnering with an OnlyFans Agency for Content Creators.pdf
4 Benefits of Partnering with an OnlyFans Agency for Content Creators.pdf4 Benefits of Partnering with an OnlyFans Agency for Content Creators.pdf
4 Benefits of Partnering with an OnlyFans Agency for Content Creators.pdf
onlyfansmanagedau
 
DearbornMusic-KatherineJasperFullSailUni
DearbornMusic-KatherineJasperFullSailUniDearbornMusic-KatherineJasperFullSailUni
DearbornMusic-KatherineJasperFullSailUni
katiejasper96
 
Presentation by Herman Kienhuis (Curiosity VC) on Investing in AI for ABS Alu...
Presentation by Herman Kienhuis (Curiosity VC) on Investing in AI for ABS Alu...Presentation by Herman Kienhuis (Curiosity VC) on Investing in AI for ABS Alu...
Presentation by Herman Kienhuis (Curiosity VC) on Investing in AI for ABS Alu...
Herman Kienhuis
 
Income Tax exemption for Start up : Section 80 IAC
Income Tax  exemption for Start up : Section 80 IACIncome Tax  exemption for Start up : Section 80 IAC
Income Tax exemption for Start up : Section 80 IAC
CA Dr. Prithvi Ranjan Parhi
 
Sustainable Logistics for Cost Reduction_ IPLTech Electric's Eco-Friendly Tra...
Sustainable Logistics for Cost Reduction_ IPLTech Electric's Eco-Friendly Tra...Sustainable Logistics for Cost Reduction_ IPLTech Electric's Eco-Friendly Tra...
Sustainable Logistics for Cost Reduction_ IPLTech Electric's Eco-Friendly Tra...
IPLTech Electric
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
➒➌➎➏➑➐➋➑➐➐Dpboss Matka Guessing Satta Matka Kalyan Chart Indian Matka
 
Best Competitive Marble Pricing in Dubai - ☎ 9928909666
Best Competitive Marble Pricing in Dubai - ☎ 9928909666Best Competitive Marble Pricing in Dubai - ☎ 9928909666
Best Competitive Marble Pricing in Dubai - ☎ 9928909666
Stone Art Hub
 

Recently uploaded (20)

Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel ChartSatta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
Satta Matka Dpboss Matka Guessing Kalyan Chart Indian Matka Kalyan panel Chart
 
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
NIMA2024 | De toegevoegde waarde van DEI en ESG in campagnes | Nathalie Lam |...
 
Profiles of Iconic Fashion Personalities.pdf
Profiles of Iconic Fashion Personalities.pdfProfiles of Iconic Fashion Personalities.pdf
Profiles of Iconic Fashion Personalities.pdf
 
Pitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deckPitch Deck Teardown: Kinnect's $250k Angel deck
Pitch Deck Teardown: Kinnect's $250k Angel deck
 
GKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt PresentationGKohler - Retail Scavenger Hunt Presentation
GKohler - Retail Scavenger Hunt Presentation
 
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
❼❷⓿❺❻❷❽❷❼❽ Dpboss Matka Result Satta Matka Guessing Satta Fix jodi Kalyan Fin...
 
Best practices for project execution and delivery
Best practices for project execution and deliveryBest practices for project execution and delivery
Best practices for project execution and delivery
 
Call8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessingCall8328958814 satta matka Kalyan result satta guessing
Call8328958814 satta matka Kalyan result satta guessing
 
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
How are Lilac French Bulldogs Beauty Charming the World and Capturing Hearts....
 
Digital Marketing with a Focus on Sustainability
Digital Marketing with a Focus on SustainabilityDigital Marketing with a Focus on Sustainability
Digital Marketing with a Focus on Sustainability
 
Best Forex Brokers Comparison in INDIA 2024
Best Forex Brokers Comparison in INDIA 2024Best Forex Brokers Comparison in INDIA 2024
Best Forex Brokers Comparison in INDIA 2024
 
Anny Serafina Love - Letter of Recommendation by Kellen Harkins, MS.
Anny Serafina Love - Letter of Recommendation by Kellen Harkins, MS.Anny Serafina Love - Letter of Recommendation by Kellen Harkins, MS.
Anny Serafina Love - Letter of Recommendation by Kellen Harkins, MS.
 
4 Benefits of Partnering with an OnlyFans Agency for Content Creators.pdf
4 Benefits of Partnering with an OnlyFans Agency for Content Creators.pdf4 Benefits of Partnering with an OnlyFans Agency for Content Creators.pdf
4 Benefits of Partnering with an OnlyFans Agency for Content Creators.pdf
 
DearbornMusic-KatherineJasperFullSailUni
DearbornMusic-KatherineJasperFullSailUniDearbornMusic-KatherineJasperFullSailUni
DearbornMusic-KatherineJasperFullSailUni
 
Presentation by Herman Kienhuis (Curiosity VC) on Investing in AI for ABS Alu...
Presentation by Herman Kienhuis (Curiosity VC) on Investing in AI for ABS Alu...Presentation by Herman Kienhuis (Curiosity VC) on Investing in AI for ABS Alu...
Presentation by Herman Kienhuis (Curiosity VC) on Investing in AI for ABS Alu...
 
Income Tax exemption for Start up : Section 80 IAC
Income Tax  exemption for Start up : Section 80 IACIncome Tax  exemption for Start up : Section 80 IAC
Income Tax exemption for Start up : Section 80 IAC
 
Sustainable Logistics for Cost Reduction_ IPLTech Electric's Eco-Friendly Tra...
Sustainable Logistics for Cost Reduction_ IPLTech Electric's Eco-Friendly Tra...Sustainable Logistics for Cost Reduction_ IPLTech Electric's Eco-Friendly Tra...
Sustainable Logistics for Cost Reduction_ IPLTech Electric's Eco-Friendly Tra...
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Satta Matka
 
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian MatkaDpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
Dpboss Matka Guessing Satta Matta Matka Kalyan Chart Indian Matka
 
Best Competitive Marble Pricing in Dubai - ☎ 9928909666
Best Competitive Marble Pricing in Dubai - ☎ 9928909666Best Competitive Marble Pricing in Dubai - ☎ 9928909666
Best Competitive Marble Pricing in Dubai - ☎ 9928909666
 

Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

  • 1. Big Data Technologies and Techniques Ryan Brush Distinguished Engineer, Cerner Corporation @ryanbrush
  • 3. Atomic, transactional updates Guaranteed consistency Relational Databases are Awesome Declarative queries Easy to reason about Long track record of success
  • 4. Relational Databases are Awesome …so use them!
  • 5. Relational Databases are Awesome …so use them! But…
  • 6. Those advantages have a cost Global, atomic state means global, atomic coordination Coordination does not scale linearly
  • 7. The costs of coordination Remember the network effect?
  • 8. The costs of coordination n(n -1) channels = 2 2 nodes = 1 channel 5 nodes = 10 channels 12 nodes = 66 channels 25 nodes = 300 channels
  • 9. So we better be able to scale
  • 10. The costs of coordination Databases have optimized this in many clever ways, but a limit on scalability still exists
  • 11. Let’s look at some ways to scale
  • 13. Bulk processing billions of records Data aggregation and storage
  • 14. Bulk processing billions of records Data aggregation and storage Real-time processing of updates
  • 15. Bulk processing billions of records Data aggregation and storage Real-time processing of updates Serving data for: Online Apps Analytics
  • 16. Let’s start with scalability of bulk processing
  • 17. Quiz: which one is scalable?
  • 18. Quiz: which one is scalable? 1000-node Hadoop cluster where jobs depend on a common process
  • 19. Quiz: which one is scalable? 1000-node Hadoop cluster where jobs depend on a common process 1000 Windows ME machines running independent Excel macros
  • 20. Quiz: which one is scalable? 1000-node Hadoop cluster where jobs depend on a common process 1000 Windows ME machines running independent Excel macros
  • 21. Independence Parallelizable
  • 22. Independence Parallelizable Parallelizable Scalable
  • 23. “Shared Nothing” architectures are the most scalable…
  • 24. “Shared Nothing” architectures are the most scalable… …but most real-world problems require us to share something…
  • 25. “Shared Nothing” architectures are the most scalable… …but most real-world problems require us to share something… …so our designs usually have a parallel part and a serial part
  • 26. The key is to make sure the vast majority of our work in the cloud is independent and parallelizable.
  • 27. Amdahl’s Law 1 S : speed improvement S(N ) = P : ratio of the problem that (1- P) + P can be parallelized N N: number of processors
  • 28. MapReduce Primer Input Data Map Phase Shuffle Reduce Split 1 Phase Mapper 1 Split 2 Mapper 2 Reducer 1 Split 3 Mapper 3 Reducer 2 . . . . . . . . Reducer N Split N Mapper N
  • 29. MapReduce Example: Word Count Books Map Phase Shuffle Reduce Count words Phase per book Sum words Count words A-C per book Sum words . D-E . . . . Sum words W-Z Count words per book
  • 30. Notice there is still a serial part of the problem: the of the reducers must be combined
  • 31. Notice there is still a serial part of the problem: the of the reducers must be combined …but this is much smaller, and can be handled by a single process
  • 32. Also notice that the network is a shared resource when processing big data
  • 33. Also notice that the network is a shared resource when processing big data So rather than moving data to computation, we move computation to data.
  • 34. MapReduce Data Locality Input Data Map Phase Shuffle Reduce Split 1 Phase Mapper 1 Split 2 Mapper 2 Reducer 1 Split 3 Mapper 3 Reducer 2 . . . . . . . . Reducer N Split N Mapper N = a physical machine
  • 35. Data locality is only guaranteed the Map phase
  • 36. Data locality is only guaranteed the Map phase So the most data-intensive work should be done in the map, with smaller sets set to the reducer
  • 37. Data locality is only guaranteed the Map phase So the most data-intensive work should be done in the map, with smaller sets set to the reducer Some Map/Reduce jobs have no reducer at all!
  • 38. MapReduce Gone Wrong Books Map Phase Shuffle Reduce Count words Phase per book Sum words Count words A-C per book Sum words Word . D-E . Addition . . . Service Sum words W-Z Count words per book
  • 39. Even if our Word Addition Service is scalable, we’d need to scale it to the size of the largest Map/Reduce job that will ever use it
  • 40. So for data processing, prefer embedded libraries over remote services
  • 41. So for data processing, prefer embedded libraries over remote services Use remote services for configuration, to prime caches, etc. – just not for every data element!
  • 42. Joining a billion records Word counts are great, but many real-world problems mean bringing together multiple datasets. So how do we “join” with MapReduce?
  • 43. Map-Side Joins When joining one big input to a small one, Simply copy the small data set to each mapper Data Set 1 Map Phase Shuffle Reduce Mapper 1 Phase Split 1 Data set 2 Reducer 1 Mapper 2 Split 2 Data set 2 Reducer 2 . Mapper 3 . Split 3 Data set 2
  • 44. Merge in Reducer Route common items to the same reducer Data Set 1 Map Phase Shuffle Reduce Split 1 Phase Group by key Split 2 Group by key Reducer 1 Split 3 Group by key Reducer 2 . . Data Set 2 Split 1 Group by key Reducer N Split 2 Group by key Split 3 Group by key
  • 45. Higher-Level Constructs MapReduce is a primitive operation for higher-level constructs Hive, Pig, Cascading, and Crunch all compile Into MapReduce Use one! Crunch!
  • 46. MapReduce and MPP Databases
  • 47. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases
  • 48. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data
  • 49. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive)
  • 50. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive) Poor support for iterative operations Good support of iterative operations
  • 51. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive) Poor support for iterative operations Good support of iterative operations Arbitrarily complex programs SQL and User-Defined Functions running next to data running next to data
  • 52. MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive) Poor support for iterative operations Good support of iterative operations Arbitrarily complex programs SQL and User-Defined Functions running next to data running next to data Poor interactive query support Good interactive query support
  • 53. MapReduce MPP Databases …are complementary!
  • 54. MapReduce MPP Databases …are complementary! Map/Reduce to clean, normalize, reconcile and codify data to load into a MPP system for interactive analysis
  • 55. Bulk processing of millions of records Data aggregation and storage
  • 56. Hadoop Distributed Filesystem Scales to many petabytes
  • 57. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes
  • 58. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file
  • 59. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file All blocks written in triplicate
  • 60. Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file All blocks written in triplicate Write and append only – no random updates!
  • 61. HDFS Writes Lookup Data Node Name Node Client Write Data Node 1 Data Node 2 Data Node N Block Replicate Block Replicate . . . Block Block Block
  • 62. HDFS Reads Lookup Block locations Name Node Client Read Data Node 1 Data Node 2 Data Node N Block Block ... Block Block Block
  • 63. HDFS Shortcomings No random reads No random writes Doesn’t deal with many small files
  • 64. HDFS Shortcomings No random reads No random writes Doesn’t deal with many small files Enter HBase “Random Access To Your Planet-Size Data”
  • 65. HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files
  • 66. HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files Files accessible as tables, split across many regions, hosted by region servers
  • 67. HBase Emulates random I/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files Files accessible as tables, split across many regions, hosted by region servers Preserves scalability, data locality, and Map/Reduce features of Hadoop
  • 68. Use HBase when: You have noisy, semi-structured data
  • 69. Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem
  • 70. Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem To handle huge write loads
  • 71. Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem To handle huge write loads As a scalable key/value store
  • 72. But there are drawbacks: Limited schema support Limited atomicity guarantees No built-in secondary indexes HBase is a great tool for many jobs, but not every job
  • 73. The data store should align with the needs of the application
  • 74. So a pattern is emerging: Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  • 75. But we have a potential bottleneck Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  • 76. Direct inserts are designed for online updates, not massively parallel data loads So shift the work into MapReduce, and pre- build files for bulk import Oracle Loader for Hadoop HBase HFile Import Bulk Loads for MPP
  • 77. And we’re missing an important piece: Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  • 78. And we’re missing an important piece: Collection Aggregation Processing Storage Millennium MPP Realtime Processing CCDs Relational Hadoop with Claims HBase Document Map/Red Store HL7 uce Jobs (batch) HBase
  • 79. How do we make it fast? Speed Layer Batch Layer http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
  • 80. How do we make it fast? Move data to computation Hours of data Speed Layer Incremental Low Latency (seconds to process) updates Move computation to data Years of data Batch Layer Bulk loads High Latency (minutes or hours to process)
  • 81. How do we make it fast? Complex Event Processing Speed Layer Storm Batch Layer Hadoop MapReduce
  • 82. And now, the challenge…
  • 83. Process all data overnight
  • 84. Quickly create new data models Fast iteration cycles means fast innovation Process all data overnight Simple correction of any bugs Much easier to understand and work with