Big Data Technologies and
           Techniques
                Ryan Brush
Distinguished Engineer, Cerner Corporation
               @ryanbrush
Relational Databases are Awesome
Atomic, transactional updates
   Guaranteed consistency

Relational Databases are Awesome
             Declarative queries
Easy to reason about
     Long track record of success
Relational Databases are Awesome
          …so use them!
Relational Databases are Awesome
          …so use them!

        But…
Those advantages have a cost
Global, atomic state means global,
atomic coordination

      Coordination does not scale linearly
The costs of coordination
     Remember the
     network effect?
The costs of coordination

             n(n -1)
  channels =
                2

  2 nodes = 1 channel
  5 nodes = 10 channels
  12 nodes = 66 channels
  25 nodes = 300 channels
So we better be able to scale
The costs of coordination
  Databases have optimized this in
  many clever ways, but a limit on
  scalability still exists
Let’s look at some ways to scale
Bulk processing billions of records
Bulk processing billions of records
 Data aggregation and storage
Bulk processing billions of records
 Data aggregation and storage
    Real-time processing of updates
Bulk processing billions of records
 Data aggregation and storage
    Real-time processing of updates
 Serving data for: Online Apps
                   Analytics
Let’s start with scalability of
bulk processing
Quiz: which one is scalable?
Quiz: which one is scalable?
    1000-node Hadoop cluster where
    jobs depend on a common process
Quiz: which one is scalable?
    1000-node Hadoop cluster where
    jobs depend on a common process
    1000 Windows ME machines running
    independent Excel macros
Quiz: which one is scalable?
    1000-node Hadoop cluster where
    jobs depend on a common process
    1000 Windows ME machines running
    independent Excel macros
Independence   Parallelizable
Independence      Parallelizable

      Parallelizable     Scalable
“Shared Nothing” architectures are the
most scalable…
“Shared Nothing” architectures are the
most scalable…
     …but most real-world problems require
     us to share something…
“Shared Nothing” architectures are the
most scalable…
     …but most real-world problems require
     us to share something…
  …so our designs usually have a parallel
  part and a serial part
The key is to make sure the vast majority
of our work in the cloud is independent and
parallelizable.
Amdahl’s Law
             1           S : speed improvement
S(N ) =                  P : ratio of the problem that
        (1- P) + P           can be parallelized
                     N   N: number of processors
MapReduce Primer
Input Data      Map Phase   Shuffle   Reduce
  Split 1                             Phase
                 Mapper 1
  Split 2        Mapper 2
                                      Reducer 1
  Split 3        Mapper 3
                                      Reducer 2
     .              .                     .
     .              .                     .
     .              .

                                      Reducer N


  Split N        Mapper N
MapReduce Example: Word Count
  Books   Map Phase     Shuffle   Reduce
          Count words             Phase
            per book              Sum words
          Count words                A-C
            per book              Sum words
               .                     D-E
                                     .
               .                      .
               .


                                  Sum words
                                     W-Z
          Count words
            per book
Notice there is still a serial part of the
problem: the of the reducers must be
combined
Notice there is still a serial part of the
problem: the of the reducers must be
combined
   …but this is much smaller, and can be
   handled by a single process
Also notice that the network is a shared
resource when processing big data
Also notice that the network is a shared
resource when processing big data
 So rather than moving data to computation,
 we move computation to data.
MapReduce Data Locality
Input Data     Map Phase       Shuffle     Reduce
  Split 1                                  Phase
                Mapper 1

  Split 2       Mapper 2                    Reducer 1

  Split 3       Mapper 3                    Reducer 2
                                                  .
                                                  .
     .             .
     .             .
     .             .
                                            Reducer N


  Split N       Mapper N


                           = a physical machine
Data locality is only guaranteed the Map
phase
Data locality is only guaranteed the Map
phase
 So the most data-intensive work should be
 done in the map, with smaller sets set to
 the reducer
Data locality is only guaranteed the Map
phase
 So the most data-intensive work should be
 done in the map, with smaller sets set to the
 reducer
Some Map/Reduce jobs have no reducer at
all!
MapReduce Gone Wrong
Books     Map Phase     Shuffle   Reduce
          Count words             Phase
            per book              Sum words
          Count words                A-C
            per book
                                  Sum words    Word
               .                     D-E
                                     .        Addition
               .
                                      .
               .                              Service

                                  Sum words
                                     W-Z
          Count words
            per book
Even if our Word Addition Service is
scalable, we’d need to scale it to the size of
the largest Map/Reduce job that will ever
use it
So for data processing, prefer embedded
libraries over remote services
So for data processing, prefer embedded
libraries over remote services
Use remote services for configuration, to
prime caches, etc. – just not for every data
element!
Joining a billion records
Word counts are great, but many real-world
problems mean bringing together multiple
datasets.

 So how do we “join” with MapReduce?
Map-Side Joins
When joining one big input to a small one,
Simply copy the small data set to each mapper
    Data Set 1     Map Phase     Shuffle   Reduce
                    Mapper 1               Phase
      Split 1
                    Data set 2

                                           Reducer 1
                    Mapper 2
      Split 2
                    Data set 2             Reducer 2
                                               .
                    Mapper 3                   .
      Split 3
                    Data set 2
Merge in Reducer
Route common items to the same reducer
  Data Set 1     Map Phase      Shuffle   Reduce
     Split 1                              Phase
                 Group by key
    Split 2      Group by key
                                          Reducer 1
    Split 3      Group by key
                                          Reducer 2
                                              .
                                              .
  Data Set 2
     Split 1     Group by key
                                          Reducer N
    Split 2      Group by key

    Split 3      Group by key
Higher-Level Constructs
MapReduce is a primitive operation for
higher-level constructs
Hive, Pig, Cascading, and Crunch all compile
Into MapReduce
                  Use one!


                                Crunch!
MapReduce and MPP Databases
MapReduce                          MPP Databases
Data in a distributed filesystem   Data in sharded relational databases
MapReduce                          MPP Databases
Data in a distributed filesystem   Data in sharded relational databases
Oriented towards unstructured      Oriented towards structured data
or semi-structured data
MapReduce                           MPP Databases
Data in a distributed filesystem    Data in sharded relational databases
Oriented towards unstructured       Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages   SQL
(e.g., Pig and Hive)
MapReduce                               MPP Databases
Data in a distributed filesystem        Data in sharded relational databases
Oriented towards unstructured           Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages       SQL
(e.g., Pig and Hive)
Poor support for iterative operations   Good support of iterative operations
MapReduce                               MPP Databases
Data in a distributed filesystem        Data in sharded relational databases
Oriented towards unstructured           Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages       SQL
(e.g., Pig and Hive)
Poor support for iterative operations   Good support of iterative operations
Arbitrarily complex programs            SQL and User-Defined Functions
running next to data                    running next to data
MapReduce                               MPP Databases
Data in a distributed filesystem        Data in sharded relational databases
Oriented towards unstructured           Oriented towards structured data
or semi-structured data
Java or Domain-Specific Languages       SQL
(e.g., Pig and Hive)
Poor support for iterative operations   Good support of iterative operations
Arbitrarily complex programs            SQL and User-Defined Functions
running next to data                    running next to data
Poor interactive query support          Good interactive query support
MapReduce   MPP Databases
      …are complementary!
MapReduce           MPP Databases
          …are complementary!

Map/Reduce to clean, normalize, reconcile
and codify data to load into a MPP system
for interactive analysis
Bulk processing of millions of records
 Data aggregation and storage
Hadoop Distributed Filesystem
  Scales to many petabytes
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
  The name node keeps track of what
  blocks belong to what file
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
  The name node keeps track of what
  blocks belong to what file
  All blocks written in triplicate
Hadoop Distributed Filesystem
  Scales to many petabytes
  Splits all files into blocks and spreads
  them across data nodes
  The name node keeps track of what
  blocks belong to what file
  All blocks written in triplicate
  Write and append only –
  no random updates!
HDFS Writes

            Lookup Data Node
                                 Name Node
   Client

       Write

      Data Node 1              Data Node 2              Data Node N
      Block Replicate           Block Replicate . . .   Block

              Block                                         Block
HDFS Reads
              Lookup Block
              locations        Name Node
   Client

                Read

      Data Node 1            Data Node 2         Data Node N
      Block                  Block         ...   Block

              Block                                  Block
HDFS Shortcomings
 No random reads
 No random writes
 Doesn’t deal with many small files
HDFS Shortcomings
 No random reads
 No random writes
 Doesn’t deal with many small files


             Enter HBase
“Random Access To Your Planet-Size Data”
HBase
 Emulates random I/O with a
 Write Ahead Log (WAL)
 Periodically flushes log to sorted files
HBase
 Emulates random I/O with a
 Write Ahead Log (WAL)
 Periodically flushes log to sorted files
 Files accessible as tables, split across
 many regions, hosted by region servers
HBase
 Emulates random I/O with a
 Write Ahead Log (WAL)
 Periodically flushes log to sorted files
 Files accessible as tables, split across
 many regions, hosted by region servers
 Preserves scalability, data locality, and
 Map/Reduce features of Hadoop
Use HBase when:
 You have noisy, semi-structured data
Use HBase when:
 You have noisy, semi-structured data
 You want to apply massively parallel
 processing to your problem
Use HBase when:
 You have noisy, semi-structured data
 You want to apply massively parallel
 processing to your problem
 To handle huge write loads
Use HBase when:
 You have noisy, semi-structured data
 You want to apply massively parallel
 processing to your problem
 To handle huge write loads
 As a scalable key/value store
But there are drawbacks:
  Limited schema support
  Limited atomicity guarantees
  No built-in secondary indexes

HBase is a great tool for many jobs,
but not every job
The data store should align
with the needs of the application
So a pattern is emerging:
     Collection   Aggregation   Processing    Storage

     Millennium                                MPP

       CCDs                                  Relational
                   Hadoop
                                MapReduce
                     with
       Claims                      Jobs
                    HBase                    Document
                                               Store
        HL7
                                              HBase
But we have a potential bottleneck
     Collection   Aggregation   Processing    Storage

    Millennium                                 MPP

       CCDs                                  Relational
                   Hadoop
                                MapReduce
                     with
      Claims                       Jobs
                    HBase                    Document
                                               Store
        HL7
                                              HBase
Direct inserts are designed for online
updates, not massively parallel data loads
So shift the work into MapReduce, and pre-
build files for bulk import

      Oracle Loader for Hadoop
 HBase HFile Import     Bulk Loads for MPP
And we’re missing an important piece:
     Collection   Aggregation   Processing    Storage

    Millennium                                 MPP

       CCDs                                  Relational
                   Hadoop
                                MapReduce
                     with
      Claims                       Jobs
                    HBase                    Document
                                               Store
        HL7
                                              HBase
And we’re missing an important piece:
     Collection   Aggregation   Processing    Storage

    Millennium                                 MPP
                                Realtime
                                Processing
       CCDs                                  Relational
                   Hadoop
                     with
      Claims        HBase                    Document
                                Map/Red        Store
        HL7                     uce Jobs
                                 (batch)
                                              HBase
How do we make it fast?

                             Speed Layer




                              Batch Layer


http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
How do we make it fast?
                          Move data to computation
 Hours of data

             Speed Layer
                                             Incremental
  Low Latency (seconds to process)           updates




                        Move computation to data
 Years of data

                 Batch Layer
                                              Bulk loads
High Latency (minutes or hours to process)
How do we make it fast?
               Complex Event Processing

          Speed Layer
  Storm




          Batch Layer         Hadoop
 MapReduce
And now, the challenge…
Process all data overnight
Quickly create new data models
   Fast iteration cycles means fast innovation

    Process all data overnight
             Simple correction of any bugs
Much easier to understand and work with
Questions?

Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

  • 1.
    Big Data Technologiesand Techniques Ryan Brush Distinguished Engineer, Cerner Corporation @ryanbrush
  • 2.
  • 3.
    Atomic, transactional updates Guaranteed consistency Relational Databases are Awesome Declarative queries Easy to reason about Long track record of success
  • 4.
    Relational Databases areAwesome …so use them!
  • 5.
    Relational Databases areAwesome …so use them! But…
  • 6.
    Those advantages havea cost Global, atomic state means global, atomic coordination Coordination does not scale linearly
  • 7.
    The costs ofcoordination Remember the network effect?
  • 8.
    The costs ofcoordination n(n -1) channels = 2 2 nodes = 1 channel 5 nodes = 10 channels 12 nodes = 66 channels 25 nodes = 300 channels
  • 9.
    So we betterbe able to scale
  • 10.
    The costs ofcoordination Databases have optimized this in many clever ways, but a limit on scalability still exists
  • 11.
    Let’s look atsome ways to scale
  • 12.
  • 13.
    Bulk processing billionsof records Data aggregation and storage
  • 14.
    Bulk processing billionsof records Data aggregation and storage Real-time processing of updates
  • 15.
    Bulk processing billionsof records Data aggregation and storage Real-time processing of updates Serving data for: Online Apps Analytics
  • 16.
    Let’s start withscalability of bulk processing
  • 17.
    Quiz: which oneis scalable?
  • 18.
    Quiz: which oneis scalable? 1000-node Hadoop cluster where jobs depend on a common process
  • 19.
    Quiz: which oneis scalable? 1000-node Hadoop cluster where jobs depend on a common process 1000 Windows ME machines running independent Excel macros
  • 20.
    Quiz: which oneis scalable? 1000-node Hadoop cluster where jobs depend on a common process 1000 Windows ME machines running independent Excel macros
  • 21.
    Independence Parallelizable
  • 22.
    Independence Parallelizable Parallelizable Scalable
  • 23.
    “Shared Nothing” architecturesare the most scalable…
  • 24.
    “Shared Nothing” architecturesare the most scalable… …but most real-world problems require us to share something…
  • 25.
    “Shared Nothing” architecturesare the most scalable… …but most real-world problems require us to share something… …so our designs usually have a parallel part and a serial part
  • 26.
    The key isto make sure the vast majority of our work in the cloud is independent and parallelizable.
  • 27.
    Amdahl’s Law 1 S : speed improvement S(N ) = P : ratio of the problem that (1- P) + P can be parallelized N N: number of processors
  • 28.
    MapReduce Primer Input Data Map Phase Shuffle Reduce Split 1 Phase Mapper 1 Split 2 Mapper 2 Reducer 1 Split 3 Mapper 3 Reducer 2 . . . . . . . . Reducer N Split N Mapper N
  • 29.
    MapReduce Example: WordCount Books Map Phase Shuffle Reduce Count words Phase per book Sum words Count words A-C per book Sum words . D-E . . . . Sum words W-Z Count words per book
  • 30.
    Notice there isstill a serial part of the problem: the of the reducers must be combined
  • 31.
    Notice there isstill a serial part of the problem: the of the reducers must be combined …but this is much smaller, and can be handled by a single process
  • 32.
    Also notice thatthe network is a shared resource when processing big data
  • 33.
    Also notice thatthe network is a shared resource when processing big data So rather than moving data to computation, we move computation to data.
  • 34.
    MapReduce Data Locality InputData Map Phase Shuffle Reduce Split 1 Phase Mapper 1 Split 2 Mapper 2 Reducer 1 Split 3 Mapper 3 Reducer 2 . . . . . . . . Reducer N Split N Mapper N = a physical machine
  • 35.
    Data locality isonly guaranteed the Map phase
  • 36.
    Data locality isonly guaranteed the Map phase So the most data-intensive work should be done in the map, with smaller sets set to the reducer
  • 37.
    Data locality isonly guaranteed the Map phase So the most data-intensive work should be done in the map, with smaller sets set to the reducer Some Map/Reduce jobs have no reducer at all!
  • 38.
    MapReduce Gone Wrong Books Map Phase Shuffle Reduce Count words Phase per book Sum words Count words A-C per book Sum words Word . D-E . Addition . . . Service Sum words W-Z Count words per book
  • 39.
    Even if ourWord Addition Service is scalable, we’d need to scale it to the size of the largest Map/Reduce job that will ever use it
  • 40.
    So for dataprocessing, prefer embedded libraries over remote services
  • 41.
    So for dataprocessing, prefer embedded libraries over remote services Use remote services for configuration, to prime caches, etc. – just not for every data element!
  • 42.
    Joining a billionrecords Word counts are great, but many real-world problems mean bringing together multiple datasets. So how do we “join” with MapReduce?
  • 43.
    Map-Side Joins When joiningone big input to a small one, Simply copy the small data set to each mapper Data Set 1 Map Phase Shuffle Reduce Mapper 1 Phase Split 1 Data set 2 Reducer 1 Mapper 2 Split 2 Data set 2 Reducer 2 . Mapper 3 . Split 3 Data set 2
  • 44.
    Merge in Reducer Routecommon items to the same reducer Data Set 1 Map Phase Shuffle Reduce Split 1 Phase Group by key Split 2 Group by key Reducer 1 Split 3 Group by key Reducer 2 . . Data Set 2 Split 1 Group by key Reducer N Split 2 Group by key Split 3 Group by key
  • 45.
    Higher-Level Constructs MapReduce isa primitive operation for higher-level constructs Hive, Pig, Cascading, and Crunch all compile Into MapReduce Use one! Crunch!
  • 46.
  • 47.
    MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases
  • 48.
    MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data
  • 49.
    MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive)
  • 50.
    MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive) Poor support for iterative operations Good support of iterative operations
  • 51.
    MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive) Poor support for iterative operations Good support of iterative operations Arbitrarily complex programs SQL and User-Defined Functions running next to data running next to data
  • 52.
    MapReduce MPP Databases Data in a distributed filesystem Data in sharded relational databases Oriented towards unstructured Oriented towards structured data or semi-structured data Java or Domain-Specific Languages SQL (e.g., Pig and Hive) Poor support for iterative operations Good support of iterative operations Arbitrarily complex programs SQL and User-Defined Functions running next to data running next to data Poor interactive query support Good interactive query support
  • 53.
    MapReduce MPP Databases …are complementary!
  • 54.
    MapReduce MPP Databases …are complementary! Map/Reduce to clean, normalize, reconcile and codify data to load into a MPP system for interactive analysis
  • 55.
    Bulk processing ofmillions of records Data aggregation and storage
  • 56.
    Hadoop Distributed Filesystem Scales to many petabytes
  • 57.
    Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes
  • 58.
    Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file
  • 59.
    Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file All blocks written in triplicate
  • 60.
    Hadoop Distributed Filesystem Scales to many petabytes Splits all files into blocks and spreads them across data nodes The name node keeps track of what blocks belong to what file All blocks written in triplicate Write and append only – no random updates!
  • 61.
    HDFS Writes Lookup Data Node Name Node Client Write Data Node 1 Data Node 2 Data Node N Block Replicate Block Replicate . . . Block Block Block
  • 62.
    HDFS Reads Lookup Block locations Name Node Client Read Data Node 1 Data Node 2 Data Node N Block Block ... Block Block Block
  • 63.
    HDFS Shortcomings Norandom reads No random writes Doesn’t deal with many small files
  • 64.
    HDFS Shortcomings Norandom reads No random writes Doesn’t deal with many small files Enter HBase “Random Access To Your Planet-Size Data”
  • 65.
    HBase Emulates randomI/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files
  • 66.
    HBase Emulates randomI/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files Files accessible as tables, split across many regions, hosted by region servers
  • 67.
    HBase Emulates randomI/O with a Write Ahead Log (WAL) Periodically flushes log to sorted files Files accessible as tables, split across many regions, hosted by region servers Preserves scalability, data locality, and Map/Reduce features of Hadoop
  • 68.
    Use HBase when: You have noisy, semi-structured data
  • 69.
    Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem
  • 70.
    Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem To handle huge write loads
  • 71.
    Use HBase when: You have noisy, semi-structured data You want to apply massively parallel processing to your problem To handle huge write loads As a scalable key/value store
  • 72.
    But there aredrawbacks: Limited schema support Limited atomicity guarantees No built-in secondary indexes HBase is a great tool for many jobs, but not every job
  • 73.
    The data storeshould align with the needs of the application
  • 74.
    So a patternis emerging: Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  • 75.
    But we havea potential bottleneck Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  • 76.
    Direct inserts aredesigned for online updates, not massively parallel data loads So shift the work into MapReduce, and pre- build files for bulk import Oracle Loader for Hadoop HBase HFile Import Bulk Loads for MPP
  • 77.
    And we’re missingan important piece: Collection Aggregation Processing Storage Millennium MPP CCDs Relational Hadoop MapReduce with Claims Jobs HBase Document Store HL7 HBase
  • 78.
    And we’re missingan important piece: Collection Aggregation Processing Storage Millennium MPP Realtime Processing CCDs Relational Hadoop with Claims HBase Document Map/Red Store HL7 uce Jobs (batch) HBase
  • 79.
    How do wemake it fast? Speed Layer Batch Layer http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems
  • 80.
    How do wemake it fast? Move data to computation Hours of data Speed Layer Incremental Low Latency (seconds to process) updates Move computation to data Years of data Batch Layer Bulk loads High Latency (minutes or hours to process)
  • 81.
    How do wemake it fast? Complex Event Processing Speed Layer Storm Batch Layer Hadoop MapReduce
  • 82.
    And now, thechallenge…
  • 83.
  • 84.
    Quickly create newdata models Fast iteration cycles means fast innovation Process all data overnight Simple correction of any bugs Much easier to understand and work with
  • 85.