Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Big Data Technologies and
Techniques
Ryan Brush
Distinguished Engineer, Cerner Corporation
@ryanbrush

Relational Databases are Awesome

Atomic, transactional updates
Guaranteed consistency

Declarative queries
Easy to reason about
Long track record of success

…so use them!

…so use them!

But…

Those advantages have a cost
Global, atomic state means global,
atomic coordination

Coordination does not scale linearly

The costs of coordination
Remember the
network effect?


n(n -1)
channels =
2

2 nodes = 1 channel
5 nodes = 10 channels

Databases have optimized this in
many clever ways, but a limit on
scalability still exists

Let’s look at some ways to scale

Bulk processing billions of records

Data aggregation and storage

Real-time processing of updates

Real-time processing of updates
Serving data for: Online Apps
Analytics

Let’s start with scalability of
bulk processing

Quiz: which one is scalable?
1000-node Hadoop cluster where
jobs depend on a common process

Quiz: which one is scalable?
1000-node Hadoop cluster where
jobs depend on a common process
1000 Windows ME machines running
independent Excel macros

Independence Parallelizable

Parallelizable Scalable

“Shared Nothing” architectures are the
most scalable…

most scalable…
…but most real-world problems require
us to share something…

most scalable…
…but most real-world problems require
us to share something…
…so our designs usually have a parallel
part and a serial part

The key is to make sure the vast majority
of our work in the cloud is independent and
parallelizable.

Amdahl’s Law
1 S : speed improvement
S(N ) = P : ratio of the problem that
(1- P) + P can be parallelized
N N: number of processors

MapReduce Primer
Input Data Map Phase Shuffle Reduce
Split 1 Phase
Mapper 1
Split 2 Mapper 2
Reducer 1
Split 3 Mapper 3
Reducer 2
. . .
. . .
. .

Reducer N

Split N Mapper N

MapReduce Example: Word Count
Books Map Phase Shuffle Reduce
Count words Phase
per book Sum words
Count words A-C
per book Sum words
. D-E
.
. .
.

Sum words
W-Z
Count words
per book

Notice there is still a serial part of the
problem: the of the reducers must be
combined

Notice there is still a serial part of the
problem: the of the reducers must be
combined
…but this is much smaller, and can be
handled by a single process

Also notice that the network is a shared
resource when processing big data

Also notice that the network is a shared
resource when processing big data
So rather than moving data to computation,
we move computation to data.

MapReduce Data Locality
Input Data Map Phase Shuffle Reduce
Split 1 Phase
Mapper 1

Split 2 Mapper 2 Reducer 1

Split 3 Mapper 3 Reducer 2
.
.
. .
. .
. .
Reducer N

Split N Mapper N

= a physical machine

Data locality is only guaranteed the Map
phase

phase
So the most data-intensive work should be
done in the map, with smaller sets set to
the reducer

phase
So the most data-intensive work should be
done in the map, with smaller sets set to the
reducer
Some Map/Reduce jobs have no reducer at
all!

MapReduce Gone Wrong
Books Map Phase Shuffle Reduce
Count words Phase
per book Sum words
Count words A-C
per book
Sum words Word
. D-E
. Addition
.
.
. Service

Sum words
W-Z
Count words
per book

Even if our Word Addition Service is
scalable, we’d need to scale it to the size of
the largest Map/Reduce job that will ever
use it

So for data processing, prefer embedded
libraries over remote services

So for data processing, prefer embedded
libraries over remote services
Use remote services for configuration, to
prime caches, etc. – just not for every data
element!

Joining a billion records
Word counts are great, but many real-world
problems mean bringing together multiple
datasets.

So how do we “join” with MapReduce?

Map-Side Joins
When joining one big input to a small one,
Simply copy the small data set to each mapper
Data Set 1 Map Phase Shuffle Reduce
Mapper 1 Phase
Split 1
Data set 2

Reducer 1
Mapper 2
Split 2
Data set 2 Reducer 2
.
Mapper 3 .
Split 3
Data set 2

Merge in Reducer
Route common items to the same reducer
Data Set 1 Map Phase Shuffle Reduce
Split 1 Phase
Group by key
Split 2 Group by key
Reducer 1
Reducer 2
.
.
Data Set 2
Reducer N


Higher-Level Constructs
MapReduce is a primitive operation for
higher-level constructs
Hive, Pig, Cascading, and Crunch all compile
Into MapReduce
Use one!

Crunch!

MapReduce MPP Databases
Data in a distributed filesystem Data in sharded relational databases

Oriented towards unstructured Oriented towards structured data
or semi-structured data

Java or Domain-Specific Languages SQL
(e.g., Pig and Hive)

Poor support for iterative operations Good support of iterative operations

Arbitrarily complex programs SQL and User-Defined Functions
running next to data running next to data

Arbitrarily complex programs SQL and User-Defined Functions
running next to data running next to data
Poor interactive query support Good interactive query support

…are complementary!

…are complementary!

Map/Reduce to clean, normalize, reconcile
and codify data to load into a MPP system
for interactive analysis

Bulk processing of millions of records

Hadoop Distributed Filesystem
Scales to many petabytes

Splits all files into blocks and spreads
them across data nodes

The name node keeps track of what
blocks belong to what file

All blocks written in triplicate

All blocks written in triplicate
Write and append only –
no random updates!

HDFS Writes

Lookup Data Node
Name Node
Client

Write

Data Node 1 Data Node 2 Data Node N
Block Replicate Block Replicate . . . Block

Block Block

HDFS Reads
Lookup Block
locations Name Node
Client

Read

Data Node 1 Data Node 2 Data Node N
Block Block ... Block

Block Block

HDFS Shortcomings
No random reads
No random writes
Doesn’t deal with many small files

HDFS Shortcomings
No random reads
No random writes
Doesn’t deal with many small files

Enter HBase
“Random Access To Your Planet-Size Data”

HBase
Emulates random I/O with a
Write Ahead Log (WAL)
Periodically flushes log to sorted files

HBase
Files accessible as tables, split across
many regions, hosted by region servers

HBase
Files accessible as tables, split across
many regions, hosted by region servers
Preserves scalability, data locality, and
Map/Reduce features of Hadoop

Use HBase when:
You have noisy, semi-structured data

Use HBase when:
You want to apply massively parallel
processing to your problem

Use HBase when:
To handle huge write loads

Use HBase when:
To handle huge write loads
As a scalable key/value store

But there are drawbacks:
Limited schema support
Limited atomicity guarantees
No built-in secondary indexes

HBase is a great tool for many jobs,
but not every job

The data store should align
with the needs of the application

So a pattern is emerging:
Collection Aggregation Processing Storage

Millennium MPP

CCDs Relational
Hadoop
MapReduce
with
Claims Jobs
HBase Document
Store
HL7
HBase

But we have a potential bottleneck

Millennium MPP

CCDs Relational
Hadoop
MapReduce
with
Claims Jobs
HBase Document
Store
HL7
HBase

Direct inserts are designed for online
updates, not massively parallel data loads
So shift the work into MapReduce, and pre-
build files for bulk import

Oracle Loader for Hadoop
HBase HFile Import Bulk Loads for MPP

And we’re missing an important piece:

Millennium MPP

CCDs Relational
Hadoop
MapReduce
with
Claims Jobs
HBase Document
Store
HL7
HBase

And we’re missing an important piece:

Millennium MPP
Realtime
Processing
CCDs Relational
Hadoop
with
Claims HBase Document
Map/Red Store
HL7 uce Jobs
(batch)
HBase

How do we make it fast?

Speed Layer

Batch Layer

http://www.slideshare.net/nathanmarz/the-secrets-of-building-realtime-big-data-systems

Move data to computation
Hours of data

Speed Layer
Incremental
Low Latency (seconds to process) updates

Move computation to data
Years of data

Batch Layer
Bulk loads
High Latency (minutes or hours to process)

Complex Event Processing

Speed Layer
Storm

Batch Layer Hadoop
MapReduce

Quickly create new data models
Fast iteration cycles means fast innovation

Process all data overnight
Simple correction of any bugs
Much easier to understand and work with

Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Recommended

Recommended

More Related Content

What's hot

What's hot (14)

Viewers also liked

Viewers also liked (20)

Similar to Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"

Similar to Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques" (20)

Recently uploaded

Recently uploaded (20)

Kansas City Big Data: The Future Of Insights - Keynote: "Big Data Technologies and Techniques"