20131011 - Los Gatos - Netflix - Big Data Design Patterns

Design Patterns for Big Data
Architecture: Best Strategies for
Streamlined [Simple, Powerful]
Design

Allen Day, PhD
Data Scientist, MapR Technologies
October 2013
©MapR Technologies - Confidential

Me, Us
• Allen Day, Principal Data Scientist, MapR
R contributor (10 yr), Hadoop (6 yr)
Human Genetics (UCLA Medicine), Machine Learning

• MapR
Distributes open source components for Hadoop
Adds major enhancements for performance, high-availability, and
ease-of-use

• See Also
– “allenday” most places (twitter, github, etc.)
– aday@maprtech.com, allenday@allenday.com
– @mapR

Three Business Use Cases
Personalized
Search


Personalized
Medicine

Market
Segmentation

Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies


Market
Segmentation
• Group similar
customers
• Target with
cross-sell / upsell campaign

Personalized
Search

Personalized
Medicine

+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies

Personal data
Personal data
Which ones are similar?

Market
Segmentation
• Group similar
customers
• Target with

Marketing

Personalized
Search

Personalized
Medicine

+ personal
search history
• Custom ranking
of results

• Patient medical
history
• Genomic info.
• Match against
database of
therapies

Personal data
How can you tell?

Personal data


Market
Segmentation
• Group similar
customers
• Target with

Marketing

But First…

WHAT IS A DESIGN PATTERN?


“a design pattern is a general reusable
solution to a commonly occurring
problem within a given context in software
design. A design pattern is not a finished
design that can be transformed directly
into source or machine code. It is a
description or template for how to solve a
problem that can be used in many
different situations”
http://en.wikipedia.org/wiki/Software_design_pattern


History of Design Pattern Ideation

1977
Architecture &
Civil Engineering


1994
OO Software
Architecture

2012
Parallelization
Software

?
Application
Parallelization

Not Just Software

http://en.wikipedia.org/wiki/A-line

Big Data Application Shapes
1. How big is your input record?
2. How big is the data that is relevant to processing the
input record?
3. How big is the total data that could be relevant to
processing the input?
4. How fast do inputs flow in?
5. How fast do outputs need to flow out?
6. How complex (unstructured) are 1-5?
7. How predictable are 1-6? (spikiness, variance)
8. Is accuracy more important than speed?
9. Does the processing contain cycles (feedback loops)?

input record?

Volume

Velocity
Variety


Choose a Pattern: Volume & Velocity
1. How big is your target data?
<10 GB

mid
?

?

A

Single element
at a time

>200 GB

2. How big is your query data?
One pass
over 100%

B

C

Big storage

Streaming

Multiple passes
over big chunks

3. How fast do you need a result?
Throughput >
response
D


Nearline
Analytics

< 100s
(human scale)
E
Exploratory
Analysis

Twitter Zeitgeist as a
Composite of Design Patterns
Live data source
e.g.
Twitter Firehose

B

C

Big storage

Streaming

D

Nearline
Analytics

Downstream applications

input record?

Volume

Velocity
Variety
Intents & Methods


Application
characteristic

Personalized
Search

Personalized
Medicine

Market
Segmenting

Input record size
Co-processed data size
Archive size

Small
Large
Large

Large
Large
Small

Small
Large
Large

Input rate
Output rate
Process complexity
Input/process spikiness
Speed or accuracy?
Cycles?

Fast
Fast
High
Low
Speed
Yes

Fast
Slow
High
Low
Accuracy
No

Fast
Fast
Low
High
Speed
Yes


Percolation in Classic Form
Real-time data
source
Real-time
insertion

Data
store

Ofﬂine
percolation
of recent data

Large-scale Incremental Processing Using Distributed Transactions and Notiﬁcations
http://research.google.com/pubs/pub36726.html

Real-time data
source
Data
store

Ofﬂine
percolation
of recent data

Queue

Data
store

Real-time
insertion
Queued data are unavailable for action – not
percolation


Real-time
insertion

Delayed
insertion

Real-time data
source
Real-time
insertion


Data
store

Ofﬂine
percolation
of recent data

Percolation of a Composite Store
Real-time data
source
Real-time
insertion

Data
store

Ofﬂine
percolation
Index

Both parts visible


Market Segmentation
• Divide customers into subsets with common
needs
• Design specific strategies for each subset
• Major emphasis on “fresh” data


Market Segmentation
Feature
Extraction
Real-time
transactions
Customer
history

Assign
Segment
(search)
db
Market
Segments
What does this have to do with percolation?


query
Clustering

Percolator 1
Feature
Extraction
Real-time
transactions
Customer
history


Feature extraction is
percolation because it is
triggered by the arrival of a
new record and because it
updates that new record.

Percolator 2
Real-time
transactions
Customer
history

Market segment assignment
is percolation because it is
triggered by the arrival of a
new record and because
only that record's segment is
updated.

What about the clustering step?


Assign
Segment
(search)
db
Market
Segments

query

Scheduled Update - Not Percolation

Customer
history

Clustering
The clustering loop is not
percolation since it runs at
ﬁxed intervals instead of
incrementally as updates are
received. It also doesn't
update just a single
customer record.


Market
Segments

Personalized Search
• Observe web users’ activity over an extended
period
• Understand individual user interests
• Customize search results for each user
• …as fast as possible


Personal Search History and Web Index
Search
Persona
Activity

db
query

Persona update
Histories
trigger

query

Search
Web
Crawl

feature
extraction

Doc
Store

db

update

trigger

Doc
Index

Persona
Index

Percolator 1

Expensive feature
extraction does not
block document ingest

Web
Crawl

feature
extraction

Doc
Store

Percolators 2 and 3
Persona
Activity
Persona update
Histories

Web
Crawl
Doc
Store

update

Doc
Index

Persona
Index

Percolator 4
Updates to personas
trigger updates in
related personas

Search
Persona
Activity

db
query

Persona update
Histories


Persona
Index

Percolator 5?

Persona
Index

Persona
Histories
trigger

query

Search
db

trigger

Doc
Index

Persona and doc
index updates trigger a
personalization refresh

Pattern Context
Persona
Activity

Web
Crawl


Encapsulated
Process

Cyclic Dependency Graph


Percolator Thoughts
• M7 tables are great as the first persistence point
in percolation
• In-memory flag column family works great for
triggering updates
– Efficient - eliminates need for queuing
– Fast triggering with row & column Bloom filters

• Percolation is best supported by dedicated
column families
– Percolators I/O characteristics differ
– M7 works especially well because it supports lots of
column families


Cyclic Dependency Graph, M7 Schema


Personalized Medicine
5. Interpretation
& Follow-up

4. Reporting

1. Select Tests

2. Draw Biosample

3. Genome Sequencing
& Analysis

Personalized Medicine Applications
• Pre-conception screening
• Clinical research & trials
– Drug re-targeting

• Therapeutics
– Companion diagnostics
– Therapy selection

Patient
history
(EHR)

EHR
archive

Insert
(eventually)

db
Sequence
extraction

Patient
health
context

query

Search

Ranked
therapies

Genome
Sample

Here we do not see real-time data pushed to a persistence
layer and processed offline. This pattern does not fit with
percolation…

Patient
history
(EHR)

EHR
archive

Insert
(eventually)

db
Sequence
extraction
Genome
Sample


Patient
health
context

query

Search

User-based recommendation pattern

Ranked
therapies

Recommendation in Classic Form

Queue

History
Archive

db
Recent
history


query

User
Search

Ranked
similar
histories

Item-Based Recommendation
in Classic Form
Queue

History
archive

Cooccurrence
analysis

Off-line analysis

Recent
history
query

Item
linkage
db

Search


Interactive recommendation

Ranked
items

Recommendation Thoughts
• Item-based recommendation is for efficiency
– expensive step in computing co-occurrence can be
done offline and cached prior to a user query

• User-based recommendation is for accuracy
– user comparisons are done online to find the current
best recommendation

• MapR is great for recommendation
– M7 tables are high I/O performance, can eliminate
queues
– Faster archive updates with optimized MapReduce
– High-availability for mission LIFE critical applications


Business Use Cases
& Design Patterns
Recommender –
Personalized
Medicine

Pattern X –
Health data

Percolator –
Personalized
Search

Percolator –
Other Industry

Percolator –
Personalized
Medicine

Pattern X –
Other Industry


Summary: Best Practices
• Look at the big picture
– Find recurring patterns

• Design systems at a high-level
– Solve problems once and reuse components
– Increase R&D productivity
– Decrease operational and maintenance overhead


Thank You!
Allen Day, PhD
Principal Data Scientist, MapR Technologies
aday@maprtech.com, allenday@allenday.com
@allenday, @mapr

20131011 - Los Gatos - Netflix - Big Data Design Patterns

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to 20131011 - Los Gatos - Netflix - Big Data Design Patterns

Similar to 20131011 - Los Gatos - Netflix - Big Data Design Patterns (20)

More from Allen Day, PhD

More from Allen Day, PhD (20)

Recently uploaded

Recently uploaded (20)

20131011 - Los Gatos - Netflix - Big Data Design Patterns

Editor's Notes