Design Patterns for Big Data
Architecture: Best Strategies for
Streamlined [Simple, Powerful]
Design

Allen Day, PhD
Data ...
Me, Us
• Allen Day, Principal Data Scientist, MapR
R contributor (10 yr), Hadoop (6 yr)
Human Genetics (UCLA Medicine), Ma...
Three Business Use Cases
Personalized
Search

©MapR Technologies - Confidential

Personalized
Medicine

Market
Segmentatio...
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom...
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom...
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom...
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom...
But First…

WHAT IS A DESIGN PATTERN?

©MapR Technologies - Confidential
“a design pattern is a general reusable
solution to a commonly occurring
problem within a given context in software
design...
History of Design Pattern Ideation

1977
Architecture &
Civil Engineering

©MapR Technologies - Confidential

1994
OO Soft...
Not Just Software

http://en.wikipedia.org/wiki/A-line
©MapR Technologies - Confidential
Big Data Application Shapes
1. How big is your input record?
2. How big is the data that is relevant to processing the
inp...
Big Data Application Shapes
1. How big is your input record?
2. How big is the data that is relevant to processing the
inp...
Big Data Application Shapes
1. How big is your input record?
2. How big is the data that is relevant to processing the
inp...
Big Data Application Shapes
1. How big is your input record?
2. How big is the data that is relevant to processing the
inp...
Choose a Pattern: Volume & Velocity
1. How big is your target data?
<10 GB

mid
?

?

A

Single element
at a time

>200 GB...
Twitter Zeitgeist as a
Composite of Design Patterns
Live data source
e.g.
Twitter Firehose

B

C

Big storage

Streaming

...
Big Data Application Shapes
1. How big is your input record?
2. How big is the data that is relevant to processing the
inp...
Big Data Application Shapes
1. How big is your input record?
2. How big is the data that is relevant to processing the
inp...
Application
characteristic

Personalized
Search

Personalized
Medicine

Market
Segmenting

Input record size
Co-processed ...
Percolation in Classic Form
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
of recent data

Larg...
Percolation in Classic Form
Real-time data
source
Data
store

Offline
percolation
of recent data

Queue

Data
store

Real-t...
Percolation in Classic Form
Real-time data
source
Real-time
insertion

©MapR Technologies - Confidential

Data
store

Offli...
Percolation of a Composite Store
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
Index

Both par...
Market Segmentation
• Divide customers into subsets with common
needs
• Design specific strategies for each subset
• Major...
Market Segmentation
Feature
Extraction
Real-time
transactions
Customer
history

Assign
Segment
(search)
db
Market
Segments...
Percolator 1
Feature
Extraction
Real-time
transactions
Customer
history

©MapR Technologies - Confidential

Feature extrac...
Percolator 2
Real-time
transactions
Customer
history

Market segment assignment
is percolation because it is
triggered by ...
Scheduled Update - Not Percolation

Customer
history

Clustering
The clustering loop is not
percolation since it runs at
fi...
Personalized Search
• Observe web users’ activity over an extended
period
• Understand individual user interests
• Customi...
Personal Search History and Web Index
Search
Persona
Activity

db
query

Persona update
Histories
trigger

query

Search
W...
Percolator 1

Expensive feature
extraction does not
block document ingest

Web
Crawl

feature
extraction

Doc
Store
©MapR ...
Percolators 2 and 3
Persona
Activity
Persona update
Histories

Web
Crawl
Doc
Store
©MapR Technologies - Confidential

upda...
Percolator 4
Updates to personas
trigger updates in
related personas

Search
Persona
Activity

db
query

Persona update
Hi...
Percolator 5?

Persona
Index

Persona
Histories
trigger

query

Search
db

trigger

Doc
Index
©MapR Technologies - Confide...
Pattern Context
Persona
Activity

Web
Crawl

©MapR Technologies - Confidential

Encapsulated
Process
Cyclic Dependency Graph

©MapR Technologies - Confidential
Percolator Thoughts
• M7 tables are great as the first persistence point
in percolation
• In-memory flag column family wor...
Cyclic Dependency Graph, M7 Schema

©MapR Technologies - Confidential
Personalized Medicine
5. Interpretation
& Follow-up

4. Reporting

1. Select Tests

2. Draw Biosample

3. Genome Sequencin...
Personalized Medicine Applications
• Pre-conception screening
• Clinical research & trials
– Drug re-targeting

• Therapeu...
Personalized Medicine
Patient
history
(EHR)

EHR
archive

Insert
(eventually)

db
Sequence
extraction

Patient
health
cont...
Personalized Medicine
Patient
history
(EHR)

EHR
archive

Insert
(eventually)

db
Sequence
extraction
Genome
Sample

©MapR...
Recommendation in Classic Form

Queue

History
Archive

db
Recent
history

©MapR Technologies - Confidential

query

User
...
Item-Based Recommendation
in Classic Form
Queue

History
archive

Cooccurrence
analysis

Off-line analysis

Recent
history...
Recommendation Thoughts
• Item-based recommendation is for efficiency
– expensive step in computing co-occurrence can be
d...
Business Use Cases
& Design Patterns
Recommender –
Personalized
Medicine

Pattern X –
Health data

Percolator –
Personaliz...
Summary: Best Practices
• Look at the big picture
– Find recurring patterns

• Design systems at a high-level
– Solve prob...
Thank You!
Allen Day, PhD
Principal Data Scientist, MapR Technologies
aday@maprtech.com, allenday@allenday.com
@allenday, ...
Upcoming SlideShare
Loading in...5
×

20131011 - Los Gatos - Netflix - Big Data Design Patterns

1,351

Published on

Published in: Technology
0 Comments
4 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
1,351
On Slideshare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
0
Comments
0
Likes
4
Embeds 0
No embeds

No notes for slide
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: In market segmentation, you want to identify useful segments of your customer base to target for a market campaign, for retention, for specific product offerings, etc. What makes “good” segments depends on what you want to do and how the environment changes. You may not know ahead of time what categories make useful segments. One way to find this is to capture customer histories and do a clustering step for discovery and definition of the market segments.This market segment db is then queried and updated in response to new real-time data insertion or new rounds of clustering. Specific feature extraction may also be a useful step from the customer history persistence layer.
  • Talk track: the feature extraction step could be triggered by real-time data insertion…
  • Talk track: a second percolator processes new customer histories relative to the market segments.
  • Talk track: the clustering step is not triggered by the real-time insertion; it is a scheduled step and thus not an example of percolation.What about the other use case we said was similar, the Genotyping?
  • Here, we trigger updates to the persona index based on EITHERUpdates to persona history, ORUpdates to the document indexThe idea here being that if enough docs have changed or personas are finding “unusual” stuff, the persona is stale and we should recompute it
  • Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.
  • Best practice: use one column family per percolator to manage their independent i/o characteristicsPrevent i/o storms
  • Talk track: Now let’s consider the other health data example, genome sequencing for personalized medicine. This is an approach that can be used to get the particular genomic characteristics of a cancerous tumor and compare to known patient histories in order to select the best option for a customized therapy.
  • Talk track: While percolation is not used in this example, it does represent a specialized form of recommendation: user-based recommendation.In this genome sequencing/ personalized medicine example, A very high bar is set for the accuracy of the recommendation. Here a user-based pattern is best. Let’s look at the generalized form…
  • Talk track: here is the basic pattern for user-based recommendation, as used in the real use case of personalized medicine. In contrast, In consumer recommendation for shopping or movie or music recommendation, rapid response is key and accuracy is slightly less important. There item-based recommendation is generally best, because the expensive step in computing co-occurrence can be done offline prior to a user query.
  • Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.
  • 20131011 - Los Gatos - Netflix - Big Data Design Patterns

    1. 1. Design Patterns for Big Data Architecture: Best Strategies for Streamlined [Simple, Powerful] Design Allen Day, PhD Data Scientist, MapR Technologies October 2013 ©MapR Technologies - Confidential
    2. 2. Me, Us • Allen Day, Principal Data Scientist, MapR R contributor (10 yr), Hadoop (6 yr) Human Genetics (UCLA Medicine), Machine Learning • MapR Distributes open source components for Hadoop Adds major enhancements for performance, high-availability, and ease-of-use • See Also – “allenday” most places (twitter, github, etc.) – aday@maprtech.com, allenday@allenday.com – @mapR ©MapR Technologies - Confidential
    3. 3. Three Business Use Cases Personalized Search ©MapR Technologies - Confidential Personalized Medicine Market Segmentation
    4. 4. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign
    5. 5. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data Which ones are similar? ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing
    6. 6. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data Which ones are similar? ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing
    7. 7. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data How can you tell? Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing
    8. 8. But First… WHAT IS A DESIGN PATTERN? ©MapR Technologies - Confidential
    9. 9. “a design pattern is a general reusable solution to a commonly occurring problem within a given context in software design. A design pattern is not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem that can be used in many different situations” http://en.wikipedia.org/wiki/Software_design_pattern ©MapR Technologies - Confidential
    10. 10. History of Design Pattern Ideation 1977 Architecture & Civil Engineering ©MapR Technologies - Confidential 1994 OO Software Architecture 2012 Parallelization Software ? Application Parallelization
    11. 11. Not Just Software http://en.wikipedia.org/wiki/A-line ©MapR Technologies - Confidential
    12. 12. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? ©MapR Technologies - Confidential
    13. 13. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? Volume Velocity Variety ©MapR Technologies - Confidential
    14. 14. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? ©MapR Technologies - Confidential
    15. 15. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? ©MapR Technologies - Confidential
    16. 16. Choose a Pattern: Volume & Velocity 1. How big is your target data? <10 GB mid ? ? A Single element at a time >200 GB 2. How big is your query data? One pass over 100% B C Big storage Streaming Multiple passes over big chunks 3. How fast do you need a result? Throughput > response D ©MapR Technologies - Confidential Nearline Analytics < 100s (human scale) E Exploratory Analysis
    17. 17. Twitter Zeitgeist as a Composite of Design Patterns Live data source e.g. Twitter Firehose B C Big storage Streaming D ©MapR Technologies - Confidential Nearline Analytics Downstream applications
    18. 18. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? Volume Velocity Variety ©MapR Technologies - Confidential
    19. 19. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? Volume Velocity Variety Intents & Methods ©MapR Technologies - Confidential
    20. 20. Application characteristic Personalized Search Personalized Medicine Market Segmenting Input record size Co-processed data size Archive size Small Large Large Large Large Small Small Large Large Input rate Output rate Process complexity Input/process spikiness Speed or accuracy? Cycles? Fast Fast High Low Speed Yes Fast Slow High Low Accuracy No Fast Fast Low High Speed Yes ©MapR Technologies - Confidential
    21. 21. Percolation in Classic Form Real-time data source Real-time insertion Data store Offline percolation of recent data Large-scale Incremental Processing Using Distributed Transactions and Notifications http://research.google.com/pubs/pub36726.html ©MapR Technologies - Confidential
    22. 22. Percolation in Classic Form Real-time data source Data store Offline percolation of recent data Queue Data store Real-time insertion Queued data are unavailable for action – not percolation ©MapR Technologies - Confidential Real-time insertion Delayed insertion
    23. 23. Percolation in Classic Form Real-time data source Real-time insertion ©MapR Technologies - Confidential Data store Offline percolation of recent data
    24. 24. Percolation of a Composite Store Real-time data source Real-time insertion Data store Offline percolation Index Both parts visible ©MapR Technologies - Confidential
    25. 25. Market Segmentation • Divide customers into subsets with common needs • Design specific strategies for each subset • Major emphasis on “fresh” data ©MapR Technologies - Confidential
    26. 26. Market Segmentation Feature Extraction Real-time transactions Customer history Assign Segment (search) db Market Segments What does this have to do with percolation? ©MapR Technologies - Confidential query Clustering
    27. 27. Percolator 1 Feature Extraction Real-time transactions Customer history ©MapR Technologies - Confidential Feature extraction is percolation because it is triggered by the arrival of a new record and because it updates that new record.
    28. 28. Percolator 2 Real-time transactions Customer history Market segment assignment is percolation because it is triggered by the arrival of a new record and because only that record's segment is updated. What about the clustering step? ©MapR Technologies - Confidential Assign Segment (search) db Market Segments query
    29. 29. Scheduled Update - Not Percolation Customer history Clustering The clustering loop is not percolation since it runs at fixed intervals instead of incrementally as updates are received. It also doesn't update just a single customer record. ©MapR Technologies - Confidential Market Segments
    30. 30. Personalized Search • Observe web users’ activity over an extended period • Understand individual user interests • Customize search results for each user • …as fast as possible ©MapR Technologies - Confidential
    31. 31. Personal Search History and Web Index Search Persona Activity db query Persona update Histories trigger query Search Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential db update trigger Doc Index Persona Index
    32. 32. Percolator 1 Expensive feature extraction does not block document ingest Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential
    33. 33. Percolators 2 and 3 Persona Activity Persona update Histories Web Crawl Doc Store ©MapR Technologies - Confidential update Doc Index Persona Index
    34. 34. Percolator 4 Updates to personas trigger updates in related personas Search Persona Activity db query Persona update Histories ©MapR Technologies - Confidential Persona Index
    35. 35. Percolator 5? Persona Index Persona Histories trigger query Search db trigger Doc Index ©MapR Technologies - Confidential Persona and doc index updates trigger a personalization refresh
    36. 36. Pattern Context Persona Activity Web Crawl ©MapR Technologies - Confidential Encapsulated Process
    37. 37. Cyclic Dependency Graph ©MapR Technologies - Confidential
    38. 38. Percolator Thoughts • M7 tables are great as the first persistence point in percolation • In-memory flag column family works great for triggering updates – Efficient - eliminates need for queuing – Fast triggering with row & column Bloom filters • Percolation is best supported by dedicated column families – Percolators I/O characteristics differ – M7 works especially well because it supports lots of column families ©MapR Technologies - Confidential
    39. 39. Cyclic Dependency Graph, M7 Schema ©MapR Technologies - Confidential
    40. 40. Personalized Medicine 5. Interpretation & Follow-up 4. Reporting 1. Select Tests 2. Draw Biosample 3. Genome Sequencing & Analysis ©MapR Technologies - Confidential
    41. 41. Personalized Medicine Applications • Pre-conception screening • Clinical research & trials – Drug re-targeting • Therapeutics – Companion diagnostics – Therapy selection ©MapR Technologies - Confidential
    42. 42. Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Patient health context query Search Ranked therapies Genome Sample Here we do not see real-time data pushed to a persistence layer and processed offline. This pattern does not fit with percolation… ©MapR Technologies - Confidential
    43. 43. Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Genome Sample ©MapR Technologies - Confidential Patient health context query Search User-based recommendation pattern Ranked therapies
    44. 44. Recommendation in Classic Form Queue History Archive db Recent history ©MapR Technologies - Confidential query User Search Ranked similar histories
    45. 45. Item-Based Recommendation in Classic Form Queue History archive Cooccurrence analysis Off-line analysis Recent history query Item linkage db Search ©MapR Technologies - Confidential Interactive recommendation Ranked items
    46. 46. Recommendation Thoughts • Item-based recommendation is for efficiency – expensive step in computing co-occurrence can be done offline and cached prior to a user query • User-based recommendation is for accuracy – user comparisons are done online to find the current best recommendation • MapR is great for recommendation – M7 tables are high I/O performance, can eliminate queues – Faster archive updates with optimized MapReduce – High-availability for mission LIFE critical applications ©MapR Technologies - Confidential
    47. 47. Business Use Cases & Design Patterns Recommender – Personalized Medicine Pattern X – Health data Percolator – Personalized Search Percolator – Other Industry Percolator – Personalized Medicine Pattern X – Other Industry ©MapR Technologies - Confidential
    48. 48. Summary: Best Practices • Look at the big picture – Find recurring patterns • Design systems at a high-level – Solve problems once and reuse components – Increase R&D productivity – Decrease operational and maintenance overhead ©MapR Technologies - Confidential
    49. 49. Thank You! Allen Day, PhD Principal Data Scientist, MapR Technologies aday@maprtech.com, allenday@allenday.com @allenday, @mapr ©MapR Technologies - Confidential

    ×