Design Patterns for Big Data
Architecture: Best Strategies for
Streamlined [Simple, Powerful]
Design

Allen Day, PhD
Data ...
BIG DATA
©MapR Technologies - Confidential
Me, Us
• Allen Day, Principal Data Scientist, MapR
R contributor (10 yr), Hadoop developer (6 yr)
Human Genetics (UCLA Med...
Three Business Use Cases
Personalized
Search

©MapR Technologies - Confidential

Personalized
Medicine

Market
Segmentatio...
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom...
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom...
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom...
Three Business Use Cases
Personalized
Search

Personalized
Medicine

• Public web index
+ personal
search history
• Custom...
But First…

WHAT IS A DESIGN PATTERN?

©MapR Technologies - Confidential
But Before That…

SURPRISE!

©MapR Technologies - Confidential
Design Pattern Idea
• a general reusable solution to a commonly
occurring problem
• not a finished design
• not code
• can...
History of SW Design Patterns

1977
Architecture &
Civil Engineering

©MapR Technologies - Confidential

1994
OO Software
...
Not Just Software Designs

http://en.wikipedia.org/wiki/A-line
©MapR Technologies - Confidential
Identifying the Pattern
Pattern Dimensions
1.
2.
3.
4.
5.

Volume
Variety
Velocity
Business Intents & Methods
SLAs

©MapR ...
Choose a Pattern: Volume & Velocity
1. How big is your target data?
<10 GB

mid
?

?

A

Single element
at a time

>200 GB...
Twitter Zeitgeist as a
Composite of Design Patterns
Live data source
e.g.
Twitter Firehose

B

C

Big storage

Streaming

...
Percolation in Classic Form
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
of recent data

Larg...
Percolation in Classic Form
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
of recent data

Queu...
Percolation in Classic Form
Real-time data
source
Real-time
insertion

©MapR Technologies - Confidential

Data
store

Offli...
Percolation of a Composite Store
Real-time data
source
Real-time
insertion

Data
store

Offline
percolation
Index

Both par...
Market Segmentation
• Divide customers into subsets with common
needs
• Design specific strategies for each subset
• Major...
Market Segmentation
Feature
Extraction
Real-time
transactions
Customer
history

What does
this have to
do with
percolation...
Percolator 1
Feature
Extraction
Real-time
transactions
Customer
history

©MapR Technologies - Confidential

Feature extrac...
Percolator 2
Real-time
transactions
Customer
history

Market segment assignment
is percolation because it is
triggered by ...
Scheduled Update - Not Percolation

Customer
history

Clustering
The clustering loop is not
percolation since it runs at
fi...
Personalized Search
• Observe web users’ activity over an extended
period
• Understand individual user interests
• Customi...
Personal Search History and Web Index
Search
Persona
Activity

db
query

Persona update
Histories
trigger

query

Search
W...
Percolator 1

Expensive feature
extraction does not
block document ingest

Web
Crawl

feature
extraction

Doc
Store
©MapR ...
Percolators 2 and 3
Persona
Activity
Persona update
Histories

Web
Crawl
Doc
Store
©MapR Technologies - Confidential

upda...
Percolator 4
Updates to personas
trigger updates in
related personas

Search
Persona
Activity

db
query

Persona update
Hi...
Percolator 5?

Persona
Index

Persona
Histories
trigger

query

Search
db

trigger

Doc
Index
©MapR Technologies - Confide...
Pattern Context
Persona
Activity

Web
Crawl

©MapR Technologies - Confidential

Encapsulated
Process
Cyclic Dependency Graph

©MapR Technologies - Confidential
Percolator Thoughts
• M7 tables are great as the first persistence point
in percolation
• In-memory flag column family wor...
Cyclic Dependency Graph, M7 Schema

©MapR Technologies - Confidential
Personalized Medicine
5. Interpretation
& Follow-up

4. Reporting

1. Select Tests

2. Draw Biosample

3. Genome Sequencin...
Personalized Medicine Applications
• Pre-conception screening
• Clinical research & trials
– Drug re-targeting

• Therapeu...
Personalized Medicine
Patient
history
(EHR)

EHR
archive

Insert
(eventually)

db
Sequence
extraction
Genome
Sample

Patie...
Personalized Medicine
Patient
history
(EHR)

EHR
archive

Insert
(eventually)

db
Sequence
extraction
Genome
Sample

Patie...
Recommendation in Classic Form

Queue

History
Archive

db
Recent
history

©MapR Technologies - Confidential

query

User
...
Item-Based Recommendation
in Classic Form
Queue

History
archive

Cooccurrence
analysis

Off-line analysis

Recent
history...
Recommendation Thoughts
• Item-based recommendation is for efficiency
– expensive step in computing co-occurrence can be
d...
Business Use Cases
& Design Patterns
Recommender –
Personalized
Medicine

Pattern X –
Health data

Percolator –
Personaliz...
Summary: Best Practices
• Look at the big picture
– Find recurring patterns

• Design systems at a high-level
– Solve prob...
Thank
You!

Allen Day, PhD
Principal Data Scientist, MapR Technologies
aday@maprtech.com, allenday@allenday.com
@allenday,...
Evolution of Data Storage
Scalability
Over decades of progress,
Unix-based systems have set
the standard for compatibility...
Evolution of Data Storage
Scalability
Hadoop achieves much higher
Hadoop
scalability by trading away
essentially all of th...
Evolution of Data Storage
Scalability
Hadoop

MapR enhances Apache Hadoop by
restoring the compatibility while
increasing ...
MapR Data Storage: How it’s done
HBase
NoSQL Tables API

POSIX NFS

implements

depends

Apache
HBase

implements

impleme...
MapR Data Storage: How it’s done
Vertical Integration = High Performance
HBase
NoSQL Tables API

POSIX NFS

implements

de...
Hadoop on MapR No Longer Stands
Apart
Legacy code &
applications

New technologies
d3
node.js
Apache Storm

Multiple types...
Upcoming SlideShare
Loading in...5
×

2013.12.12 - Sydney - Big Data Analytics

438

Published on

http://www.meetup.com/Big-Data-Analytics/events/153606372/

Published in: Technology, Business
0 Comments
2 Likes
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
438
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
0
Comments
0
Likes
2
Embeds 0
No embeds

No notes for slide
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: In market segmentation, you want to identify useful segments of your customer base to target for a market campaign, for retention, for specific product offerings, etc. What makes “good” segments depends on what you want to do and how the environment changes. You may not know ahead of time what categories make useful segments. One way to find this is to capture customer histories and do a clustering step for discovery and definition of the market segments.This market segment db is then queried and updated in response to new real-time data insertion or new rounds of clustering. Specific feature extraction may also be a useful step from the customer history persistence layer.
  • Talk track: the feature extraction step could be triggered by real-time data insertion…
  • Talk track: a second percolator processes new customer histories relative to the market segments.
  • Talk track: the clustering step is not triggered by the real-time insertion; it is a scheduled step and thus not an example of percolation.What about the other use case we said was similar, the Genotyping?
  • Here, we trigger updates to the persona index based on EITHERUpdates to persona history, ORUpdates to the document indexThe idea here being that if enough docs have changed or personas are finding “unusual” stuff, the persona is stale and we should recompute it
  • Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.
  • Best practice: use one column family per percolator to manage their independent i/o characteristicsPrevent i/o storms
  • Talk track: Now let’s consider the other health data example, genome sequencing for personalized medicine. This is an approach that can be used to get the particular genomic characteristics of a cancerous tumor and compare to known patient histories in order to select the best option for a customized therapy.
  • Talk track: While percolation is not used in this example, it does represent a specialized form of recommendation: user-based recommendation.In this genome sequencing/ personalized medicine example, A very high bar is set for the accuracy of the recommendation. Here a user-based pattern is best. Let’s look at the generalized form…
  • Talk track: here is the basic pattern for user-based recommendation, as used in the real use case of personalized medicine. In contrast, In consumer recommendation for shopping or movie or music recommendation, rapid response is key and accuracy is slightly less important. There item-based recommendation is generally best, because the expensive step in computing co-occurrence can be done offline prior to a user query.
  • Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.
  • Gives up random access read on filesGives up strong authentication / authorization modelGives up random access write / append on files
  • Transcript of "2013.12.12 - Sydney - Big Data Analytics"

    1. 1. Design Patterns for Big Data Architecture: Best Strategies for Streamlined [Simple, Powerful] Design Allen Day, PhD Data Scientist, MapR Technologies December 2013 ©MapR Technologies - Confidential
    2. 2. BIG DATA ©MapR Technologies - Confidential
    3. 3. Me, Us • Allen Day, Principal Data Scientist, MapR R contributor (10 yr), Hadoop developer (6 yr) Human Genetics (UCLA Medicine), Machine Learning • MapR Distributes open source components for Hadoop Adds major enhancements for performance, high-availability, and ease-of-use • See Also – “allenday” most places (twitter, github, etc.) – aday@maprtech.com, @mapR – http://slideshare.net/allenday ©MapR Technologies - Confidential
    4. 4. Three Business Use Cases Personalized Search ©MapR Technologies - Confidential Personalized Medicine Market Segmentation
    5. 5. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign
    6. 6. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Which ones are similar?
    7. 7. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Which ones are similar?
    8. 8. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Surprise! How can you tell?
    9. 9. But First… WHAT IS A DESIGN PATTERN? ©MapR Technologies - Confidential
    10. 10. But Before That… SURPRISE! ©MapR Technologies - Confidential
    11. 11. Design Pattern Idea • a general reusable solution to a commonly occurring problem • not a finished design • not code • can be used in many different situations ©MapR Technologies - Confidential
    12. 12. History of SW Design Patterns 1977 Architecture & Civil Engineering ©MapR Technologies - Confidential 1994 OO Software Architecture 2012 Parallelization Software ? Application Parallelization
    13. 13. Not Just Software Designs http://en.wikipedia.org/wiki/A-line ©MapR Technologies - Confidential
    14. 14. Identifying the Pattern Pattern Dimensions 1. 2. 3. 4. 5. Volume Variety Velocity Business Intents & Methods SLAs ©MapR Technologies - Confidential
    15. 15. Choose a Pattern: Volume & Velocity 1. How big is your target data? <10 GB mid ? ? A Single element at a time >200 GB 2. How big is your query data? One pass over 100% B C Big storage Streaming Multiple passes over big chunks 3. How fast do you need a result? Throughput > response D ©MapR Technologies - Confidential Nearline Analytics < 100s (human scale) E Exploratory Analysis
    16. 16. Twitter Zeitgeist as a Composite of Design Patterns Live data source e.g. Twitter Firehose B C Big storage Streaming D ©MapR Technologies - Confidential Nearline Analytics Downstream applications
    17. 17. Percolation in Classic Form Real-time data source Real-time insertion Data store Offline percolation of recent data Large-scale Incremental Processing Using Distributed Transactions and Notifications http://research.google.com/pubs/pub36726.html ©MapR Technologies - Confidential
    18. 18. Percolation in Classic Form Real-time data source Real-time insertion Data store Offline percolation of recent data Queued data are unavailable for action – not percolation Queue ©MapR Technologies - Confidential Real-time insertion Delayed insertion Data store
    19. 19. Percolation in Classic Form Real-time data source Real-time insertion ©MapR Technologies - Confidential Data store Offline percolation of recent data
    20. 20. Percolation of a Composite Store Real-time data source Real-time insertion Data store Offline percolation Index Both parts visible ©MapR Technologies - Confidential
    21. 21. Market Segmentation • Divide customers into subsets with common needs • Design specific strategies for each subset • Major emphasis on “fresh” data ©MapR Technologies - Confidential
    22. 22. Market Segmentation Feature Extraction Real-time transactions Customer history What does this have to do with percolation ©MapR Technologies - Confidential Assign Segment (search) db Market Segments query Clustering
    23. 23. Percolator 1 Feature Extraction Real-time transactions Customer history ©MapR Technologies - Confidential Feature extraction is percolation because it is triggered by the arrival of a new record and because it updates that new record.
    24. 24. Percolator 2 Real-time transactions Customer history Market segment assignment is percolation because it is triggered by the arrival of a new record and because only that record's segment is updated. ©MapR Technologies - Confidential Assign Segment (search) db Market Segments query What about the clustering
    25. 25. Scheduled Update - Not Percolation Customer history Clustering The clustering loop is not percolation since it runs at fixed intervals instead of incrementally as updates are received. It also doesn't update just a single customer record. ©MapR Technologies - Confidential Market Segments
    26. 26. Personalized Search • Observe web users’ activity over an extended period • Understand individual user interests • Customize search results for each user • …as fast as possible ©MapR Technologies - Confidential
    27. 27. Personal Search History and Web Index Search Persona Activity db query Persona update Histories trigger query Search Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential db update trigger Doc Index Persona Index
    28. 28. Percolator 1 Expensive feature extraction does not block document ingest Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential
    29. 29. Percolators 2 and 3 Persona Activity Persona update Histories Web Crawl Doc Store ©MapR Technologies - Confidential update Doc Index Persona Index
    30. 30. Percolator 4 Updates to personas trigger updates in related personas Search Persona Activity db query Persona update Histories ©MapR Technologies - Confidential Persona Index
    31. 31. Percolator 5? Persona Index Persona Histories trigger query Search db trigger Doc Index ©MapR Technologies - Confidential Persona and doc index updates trigger a personalization refresh
    32. 32. Pattern Context Persona Activity Web Crawl ©MapR Technologies - Confidential Encapsulated Process
    33. 33. Cyclic Dependency Graph ©MapR Technologies - Confidential
    34. 34. Percolator Thoughts • M7 tables are great as the first persistence point in percolation • In-memory flag column family works great for triggering updates – Efficient - eliminates need for queuing – Fast triggering with row & column Bloom filters • Percolation is best supported by dedicated column families – Percolators I/O characteristics differ – M7 works especially well because it supports lots of column families ©MapR Technologies - Confidential
    35. 35. Cyclic Dependency Graph, M7 Schema ©MapR Technologies - Confidential
    36. 36. Personalized Medicine 5. Interpretation & Follow-up 4. Reporting 1. Select Tests 2. Draw Biosample 3. Genome Sequencing & Analysis ©MapR Technologies - Confidential
    37. 37. Personalized Medicine Applications • Pre-conception screening • Clinical research & trials – Drug re-targeting • Therapeutics – Companion diagnostics – Therapy selection ©MapR Technologies - Confidential
    38. 38. Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Genome Sample Patient health context query Search Ranked therapies Here we do not see real-time data pushed to a persistence layer and processed offline. This pattern does ©MapR Technologies - Confidential
    39. 39. Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Genome Sample Patient health context query Search User-based recommendation pattern Surprise! It’s the recommender ©MapR Technologies - Confidential Ranked therapies
    40. 40. Recommendation in Classic Form Queue History Archive db Recent history ©MapR Technologies - Confidential query User Search Ranked similar histories
    41. 41. Item-Based Recommendation in Classic Form Queue History archive Cooccurrence analysis Off-line analysis Recent history query Item linkage db Search ©MapR Technologies - Confidential Interactive recommendation Ranked items
    42. 42. Recommendation Thoughts • Item-based recommendation is for efficiency – expensive step in computing co-occurrence can be done offline and cached prior to a user query • User-based recommendation is for accuracy – user comparisons are done online to find the current best recommendation • MapR is great for recommendation – M7 tables are high I/O performance, can eliminate queues – Faster archive updates with optimized MapReduce – High-availability for mission life critical applications ©MapR Technologies - Confidential
    43. 43. Business Use Cases & Design Patterns Recommender – Personalized Medicine Pattern X – Health data Percolator – Personalized Search Percolator – Other Industry Percolator – Personalized Medicine Pattern X – Other Industry ©MapR Technologies - Confidential
    44. 44. Summary: Best Practices • Look at the big picture – Find recurring patterns • Design systems at a high-level – Solve problems once and reuse components – Increase R&D productivity – Decrease operational and maintenance overhead ©MapR Technologies - Confidential
    45. 45. Thank You! Allen Day, PhD Principal Data Scientist, MapR Technologies aday@maprtech.com, allenday@allenday.com @allenday, @mapr ©MapR Technologies - Confidential
    46. 46. Evolution of Data Storage Scalability Over decades of progress, Unix-based systems have set the standard for compatibility and functionality Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential
    47. 47. Evolution of Data Storage Scalability Hadoop achieves much higher Hadoop scalability by trading away essentially all of this compatibility Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential
    48. 48. Evolution of Data Storage Scalability Hadoop MapR enhances Apache Hadoop by restoring the compatibility while increasing scalability and performance Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential
    49. 49. MapR Data Storage: How it’s done HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS
    50. 50. MapR Data Storage: How it’s done Vertical Integration = High Performance HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS
    51. 51. Hadoop on MapR No Longer Stands Apart Legacy code & applications New technologies d3 node.js Apache Storm Multiple types of data sources New custom applications MapR cluster ©MapR Technologies - Confidential

    ×