• Save
2013.12.12 - Sydney - Big Data Analytics
 

2013.12.12 - Sydney - Big Data Analytics

on

  • 497 views

http://www.meetup.com/Big-Data-Analytics/events/153606372/

http://www.meetup.com/Big-Data-Analytics/events/153606372/

Statistics

Views

Total Views
497
Views on SlideShare
497
Embed Views
0

Actions

Likes
2
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Shapes too big; overwhelmI would describe three projects by short name; then add three distinct shapes, making two hearts since both healthcare; start with all line drawings; two distracting to be color
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: Both genotyping and market segmentation solutions have a useful design component known as percolation. The key idea is that there is a fast push to store data and an offline processing step that modifies data. The modified data could go back to the same data store or….Speaker: you might note that we show real-time steps in red; and non-real time steps in black.
  • Talk track: In market segmentation, you want to identify useful segments of your customer base to target for a market campaign, for retention, for specific product offerings, etc. What makes “good” segments depends on what you want to do and how the environment changes. You may not know ahead of time what categories make useful segments. One way to find this is to capture customer histories and do a clustering step for discovery and definition of the market segments.This market segment db is then queried and updated in response to new real-time data insertion or new rounds of clustering. Specific feature extraction may also be a useful step from the customer history persistence layer.
  • Talk track: the feature extraction step could be triggered by real-time data insertion…
  • Talk track: a second percolator processes new customer histories relative to the market segments.
  • Talk track: the clustering step is not triggered by the real-time insertion; it is a scheduled step and thus not an example of percolation.What about the other use case we said was similar, the Genotyping?
  • Here, we trigger updates to the persona index based on EITHERUpdates to persona history, ORUpdates to the document indexThe idea here being that if enough docs have changed or personas are finding “unusual” stuff, the persona is stale and we should recompute it
  • Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.
  • Best practice: use one column family per percolator to manage their independent i/o characteristicsPrevent i/o storms
  • Talk track: Now let’s consider the other health data example, genome sequencing for personalized medicine. This is an approach that can be used to get the particular genomic characteristics of a cancerous tumor and compare to known patient histories in order to select the best option for a customized therapy.
  • Talk track: While percolation is not used in this example, it does represent a specialized form of recommendation: user-based recommendation.In this genome sequencing/ personalized medicine example, A very high bar is set for the accuracy of the recommendation. Here a user-based pattern is best. Let’s look at the generalized form…
  • Talk track: here is the basic pattern for user-based recommendation, as used in the real use case of personalized medicine. In contrast, In consumer recommendation for shopping or movie or music recommendation, rapid response is key and accuracy is slightly less important. There item-based recommendation is generally best, because the expensive step in computing co-occurrence can be done offline prior to a user query.
  • Talk track: MapR advantages include the smooth use of HBase on a MapR cluster for the persistence layer at the insertion point, or even better, the use of MapR M7 tables instead. There are two specific advantages to M7 (besides the all-important reliability):a)Less risk of delays/ IO storms etc that can happen with HBase. This is VERY important when pushing real-time data to a data store.b) Strategic advantage of using in-memory flags on column families – very efficient in M7 where you can have lots of column families as opposed to only a few in HBase, operationally speaking.
  • Gives up random access read on filesGives up strong authentication / authorization modelGives up random access write / append on files

2013.12.12 - Sydney - Big Data Analytics 2013.12.12 - Sydney - Big Data Analytics Presentation Transcript

  • Design Patterns for Big Data Architecture: Best Strategies for Streamlined [Simple, Powerful] Design Allen Day, PhD Data Scientist, MapR Technologies December 2013 ©MapR Technologies - Confidential
  • BIG DATA ©MapR Technologies - Confidential
  • Me, Us • Allen Day, Principal Data Scientist, MapR R contributor (10 yr), Hadoop developer (6 yr) Human Genetics (UCLA Medicine), Machine Learning • MapR Distributes open source components for Hadoop Adds major enhancements for performance, high-availability, and ease-of-use • See Also – “allenday” most places (twitter, github, etc.) – aday@maprtech.com, @mapR – http://slideshare.net/allenday ©MapR Technologies - Confidential
  • Three Business Use Cases Personalized Search ©MapR Technologies - Confidential Personalized Medicine Market Segmentation
  • Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign
  • Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Which ones are similar?
  • Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Which ones are similar?
  • Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing Surprise! How can you tell?
  • But First… WHAT IS A DESIGN PATTERN? ©MapR Technologies - Confidential
  • But Before That… SURPRISE! ©MapR Technologies - Confidential
  • Design Pattern Idea • a general reusable solution to a commonly occurring problem • not a finished design • not code • can be used in many different situations ©MapR Technologies - Confidential
  • History of SW Design Patterns 1977 Architecture & Civil Engineering ©MapR Technologies - Confidential 1994 OO Software Architecture 2012 Parallelization Software ? Application Parallelization
  • Not Just Software Designs http://en.wikipedia.org/wiki/A-line ©MapR Technologies - Confidential
  • Identifying the Pattern Pattern Dimensions 1. 2. 3. 4. 5. Volume Variety Velocity Business Intents & Methods SLAs ©MapR Technologies - Confidential
  • Choose a Pattern: Volume & Velocity 1. How big is your target data? <10 GB mid ? ? A Single element at a time >200 GB 2. How big is your query data? One pass over 100% B C Big storage Streaming Multiple passes over big chunks 3. How fast do you need a result? Throughput > response D ©MapR Technologies - Confidential Nearline Analytics < 100s (human scale) E Exploratory Analysis
  • Twitter Zeitgeist as a Composite of Design Patterns Live data source e.g. Twitter Firehose B C Big storage Streaming D ©MapR Technologies - Confidential Nearline Analytics Downstream applications
  • Percolation in Classic Form Real-time data source Real-time insertion Data store Offline percolation of recent data Large-scale Incremental Processing Using Distributed Transactions and Notifications http://research.google.com/pubs/pub36726.html ©MapR Technologies - Confidential
  • Percolation in Classic Form Real-time data source Real-time insertion Data store Offline percolation of recent data Queued data are unavailable for action – not percolation Queue ©MapR Technologies - Confidential Real-time insertion Delayed insertion Data store
  • Percolation in Classic Form Real-time data source Real-time insertion ©MapR Technologies - Confidential Data store Offline percolation of recent data
  • Percolation of a Composite Store Real-time data source Real-time insertion Data store Offline percolation Index Both parts visible ©MapR Technologies - Confidential
  • Market Segmentation • Divide customers into subsets with common needs • Design specific strategies for each subset • Major emphasis on “fresh” data ©MapR Technologies - Confidential
  • Market Segmentation Feature Extraction Real-time transactions Customer history What does this have to do with percolation ©MapR Technologies - Confidential Assign Segment (search) db Market Segments query Clustering
  • Percolator 1 Feature Extraction Real-time transactions Customer history ©MapR Technologies - Confidential Feature extraction is percolation because it is triggered by the arrival of a new record and because it updates that new record.
  • Percolator 2 Real-time transactions Customer history Market segment assignment is percolation because it is triggered by the arrival of a new record and because only that record's segment is updated. ©MapR Technologies - Confidential Assign Segment (search) db Market Segments query What about the clustering
  • Scheduled Update - Not Percolation Customer history Clustering The clustering loop is not percolation since it runs at fixed intervals instead of incrementally as updates are received. It also doesn't update just a single customer record. ©MapR Technologies - Confidential Market Segments
  • Personalized Search • Observe web users’ activity over an extended period • Understand individual user interests • Customize search results for each user • …as fast as possible ©MapR Technologies - Confidential
  • Personal Search History and Web Index Search Persona Activity db query Persona update Histories trigger query Search Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential db update trigger Doc Index Persona Index
  • Percolator 1 Expensive feature extraction does not block document ingest Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential
  • Percolators 2 and 3 Persona Activity Persona update Histories Web Crawl Doc Store ©MapR Technologies - Confidential update Doc Index Persona Index
  • Percolator 4 Updates to personas trigger updates in related personas Search Persona Activity db query Persona update Histories ©MapR Technologies - Confidential Persona Index
  • Percolator 5? Persona Index Persona Histories trigger query Search db trigger Doc Index ©MapR Technologies - Confidential Persona and doc index updates trigger a personalization refresh
  • Pattern Context Persona Activity Web Crawl ©MapR Technologies - Confidential Encapsulated Process
  • Cyclic Dependency Graph ©MapR Technologies - Confidential
  • Percolator Thoughts • M7 tables are great as the first persistence point in percolation • In-memory flag column family works great for triggering updates – Efficient - eliminates need for queuing – Fast triggering with row & column Bloom filters • Percolation is best supported by dedicated column families – Percolators I/O characteristics differ – M7 works especially well because it supports lots of column families ©MapR Technologies - Confidential
  • Cyclic Dependency Graph, M7 Schema ©MapR Technologies - Confidential
  • Personalized Medicine 5. Interpretation & Follow-up 4. Reporting 1. Select Tests 2. Draw Biosample 3. Genome Sequencing & Analysis ©MapR Technologies - Confidential
  • Personalized Medicine Applications • Pre-conception screening • Clinical research & trials – Drug re-targeting • Therapeutics – Companion diagnostics – Therapy selection ©MapR Technologies - Confidential
  • Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Genome Sample Patient health context query Search Ranked therapies Here we do not see real-time data pushed to a persistence layer and processed offline. This pattern does ©MapR Technologies - Confidential
  • Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Genome Sample Patient health context query Search User-based recommendation pattern Surprise! It’s the recommender ©MapR Technologies - Confidential Ranked therapies
  • Recommendation in Classic Form Queue History Archive db Recent history ©MapR Technologies - Confidential query User Search Ranked similar histories
  • Item-Based Recommendation in Classic Form Queue History archive Cooccurrence analysis Off-line analysis Recent history query Item linkage db Search ©MapR Technologies - Confidential Interactive recommendation Ranked items
  • Recommendation Thoughts • Item-based recommendation is for efficiency – expensive step in computing co-occurrence can be done offline and cached prior to a user query • User-based recommendation is for accuracy – user comparisons are done online to find the current best recommendation • MapR is great for recommendation – M7 tables are high I/O performance, can eliminate queues – Faster archive updates with optimized MapReduce – High-availability for mission life critical applications ©MapR Technologies - Confidential
  • Business Use Cases & Design Patterns Recommender – Personalized Medicine Pattern X – Health data Percolator – Personalized Search Percolator – Other Industry Percolator – Personalized Medicine Pattern X – Other Industry ©MapR Technologies - Confidential
  • Summary: Best Practices • Look at the big picture – Find recurring patterns • Design systems at a high-level – Solve problems once and reuse components – Increase R&D productivity – Decrease operational and maintenance overhead ©MapR Technologies - Confidential
  • Thank You! Allen Day, PhD Principal Data Scientist, MapR Technologies aday@maprtech.com, allenday@allenday.com @allenday, @mapr ©MapR Technologies - Confidential
  • Evolution of Data Storage Scalability Over decades of progress, Unix-based systems have set the standard for compatibility and functionality Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential
  • Evolution of Data Storage Scalability Hadoop achieves much higher Hadoop scalability by trading away essentially all of this compatibility Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential
  • Evolution of Data Storage Scalability Hadoop MapR enhances Apache Hadoop by restoring the compatibility while increasing scalability and performance Linux POSIX Functionality Compatibility ©MapR Technologies - Confidential
  • MapR Data Storage: How it’s done HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS
  • MapR Data Storage: How it’s done Vertical Integration = High Performance HBase NoSQL Tables API POSIX NFS implements depends Apache HBase implements implements depends Hadoop HDFS API implements MapR Filesystem ©MapR Technologies - Confidential implements Apache Hadoop HDFS
  • Hadoop on MapR No Longer Stands Apart Legacy code & applications New technologies d3 node.js Apache Storm Multiple types of data sources New custom applications MapR cluster ©MapR Technologies - Confidential