Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

20131011 - Los Gatos - Netflix - Big Data Design Patterns

2,510 views

Published on

Published in: Technology
  • Be the first to comment

20131011 - Los Gatos - Netflix - Big Data Design Patterns

  1. 1. Design Patterns for Big Data Architecture: Best Strategies for Streamlined [Simple, Powerful] Design Allen Day, PhD Data Scientist, MapR Technologies October 2013 ©MapR Technologies - Confidential
  2. 2. Me, Us • Allen Day, Principal Data Scientist, MapR R contributor (10 yr), Hadoop (6 yr) Human Genetics (UCLA Medicine), Machine Learning • MapR Distributes open source components for Hadoop Adds major enhancements for performance, high-availability, and ease-of-use • See Also – “allenday” most places (twitter, github, etc.) – aday@maprtech.com, allenday@allenday.com – @mapR ©MapR Technologies - Confidential
  3. 3. Three Business Use Cases Personalized Search ©MapR Technologies - Confidential Personalized Medicine Market Segmentation
  4. 4. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign
  5. 5. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data Which ones are similar? ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing
  6. 6. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data Personal data Which ones are similar? ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing
  7. 7. Three Business Use Cases Personalized Search Personalized Medicine • Public web index + personal search history • Custom ranking of results • Patient medical history • Genomic info. • Match against database of therapies Personal data How can you tell? Personal data ©MapR Technologies - Confidential Market Segmentation • Group similar customers • Target with cross-sell / upsell campaign Marketing
  8. 8. But First… WHAT IS A DESIGN PATTERN? ©MapR Technologies - Confidential
  9. 9. “a design pattern is a general reusable solution to a commonly occurring problem within a given context in software design. A design pattern is not a finished design that can be transformed directly into source or machine code. It is a description or template for how to solve a problem that can be used in many different situations” http://en.wikipedia.org/wiki/Software_design_pattern ©MapR Technologies - Confidential
  10. 10. History of Design Pattern Ideation 1977 Architecture & Civil Engineering ©MapR Technologies - Confidential 1994 OO Software Architecture 2012 Parallelization Software ? Application Parallelization
  11. 11. Not Just Software http://en.wikipedia.org/wiki/A-line ©MapR Technologies - Confidential
  12. 12. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? ©MapR Technologies - Confidential
  13. 13. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? Volume Velocity Variety ©MapR Technologies - Confidential
  14. 14. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? ©MapR Technologies - Confidential
  15. 15. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? ©MapR Technologies - Confidential
  16. 16. Choose a Pattern: Volume & Velocity 1. How big is your target data? <10 GB mid ? ? A Single element at a time >200 GB 2. How big is your query data? One pass over 100% B C Big storage Streaming Multiple passes over big chunks 3. How fast do you need a result? Throughput > response D ©MapR Technologies - Confidential Nearline Analytics < 100s (human scale) E Exploratory Analysis
  17. 17. Twitter Zeitgeist as a Composite of Design Patterns Live data source e.g. Twitter Firehose B C Big storage Streaming D ©MapR Technologies - Confidential Nearline Analytics Downstream applications
  18. 18. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? Volume Velocity Variety ©MapR Technologies - Confidential
  19. 19. Big Data Application Shapes 1. How big is your input record? 2. How big is the data that is relevant to processing the input record? 3. How big is the total data that could be relevant to processing the input? 4. How fast do inputs flow in? 5. How fast do outputs need to flow out? 6. How complex (unstructured) are 1-5? 7. How predictable are 1-6? (spikiness, variance) 8. Is accuracy more important than speed? 9. Does the processing contain cycles (feedback loops)? Volume Velocity Variety Intents & Methods ©MapR Technologies - Confidential
  20. 20. Application characteristic Personalized Search Personalized Medicine Market Segmenting Input record size Co-processed data size Archive size Small Large Large Large Large Small Small Large Large Input rate Output rate Process complexity Input/process spikiness Speed or accuracy? Cycles? Fast Fast High Low Speed Yes Fast Slow High Low Accuracy No Fast Fast Low High Speed Yes ©MapR Technologies - Confidential
  21. 21. Percolation in Classic Form Real-time data source Real-time insertion Data store Offline percolation of recent data Large-scale Incremental Processing Using Distributed Transactions and Notifications http://research.google.com/pubs/pub36726.html ©MapR Technologies - Confidential
  22. 22. Percolation in Classic Form Real-time data source Data store Offline percolation of recent data Queue Data store Real-time insertion Queued data are unavailable for action – not percolation ©MapR Technologies - Confidential Real-time insertion Delayed insertion
  23. 23. Percolation in Classic Form Real-time data source Real-time insertion ©MapR Technologies - Confidential Data store Offline percolation of recent data
  24. 24. Percolation of a Composite Store Real-time data source Real-time insertion Data store Offline percolation Index Both parts visible ©MapR Technologies - Confidential
  25. 25. Market Segmentation • Divide customers into subsets with common needs • Design specific strategies for each subset • Major emphasis on “fresh” data ©MapR Technologies - Confidential
  26. 26. Market Segmentation Feature Extraction Real-time transactions Customer history Assign Segment (search) db Market Segments What does this have to do with percolation? ©MapR Technologies - Confidential query Clustering
  27. 27. Percolator 1 Feature Extraction Real-time transactions Customer history ©MapR Technologies - Confidential Feature extraction is percolation because it is triggered by the arrival of a new record and because it updates that new record.
  28. 28. Percolator 2 Real-time transactions Customer history Market segment assignment is percolation because it is triggered by the arrival of a new record and because only that record's segment is updated. What about the clustering step? ©MapR Technologies - Confidential Assign Segment (search) db Market Segments query
  29. 29. Scheduled Update - Not Percolation Customer history Clustering The clustering loop is not percolation since it runs at fixed intervals instead of incrementally as updates are received. It also doesn't update just a single customer record. ©MapR Technologies - Confidential Market Segments
  30. 30. Personalized Search • Observe web users’ activity over an extended period • Understand individual user interests • Customize search results for each user • …as fast as possible ©MapR Technologies - Confidential
  31. 31. Personal Search History and Web Index Search Persona Activity db query Persona update Histories trigger query Search Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential db update trigger Doc Index Persona Index
  32. 32. Percolator 1 Expensive feature extraction does not block document ingest Web Crawl feature extraction Doc Store ©MapR Technologies - Confidential
  33. 33. Percolators 2 and 3 Persona Activity Persona update Histories Web Crawl Doc Store ©MapR Technologies - Confidential update Doc Index Persona Index
  34. 34. Percolator 4 Updates to personas trigger updates in related personas Search Persona Activity db query Persona update Histories ©MapR Technologies - Confidential Persona Index
  35. 35. Percolator 5? Persona Index Persona Histories trigger query Search db trigger Doc Index ©MapR Technologies - Confidential Persona and doc index updates trigger a personalization refresh
  36. 36. Pattern Context Persona Activity Web Crawl ©MapR Technologies - Confidential Encapsulated Process
  37. 37. Cyclic Dependency Graph ©MapR Technologies - Confidential
  38. 38. Percolator Thoughts • M7 tables are great as the first persistence point in percolation • In-memory flag column family works great for triggering updates – Efficient - eliminates need for queuing – Fast triggering with row & column Bloom filters • Percolation is best supported by dedicated column families – Percolators I/O characteristics differ – M7 works especially well because it supports lots of column families ©MapR Technologies - Confidential
  39. 39. Cyclic Dependency Graph, M7 Schema ©MapR Technologies - Confidential
  40. 40. Personalized Medicine 5. Interpretation & Follow-up 4. Reporting 1. Select Tests 2. Draw Biosample 3. Genome Sequencing & Analysis ©MapR Technologies - Confidential
  41. 41. Personalized Medicine Applications • Pre-conception screening • Clinical research & trials – Drug re-targeting • Therapeutics – Companion diagnostics – Therapy selection ©MapR Technologies - Confidential
  42. 42. Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Patient health context query Search Ranked therapies Genome Sample Here we do not see real-time data pushed to a persistence layer and processed offline. This pattern does not fit with percolation… ©MapR Technologies - Confidential
  43. 43. Personalized Medicine Patient history (EHR) EHR archive Insert (eventually) db Sequence extraction Genome Sample ©MapR Technologies - Confidential Patient health context query Search User-based recommendation pattern Ranked therapies
  44. 44. Recommendation in Classic Form Queue History Archive db Recent history ©MapR Technologies - Confidential query User Search Ranked similar histories
  45. 45. Item-Based Recommendation in Classic Form Queue History archive Cooccurrence analysis Off-line analysis Recent history query Item linkage db Search ©MapR Technologies - Confidential Interactive recommendation Ranked items
  46. 46. Recommendation Thoughts • Item-based recommendation is for efficiency – expensive step in computing co-occurrence can be done offline and cached prior to a user query • User-based recommendation is for accuracy – user comparisons are done online to find the current best recommendation • MapR is great for recommendation – M7 tables are high I/O performance, can eliminate queues – Faster archive updates with optimized MapReduce – High-availability for mission LIFE critical applications ©MapR Technologies - Confidential
  47. 47. Business Use Cases & Design Patterns Recommender – Personalized Medicine Pattern X – Health data Percolator – Personalized Search Percolator – Other Industry Percolator – Personalized Medicine Pattern X – Other Industry ©MapR Technologies - Confidential
  48. 48. Summary: Best Practices • Look at the big picture – Find recurring patterns • Design systems at a high-level – Solve problems once and reuse components – Increase R&D productivity – Decrease operational and maintenance overhead ©MapR Technologies - Confidential
  49. 49. Thank You! Allen Day, PhD Principal Data Scientist, MapR Technologies aday@maprtech.com, allenday@allenday.com @allenday, @mapr ©MapR Technologies - Confidential

×