Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Extending Data Lake using the Lambda Architecture June 2015

3,584 views

Published on

Hadoop Summit 2015

Published in: Technology
  • Be the first to comment

Extending Data Lake using the Lambda Architecture June 2015

  1. 1. Extending Data Lake using the Lambda Architecture June 2015 Dr. William Kornfeld – R& D Director Think Big, a Teradata company Peyman Mohajerian – UDA Architecture COE, Teradata
  2. 2. Agenda  Considerations for choosing a real-time architecture  Use cases
  3. 3. • What does it mean to be a real-time architecture? • What are the use cases that real-time architecture serves? • When would it be a mistake to use a real-time architecture? • What are useful design patterns for implementing real-time architectures (including lambda)? Introduction 3
  4. 4. What is “Real Time”? 4 Data StoreData In Info Out Generally means something is happening in seconds, not minutes or hours.
  5. 5. What is “Real Time”? 5 Data StoreData In Info Out Generally means something is happening in second or so, not minutes or hours. Push or Pull
  6. 6. What is “Real Time”? 6 Data StoreData In Info Out Generally means something is happening in a second give-or-take, not minutes or hours. Push or Pull For purposes of this talk, “Real Time” is measuring from Data In through Info Out.
  7. 7.  The significant component of each individual message coming in is stored.  Example: - Individual prescription records to be retrieved.  Each of the messages coming in contriburtes to one or more aggregates.  Example: - Number of prescriptions for penicillin on June 9, 2015 Two General Classes of Information for Storage and Retrieval 7 Atomic Aggregate
  8. 8. • Question to ask: If a new message comes in, do I need to be able to see or react to it nearly immediately? • Case 1: A message represents a doctor ordering a prescription. • Case 2: A message represents a student completing the SAT with a certain score. Atomic Retrieval 8
  9. 9. • Some aggregate types make sense in real time as an instantaneous snapshot at the present moment. • The “real time” value of some aggregate types are really an estimate of the value of something at some indeterminate time in the past. • Some aggregate types lose their meaning as real-time values. • Some real time processes can be enabled by batch aggregates. Aggregate Retrieval 9
  10. 10. • Includes sums and counts. • Examples: − Dollars of revenue earned so far today − Number of prescriptions for penicillin written today Aggregates with Instantaneous Meaning in Real Time 10
  11. 11. • Includes aggregates which are ratios. • Examples − Click-through rate on an ad − Conversion rate on an email marketing campaign − Percent of prescriptions filled Aggregates Whose Current Value may not be an accurate reflection of what is happening NOW 11
  12. 12. • Includes aggregates which are ratios. • Examples − Click-through rate on an ad − Conversion rate on an email marketing campaign − Percent of prescriptions filled Aggregates Whose Current Value may not be an accurate reflection of what is happening NOW 12 Now
  13. 13. • Includes Unique User Counts • Well-defined meaning only on intervals Aggregates that Have no Instaneous Meaning 13 Joe Ken Sue Fred Jane Bob Joe Ken Joe Fred Joe
  14. 14. Real Time Aggregate Update Can be Significantly More Expensive Than Batch 14 Web Server PC/Male PC/Female Mac/Male Mac/Female PC Mac Male Female Everyone
  15. 15. Real Time Aggregate Update Can be Significantly More Expensive Than Batch 15 Web Server PC/Male PC/Female Mac/Male Mac/Female PC Mac Male Female Everyone
  16. 16. Real Time Processes that Use Batch Aggregates 16 Data Model Periodically Rebuild Web Server
  17. 17. Suppose your Information Can be Real Time, Should You Use a Real TIme Architecture? 17 Real World Big Data System Do you need to know about or react to changes in the Real World within a couple of minutes of the changes?
  18. 18. • There are use cases for both batch and real-time data processing. • Batch tools are stabler; less subject to frequent revision. • Real-time architectures can be significantly more expensive. • Many systems will have some of each. Real Time vs. Batch 18
  19. 19. Lambda Architecture 19 Streaming Batch Serving Stream Serving Batch
  20. 20. Kappa Architecture 20 Streaming Serving StreamKafka
  21. 21. Mu Architecture 21 Streaming Batch Serving
  22. 22. Real-Time Use Cases  Lambda Architecture - Medical: Patient Critical Care  Event Driven Architecture - Marketing: Customer Engagement
  23. 23. Why Big Data? Challenges in Medical Data Health data tends to be “wide”, not “deep” New data types are becoming more important Unstructured Real-time streaming A challenge to generally move from retrospective “BI” viewing to event-based and predictive analytics usage Multiple layers Lots of events, data Complex Lots of different languages and data structures Difficult to maintain Lots of moving pieces/components/technologies Lots of changes in the business
  24. 24. Project Optimize an existing Natural Language Processing pipeline in support of critical Colorectal Surgery (Move to tens of thousands of documents processed) Replace an existing free-text search facility used by Clinical Web Service for cancer (Move search to milliseconds)
  25. 25. Overall Architecture
  26. 26.  Current Storm throughput up to 1.5 million documents per hour  Average of 140,000 HL7 messages actually processed per day with average latency of 60 milliseconds from ingest to persistence  Average of 50,000 documents passed through annotators per day versus 5,000 historically  Actual annotations of documents up to 6 times faster than previously accomplished  Free-text search use cases that took over 30 minutes on old infrastructure completing in milliseconds in ElasticSearch Operational Statistics
  27. 27. Applications Deliver the Company’s Brand and Customer Experience Social Media The Customer Marketing Channels Mobile Apps Devices & Form-factors • Entirety of applications combine to deliver the full customer experience • Today they are mostly designed in a silo’d manner • Applications are not designed to solicit and extract customer experience data well • At the core of application design should be the considerations for obtaining and delivering information about the customer experience
  28. 28. The Customer Experience Universe Day 1 Day 3 Day 7 Day 17 Day 21 Day 25 IM Campaign Fragment Email Campaign Fragment Customers Services Fragment PaidSearch LandingPage CreateAccount TXN AttachedCC EmailSent EmailOpened EmailLinkClicked EmailClicked AccountLogin BannerAd1Impression BannerAd2Impression AddBank EmailSent EmailSent TXN AccountLogin HelpCenter EnterDispute C.S.EmailSent EmailOpened EmailLinkClicked HelpCenterHP DisputePage VirtualAgent CallsIntoIVR IVR:DisputeWorkflow TransferredtoAgent DisputeResolved C.S.SurveyEmailed Social Media The Customer Marketing Channels Mobile Apps Devices & Form-factors A universe of customer experience data: • Create threads • Build graphs • Identify patterns
  29. 29. Event Analytics Ecosystem Social Media Email Marketing Display Marketing Website Activity Customer Account Products Transactions Customer Care Event Repository EAP Metadata Dictionary & Library Core Event Dictionary, Library & Data Source Adapters Custom Business Event Dictionary & Library Machine Learning Customer Experience Best Offers Digital Marketing Applications ReportingHigh Speed Query & Reporting APIs Guided UI Driven Analytics Funnel Path Graph Guided UI Funnel & Path Processing Functions Graph Engine & Functions Business Analyst Business Analyst
  30. 30. Event Analytics Ecosystem EAP Metadata Dictionary & Library Core Event Dictionary, Library & Data Source Adapters Custom Business Event Dictionary & Library Event Repository Offers Best Offers Machine Learning A/B Testing Reporting High Speed Query & Reporting APIs Guided UI Driven Analytics Funnel Path Graph Guided UI Funnel & Path Processing Functions Graph Engine & Functions Business Analyst Business Analyst Product, Customer and Transaction Data Mobile Apps Web Site Activity Social Media Display & Search Marketing Customer State eComm Customer Care 3rd Party Tracking Batch Ingest Data Dictionary Event Pattern Matching & Scoring Decisioning Buffer Serve LWIftp Aster Analytic Engine Event Metadata Dictionary Guided UI Funnel Reporting UI Processing Engine Dashboard Engine Dashboard API R-T Events for Decisioning Dashboard API Data Warehouse Product, Customer, Transaction Event Processing & Event Repository Event Processing Engine HDFS (Time) Event Repository (HBase) Event Repository (Hive) Stream Ingest Spark
  31. 31. 3131

×