Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Design Patterns For Real Time Streaming Data Analytics

1,836 views

Published on

Hadoop Summit 2015

Published in: Technology
  • Be the first to comment

Design Patterns For Real Time Streaming Data Analytics

  1. 1. Page1 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns For Real Time Streaming Data Analytics 15 Apr 2015 Sheetal Dolas Principal Architect, Hortonworks
  2. 2. Page2 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Who am I ? • Principal Architect @ Hortonworks • Most of the career has been in field, solving real life business problems • Last 5+ years in Big Data including Hadoop, Storm etc. • Co-developed Cisco OpenSOC ( http://opensoc.github.io ) sheetal@hortonworks.com @sheetal_dolas
  3. 3. Page3 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Agenda • Streaming Architectural Patterns - Overview • Design Patterns o What o Why o Illustrations • QA
  4. 4. Page4 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Architectural Patterns
  5. 5. Page5 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Real Time Streaming Architecture Source Systems Sources Syslog Machine Data External Streams Other Data Collection Flume / Custom Agent A Agent B Agent N Messaging System Kafka Topic B Topic N Topic A Real Time Processing Storm Topology B Topology N Topology A Storage Search Elastic Search / Solr Low Latency NoSql HBase Historic Hive / HDFS Access Web Services REST API Web Apps Analytic Tools R / Python BI Tools Alerting Systems
  6. 6. Page6 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Lambda Architecture New Data Data Stream Batch Layer All Data Pre-compute Views Speed Layer Stream Processing Real Time View Serving Layer Batch View Batch View Data Access Query
  7. 7. Page7 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Kappa Architecture Data Source Data Stream Stream Processing System Job Version n Serving DB Output table n Output table n + 1 Data Access Query Job Version n + 1
  8. 8. Page8 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns
  9. 9. Page9 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Pattern – What is it? A General reusable solution to a commonly occurring problem within a given context in software design. SolutionReusable Problem Commonl y Occurring Software Design Contextua l
  10. 10. Page10 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns – Why ? • Streaming use cases have distinct characteristics o Unpredictable incoming data patterns o Correlating multiple streams o Out-of-sequence and late events
  11. 11. Page11 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Design Patterns – Why ? • High scale and continuous streams pose new challenges o Peaks and valleys o Changing data characteristics over period of time o Maintain the latency and throughput SLAs
  12. 12. Page12 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Patterns Architectural Patterns • Real-time Streaming • Near-real-time Streaming • Lambda Architecture • Kappa Architecture Functional Patterns • Stream Joins • Top N (Trending) • Rolling Windows Data Management Patterns • External Lookup • Responsive Shuffling • Out-of- Sequence Events Data Security Patterns • Message Encryption • Authorized Access • Secure Cluster Authentication
  13. 13. Page13 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Streaming Patterns – Being Discussed Architectural Patterns • Real-time Streaming • Near-real-time Streaming • Lambda Architecture • Kappa Architecture Functional Patterns • Stream Joins • Top N (Trending) • Rolling Windows Data Management Patterns • External Lookup • Responsive Shuffling • Out-of- Sequence Events Data Security Patterns • Message encryption • Authorized Access • Secure Cluster Authentication
  14. 14. Page14 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup Dynamic, High Speed Enrichments With External Data Lookup
  15. 15. Page15 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Description Referencing frequently changing external system data for event enrichments, filters or validations by minimizing the event processing latencies, system bottlenecks and maintaining high throughput.
  16. 16. Page16 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Challenges • Increased latency due to frequent external system calls • Insufficient memory to hold all reference data in memory • Scalability & performance issues with large data reference sets • Reference data needs frequent cache purge and refreshes • External systems can become a bottleneck
  17. 17. Page17 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup – Potential Options Performance Scalability Fault Tolerance Always Fetch Cache Everything Partition and Cache on the go
  18. 18. Page18 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - A Reference Use Case • Real Time Credit Card Fraud Identification and Alert o Credit card transaction data comes as stream (typically through Kafka) o External system holds information about the card holder’s recent location o Each credit card transaction is looked up against user’s current location o If the geographic distance between the credit card transaction location and user’s recent known location is significant, the credit card transaction is flagged as potential fraud
  19. 19. Page19 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Topology Overview StormSource Stream Credit Card Transaction Spout Partitioner Bolt Alerting System External Reference Data Fraud Analyzer Bolt Locally caches the user location data. Cache validity is time bound Partitions data based on area code of the mobile numbers User Location Information Fraud Alert Email Looks up user’s current location from external system and finds geo distance between transaction location and user location
  20. 20. Page20 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Peek in the Bolts Storm Partitioner Bolt Instance 2 Partitioner Bolt Instance 1 Partitioner Bolt Instance n Fraud Analyzer Bolt Instance 1 CA NV TX Fraud Analyzer Bolt Instance 2 NY CT MA Fraud Analyzer Bolt Instance n FL NC OH Stream is partitioned based on area code Local cache (time sensitive) (Use lightweight caching solution like Guava)
  21. 21. Page21 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - Benefits of the approach • Only required data is cached (on demand) • Each bolt caches only partition of reference data • Data is locally cached so trips to external system are reduced • Cache is time sensitive • On the go cache building handles failures elegantly
  22. 22. Page22 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup – Applicability • Stream processing depends on external data • External data is sufficiently large that could not be hold in memory of each task • External data keeps changing • External system has scalability limitations
  23. 23. Page23 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling
  24. 24. Page24 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Description Automatically adjust shuffling for better performance and throughput during peaks and varying data skews in streams
  25. 25. Page25 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Challenges • Incoming data stream is unpredictable and can be skewed • Skew can change from time to time • Managing latency and throughput with skews is difficult • Since streams are continuously flowing, restarting topology with new shuffling logic is practically not possible
  26. 26. Page26 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Shuffling – Potential Options Latency & Throughput System Reliability Uptime Static Shuffle Responsive Shuffle
  27. 27. Page27 © Hortonworks Inc. 2011 – 2014. All Rights Reserved External Lookup - A Reference Use Case • Optimized HBase Inserts o Event data is stored in HBase after storm processing o Group events such that a bolts can insert more events in HBase with less trips to region servers o Over period of time HBase regions can split/merge o Automatically adjust the event grouping as HBase region layout changes over period of time
  28. 28. Page28 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example – HBase writes w/o responsive shuffling HBase Bolt Instance 2 (100 events) HBase Bolt Instance 1 (100 events) HBase Bolt Instance 3 (100 events) Region Server Instance 1 (100 events) Region Server Instance 2 (100 events) Region Server Instance 3 (100 events) 300 events sent 300 events received 9 trips to region servers 300 events sent App Bolt Instance 1 (100 events) App Bolt Instance 2 (100 events) App Bolt Instance 3 (100 events)
  29. 29. Page29 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Design
  30. 30. Page30 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Example – HBase writes with responsive shuffling HBase Bolt Instance 2 (100 events) HBase Bolt Instance 1 (100 events) HBase Bolt Instance 3 (100 events) Region Server Instance 1 (100 events) Region Server Instance 2 (100 events) Region Server Instance 3 (100 events) 300 events sent 300 events received 3 trips to region servers 300 events sent RS Aware Partitioner RS Aware Partitioner RS Aware Partitioner Partitioner automatically adapts to splitting/mergi ng HBase regions App Bolt Instance 1 (100 events) App Bolt Instance 2 (100 events) App Bolt Instance 3 (100 events)
  31. 31. Page32 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Benefits • Topology responds to changes in data patterns and adopts accordingly • Maintains high level of SLA and throughput adherence • Minimizes needs for maintenance & hence downtimes
  32. 32. Page33 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Responsive Shuffling - Applicability • Change in shuffle pattern does not impact final outcome • Data stream has varying skews • Target/Reference system specifications change over period of time
  33. 33. Page34 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events
  34. 34. Page35 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Description An out-of-sequence event is one that's received late, sufficiently late that you've already processed events that should have been processed after the out-of-sequence event was received.
  35. 35. Page36 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Challenges • Hard to determine if all events in given window have been received • Need referencing of relevant data for late events • Builds more pressure on processing components • Increased latency and degraded overall system performance
  36. 36. Page37 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events – Potential Options Latency Result Accuracy Operational Ease Drop Wait Fan Out
  37. 37. Page38 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Processing Source Spout Event Filter Bolt Typical Processing Bolt Monitors currently being processed events and identifying out-of-sequence events In sequence events Out-of- Sequence events Special Handling Bolt Based on complexities in processing, this can be extended as different topology
  38. 38. Page39 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events – Benefits • Separation of concerns • Maintain the the overall throughput and latency requirements • Independent scaling of components
  39. 39. Page40 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Out-of-Sequence Events - Applicability • When order of events matter • Processing out-of-sequence events needs special and complex logic • Stream has relatively low volume of out-of-sequence events
  40. 40. Page41 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary
  41. 41. Page42 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Summary • Steam application is continuously running process as opposed to batch process • Think long term and changing data patterns over period • Simplicity gives more reliability and predictability • Use one or more patterns in conjunction to address the use case • Patterns are contextual. May not be suitable for every case.
  42. 42. Page43 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You! sheetal@hortonworks.com @sheetal_dolas
  43. 43. Page44 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Appendix
  44. 44. Page45 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka
  45. 45. Page46 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Description Ability to use Kafka as secure data transfer mechanism. Apache Kafka is widely used messaging platform in streaming applications. Unfortunately Kafka does not have built in support for Authentication & Authorization (yet)
  46. 46. Page47 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Flow Source Systems Sources Syslog Data Collection Custom Collector Encryptin g Producer Messaging System Kafka Encrypted Messages Real Time Processing Storm Kafka Spout Decryptin g Bolt App Bolt
  47. 47. Page48 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka – Encryption Details Data Collection Event Producer Messaging System Kafka Topic Event(s) Envelope Real Time Processing Storm Decrypting Bolt Event(s) Envelope Encrypted AES Key (w/ RSA) Encrypted Event (w/ AES) Event(s) Envelope Event(s) Envelope Event Encrypt event(s) w/ AES Encrypt AES key w/ RSA Event Decrypt event(s) w/ AES Decrypt AES key w/ RSA
  48. 48. Page49 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka – Encryption Details • RSA public/private keys are generated ahead of time and securely shared with topology • AES key is randomly generated and periodically refreshed • Only user having appropriate RSA private key can read the data • One event or a batch of events can be encrypted together as per needs
  49. 49. Page50 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Data Security in Kafka - Applicability • Multiple applications want to use Kafka as their source to the stream • Data is sensitive and can not be shared between applications • Other components in the pipeline are secured
  50. 50. Page51 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching
  51. 51. Page52 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Description Micro-batching is a technique that allows a process or task to treat a stream as a sequence of small batches or chunks of data. For incoming streams, the events can be packaged into small batches and delivered to a batch system for processing
  52. 52. Page53 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Challenges • Data delivery reliability • Unnecessary data duplication • Increased latency • Complexity in time-bound batching
  53. 53. Page54 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching – Potential Options Simplicity Reusability Reliability Batch Triggering Thread Controller Stream Tick Tuples
  54. 54. Page55 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tick Tuples Tick tuples are system generated tuples that Storm can send to your bolt if you need to perform some actions at a fixed interval
  55. 55. Page56 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Tick Tuple based Micro Batching - Benefits • Takes advantages of system characteristic by batching events together • Adheres to processing latency needs by ensuring that batches are executed by certain intervals • Prevents data loss by acknowledging events only after successful processing • Simple, elegant and easy to maintain code
  56. 56. Page57 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching - Applicability • Target systems are more efficient with bulk transactions • Processing group of events is more efficient than individual event • End to end event latency is not super sensitive
  57. 57. Page58 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Micro Batching – Sample Code
  58. 58. Page59 © Hortonworks Inc. 2011 – 2014. All Rights Reserved Thank You! sheetal@hortonworks.com @sheetal_dolas

×