Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases


Published on

In this session, you will learn best practices for implementing simple to advanced real-time streaming data use cases on AWS. First, we’ll review decision points on near real-time versus real time scenarios. Next, we will take a look at streaming data architecture patterns that include Amazon Kinesis Analytics, Amazon Kinesis Firehose, Amazon Kinesis Streams, Spark Streaming on Amazon EMR, and other open source libraries. Finally, we will dive deep into the most common of these patterns and cover design and implementation considerations.

Published in: Technology
  • Be the first to comment

BDA307 Real-time Streaming Applications on AWS, Patterns and Use Cases

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Ryan Nienhuis, Senior Technical Product Manager July 26, 2017 Real-time Streaming Applications on AWS, Patterns and Use Cases
  2. 2. Batch Processing Hourly server logs Weekly or monthly bills Daily website clickstream Daily fraud reports It’s all about the pace Stream Processing Real-time metrics Real-time spending alerts/caps Real-time clickstream analysis Real-time detection
  3. 3. Simple pattern for streaming data Continuously creates data Continuously writes data to a stream Can be almost anything Data Producer Durably stores data Provides temporary buffer Supports very high throughput Streaming Storage Continuously processes data Cleans, prepares, & aggregates Transforms data to information Data Consumer Mobile client Amazon Kinesis Amazon Kinesis app
  4. 4. Why streaming? Decouple producers & consumers Persistent buffer Collect multiple streams Preserve ordering Parallel consumption Streaming MapReduce 1 4 1 3 1 2 1 1 Producer A 1 1 1 1 4 4 3 3 2 2 1 1 shard 1 shard 2 Consumer App 1 Counts Consumer App 2 Sums Producer B 4 3 2 1 Producer C 4 3 2 1 Producer D 4 3 2 1 Key = orange Key = blue Key = green Key = yellow 4 4 4 4 4 10 10 10 Amazon Kinesis stream
  5. 5. Important services for streaming data applications
  6. 6. Amazon Kinesis: streaming data made easy Services make it easy to capture, deliver, process streams on AWS Amazon Kinesis Streams Amazon Kinesis Analytics Amazon Kinesis Firehose
  7. 7. Amazon Kinesis Streams • Easy administration • Build real-time applications with framework of choice • Low cost
  8. 8. Amazon Kinesis Client Library • Build applications with Kinesis Client Library (KCL) in Java, Ruby, Python, or Node.js • Deploy on your Amazon EC2 instances • Three primary components: 1. Worker – Processing unit that maps to each application instance 2. Record Processor Factory – Creates the record processor 3. Record Processor – Processor unit that processes data from a shard in Amazon Kinesis Streams
  9. 9. State management with Kinesis Client Library • One record processor maps to one shard and processes data records from that shard • One worker maps to one or more record processors • Balances shard-worker associations when worker/instance counts change • Balances shard-worker associations when shards added/removed
  10. 10. Amazon Kinesis Firehose • Zero administration and seamless elasticity • Direct-to-data store integration • Continuous data transformations Capture and submit streaming data Analyze streaming data using your favorite BI tools Firehose loads streaming data continuously into Amazon S3, Amazon Redshift, and Amazon ES
  11. 11. Easily capture and deliver data • Write data to a Kinesis Firehose delivery stream from a variety of sources • Transform, encrypt, and/or compress data along the way • Buffer and aggregate data by time and size before it is written to destination • Elastically scale with no resource provisioning AWS Platform SDKs Mobile SDKs Kinesis Agent AWS IoT Amazon S3 Amazon Redshift Amazon Kinesis Firehose Amazon Elasticsearch Service
  12. 12. Amazon Kinesis Analytics • Apply SQL on streams • Build real-time, stream-processing applications • Easy scalability Connect to Kinesis Streams, Firehose delivery streams Run standard SQL queries against data streams Kinesis Analytics can send processed data to analytics tools so you can create alerts and respond in real time
  13. 13. Use SQL to build real-time applications Easily write SQL code to process streaming data Connect to streaming source Continuously deliver SQL results
  14. 14. AWS Lambda • Function code triggered from newly arriving events • Simple event-based processing of records • Serverless processing with low administration Social media stream is loaded into Kinesis in real time Lambda runs code that generates hashtag trend data and stores it in DynamoDB Social media trend data immediately available for business users to query LAMBDASTREAMS DYNAMODB
  15. 15. Amazon Elastic MapReduce (EMR) • Ingest streaming data from many sources • Easily configure clusters with latest versions of open source frameworks • Less underlying performance management Ingest streaming data through Amazon Kinesis Streams Your choice of stream data processing engine, Spark Streaming, or Apache Flink EMRSTREAMS S3 Send processed data to S3, HDFS, or a custom destination using an open source connector
  16. 16. Streaming app patterns & use cases
  17. 17. Streaming Ingest-Transform-Load • Ingest and store raw data at high volume • Atomic transformations • Simple data enrichment Continuous Metric Generation • Windowed analytics (count X over 5 minutes) • Event correlation like sessionization • Visualization Actionable Insights • Act on the data by triggering events or alerts • Machine learning • Real-time feedback loops Stream processing use cases Use Case Characteristics
  18. 18. Streaming Ingest-Transform-Load Use Cases • Ingest raw log data and store in Amazon S3 for later analysis • Ingest and clean clickstream data, persist to a data warehouse • Capture and validate IoT sensor, device telemetry data Key Requirements • Durably ingest and buffer large volumes of small events • Perform simple transformations • Persist and store data efficiently
  19. 19. Deep dive: analyzing VPC Flow Logs • VPC Flow Logs enable you to capture information about the IP traffic going to and from network interfaces in your VPC • Solution provides both cost-effective storage of logs as well as near real-time analysis VPC subnet Amazon CloudWatch Logs AWS Lambda AmazonKinesis Firehose Amazon S3 bucket Amazon Athena Amazon QuickSight Forward VPC Flow Logs with subscription Aggregation and transformation Near real-time queries and visualization
  20. 20. Best practices for VPC Flow Log ingestion Buffering data • Amazon Kinesis makes it easy to aggregate data in one or more steps • Further buffering on producer is sometimes used for efficiency Reduce latency • VPC Flow Logs buffers data locally, which adds latency • Use Amazon Kinesis Streams or Firehose directly to buffer if you need to react closer to real time
  21. 21. Best practices for VPC Flow Log ingestion Storage optimization • Data is delivered and stored in GZIP format in Amazon S3 • Amazon Athena is faster with Apache Parquet or partitioning • Perform secondary transform using: • Spark cluster on Amazon EMR to transform to Parquet • AWS Lambda function to apply different partitions as data arrives on S3
  22. 22. Continuous metric generation Examples • Operational metrics and dashboards (e.g., API error counts) • Massively multiplayer online game (MMOG) live dashboard (e.g., top 10 players) • Clickstream analytics like impressions and page views Key Requirements • Produce accurate results with late and out-of-order data • Combine results with historical data • Quickly provide data to tech and non-tech end users
  23. 23. Deep dive: time-series analytics • Ingest IoT data via MQTT and forward to Amazon Kinesis Streams • Compute important metrics like average temperature every 10 seconds with Amazon Kinesis Analytics • Persist time-series analytics to RDS MySQL database that are immediately ready to serve via queries and BI tools IoT sensors AWS IoT RDS MySQL DB instance Amazon Kinesis Streams Amazon Kinesis Streams Amazon Kinesis Analytics AWS Lambda function Compute average temperature every 10secIngest sensor data Persist time-series analytics to database
  24. 24. Best practices for time-series analytics Data arrives out-of-order data • Data is arriving from sensors intermittingly and often out of order • Sort the data using: • ORDER BY clause in Kinesis Analytics • Pre-process data using AWS Lambda Event time versus process time • Real-time use cases – processing time is OK • Data is stored for usage later – aggregate data using event time as a key
  25. 25. Responsive analysis Examples • Recommendation engines for retail websites • Device operational intelligence and alerts • Detecting trends and anomalies on user behavior Key Requirements • Ability to notify users and machines with low latency • Long-running, stateful operations over streams
  26. 26. Deep dive: real-time anomaly detection • Ingest data from website through API Gateway and Amazon Kinesis Streams • Use Amazon Kinesis Analytics to produce an anomaly score for each data point and identify trends in data • Send users and machines notifications through Amazon SNS Amazon API Gateway Amazon Kinesis Streams Amazon Kinesis Streams Amazon Kinesis Analytics Lambda function Amazon SNS topic email notification users SMS notification SMS Ingest clickstream data Detect anomalies& take action Notify users
  27. 27. Best practices for anomaly detection Scalable ingest and processing is important • Real-time apps become significantly less valuable when you fall behind • Architecture is highly scalable but must be closely monitored Your data is different • Different data in different domains produce different results in ML algorithms • Algorithms are a tool for scientists, not a replacement • Must understand your data, its patterns (logarithmic, circadian, etc.), and algorithmic results to make decisions
  28. 28. Where to go next?
  29. 29. Try these use cases yourself Many variations of these use cases have sample code on the AWS Big Data Blog. Follow the blog! Some good examples: • Analyzing VPC Flow Logs with Amazon Kinesis Firehose, Amazon Athena, and Amazon QuickSight • Real-time Clickstream Anomaly Detection with Amazon Kinesis Analytics • Writing SQL on Streaming Data with Amazon Kinesis Analytics | Part 1, Part 2
  30. 30. Lots of customer examples 1 billion events/wk from connected devices | IoT 17 PB of game data per season | Entertainment 80 billion ad impressions/day, 30 ms response time | Ad Tech 100 GB/day clickstreams from 250+ sites | Enterprise 50 billion ad impressions/day sub-50 ms responses | Ad Tech 10 million events/day | Retail Amazon Kinesis as Databus - Migrate from Kafka to Kinesis | Enterprise Funnel all production events through Amazon Kinesis
  31. 31. Integrate with your current solutions
  32. 32. Get help from partner systems integrators
  33. 33. Thank you!