Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Serverless Real Time Analytics

733 views

Published on

Presentation given during Start Up Day Hong Kong on September 15, 2017 within the Architecture track

  • Be the first to comment

Serverless Real Time Analytics

  1. 1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Iolaire McKinnon, AWS Professional Services 2017-09-15 Serverless Real Time Analytics AWS Startup Day – Hong Kong
  2. 2. Three types of data-driven development Retrospective analysis and reporting Here-and-now real-time processing and dashboards Predictions to enable smart applications
  3. 3. The diminishing value of data Recent data is highly valuable • If you act on it in time • Perishable Insights (M. Gualtieri, Forrester) Old + Recent data is more valuable • If you have the means to combine them
  4. 4. Most data is produced continuously Mobile apps Web clickstream Application logs Metering records IoT sensors Smart buildings [Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] client denied by server configuration: /export/home/live/ap/ht docs/test
  5. 5. Ingest/ Collect Consume/ visualize Store Process/ analyze Data 1 4 0 9 5 Answers & Insights Time to answer (Latency) Cost Data Processing
  6. 6. AWS Data PipelineAWS Database Migration Service EMR Analyze Amazon Glacier S3 StoreCollect Amazon Kinesis Direct Connect Amazon Machine Learning Amazon Redshift DynamoDBAWS IoT AWS Snowball QuickSight Amazon Athena EC2 Amazon Elasticsearch Service Lambda
  7. 7. No one tool rules them all
  8. 8. I want to …. 1. Convert RAW JSON to CSV 2. Aggregate (min, max, avg) in 1 minute 3. Real Time Anomaly Detection Alert
  9. 9. Amazon Kinesis Platform Overview
  10. 10. Amazon Kinesis makes it easy to work with real-time streaming data Amazon Kinesis Streams • For technical developers • Collect and stream data for ordered, replayable, real-time processing Amazon Kinesis Firehose • For all developers, data scientists • Easily load massive volumes of streaming data into Amazon S3, Amazon Redshift, Amazon Elasticsearch Service Amazon Kinesis Analytics • For all developers, data scientists • Easily analyze data streams using standard SQL queries
  11. 11. Amazon Kinesis Streams • Reliably ingest and durably store streaming data at low cost • Build custom real-time applications to process streaming data
  12. 12. Sending & Reading Data from Kinesis Streams AWS SDK LOG4J Flume Fluentd Get* APIs Kinesis Client Library + Connector Library Apache Storm Amazon Elastic MapReduce Sending Consuming AWS Mobile SDK Kinesis Producer Library AWS Lambda Apache Spark
  13. 13. Amazon Kinesis Stream Data Sources App.4 [Machine Learning] AWSEndpoint App.1 [Aggregate & De-Duplicate] Data Sources Data Sources Data Sources App.2 [Metric Extraction] AmazonS3 DynamoDB Redshift App.3 [Sliding Window Analysis] Data Sources Availability Zone Shard 1 Shard 2 Shard N Availability Zone Availability Zone Kinesis
  14. 14. Sensors S3 bucket Data Ingestion, Processing, and Store Amazon Kinesis Streams Real Time Processing Application
  15. 15. What if I don’t need to write an application to store the data?
  16. 16. Amazon Kinesis Firehose • Reliably ingest and deliver batched, compressed, and encrypted data to S3, Amazon Redshift, and Amazon Elasticsearch Service (Amazon ES) • Point-and-click setup with zero administration and seamless elasticity
  17. 17. Kinesis Firehose Delivery Method to S3 • Single delivery stream delivers to single S3 bucket • Buffer size / interval values • Buffer size – 1 MB to 128 MBs or • Buffer interval - 60 to 900 seconds • Firehose concatenates records into a single larger object • (Optional) Compression • Compress records before delivering them to your S3 bucket. • GZIP, ZIP, SNAPPY •(Optional) Encryption • Encrypted in S3 bucket using a KMS master key
  18. 18. Amazon Kinesis Firehose vs. Amazon Kinesis Streams Amazon Kinesis Streams is for use cases that require custom processing, per incoming record, with sub-1 second processing latency, and a choice of stream processing frameworks. Amazon Kinesis Firehose is for use cases that require zero administration, ability to use existing analytics tools based on Amazon S3, Amazon Redshift and Amazon Elasticsearch, and a data latency of 60 seconds or higher.
  19. 19. Sensors S3 bucket Data Ingestion and Store Amazon Kinesis Firehose
  20. 20. What About Real Time Data Analytics?
  21. 21. Processing real-time, streaming data • Durable • Continuous • Fast • Correct • Reactive • Reliable What are the key requirements? Ingest Transform Analyze React Persist
  22. 22. Amazon Kinesis Analytics • Interact with streaming data in real time using SQL • Build fully managed and elastic stream processing applications that process data for real-time visualizations and alarms
  23. 23. Use SQL to build real-time applications Easily write SQL code to process streaming data Connect to streaming source Continuously deliver SQL results
  24. 24. Sensors Real Time Data Ingestion and Analytics Amazon Kinesis Analytics Process Data using SQL Source Stream Amazon Kinesis Streams
  25. 25. How are data mapped to a schema? Amazon Kinesis stream Amazon Kinesis Analytics { "sensorId": 4, "eventTimeStamp": "2017-09-04T15:01:55+08:00", "currentTemp": 31, "status": "OK” } sensorId eventTimeStamp currentTemp status 4 2017-09-04… 31 OK Schema is Inferred (editable) Source data for Amazon Kinesis Analytics Data Ingested in JSON Format
  26. 26. How is streaming data accessed with SQL? STREAM • Analogous to a TABLE • Represents continuous data flow PUMP • Continuous INSERT query • Inserts data from one in-application stream to another PUMP SOURCE_STREAM DESTINATION_STREAM
  27. 27. I want to …. 1. Convert RAW JSON to CSV 2. Aggregate (min, max, avg) in 1 minute 3. Real Time Anomaly Detection Alert
  28. 28. How do we model our data? DESTINATION_STREAM • sensor_id • event_timestamp • current_temp • status SOURCE_STREAM • sensorId • eventTimeStamp • currentTemperature • status Amazon Kinesis stream PUMP_1 SELECT STREAM "sensorId", CAST("eventTimeStamp" AS TIMESTAMP), "currentTemperature", "status" FROM "SOURCE_STREAM"; Amazon Kinesis Firehose In-Application Stream SCHEMA=CSV
  29. 29. Sensors S3 bucket Real Time Data Ingestion and Analytics Amazon Kinesis Analytics RAW CSV Data Process Data using SQL Source Stream Destination Stream RAW JSON to CSV Amazon Kinesis Firehose Amazon Kinesis Streams
  30. 30. I want to …. 1. Convert RAW JSON to CSV 2. Aggregate (min, max, avg) in 1 minute 3. Real Time Anomaly Detection Alert
  31. 31. Windowing Concepts • Windows can be tumbling or sliding • Windows are fixed length Output record will have the timestamp of the end of the window 1 5 4 26 8 6 4 t1 t2 t5 t6t3 t4 Time Window1 Window2 Window3 Aggregate Function (Sum) 18 14 Output Events
  32. 32. Comparing Types of Windows • Output created at the end of the window • The output of the window will be single event based on the aggregate function used Tumbling window Aggregate per time interval Sliding window Windows constantly re-evaluated
  33. 33. How do we model our data? AGGREGATE_STREAM • event_timestamp • highest_temp • lowest_temp • avg_temp SOURCE_STREAM • sensorId • eventTimeStamp • currentTemperature • status Amazon Kinesis stream PUMP_2 SELECT STREAM FLOOR("SOURCE_STREAM".ROWTIME TO MINUTE) AS "eventTimeStamp", MAX("currentTemperature") AS "highest_temp", MIN("currentTemperature") AS "lowest_temp", AVG("currentTemperature") AS "avg_temp" FROM "SOURCE_STREAM" GROUP BY FLOOR("SOURCE_SQL_STREAM".ROWTIME TO MINUTE); Amazon Kinesis Firehose In-Application Stream SCHEMA=CSV
  34. 34. Sensors S3 bucket Real Time Data Ingestion and Analytics Amazon Kinesis Analytics Amazon Kinesis Firehose RAW CSV Data Process Data using SQL Source Stream Destination Stream Aggregate Data in CSV RAW JSON to CSV Aggregate Data In CSV Amazon Kinesis Firehose Amazon Kinesis Streams
  35. 35. How to Consume the Data?
  36. 36. Serverless Query Processing • Serverless query service for querying data in S3 using standard SQL with no infrastructure to manage • No data loading required; query directly from Amazon S3 • Use standard ANSI SQL queries with support for joins, JSON, and window functions • Support for multiple data formats include text, CSV, TSV, JSON, Avro, ORC, Parquet • Pay per query only when you’re running queries based on data scanned. If you compress your data, you pay less and your queries run faster Amazon Athena
  37. 37. Familiar Technologies Under the Covers Used for SQL Queries In-memory distributed query engine ANSI-SQL compatible with extensions Used for DDL functionality Complex data types Multitude of formats Supports data partitioning
  38. 38. Sensors S3 bucket Consume Data with Amazon Athena Amazon Kinesis Analytics Amazon Kinesis Firehose Amazon Athena RAW CSV Data Aggregate Data in CSV Source Stream Amazon Kinesis Streams Process Data using SQL Amazon Kinesis Firehose Ad-Hoc Query
  39. 39. What About Visualization?
  40. 40. Business Intelligence • Fast and cloud-powered • Easy to use, no infrastructure to manage • Scales to 100s of thousands of users • Quick calculations with SPICE (Super-fast, Parallel, In-memory optimized Calculation Engine) • 1/10th the cost of legacy BI software Amazon QuickSight
  41. 41. Sensors S3 bucket Visualize your data with Amazon QuickSight Amazon Kinesis Analytics Amazon Kinesis Firehose Amazon Athena RAW CSV Data Aggregate Data in CSV Source Stream Amazon Kinesis Streams Process Data using SQL Amazon Kinesis Firehose Amazon QuickSight Ad-Hoc Query Visualization
  42. 42. Alternative - Feed real-time dashboards • Validate and transform raw data, and then process to calculate meaningful statistics • Send processed data downstream for visualization in BI and visualization services Amazon QuickSight Analytics Amazon ES Amazon Redshift Amazon RDS Streams Firehose
  43. 43. I want to …. 1. Convert RAW JSON to CSV 2. Aggregate (min, max, avg) in 1 minute 3. Real Time Anomaly Detection Alert
  44. 44. How do we model our data? FAIL_STREAM • sensorId • eventTimeStamp • currentTemperature • status SOURCE_STREAM • sensorId • eventTimeStamp • currentTemperature • status Amazon Kinesis stream PUMP_3 SELECT STREAM * FROM "SOURCE_STREAM" WHERE "STATUS" = 'FAIL' In-Application Stream Amazon Kinesis stream
  45. 45. Detecting Data Anomalies - Random Cut Forest The function detects anomalies by scoring data flowing through a dynamic data stream.
  46. 46. Sensors Amazon Kinesis Stream S3 bucket Real Time Detection Amazon Kinesis Analytics Amazon Kinesis Firehose Amazon Athena Amazon QuickSight Amazon Kinesis Stream AWS Lambda Amazon SNS email notification Amazon Kinesis Firehose RAW CSV Data Aggregate Data in CSV Source Stream Process Data using SQL Ad-Hoc Query Visualization Anomaly Data Take Action Notification Service
  47. 47. I want to …. 1. Convert RAW JSON to CSV 2. Aggregate (min, max, avg) in 1 minute 3. Real Time Anomaly Detection Alert 4. Real Time Prediction
  48. 48. Sensors Amazon Kinesis Stream S3 bucket Real Time Prediction Amazon Kinesis Analytics Amazon Kinesis Firehose Amazon Athena Amazon QuickSight Amazon Kinesis Stream AWS Lambda Amazon Kinesis Firehose RAW CSV Data Aggregate Data in CSV Source Stream Process Data using SQL Ad-Hoc Query Visualization Take Action Amazon Machine Learning Prediction
  49. 49. Thank you! Iolaire McKinnon @iolaire

×