Advertisement
Advertisement

More Related Content

Slideshows for you(20)

Similar to ABD217_From Batch to Streaming(20)

Advertisement

More from Amazon Web Services(20)

ABD217_From Batch to Streaming

  1. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. AWS re:INVENT From Batch to Streaming: H o w A m a z o n F l e x U s e s R e a l - t i m e A n a l y t i c s t o D e l i v e r P a c k a g e s o n T i m e N o v e m b e r 2 8 , 2 0 1 7
  2. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Agenda • Real-time streaming data overview • Streaming data services • Benefits of streaming analytics • Batch to streaming best practices • How Amazon Flex moved from batch to streaming
  3. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. What is batch processing? Execution of a series of jobs in a program on a computer without manual intervention - Wikipedia • Data is collected over a period of time • Process and analyze on a schedule • Combine several processes to obtain final result
  4. Most data is produced continuously Mobile apps Web clickstream Application logs Metering records IoT sensors Smart buildings
  5. The diminishing value of data Recent data is highly valuable • If you act on it in time • Perishable insights (M. Gualtieri, Forrester) Old + recent data is more valuable • If you have the means to combine them
  6. Processing real-time, streaming data • Durable • Continuous • Fast • Correct • Reactive • Reliable What are the key requirements? Collect Transform Analyze React Persist
  7. Amazon Kinesis makes it easy to work with real- time streaming data Kinesis Streams • For technical developers • Collect and stream data for ordered, replayable, real-time processing Kinesis Firehose • For all developers, data scientists • Easily load massive volumes of streaming data into Amazon S3, Redshift, ElasticSearch Kinesis Analytics • For all developers, data scientists • Easily analyze data streams using standard SQL queries • Compute analytics in real time
  8. Amazon Kinesis Streams • Reliably ingest and durably store streaming data at low cost • Build custom real-time applications to process streaming data • Use your stream-processing framework of choice
  9. Amazon Kinesis Firehose • Reliably ingest and deliver batched, compressed, and encrypted data to S3, Redshift, and Elasticsearch • Point and click setup with zero administration and seamless elasticity • Managed stream-processing consumer
  10. Amazon Kinesis Analytics • Interact with streaming data in real time using SQL • Build fully managed and elastic stream processing applications that process data for real-time visualizations and alarms
  11. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of streaming analysis Immediate results • Real-time aggregations • Filtering • Anomaly detection Reduced complexity • Fewer scheduled jobs to manage • Kinesis is a fully- managed solution Scalable • Enables parallel processing • Horizontally scales, based on your ingest rate
  12. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Batch to streaming best practices Migrate incrementally • Don’t boil the ocean • Begin by streaming data in parallel to existing batch processes • Persist streaming data into durable storage, like Amazon S3 • Add in streaming analysis results to replace batch analysis Application databases Data warehouseData producer Amazon Kinesis ETL ETL Amazon S3 Streaming data
  13. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Batch to streaming best practices Perform ITL rather than ETL • ITL: Ingest-Transform-Load • ETL: Extract-Transform-Load • Transform data in near-real time rather than a scheduled job • Enrich data in near-real time • Persist transformed and/or enriched data Data producer Amazon Kinesis Firehose Raw streaming data AWS Lambda function Amazon S3 Transformed data Transform data Enrichment source data Raw data Transformed and/or enriched data
  14. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Batch to streaming best practices Aggregate upon arrival • Continuously write raw data to persistent data store for archival and other analysis • Aggregate in real time when window size < 1 hour • Write aggregated data to persistent data store for immediate value Amazon Kinesis Firehose Raw streaming data Amazon S3 Raw data Aggregated data Amazon Kinesis Analytics Aggregate Results Data producer
  15. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Batch to streaming example
  16. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Brandon Smith • Senior software engineer • Worked at Amazon for 12 years in Kindle, AWS, and now Last Mile Delivery • Currently working on Amazon Flex
  17. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Amazon delivery app (Android/iOS) • Crowd-sourced model launched in 30+ U.S. cities • Used by Amazon Logistics worldwide
  18. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Deliveries for Amazon.com, Prime Now, Amazon Fresh, restaurants, grocery stores • Millions of packages per year
  19. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The problem • Collecting, processing, and storing telemetry data • Telemetry data = remote measurements • Includes metrics, crashes, logs, sensor data, clickstream data, etc.
  20. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. The goal • Understand what’s happening in the field • Analyze all the data and make performance optimizations • Focus our time on improving the app and the delivery flow
  21. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use cases
  22. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 1: Alarming • We want to know within minutes if there are problems • Example: If the delivery count drops below our expected/historical value, we want to alarm
  23. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 2: Troubleshooting • Logs and crashes published to AWS CloudWatch Logs in near-real time • Can filter and search to troubleshoot issues
  24. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 3: Dashboards • We can write SQL, generate reports, and create visualizations • But we really want real-time dashboards instead of daily reports Daily reports Real-time dashboards
  25. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 4: Releases • Deploying new app versions and monitoring adoption in real time • Release new code smoothly and with confidence
  26. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 5: Sharing data • Consumers get notifications of new data in real time • Consumers can join their data with other data in the data lake S3 bucket Data lake
  27. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Use case 6: Deeper analytics • Look at the stream of data and the historical data • Build ML models, create predictions, detect anomalies
  28. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. How did we build it?
  29. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Getting from batch to streaming • To solve our use cases, we had to incrementally improve our system • We evolved from a batch-based system to a stream-based system • Let’s walk through the iterations
  30. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Collect metrics and send to an existing metrics service • ETL jobs to load data into a big Oracle Data Warehouse Iteration 1: Use existing systems Existing metrics serviceApp DW ETL Data collection
  31. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Batch process with 24-hour delay 2. Fixed, inflexible DB schema 3. Analysis difficult and slow via SQL Iteration 1: Use existing systems Existing metrics serviceApp DW ETL Data collection
  32. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Collect metrics in the app using AWS Amazon Mobile Analytics SDK, which automatically loads data into Redshift Iteration 2: Use AWS App CloudFormation ETL system Data collection
  33. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Batch process with 24-hour delay 2-hour delay 2. Fixed, inflexible DB schema 3. Analysis difficult and slow via SQL Iteration 2: Use AWS App CloudFormation ETL system Data Collection
  34. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Add shared configuration that is used in the app and automatically updates the Redshift schema Iteration 3: Automated DB schema App CloudFormation ETL system Data collection Schema config
  35. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Batch process with 24-hour delay 2-hour delay 2. Fixed, inflexible Auto-updating DB schema 3. Analysis difficult and slow via SQL Iteration 3: Automated DB schema App Schema config CloudFormation ETL system Data collection
  36. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Introduce a Kinesis stream and Kinesis Firehose to publish to Redshift • Partition data by date to simplify data retention policies Iteration 4: Use Streams App Data collection Via Pinpoint Schema config
  37. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Batch Streaming process with 24-hour 2 hour a delay of a couple minutes 2. Fixed, inflexible Auto-updating DB schema 3. Analysis difficult and slow via SQL Iteration 4: Use Streams App Data collection Via Pinpoint Schema config
  38. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. • Use generic message types • Publish the data to: • S3 • Redshift • ElasticSearch Iteration 5: Generic message types App ElasticSearch
  39. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Iteration 5 App Data collection ElasticSearch Consumer Lambdas SQL reports Dashboards ProtoBuf Consumer Redshifts
  40. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. 1. Batch Streaming process with 24-hour 2 hour a few seconds delay 2. Fixed, inflexible Auto-updating DB schema and generic message types 3. Analysis difficult and slow via SQL flexible by processing message payload Iteration 5: Generic message types
  41. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Data flow App ElasticSearch Consumer Redshifts Consumer Lambdas SQL reports Dashboards
  42. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Future improvements Some ideas to make the system even better: 1. Use Kinesis Analytics to query the real-time data stream 2. Use AWS Athena to query data directly from S3 3. Use AWS Amazon AI Services to do deeper data analysis
  43. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Summary Did we solve our use cases? 1. Real-time metrics and alarming 2. Real-time dashboards 3. Real-time logs and crash troubleshooting 4. Monitoring new releases 5. Sharing data with other teams 6. Deeper analytics
  44. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Benefits of Streaming 1. Agility: real-time data means your business can react quicker 2. Flexibility: generic message types give you flexible schemas so your system can handle multiple data types and future use cases 3. Shareability: streams allow you to multiplex and share your data easily with your consumers 4. Extensibility: Processing streams of data allows us to write it to multiple data storage systems, which enables a variety of analytics tools
  45. © 2017, Amazon Web Services, Inc. or its Affiliates. All rights reserved. Thank you!
Advertisement