Streaming data for real time analysis


Published on

See how to use Amazon Web Services, such as Amazon Kinesis, Amazon Redshift, to do real time big data streaming and analytics.

Published in: Technology, Business

Streaming data for real time analysis

  1. 1. @ 2014, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of, Inc Streaming Data for Analysis Brett Francis Enterprise Solutions Architect
  2. 2. Talk Outline • Streaming Big Data • Analytics with Redshift • Generalizing the Streaming for Analytics design pattern • Cost Influences on Architecture
  3. 3. You’re likely already “streaming” • Sensor networks analytics • Ad network analytics • Log shipping and centralization • Click stream analysis • Gaming status • Hardware and software appliance metrics • …more…
  4. 4. Example Streaming Big Data Source
  5. 5. Let’s explore common challenges of streaming
  6. 6. One common starting point is ingesting records for analysis Elastic Beanstalk Global top-10
  7. 7. Too big to handle on one box Global top-10Elastic Beanstalk
  8. 8. The solution: needs record sorting and grouping Local top-10 Local top-10 Local top-10 Global top-10 Elastic Beanstalk
  9. 9. The solution: streaming map/reduce Global top-10 Elastic Beanstalk Local top-10 Local top-10 Local top-10 Data Record Shard: Sequence Number 14 17 18 21 23
  10. 10. When to use Stream Processing • “real-time” starts coming onto the radar • The time to answer can’t wait for batch processing times • Instead of processing serially as A > B > C it would be better to have a fan out pattern • The records are just a means to an end, most records can be immediately archived after an “answer” is determined.
  11. 11. How this relates to Kinesis Global top-10Elastic Beanstalk Kinesis Kinesis Application
  12. 12. Core streaming concepts Global top-10Elastic Beanstalk Data Record Stream Shard Partition Key Worker My top-10 Data Record Shard: Sequence Number 14 17 18 21 23
  13. 13. Kinesis Managed Stream Processing • Moved from batch to continuous processing • Scale shards and time series elastically UP or DOWN without losing sequencing • Workers can replay records for up to 24 hours • Scale up to GB/sec without losing durability • Records stored across multiple availability zones • Multiple parallel Kinesis Aps output to anything… • RDBMS, S3, In-house Data Warehouse, Messaging, another stream, JavaSDK, PythonSDK, etc.
  14. 14. Amazon Kinesis AWSEndpoint S3 DynamoDB Redshift Data Sources Availability Zone Availability Zone Data Sources Data Sources Data Sources Data Sources Availability Zone Shard 1 Shard 2 Shard N [Aggregate & De-Duplicate] [Metric Extraction] [Sliding Window Analysis] [Machine Learning] App. 1 App. 2 App. 3 App. 4
  15. 15. Core Concepts Recapped • Data Record ~ a single generated record • Stream ~ all records (aka. The Fire Hose) • Partition Key ~ all records for specific topic / sensor • Shard ~ all data records belonging to a set of topics, grouped together • Sequence Number ~ generated and assigned to each data record when ingested • Worker ~ processes the records of a shard in sequence order
  16. 16. Amazon Redshift
  17. 17. Analysis using Redshift • Compatible with existing SQL Business Intelligence tools • Start small and grow massively • Scalable from 160GB to Petabyte+ • Elastic data warehousing • Automatically run queries against old cluster while the new one is being provisioned • Run it when you need it
  18. 18. Redshift Architecture • Ingest from S3, EMR, DynamoDB or API • Backups to S3 • JDBC / ODBC Access
  19. 19. Generalizing a Streaming for Analytics design pattern
  20. 20. Example: Kinesis for Clickstream Analytics Clickstream processing applications Aggregated clickstream statistics Clickstream archive Clickstream Trend analysis
  21. 21. Example: Kinesis for Simple Metering & Billing Billing auditors Incremental bill computation Metering record archive Billing mgmt service
  22. 22. Kinesis Poster Worker Demo (aka. The Egg Finder) • Published at AWSlabs • h t t p s : / / g i t h u b . c o m / a w s l a b s / k i n e s i s - p o s t e r - w o r k e r • Poster ~ multi-threaded client that posts random characters in to a stream • Worker ~ a thread-per-shard client that gets batches of records looking for the word ‘egg’
  23. 23. Cost Influences on Architecture
  24. 24. Streaming Analysis Cost Dimensions • Amazon Kinesis priced in shard increments of: • 1MB/sec ingest 2MB/sec egress • 1M PUTs • Amazon EC2 Kinesis Apps priced by instance • Amazon Redshift prices are hourly and: • One tenth the cost of alternatives (ex. 3Yr RI) • Scales from 160GB to >1PB
  25. 25. Thank You. Please send me feedback on this presentation. brettf@ Follow-up Links