Streaming data for real time analysis

@ 2014 Amazon.com, Inc. and Its affiliates. All rights reserved. May not be copied, modified, or distributed in whole or in part without the express consent of Amazon.com, Inc
Streaming Data for Analysis
Brett Francis
Enterprise Solutions Architect

Talk Outline
• Streaming Big Data
• Analytics with Redshift
• Generalizing the Streaming for Analytics design pattern
• Cost Influences on Architecture

You’re likely already “streaming”
• Sensor networks analytics
• Ad network analytics
• Log shipping and centralization
• Click stream analysis
• Gaming status
• Hardware and software appliance metrics
• …more…

Example Streaming Big Data Source

Let’s explore common
challenges of streaming

One common starting point is ingesting records
for analysis
Elastic Beanstalk
foo-analysis.com
Global top-10
foo-analysis.com

Too big to handle on one box
Global top-10Elastic Beanstalk
foo-analysis.com

The solution: needs record sorting and grouping
Local top-10
Local top-10
Local top-10 Global top-10
Elastic Beanstalk
foo-analysis.com

The solution: streaming map/reduce
Global top-10
Elastic Beanstalk
foo-analysis.com
Local top-10
Local top-10
Local top-10
Data Record
Shard:
Sequence Number
14 17 18 21 23

When to use Stream Processing
• “real-time” starts coming onto the radar
• The time to answer can’t wait for batch processing times
• Instead of processing serially as A > B > C it would be
better to have a fan out pattern
• The records are just a means to an end, most records
can be immediately archived after an “answer” is
determined.

How this relates to Kinesis
foo-analysis.com
Kinesis
Kinesis
Application

Core streaming concepts
foo-analysis.com
Data
Record
Stream
Shard
Partition Key
Worker
My top-10
Data Record
Shard:
Sequence Number
14 17 18 21 23

Kinesis Managed Stream Processing
• Moved from batch to continuous processing
• Scale shards and time series elastically UP or DOWN
without losing sequencing
• Workers can replay records for up to 24 hours
• Scale up to GB/sec without losing durability
• Records stored across multiple availability zones
• Multiple parallel Kinesis Aps output to anything…
• RDBMS, S3, In-house Data Warehouse, Messaging, another stream,
JavaSDK, PythonSDK, etc.

Amazon Kinesis
AWSEndpoint
S3
DynamoDB
Redshift
Data
Sources
Availability
Zone
Availability
Zone
Data
Sources
Data
Sources
Data
Sources
Data
Sources
Availability
Zone
Shard 1
Shard 2
Shard N
[Aggregate &
De-Duplicate]
[Metric
Extraction]
[Sliding Window
Analysis]
[Machine
Learning]
App. 1
App. 2
App. 3
App. 4

Core Concepts Recapped
• Data Record ~ a single generated record
• Stream ~ all records (aka. The Fire Hose)
• Partition Key ~ all records for specific topic / sensor
• Shard ~ all data records belonging to a set of topics, grouped
together
• Sequence Number ~ generated and assigned to each data record
when ingested
• Worker ~ processes the records of a shard in sequence order

Analysis using Redshift
• Compatible with existing SQL Business Intelligence tools
• Start small and grow massively
• Scalable from 160GB to Petabyte+
• Elastic data warehousing
• Automatically run queries against old cluster while the new one is being
provisioned
• Run it when you need it

Redshift Architecture
• Ingest from S3, EMR,
DynamoDB or API
• Backups to S3
• JDBC / ODBC Access

Generalizing a Streaming for Analytics
design pattern

Example: Kinesis for Clickstream Analytics
Clickstream
processing
applications
Aggregated
clickstream
statistics
Clickstream
archive
Clickstream
Trend analysis

Example: Kinesis for Simple Metering & Billing
Billing
auditors
Incremental
bill
computation
Metering
record
archive
Billing mgmt
service

Kinesis Poster Worker Demo
(aka. The Egg Finder)
• Published at AWSlabs
• h t t p s : / / g i t h u b . c o m / a w s l a b s / k i n e s i s - p o s t e r - w o r k e r
• Poster ~ multi-threaded client that posts random characters in to a stream
• Worker ~ a thread-per-shard client that gets batches of records looking for
the word ‘egg’

Cost Influences on Architecture

Streaming Analysis Cost Dimensions
• Amazon Kinesis priced in shard increments of:
• 1MB/sec ingest 2MB/sec egress
• 1M PUTs
• Amazon EC2 Kinesis Apps priced by instance
• Amazon Redshift prices are hourly and:
• One tenth the cost of alternatives (ex. 3Yr RI)
• Scales from 160GB to >1PB

Thank You.
Please send me feedback on this presentation.
brettf@
Follow-up Links
aws.amazon.com/kinesis
aws.amazon.com/redshift
aws.amazon.com/elasticbeanstalk

Streaming data for real time analysis

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Streaming data for real time analysis

Similar to Streaming data for real time analysis (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Streaming data for real time analysis