This session is recommended for anyone interested in understanding how to use AWS big data services to develop real-time analytics applications. In this session, you will get an overview of a number of Amazon's big data and analytics services that enable you to build highly scaleable cloud applications that immediately and continuously analyze large sets of distributed data. We'll explain how services like Amazon Kinesis, EMR and Redshift can be used for data ingestion, processing and storage to enable real-time insights and analysis into customer, operational and machine generated data and log files. We'll explore system requirements, design considerations, and walk through a specific customer use case to illustrate the power of real-time insights on their business.
10. Amazon Kinesis – stream and shards
•Stream: A named entity to
capture and store data
•Shards: Unit of capacity
•Put – 1 MB/sec or 1000
TPS
•Get - 2 MB/sec or 5 TPS
•Scale by adding or removing
shards
•Replay in 24-hr. window
11. How to size your Amazon Kinesis stream
Consider 2 producers, each producing 2 KB records at 500 TPS:
Minimum of 2 shards for ingress of 2 MB/s
2 Applications can read with egress of 4MB/s
Shard
Shard
2 KB * 500 TPS = 1000 KB/s
2 KB * 500 TPS = 1000 KB/s
Application
Producers
Application
12. How to size your Amazon Kinesis stream
Consider 3 consuming applications each processing the data
Simple! Add another shard to the stream to spread the load
Shard
Shard
2 KB * 500 TPS = 1000 KB/s
2 KB * 500 TPS = 1000 KB/s
Application
Application
Application
Producers
Shard
13. Amazon Kinesis – distributed streams
• From batch to continuous processing
• Scale UP or DOWN without losing sequencing
• Workers can replay records for up to 24 hours
• Scale up to GB/sec without losing durability
– Records stored across multiple Availability Zones
• Run multiple parallel Amazon Kinesis applications
15. Batch
Micro
batch
Real
time
Pattern for real-time analytics…
Batch
analysis
Data Warehouse
Hadoop
Notifications
& alerts
Dashboards/
visualizations
APIsStreaming
analytics
Data
streams
Deep learning
Dashboards/
visualizations
Spark-Streaming
Apache Storm
Amazon KCL
Data
archive
16. Real-time analytics
• Streaming
– Event-based response within seconds; for example, detecting
whether a transaction is a fraud or not
• Micro-batch
– Operational insights within minutes; for example, monitor
transactions from different regions
Kinesis
Client
Library
18. Amazon KCL design components
• Worker: The processing unit that maps to each application instance
• Record processor: The processing unit that processes data from a
shard of an Amazon Kinesis stream
• Check-pointer: Keeps track of the records that have already been
processed in a given shard
Amazon KCL restarts the processing of the shard at the last-known
processed record if a worker fails
19. Amazon Kinesis Connector Library
• Amazon S3
– Archival of data
• Amazon Redshift
– Micro-batching loads
• Amazon DynamoDB
– Real-time Counters
• Elasticsearch
– Search and Index
S3 Dynamo DB Amazon
Redshift
Amazon
Kinesis
20. Read data directly into
Hive, Pig, Streaming,
and Cascading from
Amazon Kinesis
Real-time sources into batch-oriented systems
Multi-application support & check-pointing
EMR integration with Amazon Kinesis
21. DStream
RDD@T1 RDD@T2
Messages
Receiver
Spark streaming – Basic concepts
• Higher-level abstraction called Discretized Streams
(DStreams)
• Represented as sequences of Resilient Distributed
Datasets (RDDs)
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html
22. Apache Storm: Basic concepts
• Streams: Unbounded sequence of tuples
• Spout: Source of stream
• Bolts: Processes that input streams and output new streams
• Topologies: Network of spouts and bolts
https://github.com/awslabs/kinesis-storm-spout
24. • Best Practices for Micro-Batch Loading on Amazon Redshift
• Implement a Real-time, Sliding-Window Application Using Amazon
Kinesis and Apache Storm
• Visualizing Real-time, Geotagged Data with Amazon Kinesis
26. GREE Headquarters
Tokyo, Japan
GREE International,
Inc.
San Francisco, CA
GREE Canada
Vancouver, BC
QUICK FACTS
6
Continents playing GREE games
1,882
Employees Worldwide
13
Games made in North America
2004
2011
2013
MILESTONES GAME STATS - 4 titles in top 100 grossing*
Crime City (Studios)
Reached Top 10 Grossing in 140 countries
Top 100 Grossing in 19 countries, over 3 years
since launch
*As of Sep. 2014 – Source: App Annie
A Global Gaming Powerhouse
Knights & Dragons (Publishing)
Reached Top 10 Grossing in 41 countries
Top 100 Grossing in 22 countries
28. Data collection
Source of data
• Mobile devices
• Game servers
• Ad networks
Data size & growth
• 500 G+/day
• 500 M+ events/day
• Size of event ~ 1 KB
Analytics Data
{"player_id":"323726381807586881","player_level":169,"device":"iPhone
5","version":"iOS 7.1.2”,"platfrom":"ios","client_build":"440”,
"db":”mw_dw_ios","table":"player_login",
"uuid":"1414566719-rsl3hvhu7o","time_created":"2014-10-29 00:11:59”}
29. Key requirements
• Guaranteed data delivery
• Zero data loss
• Zero data corruption
• Ease of adding consumers
• Near real-time data latency
• Real-time ad-hoc analysis
• Managed service
33. Design choices for sender
• Single-stream vs. stream per game
• Batch vs. single event
• Compressed vs. uncompressed
• PartitionKey vs. ExplicitHashKey
34. Consumer – Amazon S3 store in DSV format
Amazon
Kinesis
stream
Shard 1
Shard 2
Shard n
S3File metadata DB
Decompress De-Dupe
BufferDSV transformation
Validation Target table
Compress
Size/
Timeout
Record
Consumer
Amazon KCL
Record processor
Record processor
Consumer
Amazon KCL
Record processor
Auto Scaling group
35. Loading data into Amazon Redshift
S3
File metadata DB
Amazon
Redshift
Update status
Transaction
Create manifest Execute COPY
Create manifest Execute COPY
Status
Create manifest Execute COPY
36. Consumer – Real-time stats
Amazon
Kinesis
Stream
Shard 1
Shard 2
Shard n
Decompress De-Dupe
Target tableRecord
Consumer
Amazon Kinesis Client Library
Record processor
Record processor
Consumer
Amazon Kinesis Client Library
Record processor
Auto Scaling group
Configuration
Metric, segment &
value, timeslot
Filter events
ElastiCache
(Redis)
Dashboard
38. Lessons learned
Sender
• Decouple data generation from sending
• Batch and compress
• PutRecord HTTP:5XX can result in duplicates
• Monitor ProvisionedThroughputExceeded exception
39. Lessons learned (Cont.)
Consumer
• Use Amazon KCL
• Auto-scale and monitor load
Overall
• Provision enough shards
• Handle shutdown gracefully
• Follow AWS best practices for error retries and
exponential back-off
41. Takeaway
Amazon Kinesis
• Data available for processing within seconds
• Robust API, Amazon KCL, and Amazon Kinesis
Connector Libraries
AWS
• Managed
• Scalable
• Cost effective
• Quick to get up and running