Getting Started with Real-time Analytics

©2015, Amazon Web Services, Inc. or its affiliates. All rights reserved
Getting Started with Real-time Analytics
&
Real-time Game Analytics at GREE International
Rahul Bhartia - Solution Architect, Amazon Web Services
Kandarp Shah - Engineering Manager, GREE International

Agenda
• Real-time analytics
– Data ingestion
– Data processing
• GREE International
– Analytics architecture
– Lessons learned
• Takeaway

Real-time analytics
Real-time ingestion
• Highly scalable
• Durable
• Elastic
• Re-playable reads
Continuous processing
• Load-balancing incoming streams
• Fault-tolerance, check-pointing and replay
• Elastic
• Enables multiple apps to process in parallel
Continuous data flow
Low end-to-end latency
Continuous, real-time workloads
+

Global top 10
example.com
Starting simple...

Global top-10
Distributing the workload…
example.com

Global top10
Local top 10
Local top 10
Local top 10
Or using an elastic data broker…
example.com

Global top 10
Data
record
Stream
Shard
Partition key
Worker
My top 10
Data recordSequence number
14 17 18 21 23
Amazon Kinesis – managed stream
example.com
Amazon
Kinesis

AWSendpoint
Amazon
S3
Amazon
DynamoDB
Amazon
Redshift
Data
sources
Availability
Zone
Availability
Zone
Data
sources
Data
sources
Data
sources
Data
sources
Availability
Zone
Shard 1
Shard 2
Shard N
[Data
archive]
[Metric
extraction]
[Sliding-wiindow
analysis]
[Machine
learning]
App. 1
App. 2
App. 3
App. 4
Amazon EMR
Amazon Kinesis – common data broker

Amazon Kinesis – stream and shards
•Stream: A named entity to
capture and store data
•Shards: Unit of capacity
•Put – 1 MB/sec or 1000
TPS
•Get - 2 MB/sec or 5 TPS
•Scale by adding or removing
shards
•Replay in 24-hr. window

How to size your Amazon Kinesis stream
Consider 2 producers, each producing 2 KB records at 500 TPS:
Minimum of 2 shards for ingress of 2 MB/s
2 Applications can read with egress of 4MB/s
Shard
Shard
2 KB * 500 TPS = 1000 KB/s
2 KB * 500 TPS = 1000 KB/s
Application
Producers
Application

How to size your Amazon Kinesis stream
Consider 3 consuming applications each processing the data
Simple! Add another shard to the stream to spread the load
Shard
Shard
2 KB * 500 TPS = 1000 KB/s
2 KB * 500 TPS = 1000 KB/s
Application
Application
Application
Producers
Shard

Amazon Kinesis – distributed streams
• From batch to continuous processing
• Scale UP or DOWN without losing sequencing
• Workers can replay records for up to 24 hours
• Scale up to GB/sec without losing durability
– Records stored across multiple Availability Zones
• Run multiple parallel Amazon Kinesis applications

Batch
Micro
batch
Real
time
Pattern for real-time analytics…
Batch
analysis
Data Warehouse
Hadoop
Notifications
& alerts
Dashboards/
visualizations
APIsStreaming
analytics
Data
streams
Deep learning
Dashboards/
visualizations
Spark-Streaming
Apache Storm
Amazon KCL
Data
archive

Real-time analytics
• Streaming
– Event-based response within seconds; for example, detecting
whether a transaction is a fraud or not
• Micro-batch
– Operational insights within minutes; for example, monitor
transactions from different regions
Kinesis
Client
Library

Amazon Kinesis Client Library (Amazon KCL)
• Distributed to handle
multiple shards
• Fault tolerant
• Elastically adjusts to shard
count
• Helps with distributed
processing
Amazon
Kinesis
Stream
Amazon EC2
Amazon EC2
Amazon EC2

Amazon KCL design components
• Worker: The processing unit that maps to each application instance
• Record processor: The processing unit that processes data from a
shard of an Amazon Kinesis stream
• Check-pointer: Keeps track of the records that have already been
processed in a given shard
Amazon KCL restarts the processing of the shard at the last-known
processed record if a worker fails

Amazon Kinesis Connector Library
• Amazon S3
– Archival of data
• Amazon Redshift
– Micro-batching loads
• Amazon DynamoDB
– Real-time Counters
• Elasticsearch
– Search and Index
S3 Dynamo DB Amazon
Redshift
Amazon
Kinesis

Read data directly into
Hive, Pig, Streaming,
and Cascading from
Amazon Kinesis
Real-time sources into batch-oriented systems
Multi-application support & check-pointing
EMR integration with Amazon Kinesis

DStream
RDD@T1 RDD@T2
Messages
Receiver
Spark streaming – Basic concepts
• Higher-level abstraction called Discretized Streams
(DStreams)
• Represented as sequences of Resilient Distributed
Datasets (RDDs)
http://spark.apache.org/docs/latest/streaming-kinesis-integration.html

Apache Storm: Basic concepts
• Streams: Unbounded sequence of tuples
• Spout: Source of stream
• Bolts: Processes that input streams and output new streams
• Topologies: Network of spouts and bolts
https://github.com/awslabs/kinesis-storm-spout

Batch
Micro
batch
Real
time
Putting it together…
Producer Amazon
Kinesis
App Client
EMRS3
Amazon KCL
DynamoDB
Amazon
Redshift BI tools
Amazon KCL
Amazon KCL

• Best Practices for Micro-Batch Loading on Amazon Redshift
• Implement a Real-time, Sliding-Window Application Using Amazon
Kinesis and Apache Storm
• Visualizing Real-time, Geotagged Data with Amazon Kinesis

Real-time game analytics at GREE

GREE Headquarters
Tokyo, Japan
GREE International,
Inc.
San Francisco, CA
GREE Canada
Vancouver, BC
QUICK FACTS
6
Continents playing GREE games
1,882
Employees Worldwide
13
Games made in North America
2004
2011
2013
MILESTONES GAME STATS - 4 titles in top 100 grossing*
Crime City (Studios)
Reached Top 10 Grossing in 140 countries
Top 100 Grossing in 19 countries, over 3 years
since launch
*As of Sep. 2014 – Source: App Annie
A Global Gaming Powerhouse
Knights & Dragons (Publishing)
Reached Top 10 Grossing in 41 countries
Top 100 Grossing in 22 countries

Ad Clicks
Downloads
Perf Data
Attribution
Campaign Performance
SC Balance
HC Balance
IAP
Player Targeting
Analytics @ GREE

Data collection
Source of data
• Mobile devices
• Game servers
• Ad networks
Data size & growth
• 500 G+/day
• 500 M+ events/day
• Size of event ~ 1 KB
Analytics Data
{"player_id":"323726381807586881","player_level":169,"device":"iPhone
5","version":"iOS 7.1.2”,"platfrom":"ios","client_build":"440”,
"db":”mw_dw_ios","table":"player_login",
"uuid":"1414566719-rsl3hvhu7o","time_created":"2014-10-29 00:11:59”}

Key requirements
• Guaranteed data delivery
• Zero data loss
• Zero data corruption
• Ease of adding consumers
• Near real-time data latency
• Real-time ad-hoc analysis
• Managed service

Game DB
Game
servers
Amazon
Kinesis
Amazon
S3
Amazon
S3
Amazon
Redshift
S3
Consumer
Amazon
EMR
DSV
JSON
Analytics architecture
Dashboard
Real-time
stats
consumer
Amazon
ElastiCache
(Redis)

Sender
Amazon
Kinesis
stream
Shard 1
Shard 2
Shard 3
Shard n
Describe
Stream
Sync
Shards
Analytics
files
Send
PutRecord
Read Buffer
Amazon Kinesis sender
Compress
50 KB

Design choices for sender
• Single-stream vs. stream per game
• Batch vs. single event
• Compressed vs. uncompressed
• PartitionKey vs. ExplicitHashKey

Consumer – Amazon S3 store in DSV format
Amazon
Kinesis
stream
Shard 1
Shard 2
Shard n
S3File metadata DB
Decompress De-Dupe
BufferDSV transformation
Validation Target table
Compress
Size/
Timeout
Record
Consumer
Amazon KCL
Record processor
Record processor
Consumer
Amazon KCL
Record processor
Auto Scaling group

Loading data into Amazon Redshift
S3
File metadata DB
Amazon
Redshift
Update status
Transaction
Create manifest Execute COPY
Status

Consumer – Real-time stats
Amazon
Kinesis
Stream
Shard 1
Shard 2
Shard n
Decompress De-Dupe
Target tableRecord
Consumer
Amazon Kinesis Client Library
Record processor
Record processor
Consumer
Amazon Kinesis Client Library
Record processor
Auto Scaling group
Configuration
Metric, segment &
value, timeslot
Filter events
ElastiCache
(Redis)
Dashboard

Lessons learned
Sender
• Decouple data generation from sending
• Batch and compress
• PutRecord HTTP:5XX can result in duplicates
• Monitor ProvisionedThroughputExceeded exception

Lessons learned (Cont.)
Consumer
• Use Amazon KCL
• Auto-scale and monitor load
Overall
• Provision enough shards
• Handle shutdown gracefully
• Follow AWS best practices for error retries and
exponential back-off

Takeaway
Amazon Kinesis
• Data available for processing within seconds
• Robust API, Amazon KCL, and Amazon Kinesis
Connector Libraries
AWS
• Managed
• Scalable
• Cost effective
• Quick to get up and running

Getting Started with Real-time Analytics

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (13)

Similar to Getting Started with Real-time Analytics

Similar to Getting Started with Real-time Analytics (20)

More from Amazon Web Services

More from Amazon Web Services (20)

Recently uploaded

Recently uploaded (20)

Getting Started with Real-time Analytics