OLX Group presentation on real-time serverless analytics at the 2018 OLX internal data summit in Barcelona.
The presentation focuses on best practices in real-time data applications, including AWS technologies such as Kinesis, Lambda (with serverless framework) and ElastiCache.
Presentation examines case study of real-time product recommendations built on top of serverless architecture.
2. 2
What to expect…
ØGoal is to give you a sweeping view of the Shedd
serverless real-time analytics stack
ØWe will cover a lot of new tools and tech building blocks,
though we will steer clear of the nitty gritty details
ØExpect technical content and hands-on exercises – for
the non-technical folk in the audience, try to focus on the
high-level understanding of the concepts
ØWe hope the presentation gives you inspiration and
smoothens the learning curve in case you decide to
pursue a similar approach
6. 6
Example: Consider this insight regarding first-time Shedd users
Does not
view any ads
Views 1
or more ads
Makes 1
or more replies
Day 1
activity
Browser Viewer Buyer
7. 7
Example: Consider this insight regarding first-time Shedd users
Does not
view any ads
Views 1
or more ads
Makes 1
or more replies
2.9 ad views
0.02 replies
1.3 active days
150 ad views
0.4 replies
4.7 active days
670 ad views
6.7 replies
11.2 active days
Day 1
activity
Days 2-30
activity
Browser Viewer Buyer
8. 8
Example: Consider this insight regarding first-time Shedd users
Does not
view any ads
Views 1
or more ads
Makes 1
or more replies
2.9 ad views
0.02 replies
1.3 active days
150 ad views
0.4 replies
4.7 active days
670 ad views
6.7 replies
11.2 active days
Day 1
activity
Days 2-30
activity
Browser Viewer Buyer
How can real-time analytics help?
9. 9
Real-time analytics unlocks a number of capabilities
Segment user behaviour and build real-time single customer viewSegmentation
Personalisation
Targeting
Reporting
A/B testing
Data-driven
products
Instantly personalise product experience based on up-to-date user
preferences and behaviour
Target users with push notifications, in-app messaging and custom
product flows based on real-time triggers and rules
Build mission-critical reports for real-time decision-making (e.g.
during large live marketing campaign or new product releases)
Continuously optimise live A/B tests based on real-time results
Enable integration of data analytics & models within our products
10. 10
Real-time analytics enables us to unlock the full value of dataThe diminishing value of data
Recent data is highly valuab
If you act on it in time
Perishable Insights (M. Gualtieri, F
Old + Recent data is more v
If you have the means to combine t
11. 11
BATCH DATA STACK
Operational data layer
(listings, replies, users, orders, etc.)
Raw data layer
(data lake)
Tracking
(Ninja /
Hydra)
Platform DB
(Mongo)
Adjust /
Facebook /
Google
…
BI Segmentation
Performance
marketing
CLM
Batch
recommender
…
DATAWAREHOUSE
Raw data streams
REAL-TIME DATA STACK
Tracking
(Ninja / Hydra)
Platform DB
(Mongo)
…
Real-time
data processing
Real-time database (Online customer view)
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
API gateway
Real-time
recommender
Real-time
segmentation
Other real-time
applications
Today we will take a peek at Shedd’s real-time data stack
16. 16
Kinesis Data Stream architecture
▪ 1 MB / sec data input
▪ 1 MB / sec data output
▪ 1000 records / sec
▪ 24 hours data retention
▪ $0.015 / shard / hour
($10.80 / shard / month)
▪ $0.014 / 1M records
($14 / 1B records)
…
Stream
Shard
Event / data record (e.g. JSON object)
Write event to stream shard
Read event from stream shard
17. 17
Exercise: Create stream and feed with sample data
1. Create Kinesis data stream 2. Feed sample real-time data
https://us-west-2.console.aws.amazon.com/kinesis/home?region=us-west-2#/streams/create https://awslabs.github.io/amazon-kinesis-data-generator/
19. 19
Exercise: Create Kinesis Analytics application and run some
real-time SQL analysis
1. Create Kinesis Analytics app 2. Run real-time SQL analysis
20. 20
We leverage 3 AWS building blocks for real-time data analytics
KINESIS
Stream data
LAMBDA
Process data
ELASTICACHE
Store data
21. 21
Evolution of computing models
ON-PREMISE
Physical servers
SERVER as a service
Virtual server in
the cloud
Amazon EC2
APP as a service
Virtual app
container
Amazon ECS
FUNCTION as a service
Serverless
computing
AWS Lambda
22. 22
Lambda is Amazon’s serverless event-driven compute service
Write code in
Python, Node.js,
Java, and others
and upload to
Lambda
Trigger code from
other AWS services,
HTTP endpoints or
in-app activity
Scale seamlessly and
elastically with number of
events, only using
required compute
resource
Only pay for the
compute time
used (per 100ms
execution time)
Forget about infrastructure, administration and scaling – focus 100% on your app logic
26. 26
Exercise: Create APIs with serverless + API gateway + Lambda
1. Create Hello World endpoint 2. Create mock API endpoint
27. 27
We leverage 3 AWS building blocks for real-time data analytics
KINESIS
Stream data
LAMBDA
Process data
ELASTICACHE
Store data
28. 28
ElastiCache is Amazon’s managed service for Redis:
an INSANELY fast in-memory key-value database
▪In-memory
▪Low latency
▪Ridiculously fast
▪NoSQL à key-value store
▪Open source
29. 29
Redis + Redshift =
▪ Run few queries infrequently
▪ Process billions of records per query
▪ Standard SQL
▪ Batch
▪ Run millions of commands continuously
▪ Process few records per command
▪ 200 Redis commands + Lua scripting
▪ Real-time
30. 30
Redis is a key-value store supporting 5 basic data types
Key => { Data Structures }
Key
"I'm a Plain Text String!"
Key1 Val1
Key2 Val 2
A: 0.1 B: 0.3 C: 500 D: 500
A B C D
C B B A C
Strings/Blobs/Bitmaps
Hash Tables (objects!)
Linked Lists
Sets
Sorted Sets
String
Hash
List
Set
Sorted set
31. 31
Exercise: Let’s have a look at Redis in action
1. Play with Redis commands 2. Test Redis speed
32. 32
Recap: We covered the 3 AWS building blocks for real-time data
KINESIS
Stream data
LAMBDA
Process data
ELASTICACHE
Store data
+
34. 34
Real-time vs offline data stacks
Offline
stack
Real-
time
stack
Raw data Files on S3 Kinesis streams
Database Redshift Redis
Volume
High – processing millions /
billions of records at the same time
Low – processing
single records at a time
Velocity
Low – running
few queries at a time
High – running thousands / millions
of queries at the same time
Query language SQL Python + Redis commands
End-user Humans, BI tools Lambda, APIs, products
35. 35
BATCH DATA STACK
Operational data layer
(listings, replies, users, orders, etc.)
Raw data layer
(data lake)
Tracking
(Ninja /
Hydra)
Platform DB
(Mongo)
Adjust /
Facebook /
Google
…
BI Segmentation
Performance
marketing
CLM
Batch
recommender
…
DATAWAREHOUSE
Raw data streams
REAL-TIME DATA STACK
Tracking
(Ninja / Hydra)
Platform DB
(Mongo)
…
Real-time
data processing
Real-time database (Online customer view)
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
λ
API gateway
Real-time
recommender
Real-time
segmentation
Other real-time
applications
Shedd end-to-end data stack architecutre