These are the slides of my Berlin Buzzwords 2019 presentation that I gave on 2018 06-17.
This talk is essentially about the choices we made in designing and implementing our streaming system that can do behavioral analytics.
Note that the mentioned Useragent analyzer that is used in this project is an opensources Apache 2.0 licensed project of mine.See https://yauaa.basjes.nl for more information about this tool.
9. Measuring interaction
• What are we showing (and why)?
• How are our visitors responding?
• Pages
• Products/Offers
• Add to cart
• Purchase
• Advertising
• Inspiration
10. Use cases
• Dashboards
• Personalization
• Site optimization
• Fraud prevention
• Data Science
• …
15. Measurements are…
too old.
• Available once every 24 hours.
• So personalization is a ‘day behind’.
Useless inspiration:
I was interested in this YESTERDAY
17. Time is not always important
Crowd pattern analysis of website usage
Building a better website for future visitors: Batch
Individual pattern analysis of website usage
Supporting and advising the current visitor: Realtime
21. It’s really all about…
• Measuring
• Better
• Processing
• Faster
• Applications
• More relevant
22. Goals of “Measuring 2.0”
• Measure everything of our website
• All interactions (also AJAX)
• All channels (also mobile, email, …)
• All countries
• All details
• All visitors (also Googlebot)
• More reliable data
• Lowest possible load on the client
• Lowest possible latency (< 1 second)
23. Goals of “Measuring 2.0”
• Developer
• Easy to build
• Easy to validate
• Test automation
• Business
• Always measure everything
• Data is “independent”
• New questions are allowed
24. Goals of “Measuring 2.0”
• Privacy by design
• General Data Protection Regulation (GDPR)
• Algemene verordening gegevensbescherming (AVG)
• No long term profiling
• Security
• Avoid storing “login” info.
• Business
• Do long term profiling
25. THE goal of “Measuring 2.0”
Make the best possible
interaction data stream.
28. Where to measure?
• Measure where “it” happens.
• “In” the responsible “frontend” service!
• Webshop
• App API
• Basket Service
• Order Service
• …
• Usually NOT in the browser
29. Measure pages
• Serverside
• What is in the page
• Clientside (Javascript)
• What part was viewed
• Screen resolution
30. Measure orders
• Listen to Order events!
• with website/app sessionid.
• The “Order confirmation” page.
• Just a viewing of the order.
31. Record everything at the start
• Measure what “really” happens.
• Keep all relevant details
• Product: ProductId, Product type, …
• Offer: OfferId, ProductId, Price, Condition, SellerId, …
• Later joining on productid/offerid is “impossible”.
• Webshop caching
• Data volume / Extra latency
36. Event ordering matters
• Click banner , Buy product
• Banner WAS (possibly) part of reason to buy.
• Buy product, Click banner
• Banner WAS NOT part of reason to buy.
“WAS” or “WAS NOT” is based on
The ordering of the events
38. Pushdown automaton
• State machine
• with a memory stack
• Simple, low latency,
pattern detection
• Ordered events
39. Event ordering matters
• A fast temperature change is dangerous
• should alert IMMEDIATELY
• Delta stays in bounds
• Expect “Ordered”
while (curr = newEvent()) {
if (tooBig(curr, prev))
sendAlert();
prev = curr;
}
-40
-20
0
20
40
60
80
100
T1 T2 T3 T4 T5 T6 T7 T8 T9
Temperature Delta
This is a simple
pushdown automaton
40. Event ordering matters
• A fast temperature change is dangerous
• should alert IMMEDIATELY
Ordering problems
• Many false positives !
• Many false negatives !
-40
-20
0
20
40
60
80
100
T1 T5 T7 T4 T2 T8 T3 T6 T9
Temperature Delta
!
! !
!
!
41. Repairing event ordering
• Is hard
• Needless complexity
• Takes time
• Buffer for the maximum ‘out-of-orderness’ period.
• Several minutes
• We want really low latency
123 4 56 78 9
Sliding time based
sort buffer
42. Exactly once please
• At least once
• Need data deduplication
• Is hard
• Large memory buffer
• Idempotent output
• Takes time
45. The measuring point
• Single entity
• single measuring instance
• Multiple instances
• Multiple output buffers
• Race conditions
• Ordering problems
46. The measuring point
In IOT:
• One temperature sensor
• one recording device
At bol.com
• One visitor
• Single webshop instance
• Session routing is a MUST have!!
• Not perfect!
• Impact negligible
• “View” measurements
• Orders
48. Message transport
• We need ordering per session: FIFO
• “Queue” or “Partitioned Queue”
• Session pinned to a specific partition
https://en.wikipedia.org/wiki/Queue_(abstract_data_type)
Partitioned Queue
49. Many “Queue” are not a Queue !
https://en.wikipedia.org/wiki/Java_Message_Service
https://stackoverflow.com/questions/16300353/activemq-lifo-ordering
50. Many “Queue” are not a Queue !
https://cloud.google.com/pubsub/docs/ordering
51. High volume partitioned queues
• Apache Kafka
• https://kafka.apache.org/
• Production ready
• Apache Pulsar
• https://pulsar.apache.org/
• Connector for Flink very new.
• Pravega
• http://pravega.io/
• Not yet production ready
• Amazon Kinesis
• Sorry, wrong cloud
• Microsoft Event Hubs
• Sorry, wrong cloud Azure Event Hubs
55. Measurement processing
Requirements
• Low latency
• Exactly once
• Ordering guarantees
• A pushdown automaton per session
• Keyed Stateful processing
• Where the ‘key’ is the ‘session id’
56. Choosing a Processing toolkit
Apache Beam
• Low latency … except
• Exactly once by deduplication
• NO ordering guarantees
• NO natural keyed stateful processing
• “Dynamic” scaling
• Abstract Java API
• Runs on
• DataFlow
• Flink
57. Choosing a Processing toolkit
Apache Flink
• Low latency
• Exactly once
• Ordering guarantees
• Keyed Stateful processing
• “Fixed” scaling
• Easy Java API
• Runs on
• Hadoop
• Kubernetes
59. Applications change!
• New business
• New insights
• New wishes
• New scope
• New …
The records will
• get new fields
• have obsolete fields
60. Data producerData producerData producer
Streaming applications
Data producer Streaming Interface
Data consumers
Data consumersData consumers
Data consumers
The real payload is
“byte array”
Multiple Applications
Rolling upgrades
Canary releases
Multiple Applications
Rolling upgrades
Canary releases
61. Kafka persists messages
• A message is retained until the TTL expired.
• So a topic will contain several message versions!
• With different fields
V1 V2
V3 V4
62. So we need something to
• Serialize records into bytes
• Data types
• Nested records
• Bidirectional Schema evolution
65. Avro Message format
• Single record into bytes encoding
• Designed for evolving streaming applications
• Need schema database:
• Key = 64bit long
• Value = String
The json version of the schema