Presenter: Gordon Tai
Video Link: https://www.youtube.com/watch?v=Uho24uN1YZQ
Flink.tw Meetup Event (2016/07/19):
"Stream Processing with Apache Flink w/ Flink PMC Robert Metzger"
2. 00 This Talk is About ...
● How FlinkCEP got me interested in Flink
● CEP use cases & applications
○ Use case study #1: tracking an order process
○ Use case study #2: advertisement targeting
● A look at the API
1
3. ● 戴資力(Gordon)
● Data Engineer @ VMFive
● Java, Scala
● Using Flink as an user on VMFive’s Adtech platform
● Enjoy working on distributed computing systems
● Works on Flink during free time
● Contributor: Flink Kinesis Consumer connector
00 Me & Flink
2
4. Tale of a Data Engineer trying to figure
out how to build up a streaming analytics
pipeline ...
1. First lesson: non-trivial streaming
applications are never stateless
2. Second lesson: statefull streaming
topologies are a pain
3
5. 1. Exactly-once state updates on failures for correctness
2. Idempotance wrt. external state stores
3. Out-of-order events
4. Aggregating on time windows
5. Rapid application development
Applications I was working on:
Streaming aggregation for reporting &
Conversion patterns for alerting
4
6. TL;DR. It isn’t fun. At all.
● Reference:
Building a Stream Processing
System for Playable Ads Data
at VMFive @ HadoopCon 2015
● Redis was used as an external
state store
● All state update had to be
idempotent
● Exactly-once & replay on
failover implemented with
Storm’s tuple acking
mechanism
5
7. ● Generate derived events when a specified pattern on raw
events occur in a data stream
○ if A and then B → infer complex event C
● Goal: identify meaningful event patterns and respond to
them as quickly as possible
● Demanding on the stream processor to provide robust state
handling & out-of-order events support while keeping low
latency with high throughput
01 Complex Event Processing
6
8. 02 Apache Flink CEP Library
● Built upon Flink’s
DataStream API
● Allows users to define
patterns, inject them on
event streams, and
generates new event
streams based on the
pattern
● Exploits Flink’s exactly-
once semantics for definite
correctness
7
9. eCommerce Order Process Tracking
Use case study #1
** Note: the illustrations & content in this section is from Data Artisans’ presentation:
Streaming Analytics & CEP - Two Sides of the Same Coin?
10. 03 Order Tracking Data Model
● Order(orderId, tStamp, “received”) extends Event
● Shipment(orderId, tStamp, “shipped”) extends Event
● Delivery(orderId, tStamp, “delivered”) extends Event
8
12. 05 Glimpse at the FlinkCEP API
val processingPattern = Pattern
.begin[Event]("orderReceived").subtype(classOf[ Order])
.followedBy( "orderShipped").where(_.status == "shipped")
.within(Time.hours(1))
val processingPatternStream = CEP.pattern(
input.keyBy( "orderId"),
processingPattern)
val procResult: DataStream[Either[ProcessWarn, ProcessSucc]] =
processingPatternStream.select {
(pP, timestamp) => // Timeout handler
ProcessWarn(pP("orderReceived").orderId, timestamp)
} {
fP => // Select function
ProcessSucc(
fP( "orderReceived").orderId, fP( "orderShipped").tStamp,
fP( "orderShipped").tStamp – fP( "orderReceived").tStamp)
}
10
13. 06 Glimpse at the FlinkCEP API
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
val input: DataStream[Event] = env.addSource(new FlinkKafkaConsumer09(...))
val processingPattern = Pattern.begin(...)...
val processingPatternStream = CEP.pattern(input.keyBy( "orderId"), processingPattern)
val procResult: DataStream[Either[ProcessWarn, ProcessSucc]] = processingPatternStream.select(...)
procResult.addSink(new RedisSink(...))
// .addSink(new FlinkKafkaProducer09(...))
// .addSink(new ElasticsearchSink(...))
// .map(new MapFunction{...})
// … anything you’d like to continue to do with the inferred event stream
env.execute()
11
14. 07 Glimpse at the FlinkCEP API
val env: StreamExecutionEnvironment = StreamExecutionEnvironment.getExecutionEnvironment
env.setStreamTimeCharacteristic( TimeCharacteristic.EventTime)
val input: DataStream[Event] = env
.addSource(new FlinkKafkaConsumer09(...))
.assignTimestampsAndWatermarks(new CustomExtractor)
val processingPattern = Pattern.begin(...)...
val processingPatternStream = CEP.pattern(input.keyBy( "orderId"), processingPattern)
val procResult: DataStream[Either[ProcessWarn, ProcessSucc]] = processingPatternStream.select(...)
procResult.addSink(new RedisSink(...))
// .addSink(new FlinkKafkaProducer09(...))
// .addSink(new ElasticsearchSink(...))
// .map(new MapFunction{...})
env.execute()
12
15. 08 Combining Stream SQL & CEP
● Further reading: Streaming Analytics & CEP - Two Sides of the Same Coin?
13
16. Ad Targeting based on User Attribution
Use case study #2
** Note: the content in this section is heavily based on my experience at VMFive
14
17. 09 Ad Targeting 101
● What an ad server does, in a nutshell →
determine an appropriate advertisement, chosen from an
advertisement campaign pool, for each incoming ad
request
AdServer
Campaign
Pool
(1) request advertisement
(2) return appropriate
advertisement info
from campaign pool
● “appropriate”:
fulfill the targeting rules of
each campaign
15
18. 10 Ad Targeting Rule Types
● Fundamental campaign targeting rule types:
○ Target users’ current location, ex. users in Taipei
○ Target specific user device type, ex. tablet or phone
○ ...
● Advanced campaign targeting rule types:
○ Target user’s past location trace, ex. in Taipei for the past 7 days
○ Target users entering / departuring countries
○ Target users with specific attribution, ex. viewed
○ ...
16
19. 11 Ad Targeting Rule Types
● Fundamental campaign targeting rule types:
○ Target users’ current location, ex. users in Taipei
○ Target specific user device type, ex. tablet or phone
○ ...
● Advanced campaign targeting rule types:
○ Target user’s past location trace, ex. in Taipei for the past 7 days
○ Target users entering / departuring countries
○ Target users with specific attribution, ex. viewed
○ ...
● Does not require event aggregation
● The rules can be matched simply
based on info at request time
● Requires aggregation of historical events
● Aggregating at request time will be far too slow
● Requires inferring complex events from patterns in
raw event stream → CEP to the rescue!
16
20. 12 Basic Ad Targeting Architecture
Campaign Pool
Targeting Cache
Ad Targeter
register ad
campaigns
Event Logger
WebService
AdServerData Warehouse
17
(1) initial
connection
21. 12 Basic Ad Targeting Architecture
Campaign Pool
Targeting Cache
Ad Targeter
Event Logger
WebService
AdServerData Warehouse
17
(2) fetch ad
22. 12 Basic Ad Targeting Architecture
Ad Targeter
Event Logger
WebService
AdServerData Warehouse
Raw Logs
Event Bus Service
Reporting & analytics
services
Batch
Streaming
...
Campaign Pool
Targeting Cache
18
(3) event
tracking
23. 13 Advanced Ad Targeting Architecture
Ad Targeter
Event Logger
WebService
AdServerData Warehouse
Raw Logs
Event Bus Service
Reporting & analytics
services
Batch
Streaming
...
RulesServuce
Campaign Pool
Targeting Cache
C
E
P
19
24. 13 Advanced Ad Targeting Architecture
Data Warehouse
Raw Logs
Event Bus Service
Batch
Streaming
...
RulesService
C
E
P
CEP-Rule Templates
Rule
Fulfillment
Cache
(Redis)
Entry /
Depart
User
Attribution
...
(1) Inject a rule
to start matching
on event stream
(3)
submit
CEP
topology
(2) Return Rule ID
20
25. 13 Advanced Ad Targeting Architecture
Data Warehouse
Raw Logs
Event Bus Service
Batch
Streaming
...
RulesService
C
E
P
CEP-Rule Templates
Rule
Fulfillment
Cache
(Redis)
Entry /
Depart
User
Attribution
...
(4) When CEP
pattern is fulfilled,
write to cache:
UID → RuleID
(5) Lookup
whether a UID
has fulfilled a
RuleID
21
26. 13 Advanced Ad Targeting Architecture
Ad Targeter
register ad
campaigns
Event Logger
WebService
AdServerData Warehouse
Raw Logs
Event Bus Service
Reporting & analytics
services
Batch
Streaming
...
RulesService
Campaign Pool
Targeting Cache
C
E
P
22
(1) register rule
for campaign
(2) lookup whether
user fulfils a rule
27. 14 Some Discussion
● Why a fixed pool of CEP-Rule Templates?
○ Prevent rogue rules to match, ex. rules that will consume too much resource
○ It’s a lot less work and complication ;)
● Would be very nice to have a freestyle rule service
○ Pattern matching across different event streams of an organization
○ For BI, there will be arbitrary complex events / patterns analysts want to monitor
● Further study for similar use case: King’s RBEA
○ RBEA: Rule-Based Event Aggregator
○ https://techblog.king.com/rbea-scalable-real-time-analytics-king/
○ http://data-artisans.com/rbea-scalable-real-time-analytics-at-king/
23
29. XX Closing
● Complex Event Processing is an emerging way to draw
insights from data streams, and is demanding of the
underlying stream processor for exactly-once semantics for
correctness
● FlinkCEP builds on the DataStreamAPI to make this possible
and easy
24