Advanced Flink Training - Design patterns for streaming applications

Design Patterns for Streaming
Applications
Advanced Training

© 2018 data Artisans2
Our Usecase for the Session

“Suspicious Behaviour“ Detection/Reporting
Your
organization
• We have an organization using 3rd party services:
• Dropbox for file sharing
• Google Suite for collaborative editing and emails
• Slack for communication
• We want a platform to detect suspicious behaviour like:
• Does anyone share confidential docs with competitors
• Did we have any attempts to login from outside
• ...
• We assume each action (e.g. share a doc) creates an event with
userID, itemID, actionType.

Your
organization
• Requirements for the platform:
• Detect the behaviour and raise alerts
• Report the user and the affected items
• Be able to update the ”suspicious behaviour” rules

Your
organization
• Detect the behaviour and raise alerts

Your
organization
• Act on the incoming events themselves to extract knowledge

Your
organization

Your
organization
• Fetch meta-info for incoming events (e.g. userID -> user_data)

Your
organization

Your
organization
• Ability to evolve the logic of the system

Your
organization
• Guarantee the ability to evolve the logic of the system
These requirements represent the 3 patterns we
describe in the following

Time-based Aggregations

Detect “Suspicious Behaviour”
Your
organization
• Raise an alert when we have:
• more than 100 failed login attempts within 1 sec
• someone sharing more than 100 files within 1 hour
• someone logging in from multiple remote locations within 1
sec / Magic carpet travel

Your
organization

Your
organization
Time-Based Aggregations

Blueprint: Time-Based Aggregations
windowed
aggregation
source sink
state: contents of all the in-flight windows

• Do I want to window by event-time or processing time?
• How do I set my watermark emission strategy?
• How do I handle out-of-order event arrivals?
• If using event-time, when is data considered late?
• How to handle late data?
Some things to look out for.

• Windowing API
• Timestamp assigners/watermark extractors
• Allowed lateness for defining when data is late
• Side output of late data as a special flow path
• ProcessFunction
• CEP
Flink features to look at.
Covered in
basic training

windowed
aggregation
kinesis
write to
Elastic
alert real
humanslate data
allowed lateness: 10 min
extract timestamps/watermarks
side output

Data Enrichment

Report who did it / what was affected
Your
organization
• You have the alert which contains:
• the userID of the perpetrator
• the itemID of the item (e.g. doc) that was affected (e.g. shared)
• Report:
• who is behind the userID
• what is behind the itemID

Your
organization
• Report:

Your
organization
• Report:
Data Enrichment

Blueprint: Enriching data with “side input”
filter enrich

filter enrich
For each incoming element:
• extract a key
• query a DB or KV-store for info on that key
• emit an enriched version of the input element

filter enrich
Naïve approach
synchronous access to
external data store for
every element

filter enrich
Slightly better approach
asynchronous access to
external data store for
every element

Communication delay can
dominate application
throughput and latency

• Requires:
‒ a client to the data store that supports asynchronous calls
• Offers:
‒ integration with Flink’s APIs
‒ fault-tolerance
‒ order guarantees for the emitted elements
‒ correct time semantics (event/processing time)
Flink’sAsyncI/O

Blueprint: Enriching data with “side input”Flink’sAsyncI/O
/** Example async function call. */
DataStream<...> result = AsyncDataStream.(un)orderedWait(
stream, // pre-enriched stream
new MyAsyncFunction(), // the function that will query the DB
1000, TimeUnit.MILLISECONDS, // timeout for the query to complete
100); // the max number of in-flight requests

Blueprint: Enriching data with “side input”Flink’sAsyncI/O
/** Example async function call. */
DataStream<...> result = AsyncDataStream.(un)orderedWait(
stream, // pre-enriched stream
new MyAsyncFunction(), // the function that will query the DB
1000, TimeUnit.MILLISECONDS, // timeout for the query to complete
100); // the max number of in-flight requests
unorderedWait: emit results in order of completion
orderedWait: emit results in order of arrival
INVARIANT: watermarks never overpass elements and vice versa

filter enrich
“Next-level” approach
keep the enrichment
data in Flink state itselfchangelog
input

• We use ConnectedStreams (see BasicTraining):
‒ 1st input stream: the pre-enriched data
‒ 2nd input stream: the changelog of the enrichment data
‒ key the two streams on the same key, e.g. userID
‒ connect() the two keyed streams
‒ specify a KeyedCoProcessFunction/CoProcessFunction that:
• on the changelog side stores the data in Flink’s keyed state
• on the other side looks up the incoming key in the state and enriches the data accordingly
// the two KeySelectors below must return keys in the same key-space
KeyedStream<TypeA, K> keyedPreEnrichedStream = preEnrichedStream.keyBy(...)
KeyedStream<TypeB, K> keyedEnrichmentData = enrichmentData.keyBy(...)
keyedPreEnrichedStream
.connect(keyedEnrichmentData) // inputs are keyed so...
.process(myCoProcessFun) // ... the function can access state and timers

Time to see what we learned:
https://training.data-artisans.com/exercises/eventTimeJoin.html

Dynamic Processing

Evolve the set of rules
Your
organization
• Your organisation evolves:
• More services are added
• More people are added
• More departments use your software
• New types of “suspicious behaviour” emerge and other
become obsolete

Your
organization
become obsolete

Your
organization
become obsolete
• Your rule set evolves as well...

Your
organization
become obsolete
• Your rule set evolves as well...
Dynamic Processing

Blueprint: Dynamic processing
pre-
processing
dynamic
processing
rules
input
broadcast
stream
broadcast
state

Dynamic Processing: Broadcast State
Example
Stream A: user actions
Stream B: rules

Example
keyBy
Stream B: rules

Example Keyed State
keyBy
Stream B: rules

Example
keyBy
broadcast
Stream B: rules

Example Broadcast State
keyBy
broadcast
Stream B: rules

Example
keyBy
broadcast
Stream B: rules
connect

REQUIREMENTS
• Partition elements by key
• Access to keyed state
• Broadcast elements
• State to store the broadcasted elements
‒ Non-keyed
‒ Identical on all tasks even after restoring/rescaling
• Ability to connect the two streams and react to incoming elements
‒ Connect keyed with non-keyed stream
‒ Have access to respective states

// key the actions by user
KeyedStream<Action, UserID> perUserActionStream = actionStream
.keyBy(new KeySelector<Item, Color>(...))
// broadcast the rules and create the broadcast state
BroadcastStream<Rules> broadcastRuleStream = ruleStream
.broadcast(myMapStateDescriptor);
// connect the two streams and apply myFunction
DataStream<> resultStream = perUserActionStream
.connect(broadcastRuleStream)
.process(myFunction)
Dynamic Processing: Broadcast State API

• The Broadcast State has a map format (<K,V> pairs)
• The user-defined function is applied on a type of Connected Streams:
‒ Two “sides”: the broadcast side and the non-broadcast one
‒ Special type of CoProcessFunction in two “flavors”:
• Non-keyed non-broadcast side: BroadcastProcessFunction
• Keyed non-broadcast side:KeyedBroadcastProcessFunction

Focusing on the function
• Depending on if the non-broadcast stream is keyed:
‒ Non-keyed: BroadcastProcessFunction<IN1, IN2, OUT>
void processElement(IN1 value, ReadOnlyContext ctx, Collector<OUT> out)
void processBroadcastElement(IN2 value, Context ctx, Collector<OUT> out)
‒ Keyed: KeyedBroadcastProcessFunction<K, IN1, IN2, OUT>
void processElement(IN1 value, KeyedReadOnlyContext ctx, Collector<OUT> out)
void processBroadcastElement(IN2 value, KeyedContext ctx, Collector<OUT> out)
void onTimer(long timestamp, OnTimerContext ctx, Collector<OUT> out)

• void processElement(IN1 value, ReadOnlyContext ctx, Collector<OUT> out)
void processBroadcastElement(IN2 value, Context ctx, Collector<OUT> out)
• void processElement(IN1 value, KeyedReadOnlyContext ctx, Collector<OUT> out)
void processBroadcastElement(IN2 value, KeyedContext ctx, Collector<OUT> out)

• void processBroadcastElement(IN2 value, Context ctx, Collector<OUT> out)
• void processBroadcastElement(IN2 value, KeyedContext ctx, Collector<OUT> out)

• void onTimer(long timestamp, OnTimerContext ctx, Collector<OUT> out)

Non-Keyed Non-Broadcast Side: BroadcastProcessFunction
• Non-Keyed Non-Broadcast side:
‒ has read-only access to the broadcast state
• Broadcast side:
‒ has read-write access to the broadcast state
‒ each parallel task acts independently of the rest
‒ there is no communication between parallel tasks

Keyed Non-Broadcast Side: KeyedBroadcastProcessFunction
• Keyed Non-Broadcast side:
‒ has read-only access to the broadcast state
‒ has access to keyed state
‒ can register timers
• Broadcast side:
‒ has read-write access to the broadcast state
‒ can register function to be applied to the state of all keys
‒ each parallel task acts independently of the rest
‒ there is no communication between parallel tasks

Focusing on the keyed case
• void onTimer(long timestamp, OnTimerContext ctx, Collector<OUT> out)

• KeyedContext: non-broadcast side
‒ BroadcastState<K,V> getBroadcastState(MapStateDescriptor<K,V> broadcastStateDesc);
‒ void applyToKeyedState(StateDescriptor<S,VS> stateDesc, KeyedStateFunction<KS, S> function)

• KeyedReadOnlyContext: broadcast side
‒ ReadOnlyBroadcastState<K,V> getBroadcastState(MapStateDescriptor<K,V> stateDesc)
‒ TimerService timerService()

• KeyedReadOnlyContext: broadcast side
‒ ReadOnlyBroadcastState<K,V> getBroadcastState(MapStateDescriptor<K,V> stateDesc)
• OnTimerContext: upon timer firing
‒ TimeDomain timeDomain()
‒ KS getCurrentKey()

Blueprint: Dynamic processing
Time to see what we learned:
http://training.data-artisans.com/exercises/taxiQuery.html

Closing

Thank you!
aljoscha@apache.org
kkloudas@apache.org
@dataArtisans
@ApacheFlink
We are hiring!
data-artisans.com/careers

Advanced Flink Training - Design patterns for streaming applications

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Advanced Flink Training - Design patterns for streaming applications

Similar to Advanced Flink Training - Design patterns for streaming applications (20)

More from Aljoscha Krettek

More from Aljoscha Krettek (14)

Recently uploaded

Recently uploaded (20)

Advanced Flink Training - Design patterns for streaming applications

Editor's Notes