The document describes requirements for a platform to detect suspicious behavior in an organization. It involves three patterns:
1) Time-based aggregations to detect behaviors like many login failures within a short time. Windowing and aggregating events is needed.
2) Data enrichment to report details of alerts, like fetching user profiles to identify users. Side inputs allow querying external databases during event processing.
3) Dynamic processing since rules change over time. Broadcast state stores evolving rules and connects them to user event streams for continuous checking.
As a running example for this session, we will use the case of an organization which uses 3rd party software for its day to day operations, e.g. ...
And given that the organization grows, we want to build a platform for suspicious behavior detection.
Each action on these services ...
Focusing on the first, we can rephrase/generalize it by the ... (slide)
Same as the fist...
Same as before
Other Use cases
Give me the number of tweet impressions per tweet for every hour/day/…
Calculate the average temperature over 10 minute intervals for each sensor in my warehouse
Aggregate user interaction data for my website to display on my internal dashboards
We observe that they all have a time constraint, and this is reasonable as we have a continuous stream of incoming data
All the above fall into the category of time-based aggregations. Now let’s see what Flink offers for these usecases...
Now you have the alert, but you only have the userID and for example the documentID of the doc that was shared.
Other Use cases
Enrich user events with known user data
Add geolocation information to geotagged events
Now you have the alert, but you only have the userID and for example the documentID of the doc that was shared.
Other Use cases
Enrich user events with known user data
Add geolocation information to geotagged events
Now you have the alert, but you only have the userID and for example the documentID of the doc that was shared.
Other Use cases
Enrich user events with known user data
Add geolocation information to geotagged events
NOTE: The fact that this will happen for each incoming element, put the enrichment process in the critical path of your application ...so be careful.
Finally, a more pure stream-y approach is to “connect” your main stream with the stream containing the changelog of your enrichment data, keep the enrichment data in state
Managed by Flink, and use this state for the actual erichment, as shown in the figure. Once again, Flink guarantees that state will be fault tolerant.
Finally, a more pure stream-y approach is to “connect” your main stream with the stream containing the changelog of your enrichment data, keep the enrichment data in state
Managed by Flink, and use this state for the actual erichment, as shown in the figure. Once again, Flink guarantees that state will be fault tolerant.
Finally, a more pure stream-y approach is to “connect” your main stream with the stream containing the changelog of your enrichment data, keep the enrichment data in state
Managed by Flink, and use this state for the actual erichment, as shown in the figure. Once again, Flink guarantees that state will be fault tolerant.
Now you have the alert, but you only have the userID and for example the documentID of the doc that was shared.
Other Use cases
Enrich user events with known user data
Add geolocation information to geotagged events
Now you have the alert, but you only have the userID and for example the documentID of the doc that was shared.
Other Use cases
Enrich user events with known user data
Add geolocation information to geotagged events
Now you have the alert, but you only have the userID and for example the documentID of the doc that was shared.
Other Use cases
Enrich user events with known user data
Add geolocation information to geotagged events
Other use cases
Update of processing rules via DSL, think dynamic fraud-detection rules/policies
Live-update of machine learning models
Imagine that we have our stream of user actions. In the figure, this is the top stream of objects of different colors and shapes, with the color representing the userID and the shape, the type of action.
Now we want to find pairs of actions of the same user (color) that follow a certain pattern, e.g. a rectangle followed by a triangle (i.e. a login from location A followed by a login from location B).
In addition, the set of interesting patterns evolve over time.
In this case, we would have our stream of user actions (streamA) and our our rules (streamB), and we want to feed these streams into our green operator of parallelism 3, which will detect the matching sequences.
We want the matches to have objects of the same color, so we first partition our data stream by the color of each object, using the keyBy color. This will give us a keyed stream, where elements are partitioned by color, as shown in the figure.
Then, given that we want to detect pairs of objects, we need to store somewhere each matching first element. Given that our stream is now keyed, we can use Flink’s keyed state for that, as shown in the figure.
Now let’s move on to the second stream, the one containing our rules. We want those rules to be applied to all the objects of streamA, ie all the colors. For this, we need to
Broadcast the rules to all the parallel tasks of our operator.
And, as before, we need to be able to store these rules for future use. This is where the new type of state comes into play, as it allows to store the elements of a broadcasted stream, as shown in the figure.
Now our operator has the necessary data from both streams, data and rules, the operator needs to be able to connect these two streams, i.e the one needs to be able to ”see” the state of the other.
This is done so that when the yellow triangle arrives, the rules will be read from the broadcast state, the already received yellow rectangle will be read from the keyed state, and each rule is going to be evaluated.
The grey requirements are the ones that are offered by the default Flink, without Broadcast state.
KEYBY: ...
BROADCAST: Then for the rule stream, we will do a broadcast with a MapStateDescriptor. This will broadcast the elements in the stream to all downstream tasks, and create the state to store them. As we will also see later, broadcast state has a map format, so it stores pairs of a Key associated with a Value. In this case we can have a String representing the Name of the rule (or an identifier) and a list of all the currently accepted but not matched elements.
CONNECT: finally we will connect the keyed with the non-keyed, broadcast stream and we will call process() on the result with the function containing our matching logic.
As said earlier, we use a mapState descriptor in the broadcast() command, as the BroadcastState has a Map format.
In addition, as shown in the previous slide, we connect() the keyed with the broadcast stream. For those of you familiar with Flink’s apis, this means that your function will have “two sides”, with each describing how to react on an incoming element from each of the two streams.
In the case of the Broadcast State Pattern, the broadcase side has rw access to the broadcast state, while the non-broadcast has only read access.
The reason for that is that ...
Broadcast state does NOT mean that whatever your function does on one parallel instance (task), gets sent to all other parallel instances. So make sure that your computation on an element is the same across all instances.
...
Finally the KeyedBroadcastProcessFunction has the onTimer() method which contains the logic to execute when a timer fires. As we will see later, when operating on the keyed side of our KeyedBroadcastProcessFunction, we have access to an internal timerService, which allows us to register timers in both event or processing time. This is also aligned with the ProcessFunction and KeyedProcessFunction offered by Flink.
To guarantee that the contents of the Broadcast State are the same across all parallel instances of our operator, we give read-write access only to the broadcast side,
And we require the computation on each incoming element to be identical across all tasks.
On the non-broadcast side, apart from the access to the broadcast state, the processElement() can do the same stuff as in the normal ProcessFunction.
And all of them can emit to sideoutputs, ask the timestamp of the element, the current processing time and the current watermark.
And all of them can emit to sideoutputs, ask the timestamp of the element, the current processing time and the current watermark.
And all of them can emit to sideoutputs, ask the timestamp of the element, the current processing time and the current watermark.
Finally, a more pure stream-y approach is to “connect” your main stream with the stream containing the changelog of your enrichment data, keep the enrichment data in state
Managed by Flink, and use this state for the actual erichment, as shown in the figure. Once again, Flink guarantees that state will be fault tolerant.
(Keep this slide up during the Q&A part of your talk. Having this up in the final 5-10 minutes of the session gives the audience something useful to look at.)