Storm at Forter

Storm at
Picture https://www.ﬂickr.com/photos/silentmind8/15865860242 by silentmind8 under CC BY 2.0 http://creativecommons.org/licenses/by/2.0/

Forter
• We detect fraud
• A lot of data is collected
• New data can introduce new data sources
• At transaction time, we do our magic. Fast.
• We deny less

What’s Storm?
• Streaming/data-pipeline infrastructure
• What’s a pipeline?
• “Topology” driven ﬂow, static
• Written over JVM and also supports Python and
Node.js
• Easy clustering
• Apache top level project, large community

Storm Lingo
• Tuples
• The basic data transfer object in storm. Basically a dictionary (key->val).
• Spouts
• Entry points into the pipe. This is where data comes from.
• Bolts
• Components that can transform and route tuples
• Joins
• Joins are where async branches of the topology meet and join
• Streams
• Streams allow for ﬂow control in the topology

System challenges
• Latency should be determined by business needs -
ﬂexible per customer (300ms - customers who just don’t
care)
• Data dependencies in decision part can get very complex
• Getting data can be slow, especially 3rd party
• Data scientists write in Python
• Should be scaleable, because we’re ever growing
• Should be very granularly monitored

Bird’s eye view
• Two systems:
• System 1: data prefetching & preparing
• System 2: decision engine, must have all
available data handy at TX time

System 1: high
throughput pipeline
• Stream Batching
• Prefetching / Preparing
• Common use case, lots of competitors

System 2: low latency
decision
• Dedicated everything
• Complex dependency graph
• Less common, fewer players

Cache and cache layering
• Storm constructs make it easy to tweak caches,
add enrichment steps transparently
• Different enrichment operations may require
different execution power
• Each operation can be replaced by a sub-topology
- layering of cache levels
• Field grouping allows the ability to maintain state in
components - local cache or otherwise

Maintain a stored state
• Many events coming in, some cause a state to
change
• State of a working set is saved in memory
• New/old states are fetched from an external data
source
• Sate updates are saved immediately
• State machine is scalable - again, ﬁeld grouping

And the rest…
• Batching content for writing (Storm’s tick tuples)
• Aggregating events in memory
• Throttling/Circuit-breaking external calls

Unique Challenges
• Scaling. Resources need to be very dedicated,
parallelizing is bad
• Join logic is much stricter, with short timeouts
• Data validity is crucial for the stream routing
• Error handling
• Component graph is immense and hard to contain
mentally - especially considering the delicate time
window conﬁgurations.

Scalability
• Each topology is built to handle a ﬁxed number of
parallel TXs. Storm’s max-spout-pending
• Each topology atomically polls a queue
• Trying to keep as much of the logic in the same
process to reduce network and serialization costs
• Latency is the only measure

Joining and errors
• Waiting is not an option
• Tick tuples no good, break the single
thread illusion
• Static topologies are easy to analyze and
edit in runtime, and intervene
• Fallback streams are an elegant solution
to the problem, preventing developers
from explicitly deﬁning escape routes
• Also allow for “try->ﬁnally” semantics

Multilang
• Storm allows running bolt processes (shell-bolt)
with the builtin capability of communicating through
standard i/o
• Not hugely scalable, but works
• Implemented are: Node.js (our contribution) and
Python
• We use for legacy and to keep data scientists
happy

Data Validity
• Wrapping the bolts, we implemented contracts for
outputs
• Java POJOs with Hibernate Validator
• Contracts allow us “hard-typing” the links in the
topologies
• Also help minimize data ﬂow, especially to shell-bolts
• Checkout storm-data-contracts on github

Managing Complexity
• Complexity of the data dependencies is maintained
by literally drawing it.
• Nimbus REST APIs offer access to the topology
layout
• Timing complexity reduced by synchronizing the
joins to a shared point-in-time. Still pretty complex.
• Proves better than our previous iterative solution

Monitoring
• Nimbus metrics give out averages - not
good enough
• Reimann used to efﬁciently monitor
latencies for every tuple in the system
• Inherent low latency monitoring issue:
CPU utilization monitoring
• More at Itai Frenkel’s lecture

Questions?
Contact info:
Re’em Bensimhon
reem@forter.com / reem.bs@gmail.com
linkedin.com/in/bensimhon
twitter: @reembs

Storm at Forter

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Storm at Forter

Similar to Storm at Forter (20)

Recently uploaded

Recently uploaded (20)

Storm at Forter