Storm at
Picture https://www.flickr.com/photos/silentmind8/15865860242 by silentmind8 under CC BY 2.0 http://creativecommons.org/licenses/by/2.0/
Forter
• We detect fraud
• A lot of data is collected
• New data can introduce new data sources
• At transaction time, we do our magic. Fast.
• We deny less
What’s Storm?
• Streaming/data-pipeline infrastructure
• What’s a pipeline?
• “Topology” driven flow, static
• Written over JVM and also supports Python and
Node.js
• Easy clustering
• Apache top level project, large community
Storm Lingo
• Tuples
• The basic data transfer object in storm. Basically a dictionary (key->val).
• Spouts
• Entry points into the pipe. This is where data comes from.
• Bolts
• Components that can transform and route tuples
• Joins
• Joins are where async branches of the topology meet and join
• Streams
• Streams allow for flow control in the topology
System challenges
• Latency should be determined by business needs -
flexible per customer (300ms - customers who just don’t
care)
• Data dependencies in decision part can get very complex
• Getting data can be slow, especially 3rd party
• Data scientists write in Python
• Should be scaleable, because we’re ever growing
• Should be very granularly monitored
Bird’s eye view
• Two systems:
• System 1: data prefetching & preparing
• System 2: decision engine, must have all
available data handy at TX time
System 1: high
throughput pipeline
• Stream Batching
• Prefetching / Preparing
• Common use case, lots of competitors
System 2: low latency
decision
• Dedicated everything
• Complex dependency graph
• Less common, fewer players
System 1
High Throughput
Cache and cache layering
• Storm constructs make it easy to tweak caches,
add enrichment steps transparently
• Different enrichment operations may require
different execution power
• Each operation can be replaced by a sub-topology
- layering of cache levels
• Field grouping allows the ability to maintain state in
components - local cache or otherwise
Maintain a stored state
• Many events coming in, some cause a state to
change
• State of a working set is saved in memory
• New/old states are fetched from an external data
source
• Sate updates are saved immediately
• State machine is scalable - again, field grouping
And the rest…
• Batching content for writing (Storm’s tick tuples)
• Aggregating events in memory
• Throttling/Circuit-breaking external calls
System 2: Low Latency
Unique Challenges
• Scaling. Resources need to be very dedicated,
parallelizing is bad
• Join logic is much stricter, with short timeouts
• Data validity is crucial for the stream routing
• Error handling
• Component graph is immense and hard to contain
mentally - especially considering the delicate time
window configurations.
Scalability
• Each topology is built to handle a fixed number of
parallel TXs. Storm’s max-spout-pending
• Each topology atomically polls a queue
• Trying to keep as much of the logic in the same
process to reduce network and serialization costs
• Latency is the only measure
Joining and errors
• Waiting is not an option
• Tick tuples no good, break the single
thread illusion
• Static topologies are easy to analyze and
edit in runtime, and intervene
• Fallback streams are an elegant solution
to the problem, preventing developers
from explicitly defining escape routes
• Also allow for “try->finally” semantics
Multilang
• Storm allows running bolt processes (shell-bolt)
with the builtin capability of communicating through
standard i/o
• Not hugely scalable, but works
• Implemented are: Node.js (our contribution) and
Python
• We use for legacy and to keep data scientists
happy
Data Validity
• Wrapping the bolts, we implemented contracts for
outputs
• Java POJOs with Hibernate Validator
• Contracts allow us “hard-typing” the links in the
topologies
• Also help minimize data flow, especially to shell-bolts
• Checkout storm-data-contracts on github
Managing Complexity
• Complexity of the data dependencies is maintained
by literally drawing it.
• Nimbus REST APIs offer access to the topology
layout
• Timing complexity reduced by synchronizing the
joins to a shared point-in-time. Still pretty complex.
• Proves better than our previous iterative solution
Monitoring
• Nimbus metrics give out averages - not
good enough
• Reimann used to efficiently monitor
latencies for every tuple in the system
• Inherent low latency monitoring issue:
CPU utilization monitoring
• More at Itai Frenkel’s lecture
Questions?
Contact info:
Re’em Bensimhon
reem@forter.com / reem.bs@gmail.com
linkedin.com/in/bensimhon
twitter: @reembs

Storm at Forter

  • 1.
    Storm at Picture https://www.flickr.com/photos/silentmind8/15865860242by silentmind8 under CC BY 2.0 http://creativecommons.org/licenses/by/2.0/
  • 2.
    Forter • We detectfraud • A lot of data is collected • New data can introduce new data sources • At transaction time, we do our magic. Fast. • We deny less
  • 3.
    What’s Storm? • Streaming/data-pipelineinfrastructure • What’s a pipeline? • “Topology” driven flow, static • Written over JVM and also supports Python and Node.js • Easy clustering • Apache top level project, large community
  • 4.
    Storm Lingo • Tuples •The basic data transfer object in storm. Basically a dictionary (key->val). • Spouts • Entry points into the pipe. This is where data comes from. • Bolts • Components that can transform and route tuples • Joins • Joins are where async branches of the topology meet and join • Streams • Streams allow for flow control in the topology
  • 5.
    System challenges • Latencyshould be determined by business needs - flexible per customer (300ms - customers who just don’t care) • Data dependencies in decision part can get very complex • Getting data can be slow, especially 3rd party • Data scientists write in Python • Should be scaleable, because we’re ever growing • Should be very granularly monitored
  • 6.
    Bird’s eye view •Two systems: • System 1: data prefetching & preparing • System 2: decision engine, must have all available data handy at TX time
  • 7.
    System 1: high throughputpipeline • Stream Batching • Prefetching / Preparing • Common use case, lots of competitors
  • 8.
    System 2: lowlatency decision • Dedicated everything • Complex dependency graph • Less common, fewer players
  • 9.
  • 10.
    Cache and cachelayering • Storm constructs make it easy to tweak caches, add enrichment steps transparently • Different enrichment operations may require different execution power • Each operation can be replaced by a sub-topology - layering of cache levels • Field grouping allows the ability to maintain state in components - local cache or otherwise
  • 12.
    Maintain a storedstate • Many events coming in, some cause a state to change • State of a working set is saved in memory • New/old states are fetched from an external data source • Sate updates are saved immediately • State machine is scalable - again, field grouping
  • 14.
    And the rest… •Batching content for writing (Storm’s tick tuples) • Aggregating events in memory • Throttling/Circuit-breaking external calls
  • 15.
  • 16.
    Unique Challenges • Scaling.Resources need to be very dedicated, parallelizing is bad • Join logic is much stricter, with short timeouts • Data validity is crucial for the stream routing • Error handling • Component graph is immense and hard to contain mentally - especially considering the delicate time window configurations.
  • 17.
    Scalability • Each topologyis built to handle a fixed number of parallel TXs. Storm’s max-spout-pending • Each topology atomically polls a queue • Trying to keep as much of the logic in the same process to reduce network and serialization costs • Latency is the only measure
  • 18.
    Joining and errors •Waiting is not an option • Tick tuples no good, break the single thread illusion • Static topologies are easy to analyze and edit in runtime, and intervene • Fallback streams are an elegant solution to the problem, preventing developers from explicitly defining escape routes • Also allow for “try->finally” semantics
  • 19.
    Multilang • Storm allowsrunning bolt processes (shell-bolt) with the builtin capability of communicating through standard i/o • Not hugely scalable, but works • Implemented are: Node.js (our contribution) and Python • We use for legacy and to keep data scientists happy
  • 20.
    Data Validity • Wrappingthe bolts, we implemented contracts for outputs • Java POJOs with Hibernate Validator • Contracts allow us “hard-typing” the links in the topologies • Also help minimize data flow, especially to shell-bolts • Checkout storm-data-contracts on github
  • 21.
    Managing Complexity • Complexityof the data dependencies is maintained by literally drawing it. • Nimbus REST APIs offer access to the topology layout • Timing complexity reduced by synchronizing the joins to a shared point-in-time. Still pretty complex. • Proves better than our previous iterative solution
  • 22.
    Monitoring • Nimbus metricsgive out averages - not good enough • Reimann used to efficiently monitor latencies for every tuple in the system • Inherent low latency monitoring issue: CPU utilization monitoring • More at Itai Frenkel’s lecture
  • 23.
    Questions? Contact info: Re’em Bensimhon reem@forter.com/ reem.bs@gmail.com linkedin.com/in/bensimhon twitter: @reembs