Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Storm at Forter


Published on

Lecture by Forter Sr. Engineer Re'em Bensimhon about general usage of Apache Storm with examples taken from the fraud detection world.

Published in: Engineering
  • Be the first to comment

  • Be the first to like this

Storm at Forter

  1. 1. Storm at Picture by silentmind8 under CC BY 2.0
  2. 2. Forter • We detect fraud • A lot of data is collected • New data can introduce new data sources • At transaction time, we do our magic. Fast. • We deny less
  3. 3. What’s Storm? • Streaming/data-pipeline infrastructure • What’s a pipeline? • “Topology” driven flow, static • Written over JVM and also supports Python and Node.js • Easy clustering • Apache top level project, large community
  4. 4. Storm Lingo • Tuples • The basic data transfer object in storm. Basically a dictionary (key->val). • Spouts • Entry points into the pipe. This is where data comes from. • Bolts • Components that can transform and route tuples • Joins • Joins are where async branches of the topology meet and join • Streams • Streams allow for flow control in the topology
  5. 5. System challenges • Latency should be determined by business needs - flexible per customer (300ms - customers who just don’t care) • Data dependencies in decision part can get very complex • Getting data can be slow, especially 3rd party • Data scientists write in Python • Should be scaleable, because we’re ever growing • Should be very granularly monitored
  6. 6. Bird’s eye view • Two systems: • System 1: data prefetching & preparing • System 2: decision engine, must have all available data handy at TX time
  7. 7. System 1: high throughput pipeline • Stream Batching • Prefetching / Preparing • Common use case, lots of competitors
  8. 8. System 2: low latency decision • Dedicated everything • Complex dependency graph • Less common, fewer players
  9. 9. System 1 High Throughput
  10. 10. Cache and cache layering • Storm constructs make it easy to tweak caches, add enrichment steps transparently • Different enrichment operations may require different execution power • Each operation can be replaced by a sub-topology - layering of cache levels • Field grouping allows the ability to maintain state in components - local cache or otherwise
  11. 11. Maintain a stored state • Many events coming in, some cause a state to change • State of a working set is saved in memory • New/old states are fetched from an external data source • Sate updates are saved immediately • State machine is scalable - again, field grouping
  12. 12. And the rest… • Batching content for writing (Storm’s tick tuples) • Aggregating events in memory • Throttling/Circuit-breaking external calls
  13. 13. System 2: Low Latency
  14. 14. Unique Challenges • Scaling. Resources need to be very dedicated, parallelizing is bad • Join logic is much stricter, with short timeouts • Data validity is crucial for the stream routing • Error handling • Component graph is immense and hard to contain mentally - especially considering the delicate time window configurations.
  15. 15. Scalability • Each topology is built to handle a fixed number of parallel TXs. Storm’s max-spout-pending • Each topology atomically polls a queue • Trying to keep as much of the logic in the same process to reduce network and serialization costs • Latency is the only measure
  16. 16. Joining and errors • Waiting is not an option • Tick tuples no good, break the single thread illusion • Static topologies are easy to analyze and edit in runtime, and intervene • Fallback streams are an elegant solution to the problem, preventing developers from explicitly defining escape routes • Also allow for “try->finally” semantics
  17. 17. Multilang • Storm allows running bolt processes (shell-bolt) with the builtin capability of communicating through standard i/o • Not hugely scalable, but works • Implemented are: Node.js (our contribution) and Python • We use for legacy and to keep data scientists happy
  18. 18. Data Validity • Wrapping the bolts, we implemented contracts for outputs • Java POJOs with Hibernate Validator • Contracts allow us “hard-typing” the links in the topologies • Also help minimize data flow, especially to shell-bolts • Checkout storm-data-contracts on github
  19. 19. Managing Complexity • Complexity of the data dependencies is maintained by literally drawing it. • Nimbus REST APIs offer access to the topology layout • Timing complexity reduced by synchronizing the joins to a shared point-in-time. Still pretty complex. • Proves better than our previous iterative solution
  20. 20. Monitoring • Nimbus metrics give out averages - not good enough • Reimann used to efficiently monitor latencies for every tuple in the system • Inherent low latency monitoring issue: CPU utilization monitoring • More at Itai Frenkel’s lecture
  21. 21. Questions? Contact info: Re’em Bensimhon / twitter: @reembs