Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Big Events, Mob Scale - Darach Ennis (Push Technology)


Published on

Presented at JAX London

MapReduce begat Hadoop begat Big Data. NoSQL moved us away from the stricture of monolithic storage architectures to fit-for-purpose designs. But, Houston, we still have a problem. Architects are still designing systems like this is the '70s. SOA, went from buzzword to the bank with the emergence and evolution of the cloud and on-demand right-now elasticity. Yet most systems are still designed to store-then-compute rather than to observe, orient, decide and act on in-flight data.

Published in: Technology
  • Be the first to comment

  • Be the first to like this

Big Events, Mob Scale - Darach Ennis (Push Technology)

  1. 1. big DATA mob SCALE JAX London 2013 - Darach Ennis - @darachennis
  2. 2. small FAST DATA guy JAX London 2013 - Darach Ennis - @darachennis
  3. 3. Big Data! ! ! “The techniques and technologies for such dataintensive science are so different that it is worth distinguishing data-intensive science from computational science as a new, fourth paradigm” ! - Jim Gray! ! ! The Fourth Paradigm: Data-Intensive Scientific Discovery. - Microsoft 2009
  4. 4. DATA intensive! science SCALE
  5. 5. Compute Sympathy
  6. 6. Compute Sympathy
  7. 7. Compute Sympathy
  8. 8. A Wall Street Second
  9. 9. A Swiss Second
  10. 10. Small Data? <= 128bytes HTTP GET/POST - A typical RESTful performance Req/Sec Bw/Sec (MB) 12,616 Avg Latency (ms) 14,642 15,499 Max Latency (ms) 15,787 15,445 1000 Stdev (ms) 15,330 15,173 14,998 8,705 3,907 4,279 100 1000 10 100 1 10 1 0.1 1 2 4 8 16 32 64 Concurrent Connections 128 256 512 1024
  11. 11. Small Data? <= 1K Req/Sec Bw/Sec (MB) Avg - A typical RESTfulLatency (ms) Max performance Stdev (ms) HTTP GET/POST Latency (ms) 10000 1000 1,288 1,951 2,722 2,849 2,790 2,858 2,916 2,830 2,788 2,842 690 100 100 10 1 1 0.1 1 2 4 8 16 32 64 128 Concurrent Connections 256 512 1024
  12. 12. Big Events - 1Billion Sources Ballpark number of boxes if each box can handle 2500 events/second 1000000 1/dy 1/hr 1/mn 1/sc 400,000 40,000 Value Axis 16,667 4,000 1,667 1000 167 17 1 1 112 35 1 1/dy 1/hr 1/mn 1/sc 1 million 12 1 2 1 1/dy 1/hr 1/mn 1/sc 10 million 1/dy 1/hr 1/mn 1/sc 100 million Category Axis 5 1/dy 1/hr 1/mn 1/sc 1 billion
  13. 13. Data! Sympathy?
  14. 14. 5 V's
  15. 15. 5 V’s via [V-PEC-T] • Business Factors • • • ‘Veracity’ - The What ‘Value’ - The Why Technical Domain (Policies, Events, Content) • Volume, Velocity, Variety
  16. 16. Source: Ashwani Roy, Charles Cai - QCON London 2013 -
  17. 17. Source: Ashwani Roy, Charles Cai - QCON London 2013 -
  18. 18. Source: Ashwani Roy, Charles Cai - QCON London 2013 -
  19. 19. Incremental! ! The needs of the individual event or query outweigh the needs of the aggregate events or queries in flight in the system ! ! !
  20. 20. Batch! ! The needs of the system outweigh the needs of individual events and queries running in flight or active within the system ! ! !
  21. 21. “Computing arbitrary functions on an arbitrary dataset in real time is a daunting problem..” - Nathan März
  22. 22. Lambda Architecture “Twitter Scale” 5000 msgs/second inbound <1K “Small data” “Firehouse" outbound - but thats just a broadcast problem (easy)
  23. 23. Lambda: Batch Time Series Docs K/V Rel Serving Apps Web Data MQ Views Views Views "New Data" Speed Views Views Views Apps
  24. 24. Lambda: A All new data is sent to both the batch layer and the speed layer. In the batch layer, new data is appended to the master dataset. In the speed layer, the new data is consumed to do incremental updates of the realtime views.
  25. 25. Lambda: B The master dataset is an immutable, append-only set of data. The master dataset only contains the rawest information that is not derived from any other information you have.
  26. 26. Lambda: Master data set • From A: “rawest … not derived" • In many environments it may be preferable to normalise data for later ease of retrieval (eg: Dremel, strongly typed nested records) to support scalable ad hoc query.
 • Derivation allows other forms of efficient retrieval eg: using SAX - Symbolic Aggregate Approximation, PAA - Piecewise Aggregate Approximation etc..
  27. 27. Lambda: Batch Time Series Docs ? K/V Rel Serving Apps Web Data MQ Views Views Views "New Data" Speed Views Views Views Apps
  28. 28. SAX & PAA Piecewise Aggregate Approximation Symbolic Aggregate Approximation 1sc -> 1mn -> 1hr -> 1dy -> 1wk -> 1mh -> 1yr
  29. 29. Lambda: C The batch layer precomputes query functions from scratch. The results of the batch layer are called batch views. The batch layer runs in a while(true) loop and continuously recomputes the batch views from scratch. The strength of the batch layer is its ability to compute arbitrary functions on arbitrary data. This gives it the power to support any application.
  30. 30. Lambda: D The serving layer indexes the batch views produced by the batch layer and makes it possible to get particular values out of a batch view very quickly. The serving layer is a scalable database that swaps in new batch views as they’re made available. Because of the latency of the batch layer, the results available from the serving layer are always out of date by a few hours.
  31. 31. Lambda: Batch Time Series Docs K/V Rel Serving Web Data MQ "New Data" ? Apps Views Views Views Speed Views Views Views Apps
  32. 32. Think ‘Statistical Compression'
  33. 33. Lambda: E The speed layer compensates for the high latency of updates to the serving layer. It uses fast incremental algorithms and read/write databases to produce realtime views that are always up to date. The speed layer only deals with recent data, because any data older than that has been absorbed into the batch layer and accounted for in the serving layer. The speed layer is significantly more complex than the batch and serving layers, but that complexity is compensated by the fact that the realtime views can be continuously discarded as data makes its way through the batch and serving layers. So, the potential negative impact of that complexity is greatly limited.
  34. 34. Lambda: Batch Time Series Docs K/V Rel Serving Apps Web Data MQ Views Views Views "New Data" Speed ? Views Views Views Apps
  35. 35. Use a DSP + CEP/ESP or ‘Scalable CEP' • Storm/S4 + Esper/… • Embed a CEP/ESP within a Distributed Stream processing Engine • Use Drill for large scale ad hoc query [leverage nested records] • Already have middleware? Have well defined queries? Roll your own minimal EEP (or use mine!)
  36. 36. Lambda: F Queries are resolved by getting results from both the batch and realtime views and merging them together.
  37. 37. Millwheel: a St Queries Window Window Counter Counter Model Web Query ts Model Model St a ts Out of Out of Trend? Trend? Alerts Monitor Google’s “Zeitgeist pipeline"
  38. 38. Lambda: Batch View • Precomputed Queries are central to Complex Event Processing / Event Stream Processing architectures. • Unfortunately, though, most DBMS’s still offer only synchronous blocking RPC access to underlying data when asynchronous guaranteed delivery would be preferable for view construction leveraging CEP/ESP techniques.
  39. 39. Lambda: Merging … • Possibly one of the most difficult aspects of near real-time and historical data integration is combining flows sensibly. • For example, is the order of interleaving across merge sources applied in a known deterministically recomputable order? If not, how can results be recomputed subsequently? Will data converge? 
  40. 40. Lambda: A start … Batch Time Series Docs K/V Rel Serving Apps Web Data MQ Views Views Views "New Data" Speed Views Views Views Apps
  41. 41. mob DATA Not a Jedi … yet … JAX London 2013 - Darach Ennis - @darachennis
  42. 42. Thanks. Questions? ! @darachennis