Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics

1,462 views

Published on

This Tutorial will discuss and demonstrate how to implement different realtime streaming analytics patterns. We will start with counting usecases and progress into complex patterns like time windows, tracking objects, and detecting trends. We will start with Apache Storm and progress into Complex Event Processing based technologies.

Published in: Technology
  • Be the first to comment

DEBS 2015 Tutorial : Patterns for Realtime Streaming Analytics

  1. 1. ACM DEBS 2015: Realtime Streaming Analytics Patterns Srinath Perera Sriskandarajah Suhothayan WSO2 Inc.
  2. 2. Data Analytics ( Big Data) o Scientists are doing this for 25 year with MPI (1991) using special Hardware o Took off with Google’s MapReduce paper (2004), Apache Hadoop, Hive and whole ecosystem created. o Later Spark emerged, and it is faster. o But, processing takes time.
  3. 3. Value of Some Insights degrade Fast! o For some usecases ( e.g. stock markets, traffic, surveillance, patient monitoring) the value of insights degrade very quickly with time. o E.g. stock markets and speed of light oo We need technology that can produce outputs fast o Static Queries, but need very fast output (Alerts, Realtime control) o Dynamic and Interactive Queries ( Data exploration)
  4. 4. History ▪Realtime Analytics are not new either!! - Active Databases (2000+) - Stream processing (Aurora, Borealis (2005+) and later Storm) - Distributed Streaming Operators (e. g. Database research topic around 2005) - CEP Vendor Roadmap ( from http: //www.complexevents. com/2014/12/03/cep-tooling- market-survey-2014/)
  5. 5. Data Analytics Landscape
  6. 6. Realtime Interactive Analytics o Usually done to support interactive queries o Index data to make them them readily accessible so you can respond to queries fast. (e.g. Apache Drill) o Tools like Druid, VoltDB and SAP Hana can do this with all data in memory to make things really fast.
  7. 7. Realtime Streaming Analytics o Process data without Streaming ( As data some in) o Queries are fixed ( Static) o Triggers when given conditions are met. o Technologies o Stream Processing ( Apache Storm, Apache Samza) o Complex Event Processing/CEP (WSO2 CEP, Esper, StreamBase) o MicroBatches ( Spark Streaming)
  8. 8. Realtime Football Analytics ● Video: https://www.youtube.com/watch?v=nRI6buQ0NOM ● More Info: http://www.slideshare.net/hemapani/strata-2014- talktracking-a-soccer-game-with-big-data
  9. 9. Why Realtime Streaming Analytics Patterns? o Reason 1: Usual advantages o Give us better understanding o Give us better vocabulary to teach and communicate o Tools can implement them o .. o Reason 2: Under theme realtime analytics, lot of people get too much carried away with word count example. Patterns shows word count is just tip of the iceberg.
  10. 10. Earlier Work on Patterns o Patterns from SQL ( project, join, filter etc) o Event Processing Technical Society’s (EPTS) reference architecture o higher-level patterns such as tracking, prediction and learning in addition to low-level operators that comes from SQL like languages. o Esper’s Solution Patterns Document (50 patterns) o Coral8 White Paper
  11. 11. Basic Patterns o Pattern 1: Preprocessing ( filter, transform, enrich, project .. ) o Pattern 2: Alerts and Thresholds o Pattern 3: Simple Counting and Counting with Windows o Pattern 4: Joining Event Streams o Pattern 5: Data Correlation, Missing Events, and Erroneous Data
  12. 12. Patterns for Handling Trends o Pattern 7: Detecting Temporal Event Sequence Patterns o Pattern 8: Tracking ( track something over space or time) o Pattern 9: Detecting Trends ( rise, fall, turn, tipple bottom) o Pattern 13: Online Control
  13. 13. Mixed Patterns o Pattern 6: Interacting with Databases o Pattern 10: Running the same Query in Batch and Realtime Pipelines o Pattern 11: Detecting and switching to Detailed Analysis o Pattern 12: Using a Machine Learning Model
  14. 14. Earlier Work on Patterns
  15. 15. Realtime Streaming Analytics Tools
  16. 16. Implementing Realtime Analytics o tempting to write a custom code. Filter look very easy. Too complex!! Don’t! o Option 1: Stream Processing (e.g. Storm). Kind of works. It is like Map Reduce, you have to write code. o Option 2: Spark Streaming - more compact than Storm, but cannot do some stateful operations. o Option 3: Complex Event Processing - compact, SQL like language, fast
  17. 17. Stream Processing o Program a set of processors and wire them up, data flows though the graph. o A middleware framework handles data flow, distribution, and fault tolerance (e.g. Apache Storm, Samza) o Processors may be in the same machine or multiple machines
  18. 18. Writing a Storm Program o Write Spout(s) o Write Bolt(s) o Wire them up o Run
  19. 19. Write Bolts We will use a shorthand like on the left to explain public static class WordCount extends BaseBasicBolt { @Override public void execute(Tuple tuple, BasicOutputCollector collector) { .. do something … collector.emit(new Values(word, count)); } @Override public void declareOutputFields(OutputFieldsDeclarer declarer) { declarer.declare(new Fields("word", "count")); } }
  20. 20. Wire up and Run TopologyBuilder builder = new TopologyBuilder(); builder.setSpout("spout", new RandomSentenceSpout(), 5); builder.setBolt("split", new SplitSentence(), 8) .shuffleGrouping("spout"); builder.setBolt("count", new WordCount(), 12) .fieldsGrouping("split", new Fields("word")); Config conf = new Config(); if (args != null && args.length > 0) { conf.setNumWorkers(3); StormSubmitter.submitTopologyWithProgressBar( args[0], conf, builder.createTopology()); }else { conf.setMaxTaskParallelism(3); LocalCluster cluster = new LocalCluster(); cluster.submitTopology("word-count", conf, builder.createTopology()); ... } }
  21. 21. Complex Event Processing
  22. 22. Micro Batches ( e.g. Spark Streaming) o Process data in small batches, and then combine results for final results (e.g. Spark) o Works for simple aggregates, but tricky to do this for complex operations (e.g. Event Sequences) o Can do it with MapReduce as well if the deadlines are not too tight.
  23. 23. o A SQL like data processing languages (e.g. Apache Hive) o Since many understand SQL, Hive made large scale data processing Big Data accessible to many o Expressive, short, and sweet. o Define core operations that covers 90% of problems o Let experts dig in when they like! SQL Like Query Languages
  24. 24. o Easy to follow from SQL o Expressive, short, and sweet. o Define core operations that covers 90% of problems o Let experts dig in when they like! CEP = SQL for Realtime Analytics
  25. 25. Pattern Implementations
  26. 26. Code and other details o Sample code - https://github. com/suhothayan/DEBS-2015-Realtime-Analytics- Patterns o WSO2 CEP o pack http://svn.wso2. org/repos/wso2/people/suho/packs/cep/4.0.0 /debs2015/wso2cep-4.0.0-SNAPSHOT.zip o docs- https://docs.wso2. com/display/CEP400/WSO2+Complex+Event+Processor+ Documentation o Apache Storm - https://storm.apache.org/ o We have packs in a pendrive
  27. 27. Pattern 1: Preprocessing o What? Cleanup and prepare data via operations like filter, project, enrich, split, and transformations o Usecases? o From twitter data stream: we extract author, timestamp and location fields and then filter them based on the location of the author. o From temperature stream we expect temperature & room number of the sensor and filter by them.
  28. 28. Filter from TempStream [ roomNo > 245 and roomNo <= 365] select roomNo, temp insert into ServerRoomTempStream ; In Storm In CEP ( Siddhi)
  29. 29. Architecture of WSO2 CEP
  30. 30. CEP Event Adapters Support for several transports (network access) ● SOAP ● HTTP ● JMS ● SMTP ● SMS ● Thrift ● Kafka ● Websocket ● MQTT Supports database writes using Map messages ● Cassandra ● RDBMs Supports custom event adaptors via its pluggable architecture!
  31. 31. Stream Definition (Data Model) { 'name':'soft.drink.coop.sales', 'version':'1.0.0', 'nickName': 'Soft_Drink_Sales', 'description': 'Soft drink sales', 'metaData':[ {'name':'region','type':'STRING'} ], 'correlationData':[ {'name':’transactionID’,'type':'STRING'} ], 'payloadData':[ {'name':'brand','type':'STRING'}, {'name':'quantity','type':'INT'}, {'name':'total','type':'INT'}, {'name':'user','type':'STRING'} ] }
  32. 32. Projection define stream TempStream (deviceID long, roomNo int, temp double); from TempStream select roomNo, temp insert into OutputStream ;
  33. 33. Inferred Streams from TempStream select roomNo, temp insert into OutputStream ; define stream OutputStream (roomNo int, temp double);
  34. 34. Enrich from TempStream select roomNo, temp,‘C’ as scale insert into OutputStream define stream OutputStream (roomNo int, temp double, scale string); from TempStream select deviceID, roomNo, avg(temp) as avgTemp insert into OutputStream ;
  35. 35. Transformation from TempStream select concat(deviceID, ‘-’, roomNo) as uid, toFahrenheit(temp) as tempInF, ‘F’ as scale insert into OutputStream ;
  36. 36. Split from TempStream select roomNo, temp insert into RoomTempStream ; from TempStream select deviceID, temp insert into DeviceTempStream ;
  37. 37. Pattern 2: Alerts and Thresholds o What? detects a condition and generates alerts based on a condition. (e.g. Alarm on high temperature). o These alerts can be based on a simple value or more complex conditions such as rate of increase etc. o Usecases? o Raise alert when vehicle going too fast o Alert when a room is too hot
  38. 38. Filter Alert from TempStream [ roomNo > 245 and roomNo <= 365 and temp > 40 ] select roomNo, temp insert into AlertServerRoomTempStream ;
  39. 39. Pattern 3: Simple Counting and Counting with Windows o What? aggregate functions like Min, Max, Percentiles, etc o Often they can be counted without storing any data o Most useful when used with a window o Usecases? o Most metrics need a time bound so we can compare ( errors per day, transactions per second) o Linux Load Average give us an idea of overall trend by reporting last 1m, 3m, and 5m mean.
  40. 40. Types of windows o Sliding windows vs. Batch (tumbling) windows o Time vs. Length windows Also supports o Unique window o First unique window o External time window
  41. 41. Window In Storm
  42. 42. Aggregation In CEP (Siddhi) from TempStream select roomNo, avg(temp) as avgTemp insert into HotRoomsStream ;
  43. 43. Sliding Time Window from TempStream#window.time(1 min) select roomNo, avg(temp) as avgTemp insert all events into AvgRoomTempStream ;
  44. 44. Group By from TempStream#window.time(1 min) select roomNo, avg(temp) as avgTemp group by roomNo insert all events into HotRoomsStream ;
  45. 45. Batch Time Window from TempStream#window.timeBatch(5 min) select roomNo, avg(temp) as avgTemp group by roomNo insert all events into HotRoomsStream ;
  46. 46. Pattern 4: Joining Event Streams o What? Create a new event stream by joining multiple streams o Complication comes with time. So need at least one window o Often used with a window o Usecases? o To detecting when a player has kicked the ball in a football game . o To correlate TempStream and the state of the regulator and trigger control commands
  47. 47. Join with Storm
  48. 48. Join define stream TempStream (deviceID long, roomNo int, temp double); define stream RegulatorStream (deviceID long, roomNo int, isOn bool); In CEP (Siddhi)
  49. 49. Join define stream TempStream (deviceID long, roomNo int, temp double); define stream RegulatorStream (deviceID long, roomNo int, isOn bool); from TempStream[temp > 30.0]#window.time(1 min) as T join RegulatorStream[isOn == false]#window.length(1) as R on T.roomNo == R.roomNo select T.roomNo, R.deviceID, ‘start’ as action insert into RegulatorActionStream ; In CEP (Siddhi)
  50. 50. Pattern 5: Data Correlation, Missing Events, and Erroneous Data o What? find correlations and use that to detect and handle missing and erroneous Data o Use Cases? o Detecting a missing event (e.g., Detect a customer request that has not been responded within 1 hour of its reception) o Detecting erroneous data (e.g., Detecting failed sensors using a set of sensors that monitor overlapping regions. We can use those redundant data to find erroneous sensors and remove those data from further processing)
  51. 51. Missing Event in Storm
  52. 52. Missing Event in CEP In CEP (Siddhi) from RequestStream#window.time(1h) insert expired events into ExpiryStream from r1=RequestStream->r2=Response[id=r1.id] or r3=ExpiryStream[id=r1.id] select r1.id as id ... insert into AlertStream having having r2.id == null;
  53. 53. Pattern 6: Interacting with Databases o What? Combine realtime data against historical data o Use Cases? o On a transaction, looking up the customer age using ID from customer database to detect fraud (enrichment) o Checking a transaction against blacklists and whitelists in the database o Receive an input from the user (e.g., Daily discount amount may be updated in the database, and then the query will pick it automatically without human intervention).
  54. 54. In Storm Querying Databases
  55. 55. In CEP (Siddhi) Event Table define table CardUserTable (name string, cardNum long) ; @from(eventtable = 'rdbms' , datasource.name = ‘CardDataSource’ , table.name = ‘UserTable’, caching.algorithm’=‘LRU’) define table CardUserTable (name string, cardNum long) Cache types supported ● Basic: A size-based algorithm based on FIFO. ● LRU (Least Recently Used): The least recently used event is dropped when cache is full. ● LFU (Least Frequently Used): The least frequently used event is dropped when cache is full.
  56. 56. Join : Event Table define stream Purchase (price double, cardNo long, place string); define table CardUserTable (name string, cardNum long) ; from Purchase#window.length(1) join CardUserTable on Purchase.cardNo == CardUserTable.cardNum select Purchase.cardNo as cardNo, CardUserTable.name as name, Purchase.price as price insert into PurchaseUserStream ;
  57. 57. Insert : Event Table define stream FraudStream (price double, cardNo long, userName string); define table BlacklistedUserTable (name string, cardNum long) ; from FraudStream select userName as name, cardNo as cardNum insert into BlacklistedUserTable ;
  58. 58. Update : Event Table define stream LoginStream (userID string, islogin bool, loginTime long); define table LastLoginTable (userID string, time long) ; from LoginStream select userID, loginTime as time update LastLoginTable on LoginStream.userID == LastLoginTable.userID ;
  59. 59. Pattern 7: Detecting Temporal Event Sequence Patterns o What? detect a temporal sequence of events or condition arranged in time o Use Cases? o Detect suspicious activities like small transaction immediately followed by a large transaction o Detect ball possession in a football game o Detect suspicious financial patterns like large buy and sell behaviour within a small time period
  60. 60. In Storm Pattern
  61. 61. In CEP (Siddhi) Pattern define stream Purchase (price double, cardNo long,place string); from every (a1 = Purchase[price < 100] -> a3= ..) -> a2 = Purchase[price >10000 and a1.cardNo == a2.cardNo] within 1 day select a1.cardNo as cardNo, a2.price as price, a2.place as place insert into PotentialFraud ;
  62. 62. Pattern 8: Tracking o What? detecting an overall trend over time o Use Cases? o Tracking a fleet of vehicles, making sure that they adhere to speed limits, routes, and Geo- fences. o Tracking wildlife, making sure they are alive (they will not move if they are dead) and making sure they will not go out of the reservation. o Tracking airline luggage and making sure they have not been sent to wrong destinations o Tracking a logistic network and figuring out bottlenecks and unexpected conditions.
  63. 63. TFL: Traffic Analytics Built using TFL ( Transport for London) open data feeds. http://goo.gl/9xNiCm http://goo.gl/04tX6k
  64. 64. Pattern 9: Detecting Trends o What? tracking something over space and time and detects given conditions. o Useful in stock markets, SLA enforcement, auto scaling, predictive maintenance o Use Cases? o Rise, Fall of values and Turn (switch from rise to a fall) o Outliers - deviate from the current trend by a large value o Complex trends like “Triple Bottom” and “Cup and Handle” [17].
  65. 65. Trend in Storm Build and apply an state machine
  66. 66. In CEP (Siddhi) Sequence from t1=TempStream, t2=TempStream [(isNull(t2[last].temp) and t1.temp<temp) or (t2[last].temp < temp and not(isNull(t2[last].temp))]+ within 5 min select t1.temp as initialTemp, t2[last].temp as finalTemp, t1.deviceID, t1.roomNo insert into IncreaingHotRoomsStream ;
  67. 67. In CEP (Siddhi) Partition partition by (roomNo of TempStream) begin from t1=TempStream, t2=TempStream [(isNull(t2[last].temp) and t1.temp<temp) or (t2[last].temp < temp and not(isNull(t2[last].temp))]+ within 5 min select t1.temp as initialTemp, t2[last].temp as finalTemp, t1.deviceID, t1.roomNo insert into IncreaingHotRoomsStream ; end;
  68. 68. Detecting Trends in Real Life o Paper “A Complex Event Processing Toolkit for Detecting Technical Chart Patterns” (HPBC 2015) used the idea to identify stock chart patterns o Used kernel regression for smoothing and detected maxima’s and minimas. o Then any pattern can be written as a temporal event sequence.
  69. 69. Pattern 10: Lambda Architecture o What? runs the same query in both relatime and batch pipelines. This uses realtime analytics to fill the lag in batch analytics results. o Also called “Lambda Architecture”. See Nathen Marz’s “Questioning the Lambda Architecture” o Use Cases? o For example, if batch processing takes 15 minutes, results would always lags 15 minutes from the current data. Here realtime processing fill the gap.
  70. 70. Lambda Architecture. How?
  71. 71. Pattern 11: Detecting and switching to Detailed Analysis o What? detect a condition that suggests some anomaly, and further analyze it using historical data. o Use Cases? o Use basic rules to detect Fraud (e.g., large transaction), then pull out all transactions done against that credit card for a larger time period (e.g., 3 months data) from batch pipeline and run a detailed analysis o While monitoring weather, detect conditions like high temperature or low pressure in a given region, and then start a high resolution localized forecast for that region. o Detect good customers (e.g., through expenditure of more than $1000 within a month, and then run a detailed model to decide the potential of offering a deal).
  72. 72. Pattern 11: How?
  73. 73. Pattern 12: Using a Machine Learning Model o What? The idea is to train a model (often a Machine Learning model), and then use it with the Realtime pipeline to make decisions o For example, you can build a model using R, export it as PMML (Predictive Model Markup Language) and use it within your realtime pipeline. o Use Cases? o Fraud Detection o Segmentation o Predict Churn
  74. 74. Predictive Analytics o Build models and use them with WSO2 CEP, BAM and ESB using upcoming WSO2 Machine Learner Product ( 2015 Q2) o Build model using R, export them as PMML, and use within WSO2 CEP o Call R Scripts from CEP queries
  75. 75. In CEP (Siddhi) PMML Model from TrasnactionStream #ml:applyModel(‘/path/logisticRegressionModel1.xml’, timestamp, amount, ip) insert into PotentialFraudsStream;
  76. 76. Pattern 13: Online Control o What? Control something Online. These would involve problems like current situation awareness, predicting next value(s), and deciding on corrective actions. o Use Cases? o Autopilot o Self-driving o Robotics
  77. 77. Fraud Demo
  78. 78. Scaling & HA for Pattern Implementations
  79. 79. So how we scale a system ? o Vertical Scaling o Horizontal Scaling
  80. 80. Vertical Scaling
  81. 81. Horizontal Scaling E.g. Calculate Mean
  82. 82. Horizontal Scaling ... E.g. Calculate Mean
  83. 83. Horizontal Scaling ... E.g. Calculate Mean
  84. 84. Horizontal Scaling ... How about scaling median ?
  85. 85. Horizontal Scaling ... How about scaling median ? If & only if we can partition !
  86. 86. Scalable Realtime solutions ... Spark Streaming o Supports distributed processing o Runs micro batches o Not supports pattern & sequence detection
  87. 87. Scalable Realtime solutions ... Spark Streaming o Supports distributed processing o Runs micro batches o Not supports pattern & sequence detection Apache Storm o Supports distributed processing o Stream processing engine
  88. 88. Why not use Apache Storm ? Advantages o Supports distributed processing o Supports Partitioning o Extendable o Opensource Disadvantages o Need to write Java code o Need to start from basic principles ( & data structures ) o Adoption for change is slow o No support to govern artifacts
  89. 89. WSO2 CEP += Apache Storm Advantages o Supports distributed processing o Supports Partitioning o Extendable o Opensource Disadvantages o No need to write Java code (Supports SQL like query language) o No need to start from basic principles (Supports high level language) o Adoption for change is fast o Govern artifacts using Toolboxes o etc ...
  90. 90. How we scale ?
  91. 91. How we scale ...
  92. 92. Scaling with Storm
  93. 93. Siddhi QL define stream StockStream (symbol string, volume int, price double); @name(‘Filter Query’) from StockStream[price > 75] select * insert into HighPriceStockStream ; @name(‘Window Query’) from HighPriceStockStream#window.time(10 min) select symbol, sum(volume) as sumVolume insert into ResultStockStream ;
  94. 94. Siddhi QL - with partition define stream StockStream (symbol string, volume int, price double); @name(‘Filter Query’) from StockStream[price > 75] select * insert into HighPriceStockStream ; @name(‘Window Query’) partition with (symbol of HighPriceStockStream) begin from HighPriceStockStream#window.time(10 min) select symbol, sum(volume) as sumVolume insert into ResultStockStream ; end;
  95. 95. Siddhi QL - distributed define stream StockStream (symbol string, volume int, price double); @name(Filter Query’) @dist(parallel= ‘3') from StockStream[price > 75] select * insert into HightPriceStockStream ; @name(‘Window Query’) @dist(parallel= ‘2') partition with (symbol of HighPriceStockStream) begin from HighPriceStockStream#window.time(10 min) select symbol, sum(volume) as sumVolume insert into ResultStockStream ; end;
  96. 96. On Storm UI
  97. 97. On Storm UI
  98. 98. High Availability
  99. 99. HA / Persistence o Option 1: Side by side o Recommended o Takes 2X hardware o Gives zero down time o Option 2: Snapshot and restore o Uses less HW o Will lose events between snapshots o Downtime while recovery o ** Some scenarios you can use event tables to keep intermediate state
  100. 100. Siddhi Extensions ● Function extension ● Aggregator extension ● Window extension ● Transform extension
  101. 101. Siddhi Query : Function Extension from TempStream select deviceID, roomNo, custom:toKelvin(temp) as tempInKelvin, ‘K’ as scale insert into OutputStream ;
  102. 102. Siddhi Query : Aggregator Extension from TempStream select deviceID, roomNo, temp custom:stdev(temp) as stdevTemp, ‘C’ as scale insert into OutputStream ;
  103. 103. Siddhi Query : Window Extension from TempStream #window.custom:lastUnique(roomNo,2 min) select * insert into OutputStream ;
  104. 104. Siddhi Query : Transform Extension from XYZSpeedStream #transform.custom:getVelocityVector(v,vx,vy,vz) select velocity, direction insert into SpeedStream ;
  105. 105. Contact us !

×