Processing Flows of InformationFrom Data Streams to Complex Event ProcessingGianpaoloCugola and Alessandro MargaraPolitecnicodi MilanoDip. Elettronica e Informazione[cugola,margara]@elet.polimi.it“Processing Flows of Information: from data streams to complex event processing”ACM Computing Surveys, to appear (but available from: http://home.dei.polimi.it/cugola)
Background and motivationBack in 2007 CEP was already a hot topic…… but having a good grasp of the area was rather hardAs observed by OpherEtzion the area was looking like the “Tower of Babel”Event Processing and the Babylon Tower – Event process thinking blog – Sept. 8, 2007eventdatamessageprimitivecomplexcompositestreamflowsequencepatternjoinqueryruleconditionactionpublishsubscribenotifyadvertisesendreceivemiddlewaresystemapplicationprotocolnetworkrouting2DEBS 2011 - Tutorial
Background and motivationSeveral communities were contributing to the area…… each bringing its own expertise and vocabulary……but often working in isolationeventResearchersdatamessageDEBSprimitivedata management& databasesprocess modeling& automationcomplexcompositestreamflowsequencepatternjoinqueryruleconditionactionpublishsubscribenotifymiddlewaresystemsad-hoc tools(intrusion det., …)advertisesendreceiveapplicationserversDBMSsmiddlewaresystemapplicationprotocolnetworkTool vendorsrouting3DEBS 2011 - Tutorial
Background and motivationThat was 2007. What about today?Things did not change muchFrom the “Event Process Thinking” blog – Sept. 2010[Which is] the relation between event processing and data stream management?They are aliases -- stream is just a collection of events, likewise, an event is just a member in a stream, and the functionality is the sameStream management is a subset of event processing -- there are different ways to do event processing, streams is one of themEvent processing is a subset of stream management -- event streams is just one type of stream, but there are voice stream, video stream, …Event processing and stream management are distinct and there is no overlapping between themAt the same time tool vendors are building tools that try to combine  ¿ juxtapose ?  different approaches4DEBS 2011 - Tutorial
Our goalWe would like to compare different systems in a precise wayWe would like to compare different approaches in a precise wayWe would like to help people coming from different areas communicate and compare their work with othersWe would like to isolate the open issues from those already solvedWe would like to precisely identify the challengesWe would like to isolate the best part of the various approaches…… finding a way to combine themWe need a modeling frameworkthat could accommodate all the proposals5DEBS 2011 - Tutorial
Forget your own vocabulary(temporarily)CEP, DSMSs, Active Databases… All these terms hide a peculiar view of the domain we have in mindWith subtle, “unsaid” (and often unclear) differencesBefore learning something new we have to forget what we already know6DEBS 2011 - Tutorial
The Information Flow Processing domainThe IFP engine processes incoming flows of information according to a set of processing rulesProcessing is “on line”Sources produce the incoming information flows, sinks consume the results of processing, rule managers add or remove rulesInformation flows are composed of information itemsItems part of the same flow are neither necessarily ordered nor of the same kindProcessing involve filtering,combining, and aggregatingflows, item by item as theyenter the engineSourcesSinksIFP EngineInformation FlowsInformation Flows--------------------Rules--------------------Rule managers7DEBS 2011 - Tutorial
IFP: One name several incarnationsA lot of applicationsEnvironmental monitoring through sensor networksFinancial applicationsFraud detection toolsNetwork intrusion detection systemsRFID-based inventory managementManufacturing control systems…Several technologiesActive databasesDSMSsCEP Systems8DEBS 2011 - Tutorial
One model, several modelsDifferent models to capture different viewpointsFunctional modelProcessing modelDeployment modelInteraction modelTime modelData modelRule modelLanguage model9DEBS 2011 - Tutorial
Functional modelKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules10DEBS 2011 - Tutorial
Functional modelImplements the transport protocol to move information items along the net
Acts as a demultiplexerKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules11DEBS 2011 - Tutorial
Functional modelImplements the transport protocol to move information items along the net
Acts as a multiplexerKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules12DEBS 2011 - Tutorial
A short digressionWe assume rules can be (logically)decomposed in two parts: C -> AC is the conditionA is the actionExample (in CQL):  Select IStream(Count(*))From F1 [Range 1 Minute]Where F1.A > 0This way we can split processingin two phases:The detection phase determines the items that trigger the ruleThe production phase use those items to produce the output of the ruleKnowledgebaseAAAHistoryHistorySeqHistoryactionDeciderForwarderReceiverconditionProducer------------------------------------------------------------------------------------------ClockRules13DEBS 2011 - Tutorial
Functional modelKnowledgebaseAAASeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules14DEBS 2011 - Tutorial
Functional modelImplements the detection phase
Accumulates partial results into the history
When a rule fires passes to the producer its action part and the triggering itemsKnowledgebaseAAASeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules15DEBS 2011 - Tutorial
Functional modelKnowledgebaseAAAImplements the production phase
Uses the items in Seq as stated in action ASeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules16DEBS 2011 - Tutorial
Functional modelKnowledgebaseAAASeqDeciderForwarderHistoryReceiverHistoryProducerHistorySome systems allow rules to be added or removed at processing time------------------------------------------------------------------------------------------ClockRules17DEBS 2011 - Tutorial
Functional modelIf present models the ability of performing recursive processing building hierarchies of itemsKnowledgebaseAAASeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules18DEBS 2011 - Tutorial
Functional modelSome systems allows rules to combine flowing items with items previously stored into a (read only) storageKnowledgebaseAAASeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules19DEBS 2011 - Tutorial
Functional modelKnowledgebaseAAAOptional component
Periodically creates special information items holding current time
Its presence models the ability of performingperiodic processing of inputsSeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules20DEBS 2011 - Tutorial
Functional model: ConsiderationsThe detection-production cycleFired by a new item I entering the engine through the ReceiverIncluding those periodically produced by the Clock, if presentDetection phase: Evaluates all the rules to find those enabledUsing item I plus the data into the Knowledge base, if presentThe item I can be accumulated into the History for partially enabled rulesThe action part of the enabled rules together with the triggering items (A+Seq) is passed to the producerProduction phase: Produces the output itemsCombining the items that triggered the rule with data present in the Knowledge base, if presentNew items are sent to subscribed sinks (through the Forwarder)……but they could also be sent internally to be processed again (recursive processing)In some systems the action part of fired rules may also change the set of deployed rules21DEBS 2011 - Tutorial
Functional model: ConsiderationsMaximum length of Seq a key aspect1  PubSubBounded  CQL like languages without time based windowsPattern based languages without a Kleene+ operatorOther key aspects that impact expressivenessPresence of the ClockModels the ability to process rulesperiodicallyAvailable in almost half of the systemsreviewedMost Active DBs and DSMSs butfew CEP systemsKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules22DEBS 2011 - Tutorial
Functional model: ConsiderationsPresence of the Knowledge baseOnly available in systems coming from the database communityPresence of the looping flow exitingthe ProducerModels the ability of performing recursiveprocessingHalf CEP systems have itAll Active DBs but very few DSMSshave itThey have nested rulesSupport to dynamic rule changeFew systems support itCan be implemented externally…Through sinks acting also as rule managers…but we think it is nice to have it internallyKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules23DEBS 2011 - Tutorial
The semantics of processingWhat determines the output of each detection-production cycle?The new item entering the engineThe set of deployed rulesThe items stored into the HistoryThe content of the Knowledge BaseIs this enough?Example (in Padres and CQL):Smoke && Temp>50	Select IStream(Smoke.area)From Smoke[Rows 30 Slide 10], Temp[Rows 50 Slide 5]Where Smoke.area = Temp.area AND Temp.value > 50KnowledgebaseAAAHistoryHistoryHistoryHistoryHistoryHistoryRulesKnowledgebaseSeqDecider------------------------------------------------------------Producer------------------------------------------------------------------------------------------------------------------------24DEBS 2011 - Tutorial
Processing modelThree policies affect the behavior of the systemThe selection policyThe consumption policyThe load shedding policy25DEBS 2011 - Tutorial
Selection policyDetermines if a rule fires once or multiple times and the items actually selected from the HistoryExample:A0  BA1  BA0  BA1  BRRRRmultiple?DeciderReceiverAA AA0 A1singleorBAA B26DEBS 2011 - Tutorial
Selection policy: ConsiderationsMost systems adopt a multiple selection policyIt is simpler to implementIs it adequate?Example: Alert fire when smoke and high temperature in a short time frameIf 10 sensors read high temperature and immediately afterward one detects smoke I would like to receive a single alert, not 10A few systems allow this policy to be programmed……some of them on a per-rule baseE.g., Amit, T-Rex27DEBS 2011 - Tutorial
Selection policy: The TESLA caseTESLA (Trio-based Event Specification Language): the T-Rex languageA rule language for CEP. Tries to combine expressiveness and efficiencyHas a formally defined semanticsExpressed in Trio, a Metric Temporal Logic (see DEBS 2010)Allows rule managers to choose their own selection policy on a per rule baseExample: Multiple selection	define  Fire(area: string, measuredTemp: double)from    Smoke(area=$a) and each Temp(area=$a and val>50) within 1min. from Smokewhere   area=Smoke.area and measuredTemp=Temp.valueExample: Single selection	define  Fire(area: string, measuredTemp: double)from    Smoke(area=$a) and last Temp(area=$a and val>50) within 1min. from Smokewhere   area=Smoke.area and measuredTemp=Temp.val28DEBS 2011 - Tutorial
Selection policy: The TESLA caseTESLA (Trio-based Event Specification Language): the T-Rex languageA rule language for CEP. Tries to combine expressiveness and efficiencyHas a formally defined semanticsExpressed in Trio, a Metric Temporal Logic (see DEBS 2010)Allows rule managers to choose their own selection policy on a per rule baseExample: Multiple selection	define  Fire(area: string, measuredTemp: double)from    Smoke(area=$a) and each Temp(area=$a and val>50) within 1min. from Smokewhere   area=Smoke.area and measuredTemp=Temp.valueExample: Single selection	define  Fire(area: string, measuredTemp: double)from    Smoke(area=$a) and last Temp(area=$a and val>50) within 1min. from Smokewhere   area=Smoke.area and measuredTemp=Temp.valAlternatively you may use:
first…within
n-first…within   n-last…within29DEBS 2011 - Tutorial
Consumption policyDetermines how the history changes after firing of a rule  what happens when new items enter the DeciderExample:A   BA   BRR?DeciderReceiverAzeroBAA Bselected30DEBS 2011 - Tutorial
Consumption policy: ConsiderationsMost systems couple a multiple selection policy with a zero consumption policyThis is the common case with DSMSs, which use (sliding) windows to select relevant eventsExample (in CQL)	Select IStream(Smoke.area)From Smoke[Range 1 min], Temp[Range 1 min]Where Smoke.area = Temp.area AND Temp.val > 50The systems that allow the selection policy to be programmed often allow the consumption policy to be programmed, tooE.g., Amit, T-Rex31DEBS 2011 - Tutorial
Consumption policy: The TESLA caseZero consumption policydefine   Fire(area: string, measuredTemp: double)from     Smoke(area=$a) and each Temp(area=$a and val>50)within 1min. from Smokewhere    area=Smoke.area and measuredTemp=Temp.valueSelected consumption policydefine    Fire(area: string, measuredTemp: double)from      Smoke(area=$a) and each Temp(area=$a and val>50)within 1min. from Smokewhere     area=Smoke.area and measuredTemp=Temp.valueconsuming TempFire!Fire!Fire!Fire!Fire!Fire!Fire!Fire!Fire!Fire!Fire!Fire!TTTTSSTTTTSSTTTT32DEBS 2011 - Tutorial
Load shedding policyProblem: How to manage bursts of input dataIt may seem a system issuei.e., an issue to solve into theReceiverBut it strongly impacts the resultsproducedi.e., the “semantics” of the rulesAccordingly, some systems allows this issue to be determined on a per-rule basise.g., Aurora allows rules to specify the expected QoS and sheds input to stay within limits with the available resourcesConceptually the issue is addressed into the deciderKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules33DEBS 2011 - Tutorial
Deployment modelIFP applications may include a large number of sources and sinksPossibly dispersed over a wide geographical areaIt becomes important to consider the deployment architecture of the engineHow the components of the functional model can be distributed to achieve scalability34DEBS 2011 - Tutorial
Deployment modelSourcesSinksIFP EngineCentralizedClusteredNetworkedInformation FlowsInformation Flows--------------------Rules--------------------DistributedRule managers35DEBS 2011 - Tutorial
Deployment ModelMost existing systems adopt a centralized solutionWhen distributed processing is allowed, it is usually based on clustered solutionsA few systems have recognized the importance of networked deployment for some applicationsE.g. Microsoft StreamInsight (part of SQLServer)Filtering near sourcesAggregation and correlation in-networkAnalytics and historical data in a centralized server/clusterIn most cases, deployment/configuration is not automatic36DEBS 2011 - Tutorial
Deployment modelAutomatic distribution of processing introduces the operator placementproblemGiven a set of rules (composed of operators) and a set of nodesHow to split the processing loadHow to assign operators to available nodesIn other words (Event Processing in Action)Given an event processing networkHow to map it onto the physical network of nodes37DEBS 2011 - Tutorial
Operator placementThe operator placement problem is still openSeveral proposalsOften adopting techniques coming from the Operational ResearchDifficult to compare solutions and resultsEven in its simplest form the problem is NP-hardGood survey paper by Lakshmanan, Li, and Strom“Placement Strategies for Internet-Scale Data Stream Systems”38DEBS 2011 - Tutorial
Operator placementGoalsOperator Placement39DEBS 2011 - Tutorial
Operator placementDifferent goals have been considered for operator placementDepending from the deployment architectureDepending from the application requirementsExamples:Minimize delay / network usage (bandwidth)Pietzuch et. Al. “Network-Aware Operator Placement for Stream-Processing Systems”Reduce hardware resource usage / Obtain load balancingRelevant in clustered scenariosSeveral papers for IBM System S40DEBS 2011 - Tutorial
Operator placementHeterogeneous ResourcesRule RewritingGoalUser-defined QOSOperator PlacementCentralized vs DistributedRe-use vs ReplicationOn-line vsOff-line41DEBS 2011 - Tutorial
More on deployment modelOperator placement is only part of the problemOther issuesHow to build the network of nodes?How to maintain it?How to gather the information required to solve the operator placement problem?How to actually “place” the operators?How to “replace” them when the situation changes?New rules added, old rules removed……new sources/sinks42DEBS 2011 - Tutorial
Deployment model and dynamicsHow to cope with mobile nodes?Mobile sinks and sources……but also mobile “processors”The issue is relevantWe leave in a mobile worldVery few proposalsA lot of work in the area of pure publish/subscribeSeveral works published in DEBS, not to mention other major conferences/journalsMay we reuse some of this work?43DEBS 2011 - Tutorial
Interaction ModelSourcesSinksIFP EngineIt is interesting to study the characteristics of the interactions among the main component of an IFP systemWho starts the communication?44DEBS 2011 - Tutorial
Interaction ModelSourcesSinksIFP EngineObservation ModelForwarding ModelNotification ModelPush
Pull
Push
Pull
Push
Pull45DEBS 2011 - Tutorial
Time ModelRelationship between information items and passing of timeAbility of an IFP system to associate some kind of happened-before (ordering) relationship to information itemsWe identified 4 classes:Stream-onlyCausalAbsoluteInterval46DEBS 2011 - Tutorial
Stream-Only Time ModelUsed in original DSMSsTimestamps may be present or notWhen present, they are used only to group items when they enter the processing engineThey are not exposed to the languageWith the exception of windowing constructsOrdering in output streams is conceptually separate from the ordering in input streamsCQL/StreamSelect DStream(*)From F1[Rows 5], F2[Range 1 Minute]Where	 F1.A = F2.A47DEBS 2011 - Tutorial
Causal Time ModelEach item has a label reflecting some kind of causal relationshipPartial orderE.g. RapideAn event is causally ordered after all events that led to its occurrenceGigascopeSelectcount(*)From	 A, BWhere	 A.a-1 <= B.b and A.a+1 > B.bA.a, B.b monotonically increaseABC48DEBS 2011 - Tutorial
Absolute Time ModelInformation items have an associated timestampDefining a single point in time w.r.t. a (logically)unique clockTotal orderTimestamps are fully exposed to the languageTimestamps can be given by the source, or by the processing engineTESLA/T-RexDefine	Fire(area: string,measuredTemp: double) From	Smoke(area=$a) and last 	Temp(area=$a and value>45)	within 5 min. from Smoke Where 	area=Smoke.areaandmeasuredTemp=Temp.value49DEBS 2011 - Tutorial
Interval Time ModelUsed for events to include “duration”SnoopIB, Cayuga, NextCEP, …At a first sight, it is a simple extension of the absolute time modelTimestamps with two values: start time and end timeHowever, it opens many issuesWhat is the successor of an event?What is the timestamp associated to a composite event?50DEBS 2011 - Tutorial
Interval Time ModelWhich is the immediate successor of A?Choose according to end time only: BBut it started before A!Exclude B: C, DBoth of them?Which of them?No other event strictly between A and its successor: C, D, ESeems a natural definitionUnfortunately we loose associativity!X(YZ) ≠ (X Y)ZMay impede some rule rewriting for processing optimizations51DEBS 2011 - Tutorial
Interval Time Model“What is “Next” in event processing?” by White et. AlProposes a number of desired properties to be satisfied by the “Next” functionThere is one model that satisfies them allComplete HistoryIt is not sufficient to encode timestamps using a couple of valuesTimestamps of composite events must embed the timestamps of all the events that led to their occurrencePossibly, timestamps of unbounded sizeIn case of unbounded Seq52DEBS 2011 - Tutorial
Data ModelStudies how the different systemsRepresent single data itemsOrganize them into data flowsDataData ItemsNature of ItemsGeneric Data
Event NotificationsFormatRecords
Tuples
Objects
…Support for UncertaintyData FlowsHomogeneous
Heterogeneous53DEBS 2011 - Tutorial
Nature of ItemsThe meaning we associate to information itemsGeneric dataEvent notificationsDeeply influences several other aspects of an IFP systemTime model !!!Rule languageSemantics of processingHeritage of the heterogeneous backgrounds of different communitiesDataData ItemsNature of ItemsGeneric Data
Event NotificationsFormatRecords
Tuples
Objects
…Support for UncertaintyData FlowsHomogeneous
Heterogeneous54DEBS 2011 - Tutorial
Nature of ItemsCQL/Stream (Generic Data)SelectIStream(*)From	 F1[Rows 5], F2[Range 1 Minute]Where	 F1.A = F2.ATESLA/T-Rex (Event Notifications)Define	Fire (area: string,measuredTemp: double) From	Smoke(area=$a)and last 	Temp(area=$a and value>45)within 5 min.from Smoke Where 	area=Smoke.area andmeasuredTemp=Temp.value55DEBS 2011 - Tutorial
Format of ItemsHow information is representedInfluences the way items are processedE.g., Relational model requires tuplesDataData ItemsNature of ItemsGeneric Data
Event NotificationsFormatRecords
Tuples
Objects
…Support for UncertaintyData FlowsHomogeneous
Heterogeneous56DEBS 2011 - Tutorial
Support for UncertaintyAbility to associate a degree of uncertainty to information itemsTo the content of itemsImprecise temperature readingTo the presence of an item (occurrence of an event)Spurious RFID readingWhen present, probabilistic information is usually exploited in rules during processingDataData ItemsNature of ItemsGeneric Data
Event NotificationsFormatRecords
Tuples
Objects
…Support for UncertaintyData FlowsHomogeneous
Heterogeneous57DEBS 2011 - Tutorial
Data FlowsHomogeneousEach flow contains data with the same format and “kind”E.g. Tuples with identical structureOften associated with “database-like” rule languagesHeterogeneousInformation flows are seen as channels connecting sources, processors, and sinksEach channel may transport items with different kind and formatDataData ItemsNature of ItemsGeneric Data
Event NotificationsFormatRecords
Tuples
Objects
…Support for UncertaintyData FlowsHomogeneous
Heterogeneous58DEBS 2011 - Tutorial
Rule ModelRules are much more complex entities than data itemsLarge number of different approachesAlready observed in the previous slidesLooking back to our functional model, we classify them into two macro classesTransforming rulesDetecting rulesRuleType of RulesTransforming Rules
Detecting RulesSupport for Uncertainty59DEBS 2011 - Tutorial
Transforming RulesDo not present an explicit distinction between detection and productionDefine an execution plan combining primitiveoperatorsEach operator transforms one or more input flows into one or more output flowsThe execution plan can be definedexplicitly (e.g., through graphical notation)implicitly (using a high level language)Often used with homogeneous information flowsTo take advantage of the predefined structure of input and outputRuleType of RulesTransforming Rules
Detecting RulesSupport for Uncertainty60DEBS 2011 - Tutorial
Detecting RulesPresent an explicit distinction between detection and productionUsually, the detection is based on a logical predicate that captures patterns of interest in the history of received itemsRuleType of RulesTransforming Rules
Detecting RulesSupport for Uncertainty61DEBS 2011 - Tutorial
Support for UncertaintyTwo orthogonal aspectsSupport for uncertain inputAllows rules to deal with/reason about uncertain input dataSupport for uncertain outputAllows rules to associate a degree of uncertainty to the output producedRuleType of RulesTransforming Rules
Detecting RulesSupport for Uncertainty62DEBS 2011 - Tutorial
Language ModelSpecify operations toFilterJoinAggregateinput flows …… to produce one or more output flowsFollowing the rule model, we define two classes of languages:Transforming languagesDeclarative languagesImperative languagesDetecting languagesPattern-based63DEBS 2011 - Tutorial
Following the rule model, we define two classes of languages:Transforming languagesDeclarative languagesImperative languagesDetecting languagesPattern-basedLanguage ModelSpecify the expected result rather than the desired execution flowUsually derive from relational languagesRelational algebra / SQLCQL/Stream:Select IStream(*)From	  F1[Rows 5],  F2[Rows 10]Where  F1.A = F2.A64DEBS 2011 - Tutorial
Following the rule model, we define two classes of languages:Transforming languagesDeclarative languagesImperative languagesDetecting languagesPattern-basedLanguage ModelSpecify the desired execution flowStarting from primitive operatorsCan be user-definedUsually adopt a graphical notation65DEBS 2011 - Tutorial
Imperative LanguagesAurora (Boxes & Arrows Model)66DEBS 2011 - Tutorial
Hybrid LanguagesOracle CEP67DEBS 2011 - Tutorial
Following the rule model, we define two classes of languages:Transforming languagesDeclarative languagesImperative languagesDetecting languagesPattern-basedLanguage ModelSpecify a firing condition as a patternSelect a portion of incoming flows throughLogic operatorsContent / timing constraintsThe action uses selected items to produce new knowledge68DEBS 2011 - Tutorial
Detecting LanguagesActionTESLA / T-RexDefineFire(area: string, measuredTemp: double) From 	 Smoke(area=$a) and last Temp(area=$a and value>45)within 5 min. from Smoke Where area=Smoke.areaandmeasuredTemp=Temp.valueCondition (Pattern)69DEBS 2011 - Tutorial
Language ModelDifferent syntaxes / constructs / operatorsComparison of languages semantics still an open issueOur approach:Review all operators encountered in the analysis of systemsSpecifying the classes of languages adopting themTrying to capture some semantics relationshipAmong operators70DEBS 2011 - Tutorial
Language ModelSingle-Item operatorsSelection operatorsFilter items according to their contentElaboration operatorsProjectionExtracts a part of the content of an itemRenamingChanges the name of a field in languages based on records or tuplesPresent in all languagesDefined as primitive operators in imperative languagesDeclarative languages inherit selection, projection, and renaming from relational algebraRenamingSelect RStream(I.Price as HighPrice)From	Items[Rows 1] as IWhere	I.Price > 100ProjectionSelection71DEBS 2011 - Tutorial
Language ModelSingle-Item operatorsSelection operatorsFilter items according to their contentElaboration operatorsProjectionExtracts a part of the content of an itemRenamingChanges the name of a field in languages based on records or tuplesPattern-based languagesSelection inside the condition part (pattern)Elaboration as part of the actionProjectionDefineExpensiveItem 	(highPrice: double) From 	Item(price>100)Where 	highPrice = priceSelectionRenaming72DEBS 2011 - Tutorial
Language ModelLogic OperatorsConjunctionDisjunctionRepetitionNegationExplicitly present in pattern-based languagesPADRES(A & B) || (C & D)ConjunctionDisjunction73DEBS 2011 - Tutorial
Language ModelLogic OperatorsConjunctionDisjunctionRepetitionNegationSome logic operators are blockingExpress pattern whose validity cannot be decided into a bounded amount of timeE.g., NegationUsed in conjunction with windowsDefineFire()From 	 Smoke(area=$a) and not Rain(area=$a) within 10 min from SmokeNegationWindow74DEBS 2011 - Tutorial
Language ModelLogic OperatorsConjunctionDisjunctionRepetitionNegationTraditionally, logic operators were not explicitly offered by declarative and imperative languagesHowever, they could be expressed as transformation of input flowsConjunction of A and BSelect IStream (F1.A, F2.B)From F1 [Rows 10], F2 [Rows 20]75DEBS 2011 - Tutorial
Language ModelSequencesSimilar to logic operatorsBased on timing relations among itemsPresent in almost all pattern-based languagesDefineFire() From 	 Smoke(area=$a) and last  Temp(area=$a and value>45) within 5 min. from Smoke Sequence (time-bounded)76DEBS 2011 - Tutorial
Language ModelSequencesSimilar to logic operatorsBased on timing relations among itemsTraditionally, transforming languages did not provide sequences explicitlyCould be expressed with an explicit reference to timestampsIf present inside itemsSelect IStream (F1.A, F2.B)From	 F1 [Rows 10], F2 [Rows 20]Where	 F1.timestamp <F2.timestampImpose timestamp order77DEBS 2011 - Tutorial
Language ModelIterationsExpress possibly unbounded sequences of items …… satisfying an iterating conditionImplicitly defines an ordering among itemsIteration (Kleene +)SASE+PATTERN SEQ(Alert a, Shipment+ b[ ])WHERE skip_till_any_match(a, b[ ]) {a.type= ’contaminated’ and	b[1].from = a.site and b[i].from = b[i-1].to} WITHIN 3 hours78DEBS 2011 - Tutorial

Processing Flows of Information DEBS 2011

  • 1.
    Processing Flows ofInformationFrom Data Streams to Complex Event ProcessingGianpaoloCugola and Alessandro MargaraPolitecnicodi MilanoDip. Elettronica e Informazione[cugola,margara]@elet.polimi.it“Processing Flows of Information: from data streams to complex event processing”ACM Computing Surveys, to appear (but available from: http://home.dei.polimi.it/cugola)
  • 2.
    Background and motivationBackin 2007 CEP was already a hot topic…… but having a good grasp of the area was rather hardAs observed by OpherEtzion the area was looking like the “Tower of Babel”Event Processing and the Babylon Tower – Event process thinking blog – Sept. 8, 2007eventdatamessageprimitivecomplexcompositestreamflowsequencepatternjoinqueryruleconditionactionpublishsubscribenotifyadvertisesendreceivemiddlewaresystemapplicationprotocolnetworkrouting2DEBS 2011 - Tutorial
  • 3.
    Background and motivationSeveralcommunities were contributing to the area…… each bringing its own expertise and vocabulary……but often working in isolationeventResearchersdatamessageDEBSprimitivedata management& databasesprocess modeling& automationcomplexcompositestreamflowsequencepatternjoinqueryruleconditionactionpublishsubscribenotifymiddlewaresystemsad-hoc tools(intrusion det., …)advertisesendreceiveapplicationserversDBMSsmiddlewaresystemapplicationprotocolnetworkTool vendorsrouting3DEBS 2011 - Tutorial
  • 4.
    Background and motivationThatwas 2007. What about today?Things did not change muchFrom the “Event Process Thinking” blog – Sept. 2010[Which is] the relation between event processing and data stream management?They are aliases -- stream is just a collection of events, likewise, an event is just a member in a stream, and the functionality is the sameStream management is a subset of event processing -- there are different ways to do event processing, streams is one of themEvent processing is a subset of stream management -- event streams is just one type of stream, but there are voice stream, video stream, …Event processing and stream management are distinct and there is no overlapping between themAt the same time tool vendors are building tools that try to combine ¿ juxtapose ? different approaches4DEBS 2011 - Tutorial
  • 5.
    Our goalWe wouldlike to compare different systems in a precise wayWe would like to compare different approaches in a precise wayWe would like to help people coming from different areas communicate and compare their work with othersWe would like to isolate the open issues from those already solvedWe would like to precisely identify the challengesWe would like to isolate the best part of the various approaches…… finding a way to combine themWe need a modeling frameworkthat could accommodate all the proposals5DEBS 2011 - Tutorial
  • 6.
    Forget your ownvocabulary(temporarily)CEP, DSMSs, Active Databases… All these terms hide a peculiar view of the domain we have in mindWith subtle, “unsaid” (and often unclear) differencesBefore learning something new we have to forget what we already know6DEBS 2011 - Tutorial
  • 7.
    The Information FlowProcessing domainThe IFP engine processes incoming flows of information according to a set of processing rulesProcessing is “on line”Sources produce the incoming information flows, sinks consume the results of processing, rule managers add or remove rulesInformation flows are composed of information itemsItems part of the same flow are neither necessarily ordered nor of the same kindProcessing involve filtering,combining, and aggregatingflows, item by item as theyenter the engineSourcesSinksIFP EngineInformation FlowsInformation Flows--------------------Rules--------------------Rule managers7DEBS 2011 - Tutorial
  • 8.
    IFP: One nameseveral incarnationsA lot of applicationsEnvironmental monitoring through sensor networksFinancial applicationsFraud detection toolsNetwork intrusion detection systemsRFID-based inventory managementManufacturing control systems…Several technologiesActive databasesDSMSsCEP Systems8DEBS 2011 - Tutorial
  • 9.
    One model, severalmodelsDifferent models to capture different viewpointsFunctional modelProcessing modelDeployment modelInteraction modelTime modelData modelRule modelLanguage model9DEBS 2011 - Tutorial
  • 10.
  • 11.
    Functional modelImplements thetransport protocol to move information items along the net
  • 12.
    Acts as ademultiplexerKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules11DEBS 2011 - Tutorial
  • 13.
    Functional modelImplements thetransport protocol to move information items along the net
  • 14.
    Acts as amultiplexerKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules12DEBS 2011 - Tutorial
  • 15.
    A short digressionWeassume rules can be (logically)decomposed in two parts: C -> AC is the conditionA is the actionExample (in CQL): Select IStream(Count(*))From F1 [Range 1 Minute]Where F1.A > 0This way we can split processingin two phases:The detection phase determines the items that trigger the ruleThe production phase use those items to produce the output of the ruleKnowledgebaseAAAHistoryHistorySeqHistoryactionDeciderForwarderReceiverconditionProducer------------------------------------------------------------------------------------------ClockRules13DEBS 2011 - Tutorial
  • 16.
  • 17.
  • 18.
  • 19.
    When a rulefires passes to the producer its action part and the triggering itemsKnowledgebaseAAASeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules15DEBS 2011 - Tutorial
  • 20.
  • 21.
    Uses the itemsin Seq as stated in action ASeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules16DEBS 2011 - Tutorial
  • 22.
    Functional modelKnowledgebaseAAASeqDeciderForwarderHistoryReceiverHistoryProducerHistorySome systemsallow rules to be added or removed at processing time------------------------------------------------------------------------------------------ClockRules17DEBS 2011 - Tutorial
  • 23.
    Functional modelIf presentmodels the ability of performing recursive processing building hierarchies of itemsKnowledgebaseAAASeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules18DEBS 2011 - Tutorial
  • 24.
    Functional modelSome systemsallows rules to combine flowing items with items previously stored into a (read only) storageKnowledgebaseAAASeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules19DEBS 2011 - Tutorial
  • 25.
  • 26.
    Periodically creates specialinformation items holding current time
  • 27.
    Its presence modelsthe ability of performingperiodic processing of inputsSeqDeciderForwarderHistoryReceiverHistoryProducerHistory------------------------------------------------------------------------------------------ClockRules20DEBS 2011 - Tutorial
  • 28.
    Functional model: ConsiderationsThedetection-production cycleFired by a new item I entering the engine through the ReceiverIncluding those periodically produced by the Clock, if presentDetection phase: Evaluates all the rules to find those enabledUsing item I plus the data into the Knowledge base, if presentThe item I can be accumulated into the History for partially enabled rulesThe action part of the enabled rules together with the triggering items (A+Seq) is passed to the producerProduction phase: Produces the output itemsCombining the items that triggered the rule with data present in the Knowledge base, if presentNew items are sent to subscribed sinks (through the Forwarder)……but they could also be sent internally to be processed again (recursive processing)In some systems the action part of fired rules may also change the set of deployed rules21DEBS 2011 - Tutorial
  • 29.
    Functional model: ConsiderationsMaximumlength of Seq a key aspect1  PubSubBounded  CQL like languages without time based windowsPattern based languages without a Kleene+ operatorOther key aspects that impact expressivenessPresence of the ClockModels the ability to process rulesperiodicallyAvailable in almost half of the systemsreviewedMost Active DBs and DSMSs butfew CEP systemsKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules22DEBS 2011 - Tutorial
  • 30.
    Functional model: ConsiderationsPresenceof the Knowledge baseOnly available in systems coming from the database communityPresence of the looping flow exitingthe ProducerModels the ability of performing recursiveprocessingHalf CEP systems have itAll Active DBs but very few DSMSshave itThey have nested rulesSupport to dynamic rule changeFew systems support itCan be implemented externally…Through sinks acting also as rule managers…but we think it is nice to have it internallyKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules23DEBS 2011 - Tutorial
  • 31.
    The semantics ofprocessingWhat determines the output of each detection-production cycle?The new item entering the engineThe set of deployed rulesThe items stored into the HistoryThe content of the Knowledge BaseIs this enough?Example (in Padres and CQL):Smoke && Temp>50 Select IStream(Smoke.area)From Smoke[Rows 30 Slide 10], Temp[Rows 50 Slide 5]Where Smoke.area = Temp.area AND Temp.value > 50KnowledgebaseAAAHistoryHistoryHistoryHistoryHistoryHistoryRulesKnowledgebaseSeqDecider------------------------------------------------------------Producer------------------------------------------------------------------------------------------------------------------------24DEBS 2011 - Tutorial
  • 32.
    Processing modelThree policiesaffect the behavior of the systemThe selection policyThe consumption policyThe load shedding policy25DEBS 2011 - Tutorial
  • 33.
    Selection policyDetermines ifa rule fires once or multiple times and the items actually selected from the HistoryExample:A0 BA1 BA0 BA1 BRRRRmultiple?DeciderReceiverAA AA0 A1singleorBAA B26DEBS 2011 - Tutorial
  • 34.
    Selection policy: ConsiderationsMostsystems adopt a multiple selection policyIt is simpler to implementIs it adequate?Example: Alert fire when smoke and high temperature in a short time frameIf 10 sensors read high temperature and immediately afterward one detects smoke I would like to receive a single alert, not 10A few systems allow this policy to be programmed……some of them on a per-rule baseE.g., Amit, T-Rex27DEBS 2011 - Tutorial
  • 35.
    Selection policy: TheTESLA caseTESLA (Trio-based Event Specification Language): the T-Rex languageA rule language for CEP. Tries to combine expressiveness and efficiencyHas a formally defined semanticsExpressed in Trio, a Metric Temporal Logic (see DEBS 2010)Allows rule managers to choose their own selection policy on a per rule baseExample: Multiple selection define Fire(area: string, measuredTemp: double)from Smoke(area=$a) and each Temp(area=$a and val>50) within 1min. from Smokewhere area=Smoke.area and measuredTemp=Temp.valueExample: Single selection define Fire(area: string, measuredTemp: double)from Smoke(area=$a) and last Temp(area=$a and val>50) within 1min. from Smokewhere area=Smoke.area and measuredTemp=Temp.val28DEBS 2011 - Tutorial
  • 36.
    Selection policy: TheTESLA caseTESLA (Trio-based Event Specification Language): the T-Rex languageA rule language for CEP. Tries to combine expressiveness and efficiencyHas a formally defined semanticsExpressed in Trio, a Metric Temporal Logic (see DEBS 2010)Allows rule managers to choose their own selection policy on a per rule baseExample: Multiple selection define Fire(area: string, measuredTemp: double)from Smoke(area=$a) and each Temp(area=$a and val>50) within 1min. from Smokewhere area=Smoke.area and measuredTemp=Temp.valueExample: Single selection define Fire(area: string, measuredTemp: double)from Smoke(area=$a) and last Temp(area=$a and val>50) within 1min. from Smokewhere area=Smoke.area and measuredTemp=Temp.valAlternatively you may use:
  • 37.
  • 38.
    n-first…within n-last…within29DEBS 2011 - Tutorial
  • 39.
    Consumption policyDetermines howthe history changes after firing of a rule  what happens when new items enter the DeciderExample:A BA BRR?DeciderReceiverAzeroBAA Bselected30DEBS 2011 - Tutorial
  • 40.
    Consumption policy: ConsiderationsMostsystems couple a multiple selection policy with a zero consumption policyThis is the common case with DSMSs, which use (sliding) windows to select relevant eventsExample (in CQL) Select IStream(Smoke.area)From Smoke[Range 1 min], Temp[Range 1 min]Where Smoke.area = Temp.area AND Temp.val > 50The systems that allow the selection policy to be programmed often allow the consumption policy to be programmed, tooE.g., Amit, T-Rex31DEBS 2011 - Tutorial
  • 41.
    Consumption policy: TheTESLA caseZero consumption policydefine Fire(area: string, measuredTemp: double)from Smoke(area=$a) and each Temp(area=$a and val>50)within 1min. from Smokewhere area=Smoke.area and measuredTemp=Temp.valueSelected consumption policydefine Fire(area: string, measuredTemp: double)from Smoke(area=$a) and each Temp(area=$a and val>50)within 1min. from Smokewhere area=Smoke.area and measuredTemp=Temp.valueconsuming TempFire!Fire!Fire!Fire!Fire!Fire!Fire!Fire!Fire!Fire!Fire!Fire!TTTTSSTTTTSSTTTT32DEBS 2011 - Tutorial
  • 42.
    Load shedding policyProblem:How to manage bursts of input dataIt may seem a system issuei.e., an issue to solve into theReceiverBut it strongly impacts the resultsproducedi.e., the “semantics” of the rulesAccordingly, some systems allows this issue to be determined on a per-rule basise.g., Aurora allows rules to specify the expected QoS and sheds input to stay within limits with the available resourcesConceptually the issue is addressed into the deciderKnowledgebaseAAAHistoryHistorySeqHistoryDeciderForwarderReceiverProducer------------------------------------------------------------------------------------------ClockRules33DEBS 2011 - Tutorial
  • 43.
    Deployment modelIFP applicationsmay include a large number of sources and sinksPossibly dispersed over a wide geographical areaIt becomes important to consider the deployment architecture of the engineHow the components of the functional model can be distributed to achieve scalability34DEBS 2011 - Tutorial
  • 44.
    Deployment modelSourcesSinksIFP EngineCentralizedClusteredNetworkedInformationFlowsInformation Flows--------------------Rules--------------------DistributedRule managers35DEBS 2011 - Tutorial
  • 45.
    Deployment ModelMost existingsystems adopt a centralized solutionWhen distributed processing is allowed, it is usually based on clustered solutionsA few systems have recognized the importance of networked deployment for some applicationsE.g. Microsoft StreamInsight (part of SQLServer)Filtering near sourcesAggregation and correlation in-networkAnalytics and historical data in a centralized server/clusterIn most cases, deployment/configuration is not automatic36DEBS 2011 - Tutorial
  • 46.
    Deployment modelAutomatic distributionof processing introduces the operator placementproblemGiven a set of rules (composed of operators) and a set of nodesHow to split the processing loadHow to assign operators to available nodesIn other words (Event Processing in Action)Given an event processing networkHow to map it onto the physical network of nodes37DEBS 2011 - Tutorial
  • 47.
    Operator placementThe operatorplacement problem is still openSeveral proposalsOften adopting techniques coming from the Operational ResearchDifficult to compare solutions and resultsEven in its simplest form the problem is NP-hardGood survey paper by Lakshmanan, Li, and Strom“Placement Strategies for Internet-Scale Data Stream Systems”38DEBS 2011 - Tutorial
  • 48.
  • 49.
    Operator placementDifferent goalshave been considered for operator placementDepending from the deployment architectureDepending from the application requirementsExamples:Minimize delay / network usage (bandwidth)Pietzuch et. Al. “Network-Aware Operator Placement for Stream-Processing Systems”Reduce hardware resource usage / Obtain load balancingRelevant in clustered scenariosSeveral papers for IBM System S40DEBS 2011 - Tutorial
  • 50.
    Operator placementHeterogeneous ResourcesRuleRewritingGoalUser-defined QOSOperator PlacementCentralized vs DistributedRe-use vs ReplicationOn-line vsOff-line41DEBS 2011 - Tutorial
  • 51.
    More on deploymentmodelOperator placement is only part of the problemOther issuesHow to build the network of nodes?How to maintain it?How to gather the information required to solve the operator placement problem?How to actually “place” the operators?How to “replace” them when the situation changes?New rules added, old rules removed……new sources/sinks42DEBS 2011 - Tutorial
  • 52.
    Deployment model anddynamicsHow to cope with mobile nodes?Mobile sinks and sources……but also mobile “processors”The issue is relevantWe leave in a mobile worldVery few proposalsA lot of work in the area of pure publish/subscribeSeveral works published in DEBS, not to mention other major conferences/journalsMay we reuse some of this work?43DEBS 2011 - Tutorial
  • 53.
    Interaction ModelSourcesSinksIFP EngineItis interesting to study the characteristics of the interactions among the main component of an IFP systemWho starts the communication?44DEBS 2011 - Tutorial
  • 54.
    Interaction ModelSourcesSinksIFP EngineObservationModelForwarding ModelNotification ModelPush
  • 55.
  • 56.
  • 57.
  • 58.
  • 59.
  • 60.
    Time ModelRelationship betweeninformation items and passing of timeAbility of an IFP system to associate some kind of happened-before (ordering) relationship to information itemsWe identified 4 classes:Stream-onlyCausalAbsoluteInterval46DEBS 2011 - Tutorial
  • 61.
    Stream-Only Time ModelUsedin original DSMSsTimestamps may be present or notWhen present, they are used only to group items when they enter the processing engineThey are not exposed to the languageWith the exception of windowing constructsOrdering in output streams is conceptually separate from the ordering in input streamsCQL/StreamSelect DStream(*)From F1[Rows 5], F2[Range 1 Minute]Where F1.A = F2.A47DEBS 2011 - Tutorial
  • 62.
    Causal Time ModelEachitem has a label reflecting some kind of causal relationshipPartial orderE.g. RapideAn event is causally ordered after all events that led to its occurrenceGigascopeSelectcount(*)From A, BWhere A.a-1 <= B.b and A.a+1 > B.bA.a, B.b monotonically increaseABC48DEBS 2011 - Tutorial
  • 63.
    Absolute Time ModelInformationitems have an associated timestampDefining a single point in time w.r.t. a (logically)unique clockTotal orderTimestamps are fully exposed to the languageTimestamps can be given by the source, or by the processing engineTESLA/T-RexDefine Fire(area: string,measuredTemp: double) From Smoke(area=$a) and last Temp(area=$a and value>45) within 5 min. from Smoke Where area=Smoke.areaandmeasuredTemp=Temp.value49DEBS 2011 - Tutorial
  • 64.
    Interval Time ModelUsedfor events to include “duration”SnoopIB, Cayuga, NextCEP, …At a first sight, it is a simple extension of the absolute time modelTimestamps with two values: start time and end timeHowever, it opens many issuesWhat is the successor of an event?What is the timestamp associated to a composite event?50DEBS 2011 - Tutorial
  • 65.
    Interval Time ModelWhichis the immediate successor of A?Choose according to end time only: BBut it started before A!Exclude B: C, DBoth of them?Which of them?No other event strictly between A and its successor: C, D, ESeems a natural definitionUnfortunately we loose associativity!X(YZ) ≠ (X Y)ZMay impede some rule rewriting for processing optimizations51DEBS 2011 - Tutorial
  • 66.
    Interval Time Model“Whatis “Next” in event processing?” by White et. AlProposes a number of desired properties to be satisfied by the “Next” functionThere is one model that satisfies them allComplete HistoryIt is not sufficient to encode timestamps using a couple of valuesTimestamps of composite events must embed the timestamps of all the events that led to their occurrencePossibly, timestamps of unbounded sizeIn case of unbounded Seq52DEBS 2011 - Tutorial
  • 67.
    Data ModelStudies howthe different systemsRepresent single data itemsOrganize them into data flowsDataData ItemsNature of ItemsGeneric Data
  • 68.
  • 69.
  • 70.
  • 71.
  • 72.
  • 73.
    Nature of ItemsThemeaning we associate to information itemsGeneric dataEvent notificationsDeeply influences several other aspects of an IFP systemTime model !!!Rule languageSemantics of processingHeritage of the heterogeneous backgrounds of different communitiesDataData ItemsNature of ItemsGeneric Data
  • 74.
  • 75.
  • 76.
  • 77.
  • 78.
  • 79.
    Nature of ItemsCQL/Stream(Generic Data)SelectIStream(*)From F1[Rows 5], F2[Range 1 Minute]Where F1.A = F2.ATESLA/T-Rex (Event Notifications)Define Fire (area: string,measuredTemp: double) From Smoke(area=$a)and last Temp(area=$a and value>45)within 5 min.from Smoke Where area=Smoke.area andmeasuredTemp=Temp.value55DEBS 2011 - Tutorial
  • 80.
    Format of ItemsHowinformation is representedInfluences the way items are processedE.g., Relational model requires tuplesDataData ItemsNature of ItemsGeneric Data
  • 81.
  • 82.
  • 83.
  • 84.
  • 85.
  • 86.
    Support for UncertaintyAbilityto associate a degree of uncertainty to information itemsTo the content of itemsImprecise temperature readingTo the presence of an item (occurrence of an event)Spurious RFID readingWhen present, probabilistic information is usually exploited in rules during processingDataData ItemsNature of ItemsGeneric Data
  • 87.
  • 88.
  • 89.
  • 90.
  • 91.
  • 92.
    Data FlowsHomogeneousEach flowcontains data with the same format and “kind”E.g. Tuples with identical structureOften associated with “database-like” rule languagesHeterogeneousInformation flows are seen as channels connecting sources, processors, and sinksEach channel may transport items with different kind and formatDataData ItemsNature of ItemsGeneric Data
  • 93.
  • 94.
  • 95.
  • 96.
  • 97.
  • 98.
    Rule ModelRules aremuch more complex entities than data itemsLarge number of different approachesAlready observed in the previous slidesLooking back to our functional model, we classify them into two macro classesTransforming rulesDetecting rulesRuleType of RulesTransforming Rules
  • 99.
    Detecting RulesSupport forUncertainty59DEBS 2011 - Tutorial
  • 100.
    Transforming RulesDo notpresent an explicit distinction between detection and productionDefine an execution plan combining primitiveoperatorsEach operator transforms one or more input flows into one or more output flowsThe execution plan can be definedexplicitly (e.g., through graphical notation)implicitly (using a high level language)Often used with homogeneous information flowsTo take advantage of the predefined structure of input and outputRuleType of RulesTransforming Rules
  • 101.
    Detecting RulesSupport forUncertainty60DEBS 2011 - Tutorial
  • 102.
    Detecting RulesPresent anexplicit distinction between detection and productionUsually, the detection is based on a logical predicate that captures patterns of interest in the history of received itemsRuleType of RulesTransforming Rules
  • 103.
    Detecting RulesSupport forUncertainty61DEBS 2011 - Tutorial
  • 104.
    Support for UncertaintyTwoorthogonal aspectsSupport for uncertain inputAllows rules to deal with/reason about uncertain input dataSupport for uncertain outputAllows rules to associate a degree of uncertainty to the output producedRuleType of RulesTransforming Rules
  • 105.
    Detecting RulesSupport forUncertainty62DEBS 2011 - Tutorial
  • 106.
    Language ModelSpecify operationstoFilterJoinAggregateinput flows …… to produce one or more output flowsFollowing the rule model, we define two classes of languages:Transforming languagesDeclarative languagesImperative languagesDetecting languagesPattern-based63DEBS 2011 - Tutorial
  • 107.
    Following the rulemodel, we define two classes of languages:Transforming languagesDeclarative languagesImperative languagesDetecting languagesPattern-basedLanguage ModelSpecify the expected result rather than the desired execution flowUsually derive from relational languagesRelational algebra / SQLCQL/Stream:Select IStream(*)From F1[Rows 5], F2[Rows 10]Where F1.A = F2.A64DEBS 2011 - Tutorial
  • 108.
    Following the rulemodel, we define two classes of languages:Transforming languagesDeclarative languagesImperative languagesDetecting languagesPattern-basedLanguage ModelSpecify the desired execution flowStarting from primitive operatorsCan be user-definedUsually adopt a graphical notation65DEBS 2011 - Tutorial
  • 109.
    Imperative LanguagesAurora (Boxes& Arrows Model)66DEBS 2011 - Tutorial
  • 110.
  • 111.
    Following the rulemodel, we define two classes of languages:Transforming languagesDeclarative languagesImperative languagesDetecting languagesPattern-basedLanguage ModelSpecify a firing condition as a patternSelect a portion of incoming flows throughLogic operatorsContent / timing constraintsThe action uses selected items to produce new knowledge68DEBS 2011 - Tutorial
  • 112.
    Detecting LanguagesActionTESLA /T-RexDefineFire(area: string, measuredTemp: double) From Smoke(area=$a) and last Temp(area=$a and value>45)within 5 min. from Smoke Where area=Smoke.areaandmeasuredTemp=Temp.valueCondition (Pattern)69DEBS 2011 - Tutorial
  • 113.
    Language ModelDifferent syntaxes/ constructs / operatorsComparison of languages semantics still an open issueOur approach:Review all operators encountered in the analysis of systemsSpecifying the classes of languages adopting themTrying to capture some semantics relationshipAmong operators70DEBS 2011 - Tutorial
  • 114.
    Language ModelSingle-Item operatorsSelectionoperatorsFilter items according to their contentElaboration operatorsProjectionExtracts a part of the content of an itemRenamingChanges the name of a field in languages based on records or tuplesPresent in all languagesDefined as primitive operators in imperative languagesDeclarative languages inherit selection, projection, and renaming from relational algebraRenamingSelect RStream(I.Price as HighPrice)From Items[Rows 1] as IWhere I.Price > 100ProjectionSelection71DEBS 2011 - Tutorial
  • 115.
    Language ModelSingle-Item operatorsSelectionoperatorsFilter items according to their contentElaboration operatorsProjectionExtracts a part of the content of an itemRenamingChanges the name of a field in languages based on records or tuplesPattern-based languagesSelection inside the condition part (pattern)Elaboration as part of the actionProjectionDefineExpensiveItem (highPrice: double) From Item(price>100)Where highPrice = priceSelectionRenaming72DEBS 2011 - Tutorial
  • 116.
    Language ModelLogic OperatorsConjunctionDisjunctionRepetitionNegationExplicitlypresent in pattern-based languagesPADRES(A & B) || (C & D)ConjunctionDisjunction73DEBS 2011 - Tutorial
  • 117.
    Language ModelLogic OperatorsConjunctionDisjunctionRepetitionNegationSomelogic operators are blockingExpress pattern whose validity cannot be decided into a bounded amount of timeE.g., NegationUsed in conjunction with windowsDefineFire()From Smoke(area=$a) and not Rain(area=$a) within 10 min from SmokeNegationWindow74DEBS 2011 - Tutorial
  • 118.
    Language ModelLogic OperatorsConjunctionDisjunctionRepetitionNegationTraditionally,logic operators were not explicitly offered by declarative and imperative languagesHowever, they could be expressed as transformation of input flowsConjunction of A and BSelect IStream (F1.A, F2.B)From F1 [Rows 10], F2 [Rows 20]75DEBS 2011 - Tutorial
  • 119.
    Language ModelSequencesSimilar tologic operatorsBased on timing relations among itemsPresent in almost all pattern-based languagesDefineFire() From Smoke(area=$a) and last Temp(area=$a and value>45) within 5 min. from Smoke Sequence (time-bounded)76DEBS 2011 - Tutorial
  • 120.
    Language ModelSequencesSimilar tologic operatorsBased on timing relations among itemsTraditionally, transforming languages did not provide sequences explicitlyCould be expressed with an explicit reference to timestampsIf present inside itemsSelect IStream (F1.A, F2.B)From F1 [Rows 10], F2 [Rows 20]Where F1.timestamp <F2.timestampImpose timestamp order77DEBS 2011 - Tutorial
  • 121.
    Language ModelIterationsExpress possiblyunbounded sequences of items …… satisfying an iterating conditionImplicitly defines an ordering among itemsIteration (Kleene +)SASE+PATTERN SEQ(Alert a, Shipment+ b[ ])WHERE skip_till_any_match(a, b[ ]) {a.type= ’contaminated’ and b[1].from = a.site and b[i].from = b[i-1].to} WITHIN 3 hours78DEBS 2011 - Tutorial
  • 122.
    Language ModelLogic operators,sequences, and iterations traditionally not offered by transforming languagesAnd now?Current trend:Embed patterns inside declarative languagesEspecially adopted in commercial systemsEsperSelect A.priceFrom pattern [every (A (B or C))]Where A.price > 100 IMO: difficult to write / understandMailing list of Esper!79DEBS 2011 - Tutorial
  • 123.
    Language ModelWindowsKind:Logical (Time-Based)Physical(Count-Based)User-DefinedMovementFixedLandmarkSlidingPaneTumbleLogicalSelectIStream(Count(*))From F1[Range 1 Minute]PhysicalSelect IStream(Count(*))From F1[Rows 50 Slide 10]80DEBS 2011 - Tutorial
  • 124.
    Language ModelUser-Defined (Stream-Mill)WINDOWAGGREGATE positive min(Next Real): Real { TABLE inwindow(wnext real); INITIALIZE : { } ITERATE : { DELETE FROM inwindow WHERE wnext < 0 INSERT INTO RETURN SELECT min(wnext) FROM inwindow } EXPIRE : { }}81DEBS 2011 - Tutorial
  • 125.
    Language ModelWindows areused to limit the scope of blocking operatorsThey are generally available in declarative and imperative languagesThey are not present in all pattern-based languagesSome of them do not include blocking operatorsSome of them “embed” windows inside operatorsMaking them unblockingCEDREVENT Test-RuleWHENUNLESS(A, B, 12 hours)WHERE A.a< B.b82DEBS 2011 - Tutorial
  • 126.
    Language ModelFlow managementoperatorsRequired by declarative and imperative languages to merge, split, organize, and process incoming flows of informationFlow Management OperatorsJoinOrder ByUnionGroup ByExceptFlow CreationBag OperatorsIntersectDuplicateRemove-duplicates83DEBS 2011 - Tutorial
  • 127.
    Language ModelParameterizationAllows thebinding of different information items based on their contentOffered implicitly by declarative and imperative languagesThrough a combination of join and selectionOffered as an explicit operator in pattern-based languagesJoinCQL / StreamSelectIStream (F1.A, F2.B)From F1 [Rows 10], F2 [Rows 20]Where F1.A > F2.BSelectionExplicit Parameter84TESLA / T-RexDefine Fire() From Smoke(area=$a) and last Temp(area=$a and value>45) within 5 min. from Smoke DEBS 2011 - Tutorial
  • 128.
    Language ModelDetectionAggregateDefineFire(area: string,measuredTemp:double) From Smoke(area=$a) and45 < Avg(Temp(area=$a).valuewithin 5 min. from Smoke)Where area=Smoke.areaand measuredTemp=Temp.valueAggregatesScopeDetection Aggregates
  • 129.
    Production AggregatesDefinitionDefineFire(area: string,measuredTemp:double) From Smoke(area=$a) andlast Temp(area=$a and value>45)within 5 min. from Smoke)Where area=Smoke.areaand measuredTemp=Avg(Temp(area=$a).value)within 1 hour from SmokePredefined
  • 130.
  • 131.
  • 132.
    AbstractionIFP languages andsystems offer a level of abstraction over flowing informationSimilar to the role SQL / DBMSs play for static dataThe heterogeneity of solutions suggests that finding the “right” abstraction is still an open issueSeveral open questionsApplicationsAPI (Including Rules)IFP System87DEBS 2011 - Tutorial
  • 133.
    AbstractionWhich are therequirements of applications?According to the results of EPTS surveySimple operations (mainly aggregate / join)Few sources (mainly 1-10, and 10-100)Low rates (mainly 10-100, and 100-1000 event/s)Data coming from DBMSs / files88DEBS 2011 - Tutorial
  • 134.
    AbstractionHow can weexplain these answers?IFP systems used as “enhanced” DBMSsIMO, companies are using the technology they knowE.g. Language: no need for patterns or no knowledge about patterns?Features vs Simplicity/IntegrationThis is the “consolidated” part of IFP systemsIMO, raising the level of abstraction would make IFP systems more appealing for a larger audienceIs it possible to combine advanced features and ease of use?89DEBS 2011 - Tutorial
  • 135.
    AbstractionOther details arepresent in the EPTS surveyMost of the people was referring to technologies that were “already deployed” or “in development”The systems were chosen based on the trust on the vendorThis suggests that the last (research) advancements in event processing may be still unknownThey lack a solid implementation …… that can easily work side by side with existing data processing systems90DEBS 2011 - Tutorial
  • 136.
    AbstractionHow many kindsof languages / abstractions?Traditionally, twoTransforming rulesBased on the Stream modelDetecting rulesBased on patternsOften merged together in commercial systems“Merged together” does not mean “organically combined”The underlying model is usually derived from the Stream modelGood integration with relational languagesDifficult to integrate with patternsDifficult to understand the semantics of rulesCan we offer a better model / formalism?91DEBS 2011 - Tutorial
  • 137.
    AbstractionDo we reallyneed “One abstraction to rule ‘em all”?Again, we need to better understand application requirements …… but we also need to better analyze the expressiveness and semantics of languages!In our survey, we offer a description of existing operatorsOpen issues:How do operators contribute to the overall expressiveness of languages?Is it possible to find a “minimal set” of operators to organically combine the capabilities of transforming and detecting rules?92DEBS 2011 - Tutorial
  • 138.
    Our Proposal: TESLA“Onesize fits all” language does not existAt least, we have to find it yetWe started from the following requirementsFocus specifically on eventsNot generic data processingDefine a clear and precise semanticsIssues of selection and consumption policiesGeneral issue of formal semanticsBe expressiveUseful for applicationsKeep it simpleEasy to read and write93DEBS 2011 - Tutorial
  • 139.
    DefineCE(Att1 : Type1,…, Attn : Typen)From PatternWhere Att1 = f1(..), …, Attn = fn(..)Consuminge1, …, emTESLA94DEBS 2011 - Tutorial
  • 140.
    Patterns in TESLACESelectionof a single eventA(x>10)Timer()Selection of sequencesA(x>10) and each B within 5 min from AA(x>10) and last B within 5 min from AA(x>10) and first B within 5 min from AGeneralizationn-first / n-lastA(X=15)BB5 minCECEA (X=15)BBCEA (X=15)BBCEA (X=15)BB95DEBS 2011 - Tutorial
  • 141.
    Patterns in TESLATESLAallows *-within operators to be composed with each other:In chains of eventsA and each Bwithin 3 min from A and last Cwithin 2 min from BIn parallelA and each Bwithin 3 min from A and last Cwithin 4 min from AParameters can be added between events in a patternABCBAC96DEBS 2011 - Tutorial
  • 142.
    Negations and AggregatesTwokinds of negations:Interval based:A and last Bwithin 3 min from Aand not C between B and ATime based:A and not C within 3 min from ASimilarly, two kinds of aggregatesInterval basedUse values appearing between two eventsTime basedUse values appearing in a time interval97DEBS 2011 - Tutorial
  • 143.
    IterationsWe believe thatexplicit operators for repetitions are difficult to use/understandEspecially when different selection/consumption policies are allowedWe achieve the same goal using hierarchies of eventsComplex events can be used to define (more) complex eventsRecurring schemes of repetitionsMacros! 98DEBS 2011 - Tutorial
  • 144.
    Formal SemanticsThe semanticsof each TESLA operator / rule is formally defined through a set of TRIO metric temporal logic formulasThey define an equivalence between the presence of a given pattern and the occurrence of one or more complex events, specifying:The occurrence time of complex eventsThe content of complex eventsThe set of consumed eventsThe language is unambiguousA user can in advance check the semantics of her rulesThis makes it possible to check the correctness of an event detection engine using a model checkerGiven a history of events and a set of TESLA rules …… does the engine satisfy all the formulas ?99DEBS 2011 - Tutorial
  • 145.
    AbstractionSupport for uncertaintyManyapplications deal with imprecise (or even incorrect) data from sourcesE.g., sensors, RFID, …Uncertain rules may increase the expressiveness / usefulness of an IFP systemAddressed only in a few research papers …… and still missing in existing systems100DEBS 2011 - Tutorial
  • 146.
    AbstractionWhich are thekey requirements for an uncertainty model?It should be suitable for the application it is designed forIs it possible to have a single model?It should be easy to manage / understandTo allow application domain experts to easily build rules on top of itIt should have a limited impact on the performance of the overall systemTo preserve the required QoS101DEBS 2011 - Tutorial
  • 147.
    AbstractionSupport for QoSpoliciesAllow users to define application-dependent policiesWhich data is more important?Which items is it better to discard in case of system overload?Which QoS metric is more significant for the application?Completeness of resultsLatency…Traditionally offered only by DSMSsMost systems simply adopt a best-effort strategy102DEBS 2011 - Tutorial
  • 148.
    AbstractionSupport for aKnowledge BaseSome systems offer special operations to read from persistent storagePossibly a key-feature for the integration with existing DBMSsSupport for periodic evaluation of rulesAs opposed to purely reactive evaluation103DEBS 2011 - Tutorial
  • 149.
    AbstractionCapability to dynamicallychange the set of deployed rulesAdd / RemoveActivate / DeactivateContext awarenessTutorial at DEBS 2010Could be used to increase the expressiveness of IFP …… but also to simplify / speedup processingContext / Situation used to reason on the status of the system / applicationPossibly combined with the possibility to activate / deactivate rules104DEBS 2011 - Tutorial
  • 150.
    AbstractionWhere to offerthe IFP abstraction?As a standalone product/systemSimilar to DBMSsAs a middleware componentSimilar to a Publish/Subscribe systemTo guide the interaction of other componentsAs part of a programming languageSimilar to what LINQ (Language-Integrated Query) offers to C# and to the .Net FrameworkExplored in EventJavaExtension of Java for event correlation105DEBS 2011 - Tutorial
  • 151.
    TimeGoing back toour time model …… is there a “right approach” to deal with time?Using models that define a total order among information items seems to me easier to manage/understandDo we really need interval time for events?How do we define operators like sequences?To offer a proper level of abstraction …… but also to allow an efficient implementation106DEBS 2011 - Tutorial
  • 152.
    TimeMany existing systemsoffer operators that rely of some notion of timeMain problem: out-of-order arrival of information itemsRequirementsPreserve the semantics of processingKeep the delay for obtaining results as small as possibleSolutionsTrade-off between correct and low latency processingSometimes controlled by users through predefined or customizable policiesOpen issuesSolutions for distributed / incremental processing may introduce delay at each processing stepProblem not addressed in existing solutions for distributed processing107DEBS 2011 - Tutorial
  • 153.
    ProcessingMany processing algorithmsproposedDSMSs cannot exploit traditional indexing techniquesNew effort is required to efficiently evaluate standing queriesMost CEP systems are based on automata …… or in general incremental processingPerform processing incrementally, as new information is availableAcademia: ODE, Cayuga, NextCEP, Amit, TRex, etc.Industry: Esper/NesperWe are currently investigating other solutionsKeep the whole history of all received information itemsAd-hoc data structures that simplify information retrievalStart evaluation only when requiredFaster!Easier to parallelize (more on this later …)108DEBS 2011 - Tutorial
  • 154.
    Our ExperienceProcessing Time(microsec)Multiple Selection PolicyProcessing Time (microsec)Single Selection Policy109DEBS 2011 - Tutorial
  • 155.
    ProcessingAre we usingthe right algorithms?Can we isolate the best algorithm for a particular set of operators?Accurate analysis of performanceCan we switch among different algorithms depending from the workload?Deployed rules …… but also analysis of trafficAdaptive middleware110DEBS 2011 - Tutorial
  • 156.
    ProcessingExploit parallel hardwareMulti-coreCPUs, but also GP-GPUsTo handle several rules concurrentlyTo speedup complex operations inside a single ruleOnly partially investigatedE.g., streaming aggregatesCan be extended to different operators?Also in this case the best solution may depend from the workload111DEBS 2011 - Tutorial
  • 157.
  • 158.
    Our ExperienceGPU cansignificantly speedup huge computationsFixed overhead to activate it makes it unsuitable for simple tasksOptimal workload: reduced set of very complex rulesAdditional advantage: CPU cores are free for other taskProcess simple rules …… but also handle marshalling/unmarshalling and communication with sources and sinksThe trend. GPUs are becoming:Easier to programSuitable for an increasing number of tasksNot only data parallel algorithms113DEBS 2011 - Tutorial
  • 159.
    ProcessingExploit existing modelsfor computation on clusters / cloudsRecently received great attentionE.g. Map-ReduceAre these models suitable for IFP systems?To efficiently support the expressiveness of existing languagesCan we create new models?Explicitly conceived to deal with IFP rules114DEBS 2011 - Tutorial
  • 160.
    ProcessingExploit domain-dependent informationto speedup processingE.g., implication between complex events or situationsCan be part of context information, or combined with itIn certain cases information can be automatically deducedDepending from the data and rule models adopted115DEBS 2011 - Tutorial
  • 161.
    Distribution of ProcessingMostsystems rely on centralized processingWhen distributed processing is allowed, a clustered deployment model is often adoptedCooperating processors are co-locatedThe problem of networked distribution has been addressed only marginallyAre there applications that involve large scale scenarios?E.g. monitoring applicationsPotentially introduce new issuesBandwidth may become a bottleneckData may be transferred over non-dedicated channelsSeveral applications may run on the same “processing network”116DEBS 2011 - Tutorial
  • 162.
    Operator PlacementCurrent approachesmainly focus on (sub)optimal placement with global knowledgeOften based on over-simplified model …… that ignore important implementation detailsResearch directionsDistributed protocols allowing autonomous decisions at each nodeFast adaptation to application loadMonitoring and statistical analysis of dataOptimization of multiple rules on the same processing network117DEBS 2011 - Tutorial
  • 163.
    WorkloadsA major problemwhen it comes to evaluate and compare different systems is the lack of workloadsBenchmarkOnly one benchmark available: Linear RoadStrongly tailored for the Stream modelDifficult to adapt to other systemsHuge parameter spaceDifficult to isolate relevant casesDEBS Grand Challenge may be a first step in this direction …Data coming from real / realistic use casesTo understand which operators are more important …… and what to optimize in the processing algorithm118DEBS 2011 - Tutorial
  • 164.
    Security / PrivacyDefinesecurity models and policiesWhich pieces of information can be consumed be a single operator or a set of operatorsImportant: output data may have a different visibility w.r.t. input data!The system may be allowed to consume single data items, but not to extract new information from them, or to distribute this informationImplement protocols and algorithms to enforce them119DEBS 2011 - Tutorial
  • 165.
    Thanks for yourAttention!120DEBS 2011 - Tutorial
  • 166.

Editor's Notes

  • #15 Parlarediintgrazione con I DB quandoparlidiKnowledgeBase
  • #16 Parlarediintgrazione con I DB quandoparlidiKnowledgeBase
  • #17 Parlarediintgrazione con I DB quandoparlidiKnowledgeBase
  • #18 Parlarediintgrazione con I DB quandoparlidiKnowledgeBase
  • #19 Parlarediintgrazione con I DB quandoparlidiKnowledgeBase
  • #20 Parlarediintgrazione con I DB quandoparlidiKnowledgeBase
  • #21 Parlarediintgrazione con I DB quandoparlidiKnowledgeBase
  • #36 Clustered -&gt; Processing nodes are co-located. Strictly connected. Usually adopting a shared-memory model. The aim is to split the processing load among different processing elements.Networked -&gt; Processing nodes are geographically distributed. No shared memory. The aim is to reduce network usage by pushing processing as near as possible to sources.
  • #37 MicrosoftStreamInsight allows users / system administrators to deploy multiple services, so that [elencopuntato]
  • #38 What do we mean by automatic distribution? Take for example P2P applications for file sharing. No need to manually configure the network when new nodes join or leave it.
  • #42 Heteorgeneous Resources -&gt; e.g., when the algorithm tries to deploy part of the processing load on small devices like mobile phones or even wireless sensorsRule Rewriting -&gt; Not only optimization inside single rules, but also multi-rule optimization. The rewriting algorithm depends from a set of rules.It is not clear if reuse and replications are assumptons from which starting or they are a “value” of the result and which one is better.
  • #46 Observation model -&gt; hybrid is common in existing applications. Indeed, they include both sources that actively send information (e.g., sensors) and sources with stable information the engine can use during processing (e.g., databases). These are modeled in our functional model using the knowledge base component.
  • #48 Ordering of output streams is conceptually separate from the ordering of input streams -&gt; Consider for example the use of Dstream, that outputs all the elements that are removed from the processing table. Or consider a sort by clause over a generic attribute.
  • #49 Gigascope -&gt; explicitly conceived to monitor traffic. Does not use the traditional Stream model. It does not use intermediate database tables for processing, but directly defines output streams starting from input streams.This requires some assumptions on the ordering of items to define a semantics for operators like join and group by.Allows the system to decide how long to keep an information item
  • #55 Data model -&gt; studies how the different systems represent information and organize them into streams.
  • #56 Processing of generic data, as derived from the relational modelAs already discussed in the time model, no relation between processing and time / timestampsIn TESLA  event notifications: strong focus on time (usually totally ordered, absolute or interval)
  • #57 Data model -&gt; studies how the different systems represent information and organize them into streams.
  • #58 Siriesce a trovare un esempio?
  • #60 This distinction also appears (sometimes implicitly) in the related work section of many papers.
  • #61 We will provide a few examples to better clarify these aspects in the next few slides, when we enter more into the details about languages and operators.
  • #88 Probably for our background (we are part of a software engineering group), the first topic that we want to address is about the abstraction we are building with event processing.As we will see this is also a rich field: despite many works have been carried out, in my opinion much has to be done.
  • #89 If we believe this data, we have already finished our work. We already build (many) systems that satisfy these requirements.