Automated Correlation Discovery for Semi-Structured Business Processes

Automated Correlation Discovery for Semi-Structured Business Processes DEBS 2011 Szabolcs Rozsnyai, Aleksander Slominski, Geetika T. Lakshmanan

Agenda ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Motivation ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],We present a novel algorithm to automatically determine correlation rules for the purposes of monitoring, and discovery, and other applications

Solution Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Related Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Overview ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object]

Data Pre-Processing ,[object Object],[object Object],[object Object],Raw Event Event Attributes EventType Common Alias Key Timestamp Type Raw DateTime OrderId Product … 32123… 2011-01-01T09:35:52.50 OrderReceived <OrderReceived… 2011-01-01T09:35:52.50 166635 ProductA … DateTime ShipmentId OrderId … 213131… 2011-01-01T09:40:54.50 Shipment Created <Shipment Created… 2011-01-01T09:31:52.50 253355 166635 …

Statistics Calculation 2/2 Attribute Cardinality It contains a map of each value and how often each of those values occurred. Card Determines the number of different values for the attribute. Cnt Represents the total number of instances in which the attribute occurs. As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. AvgAttributeLength Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts. InferencedType ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],NoOfNumeric Depending on the InferencedType this variable contains the number of values that are of a numeric type. NoOfAlphaNum Depending on the InferencedType this variable contains the number of values that are of an alpha-numeric type.

Example - Index The attribute cardinality (i.e. Index) contains a map of each value and how often each of those values occurred . Example

Example - Card 4 Unique Values Determines the number of different values for the attribute. Example

Example - Cnt Cnt=5 For certain attributes the number might be smaller as they can be null or missing Cnt=5 For certain attributes the number might be smaller as they can be null or missing Cnt=5 For certain attributes the number might be smaller as they can be null or missing Represents the total number of instances in which the attribute occurs . As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. Represents the total number of instances in which the attribute occurs . As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. Example

Example - AvgAttributeLength AvgAttributeLength is calculated Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts. Example

Example - InferencedType Determines DataType Defines the type of an attribute . The type of an attribute is an important characteristic for correlation discovery to reduce the problem space of correlation candidates. The chances that a type would correlate with another attribute given that the type contains mostly alpha-numeric attributes are very low. The determination of the type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi. Example

Example – The rest of the types… Example

Determining Correlation Candidates ,[object Object],[object Object],[object Object],[object Object]

Difference Set 1/2 ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[1] I. Ilyas, V. Markl, P. Haas, P. Brown. (2004). CORDS: Automatic discovery of correlations and soft functional dependencies. [2] A. Rostin, O. Albrecht, F. Naumann, J. Bauckmann, and U. Leser. (2009). A Machine Learning Approach to Foreign Key Discovery, (WebDB).

Example – Determining Highly Indexables ,[object Object],[object Object],[object Object],1 1 0.8 0.8 0.2 1 Calculate Card/Cnt Example

Example – Determining Highly Indexables ,[object Object],[object Object],[object Object],1 1 0.8 0.8 0.2 1 Card / Cnt > Alpha where Alpha = 0.9  AvgAttributeLength > Epsilon where Epsilon = 5 Example

Example – Determining Mappables The Mappable Attribute can be seen as means to reduced search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how often a value of an attribute can occur. The assumption is that if it occurs more then x times it is unlikely that it is a correlation candidate. x… Cardinality of a value i… Attribute of a type { xi | x < Gamma } Card < Gamma where Gamma = 10 For instance in this domain it might be unlikely that a shipment has more than 10 orders. However this might cause problems in other domains or for certain relationships (one customer has definitely more than 10 orders). Example

Difference Set 2/2 ,[object Object],[object Object],[object Object],[object Object]

Example ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Example DateTime’s are excluded as they are a timestamp which are of a type that are not suitable for correlation pairs. This also applies for booleans and description texts .

Example – DifferenceSet for all Permutations Example OrderReceived.OrderId = ShipmentCreated.ShipmentId OrderReceived.OrderId = ShipmentCreated.OrderId OrderReceived.OrderId = TransportStarted.TransportId OrderReceived.OrderId = TransportStarted.ShipmentId … A/B = {x | x  A  x  B} |A/B| <= DiffTreshold 100% 0% 100% 100% … DiffTreshold = 0.95 OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId Resulting candidates of Correlation Pairs with 100% overlapping SetDiff SetDiff 0% 0% 0% 0% 0%

Example – DifferenceSet for all Permutations Example A/B = {x | x  A  x  B} |A/B| <= DiffTreshold OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId ,[object Object],[object Object],[object Object],SetDiff 0% 0% 0% 0% 0%

Difference between AvgAttributeLength ,[object Object],[object Object]

Example – AvgAttributeLength Example OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId SetDiff 0% 0% 0% 0% 0% AvgAttrLength 0 0 0 0 0

LevenshteinDistance ,[object Object],[object Object],[object Object]

Example – LevenshteinDistance Example OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId SetDiff 0% 0% 0% 0% 0% AvgAttrLength 0 0 0 0 0 LevenshteinDistance 0 0 0 0 0

Example – Weight Calculation Example OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId SetDiff 0% 0% 0% 0% 0% Avg Attribute Length 0 0 0 0 0 Levenshtein Distance 0 0 0 0 0 SetDiff AvgAttrLenght LevenshteinDistance 60% 20% 20% Confidence 100% 100% 100% 100% 100% Weight is adjustable!

Correlation Discovery Refinement

Evaluation, Conclusion and Future Work ,[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],[object Object],Precision: 99.56% False Positive Example: correlation by “ orderVolume”  Always similar size and attributes has a min. length ( No.of.RelevantCorrelationRules / ( No.of. RelevantCorrelationRules + FalsePositives ) * 100).

Automated Correlation Discovery for Semi-Structured Business Processes

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (11)

Similar to Automated Correlation Discovery for Semi-Structured Business Processes

Similar to Automated Correlation Discovery for Semi-Structured Business Processes (20)

Recently uploaded

Recently uploaded (20)

Automated Correlation Discovery for Semi-Structured Business Processes