0
Upcoming SlideShare
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Standard text messaging rates apply

# Automated Correlation Discovery for Semi-Structured Business Processes

3,861

Published on

Published in: Technology
0 Likes
Statistics
Notes
• Full Name
Comment goes here.

Are you sure you want to Yes No
• Be the first to comment

• Be the first to like this

Views
Total Views
3,861
On Slideshare
0
From Embeds
0
Number of Embeds
10
Actions
Shares
0
43
0
Likes
0
Embeds 0
No embeds

No notes for slide

### Transcript

• 1. Automated Correlation Discovery for Semi-Structured Business Processes DEBS 2011 Szabolcs Rozsnyai, Aleksander Slominski, Geetika T. Lakshmanan
• 2. Agenda
• Motivation
• Big Picture and Context
• Related Work
• Algorithm (with Examples)
• Data Pre-Processing
• Statistics Calculation
• Determining Correlation Candidates
• Screenshots of prototype application
• Conclusion & Future Work
• 3. Motivation
• Event producing systems are
• distributed,
• changing rapidly
• federated,
• loosely coupled,
• generating huge numbers
• Correlating events requires a lot of knowledge about the source systems and their data.
We present a novel algorithm to automatically determine correlation rules for the purposes of monitoring, and discovery, and other applications
• 4. Solution Overview
• Correlation rules are common identifiers defined as correspondence between the attributes of two different types.
• Correlation Rule Example: A.x = B.y where A and B are types and x and y are attributes
• The correlation rules are determined by a unique combination of statistics applied on event attributes such that several attribute statistics are taken into account to improve the precision of the correlation candidate detection and to calculate a confidence score .
• The algorithm does not require input of knowledge about the structure of artifacts (E.g. Event Format could be anything such as XML etc) or the data-type of their attributes nor does it require a n ormalized organization of artifacts.
• The confidence score precisely defines the significance of a correlation rule.
• Correlation rules , discovered by our algorithm, can be used either during runtime to group related artifacts together, such as events belonging to a process instance or to create a graph of relationships that enables querying and walking the paths of relationships.
• 5. Agenda
• Motivation
• Big Picture and Context
• Related Work
• Algorithm (with Examples)
• Data Pre-Processing
• Statistics Calculation
• Determining Correlation Candidates
• Screenshots of prototype application
• Conclusion & Future Work
• 6. Big Picture and Context
• 7. Agenda
• Motivation
• Big Picture and Context
• Related Work
• Algorithm (with Examples)
• Data Pre-Processing
• Statistics Calculation
• Determining Correlation Candidates
• Screenshots of prototype application
• Conclusion & Future Work
• 8. Related Work
• Motahari Nezhad et. al. (HP)
• Their approach takes mainly instance based measures into account to determine the “interestingness” of correlation pairs (and groups of pairs).
• That means that they first prune the large space of potential correlation pairs based on some techniques similar to DePauw and then correlate the data with this large set of correlation rules to generate various correlated instances. Then they apply certain statistics on the instances to determine if the correlation rules make sense.
• DePauw et al (IBM)
• The work by DePauw et al has at its core a certain similarity to our algorithm. For instance, we also take the notion of Indexable and Mappable Paths into account, but with the major purpose to reduce the problem space of candidate-pair permutations that need to be checked against each other for potential correlations. In our algorithm this step is optional and instead every attribute of a type is attempted to be matched against another attribute of a type .
• In addition our correlation algorithm takes several attribute-based statistics into account to improve the precision of the correlation candidate detection and also calculates a confidence score based on those statistics.
• CORDS (IBM) is a tool making use of statistical methods to discover correlations and soft functionalities between database columns to produce a dependency graph to improve the performance of query optimizers
• In the database world there is detailed knowledge about the data available which is defined either in the schema or in metadata. . This means that there are defined relations and attributes whereas their type (e.g. integer , string, timestamp , …) is known.
• A key difference of our algorithm, to other approaches, is that our it does not assume that artifacts are grouped together in a normalized schema and nor does it have any information on meta-data that describes an artifact's attribute.
• 9. Agenda
• Motivation
• Big Picture and Context
• Related Work
• Algorithm (with Examples)
• Data Pre-Processing
• Statistics Calculation
• Determining Correlation Candidates
• Screenshots of prototype application
• Conclusion & Future Work
• 10. Overview
• Our algorithm for correlation discovery is divided into three major steps:
• Data Pre-Processing .
• The first step of the correlation discovery process is to load and integrate the data into a data store (e.g. database, cloud storage, etc) that is then used to calculate statistics and determine correlation candidates.
• Statistics Calculation.
• After the data has been loaded and integrated into the internal representation, various statistics, mainly on attribute values, are calculated and stored into a fast accessible data structure.
• Determining Correlation Candidates.
• In the last step the correlation discovery algorithm determines correlation pairs with a certain confidence value based on the previously calculated statistics.
• 11. Data Pre-Processing
• Raw events are stored into a data storage
• Attributes of events are extracted (method of extraction is not in scope of this idea)
• Events have a type assigned (e.g. OrderReceived, ShipmentCreated, TransportStarted, …)
Raw Event Event Attributes EventType     Common Alias Key Timestamp Type Raw         DateTime OrderId Product … 32123… 2011-01-01T09:35:52.50 OrderReceived <OrderReceived… 2011-01-01T09:35:52.50 166635 ProductA …         DateTime ShipmentId OrderId … 213131… 2011-01-01T09:40:54.50 Shipment Created <Shipment Created… 2011-01-01T09:31:52.50 253355 166635 …
• 12. Statistics Calculation 1/2
• 13. Statistics Calculation 2/2 Attribute Cardinality It contains a map of each value and how often each of those values occurred. Card Determines the number of different values for the attribute. Cnt Represents the total number of instances in which the attribute occurs. As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. AvgAttributeLength Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts. InferencedType
• Defines the type of an attribute. The type of an attribute is an important characteristic for correlation discovery to reduce the problem space of correlation candidates. The chances that a type would correlate with another attribute given that the type contains mostly alpha-numeric attributes are very low.
• The determination of the type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi.
• Currently following type distinctions are supported:
• Numeric or Alphanumeric
• Timestamp/DateTime
• Boolean
• Descriptiontext
NoOfNumeric Depending on the InferencedType this variable contains the number of values that are of a numeric type. NoOfAlphaNum Depending on the InferencedType this variable contains the number of values that are of an alpha-numeric type.
• 14. Example Example
• 15. Example - Index The attribute cardinality (i.e. Index) contains a map of each value and how often each of those values occurred . Example
• 16. Example - Card 4 Unique Values Determines the number of different values for the attribute. Example
• 17. Example - Cnt Cnt=5 For certain attributes the number might be smaller as they can be null or missing Cnt=5 For certain attributes the number might be smaller as they can be null or missing Cnt=5 For certain attributes the number might be smaller as they can be null or missing Represents the total number of instances in which the attribute occurs . As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. Represents the total number of instances in which the attribute occurs . As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. Example
• 18. Example - AvgAttributeLength AvgAttributeLength is calculated Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts. Example
• 19. Example - InferencedType Determines DataType Defines the type of an attribute . The type of an attribute is an important characteristic for correlation discovery to reduce the problem space of correlation candidates. The chances that a type would correlate with another attribute given that the type contains mostly alpha-numeric attributes are very low. The determination of the type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi. Example
• 20. Example – The rest of the types… Example
• 21. Example – The rest of the types… Example
• 22. Determining Correlation Candidates
• The confidence score of correlation candidates is determined by the following three parameters with a default set of weights.
• Set Difference . A set difference determines the difference between two correlation candidates and is assigned a weight of 60%.
• Difference between AvgAttributeLength. The difference between the lengths of values of two correlation candidates is assigned a weight of 20%.
• LevenshteinDistance. The Levenshtein distance between attribute names is assigned a weight of 20%
• 23. Difference Set 1/2
• The first confidence score is calculated by creating the difference set of all permutations of pairs of all attribute candidates.
• To reduce the search space of candidates we applied an approach similar to [1][2], where we first want to determine so called Highly Indexable Attributes for each type and then Mappable Attributes to form pair candidates.
• Highly Indexable Attribute: A Highly Indexable Attribute is an attribute that is potentially unique for each instance of a type. This attribute is determined by the following formula: Card / Cnt > Alpha  AvgAttribtueLength > Epsilon
• Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.
• Epsilon is an additional parameter that defines the minimum average length of an attribute.
• Mappable Attribute The Mappable Attribute can be seen as means to reduced search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how often a value of an attribute can occur. The assumption is that if it occurs more then x times it is unlikely that it is a correlation candidate. x… Cardinality of a value i… Attribute of a type { xi | x < Gamma }
• Gamma is a threshold parameter that can be set experimentally and customized to the application scenario based on knowledge of the artefacts.
[1] I. Ilyas, V. Markl, P. Haas, P. Brown. (2004). CORDS: Automatic discovery of correlations and soft functional dependencies. [2] A. Rostin, O. Albrecht, F. Naumann, J. Bauckmann, and U. Leser. (2009). A Machine Learning Approach to Foreign Key Discovery, (WebDB).
• 24. Example – Determining Highly Indexables
• Card / Cnt > Alpha  AvgAttribtueLength > Epsilon
• Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.
• Epsilon is an additional parameter that defines the minimum average length of an attribute.
1 1 0.8 0.8 0.2 1 Calculate Card/Cnt Example
• 25. Example – Determining Highly Indexables
• Card / Cnt > Alpha  AvgAttribtueLength > Epsilon
• Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.
• Epsilon is an additional parameter that defines the minimum average length of an attribute.
1 1 0.8 0.8 0.2 1 Card / Cnt > Alpha where Alpha = 0.9  AvgAttributeLength > Epsilon where Epsilon = 5 Example
• 26. Example – Determining Mappables The Mappable Attribute can be seen as means to reduced search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how often a value of an attribute can occur. The assumption is that if it occurs more then x times it is unlikely that it is a correlation candidate. x… Cardinality of a value i… Attribute of a type { xi | x < Gamma } Card < Gamma where Gamma = 10 For instance in this domain it might be unlikely that a shipment has more than 10 orders. However this might cause problems in other domains or for certain relationships (one customer has definitely more than 10 orders). Example
• 27. Difference Set 2/2
• By determining all the Indexable and Mappable Attributes of all types the next step is to find candidates of pairs of attributes that potentially correlate with each other.
• Therefore a difference set A/B = {x | x  A  x  B} between all permutations of attribute candidates A and B is created.
• A/B must be below a certain threshold in order to be taken into account: |A/B| <= DiffTreshold
• Candidate Pairs of the permutation mixes are excluded if they have a mismatch of types based on the previously determined InferencedType .
• 28. Example
• Indexable Attributes
• DateTime
• OrderId
• CustomerId
• ShipmentCreated
• DateTime
• ShipmentId
• OrderId
• TransportStarted
• TransportId
• ShipmentId
• TransportEnd
• DateTime
• TransportId
• ShipmentId
• Mappable Attributes
• In our Scenario every attribute is considered as a Mappable Attribute as the total number of instances is lower than the threshold in order to reduce the complexity of the examples
Example DateTime’s are excluded as they are a timestamp which are of a type that are not suitable for correlation pairs. This also applies for booleans and description texts .
• 29. Example – DifferenceSet for all Permutations Example OrderReceived.OrderId = ShipmentCreated.ShipmentId OrderReceived.OrderId = ShipmentCreated.OrderId OrderReceived.OrderId = TransportStarted.TransportId OrderReceived.OrderId = TransportStarted.ShipmentId … A/B = {x | x  A  x  B} |A/B| <= DiffTreshold 100% 0% 100% 100% … DiffTreshold = 0.95 OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId Resulting candidates of Correlation Pairs with 100% overlapping SetDiff SetDiff 0% 0% 0% 0% 0%
• 30. Example – DifferenceSet for all Permutations Example A/B = {x | x  A  x  B} |A/B| <= DiffTreshold OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId
• A difference often occurs especially when processes are not completed, have been prematurely terminated/aborted or events are not generated always because of decision forks. Bear in mind that this is a very simplified example!
• Pairs that are associative are removed!
• In this case every pair has the same type – In practice this is not the case! If they are not of the same type they are excluded from the permutation set and thus the difference set is not calculated.
SetDiff 0% 0% 0% 0% 0%
• 31. Difference between AvgAttributeLength
• The second weighting factor for the confidence is the difference between the AvgAttributeLength of the two correlation candidates.
• If the difference of the attribute lengths has a strong variance it might mean that they won’t share significant relationships.
• 32. Example – AvgAttributeLength Example OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId SetDiff 0% 0% 0% 0% 0% AvgAttrLength 0 0 0 0 0
• 33. LevenshteinDistance
• The last variable that influences confidence weighting is the Levenshtein distance between the names of two attributes.
• It is common that attribute names from different sources might have the same or comparable names if they have the same meaning.
• For example, in one system the attribute that contains the identifier for an order is named OrderId and in the other it is named order-id.
• 34. Example – LevenshteinDistance Example OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId SetDiff 0% 0% 0% 0% 0% AvgAttrLength 0 0 0 0 0 LevenshteinDistance 0 0 0 0 0
• 35. Example – Weight Calculation Example OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId SetDiff 0% 0% 0% 0% 0% Avg Attribute Length 0 0 0 0 0 Levenshtein Distance 0 0 0 0 0 SetDiff AvgAttrLenght LevenshteinDistance 60% 20% 20% Confidence 100% 100% 100% 100% 100% Weight is adjustable!
• 36. Agenda
• Motivation
• Big Picture and Context
• Related Work
• Algorithm (with Examples)
• Data Pre-Processing
• Statistics Calculation
• Determining Correlation Candidates
• Screenshots of prototype application
• Conclusion & Future Work
• 37. Correlation Discovery
• 38. Correlation Discovery Refinement
• 39. Agenda
• Motivation
• Big Picture and Context
• Related Work
• Algorithm (with Examples)
• Data Pre-Processing
• Statistics Calculation
• Determining Correlation Candidates
• Screenshots of prototype application
• Conclusion & Future Work
• 40. Evaluation, Conclusion and Future Work
• Export compliance regulation
• Wide range of heterogeneous systems
• Order Management,
• Document Management,
• E-Mail,
• Export Violation Detection Services
• Workflow-supported human-driven interactions (Process Management System).
• 24 EventTypes
• 95 Attributes
Precision: 99.56% False Positive Example: correlation by “ orderVolume”  Always similar size and attributes has a min. length ( No.of.RelevantCorrelationRules / ( No.of. RelevantCorrelationRules + FalsePositives ) * 100).
• 41. THANK YOU! Questions?