Correlation rules are common identifiers defined as correspondence between the attributes of two different types.
Correlation Rule Example: A.x = B.y where A and B are types and x and y are attributes
The correlation rules are determined by a unique combination of statistics applied on event attributes such that several attribute statistics are taken into account to improve the precision of the correlation candidate detection and to calculate a confidence score .
The algorithm does not require input of knowledge about the structure of artifacts (E.g. Event Format could be anything such as XML etc) or the data-type of their attributes nor does it require a n ormalized organization of artifacts.
The confidence score precisely defines the significance of a correlation rule.
Correlation rules , discovered by our algorithm, can be used either during runtime to group related artifacts together, such as events belonging to a process instance or to create a graph of relationships that enables querying and walking the paths of relationships.
Their approach takes mainly instance based measures into account to determine the “interestingness” of correlation pairs (and groups of pairs).
That means that they first prune the large space of potential correlation pairs based on some techniques similar to DePauw and then correlate the data with this large set of correlation rules to generate various correlated instances. Then they apply certain statistics on the instances to determine if the correlation rules make sense.
DePauw et al (IBM)
The work by DePauw et al has at its core a certain similarity to our algorithm. For instance, we also take the notion of Indexable and Mappable Paths into account, but with the major purpose to reduce the problem space of candidate-pair permutations that need to be checked against each other for potential correlations. In our algorithm this step is optional and instead every attribute of a type is attempted to be matched against another attribute of a type .
In addition our correlation algorithm takes several attribute-based statistics into account to improve the precision of the correlation candidate detection and also calculates a confidence score based on those statistics.
CORDS (IBM) is a tool making use of statistical methods to discover correlations and soft functionalities between database columns to produce a dependency graph to improve the performance of query optimizers
In the database world there is detailed knowledge about the data available which is defined either in the schema or in metadata. . This means that there are defined relations and attributes whereas their type (e.g. integer , string, timestamp , …) is known.
A key difference of our algorithm, to other approaches, is that our it does not assume that artifacts are grouped together in a normalized schema and nor does it have any information on meta-data that describes an artifact's attribute.
Our algorithm for correlation discovery is divided into three major steps:
Data Pre-Processing .
The first step of the correlation discovery process is to load and integrate the data into a data store (e.g. database, cloud storage, etc) that is then used to calculate statistics and determine correlation candidates.
Statistics Calculation.
After the data has been loaded and integrated into the internal representation, various statistics, mainly on attribute values, are calculated and stored into a fast accessible data structure.
Determining Correlation Candidates.
In the last step the correlation discovery algorithm determines correlation pairs with a certain confidence value based on the previously calculated statistics.
13.
Statistics Calculation 2/2 Attribute Cardinality It contains a map of each value and how often each of those values occurred. Card Determines the number of different values for the attribute. Cnt Represents the total number of instances in which the attribute occurs. As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. AvgAttributeLength Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts. InferencedType
Defines the type of an attribute. The type of an attribute is an important characteristic for correlation discovery to reduce the problem space of correlation candidates. The chances that a type would correlate with another attribute given that the type contains mostly alpha-numeric attributes are very low.
The determination of the type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi.
Currently following type distinctions are supported:
Numeric or Alphanumeric
Timestamp/DateTime
Boolean
Descriptiontext
NoOfNumeric Depending on the InferencedType this variable contains the number of values that are of a numeric type. NoOfAlphaNum Depending on the InferencedType this variable contains the number of values that are of an alpha-numeric type.
15.
Example - Index The attribute cardinality (i.e. Index) contains a map of each value and how often each of those values occurred . Example
16.
Example - Card 4 Unique Values Determines the number of different values for the attribute. Example
17.
Example - Cnt Cnt=5 For certain attributes the number might be smaller as they can be null or missing Cnt=5 For certain attributes the number might be smaller as they can be null or missing Cnt=5 For certain attributes the number might be smaller as they can be null or missing Represents the total number of instances in which the attribute occurs . As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. Represents the total number of instances in which the attribute occurs . As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. Example
18.
Example - AvgAttributeLength AvgAttributeLength is calculated Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts. Example
19.
Example - InferencedType Determines DataType Defines the type of an attribute . The type of an attribute is an important characteristic for correlation discovery to reduce the problem space of correlation candidates. The chances that a type would correlate with another attribute given that the type contains mostly alpha-numeric attributes are very low. The determination of the type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi. Example
The first confidence score is calculated by creating the difference set of all permutations of pairs of all attribute candidates.
To reduce the search space of candidates we applied an approach similar to [1][2], where we first want to determine so called Highly Indexable Attributes for each type and then Mappable Attributes to form pair candidates.
Highly Indexable Attribute: A Highly Indexable Attribute is an attribute that is potentially unique for each instance of a type. This attribute is determined by the following formula: Card / Cnt > Alpha AvgAttribtueLength > Epsilon
Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.
Epsilon is an additional parameter that defines the minimum average length of an attribute.
Mappable Attribute The Mappable Attribute can be seen as means to reduced search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how often a value of an attribute can occur. The assumption is that if it occurs more then x times it is unlikely that it is a correlation candidate. x… Cardinality of a value i… Attribute of a type { xi | x < Gamma }
Gamma is a threshold parameter that can be set experimentally and customized to the application scenario based on knowledge of the artefacts.
[1] I. Ilyas, V. Markl, P. Haas, P. Brown. (2004). CORDS: Automatic discovery of correlations and soft functional dependencies. [2] A. Rostin, O. Albrecht, F. Naumann, J. Bauckmann, and U. Leser. (2009). A Machine Learning Approach to Foreign Key Discovery, (WebDB).
Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.
Epsilon is an additional parameter that defines the minimum average length of an attribute.
Alpha is a threshold parameter that determines the minimum ratio (i.e. uniqueness) of Card / Cnt and thus allows a small deviation that can be caused for instance by duplicates.
Epsilon is an additional parameter that defines the minimum average length of an attribute.
1 1 0.8 0.8 0.2 1 Card / Cnt > Alpha where Alpha = 0.9 AvgAttributeLength > Epsilon where Epsilon = 5 Example
26.
Example – Determining Mappables The Mappable Attribute can be seen as means to reduced search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how often a value of an attribute can occur. The assumption is that if it occurs more then x times it is unlikely that it is a correlation candidate. x… Cardinality of a value i… Attribute of a type { xi | x < Gamma } Card < Gamma where Gamma = 10 For instance in this domain it might be unlikely that a shipment has more than 10 orders. However this might cause problems in other domains or for certain relationships (one customer has definitely more than 10 orders). Example
By determining all the Indexable and Mappable Attributes of all types the next step is to find candidates of pairs of attributes that potentially correlate with each other.
Therefore a difference set A/B = {x | x A x B} between all permutations of attribute candidates A and B is created.
A/B must be below a certain threshold in order to be taken into account: |A/B| <= DiffTreshold
Candidate Pairs of the permutation mixes are excluded if they have a mismatch of types based on the previously determined InferencedType .
In our Scenario every attribute is considered as a Mappable Attribute as the total number of instances is lower than the threshold in order to reduce the complexity of the examples
Example DateTime’s are excluded as they are a timestamp which are of a type that are not suitable for correlation pairs. This also applies for booleans and description texts .
29.
Example – DifferenceSet for all Permutations Example OrderReceived.OrderId = ShipmentCreated.ShipmentId OrderReceived.OrderId = ShipmentCreated.OrderId OrderReceived.OrderId = TransportStarted.TransportId OrderReceived.OrderId = TransportStarted.ShipmentId … A/B = {x | x A x B} |A/B| <= DiffTreshold 100% 0% 100% 100% … DiffTreshold = 0.95 OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId Resulting candidates of Correlation Pairs with 100% overlapping SetDiff SetDiff 0% 0% 0% 0% 0%
30.
Example – DifferenceSet for all Permutations Example A/B = {x | x A x B} |A/B| <= DiffTreshold OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId
A difference often occurs especially when processes are not completed, have been prematurely terminated/aborted or events are not generated always because of decision forks. Bear in mind that this is a very simplified example!
Pairs that are associative are removed!
In this case every pair has the same type – In practice this is not the case! If they are not of the same type they are excluded from the permutation set and thus the difference set is not calculated.
Precision: 99.56% False Positive Example: correlation by “ orderVolume” Always similar size and attributes has a min. length ( No.of.RelevantCorrelationRules / ( No.of. RelevantCorrelationRules + FalsePositives ) * 100).
Be the first to comment