15. Example - Index The attribute cardinality (i.e. Index) contains a map of each value and how often each of those values occurred . Example
16. Example - Card 4 Unique Values Determines the number of different values for the attribute. Example
17. Example - Cnt Cnt=5 For certain attributes the number might be smaller as they can be null or missing Cnt=5 For certain attributes the number might be smaller as they can be null or missing Cnt=5 For certain attributes the number might be smaller as they can be null or missing Represents the total number of instances in which the attribute occurs . As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. Represents the total number of instances in which the attribute occurs . As the data structure does not work on a defined schema it is possible that the attribute does not occur in every instance. Example
18. Example - AvgAttributeLength AvgAttributeLength is calculated Represents the average attribute length of the current attribute. This is an indicator about the potential uniqueness of a value. A long value might be the sign that attribute might be a unique identifier. Unique identifiers such as OrderId is a potential attribute that occurs in other types and thus forms a correlation. This may also be misleading since a textual description may be very long and is in fact unique but it is never used for correlating artefacts. Example
19. Example - InferencedType Determines DataType Defines the type of an attribute . The type of an attribute is an important characteristic for correlation discovery to reduce the problem space of correlation candidates. The chances that a type would correlate with another attribute given that the type contains mostly alpha-numeric attributes are very low. The determination of the type is made with a fault tolerance of 0.9 (e.g. min. 90% of the values must be numeric), and we refer to this as a parameter Phi. Example
26. Example – Determining Mappables The Mappable Attribute can be seen as means to reduced search space of potentially correlating attributes of a type. One approach is to set an upper threshold of how often a value of an attribute can occur. The assumption is that if it occurs more then x times it is unlikely that it is a correlation candidate. x… Cardinality of a value i… Attribute of a type { xi | x < Gamma } Card < Gamma where Gamma = 10 For instance in this domain it might be unlikely that a shipment has more than 10 orders. However this might cause problems in other domains or for certain relationships (one customer has definitely more than 10 orders). Example
27.
28.
29. Example – DifferenceSet for all Permutations Example OrderReceived.OrderId = ShipmentCreated.ShipmentId OrderReceived.OrderId = ShipmentCreated.OrderId OrderReceived.OrderId = TransportStarted.TransportId OrderReceived.OrderId = TransportStarted.ShipmentId … A/B = {x | x A x B} |A/B| <= DiffTreshold 100% 0% 100% 100% … DiffTreshold = 0.95 OrderReceived.OrderId = ShipmentCreated.OrderId ShipmentCreated.ShipmentId = TransportStarted.ShipmentId ShipmentCreated.ShipmentId = TransportEnded.ShipmentId TransportStarted.TransportId = TransportEnded.TransportId TransportEnded.TransportId = TransportStarted.TransportId Resulting candidates of Correlation Pairs with 100% overlapping SetDiff SetDiff 0% 0% 0% 0% 0%