IID (Taskar et al, 2002) -> test instances are related to each other and their labels are not independent! (Lu&Getoor, 2003; Jensen, 1999) Traditional statistical inference assume that instances are independent -> can lead inappropriate conclusions (Lu&Getoor, 2003) “traditional data mining tasks such as association rule mining, market basket analysis, and cluster analysis commonly attempt to find patterns in a dataset characterized by a collection of independent instances of a single relation. This is consistent with the classical statistical inference problem of trying to identify a model given a random sample from a common underlying distribution“ Latent semantics are important for Information Retrieval (IR) and Text Mining applications For example In LSI, latent aspects of term similarity that LSI reveals is dependent on the higher-order paths between terms
Explicit links : (e.g., hyperlinks between web pages or citation links between scientific papers) apply the model to a separate network (a set of unlabeled test instances with links) in collective classification phase Collective inference : making inferences about multiple data instances simultaneously collective inference can significantly reduce classification error (Jensen et al., 2004) The basic idea in these iterative algorithms is to start with a labeling of reasonable quality (by a content only classifier) and refine it using a coupled distribution of content and labels of neighbors.
Several simple link attributes that are constructed based on the statistics computed from the categories of the different sets of linked objects: mode-link, a single attribute computed from the in-links, out-links, and co-citation links count-link which is basically the frequency of classes of linked instances binary-link is a simple binary feature vector; for each class label, if a link to an instance occurs at least once, the corresponding feature is 1 To determine a label (dependent var) c in {-1,+1} given an input vector (explanatory var) x, P(c=1|w,x) , find optimal w for discriminative function CORA : 4187 machine learning papers, 7 class, dictionary 1400 words after stemming, stopwords WEBKB: web pages from four computer science departments; 4 topics and others: total 5; without others 700 pages
(Edmond, 1997) The application described selects the most appropriate term when a context (such as a sentence) is provided. (Sch ü tze,1998) use of second-order co-occurrence of the terms in the training set to create context vectors that represent a specific sense of a word to be discriminated. (Xu & Croft, 1998) A strong correlation between terms A and B, and also between terms B and C will result in the placement of terms A, B, and C into the same equivalence class. The result will be a transitive semantic relationship between A and C. Orders of co-occurrence higher than two are also possible in this application. (LSI), a well-known approach to information retrieval, (LSI) implicitly depends on higher-order co-occurrences. (LSI) In previous work it is demonstrated empirically that higher-order co-occurrences play a key role in the effectiveness of systems based on LSI. (Zhang, 2000) A related effort used second-order co-occurrences to improve the runtime performance of LSI. (LBD), employs second-order co-occurrence to discover connections between concepts (entities). (LBD) A well-known example is the discovery of a novel migraine-magnesium connection in the medical domain, The researchers found that in the Medline database some terms co-occur frequently with “migraine” in article titles, e.g. “stress” and “calcium channel blockers.” They also discovered that “stress” co-occurs frequently with “magnesium” in other titles. As a result, they hypothesized a link between “migraine” and “magnesium,” and some clinical evidence has been obtained that supports this hypothesis.
preliminary conclusion: frequency distributions of higher-order itemsets capture distinguishing characteristics of the classes in supervised machine learning datasets
Cybersecurity: Abnormal BGP events often affect global routing infrastructure. For example, in January 2003, the Slammer worm caused a surge of BGP updates. Since BGP anomaly events often cause major disruptions in Internet, the ability to detect and categorize BGP events is extremely useful Our aim is to distinguish whether Border Gateway Protocol (BGP) traffic is caused by an anomalous event such as a power failure, a worm attack or a node/link failure. This is different from mushroom dataset because attributes are integer valued
For the Slammer worm and Blackout events, the t-test probability starts increasing as the sliding window approaches the 25th window. When the number of abnormal event instances inside the current window exceeds a certain threshold (around the 21st – 23rd window), observe a sharp increase After the 25th window, the probability stays above 5%, revealing that we are in the event period and have detected and distinguished both the Slammer and Blackout events using their respective event models detect and distinguish these events in 360 seconds or less. Results are similar for the Witty worm event in figure 6, although the detection takes slightly longer
This figure depicts three documents, D1, D2 and D3, each containing two terms, or entities, represented by the letters A, B, C and D. Below the three documents form a higher-order path that links entity A with entity D through B and C. This is a third-order path since three links, or “hops,” connect A and D D1, D2 and D3 are not always documents – they might be records in a database or instances in a labeled training dataset. Likewise, the entities A, B, C etc. need not be terms – they may be values in a database record, or items (attribute-value pairs) in an instance. Actually we can extract co-occurrence relations as long as there is a meaningful context of entities.
Transcript
1.
Higher Order Learning William M. Pottenger, Ph.D. Rutgers University ARO Workshop
Data mining tasks such as association rule mining, cluster analysis, classification aim to find patterns/form a model from a collection of instances.
Traditionally instances are assumed to be independent and identically distributed (IID).
In classification, a model is applied to a single instance and the decision is based on the feature vector of this instance in a “context-free” manner, independent of the other instances in the test set.
This context-free approach does not exploit the available information about relationships between instances in the dataset (Angelova & Weikum, 2006)
Collective classification / Link based classification
Link prediction
Link based clustering
Social network modeling
Object identification
Bibliometrics
…
Ref: P. Domingos and M. Richardson, Markov Logic: A Unifying Framework for Statistical Relational Learning . Proceedings of the ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields (pp. 49-54), 2004. Banff, Canada: IMLS.
Need for reasoning from evidence, even in the face of information that may be incomplete , inexact , inaccurate , or from diverse sources
Evidence is provided by sets of diverse, distributed, and noisy sensors and information.
Build a quantitative theoretical framework for reasoning by abduction in the face of real-world uncertainties.
Reasoning by leveraging higher order relations…
8.
Gathering Evidence stress migraine CCB magnesium PA magnesium SCD magnesium magnesium Slide reused with permission of Marti Hearst @ UCB
9.
A Higher Order Co-Occurrence Relation! migraine magnesium Slide reused with permission of Marti Hearst @ UCB No single author knew/wrote about this connection… this distinguishes Text Mining from Information Retrieval. stress CCB PA SCD
Yes! Answer is in the following theorem we proved: If the ij th element of the truncated term by term matrix, Y, is non-zero, then there exists a co-occurrence path of order 1 between terms i and j.
Kontostathis, A. and Pottenger, W. M. (2006) A Framework for Understanding LSI Performance. Information Processing & Management.
We have both proven mathematically and demonstrated empirically that LSI is based on the use of higher order co-occurrence relations.
Next step? Extend the theoretical foundation by studying characteristics of higher-order relations in other machine learning datasets/algorithms such as association rule mining, supervised learning, etc.
Start by analyzing higher-order relations in labeled training data used in supervised machine learning
Is there a theoretical basis for the use of higher order co-occurrence relations in LSI?
13.
What role do higher-order relations play in supervised machine learning?
Goal: discover patterns in higher-order paths useful in separating the classes
Co-occurrence relations in a record or instance set can be represented as an undirected graph G = (V, E)
V : a finite set of vertices (e.g., entities in a record)
E is the set of edges representing co-occurrence relations (edges are labeled with the record(s) in which entities co-occur)
Path definition from graph theory: Two vertices x i and x k are linked by a path P (nodes x i distinct) where the number of edges in P is its length.
Higher-order path: Not only vertices (entities) must be distinct but also edges (records) must be distinct.
An example of a fourth-order path between e1 and e5, as well as several shorter paths
14.
What role do higher-order relations play in supervised machine learning?
Path Group: A path (length≥2) is extracted per the definition of a path from graph theory. In the example, a 2 nd order path group comprises two sets of records: S 1 ={1,2,5} and S 2 ={1,2,3,4}. A path group may be composed of several higher-order paths.
A bipartite graph G = (V 1 U V 2 , E) is formed where V 1 is the sets of records and V 2 is the records. Enumerating all maximum matchings in this graph yields all higher-order paths in the path group. Another approach is to discover the system of distinct representatives (SDR) of these sets.
Approach: Discover frequent itemsets in higher-order paths
For labeled datasets, divide instances by class and enumerate k-itemsets (initially for k in {3,4})
Results in a distribution of k-itemset frequencies for a given class
Compare distributions using simple statistical measure such as t-test to determine independence
If two distributions are statistically significantly different, we conclude that the higher-order path patterns (i.e., itemset frequencies) distinguish the classes
Labeled training data analyzed
Mushroom dataset: performs well on decision tree
Border gateway protocol updates: relevant to cybersecurity
What role do higher-order relations play in supervised machine learning?
240 instances to characterize a particular abnormal event
Sliding window approach for detection
Window size: 120 instances (360 seconds)
Sliding 10 instances (sampling every 30 seconds)
Ganiz, M., Pottenger, W.M., Kanitkar, S., Chuah, M.C. (2006b). Detection of Interdomain Routing Anomalies Based on Higher-Order Path Analysis. Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM’06), December 2006, Hong Kong, China Event 1 Event 2 t-test results Slammer Witty 0.00023 Blackout Witty 0.00016 Slammer Blackout 0.018
21.
Preliminary Results – Naïve Bayes on Higher-order Paths
Cora (McCallum et al., 2000)
Scientific paper dataset
Several classes: case based, neural networks, etc.
2708 documents, 1433 terms, 5429 links
Terms are ordered most sparse first
Instead of links, we used higher order paths in a Naïve Bayes framework
E.g., when 2 nd order paths are used, F-beta (beta=1) is higher starting from dictionary size 400
22.
What role do higher-order relations play in unsupervised machine learning?
Next step? Consider un supervised learning…
Association Rule Mining (ARM)
ARM is one of the most widely used algorithms in data mining
Extend ARM to higher order… Higher Order Apriori
Experiments confirm the value of Higher Order Apriori on real world e-marketplace data
Given j instances of n th - order k- recordset rs , its size is defined as:
Since the same k- itemset can be generated at different orders, the global support for a given k- itemset must include the local support at each order u . So we have the this formula:
Our algorithm was tested on real-world e-commerce data, from the KDD Cup 2000 competition. there are 530 transactions involving 244 products in the dataset.
We compared the itemsets generated by Higher Order Apriori with two other algorithms:
Apriori (1 st order)
Indirect (Using our algorithm limited to 2 nd order).
Higher Order Apriori limited to 6 th order.
We conducted experiments on multiple systems, including at the National Center for Supercomputing Applications (NCSA)
This relationship is also discovered by Apriori and Indirect, but Higher order Apriori discovered a new nugget, which provides extra evidence for this relationship
Higher Order Apriori discovers itemsets that demonstrate novel relationships not discovered by lower order methods.
For example, the following are reasonable relationships. While Apriori and Indirect failed to discover itemsets representing such relationships in the SIGKDD dataset, they might discover such links given a larger data set.
Shaver : Lotion/Cream Apriori No Indirect No Higher Order Apriori (Pedicure Care Kit, Toning Lotion) (wet/dry shaver, Herb Lotion) (Pedicure Care Kit, Leg Cream) foot cream – women’s socks Apriori No Indirect No Higher-order Apriori (foot cream, Women's Ultra Sheer Knee High) (foot cream, women’s cotton dog sock)
Be the first to comment