Your SlideShare is downloading. ×
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Higher Order Learning
Upcoming SlideShare
Loading in...5

Thanks for flagging this SlideShare!

Oops! An error has occurred.

Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Higher Order Learning


Published on

  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

No notes for slide
  • IID (Taskar et al, 2002) -> test instances are related to each other and their labels are not independent! (Lu&Getoor, 2003; Jensen, 1999) Traditional statistical inference assume that instances are independent -> can lead inappropriate conclusions (Lu&Getoor, 2003) “traditional data mining tasks such as association rule mining, market basket analysis, and cluster analysis commonly attempt to find patterns in a dataset characterized by a collection of independent instances of a single relation. This is consistent with the classical statistical inference problem of trying to identify a model given a random sample from a common underlying distribution“ Latent semantics are important for Information Retrieval (IR) and Text Mining applications For example In LSI, latent aspects of term similarity that LSI reveals is dependent on the higher-order paths between terms
  • Explicit links : (e.g., hyperlinks between web pages or citation links between scientific papers) apply the model to a separate network (a set of unlabeled test instances with links) in collective classification phase Collective inference : making inferences about multiple data instances simultaneously collective inference can significantly reduce classification error (Jensen et al., 2004) The basic idea in these iterative algorithms is to start with a labeling of reasonable quality (by a content only classifier) and refine it using a coupled distribution of content and labels of neighbors.
  • Several simple link attributes that are constructed based on the statistics computed from the categories of the different sets of linked objects: mode-link, a single attribute computed from the in-links, out-links, and co-citation links count-link which is basically the frequency of classes of linked instances binary-link is a simple binary feature vector; for each class label, if a link to an instance occurs at least once, the corresponding feature is 1 To determine a label (dependent var) c in {-1,+1} given an input vector (explanatory var) x, P(c=1|w,x) , find optimal w for discriminative function CORA : 4187 machine learning papers, 7 class, dictionary 1400 words after stemming, stopwords WEBKB: web pages from four computer science departments; 4 topics and others: total 5; without others 700 pages
  • (Edmond, 1997) The application described selects the most appropriate term when a context (such as a sentence) is provided. (Sch ü tze,1998) use of second-order co-occurrence of the terms in the training set to create context vectors that represent a specific sense of a word to be discriminated. (Xu & Croft, 1998) A strong correlation between terms A and B, and also between terms B and C will result in the placement of terms A, B, and C into the same equivalence class. The result will be a transitive semantic relationship between A and C. Orders of co-occurrence higher than two are also possible in this application. (LSI), a well-known approach to information retrieval, (LSI) implicitly depends on higher-order co-occurrences. (LSI) In previous work it is demonstrated empirically that higher-order co-occurrences play a key role in the effectiveness of systems based on LSI. (Zhang, 2000) A related effort used second-order co-occurrences to improve the runtime performance of LSI. (LBD), employs second-order co-occurrence to discover connections between concepts (entities). (LBD) A well-known example is the discovery of a novel migraine-magnesium connection in the medical domain, The researchers found that in the Medline database some terms co-occur frequently with “migraine” in article titles, e.g. “stress” and “calcium channel blockers.” They also discovered that “stress” co-occurs frequently with “magnesium” in other titles. As a result, they hypothesized a link between “migraine” and “magnesium,” and some clinical evidence has been obtained that supports this hypothesis.
  • preliminary conclusion: frequency distributions of higher-order itemsets capture distinguishing characteristics of the classes in supervised machine learning datasets
  • Cybersecurity: Abnormal BGP events often affect global routing infrastructure. For example, in January 2003, the Slammer worm caused a surge of BGP updates. Since BGP anomaly events often cause major disruptions in Internet, the ability to detect and categorize BGP events is extremely useful Our aim is to distinguish whether Border Gateway Protocol (BGP) traffic is caused by an anomalous event such as a power failure, a worm attack or a node/link failure. This is different from mushroom dataset because attributes are integer valued
  • For the Slammer worm and Blackout events, the t-test probability starts increasing as the sliding window approaches the 25th window. When the number of abnormal event instances inside the current window exceeds a certain threshold (around the 21st – 23rd window), observe a sharp increase After the 25th window, the probability stays above 5%, revealing that we are in the event period and have detected and distinguished both the Slammer and Blackout events using their respective event models detect and distinguish these events in 360 seconds or less. Results are similar for the Witty worm event in figure 6, although the detection takes slightly longer
  • This figure depicts three documents, D1, D2 and D3, each containing two terms, or entities, represented by the letters A, B, C and D. Below the three documents form a higher-order path that links entity A with entity D through B and C. This is a third-order path since three links, or “hops,” connect A and D D1, D2 and D3 are not always documents – they might be records in a database or instances in a labeled training dataset. Likewise, the entities A, B, C etc. need not be terms – they may be values in a database record, or items (attribute-value pairs) in an instance. Actually we can extract co-occurrence relations as long as there is a meaningful context of entities.
  • Transcript

    • 1. Higher Order Learning William M. Pottenger, Ph.D. Rutgers University ARO Workshop
    • 2.
      • Introduction
      • Overview
        • IID Assumption in Machine Learning
        • Statistical Relational Learning (SRL)
        • Higher-order Co-occurrence Relations
      • Approach
        • Supervised Higher Order Learning
        • Unsupervised Higher Order Learning
      • Conclusion
    • 3. IID Assumption in Machine Learning
      • Data mining tasks such as association rule mining, cluster analysis, classification aim to find patterns/form a model from a collection of instances.
      • Traditionally instances are assumed to be independent and identically distributed (IID).
        • In classification, a model is applied to a single instance and the decision is based on the feature vector of this instance in a “context-free” manner, independent of the other instances in the test set.
      • This context-free approach does not exploit the available information about relationships between instances in the dataset (Angelova & Weikum, 2006)
    • 4. Statistical Relational Learning (SRL)
      • Underlying assumption
        • Linked instances are often correlated
      • SRL operates on relational data with explicit links between instances
        • Explicitly leverages correlations between related instances
      • Collective Inference / Classification
        • Simultaneously label all test instances together
          • Exploit the correlations between class labels of related instances
        • Learn from one network (a set of labeled training instances with links )
        • Apply the model to a separate network (a set of unlabeled test instances with links)
      • Iterative algorithms
        • First assign initial class labels (content-only traditional classifier)
        • Adjust class label using the class labels of linked instances
    • 5. Statistical Relational Learning (SRL)
      • Several tasks
        • Collective classification / Link based classification
        • Link prediction
        • Link based clustering
        • Social network modeling
        • Object identification
        • Bibliometrics
        • Ref: P. Domingos and M. Richardson, Markov Logic: A Unifying Framework for Statistical Relational Learning . Proceedings of the ICML-2004 Workshop on Statistical Relational Learning and its Connections to Other Fields (pp. 49-54), 2004. Banff, Canada: IMLS.
    • 6. Some Related Work in SRL
      • Relational Markov Networks (Taskar et al., 2002)
        • Extend Markov networks for relational data
        • Discriminatively train undirected graphical model - for every link between two pages, there is an edge between labels of these pages
        • Significant improvement over the flat model (logistic regression)
      • Link-based Classification (Lu & Getoor, 2003)
        • Structured logistic regression
        • Iterative classification algorithm
        • Outperforms content-only classifier on WebKB, Cora, CiteSeer datasets
      • Relational Dependency Networks (Neville & Jensen, 2004)
        • Extend Dependency Networks (DNs) for relational data
        • Experiments on IMDb, Cora, WebKB, Gene datasets
        • Results: RDN model is superior to IID classifier
      • Graph-based Text Classification (Angelova & Weikum, 2006)
        • Has graph in which nodes are instances and edges are the relationships between instances in the dataset
        • Increase in performance on DBLP, IMDb, Wikipedia datasets
        • Interesting observation: gains are most prominent for small training sets
    • 7. Reasoning by Abductive Inference
      • Need for reasoning from evidence, even in the face of information that may be incomplete , inexact , inaccurate , or from diverse sources
      • Evidence is provided by sets of diverse, distributed, and noisy sensors and information.
      • Build a quantitative theoretical framework for reasoning by abduction in the face of real-world uncertainties.
      • Reasoning by leveraging higher order relations…
    • 8. Gathering Evidence stress migraine CCB magnesium PA magnesium SCD magnesium magnesium Slide reused with permission of Marti Hearst @ UCB
    • 9. A Higher Order Co-Occurrence Relation! migraine magnesium Slide reused with permission of Marti Hearst @ UCB No single author knew/wrote about this connection… this distinguishes Text Mining from Information Retrieval. stress CCB PA SCD
    • 10. Uses of Higher-order Co-occurrence Relations
      • Higher-order co-occurrences play a key role in the effectiveness of systems used for information retrieval and text mining
      • Literature Based Discovery (LBD) (Swanson, 1988)
        • Migraine↔(stress, calcium channel blockers)↔ Magnesium
      • Improve the runtime performance of LSI (Zhang et al., 2000)
        • Explicitly use 2 nd order co-occurrence to reduce MT×D
      • Word sense disambiguation (Schütze,1998)
        • Similarity in word space is based on 2 nd order co-occurrence
      • Identifying synonyms in a given context (Edmonds,1997)
        • Precision of system using 3 rd order > 2 nd order > 1 st order
      • Stemming algorithm (Xu & Croft, 1998)
        • Implicitly uses higher orders of co-occurrence
    • 11. Is there a theoretical basis for the use of higher order co-occurrence relations?
      • Research agenda: study machine learning algorithms in search of a theoretical foundation for the use of higher order relations
      • First algorithm: Latent Semantic Indexing (LSI)
        • Widely used technique in text mining and IR based on the Singular Value Decomposition (SVD) matrix factoring algorithm
        • Terms semantically similar lie closer in LSI vector space even though they don’t co-occur
          •  LSI reveals hidden or latent relationships
        • Research question: Does LSI leverage higher order term co-occurrence?
    • 12.
      • Yes! Answer is in the following theorem we proved: If the ij th element of the truncated term by term matrix, Y, is non-zero, then there exists a co-occurrence path of order  1 between terms i and j.
        • Kontostathis, A. and Pottenger, W. M. (2006) A Framework for Understanding LSI Performance. Information Processing & Management.
      • We have both proven mathematically and demonstrated empirically that LSI is based on the use of higher order co-occurrence relations.
      • Next step? Extend the theoretical foundation by studying characteristics of higher-order relations in other machine learning datasets/algorithms such as association rule mining, supervised learning, etc.
        • Start by analyzing higher-order relations in labeled training data used in supervised machine learning
      Is there a theoretical basis for the use of higher order co-occurrence relations in LSI?
    • 13. What role do higher-order relations play in supervised machine learning?
      • Goal: discover patterns in higher-order paths useful in separating the classes
      • Co-occurrence relations in a record or instance set can be represented as an undirected graph G = (V, E)
        • V : a finite set of vertices (e.g., entities in a record)
        • E is the set of edges representing co-occurrence relations (edges are labeled with the record(s) in which entities co-occur)
      • Path definition from graph theory: Two vertices x i and x k are linked by a path P (nodes x i distinct) where the number of edges in P is its length.
      • Higher-order path: Not only vertices (entities) must be distinct but also edges (records) must be distinct.
      An example of a fourth-order path between e1 and e5, as well as several shorter paths
    • 14. What role do higher-order relations play in supervised machine learning?
      • Path Group: A path (length≥2) is extracted per the definition of a path from graph theory. In the example, a 2 nd order path group comprises two sets of records: S 1 ={1,2,5} and S 2 ={1,2,3,4}. A path group may be composed of several higher-order paths.
      • A bipartite graph G = (V 1 U V 2 , E) is formed where V 1 is the sets of records and V 2 is the records. Enumerating all maximum matchings in this graph yields all higher-order paths in the path group. Another approach is to discover the system of distinct representatives (SDR) of these sets.
    • 15.
      • Approach: Discover frequent itemsets in higher-order paths
        • For labeled datasets, divide instances by class and enumerate k-itemsets (initially for k in {3,4})
          • Results in a distribution of k-itemset frequencies for a given class
        • Compare distributions using simple statistical measure such as t-test to determine independence
        • If two distributions are statistically significantly different, we conclude that the higher-order path patterns (i.e., itemset frequencies) distinguish the classes
      • Labeled training data analyzed
        • Mushroom dataset: performs well on decision tree
        • Border gateway protocol updates: relevant to cybersecurity
      What role do higher-order relations play in supervised machine learning?
    • 16.
      • For each fold; compared 3-itemset frequencies of E set vs. P set
      • Interesting result: six of the 10 folds had a confidence of 95% or greater that the E and P instances are statistically significantly different
        • Other folds between 80-95% (see below)
      Preliminary Results – Supervised ML dataset Ganiz, M., Pottenger, W.M. and Yang, X. (2006). Link Analysis of Higher-Order Paths in Supervised Learning Datasets, In the Proceedings of the Workshop on Link Analysis, Counterterrorism and Security, 2006 SIAM Conference on Data Mining, Bethesda, MD, April Fold t Stat P(T<=t) one-tail t_Critical one-tail P(T<=t) two-tail t_Critical two-tail 0 -2.684 0.0037 1.6471 0.0074 1.9634 1 -1.357 0.0875 1.6467 0.1751 1.9629 2 -1.554 0.0603 1.6468 0.1205 1.9629 3 -2.924 0.0018 1.6472 0.0036 1.9636 4 -1.908 0.0284 1.6469 0.0568 1.9631 5 -2.047 0.0205 1.6469 0.041 1.9631 6 -1.455 0.073 1.6467 0.146 1.9629 7 -2.023 0.0217 1.6469 0.0434 1.9631 8 -2.795 0.0027 1.6471 0.0053 1.9635 9 -2.71 0.0034 1.647 0.0069 1.9633
    • 17. What role do higher-order relations play in supervised machine learning?
      • Detection of Interdomain Routing Anomalies Based on Higher-Order Path Analysis
        • Border Gateway Protocol (BGP) is de facto interdomain routing protocol for Internet.
        • Anomalous BGP events: misconfigurations, attacks and large-scale power failures often affect the global routing infrastructure.
          • Slammer worm attack (January 25, 2003 )
          • Witty worm attack (March 19, 2004)
          • 2003 East Coast Blackout (i.e., power failure)
        • Goal: detect and categorize such events
    • 18. What role do higher-order relations play in supervised machine learning?
      • Detection of Interdomain Routing Anomalies Based on Higher-Order Path Analysis
        • The data divided into three-second bins
        • Each bin is a single instance in our training data
      ID Attribute Definition 1 Announce # of BGP announcements 2 Withdrawal # of BGP withdrawals 3 Update # of BGP updates (=Announce + Withdrawal ) 4 Announce Prefix # of announced prefixes 5 Withdraw Prefix # of withdrawn prefixes 6 Updated Prefix # of updated prefixes (=Announce Prefix + Withdraw Prefix)
    • 19.
      • Border Gateway Protocol (BGP) routing data
        • BGP messages generated during interdomain routing
        • Relevant to cybersecurity
        • Detect abnormal BGP events
          • Internet worm attacks (slammer, witty,…), power failures, etc.
        • Data from a period of time surrounding/including worm propagation
        • Instance ->three second sample of BGP traffic
          • Six numeric attributes (Li et al., 2005)
        • Previously decision tree applied successfully for two classes: worm vs. normal (Li et al., 2005)
          • Cannot distinguish different worms!
      Preliminary Results – BGP dataset
    • 20. Preliminary Results – BGP dataset
      • 240 instances to characterize a particular abnormal event
      • Sliding window approach for detection
        • Window size: 120 instances (360 seconds)
        • Sliding 10 instances (sampling every 30 seconds)
      Ganiz, M., Pottenger, W.M., Kanitkar, S., Chuah, M.C. (2006b). Detection of Interdomain Routing Anomalies Based on Higher-Order Path Analysis. Proceedings of the Sixth IEEE International Conference on Data Mining (ICDM’06), December 2006, Hong Kong, China Event 1 Event 2 t-test results Slammer Witty 0.00023 Blackout Witty 0.00016 Slammer Blackout 0.018
    • 21. Preliminary Results – Naïve Bayes on Higher-order Paths
      • Cora (McCallum et al., 2000)
        • Scientific paper dataset
        • Several classes: case based, neural networks, etc.
        • 2708 documents, 1433 terms, 5429 links
        • Terms are ordered most sparse first
      • Instead of links, we used higher order paths in a Naïve Bayes framework
      • E.g., when 2 nd order paths are used, F-beta (beta=1) is higher starting from dictionary size 400
    • 22. What role do higher-order relations play in unsupervised machine learning?
      • Next step? Consider un supervised learning…
        • Association Rule Mining (ARM)
      • ARM is one of the most widely used algorithms in data mining
        • Extend ARM to higher order… Higher Order Apriori
      • Experiments confirm the value of Higher Order Apriori on real world e-marketplace data
    • 23. Higher Order Apriori: Approach
      • First we extend the itemset definition to incorporate k-itemsets up to n th -order
          • Definition 1 : item a and b are n th -order associated, If a and b can be associated across n distinct records.
          • Definition 2 : An n th -order k -itemset is a k -itemset for which each pair of its items is n th -order associated.
          • Definition 3 : 2 records are n th -order linked if they can be linked through n-2 distinct records.
          • Definition 4 : A n th -order itemset i1i2…in is supported by a n th -order recordset r1r2...rn if no two items come from the same record.
    • 24. Higher Order Apriori: Approach
      • Given j instances of n th - order k- recordset rs , its size is defined as:
      • Since the same k- itemset can be generated at different orders, the global support for a given k- itemset must include the local support at each order u . So we have the this formula:
    • 25. Higher Order Apriori: Approach
      • Higher Order Apriori is structured in a level-wise order-first manner.
        • Level-wise:
        • the size of k -itemsets increases in each iteration (as is the case for Apriori),
        • Order-first:
        • at each level, itemsets are generated across all orders.
    • 26. Higher Order Apriori: Results
      • Our algorithm was tested on real-world e-commerce data, from the KDD Cup 2000 competition. there are 530 transactions involving 244 products in the dataset.
      • We compared the itemsets generated by Higher Order Apriori with two other algorithms:
          • Apriori (1 st order)
          • Indirect (Using our algorithm limited to 2 nd order).
          • Higher Order Apriori limited to 6 th order.
      • We conducted experiments on multiple systems, including at the National Center for Supercomputing Applications (NCSA)
    • 27. Higher Order Apriori: Results
      • Higher Order Apriori mines significantly more final itemsets than Apriori and Indirect
      • Next we will show high support itemsets are discovered using smaller datasets than required by Apriori or Indirect
    • 28. Higher Order Apriori: Results
      • {CU, DQ} is the top ranked 2-itemset using Apriori on all 530 transactions.
        • Neither Apriori nor Indirect, leverage the latent higher order information in runs of 75, 100 and 200 random transactions
        • While Higher Order Apriori discovered this itemset as top ranked using only 75 transactions,
        • In addition, the gap between the supports increases as the transaction sets get larger
    • 29. Higher Order Apriori: Results
      • Discovering Novel Itemsets
      • AY- Girdle-at-the-top Classic Sheer Pantyhose
      • Q- Men’s City Rib Socks - 3 Pack
      • This relationship is also discovered by Apriori and Indirect, but Higher order Apriori discovered a new nugget, which provides extra evidence for this relationship
      Apriori Indirect Higher Order Apriori Itemsets Discovered {AY, X } { X , K } { K , Q) {AY, K} {X, Q} + Apriori Itemsets {AY, Q} + Indirect Itemsets + Apriori Itemsets Itemsets Undiscovered {AY, K} {X, Q} {AY, Q} Shaver : Women’s Pantyhose relationship Apriori (Donna Karan’s Extra Thin Pantyhose, Wet/Dry Shaver) Indirect (Berkshire’s Ultra Nudes Pantyhose, Epilady Wet/Dry Shaver) Higher-order Apriori (Donna Karan’s Pantyhose, Epilady Wet/Dry Shaver)
    • 30. Higher Order Apriori: Results
      • Discovering Novel Relationships
          • Higher Order Apriori discovers itemsets that demonstrate novel relationships not discovered by lower order methods.
          • For example, the following are reasonable relationships. While Apriori and Indirect failed to discover itemsets representing such relationships in the SIGKDD dataset, they might discover such links given a larger data set.
      Shaver : Lotion/Cream Apriori No Indirect No Higher Order Apriori (Pedicure Care Kit, Toning Lotion) (wet/dry shaver, Herb Lotion) (Pedicure Care Kit, Leg Cream) foot cream – women’s socks Apriori No Indirect No Higher-order Apriori (foot cream, Women's Ultra Sheer Knee High) (foot cream, women’s cotton dog sock)
    • 31. Conclusions
      • Many traditional machine learning algorithms assume instances are independent and identically distributed (I.I.D.)
        • Apply model to a single instance (decision is based on the feature vector) in a “context-free” manner
        • Independent of the other instances in the test set
      • Statistical Relational Learning (SRL)
        • Classifies a set of instances simultaneously (collective classification)
        • Utilizes relations (links) between instances in dataset
        • Usually considers immediate neighbors
        • Violates “independence” assumption
      • Our approach utilizes the latent information based on higher order paths
        • Utilizes higher order paths of order greater than or equal to two
        • Higher-order paths are implicit; based on co-occurrences of entities
          • We do not use the explicit links in the dataset!
        • Captures “latent semantics” (aka Latent Semantic Indexing)
    • 32. Thanks
      • Q&A
    • 33. Higher-order Co-occurrence 3 rd order co-occurrence as a chain of co-occurrences (Kontostathis & Pottenger, 2006) Context (document, instance, record, …) entity, term, AVP, item, … Example