This paper describes BABAR, a knowledge extraction and representation system, completely implemented in CLOS, that is primarily geared towards organizing and reasoning about knowledge extracted from the Wikipedia Website. The system combines natural language processing techniques, knowledge representation paradigms and machine learning algorithms. BABAR is currently an ongoing independent research project that when sufficiently mature, may provide various commercial opportunities.
BABAR uses natural language processing to parse both page name and page contents. It automatically generates Wikipedia topic taxonomies thus providing a model for organizing the approximately 4,000,000 existing Wikipedia pages. It uses similarity metrics to establish concept relevancy and clustering algorithms to group topics based on semantic relevancy. Novel algorithms are presented that combine approaches from the areas of machine learning and recommender systems. The system also generates a knowledge hypergraph which will ultimately be used in conjunction with an automated reasoner to answer questions about particular topics.
1. (Knowledge Extraction)
Raymond Pierre de Lacaze
(RPL)
LispNYC July 10th, 2012
rpl@lispnyc.org
2. (John McCarthy)
September 4th,1927 – October 24th, 2011
This talk is dedicated to the memory of John McCarthy
Inventor of the Lisp Language (1958)
Founder of Artificial Intelligence
Winner of the Turing award (1971)
Designer of Elephant 2000
Programming Language based on speech acts
http://www-formal.stanford.edu/jmc/elephant/elephant.html
May He Rest in Peace
3. BABAR: Project Goals
Leverage Wikipedia as a Knowledge Base
Infer Infrastructure & Extract Content
Create Wiki Topic Taxonomies
Generate Knowledge Hypergraphs
Investigate Conceptual Relevance Metrics
Generate Knowledge summaries
Answer Knowledge base queries
Evolve a new generation of web browsers:
Knowledge Browsers
4. Overview
Brief Overview AI
Knowledge Representation
Natural Language Processing
Examine Specific Algorithms
Semantic Nets & Hypergraphs
Recursive Descent Parsing
Clustering Algorithms
Similarity Metrics
Describe Aspects of the BABAR System
Semantic Link Analysis
Automatic Topic Taxonomy Generation
Knowledge Category Assignment
Content Extraction
English Phrase to Clausal Form Logic
5. AI Technologies Discussed
Knowledge Representation
Clausal Form Logic
Semantic Nets
Hypergraphs
Natural Language Processing
Lexical Analysis
Syntactic Analysis
Recursive Descent Parsing
Semantic Analysis
Machine Learning Techniques
Clustering Algorithms
K-Means, Agglomerative and SR Clustering
Similarity Metrics
Jaccard Index
Pearson Correlation
6. Logics used in Artificial Intelligence
Monotonic Logic (standard)
Non-Monotonic Logic (exceptions)
(1) Birds can fly, (2) Penguins are birds, (3) Penguins can't fly
Sorted Logics (types)
Fuzzy Logic (continuous truth values)
Higher-Order Logics (meta-statements)
Modal Logics (may, can, must)
Intentional Logics (know, believe, think)
Temporal Logics (temporal operators)
Point-Based Temporal Logic (moments)
Interval Time Logic (Allen 1986, 13 temporal operators)
Before, Meets, Starts, Finishes, Overlaps, Contains, their inverses, and Equals.
Logics can be expressed in clausal form:
(ancestor ?x ?y) (parent ?x ?y)
(ancestor ?x ?y) (parent ?x ?z)(ancestor ?z ?y)
Note: The variables ?x and ?y are universally quantified, whereas the variable
?z is existentially quantified.
7. Clausal Form Logic
Propositional Calculus (PC)
Fully grounded clauses
No variables
(Brother John Jill),
(Parent Jane Jill) (Mother Jane Jill)
First Order Predicate Calculus (FOPC)
Variables
Universally qualified (for all ?x)
Existentially qualified (there exists ?x)
(Elephant ?x) (Has-Tusks ?x)
Converting 1st order logic to FOPC
Skolem constants (there exists x for all y such that…)
Skolem functions (for each x there exists a y such that…)
Second Order Predicate Calculus
Predicates and clauses can be arguments
Meta statements
Gödel's Incompleteness Theorem
Horn Clauses
Wikipedia: In computational logic, a Horn clause is a clause with at most
one positive literal
B (A1 ^ …. ^ An) ≡ ¬A1 v … v ¬A2 v B
(<LHS> <RHS>) ≡ ((B) (A1…An))
8. Automated Reasoning
Unification Algorithm
Clausal pattern matching and variable binding
(unify (P ?x ?y) (P A (Q ?x)))
Returns bindings: ((?x A) (?y (Q ?x))
Instantiation: (P A (Q A))
Rete Algorithm
Charles L. Forgy, CMU, 1974
Addresses the many-many matching problem
Matching facts to rules in rule-based systems
Donald Knuth , Volume 3.
Automated Reasoners
Backward Chaining Reasoners
Work from conclusion axioms (facts)
Good when state space branching factor is large
Forward Chaining Reasoners
Work from axioms conclusion
Good when the depth state space is large
Mixed methods
Perform both forward & backward chaining
GPS (Ernst & Newell, 1969)
Island hopping
9. Semantic Nets
Labeled, directed (or not) and weighted (or not) Graphs
Equivalent in expressiveness to FOPC
Graphical representation of 1st order logic.
ISA Hierarchies
Subsumption (Bill Woods)
KL-ONE System: R.J. Brachman and J. Schmolze (1985)
A whole family of KL-ONE like systems
Concepts
Distinguish Primitive and Defined concepts
Only defined concepts are classifiable
Frames
Marvin Minsky , "A Framework for Representing Knowledge.“, 1974
OO Languages (CLOS) ≡ Frame Language
Think of class of definitions as frames, where slots are attribute-value pairs
and you use pattern matching to fill in all the slots at which point a
concept becomes defined and classifiable.
10. HyperGraphs
A hypergraph is graph in which edges are first class
objects and can be linked to other edges or vertices.
Hypergraphs are a natural and convenient way of
representing sentences and meta-statements.
Married
Jane Jim
Disapproves
Loves Likes
Mom
Resents
John
Mom resents the fact that John disapproves of Jane and
Jim’s Marriage.
BABAR uses an in memory HyperGraph Semantic Net
11. Natural Language Processing
Lexical Analysis
Understanding the role and morphological nature of words.
Morphology, Orthography, Part of Speech Tagging
Typically use Lexicons: Dictionaries, etc…
Programs that do this are called Scanners or Lexical Analyzers
ScanGen and LEX on Unix systems for Programming Languages
Syntactic Analysis
Understanding the grammatical nature of groups of words
Programs that do this are called Parsers.
They take tokens produced by scanners/analyzers and apply them
to a grammar.
In doing so they typically produce parse trees.
NLP parsing methodologies include:
Top-Down Parsers(recursive descent)
Bottom-Up Parsers
ParseGen and YACC on Unix systems for Programming Languages
Semantic Analysis
Extracting phrase structure from parse trees and producing
statements in some knowledge representation language such as
clausal-form logic.
KRL: "An Overview of KRL, a Knowledge Representation
Language", D.G. Bobrow and T. Winograd, (1977).
12. Lexical Analysis
Morphology
The rules that govern word morphing
foxes ≡ fox+<plural>
Orthography
The rules that govern spelling
Plural of fox ≡ fox+’es’
Transducers
Define languages consisting of pairs of strings
Loosely: Finite Automaton with 2 state transition functions.
Formally: Q (states), Σ (i-alph), Δ (o-alph), q0 (start), F (final), δ(q, w) and σ(q, w).
FST: Finite State Transducer
Surface level, Intermediate level, Lexical level
E.g. foxes fox+es fox+N+PL
Parsing, Generating & Translating
Morphological Parser
Lexicons, Morphotactics and Orthographic Rules
Penn Treebank Parts of Speech Tags (50)
Probabilistic Approaches
N-Gram model
Counting word frequency
See Chapter 4 of Jurafsky & Martin, Speech & Language Processing, 2009
Google Translate
13. Lexical Analysis in BABAR
Lexicons
Regular words Lexicon
http://www.merriam-webster.com/
Query the site and extract parts of speech
About 50,000 locally cached entries.
Irregular Words Lexicons
Irregular nouns
Irregular verbs
Irregular auxiliaries
Orthographic Rules
reverse engineer morphed words
(analyze-morphed-word <word>)
Analyzes word suffixes then queries MW.
14. Lexical Analysis Example
KB(5): (parser::analyze-morphed-word "traditionally“ )
Loading #P"C:ProjectstrunkDataLexiconsParts-of-Speech.lisp"
Loading table from file English-Irregular-Nouns ...
Loading table from file English-Irregular-Verbs ...
Loading table from file English-Irregular-Auxiliary ...
Initializing reverse lexicon table...
URL: "http://www.merriam-webster.com/dictionary/tradition"
Returns five values:
Base Form: "tradition"
Actual Form: "traditionally"
Primary POS: :ADVERB
Additional NIL
Complete POS (:ADVERB)
Reverse Engineering:
traditionally (adverb) traditional (adjective) tradition (noun)
Parts-of-Speech Lexicon currently has about 50,000 entries.
Appriximately one million words in the English language
15. Syntactic Analysis
Grammars
Productions (grammatical rules)
LHS: A non-terminal symbol
RHS: A disjunction of conjunctions of TS & NTS
Can be recursive
Non-Terminal Symbols
Terminal Symbols (lexicon entries)
Start Symbol
Implicitly Define an AND-OR Tree.
Context-Free Grammars, Attribute Grammars
Parsers
Traverse a grammar while consuming input tokens in an attempt to find a
valid path through the grammar that accommodates the input tokens.
Produce parse trees in which the internal nodes are Non-Terminal Symbols
(NTS) and the leaves are Terminal Symbols (TS)
Three typical ways to handle non-determinism
Backtracking
Look-ahead
Parallelism
16. Parsing in BABAR
Implements a Recursive Descent Parser which performs a
top-down traversal of the grammar.
Uses backtracking to handle non-determinism
3 Types of objects: tokens, grammars and parse-nodes
Scanner
Creates of seven fundamental token classes based on
character composition
alphabetic, numeric, special, alpha-numeric, alpha-special,
numeric-special and alpha-numeric-special
Implemented using multiple-inheritance:
alphabetic-mixin, numeric-mixin and special-mixin classes
Parser Module (Scanner, Analyzer, Parser)
Implements a set of classes and generic functions geared towards
being easily able to develop particular domain–specific parsers.
18. Crawling Wikipedia
Wikipedia has approximately 4 million pages.
(initialize-wiki-graph <topic><depth>)
Returns a graph object
(crawl-wiki-topic <topic> <depth>)
Returns a Hash-Table of related-topics
For topic=elephant and depth=
#<EQUALP hash-table with 2580 entries>
(generate-wiki-graph <hash-table>)
Only create a vertex for keyss (pruning)
Non-key related topics are ignored (pruning)
Create a ‘related-to edge for every (<key> <related-topic>) pair.
Without pruning: #<Graph Elephant-3: 154833 vertices, 553604 edges>
With Pruning: #<Graph Elephant-3: 2580 vertices, 182562 edges> (2.7%)
With Pruning: #<Graph Elephant-4: 25577 vertices, 2355810 edges> (0.3%)
A complete graph of n vertices has n(n-1)/2 edges ≡ O(n2)
19. Link Name Organization
Internal, External and Intranal hyperlinks
I chose the Elephant page as my entry page for crawling
There are 228 internal links from the Elephant page.
These occur throughout 103 paragraphs of text
Goal: Organize the 228 links into a meaningful taxonomy
Asian_Elephant
Elephant
African_Bush_Elephant
African_Elephant
African_Forest_Elephant
Apply NLP to link names: i.e. parse the link names.
Partition link names into subtopic, supertopic and related.
Subtopic candidate elimination
Partition related topics into strongly and weakly related
based on link bi-directionality
20. Subtopic Taxonomy Generation Algorithm
(generate-subtopic-relations-in-graph <graph>)
1. Produce Candidates: a list of pairs of concepts. Each pair of
concepts is such that the first concept is a generalization of the
second concept. This is determined by noting concepts that
when parsed produce a set of tokens that is subset of the set
tokens produced by parsing the second concept.
2. Eliminate False-Poisitives: These are eliminated by ensuring that
the subjects of the phrases of each set of parsed tokens are
identical.
E.g. Elephant_Hotel is not a subtopic of Elephant whereas
Hotel_Elephant would a be subtopic of Elephant. This is one place
where NLP really adds value.
3. Replace ‘related-to relations with ‘generalizes relations.
4. Eliminate direct ‘generalizes relationships between children and
non-parent ancestors.
E.g. Elephant and North_African_Elephant.
5. Eliminate Singletons: Prune the list of sub trees by eliminating
singleton sub trees thus leaving them in a state of yet to be
classified
Finally return a forest of trees, i.e. a list of root nodes.
21. Subtopic Taxonomies
Organize 2,580 topics into a forest of 131 trees consisting of 1,594 nodes
(62%) and 986 yet to be classified nodes.
Elephant Tree Elephant_Seal Tree
-> Elephant -> Elephant_seal
-> Dwarf_elephant -> Southern_elephant_seal
-> Northern_elephant_seal
-> Sri_Lankan_elephant
-> Year_of_the_Elephant
-> Sumatran_Elephant
Intelligence Tree
-> White_elephant -> Intelligence
-> Fish_intelligence
-> War_elephant -> Cat_intelligence
-> Crushing_by_elephant -> Artificial_intelligence
-> Babar_the_Elephant -> Electronic_Transactions_on_Artificial_Intelligence
-> Indian_Elephant -> Swarm_intelligence
-> Cephalopod_intelligence
-> African_elephant -> Dinosaur_intelligence
-> African_Forest_Elephant -> Cetacean_intelligence
-> North_African_Elephant -> Evolution_of_human_intelligence
-> African_Bush_Elephant -> Elephant_intelligence
-> Dog_intelligence
-> Execution_by_elephant -> Pigeon_intelligence
-> Borneo_pygmy_elephant -> Primate_intelligence
-> Horton_the_Elephant -> Bird_intelligence
-> Asian elephant
-> Elmer_the_Patchwork_Elephant
23. Clustering
Two Fundamental Perspectives:
Top-Down: Partitioning a set into disjoint subsets
Bottom-Up: Grouping data points into disjoint clusters
Goes hand-in-hand with classification
Typically involves a metric: Euclidian or Manhattan distance
Many, many different algorithms & books.
Some really popular algorithms:
K-Means Clustering (EM, PCA)
Hierarchical Agglomerative Clustering
K-Nearest Neighbor (classification)
SR-Clustering: This is something I (re)invented.
Effectively: The world’s simplest clustering algorithm.
24. K-Means Clustering (1)
Given an initial set of cluster centroids, determine
the actual centroids of each cluster via an
iterative refinement algorithm.
Each refinement iteration consists of two steps :
1. Computing new data point centroid assignments
2. Computing new centroid positions based of the
mean deviation of the data points from the previous
centroid positions.
Converge, Divergence, Oscillation….
Also known as Lloyd’s Algorithm in CS.
25. K-Means Clustering (2)
Wikipedia: Given a set of observations
(x1, x2, …, xn), where each observation is a d-
dimensional real vector, k-means clustering aims
to partition the n observations into k sets (k ≤ n) S
= {S1, S2, …, Sk} so as to minimize the within-
cluster sum of squares (WCSS):
where μi is the mean of points in Si
26. K-Means Clustering (3)
Assignment Step:
Defines Si to be the set of xi that deviate least from Si
Update Step:
Calculate the new means to be the centroid of the
observations in the cluster.
I.e. The average along each dimension
27. K-Means Clustering(4)
K-Means is *really* a 3 step algorithm
Step1. Initialize K-Means (non-trivial)
Problem 1: Estimate K
Problem 2: Pick Initial Centroid for each K
Iterative Refinement
Step 2: Centroid Assignments
Step3: Centroid Update
Many initialization approaches:
Random, Forgy, MacQueen and Kaufman
Performance depends on initialization and instance ordering
Popular because of its robustness
Related to:
EM Algorithm and
Principal Component Analysis (PCA)
28. Hierarchical Agglomerative Clustering
The Algorithm
1. Cluster each data point with its nearest neighbor(s)
and make that a new data point (cluster).
2. Repeat until some fixed number of clusters is reached.
K-Nearest Neighbor is often used hand-in-hand with
agglomerative clustering to compute the nearest
neighbor(s).
End up with a tree of clusters (clustering history)
This tree is called a dendogram
See Chapter 6 of Duda & Hart (SRI, 1973)
Pattern Classification & Scene Analysis
29. SR-Clustering (1)
Simple Ray Clustering
Sort of like non-hierarchical agglomerative
clustering
Basic Algorithm
For each data point, place it in the correct cluster
If it doesn’t belong to any cluster, create a new
cluster consisting of that single data point
Cluster Membership
Defined as being within a certain proximity
threshold of every data point in that cluster.
Proximity Metric
The Jaccard Index
30. Recommender Systems
Used by Netflix, Amazon, etc…
Objects: Users, Items & Preferences
User vs. Item based recommendations
Former aka collaborative filtering
Mixed method recommendations
Based on User Similarity and/or Item Similarity
Jaccard Index takes into account dissimilarity and
does not require preference measurements.
Apache Mahout (leverages Hadoop)
31. Jaccard Index
Defines a Similarity Metric between two sets
Wikipedia: The Jaccard coefficient measures
similarity between sample sets, and is defined
as the size of the intersection divided by the size
of the union of the sample sets:
Jaccard
Distance
32. Another Similarity Metric
Pearson Correlation Coefficient
Wikipedia:
Defined as the covariance of the
two variables divided by the product of their
standard deviations
33. (compute-similarity-matrix <topics>)
Computes the Jaccard index for pairs of topics by
using the related topics of each topic as the sets to
be compared.
African Asian Indian Babar Horton War
African 100.00 38.46 21.05 4.35 6.82 7.94
Asian 38.46 100.00 37.74 4.00 6.25 20.00
Indian 21.05 37.74 100.00 6.90 7.14 24.39
Babar 4.35 4.00 6.90 100.00 28.57 7.14
Horton 6.82 6.25 7.14 28.57 100.00 7.41
War 7.94 20.00 24.39 7.14 7.41 100.00
35. Knowledge Categories (1)
Human schooling as a decade(s) long knowledge
acquisition process
Spanning Kindergarten – Post Doctoral work
Idea is to use grade school topics as initial
knowledge categories.
Science, History, Geography, Literature & Art
Goal: Assign categories to subtopic clusters
Use Jaccard Index to determine the category
Automatically create subtopic category names
e.g. Babar Literature_Elephant
36. (compute-cluster-categories <clusters>)
Wiki Crawl each Knowledge Category (pre-run)
Compute subtopics of each knowledge category
Compute a category relevancy vector for each
cluster member
Combine the relevancy vectors of each cluster to
compute a relevancy vector for the cluster
Assign a category to the cluster
43. Things Overlooked
Wiki Page Contents Pane
Provides page taxonomy
Provides category names
Provides related topic names
Concept Weights
44. Future Direction
Enhance English Parser
Incorporate Variables into Semantic Net
Leverage topic weights
Work on language generation
Produce Wiki Summary Pages
Knowledge Queries
Develop Client Side Browser
Top Menu Bar Knowledge Categories
RHS Dynamic Subtopic Tree
LHS Wiki Page Content Pane
45. (references)
Language and Speech Processing
Jurafsky and Martin
Artificial Intelligence: A Modern Approach
Russell & Norvig
Principles of Semantic Networks
Edited by John E. Sowa, Morgan Kaufman, 1991
Machine Learning
Tom Mitchell, 1997
Pattern Classification and Scene Analysis
Duda and Hart, 1973
Algorithms of the Intelligent Web
Marmanis and Babenko, 2009