SlideShare a Scribd company logo
1 of 48
(Knowledge Extraction)

  Raymond Pierre de Lacaze

          (RPL)

          LispNYC July 10th, 2012

                        rpl@lispnyc.org
(John McCarthy)
September 4th,1927 – October 24th, 2011


This talk is dedicated to the memory of John McCarthy

   Inventor of the Lisp Language (1958)
   Founder of Artificial Intelligence
   Winner of the Turing award (1971)
   Designer of Elephant 2000
       Programming Language based on speech acts
       http://www-formal.stanford.edu/jmc/elephant/elephant.html

   May He Rest in Peace
BABAR: Project Goals
   Leverage Wikipedia as a Knowledge Base

   Infer Infrastructure & Extract Content
       Create Wiki Topic Taxonomies
       Generate Knowledge Hypergraphs

   Investigate Conceptual Relevance Metrics

   Generate Knowledge summaries
   Answer Knowledge base queries

   Evolve a new generation of web browsers:
    Knowledge Browsers
Overview
   Brief Overview AI
       Knowledge Representation
       Natural Language Processing

   Examine Specific Algorithms
       Semantic Nets & Hypergraphs
       Recursive Descent Parsing
       Clustering Algorithms
       Similarity Metrics

   Describe Aspects of the BABAR System
       Semantic Link Analysis
           Automatic Topic Taxonomy Generation
           Knowledge Category Assignment
       Content Extraction
           English Phrase to Clausal Form Logic
AI Technologies Discussed
   Knowledge Representation
       Clausal Form Logic
       Semantic Nets
       Hypergraphs

   Natural Language Processing
       Lexical Analysis
       Syntactic Analysis
           Recursive Descent Parsing
       Semantic Analysis

   Machine Learning Techniques
       Clustering Algorithms
       K-Means, Agglomerative and SR Clustering

   Similarity Metrics
       Jaccard Index
       Pearson Correlation
Logics used in Artificial Intelligence
   Monotonic Logic (standard)
   Non-Monotonic Logic (exceptions)
       (1) Birds can fly, (2) Penguins are birds, (3) Penguins can't fly

   Sorted Logics (types)
   Fuzzy Logic (continuous truth values)
   Higher-Order Logics (meta-statements)
       Modal Logics (may, can, must)
       Intentional Logics (know, believe, think)

   Temporal Logics (temporal operators)
       Point-Based Temporal Logic (moments)
       Interval Time Logic (Allen 1986, 13 temporal operators)
           Before, Meets, Starts, Finishes, Overlaps, Contains, their inverses, and Equals.

   Logics can be expressed in clausal form:
    (ancestor ?x ?y)  (parent ?x ?y)
    (ancestor ?x ?y)  (parent ?x ?z)(ancestor ?z ?y)

    Note: The variables ?x and ?y are universally quantified, whereas the variable
           ?z is existentially quantified.
Clausal Form Logic
   Propositional Calculus (PC)
       Fully grounded clauses
       No variables
           (Brother John Jill),
           (Parent Jane Jill)  (Mother Jane Jill)

   First Order Predicate Calculus (FOPC)
       Variables
           Universally qualified (for all ?x)
           Existentially qualified (there exists ?x)
           (Elephant ?x)  (Has-Tusks ?x)
       Converting 1st order logic to FOPC
           Skolem constants (there exists x for all y such that…)
           Skolem functions (for each x there exists a y such that…)

   Second Order Predicate Calculus
       Predicates and clauses can be arguments
       Meta statements
       Gödel's Incompleteness Theorem

   Horn Clauses
       Wikipedia: In computational logic, a Horn clause is a clause with at most
        one positive literal
       B  (A1 ^ …. ^ An) ≡ ¬A1 v … v ¬A2 v B
       (<LHS> <RHS>) ≡ ((B) (A1…An))
Automated Reasoning
   Unification Algorithm
       Clausal pattern matching and variable binding
       (unify (P ?x ?y) (P A (Q ?x)))
           Returns bindings: ((?x A) (?y (Q ?x))
           Instantiation: (P A (Q A))

   Rete Algorithm
       Charles L. Forgy, CMU, 1974
       Addresses the many-many matching problem
       Matching facts to rules in rule-based systems
       Donald Knuth , Volume 3.

   Automated Reasoners
       Backward Chaining Reasoners
           Work from conclusion  axioms (facts)
           Good when state space branching factor is large
       Forward Chaining Reasoners
           Work from axioms  conclusion
           Good when the depth state space is large
       Mixed methods
        Perform both forward & backward chaining
         GPS (Ernst & Newell, 1969)
         Island hopping
Semantic Nets
   Labeled, directed (or not) and weighted (or not) Graphs
   Equivalent in expressiveness to FOPC
   Graphical representation of 1st order logic.
   ISA Hierarchies
   Subsumption (Bill Woods)

   KL-ONE System: R.J. Brachman and J. Schmolze (1985)
   A whole family of KL-ONE like systems

   Concepts
       Distinguish Primitive and Defined concepts
       Only defined concepts are classifiable

   Frames
       Marvin Minsky , "A Framework for Representing Knowledge.“, 1974
       OO Languages (CLOS) ≡ Frame Language
       Think of class of definitions as frames, where slots are attribute-value pairs
        and you use pattern matching to fill in all the slots at which point a
        concept becomes defined and classifiable.
HyperGraphs
   A hypergraph is graph in which edges are first class
    objects and can be linked to other edges or vertices.
   Hypergraphs are a natural and convenient way of
    representing sentences and meta-statements.
                              Married
                    Jane                        Jim




                                  Disapproves
                    Loves                       Likes

        Mom
                    Resents

                                John

   Mom resents the fact that John disapproves of Jane and
    Jim’s Marriage.
   BABAR uses an in memory HyperGraph  Semantic Net
Natural Language Processing
   Lexical Analysis
       Understanding the role and morphological nature of words.
       Morphology, Orthography, Part of Speech Tagging
       Typically use Lexicons: Dictionaries, etc…
       Programs that do this are called Scanners or Lexical Analyzers
       ScanGen and LEX on Unix systems for Programming Languages

   Syntactic Analysis
       Understanding the grammatical nature of groups of words
       Programs that do this are called Parsers.
       They take tokens produced by scanners/analyzers and apply them
        to a grammar.
       In doing so they typically produce parse trees.
       NLP parsing methodologies include:
           Top-Down Parsers(recursive descent)
           Bottom-Up Parsers
       ParseGen and YACC on Unix systems for Programming Languages

   Semantic Analysis
       Extracting phrase structure from parse trees and producing
        statements in some knowledge representation language such as
        clausal-form logic.
       KRL: "An Overview of KRL, a Knowledge Representation
        Language", D.G. Bobrow and T. Winograd, (1977).
Lexical Analysis
   Morphology
       The rules that govern word morphing
       foxes ≡ fox+<plural>

   Orthography
       The rules that govern spelling
       Plural of fox ≡ fox+’es’

   Transducers
       Define languages consisting of pairs of strings
       Loosely: Finite Automaton with 2 state transition functions.
       Formally: Q (states), Σ (i-alph), Δ (o-alph), q0 (start), F (final), δ(q, w) and σ(q, w).
       FST: Finite State Transducer
       Surface level, Intermediate level, Lexical level
           E.g. foxes  fox+es  fox+N+PL
       Parsing, Generating & Translating

   Morphological Parser
           Lexicons, Morphotactics and Orthographic Rules
           Penn Treebank Parts of Speech Tags (50)

   Probabilistic Approaches
       N-Gram model
       Counting word frequency
       See Chapter 4 of Jurafsky & Martin, Speech & Language Processing, 2009
       Google Translate
Lexical Analysis in BABAR
   Lexicons
       Regular words Lexicon
         http://www.merriam-webster.com/
         Query the site and extract parts of speech
         About 50,000 locally cached entries.
       Irregular Words Lexicons
         Irregular nouns
         Irregular verbs
         Irregular auxiliaries


   Orthographic Rules
       reverse engineer morphed words

   (analyze-morphed-word <word>)
       Analyzes word suffixes then queries MW.
Lexical Analysis Example
KB(5): (parser::analyze-morphed-word "traditionally“ )

Loading #P"C:ProjectstrunkDataLexiconsParts-of-Speech.lisp"

Loading table from file English-Irregular-Nouns ...
Loading table from file English-Irregular-Verbs ...
Loading table from file English-Irregular-Auxiliary ...

Initializing reverse lexicon table...

URL: "http://www.merriam-webster.com/dictionary/tradition"

   Returns five values:
    Base Form:           "tradition"
    Actual Form:         "traditionally"
    Primary POS:         :ADVERB
    Additional           NIL
    Complete POS         (:ADVERB)

   Reverse Engineering:

    traditionally (adverb)  traditional (adjective)  tradition (noun)

   Parts-of-Speech Lexicon currently has about 50,000 entries.
   Appriximately one million words in the English language
Syntactic Analysis
   Grammars
       Productions (grammatical rules)
           LHS: A non-terminal symbol
           RHS: A disjunction of conjunctions of TS & NTS
           Can be recursive
       Non-Terminal Symbols
       Terminal Symbols (lexicon entries)
       Start Symbol

       Implicitly Define an AND-OR Tree.
       Context-Free Grammars, Attribute Grammars

   Parsers
       Traverse a grammar while consuming input tokens in an attempt to find a
        valid path through the grammar that accommodates the input tokens.

       Produce parse trees in which the internal nodes are Non-Terminal Symbols
        (NTS) and the leaves are Terminal Symbols (TS)

       Three typical ways to handle non-determinism
           Backtracking
           Look-ahead
           Parallelism
Parsing in BABAR
   Implements a Recursive Descent Parser which performs a
    top-down traversal of the grammar.

   Uses backtracking to handle non-determinism
   3 Types of objects: tokens, grammars and parse-nodes

   Scanner
       Creates of seven fundamental token classes based on
        character composition
       alphabetic, numeric, special, alpha-numeric, alpha-special,
        numeric-special and alpha-numeric-special
       Implemented using multiple-inheritance:
           alphabetic-mixin, numeric-mixin and special-mixin classes

   Parser Module (Scanner, Analyzer, Parser)
       Implements a set of classes and generic functions geared towards
        being easily able to develop particular domain–specific parsers.
Level 1 (simple)
Class      grammar
Macro      (define-grammar <name><prods><preds> &key <class>)
GF         (scan-tokens <string> <grammar>&key <delimiter>)
GF         (parse-tokens <tokens> <grammar>)

Level 2 (context)
Class      context-grammar

Macro      (define-context-grammar <name> <prods> <preds> <context>)

Macro      (with-grammar-context (<context><grammar>) &body <body>)

GF         (analyze-tokens <tokens> <grammar>)

Level 3 (domain)
Macro      (define-lexicon <name> <fields>)
Macro      (define-word-class <word-type> &optional <slots>)
Level 4 (english)
Adds       english-grammar, scan-tokens, analyze-word-morphology
Crawling Wikipedia
   Wikipedia has approximately 4 million pages.

(initialize-wiki-graph <topic><depth>)
     Returns a graph object

(crawl-wiki-topic <topic> <depth>)
   Returns a Hash-Table of related-topics
   For topic=elephant and depth=
     #<EQUALP hash-table with 2580 entries>

(generate-wiki-graph <hash-table>)
   Only create a vertex for keyss (pruning)
   Non-key related topics are ignored (pruning)
   Create a ‘related-to edge for every (<key> <related-topic>) pair.

   Without pruning: #<Graph Elephant-3: 154833 vertices, 553604 edges>
   With Pruning:    #<Graph Elephant-3: 2580 vertices, 182562 edges> (2.7%)
   With Pruning:    #<Graph Elephant-4: 25577 vertices, 2355810 edges> (0.3%)

   A complete graph of n vertices has n(n-1)/2 edges ≡ O(n2)
Link Name Organization
   Internal, External and Intranal hyperlinks
   I chose the Elephant page as my entry page for crawling
   There are 228 internal links from the Elephant page.
   These occur throughout 103 paragraphs of text
   Goal: Organize the 228 links into a meaningful taxonomy

                       Asian_Elephant
    Elephant
                                             African_Bush_Elephant
                      African_Elephant
                                            African_Forest_Elephant


   Apply NLP to link names: i.e. parse the link names.
   Partition link names into subtopic, supertopic and related.
       Subtopic candidate elimination
   Partition related topics into strongly and weakly related
    based on link bi-directionality
Subtopic Taxonomy Generation Algorithm
(generate-subtopic-relations-in-graph <graph>)
1. Produce Candidates: a list of pairs of concepts. Each pair of
concepts is such that the first concept is a generalization of the
second concept. This is determined by noting concepts that
when parsed produce a set of tokens that is subset of the set
tokens produced by parsing the second concept.
2. Eliminate False-Poisitives: These are eliminated by ensuring that
the subjects of the phrases of each set of parsed tokens are
identical.
      E.g. Elephant_Hotel is not a subtopic of Elephant whereas
       Hotel_Elephant would a be subtopic of Elephant. This is one place
       where NLP really adds value.
3. Replace ‘related-to relations with ‘generalizes relations.
4. Eliminate direct ‘generalizes relationships between children and
non-parent ancestors.
      E.g. Elephant and North_African_Elephant.
5. Eliminate Singletons: Prune the list of sub trees by eliminating
singleton sub trees thus leaving them in a state of yet to be
classified
 Finally return a forest of trees, i.e. a list of root nodes.
Subtopic Taxonomies
  Organize 2,580 topics into a forest of 131 trees consisting of 1,594 nodes
   (62%) and 986 yet to be classified nodes.

 Elephant Tree                       Elephant_Seal Tree
-> Elephant                           -> Elephant_seal
  -> Dwarf_elephant                      -> Southern_elephant_seal
                                         -> Northern_elephant_seal
  -> Sri_Lankan_elephant
  -> Year_of_the_Elephant
  -> Sumatran_Elephant
                                     Intelligence Tree
  -> White_elephant                  -> Intelligence
                                       -> Fish_intelligence
  -> War_elephant                      -> Cat_intelligence
  -> Crushing_by_elephant              -> Artificial_intelligence
  -> Babar_the_Elephant                  -> Electronic_Transactions_on_Artificial_Intelligence
  -> Indian_Elephant                   -> Swarm_intelligence
                                       -> Cephalopod_intelligence
  -> African_elephant                  -> Dinosaur_intelligence
     -> African_Forest_Elephant        -> Cetacean_intelligence
     -> North_African_Elephant         -> Evolution_of_human_intelligence
     -> African_Bush_Elephant          -> Elephant_intelligence
                                       -> Dog_intelligence
  -> Execution_by_elephant             -> Pigeon_intelligence
  -> Borneo_pygmy_elephant             -> Primate_intelligence
  -> Horton_the_Elephant               -> Bird_intelligence
  -> Asian elephant
  -> Elmer_the_Patchwork_Elephant
Subtopic Taxonomy Issues
 -> Lion                          -> Lion (cont.)
   -> Congolese_Spotted_Lion         -> Sea_lion
   -> Asiatic_Lion                      -> Steller_sea_lion
   -> Masai_lion                        -> Australian_sea_lion
   -> Barbary_lion                      -> South_American_sea_lion
   -> Henry_the_Lion                    -> New_Zealand_sea_lion
   -> Sri_Lanka_lion                    -> California_sea_lion
   -> Nemean_lion                     -> American_lion
   -> Western_African_lion            -> White_lion
   -> Transvaal_Lion                     -> Kimba_the_White_Lion
   -> West_African_lion               -> Cowardly_Lion
   -> Tsavo_lion                      -> Tiger_versus_lion
   -> Southwest_African_Lion
   -> European_lion
   -> Cape Lion

WRT Nomenclature purity, Lion_Seal is a better name than Sea_Lion.
Clustering
   Two Fundamental Perspectives:
     Top-Down: Partitioning a set into disjoint subsets
     Bottom-Up: Grouping data points into disjoint clusters


   Goes hand-in-hand with classification

   Typically involves a metric: Euclidian or Manhattan distance

   Many, many different algorithms & books.

   Some really popular algorithms:
     K-Means Clustering (EM, PCA)
     Hierarchical Agglomerative Clustering
     K-Nearest Neighbor (classification)


   SR-Clustering: This is something I (re)invented.
       Effectively: The world’s simplest clustering algorithm.
K-Means Clustering (1)
   Given an initial set of cluster centroids, determine
    the actual centroids of each cluster via an
    iterative refinement algorithm.

   Each refinement iteration consists of two steps :
    1. Computing new data point centroid assignments
    2. Computing new centroid positions based of the
    mean deviation of the data points from the previous
    centroid positions.

   Converge, Divergence, Oscillation….

   Also known as Lloyd’s Algorithm in CS.
K-Means Clustering (2)
Wikipedia: Given a set of observations
(x1, x2, …, xn), where each observation is a d-
dimensional real vector, k-means clustering aims
to partition the n observations into k sets (k ≤ n) S
= {S1, S2, …, Sk} so as to minimize the within-
cluster sum of squares (WCSS):




where μi is the mean of points in Si
K-Means Clustering (3)
 Assignment    Step:


  Defines Si to be the set of xi that deviate least from Si


 Update   Step:




  Calculate the new means to be the centroid of the
  observations in the cluster.
  I.e. The average along each dimension
K-Means Clustering(4)
   K-Means is *really* a 3 step algorithm
     Step1. Initialize K-Means (non-trivial)
        Problem 1: Estimate K
        Problem 2: Pick Initial Centroid for each K
     Iterative Refinement
        Step 2: Centroid Assignments
        Step3: Centroid Update


   Many initialization approaches:
       Random, Forgy, MacQueen and Kaufman

   Performance depends on initialization and instance ordering
   Popular because of its robustness
   Related to:
       EM Algorithm and
       Principal Component Analysis (PCA)
Hierarchical Agglomerative Clustering
   The Algorithm
    1. Cluster each data point with its nearest neighbor(s)
    and make that a new data point (cluster).
    2. Repeat until some fixed number of clusters is reached.

   K-Nearest Neighbor is often used hand-in-hand with
    agglomerative clustering to compute the nearest
    neighbor(s).

   End up with a tree of clusters (clustering history)

   This tree is called a dendogram

   See Chapter 6 of Duda & Hart (SRI, 1973)
    Pattern Classification & Scene Analysis
SR-Clustering (1)
 Simple    Ray Clustering 
     Sort of like non-hierarchical agglomerative
      clustering
 Basic   Algorithm
     For each data point, place it in the correct cluster
     If it doesn’t belong to any cluster, create a new
      cluster consisting of that single data point
 Cluster   Membership
     Defined as being within a certain proximity
      threshold of every data point in that cluster.
 Proximity   Metric
     The Jaccard Index
Recommender Systems
   Used by Netflix, Amazon, etc…
   Objects: Users, Items & Preferences

   User vs. Item based recommendations
   Former aka collaborative filtering
   Mixed method recommendations
   Based on User Similarity and/or Item Similarity

   Jaccard Index takes into account dissimilarity and
    does not require preference measurements.

   Apache Mahout (leverages Hadoop)
Jaccard Index
 Defines   a Similarity Metric between two sets

 Wikipedia:  The Jaccard coefficient measures
 similarity between sample sets, and is defined
 as the size of the intersection divided by the size
 of the union of the sample sets:




 Jaccard
 Distance
Another Similarity Metric
 Pearson   Correlation Coefficient

 Wikipedia:
           Defined as the covariance of the
 two variables divided by the product of their
 standard deviations
(compute-similarity-matrix <topics>)
   Computes the Jaccard index for pairs of topics by
    using the related topics of each topic as the sets to
    be compared.


             African   Asian    Indian   Babar    Horton     War

African      100.00     38.46    21.05     4.35     6.82     7.94

Asian         38.46    100.00    37.74     4.00     6.25    20.00

Indian        21.05     37.74   100.00     6.90     7.14    24.39

Babar           4.35     4.00     6.90   100.00    28.57     7.14

Horton          6.82     6.25     7.14    28.57   100.00     7.41

War             7.94    20.00    24.39     7.14     7.41   100.00
(cluster-subtopics <subtopics> <matrix> <threshold>)

Cluster 1                         Cluster 4
Asian_elephant(49)                War_elephant(22)
African_elephant(60)              Execution_by_elephant(5)
                                  Crushing_by_elephant(4)
Cluster 2
Babar_the_Elephant(7)             Cluster 5
Horton_the_Elephant(5)            Year_of_the_Elephant(8)
Elmer_the_Patchwork_Elephant(4)
                                  Cluster 6
                                  Dwarf_elephant(24)
Cluster 3
Asian_elephant(49)                Cluster 7
Indian_Elephant(18)               White_elephant(10)
Sri_Lankan_elephant(12)
Sumatran_Elephant(11)
Borneo_pygmy_elephant(3)                        Threshold = 20
Knowledge Categories (1)
   Human schooling as a decade(s) long knowledge
    acquisition process

   Spanning Kindergarten – Post Doctoral work

   Idea is to use grade school topics as initial
    knowledge categories.

   Science, History, Geography, Literature & Art

   Goal: Assign categories to subtopic clusters

   Use Jaccard Index to determine the category

   Automatically create subtopic category names
    e.g. Babar  Literature_Elephant
(compute-cluster-categories <clusters>)

   Wiki Crawl each Knowledge Category (pre-run)
   Compute subtopics of each knowledge category

   Compute a category relevancy vector for each
    cluster member

   Combine the relevancy vectors of each cluster to
    compute a relevancy vector for the cluster

   Assign a category to the cluster
(compute-cluster-categories <clusters>)
(((( :SCIENCE 0.47666672) (:HISTORY 0.44666672))
  (#<Concept(49): Asian_elephant> #<Concept(60): African_elephant>))

((( :SCIENCE 0.39) (:GEOGRAPHY 0.37800002))
 (#<Concept(3): Borneo_pygmy_elephant> #<Concept(49): Asian_elephant>
  #<Concept(18): Indian_Elephant> #<Concept(12): Sri_Lankan_elephant>
  #<Concept(11): Sumatran_Elephant>))

((( :ART 0.33333334) (:GEOGRAPHY 0.30666667))
 (#<Concept(7): Babar_the_Elephant> #<Concept(5): Horton_the_Elephant>
  #<Concept(4): Elmer_the_Patchwork_Elephant>))

((( :HISTORY 0.6) (:GEOGRAPHY 0.46)
 (#<Concept(8): Year_of_the_Elephant>))

((( :GEOGRAPHY 0.72333336) ( :HISTORY 0.43666664))
 (#<Concept(5): Execution_by_elephant> #<Concept(22): War_elephant>
  #<Concept(4): Crushing_by_elephant>))

((( :GEOGRAPHY 0.86) ( :SCIENCE 0.5))
 (#<Concept(24): Dwarf_elephant>))

(( ( :SCIENCE 0.69) (:ART 0.49))
(#<Concept(10): White_elephant>)))
Individual Subtopic Categories
The following shows the knowledge category relevancies for some of the 16
subtopics of Elephant and helps understand the results of previous slide

(#<Concept(7): Babar_the_Elephant>
 (( :LITERATURE 0.44) ( :ART 0.25) (:GEOGRAPHY 0.23) (:HISTORY 0.2) ( :SCIENCE 0.17)))

(#<Concept(4): Elmer_the_Patchwork_Elephant>
 (( :ART 0.25) (:GEOGRAPHY 0.23) (:LITERATURE 0.22) (:HISTORY 0.2) ( :SCIENCE 0.17)))

(#<Concept(5): Horton_the_Elephant>
 (( :ART 0.5) (:GEOGRAPHY 0.46) (:HISTORY 0.4) (:SCIENCE 0.35) ( :LITERATURE 0.22)))

(#<Concept(60): African_elephant>
((:ART 1.03) ( :SCIENCE 0.91) ( :HISTORY 0.85) (:GEOGRAPHY 0.77) ( :LITERATURE 0.37)))

(#<Concept(49): Asian_elephant>
 (( :HISTORY 0.7) ( :SCIENCE 0.62) (:GEOGRAPHY 0.59) (:ART 0.42) ( :LITERATURE 0.19)))

(#<Concept(22): War_elephant>
 (( :HISTORY 0.93) ( :GEOGRAPHY 0.85) (:LITERATURE 0.41) (:ART 0.23) (:SCIENCE 0.16)))
Categorized Subtopic Clusters
   Elephant
       ART_Elephant
          Elmer_the_Patchwork_Elephant
          Horton_the_Elephant
          Babar_the_Elephant
       GEOGRAPHY_Elephant
          Dwarf_elephant
          Crushing_by_elephant
          War_elephant
          Execution_by_elephant
       HISTORY_Elephant
          Year_of_the_Elephant
       SCIENCE_Elephant
          African_elephant
              African_Forest_Elephant
               African_Bush_Elephant
          Asian_elephant
          White_elephant
          Sumatran_Elephant
          Sri_Lankan_elephant
          Indian_Elephant
          Borneo_pygmy_elephany
Related Topics Associations
   Associate related topics to subtopic clusters using Jaccard Index
   Use associations to create related topic clusters

(find-compatible-clusters <strongly-related-topics> <clusters>)

(( #<Concept(60): African elephant>                ((#<Concept(24): Dwarf elephant>)
  #<Concept(49): Asian_elephant>)
                                                   (#<Concept(66): Mammoth>
(#<Concept(10): Elephant intelligence>              #<Concept(25): Mastodon>
                                                    #<Concept(275): Genus>
 #<Concept(103): Animal cognition>
                                                    #<Concept(62): Afrotheria>
 #<Concept(4): Elephant tusk>                       #<Concept(86): Gestation>
 #<Concept(15): African>                            #<Concept(749): Eutheria>
 #<Concept(102): Proboscidea>                       #<Concept(8): Gomphotherium>
 #<Concept(96): Mammalia>                           #<Concept(27): Tooth>
 #<Concept(876): Mammal>                            #<Concept(8): Tooth_development>))
 #<Concept(143): Hippopotamus>
 #<Concept(590): Lion>
 #<Concept(10): Loxodonta>)
                                                   ((#<Concept(7): Babar_the_Elephant>
                                                     #<Concept(5): Horton_the_Elephant>
                                                     #<Concept(4): Elmer_the_Patchwork_Elephant>)
((#<Concept22): War_elephant>
  #<Concept(5): Execution_by_elephant>              #<Concept(6): List_of_fictional_elephants>
  #<Concept(4): Crushing_by_elephant>)              #<Concept(5): List_of_elephants_in_mythology_and_religion>
                                                    #<Concept(5): Pinnawala>
(#<Concept(55): Ivory>                              #<Concept(3): Katy_Payne>
 #<Concept(77): Kenya>                              #<Concept(11): Infrasound>
 #<Concept(31): Grief>                              #<Concept(56): Incisor>
 #<Concept(8): History_of_elephants_in_Europe>))    #<Concept(14): Jeheskel_Shoshani>
                                                    #<Concept(6): Aanayoottu> )
(sentence-to-clause <sentence>)
  english sentence string
Scanner
  Tokens
Analyzer
  morphologically analyzed words
Parser
  parse-tree
Phrase extractor
  phrases (flattened parse tree)
Semantic analyzer
  Frame for subject, verb , object and prep. phrases
Clauses Generator
  Clause Objects
Sample Extracted Clauses
   (HASA "Asian elephant species" "disjunct distributions")

   (ISA "Elephants" "herbivores")

   (HASA "African Elephants" “three nails")

   (HASA "Indian Elephants" "four nails")

   (HASA "female African Elephants" "large tusks")

   (ISA "Elephants" "large land mammals")
Things Overlooked
 Wiki   Page Contents Pane
     Provides page taxonomy
     Provides category names
     Provides related topic names

 Concept    Weights
Future Direction
 Enhance  English Parser
 Incorporate Variables into Semantic Net
 Leverage topic weights
 Work on language generation
 Produce Wiki Summary Pages
 Knowledge Queries
 Develop Client Side Browser
    Top Menu Bar Knowledge Categories
    RHS Dynamic Subtopic Tree
    LHS Wiki Page Content Pane
(references)
 Language and          Speech Processing
  Jurafsky and Martin
 Artificial Intelligence:   A Modern Approach
  Russell & Norvig
 Principles of Semantic Networks
  Edited by John E. Sowa, Morgan Kaufman, 1991
 Machine Learning
  Tom Mitchell, 1997
 Pattern Classification     and Scene Analysis
  Duda and Hart, 1973
 Algorithms   of the Intelligent Web
  Marmanis and Babenko, 2009
(cluster-images)
(Love Elephants LispNYC)
Knowledge Extraction

More Related Content

What's hot

Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introductionRobert Lujo
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature EngineeringHJ van Veen
 
Natural language processing
Natural language processing Natural language processing
Natural language processing Md.Sumon Sarder
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language ProcessingPranav Gupta
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IMachine Learning Valencia
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processingSanzid Kawsar
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernelsDev Nath
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsRoelof Pieters
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Natural language processing
Natural language processingNatural language processing
Natural language processingYogendra Tamang
 
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22GiacomoBalloccu
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMsDaniel Perez
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingYasir Khan
 

What's hot (20)

Word2Vec
Word2VecWord2Vec
Word2Vec
 
NAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITIONNAMED ENTITY RECOGNITION
NAMED ENTITY RECOGNITION
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Multimedia Information Retrieval
Multimedia Information RetrievalMultimedia Information Retrieval
Multimedia Information Retrieval
 
Natural language processing (NLP) introduction
Natural language processing (NLP) introductionNatural language processing (NLP) introduction
Natural language processing (NLP) introduction
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Feature Engineering
Feature EngineeringFeature Engineering
Feature Engineering
 
Natural language processing
Natural language processing Natural language processing
Natural language processing
 
Introduction to Natural Language Processing
Introduction to Natural Language ProcessingIntroduction to Natural Language Processing
Introduction to Natural Language Processing
 
L2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms IL2. Evaluating Machine Learning Algorithms I
L2. Evaluating Machine Learning Algorithms I
 
Natural Language processing
Natural Language processingNatural Language processing
Natural Language processing
 
Page rank
Page rankPage rank
Page rank
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
NLP
NLPNLP
NLP
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Natural language processing
Natural language processingNatural language processing
Natural language processing
 
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
Hands on Explainable Recommender Systems with Knowledge Graphs @ RecSys22
 
Introduction to Tree-LSTMs
Introduction to Tree-LSTMsIntroduction to Tree-LSTMs
Introduction to Tree-LSTMs
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 

Viewers also liked

Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Health Informatics New Zealand
 
Lecture 4 Meta Knowledge
Lecture 4 Meta KnowledgeLecture 4 Meta Knowledge
Lecture 4 Meta KnowledgeSimon Shurville
 
7. knowledge acquisition, representation and organization 8. semantic network...
7. knowledge acquisition, representation and organization 8. semantic network...7. knowledge acquisition, representation and organization 8. semantic network...
7. knowledge acquisition, representation and organization 8. semantic network...AhL'Dn Daliva
 
The Turing Test - A sociotechnological analysis and prediction - Machine Inte...
The Turing Test - A sociotechnological analysis and prediction - Machine Inte...The Turing Test - A sociotechnological analysis and prediction - Machine Inte...
The Turing Test - A sociotechnological analysis and prediction - Machine Inte...piero scaruffi
 
Knowledge Extraction from Social Media
Knowledge Extraction from Social MediaKnowledge Extraction from Social Media
Knowledge Extraction from Social MediaSeth Grimes
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligencesanjay_asati
 
Knowledge representation and Predicate logic
Knowledge representation and Predicate logicKnowledge representation and Predicate logic
Knowledge representation and Predicate logicAmey Kerkar
 
Knowledge Representation in Artificial intelligence
Knowledge Representation in Artificial intelligence Knowledge Representation in Artificial intelligence
Knowledge Representation in Artificial intelligence Yasir Khan
 
Knowledge representation in AI
Knowledge representation in AIKnowledge representation in AI
Knowledge representation in AIVishal Singh
 

Viewers also liked (14)

Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...Integration Of Declarative and Procedural Knowledge for The Management of Chr...
Integration Of Declarative and Procedural Knowledge for The Management of Chr...
 
Turing
TuringTuring
Turing
 
KNOWLEDGE: REPRESENTATION AND MANIPULATION
KNOWLEDGE: REPRESENTATION AND MANIPULATIONKNOWLEDGE: REPRESENTATION AND MANIPULATION
KNOWLEDGE: REPRESENTATION AND MANIPULATION
 
Lecture 4 Meta Knowledge
Lecture 4 Meta KnowledgeLecture 4 Meta Knowledge
Lecture 4 Meta Knowledge
 
7. knowledge acquisition, representation and organization 8. semantic network...
7. knowledge acquisition, representation and organization 8. semantic network...7. knowledge acquisition, representation and organization 8. semantic network...
7. knowledge acquisition, representation and organization 8. semantic network...
 
Representation of knowledge
Representation of knowledgeRepresentation of knowledge
Representation of knowledge
 
The Turing Test - A sociotechnological analysis and prediction - Machine Inte...
The Turing Test - A sociotechnological analysis and prediction - Machine Inte...The Turing Test - A sociotechnological analysis and prediction - Machine Inte...
The Turing Test - A sociotechnological analysis and prediction - Machine Inte...
 
Turing Test
Turing TestTuring Test
Turing Test
 
Knowledge Extraction from Social Media
Knowledge Extraction from Social MediaKnowledge Extraction from Social Media
Knowledge Extraction from Social Media
 
Turing test
Turing testTuring test
Turing test
 
Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Knowledge representation and Predicate logic
Knowledge representation and Predicate logicKnowledge representation and Predicate logic
Knowledge representation and Predicate logic
 
Knowledge Representation in Artificial intelligence
Knowledge Representation in Artificial intelligence Knowledge Representation in Artificial intelligence
Knowledge Representation in Artificial intelligence
 
Knowledge representation in AI
Knowledge representation in AIKnowledge representation in AI
Knowledge representation in AI
 

Similar to Knowledge Extraction

Lean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicLean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicValeria de Paiva
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammarmeresie tesfay
 
Package-based Description Logics – Preliminary Results
Package-based Description Logics – Preliminary ResultsPackage-based Description Logics – Preliminary Results
Package-based Description Logics – Preliminary ResultsJie Bao
 
Ch2 (8).pptx
Ch2 (8).pptxCh2 (8).pptx
Ch2 (8).pptxDeyaHani
 
First order predicate logic(fopl)
First order predicate logic(fopl)First order predicate logic(fopl)
First order predicate logic(fopl)surbhi jha
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Facultad de Informática UCM
 
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...Antonio Lieto
 
Constructive Description Logics 2006
Constructive Description Logics 2006Constructive Description Logics 2006
Constructive Description Logics 2006Valeria de Paiva
 
First Order Logic
First Order LogicFirst Order Logic
First Order LogicMianMubeen3
 
Lean Logic for Lean Times: Entailment and Contradiction Revisited
Lean Logic for Lean Times: Entailment and Contradiction RevisitedLean Logic for Lean Times: Entailment and Contradiction Revisited
Lean Logic for Lean Times: Entailment and Contradiction RevisitedValeria de Paiva
 
Tutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsTutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsAdrian Paschke
 
Bridging the Systemic and Semantic Spheres
Bridging the Systemic and Semantic SpheresBridging the Systemic and Semantic Spheres
Bridging the Systemic and Semantic SpheresHelene Finidori
 
Natural Language parsing.pptx
Natural Language parsing.pptxNatural Language parsing.pptx
Natural Language parsing.pptxsiddhantroy13
 

Similar to Knowledge Extraction (20)

AI-09 Logic in AI
AI-09 Logic in AIAI-09 Logic in AI
AI-09 Logic in AI
 
Lean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural LogicLean Logic for Lean Times: Varieties of Natural Logic
Lean Logic for Lean Times: Varieties of Natural Logic
 
Ch3 4 regular expression and grammar
Ch3 4 regular expression and grammarCh3 4 regular expression and grammar
Ch3 4 regular expression and grammar
 
Package-based Description Logics – Preliminary Results
Package-based Description Logics – Preliminary ResultsPackage-based Description Logics – Preliminary Results
Package-based Description Logics – Preliminary Results
 
Ch2 (8).pptx
Ch2 (8).pptxCh2 (8).pptx
Ch2 (8).pptx
 
First order predicate logic(fopl)
First order predicate logic(fopl)First order predicate logic(fopl)
First order predicate logic(fopl)
 
Lec 3.pdf
Lec 3.pdfLec 3.pdf
Lec 3.pdf
 
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
Languages, Ontologies and Automatic Grammar Generation - Prof. Pedro Rangel H...
 
grammer genration
grammer genration grammer genration
grammer genration
 
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...
Functional and Structural Models of Commonsense Reasoning in Cognitive Archit...
 
Constructive Description Logics 2006
Constructive Description Logics 2006Constructive Description Logics 2006
Constructive Description Logics 2006
 
First Order Logic
First Order LogicFirst Order Logic
First Order Logic
 
Lean Logic for Lean Times: Entailment and Contradiction Revisited
Lean Logic for Lean Times: Entailment and Contradiction RevisitedLean Logic for Lean Times: Entailment and Contradiction Revisited
Lean Logic for Lean Times: Entailment and Contradiction Revisited
 
Tutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and SystemsTutorial - Introduction to Rule Technologies and Systems
Tutorial - Introduction to Rule Technologies and Systems
 
nlp (1).pptx
nlp (1).pptxnlp (1).pptx
nlp (1).pptx
 
Bridging the Systemic and Semantic Spheres
Bridging the Systemic and Semantic SpheresBridging the Systemic and Semantic Spheres
Bridging the Systemic and Semantic Spheres
 
AI Lesson 41
AI Lesson 41AI Lesson 41
AI Lesson 41
 
Lesson 41
Lesson 41Lesson 41
Lesson 41
 
Natural Language parsing.pptx
Natural Language parsing.pptxNatural Language parsing.pptx
Natural Language parsing.pptx
 
Inteligencia artificial
Inteligencia artificialInteligencia artificial
Inteligencia artificial
 

More from Pierre de Lacaze

Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationPierre de Lacaze
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsPierre de Lacaze
 

More from Pierre de Lacaze (7)

Babar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and RepresentationBabar: Knowledge Recognition, Extraction and Representation
Babar: Knowledge Recognition, Extraction and Representation
 
Deep Learning
Deep LearningDeep Learning
Deep Learning
 
Reinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural NetsReinforcement Learning and Artificial Neural Nets
Reinforcement Learning and Artificial Neural Nets
 
Logic Programming and ILP
Logic Programming and ILPLogic Programming and ILP
Logic Programming and ILP
 
Meta Object Protocols
Meta Object ProtocolsMeta Object Protocols
Meta Object Protocols
 
Prolog 7-Languages
Prolog 7-LanguagesProlog 7-Languages
Prolog 7-Languages
 
Clojure 7-Languages
Clojure 7-LanguagesClojure 7-Languages
Clojure 7-Languages
 

Recently uploaded

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 

Recently uploaded (20)

Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 

Knowledge Extraction

  • 1. (Knowledge Extraction) Raymond Pierre de Lacaze (RPL) LispNYC July 10th, 2012 rpl@lispnyc.org
  • 2. (John McCarthy) September 4th,1927 – October 24th, 2011 This talk is dedicated to the memory of John McCarthy  Inventor of the Lisp Language (1958)  Founder of Artificial Intelligence  Winner of the Turing award (1971)  Designer of Elephant 2000  Programming Language based on speech acts  http://www-formal.stanford.edu/jmc/elephant/elephant.html  May He Rest in Peace
  • 3. BABAR: Project Goals  Leverage Wikipedia as a Knowledge Base  Infer Infrastructure & Extract Content  Create Wiki Topic Taxonomies  Generate Knowledge Hypergraphs  Investigate Conceptual Relevance Metrics  Generate Knowledge summaries  Answer Knowledge base queries  Evolve a new generation of web browsers: Knowledge Browsers
  • 4. Overview  Brief Overview AI  Knowledge Representation  Natural Language Processing  Examine Specific Algorithms  Semantic Nets & Hypergraphs  Recursive Descent Parsing  Clustering Algorithms  Similarity Metrics  Describe Aspects of the BABAR System  Semantic Link Analysis  Automatic Topic Taxonomy Generation  Knowledge Category Assignment  Content Extraction  English Phrase to Clausal Form Logic
  • 5. AI Technologies Discussed  Knowledge Representation  Clausal Form Logic  Semantic Nets  Hypergraphs  Natural Language Processing  Lexical Analysis  Syntactic Analysis  Recursive Descent Parsing  Semantic Analysis  Machine Learning Techniques  Clustering Algorithms  K-Means, Agglomerative and SR Clustering  Similarity Metrics  Jaccard Index  Pearson Correlation
  • 6. Logics used in Artificial Intelligence  Monotonic Logic (standard)  Non-Monotonic Logic (exceptions)  (1) Birds can fly, (2) Penguins are birds, (3) Penguins can't fly  Sorted Logics (types)  Fuzzy Logic (continuous truth values)  Higher-Order Logics (meta-statements)  Modal Logics (may, can, must)  Intentional Logics (know, believe, think)  Temporal Logics (temporal operators)  Point-Based Temporal Logic (moments)  Interval Time Logic (Allen 1986, 13 temporal operators)  Before, Meets, Starts, Finishes, Overlaps, Contains, their inverses, and Equals.  Logics can be expressed in clausal form: (ancestor ?x ?y)  (parent ?x ?y) (ancestor ?x ?y)  (parent ?x ?z)(ancestor ?z ?y) Note: The variables ?x and ?y are universally quantified, whereas the variable ?z is existentially quantified.
  • 7. Clausal Form Logic  Propositional Calculus (PC)  Fully grounded clauses  No variables  (Brother John Jill),  (Parent Jane Jill)  (Mother Jane Jill)  First Order Predicate Calculus (FOPC)  Variables  Universally qualified (for all ?x)  Existentially qualified (there exists ?x)  (Elephant ?x)  (Has-Tusks ?x)  Converting 1st order logic to FOPC  Skolem constants (there exists x for all y such that…)  Skolem functions (for each x there exists a y such that…)  Second Order Predicate Calculus  Predicates and clauses can be arguments  Meta statements  Gödel's Incompleteness Theorem  Horn Clauses  Wikipedia: In computational logic, a Horn clause is a clause with at most one positive literal  B  (A1 ^ …. ^ An) ≡ ¬A1 v … v ¬A2 v B  (<LHS> <RHS>) ≡ ((B) (A1…An))
  • 8. Automated Reasoning  Unification Algorithm  Clausal pattern matching and variable binding  (unify (P ?x ?y) (P A (Q ?x)))  Returns bindings: ((?x A) (?y (Q ?x))  Instantiation: (P A (Q A))  Rete Algorithm  Charles L. Forgy, CMU, 1974  Addresses the many-many matching problem  Matching facts to rules in rule-based systems  Donald Knuth , Volume 3.  Automated Reasoners  Backward Chaining Reasoners  Work from conclusion  axioms (facts)  Good when state space branching factor is large  Forward Chaining Reasoners  Work from axioms  conclusion  Good when the depth state space is large  Mixed methods Perform both forward & backward chaining  GPS (Ernst & Newell, 1969)  Island hopping
  • 9. Semantic Nets  Labeled, directed (or not) and weighted (or not) Graphs  Equivalent in expressiveness to FOPC  Graphical representation of 1st order logic.  ISA Hierarchies  Subsumption (Bill Woods)  KL-ONE System: R.J. Brachman and J. Schmolze (1985)  A whole family of KL-ONE like systems  Concepts  Distinguish Primitive and Defined concepts  Only defined concepts are classifiable  Frames  Marvin Minsky , "A Framework for Representing Knowledge.“, 1974  OO Languages (CLOS) ≡ Frame Language  Think of class of definitions as frames, where slots are attribute-value pairs and you use pattern matching to fill in all the slots at which point a concept becomes defined and classifiable.
  • 10. HyperGraphs  A hypergraph is graph in which edges are first class objects and can be linked to other edges or vertices.  Hypergraphs are a natural and convenient way of representing sentences and meta-statements. Married Jane Jim Disapproves Loves Likes Mom Resents John  Mom resents the fact that John disapproves of Jane and Jim’s Marriage.  BABAR uses an in memory HyperGraph  Semantic Net
  • 11. Natural Language Processing  Lexical Analysis  Understanding the role and morphological nature of words.  Morphology, Orthography, Part of Speech Tagging  Typically use Lexicons: Dictionaries, etc…  Programs that do this are called Scanners or Lexical Analyzers  ScanGen and LEX on Unix systems for Programming Languages  Syntactic Analysis  Understanding the grammatical nature of groups of words  Programs that do this are called Parsers.  They take tokens produced by scanners/analyzers and apply them to a grammar.  In doing so they typically produce parse trees.  NLP parsing methodologies include:  Top-Down Parsers(recursive descent)  Bottom-Up Parsers  ParseGen and YACC on Unix systems for Programming Languages  Semantic Analysis  Extracting phrase structure from parse trees and producing statements in some knowledge representation language such as clausal-form logic.  KRL: "An Overview of KRL, a Knowledge Representation Language", D.G. Bobrow and T. Winograd, (1977).
  • 12. Lexical Analysis  Morphology  The rules that govern word morphing  foxes ≡ fox+<plural>  Orthography  The rules that govern spelling  Plural of fox ≡ fox+’es’  Transducers  Define languages consisting of pairs of strings  Loosely: Finite Automaton with 2 state transition functions.  Formally: Q (states), Σ (i-alph), Δ (o-alph), q0 (start), F (final), δ(q, w) and σ(q, w).  FST: Finite State Transducer  Surface level, Intermediate level, Lexical level  E.g. foxes  fox+es  fox+N+PL  Parsing, Generating & Translating  Morphological Parser  Lexicons, Morphotactics and Orthographic Rules  Penn Treebank Parts of Speech Tags (50)  Probabilistic Approaches  N-Gram model  Counting word frequency  See Chapter 4 of Jurafsky & Martin, Speech & Language Processing, 2009  Google Translate
  • 13. Lexical Analysis in BABAR  Lexicons  Regular words Lexicon  http://www.merriam-webster.com/  Query the site and extract parts of speech  About 50,000 locally cached entries.  Irregular Words Lexicons  Irregular nouns  Irregular verbs  Irregular auxiliaries  Orthographic Rules  reverse engineer morphed words  (analyze-morphed-word <word>)  Analyzes word suffixes then queries MW.
  • 14. Lexical Analysis Example KB(5): (parser::analyze-morphed-word "traditionally“ ) Loading #P"C:ProjectstrunkDataLexiconsParts-of-Speech.lisp" Loading table from file English-Irregular-Nouns ... Loading table from file English-Irregular-Verbs ... Loading table from file English-Irregular-Auxiliary ... Initializing reverse lexicon table... URL: "http://www.merriam-webster.com/dictionary/tradition"  Returns five values: Base Form: "tradition" Actual Form: "traditionally" Primary POS: :ADVERB Additional NIL Complete POS (:ADVERB)  Reverse Engineering: traditionally (adverb)  traditional (adjective)  tradition (noun)  Parts-of-Speech Lexicon currently has about 50,000 entries.  Appriximately one million words in the English language
  • 15. Syntactic Analysis  Grammars  Productions (grammatical rules)  LHS: A non-terminal symbol  RHS: A disjunction of conjunctions of TS & NTS  Can be recursive  Non-Terminal Symbols  Terminal Symbols (lexicon entries)  Start Symbol  Implicitly Define an AND-OR Tree.  Context-Free Grammars, Attribute Grammars  Parsers  Traverse a grammar while consuming input tokens in an attempt to find a valid path through the grammar that accommodates the input tokens.  Produce parse trees in which the internal nodes are Non-Terminal Symbols (NTS) and the leaves are Terminal Symbols (TS)  Three typical ways to handle non-determinism  Backtracking  Look-ahead  Parallelism
  • 16. Parsing in BABAR  Implements a Recursive Descent Parser which performs a top-down traversal of the grammar.  Uses backtracking to handle non-determinism  3 Types of objects: tokens, grammars and parse-nodes  Scanner  Creates of seven fundamental token classes based on character composition  alphabetic, numeric, special, alpha-numeric, alpha-special, numeric-special and alpha-numeric-special  Implemented using multiple-inheritance:  alphabetic-mixin, numeric-mixin and special-mixin classes  Parser Module (Scanner, Analyzer, Parser)  Implements a set of classes and generic functions geared towards being easily able to develop particular domain–specific parsers.
  • 17. Level 1 (simple) Class grammar Macro (define-grammar <name><prods><preds> &key <class>) GF (scan-tokens <string> <grammar>&key <delimiter>) GF (parse-tokens <tokens> <grammar>) Level 2 (context) Class context-grammar Macro (define-context-grammar <name> <prods> <preds> <context>) Macro (with-grammar-context (<context><grammar>) &body <body>) GF (analyze-tokens <tokens> <grammar>) Level 3 (domain) Macro (define-lexicon <name> <fields>) Macro (define-word-class <word-type> &optional <slots>) Level 4 (english) Adds english-grammar, scan-tokens, analyze-word-morphology
  • 18. Crawling Wikipedia  Wikipedia has approximately 4 million pages. (initialize-wiki-graph <topic><depth>)  Returns a graph object (crawl-wiki-topic <topic> <depth>)  Returns a Hash-Table of related-topics  For topic=elephant and depth=  #<EQUALP hash-table with 2580 entries> (generate-wiki-graph <hash-table>)  Only create a vertex for keyss (pruning)  Non-key related topics are ignored (pruning)  Create a ‘related-to edge for every (<key> <related-topic>) pair.  Without pruning: #<Graph Elephant-3: 154833 vertices, 553604 edges>  With Pruning: #<Graph Elephant-3: 2580 vertices, 182562 edges> (2.7%)  With Pruning: #<Graph Elephant-4: 25577 vertices, 2355810 edges> (0.3%)  A complete graph of n vertices has n(n-1)/2 edges ≡ O(n2)
  • 19. Link Name Organization  Internal, External and Intranal hyperlinks  I chose the Elephant page as my entry page for crawling  There are 228 internal links from the Elephant page.  These occur throughout 103 paragraphs of text  Goal: Organize the 228 links into a meaningful taxonomy Asian_Elephant Elephant African_Bush_Elephant African_Elephant African_Forest_Elephant  Apply NLP to link names: i.e. parse the link names.  Partition link names into subtopic, supertopic and related.  Subtopic candidate elimination  Partition related topics into strongly and weakly related based on link bi-directionality
  • 20. Subtopic Taxonomy Generation Algorithm (generate-subtopic-relations-in-graph <graph>) 1. Produce Candidates: a list of pairs of concepts. Each pair of concepts is such that the first concept is a generalization of the second concept. This is determined by noting concepts that when parsed produce a set of tokens that is subset of the set tokens produced by parsing the second concept. 2. Eliminate False-Poisitives: These are eliminated by ensuring that the subjects of the phrases of each set of parsed tokens are identical.  E.g. Elephant_Hotel is not a subtopic of Elephant whereas Hotel_Elephant would a be subtopic of Elephant. This is one place where NLP really adds value. 3. Replace ‘related-to relations with ‘generalizes relations. 4. Eliminate direct ‘generalizes relationships between children and non-parent ancestors.  E.g. Elephant and North_African_Elephant. 5. Eliminate Singletons: Prune the list of sub trees by eliminating singleton sub trees thus leaving them in a state of yet to be classified  Finally return a forest of trees, i.e. a list of root nodes.
  • 21. Subtopic Taxonomies  Organize 2,580 topics into a forest of 131 trees consisting of 1,594 nodes (62%) and 986 yet to be classified nodes. Elephant Tree Elephant_Seal Tree -> Elephant -> Elephant_seal -> Dwarf_elephant -> Southern_elephant_seal -> Northern_elephant_seal -> Sri_Lankan_elephant -> Year_of_the_Elephant -> Sumatran_Elephant Intelligence Tree -> White_elephant -> Intelligence -> Fish_intelligence -> War_elephant -> Cat_intelligence -> Crushing_by_elephant -> Artificial_intelligence -> Babar_the_Elephant -> Electronic_Transactions_on_Artificial_Intelligence -> Indian_Elephant -> Swarm_intelligence -> Cephalopod_intelligence -> African_elephant -> Dinosaur_intelligence -> African_Forest_Elephant -> Cetacean_intelligence -> North_African_Elephant -> Evolution_of_human_intelligence -> African_Bush_Elephant -> Elephant_intelligence -> Dog_intelligence -> Execution_by_elephant -> Pigeon_intelligence -> Borneo_pygmy_elephant -> Primate_intelligence -> Horton_the_Elephant -> Bird_intelligence -> Asian elephant -> Elmer_the_Patchwork_Elephant
  • 22. Subtopic Taxonomy Issues -> Lion -> Lion (cont.) -> Congolese_Spotted_Lion -> Sea_lion -> Asiatic_Lion -> Steller_sea_lion -> Masai_lion -> Australian_sea_lion -> Barbary_lion -> South_American_sea_lion -> Henry_the_Lion -> New_Zealand_sea_lion -> Sri_Lanka_lion -> California_sea_lion -> Nemean_lion -> American_lion -> Western_African_lion -> White_lion -> Transvaal_Lion -> Kimba_the_White_Lion -> West_African_lion -> Cowardly_Lion -> Tsavo_lion -> Tiger_versus_lion -> Southwest_African_Lion -> European_lion -> Cape Lion WRT Nomenclature purity, Lion_Seal is a better name than Sea_Lion.
  • 23. Clustering  Two Fundamental Perspectives:  Top-Down: Partitioning a set into disjoint subsets  Bottom-Up: Grouping data points into disjoint clusters  Goes hand-in-hand with classification  Typically involves a metric: Euclidian or Manhattan distance  Many, many different algorithms & books.  Some really popular algorithms:  K-Means Clustering (EM, PCA)  Hierarchical Agglomerative Clustering  K-Nearest Neighbor (classification)  SR-Clustering: This is something I (re)invented.  Effectively: The world’s simplest clustering algorithm.
  • 24. K-Means Clustering (1)  Given an initial set of cluster centroids, determine the actual centroids of each cluster via an iterative refinement algorithm.  Each refinement iteration consists of two steps : 1. Computing new data point centroid assignments 2. Computing new centroid positions based of the mean deviation of the data points from the previous centroid positions.  Converge, Divergence, Oscillation….  Also known as Lloyd’s Algorithm in CS.
  • 25. K-Means Clustering (2) Wikipedia: Given a set of observations (x1, x2, …, xn), where each observation is a d- dimensional real vector, k-means clustering aims to partition the n observations into k sets (k ≤ n) S = {S1, S2, …, Sk} so as to minimize the within- cluster sum of squares (WCSS): where μi is the mean of points in Si
  • 26. K-Means Clustering (3)  Assignment Step: Defines Si to be the set of xi that deviate least from Si  Update Step: Calculate the new means to be the centroid of the observations in the cluster. I.e. The average along each dimension
  • 27. K-Means Clustering(4)  K-Means is *really* a 3 step algorithm  Step1. Initialize K-Means (non-trivial)  Problem 1: Estimate K  Problem 2: Pick Initial Centroid for each K  Iterative Refinement  Step 2: Centroid Assignments  Step3: Centroid Update  Many initialization approaches:  Random, Forgy, MacQueen and Kaufman  Performance depends on initialization and instance ordering  Popular because of its robustness  Related to:  EM Algorithm and  Principal Component Analysis (PCA)
  • 28. Hierarchical Agglomerative Clustering  The Algorithm 1. Cluster each data point with its nearest neighbor(s) and make that a new data point (cluster). 2. Repeat until some fixed number of clusters is reached.  K-Nearest Neighbor is often used hand-in-hand with agglomerative clustering to compute the nearest neighbor(s).  End up with a tree of clusters (clustering history)  This tree is called a dendogram  See Chapter 6 of Duda & Hart (SRI, 1973) Pattern Classification & Scene Analysis
  • 29. SR-Clustering (1)  Simple Ray Clustering   Sort of like non-hierarchical agglomerative clustering  Basic Algorithm  For each data point, place it in the correct cluster  If it doesn’t belong to any cluster, create a new cluster consisting of that single data point  Cluster Membership  Defined as being within a certain proximity threshold of every data point in that cluster.  Proximity Metric  The Jaccard Index
  • 30. Recommender Systems  Used by Netflix, Amazon, etc…  Objects: Users, Items & Preferences  User vs. Item based recommendations  Former aka collaborative filtering  Mixed method recommendations  Based on User Similarity and/or Item Similarity  Jaccard Index takes into account dissimilarity and does not require preference measurements.  Apache Mahout (leverages Hadoop)
  • 31. Jaccard Index  Defines a Similarity Metric between two sets  Wikipedia: The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets:  Jaccard Distance
  • 32. Another Similarity Metric  Pearson Correlation Coefficient  Wikipedia: Defined as the covariance of the two variables divided by the product of their standard deviations
  • 33. (compute-similarity-matrix <topics>)  Computes the Jaccard index for pairs of topics by using the related topics of each topic as the sets to be compared. African Asian Indian Babar Horton War African 100.00 38.46 21.05 4.35 6.82 7.94 Asian 38.46 100.00 37.74 4.00 6.25 20.00 Indian 21.05 37.74 100.00 6.90 7.14 24.39 Babar 4.35 4.00 6.90 100.00 28.57 7.14 Horton 6.82 6.25 7.14 28.57 100.00 7.41 War 7.94 20.00 24.39 7.14 7.41 100.00
  • 34. (cluster-subtopics <subtopics> <matrix> <threshold>) Cluster 1 Cluster 4 Asian_elephant(49) War_elephant(22) African_elephant(60) Execution_by_elephant(5) Crushing_by_elephant(4) Cluster 2 Babar_the_Elephant(7) Cluster 5 Horton_the_Elephant(5) Year_of_the_Elephant(8) Elmer_the_Patchwork_Elephant(4) Cluster 6 Dwarf_elephant(24) Cluster 3 Asian_elephant(49) Cluster 7 Indian_Elephant(18) White_elephant(10) Sri_Lankan_elephant(12) Sumatran_Elephant(11) Borneo_pygmy_elephant(3) Threshold = 20
  • 35. Knowledge Categories (1)  Human schooling as a decade(s) long knowledge acquisition process  Spanning Kindergarten – Post Doctoral work  Idea is to use grade school topics as initial knowledge categories.  Science, History, Geography, Literature & Art  Goal: Assign categories to subtopic clusters  Use Jaccard Index to determine the category  Automatically create subtopic category names e.g. Babar  Literature_Elephant
  • 36. (compute-cluster-categories <clusters>)  Wiki Crawl each Knowledge Category (pre-run)  Compute subtopics of each knowledge category  Compute a category relevancy vector for each cluster member  Combine the relevancy vectors of each cluster to compute a relevancy vector for the cluster  Assign a category to the cluster
  • 37. (compute-cluster-categories <clusters>) (((( :SCIENCE 0.47666672) (:HISTORY 0.44666672)) (#<Concept(49): Asian_elephant> #<Concept(60): African_elephant>)) ((( :SCIENCE 0.39) (:GEOGRAPHY 0.37800002)) (#<Concept(3): Borneo_pygmy_elephant> #<Concept(49): Asian_elephant> #<Concept(18): Indian_Elephant> #<Concept(12): Sri_Lankan_elephant> #<Concept(11): Sumatran_Elephant>)) ((( :ART 0.33333334) (:GEOGRAPHY 0.30666667)) (#<Concept(7): Babar_the_Elephant> #<Concept(5): Horton_the_Elephant> #<Concept(4): Elmer_the_Patchwork_Elephant>)) ((( :HISTORY 0.6) (:GEOGRAPHY 0.46) (#<Concept(8): Year_of_the_Elephant>)) ((( :GEOGRAPHY 0.72333336) ( :HISTORY 0.43666664)) (#<Concept(5): Execution_by_elephant> #<Concept(22): War_elephant> #<Concept(4): Crushing_by_elephant>)) ((( :GEOGRAPHY 0.86) ( :SCIENCE 0.5)) (#<Concept(24): Dwarf_elephant>)) (( ( :SCIENCE 0.69) (:ART 0.49)) (#<Concept(10): White_elephant>)))
  • 38. Individual Subtopic Categories The following shows the knowledge category relevancies for some of the 16 subtopics of Elephant and helps understand the results of previous slide (#<Concept(7): Babar_the_Elephant> (( :LITERATURE 0.44) ( :ART 0.25) (:GEOGRAPHY 0.23) (:HISTORY 0.2) ( :SCIENCE 0.17))) (#<Concept(4): Elmer_the_Patchwork_Elephant> (( :ART 0.25) (:GEOGRAPHY 0.23) (:LITERATURE 0.22) (:HISTORY 0.2) ( :SCIENCE 0.17))) (#<Concept(5): Horton_the_Elephant> (( :ART 0.5) (:GEOGRAPHY 0.46) (:HISTORY 0.4) (:SCIENCE 0.35) ( :LITERATURE 0.22))) (#<Concept(60): African_elephant> ((:ART 1.03) ( :SCIENCE 0.91) ( :HISTORY 0.85) (:GEOGRAPHY 0.77) ( :LITERATURE 0.37))) (#<Concept(49): Asian_elephant> (( :HISTORY 0.7) ( :SCIENCE 0.62) (:GEOGRAPHY 0.59) (:ART 0.42) ( :LITERATURE 0.19))) (#<Concept(22): War_elephant> (( :HISTORY 0.93) ( :GEOGRAPHY 0.85) (:LITERATURE 0.41) (:ART 0.23) (:SCIENCE 0.16)))
  • 39. Categorized Subtopic Clusters  Elephant  ART_Elephant Elmer_the_Patchwork_Elephant Horton_the_Elephant Babar_the_Elephant  GEOGRAPHY_Elephant Dwarf_elephant Crushing_by_elephant War_elephant Execution_by_elephant  HISTORY_Elephant Year_of_the_Elephant  SCIENCE_Elephant African_elephant African_Forest_Elephant African_Bush_Elephant Asian_elephant White_elephant Sumatran_Elephant Sri_Lankan_elephant Indian_Elephant Borneo_pygmy_elephany
  • 40. Related Topics Associations  Associate related topics to subtopic clusters using Jaccard Index  Use associations to create related topic clusters (find-compatible-clusters <strongly-related-topics> <clusters>) (( #<Concept(60): African elephant> ((#<Concept(24): Dwarf elephant>) #<Concept(49): Asian_elephant>) (#<Concept(66): Mammoth> (#<Concept(10): Elephant intelligence> #<Concept(25): Mastodon> #<Concept(275): Genus> #<Concept(103): Animal cognition> #<Concept(62): Afrotheria> #<Concept(4): Elephant tusk> #<Concept(86): Gestation> #<Concept(15): African> #<Concept(749): Eutheria> #<Concept(102): Proboscidea> #<Concept(8): Gomphotherium> #<Concept(96): Mammalia> #<Concept(27): Tooth> #<Concept(876): Mammal> #<Concept(8): Tooth_development>)) #<Concept(143): Hippopotamus> #<Concept(590): Lion> #<Concept(10): Loxodonta>) ((#<Concept(7): Babar_the_Elephant> #<Concept(5): Horton_the_Elephant> #<Concept(4): Elmer_the_Patchwork_Elephant>) ((#<Concept22): War_elephant> #<Concept(5): Execution_by_elephant> #<Concept(6): List_of_fictional_elephants> #<Concept(4): Crushing_by_elephant>) #<Concept(5): List_of_elephants_in_mythology_and_religion> #<Concept(5): Pinnawala> (#<Concept(55): Ivory> #<Concept(3): Katy_Payne> #<Concept(77): Kenya> #<Concept(11): Infrasound> #<Concept(31): Grief> #<Concept(56): Incisor> #<Concept(8): History_of_elephants_in_Europe>)) #<Concept(14): Jeheskel_Shoshani> #<Concept(6): Aanayoottu> )
  • 41. (sentence-to-clause <sentence>)   english sentence string Scanner   Tokens Analyzer   morphologically analyzed words Parser   parse-tree Phrase extractor   phrases (flattened parse tree) Semantic analyzer   Frame for subject, verb , object and prep. phrases Clauses Generator   Clause Objects
  • 42. Sample Extracted Clauses  (HASA "Asian elephant species" "disjunct distributions")  (ISA "Elephants" "herbivores")  (HASA "African Elephants" “three nails")  (HASA "Indian Elephants" "four nails")  (HASA "female African Elephants" "large tusks")  (ISA "Elephants" "large land mammals")
  • 43. Things Overlooked  Wiki Page Contents Pane  Provides page taxonomy  Provides category names  Provides related topic names  Concept Weights
  • 44. Future Direction  Enhance English Parser  Incorporate Variables into Semantic Net  Leverage topic weights  Work on language generation  Produce Wiki Summary Pages  Knowledge Queries  Develop Client Side Browser  Top Menu Bar Knowledge Categories  RHS Dynamic Subtopic Tree  LHS Wiki Page Content Pane
  • 45. (references)  Language and Speech Processing Jurafsky and Martin  Artificial Intelligence: A Modern Approach Russell & Norvig  Principles of Semantic Networks Edited by John E. Sowa, Morgan Kaufman, 1991  Machine Learning Tom Mitchell, 1997  Pattern Classification and Scene Analysis Duda and Hart, 1973  Algorithms of the Intelligent Web Marmanis and Babenko, 2009