Your SlideShare is downloading. ×
0
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Knowledge Extraction
Upcoming SlideShare
Loading in...5
×

Thanks for flagging this SlideShare!

Oops! An error has occurred.

×
Saving this for later? Get the SlideShare app to save on your phone or tablet. Read anywhere, anytime – even offline.
Text the download link to your phone
Standard text messaging rates apply

Knowledge Extraction

3,295

Published on

This paper describes BABAR, a knowledge extraction and representation system, completely implemented in CLOS, that is primarily geared towards organizing and reasoning about knowledge extracted from …

This paper describes BABAR, a knowledge extraction and representation system, completely implemented in CLOS, that is primarily geared towards organizing and reasoning about knowledge extracted from the Wikipedia Website. The system combines natural language processing techniques, knowledge representation paradigms and machine learning algorithms. BABAR is currently an ongoing independent research project that when sufficiently mature, may provide various commercial opportunities.
BABAR uses natural language processing to parse both page name and page contents. It automatically generates Wikipedia topic taxonomies thus providing a model for organizing the approximately 4,000,000 existing Wikipedia pages. It uses similarity metrics to establish concept relevancy and clustering algorithms to group topics based on semantic relevancy. Novel algorithms are presented that combine approaches from the areas of machine learning and recommender systems. The system also generates a knowledge hypergraph which will ultimately be used in conjunction with an automated reasoner to answer questions about particular topics.

Published in: Technology
0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total Views
3,295
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
16
Comments
0
Likes
0
Embeds 0
No embeds

Report content
Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
No notes for slide

Transcript

  • 1. (Knowledge Extraction) Raymond Pierre de Lacaze (RPL) LispNYC July 10th, 2012 rpl@lispnyc.org
  • 2. (John McCarthy)September 4th,1927 – October 24th, 2011This talk is dedicated to the memory of John McCarthy Inventor of the Lisp Language (1958) Founder of Artificial Intelligence Winner of the Turing award (1971) Designer of Elephant 2000  Programming Language based on speech acts  http://www-formal.stanford.edu/jmc/elephant/elephant.html May He Rest in Peace
  • 3. BABAR: Project Goals Leverage Wikipedia as a Knowledge Base Infer Infrastructure & Extract Content  Create Wiki Topic Taxonomies  Generate Knowledge Hypergraphs Investigate Conceptual Relevance Metrics Generate Knowledge summaries Answer Knowledge base queries Evolve a new generation of web browsers: Knowledge Browsers
  • 4. Overview Brief Overview AI  Knowledge Representation  Natural Language Processing Examine Specific Algorithms  Semantic Nets & Hypergraphs  Recursive Descent Parsing  Clustering Algorithms  Similarity Metrics Describe Aspects of the BABAR System  Semantic Link Analysis  Automatic Topic Taxonomy Generation  Knowledge Category Assignment  Content Extraction  English Phrase to Clausal Form Logic
  • 5. AI Technologies Discussed Knowledge Representation  Clausal Form Logic  Semantic Nets  Hypergraphs Natural Language Processing  Lexical Analysis  Syntactic Analysis  Recursive Descent Parsing  Semantic Analysis Machine Learning Techniques  Clustering Algorithms  K-Means, Agglomerative and SR Clustering Similarity Metrics  Jaccard Index  Pearson Correlation
  • 6. Logics used in Artificial Intelligence Monotonic Logic (standard) Non-Monotonic Logic (exceptions)  (1) Birds can fly, (2) Penguins are birds, (3) Penguins cant fly Sorted Logics (types) Fuzzy Logic (continuous truth values) Higher-Order Logics (meta-statements)  Modal Logics (may, can, must)  Intentional Logics (know, believe, think) Temporal Logics (temporal operators)  Point-Based Temporal Logic (moments)  Interval Time Logic (Allen 1986, 13 temporal operators)  Before, Meets, Starts, Finishes, Overlaps, Contains, their inverses, and Equals. Logics can be expressed in clausal form: (ancestor ?x ?y)  (parent ?x ?y) (ancestor ?x ?y)  (parent ?x ?z)(ancestor ?z ?y) Note: The variables ?x and ?y are universally quantified, whereas the variable ?z is existentially quantified.
  • 7. Clausal Form Logic Propositional Calculus (PC)  Fully grounded clauses  No variables  (Brother John Jill),  (Parent Jane Jill)  (Mother Jane Jill) First Order Predicate Calculus (FOPC)  Variables  Universally qualified (for all ?x)  Existentially qualified (there exists ?x)  (Elephant ?x)  (Has-Tusks ?x)  Converting 1st order logic to FOPC  Skolem constants (there exists x for all y such that…)  Skolem functions (for each x there exists a y such that…) Second Order Predicate Calculus  Predicates and clauses can be arguments  Meta statements  Gödels Incompleteness Theorem Horn Clauses  Wikipedia: In computational logic, a Horn clause is a clause with at most one positive literal  B  (A1 ^ …. ^ An) ≡ ¬A1 v … v ¬A2 v B  (<LHS> <RHS>) ≡ ((B) (A1…An))
  • 8. Automated Reasoning Unification Algorithm  Clausal pattern matching and variable binding  (unify (P ?x ?y) (P A (Q ?x)))  Returns bindings: ((?x A) (?y (Q ?x))  Instantiation: (P A (Q A)) Rete Algorithm  Charles L. Forgy, CMU, 1974  Addresses the many-many matching problem  Matching facts to rules in rule-based systems  Donald Knuth , Volume 3. Automated Reasoners  Backward Chaining Reasoners  Work from conclusion  axioms (facts)  Good when state space branching factor is large  Forward Chaining Reasoners  Work from axioms  conclusion  Good when the depth state space is large  Mixed methods Perform both forward & backward chaining  GPS (Ernst & Newell, 1969)  Island hopping
  • 9. Semantic Nets Labeled, directed (or not) and weighted (or not) Graphs Equivalent in expressiveness to FOPC Graphical representation of 1st order logic. ISA Hierarchies Subsumption (Bill Woods) KL-ONE System: R.J. Brachman and J. Schmolze (1985) A whole family of KL-ONE like systems Concepts  Distinguish Primitive and Defined concepts  Only defined concepts are classifiable Frames  Marvin Minsky , "A Framework for Representing Knowledge.“, 1974  OO Languages (CLOS) ≡ Frame Language  Think of class of definitions as frames, where slots are attribute-value pairs and you use pattern matching to fill in all the slots at which point a concept becomes defined and classifiable.
  • 10. HyperGraphs A hypergraph is graph in which edges are first class objects and can be linked to other edges or vertices. Hypergraphs are a natural and convenient way of representing sentences and meta-statements. Married Jane Jim Disapproves Loves Likes Mom Resents John Mom resents the fact that John disapproves of Jane and Jim’s Marriage. BABAR uses an in memory HyperGraph  Semantic Net
  • 11. Natural Language Processing Lexical Analysis  Understanding the role and morphological nature of words.  Morphology, Orthography, Part of Speech Tagging  Typically use Lexicons: Dictionaries, etc…  Programs that do this are called Scanners or Lexical Analyzers  ScanGen and LEX on Unix systems for Programming Languages Syntactic Analysis  Understanding the grammatical nature of groups of words  Programs that do this are called Parsers.  They take tokens produced by scanners/analyzers and apply them to a grammar.  In doing so they typically produce parse trees.  NLP parsing methodologies include:  Top-Down Parsers(recursive descent)  Bottom-Up Parsers  ParseGen and YACC on Unix systems for Programming Languages Semantic Analysis  Extracting phrase structure from parse trees and producing statements in some knowledge representation language such as clausal-form logic.  KRL: "An Overview of KRL, a Knowledge Representation Language", D.G. Bobrow and T. Winograd, (1977).
  • 12. Lexical Analysis Morphology  The rules that govern word morphing  foxes ≡ fox+<plural> Orthography  The rules that govern spelling  Plural of fox ≡ fox+’es’ Transducers  Define languages consisting of pairs of strings  Loosely: Finite Automaton with 2 state transition functions.  Formally: Q (states), Σ (i-alph), Δ (o-alph), q0 (start), F (final), δ(q, w) and σ(q, w).  FST: Finite State Transducer  Surface level, Intermediate level, Lexical level  E.g. foxes  fox+es  fox+N+PL  Parsing, Generating & Translating Morphological Parser  Lexicons, Morphotactics and Orthographic Rules  Penn Treebank Parts of Speech Tags (50) Probabilistic Approaches  N-Gram model  Counting word frequency  See Chapter 4 of Jurafsky & Martin, Speech & Language Processing, 2009  Google Translate
  • 13. Lexical Analysis in BABAR Lexicons  Regular words Lexicon  http://www.merriam-webster.com/  Query the site and extract parts of speech  About 50,000 locally cached entries.  Irregular Words Lexicons  Irregular nouns  Irregular verbs  Irregular auxiliaries Orthographic Rules  reverse engineer morphed words (analyze-morphed-word <word>)  Analyzes word suffixes then queries MW.
  • 14. Lexical Analysis ExampleKB(5): (parser::analyze-morphed-word "traditionally“ )Loading #P"C:ProjectstrunkDataLexiconsParts-of-Speech.lisp"Loading table from file English-Irregular-Nouns ...Loading table from file English-Irregular-Verbs ...Loading table from file English-Irregular-Auxiliary ...Initializing reverse lexicon table...URL: "http://www.merriam-webster.com/dictionary/tradition" Returns five values: Base Form: "tradition" Actual Form: "traditionally" Primary POS: :ADVERB Additional NIL Complete POS (:ADVERB) Reverse Engineering: traditionally (adverb)  traditional (adjective)  tradition (noun) Parts-of-Speech Lexicon currently has about 50,000 entries. Appriximately one million words in the English language
  • 15. Syntactic Analysis Grammars  Productions (grammatical rules)  LHS: A non-terminal symbol  RHS: A disjunction of conjunctions of TS & NTS  Can be recursive  Non-Terminal Symbols  Terminal Symbols (lexicon entries)  Start Symbol  Implicitly Define an AND-OR Tree.  Context-Free Grammars, Attribute Grammars Parsers  Traverse a grammar while consuming input tokens in an attempt to find a valid path through the grammar that accommodates the input tokens.  Produce parse trees in which the internal nodes are Non-Terminal Symbols (NTS) and the leaves are Terminal Symbols (TS)  Three typical ways to handle non-determinism  Backtracking  Look-ahead  Parallelism
  • 16. Parsing in BABAR Implements a Recursive Descent Parser which performs a top-down traversal of the grammar. Uses backtracking to handle non-determinism 3 Types of objects: tokens, grammars and parse-nodes Scanner  Creates of seven fundamental token classes based on character composition  alphabetic, numeric, special, alpha-numeric, alpha-special, numeric-special and alpha-numeric-special  Implemented using multiple-inheritance:  alphabetic-mixin, numeric-mixin and special-mixin classes Parser Module (Scanner, Analyzer, Parser)  Implements a set of classes and generic functions geared towards being easily able to develop particular domain–specific parsers.
  • 17. Level 1 (simple)Class grammarMacro (define-grammar <name><prods><preds> &key <class>)GF (scan-tokens <string> <grammar>&key <delimiter>)GF (parse-tokens <tokens> <grammar>)Level 2 (context)Class context-grammarMacro (define-context-grammar <name> <prods> <preds> <context>)Macro (with-grammar-context (<context><grammar>) &body <body>)GF (analyze-tokens <tokens> <grammar>)Level 3 (domain)Macro (define-lexicon <name> <fields>)Macro (define-word-class <word-type> &optional <slots>)Level 4 (english)Adds english-grammar, scan-tokens, analyze-word-morphology
  • 18. Crawling Wikipedia Wikipedia has approximately 4 million pages.(initialize-wiki-graph <topic><depth>)  Returns a graph object(crawl-wiki-topic <topic> <depth>) Returns a Hash-Table of related-topics For topic=elephant and depth=  #<EQUALP hash-table with 2580 entries>(generate-wiki-graph <hash-table>) Only create a vertex for keyss (pruning) Non-key related topics are ignored (pruning) Create a ‘related-to edge for every (<key> <related-topic>) pair. Without pruning: #<Graph Elephant-3: 154833 vertices, 553604 edges> With Pruning: #<Graph Elephant-3: 2580 vertices, 182562 edges> (2.7%) With Pruning: #<Graph Elephant-4: 25577 vertices, 2355810 edges> (0.3%) A complete graph of n vertices has n(n-1)/2 edges ≡ O(n2)
  • 19. Link Name Organization Internal, External and Intranal hyperlinks I chose the Elephant page as my entry page for crawling There are 228 internal links from the Elephant page. These occur throughout 103 paragraphs of text Goal: Organize the 228 links into a meaningful taxonomy Asian_Elephant Elephant African_Bush_Elephant African_Elephant African_Forest_Elephant Apply NLP to link names: i.e. parse the link names. Partition link names into subtopic, supertopic and related.  Subtopic candidate elimination Partition related topics into strongly and weakly related based on link bi-directionality
  • 20. Subtopic Taxonomy Generation Algorithm(generate-subtopic-relations-in-graph <graph>)1. Produce Candidates: a list of pairs of concepts. Each pair ofconcepts is such that the first concept is a generalization of thesecond concept. This is determined by noting concepts thatwhen parsed produce a set of tokens that is subset of the settokens produced by parsing the second concept.2. Eliminate False-Poisitives: These are eliminated by ensuring thatthe subjects of the phrases of each set of parsed tokens areidentical.  E.g. Elephant_Hotel is not a subtopic of Elephant whereas Hotel_Elephant would a be subtopic of Elephant. This is one place where NLP really adds value.3. Replace ‘related-to relations with ‘generalizes relations.4. Eliminate direct ‘generalizes relationships between children andnon-parent ancestors.  E.g. Elephant and North_African_Elephant.5. Eliminate Singletons: Prune the list of sub trees by eliminatingsingleton sub trees thus leaving them in a state of yet to beclassified Finally return a forest of trees, i.e. a list of root nodes.
  • 21. Subtopic Taxonomies  Organize 2,580 topics into a forest of 131 trees consisting of 1,594 nodes (62%) and 986 yet to be classified nodes. Elephant Tree Elephant_Seal Tree-> Elephant -> Elephant_seal -> Dwarf_elephant -> Southern_elephant_seal -> Northern_elephant_seal -> Sri_Lankan_elephant -> Year_of_the_Elephant -> Sumatran_Elephant Intelligence Tree -> White_elephant -> Intelligence -> Fish_intelligence -> War_elephant -> Cat_intelligence -> Crushing_by_elephant -> Artificial_intelligence -> Babar_the_Elephant -> Electronic_Transactions_on_Artificial_Intelligence -> Indian_Elephant -> Swarm_intelligence -> Cephalopod_intelligence -> African_elephant -> Dinosaur_intelligence -> African_Forest_Elephant -> Cetacean_intelligence -> North_African_Elephant -> Evolution_of_human_intelligence -> African_Bush_Elephant -> Elephant_intelligence -> Dog_intelligence -> Execution_by_elephant -> Pigeon_intelligence -> Borneo_pygmy_elephant -> Primate_intelligence -> Horton_the_Elephant -> Bird_intelligence -> Asian elephant -> Elmer_the_Patchwork_Elephant
  • 22. Subtopic Taxonomy Issues -> Lion -> Lion (cont.) -> Congolese_Spotted_Lion -> Sea_lion -> Asiatic_Lion -> Steller_sea_lion -> Masai_lion -> Australian_sea_lion -> Barbary_lion -> South_American_sea_lion -> Henry_the_Lion -> New_Zealand_sea_lion -> Sri_Lanka_lion -> California_sea_lion -> Nemean_lion -> American_lion -> Western_African_lion -> White_lion -> Transvaal_Lion -> Kimba_the_White_Lion -> West_African_lion -> Cowardly_Lion -> Tsavo_lion -> Tiger_versus_lion -> Southwest_African_Lion -> European_lion -> Cape LionWRT Nomenclature purity, Lion_Seal is a better name than Sea_Lion.
  • 23. Clustering Two Fundamental Perspectives:  Top-Down: Partitioning a set into disjoint subsets  Bottom-Up: Grouping data points into disjoint clusters Goes hand-in-hand with classification Typically involves a metric: Euclidian or Manhattan distance Many, many different algorithms & books. Some really popular algorithms:  K-Means Clustering (EM, PCA)  Hierarchical Agglomerative Clustering  K-Nearest Neighbor (classification) SR-Clustering: This is something I (re)invented.  Effectively: The world’s simplest clustering algorithm.
  • 24. K-Means Clustering (1) Given an initial set of cluster centroids, determine the actual centroids of each cluster via an iterative refinement algorithm. Each refinement iteration consists of two steps : 1. Computing new data point centroid assignments 2. Computing new centroid positions based of the mean deviation of the data points from the previous centroid positions. Converge, Divergence, Oscillation…. Also known as Lloyd’s Algorithm in CS.
  • 25. K-Means Clustering (2)Wikipedia: Given a set of observations(x1, x2, …, xn), where each observation is a d-dimensional real vector, k-means clustering aimsto partition the n observations into k sets (k ≤ n) S= {S1, S2, …, Sk} so as to minimize the within-cluster sum of squares (WCSS):where μi is the mean of points in Si
  • 26. K-Means Clustering (3) Assignment Step: Defines Si to be the set of xi that deviate least from Si Update Step: Calculate the new means to be the centroid of the observations in the cluster. I.e. The average along each dimension
  • 27. K-Means Clustering(4) K-Means is *really* a 3 step algorithm  Step1. Initialize K-Means (non-trivial)  Problem 1: Estimate K  Problem 2: Pick Initial Centroid for each K  Iterative Refinement  Step 2: Centroid Assignments  Step3: Centroid Update Many initialization approaches:  Random, Forgy, MacQueen and Kaufman Performance depends on initialization and instance ordering Popular because of its robustness Related to:  EM Algorithm and  Principal Component Analysis (PCA)
  • 28. Hierarchical Agglomerative Clustering The Algorithm 1. Cluster each data point with its nearest neighbor(s) and make that a new data point (cluster). 2. Repeat until some fixed number of clusters is reached. K-Nearest Neighbor is often used hand-in-hand with agglomerative clustering to compute the nearest neighbor(s). End up with a tree of clusters (clustering history) This tree is called a dendogram See Chapter 6 of Duda & Hart (SRI, 1973) Pattern Classification & Scene Analysis
  • 29. SR-Clustering (1) Simple Ray Clustering   Sort of like non-hierarchical agglomerative clustering Basic Algorithm  For each data point, place it in the correct cluster  If it doesn’t belong to any cluster, create a new cluster consisting of that single data point Cluster Membership  Defined as being within a certain proximity threshold of every data point in that cluster. Proximity Metric  The Jaccard Index
  • 30. Recommender Systems Used by Netflix, Amazon, etc… Objects: Users, Items & Preferences User vs. Item based recommendations Former aka collaborative filtering Mixed method recommendations Based on User Similarity and/or Item Similarity Jaccard Index takes into account dissimilarity and does not require preference measurements. Apache Mahout (leverages Hadoop)
  • 31. Jaccard Index Defines a Similarity Metric between two sets Wikipedia: The Jaccard coefficient measures similarity between sample sets, and is defined as the size of the intersection divided by the size of the union of the sample sets: Jaccard Distance
  • 32. Another Similarity Metric Pearson Correlation Coefficient Wikipedia: Defined as the covariance of the two variables divided by the product of their standard deviations
  • 33. (compute-similarity-matrix <topics>) Computes the Jaccard index for pairs of topics by using the related topics of each topic as the sets to be compared. African Asian Indian Babar Horton WarAfrican 100.00 38.46 21.05 4.35 6.82 7.94Asian 38.46 100.00 37.74 4.00 6.25 20.00Indian 21.05 37.74 100.00 6.90 7.14 24.39Babar 4.35 4.00 6.90 100.00 28.57 7.14Horton 6.82 6.25 7.14 28.57 100.00 7.41War 7.94 20.00 24.39 7.14 7.41 100.00
  • 34. (cluster-subtopics <subtopics> <matrix> <threshold>)Cluster 1 Cluster 4Asian_elephant(49) War_elephant(22)African_elephant(60) Execution_by_elephant(5) Crushing_by_elephant(4)Cluster 2Babar_the_Elephant(7) Cluster 5Horton_the_Elephant(5) Year_of_the_Elephant(8)Elmer_the_Patchwork_Elephant(4) Cluster 6 Dwarf_elephant(24)Cluster 3Asian_elephant(49) Cluster 7Indian_Elephant(18) White_elephant(10)Sri_Lankan_elephant(12)Sumatran_Elephant(11)Borneo_pygmy_elephant(3) Threshold = 20
  • 35. Knowledge Categories (1) Human schooling as a decade(s) long knowledge acquisition process Spanning Kindergarten – Post Doctoral work Idea is to use grade school topics as initial knowledge categories. Science, History, Geography, Literature & Art Goal: Assign categories to subtopic clusters Use Jaccard Index to determine the category Automatically create subtopic category names e.g. Babar  Literature_Elephant
  • 36. (compute-cluster-categories <clusters>) Wiki Crawl each Knowledge Category (pre-run) Compute subtopics of each knowledge category Compute a category relevancy vector for each cluster member Combine the relevancy vectors of each cluster to compute a relevancy vector for the cluster Assign a category to the cluster
  • 37. (compute-cluster-categories <clusters>)(((( :SCIENCE 0.47666672) (:HISTORY 0.44666672)) (#<Concept(49): Asian_elephant> #<Concept(60): African_elephant>))((( :SCIENCE 0.39) (:GEOGRAPHY 0.37800002)) (#<Concept(3): Borneo_pygmy_elephant> #<Concept(49): Asian_elephant> #<Concept(18): Indian_Elephant> #<Concept(12): Sri_Lankan_elephant> #<Concept(11): Sumatran_Elephant>))((( :ART 0.33333334) (:GEOGRAPHY 0.30666667)) (#<Concept(7): Babar_the_Elephant> #<Concept(5): Horton_the_Elephant> #<Concept(4): Elmer_the_Patchwork_Elephant>))((( :HISTORY 0.6) (:GEOGRAPHY 0.46) (#<Concept(8): Year_of_the_Elephant>))((( :GEOGRAPHY 0.72333336) ( :HISTORY 0.43666664)) (#<Concept(5): Execution_by_elephant> #<Concept(22): War_elephant> #<Concept(4): Crushing_by_elephant>))((( :GEOGRAPHY 0.86) ( :SCIENCE 0.5)) (#<Concept(24): Dwarf_elephant>))(( ( :SCIENCE 0.69) (:ART 0.49))(#<Concept(10): White_elephant>)))
  • 38. Individual Subtopic CategoriesThe following shows the knowledge category relevancies for some of the 16subtopics of Elephant and helps understand the results of previous slide(#<Concept(7): Babar_the_Elephant> (( :LITERATURE 0.44) ( :ART 0.25) (:GEOGRAPHY 0.23) (:HISTORY 0.2) ( :SCIENCE 0.17)))(#<Concept(4): Elmer_the_Patchwork_Elephant> (( :ART 0.25) (:GEOGRAPHY 0.23) (:LITERATURE 0.22) (:HISTORY 0.2) ( :SCIENCE 0.17)))(#<Concept(5): Horton_the_Elephant> (( :ART 0.5) (:GEOGRAPHY 0.46) (:HISTORY 0.4) (:SCIENCE 0.35) ( :LITERATURE 0.22)))(#<Concept(60): African_elephant>((:ART 1.03) ( :SCIENCE 0.91) ( :HISTORY 0.85) (:GEOGRAPHY 0.77) ( :LITERATURE 0.37)))(#<Concept(49): Asian_elephant> (( :HISTORY 0.7) ( :SCIENCE 0.62) (:GEOGRAPHY 0.59) (:ART 0.42) ( :LITERATURE 0.19)))(#<Concept(22): War_elephant> (( :HISTORY 0.93) ( :GEOGRAPHY 0.85) (:LITERATURE 0.41) (:ART 0.23) (:SCIENCE 0.16)))
  • 39. Categorized Subtopic Clusters Elephant  ART_Elephant Elmer_the_Patchwork_Elephant Horton_the_Elephant Babar_the_Elephant  GEOGRAPHY_Elephant Dwarf_elephant Crushing_by_elephant War_elephant Execution_by_elephant  HISTORY_Elephant Year_of_the_Elephant  SCIENCE_Elephant African_elephant African_Forest_Elephant African_Bush_Elephant Asian_elephant White_elephant Sumatran_Elephant Sri_Lankan_elephant Indian_Elephant Borneo_pygmy_elephany
  • 40. Related Topics Associations Associate related topics to subtopic clusters using Jaccard Index Use associations to create related topic clusters(find-compatible-clusters <strongly-related-topics> <clusters>)(( #<Concept(60): African elephant> ((#<Concept(24): Dwarf elephant>) #<Concept(49): Asian_elephant>) (#<Concept(66): Mammoth>(#<Concept(10): Elephant intelligence> #<Concept(25): Mastodon> #<Concept(275): Genus> #<Concept(103): Animal cognition> #<Concept(62): Afrotheria> #<Concept(4): Elephant tusk> #<Concept(86): Gestation> #<Concept(15): African> #<Concept(749): Eutheria> #<Concept(102): Proboscidea> #<Concept(8): Gomphotherium> #<Concept(96): Mammalia> #<Concept(27): Tooth> #<Concept(876): Mammal> #<Concept(8): Tooth_development>)) #<Concept(143): Hippopotamus> #<Concept(590): Lion> #<Concept(10): Loxodonta>) ((#<Concept(7): Babar_the_Elephant> #<Concept(5): Horton_the_Elephant> #<Concept(4): Elmer_the_Patchwork_Elephant>)((#<Concept22): War_elephant> #<Concept(5): Execution_by_elephant> #<Concept(6): List_of_fictional_elephants> #<Concept(4): Crushing_by_elephant>) #<Concept(5): List_of_elephants_in_mythology_and_religion> #<Concept(5): Pinnawala>(#<Concept(55): Ivory> #<Concept(3): Katy_Payne> #<Concept(77): Kenya> #<Concept(11): Infrasound> #<Concept(31): Grief> #<Concept(56): Incisor> #<Concept(8): History_of_elephants_in_Europe>)) #<Concept(14): Jeheskel_Shoshani> #<Concept(6): Aanayoottu> )
  • 41. (sentence-to-clause <sentence>)  english sentence stringScanner  TokensAnalyzer  morphologically analyzed wordsParser  parse-treePhrase extractor  phrases (flattened parse tree)Semantic analyzer  Frame for subject, verb , object and prep. phrasesClauses Generator  Clause Objects
  • 42. Sample Extracted Clauses (HASA "Asian elephant species" "disjunct distributions") (ISA "Elephants" "herbivores") (HASA "African Elephants" “three nails") (HASA "Indian Elephants" "four nails") (HASA "female African Elephants" "large tusks") (ISA "Elephants" "large land mammals")
  • 43. Things Overlooked Wiki Page Contents Pane  Provides page taxonomy  Provides category names  Provides related topic names Concept Weights
  • 44. Future Direction Enhance English Parser Incorporate Variables into Semantic Net Leverage topic weights Work on language generation Produce Wiki Summary Pages Knowledge Queries Develop Client Side Browser  Top Menu Bar Knowledge Categories  RHS Dynamic Subtopic Tree  LHS Wiki Page Content Pane
  • 45. (references) Language and Speech Processing Jurafsky and Martin Artificial Intelligence: A Modern Approach Russell & Norvig Principles of Semantic Networks Edited by John E. Sowa, Morgan Kaufman, 1991 Machine Learning Tom Mitchell, 1997 Pattern Classification and Scene Analysis Duda and Hart, 1973 Algorithms of the Intelligent Web Marmanis and Babenko, 2009
  • 46. (cluster-images)
  • 47. (Love Elephants LispNYC)

×