SlideShare a Scribd company logo
Semantic Transforms Using
 Collaborative Knowledge Bases


Yegin Genc, Winter Mason, Jeffrey V. Nickerson

          Stevens Institute of Technology
Overview


• Automatically understand online information

• Using network artifacts, such as Wikipedia, to
  help
Topic Models
       Algorithms to understand and
       organize documents by
       uncovering semantic structure
       of a document collection

       • Discover hidden themes –
         patterns of word use
       • Connect documents that
         exhibit similar patterns
Latent Dirichlet Allocation (LDA)

   “In the computer science field of artificial intelligence, a genetic algorithm (GA) is a
   search heuristic that mimics the process of natural evolution. This heuristic is
   routinely used to generate useful solutions to optimization and search problems.
   Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which
   generate solutions to optimization problems using techniques inspired by natural
   evolution, such as inheritance, mutation, selection, and crossover.” 1


            Algorithms      – 0.28               Genetic         – 0.18
            Optimization    – 0.28               Natural         – 0.18
            Algorithm       – 0.14               Evolution       – 0.18
            Computer        – 0.14               Evolutionary    – 0.09
            Techniques      – 0.14               …
            ….
1http://en.wikipedia.org/wiki/Genetic_algorithm
Topics from LDA
     computer          chemistry           cortex             orbit           infection
     methods            synthesis         stimulus            dust            immune
      number            oxidation             fig            jupiter             aids
         two            reaction            vision             line            infected
      principle          product           neuron            system              viral
       design            organic         recordings           solar              cells
Five topics from a 50-topic LDA model to fit Science from 1980 – 2002 (Blei and Lafferty, 2009)



   methods      k               of   the    for  the              the      operations     the
      the      the           objects of     the   o               and         the          of
       a        of              to     a  linear we                of      functional       a
       of   algorithm         and     to problem and               to       requires       is
   problems    for             the   we problems a                that        and          in
Ten randomly chosen topics from a 50-topic LDA model fit to abstracts from the Journal of
the ACM (JACM) from the years 1987 to 2004 (Blei et al., 2010).
The interpretation problem
1. Labeling the topics is difficult (J. Chang et al.,
   2009)
2. The relationships between topics are not
   identified
3. The information in the topics is based solely
   on the input corpus
4. The external validity of the topics may be
   limited
Collaborative Knowledge Bases
1. Labeled topics
2. Connected to each other in a meaningful way
3. Contain rich, focused information on
   particular topics
4. Contain fresh, up-to-date information about
   practically everything
Wikipedia Pages as Topics
LDA topic      Wikipedia Page

   orbit       Solar System
   dust        “The Solar System[a] consists of the Sun
  jupiter      and the astronomical objects
               gravitationally bound in orbit around it,
    line
               all of which formed from the collapse of a
  system       giant molecular cloud approximately 4.6
   solar       billion years ago…”
    gas
atmospheric    (http://en.wikipedia.org/wiki/Solar_System)
   mars
   field
Wikipedia Pages as Topics
Topics are characterized as distributions over observed words in
Wikipedia pages

 Wikipedia Word Freq.
     orbit    34      0.12
     dust      7      0.02                                   {Wi Î k}
                                      bk = p(Wi | k) =   N
    jupiter   36      0.12
      line     0      0.00                               å {W Î k}
                                                                i
                                                         i
    system    76      0.26
                                      βk : Per-topic word distribution
     solar    110     0.38
      gas     11      0.04
  atmospheric  1      0.00
     mars      8      0.03
     field     8      0.03
DOCUMENT – TOPIC          DOCUMENT – W0RD                    TOPIC - WORD
          Θ (D x K)                 W (D x W )
                                                                    β (K x W)
             Z d,n                                                         W d,n

                                              n
                                                            Z d,n
LDA



         d                          d




                                                                     Wiki (W x K)
                     k                                                       k
WIKI




         d                   =          d
                                                          *


                     D: Documents           K: Topics   W: Words
Experiment
Data
617 abstracts from Journal of the ACM
Classified into 80 categories by their authors
53 categories have corresponding Wikipedia Pages

Abstracts
{Article Name:        On the (Im)possibility of Obfuscating Programs,
    Category:         D.4. Operating Systems
    Add. Category:    F.1 Computation by Abstract Devices
    …
}

Category Mappings
    Category                                Wikipedia Page
    D.4 Operating Systems:                  Operating System
    F.1 Computation by Abstract Devices :   Abstract Machine
Three variations of our method



- Inbound links are Wikipedia pages that link to the topic page
- Outbound links are Wikipedia pages linked to by the topic
  page
- Text-based method only uses word distributions in topic pages
Results
      Method                    Primary                   Primary or Additional

         Text                 182 (29.5%)                      314 (50.8%)

   Inbound links              131 (21.2%)                      249 (40.0%)

  Outbound links               79 (12.8%)                      166 (26.9%)



The number (and percentage) of authors’ primary ACM topic labels, or authors’
primary + additional ACM topics successfully identified by each method.

LDA cannot be compared without an additional step mapping word distributions to
ACM topics.
Results (Qualitative)
Concluding Remarks
The Wiki categories often match the categories that
were chosen by the authors. When they don’t
match, they generally appear plausible.

Among the variations of our method, the text based
approach performed better than link based
approaches.

Among the link based approaches, inbound links
performed better than outbound links.
Next Steps

Dependent topic structures

Combine heuristics with generative models:
  Wikipedia as a prior for the topic distribution
  Learn from the documents observed.

More Related Content

What's hot

Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
National Institute of Informatics
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Nathan Frey, PhD
 
Accelerated Materials Discovery & Characterization with Classical, Quantum an...
Accelerated Materials Discovery & Characterization with Classical, Quantum an...Accelerated Materials Discovery & Characterization with Classical, Quantum an...
Accelerated Materials Discovery & Characterization with Classical, Quantum an...
KAMAL CHOUDHARY
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Anubhav Jain
 
Data Mining The Sky
Data Mining The SkyData Mining The Sky
Data Mining The Sky
DataminingTools Inc
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
aimsnist
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Anubhav Jain
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
aimsnist
 

What's hot (8)

Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...Application of a Novel Subject Classification Scheme for a Bibliographic Data...
Application of a Novel Subject Classification Scheme for a Bibliographic Data...
 
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. FreyMachine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
Machine Learning in Materials Science and Chemistry, USPTO, Nathan C. Frey
 
Accelerated Materials Discovery & Characterization with Classical, Quantum an...
Accelerated Materials Discovery & Characterization with Classical, Quantum an...Accelerated Materials Discovery & Characterization with Classical, Quantum an...
Accelerated Materials Discovery & Characterization with Classical, Quantum an...
 
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...Evaluating Machine Learning Algorithms for Materials Science using the Matben...
Evaluating Machine Learning Algorithms for Materials Science using the Matben...
 
Data Mining The Sky
Data Mining The SkyData Mining The Sky
Data Mining The Sky
 
Smart Metrics for High Performance Material Design
Smart Metrics for High Performance Material DesignSmart Metrics for High Performance Material Design
Smart Metrics for High Performance Material Design
 
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...Software Tools, Methods and Applications of Machine Learning in Functional Ma...
Software Tools, Methods and Applications of Machine Learning in Functional Ma...
 
A Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge SystemsA Machine Learning Framework for Materials Knowledge Systems
A Machine Learning Framework for Materials Knowledge Systems
 

Viewers also liked

H0ly L4nd
H0ly L4ndH0ly L4nd
H0ly L4nddanr
 
Discovering Context
Discovering ContextDiscovering Context
Discovering Context
Yegin Genc
 

Viewers also liked (6)

H0ly L4nd
H0ly L4ndH0ly L4nd
H0ly L4nd
 
windward5
windward5windward5
windward5
 
Discovering Context
Discovering ContextDiscovering Context
Discovering Context
 
Creative
CreativeCreative
Creative
 
Knights
KnightsKnights
Knights
 
Advertising
AdvertisingAdvertising
Advertising
 

Similar to Semantic Transforms Using Collaborative Knowledge Bases

Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Anubhav Jain
 
Materials Modelling: From theory to solar cells (Lecture 1)
Materials Modelling: From theory to solar cells  (Lecture 1)Materials Modelling: From theory to solar cells  (Lecture 1)
Materials Modelling: From theory to solar cells (Lecture 1)
cdtpv
 
Development of a Trans-Field Learning System Based on Multidimensional Topic ...
Development of a Trans-Field Learning System Based on Multidimensional Topic ...Development of a Trans-Field Learning System Based on Multidimensional Topic ...
Development of a Trans-Field Learning System Based on Multidimensional Topic ...
tmra
 
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Ahmed Saleh
 
Ontology driven Annotation
Ontology driven AnnotationOntology driven Annotation
Ontology driven AnnotationAshish Kulkarni
 
The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking
Jie Bao
 
SWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesSWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologies
Christoph Lange
 
Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]Marcia Zeng
 
Exploring Content with Wikipedia
Exploring Content with WikipediaExploring Content with Wikipedia
Exploring Content with Wikipedia
Yegin Genc
 
Blei lafferty2009
Blei lafferty2009Blei lafferty2009
Blei lafferty2009Ajay Ohri
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
University of Illinois at Urbana-Champaign
 
#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing
Berlin Language Technology
 
Wikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing DocumentsWikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing Documents
Zareen Syed
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
Fabien Gandon
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
Angelo Salatino
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
Artificial Intelligence Institute at UofSC
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than DataAmit Sheth
 
AdS Biology and Quantum Information Science
AdS Biology and Quantum Information ScienceAdS Biology and Quantum Information Science
AdS Biology and Quantum Information Science
Melanie Swan
 
LDAvis
LDAvisLDAvis
LDAvis
曾 子芸
 
mx & dbs
mx & dbsmx & dbs

Similar to Semantic Transforms Using Collaborative Knowledge Bases (20)

Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
Accelerated Materials Discovery Using Theory, Optimization, and Natural Langu...
 
Materials Modelling: From theory to solar cells (Lecture 1)
Materials Modelling: From theory to solar cells  (Lecture 1)Materials Modelling: From theory to solar cells  (Lecture 1)
Materials Modelling: From theory to solar cells (Lecture 1)
 
Development of a Trans-Field Learning System Based on Multidimensional Topic ...
Development of a Trans-Field Learning System Based on Multidimensional Topic ...Development of a Trans-Field Learning System Based on Multidimensional Topic ...
Development of a Trans-Field Learning System Based on Multidimensional Topic ...
 
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
Performance Comparison of Ad-hoc Retrieval Models over Full-text vs. Titles o...
 
Ontology driven Annotation
Ontology driven AnnotationOntology driven Annotation
Ontology driven Annotation
 
The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking The Unbearable Lightness of Wiking
The Unbearable Lightness of Wiking
 
SWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologiesSWiM – A wiki for collaborating on mathematical ontologies
SWiM – A wiki for collaborating on mathematical ontologies
 
Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]Linking KOS Data [using SKOS and OWL2]
Linking KOS Data [using SKOS and OWL2]
 
Exploring Content with Wikipedia
Exploring Content with WikipediaExploring Content with Wikipedia
Exploring Content with Wikipedia
 
Blei lafferty2009
Blei lafferty2009Blei lafferty2009
Blei lafferty2009
 
Cyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and BeyondCyberinfrastructure for Einstein's Equations and Beyond
Cyberinfrastructure for Einstein's Equations and Beyond
 
#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing#4 Convolutional Neural Networks for Natural Language Processing
#4 Convolutional Neural Networks for Natural Language Processing
 
Wikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing DocumentsWikipedia as an Ontology for Describing Documents
Wikipedia as an Ontology for Describing Documents
 
bridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the webbridging formal semantics and social semantics on the web
bridging formal semantics and social semantics on the web
 
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology:  A Large-Scale Taxonomy of Research AreasThe Computer Science Ontology:  A Large-Scale Taxonomy of Research Areas
The Computer Science Ontology: A Large-Scale Taxonomy of Research Areas
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 
How To Make Linked Data More than Data
How To Make Linked Data More than DataHow To Make Linked Data More than Data
How To Make Linked Data More than Data
 
AdS Biology and Quantum Information Science
AdS Biology and Quantum Information ScienceAdS Biology and Quantum Information Science
AdS Biology and Quantum Information Science
 
LDAvis
LDAvisLDAvis
LDAvis
 
mx & dbs
mx & dbsmx & dbs
mx & dbs
 

Recently uploaded

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
sonjaschweigert1
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Aggregage
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
SOFTTECHHUB
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
Kari Kakkonen
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
Aftab Hussain
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
Uni Systems S.M.S.A.
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
Neo4j
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
Neo4j
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
Alan Dix
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Malak Abu Hammad
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
Pixlogix Infotech
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
ThomasParaiso2
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Vladimir Iglovikov, Ph.D.
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
Zilliz
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
DianaGray10
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
DianaGray10
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
James Anderson
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
Octavian Nadolu
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
Matthew Sinclair
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
KatiaHIMEUR1
 

Recently uploaded (20)

A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...A tale of scale & speed: How the US Navy is enabling software delivery from l...
A tale of scale & speed: How the US Navy is enabling software delivery from l...
 
Generative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to ProductionGenerative AI Deep Dive: Advancing from Proof of Concept to Production
Generative AI Deep Dive: Advancing from Proof of Concept to Production
 
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
Goodbye Windows 11: Make Way for Nitrux Linux 3.5.0!
 
DevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA ConnectDevOps and Testing slides at DASA Connect
DevOps and Testing slides at DASA Connect
 
Removing Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software FuzzingRemoving Uninteresting Bytes in Software Fuzzing
Removing Uninteresting Bytes in Software Fuzzing
 
Microsoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdfMicrosoft - Power Platform_G.Aspiotis.pdf
Microsoft - Power Platform_G.Aspiotis.pdf
 
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
GraphSummit Singapore | Graphing Success: Revolutionising Organisational Stru...
 
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
GraphSummit Singapore | Enhancing Changi Airport Group's Passenger Experience...
 
Epistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI supportEpistemic Interaction - tuning interfaces to provide information for AI support
Epistemic Interaction - tuning interfaces to provide information for AI support
 
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdfUnlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
Unlock the Future of Search with MongoDB Atlas_ Vector Search Unleashed.pdf
 
20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website20 Comprehensive Checklist of Designing and Developing a Website
20 Comprehensive Checklist of Designing and Developing a Website
 
GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...GridMate - End to end testing is a critical piece to ensure quality and avoid...
GridMate - End to end testing is a critical piece to ensure quality and avoid...
 
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AIEnchancing adoption of Open Source Libraries. A case study on Albumentations.AI
Enchancing adoption of Open Source Libraries. A case study on Albumentations.AI
 
Full-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalizationFull-RAG: A modern architecture for hyper-personalization
Full-RAG: A modern architecture for hyper-personalization
 
UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5UiPath Test Automation using UiPath Test Suite series, part 5
UiPath Test Automation using UiPath Test Suite series, part 5
 
Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1Communications Mining Series - Zero to Hero - Session 1
Communications Mining Series - Zero to Hero - Session 1
 
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
Alt. GDG Cloud Southlake #33: Boule & Rebala: Effective AppSec in SDLC using ...
 
Artificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopmentArtificial Intelligence for XMLDevelopment
Artificial Intelligence for XMLDevelopment
 
20240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 202420240607 QFM018 Elixir Reading List May 2024
20240607 QFM018 Elixir Reading List May 2024
 
Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !Securing your Kubernetes cluster_ a step-by-step guide to success !
Securing your Kubernetes cluster_ a step-by-step guide to success !
 

Semantic Transforms Using Collaborative Knowledge Bases

  • 1. Semantic Transforms Using Collaborative Knowledge Bases Yegin Genc, Winter Mason, Jeffrey V. Nickerson Stevens Institute of Technology
  • 2. Overview • Automatically understand online information • Using network artifacts, such as Wikipedia, to help
  • 3. Topic Models Algorithms to understand and organize documents by uncovering semantic structure of a document collection • Discover hidden themes – patterns of word use • Connect documents that exhibit similar patterns
  • 4. Latent Dirichlet Allocation (LDA) “In the computer science field of artificial intelligence, a genetic algorithm (GA) is a search heuristic that mimics the process of natural evolution. This heuristic is routinely used to generate useful solutions to optimization and search problems. Genetic algorithms belong to the larger class of evolutionary algorithms (EA), which generate solutions to optimization problems using techniques inspired by natural evolution, such as inheritance, mutation, selection, and crossover.” 1 Algorithms – 0.28 Genetic – 0.18 Optimization – 0.28 Natural – 0.18 Algorithm – 0.14 Evolution – 0.18 Computer – 0.14 Evolutionary – 0.09 Techniques – 0.14 … …. 1http://en.wikipedia.org/wiki/Genetic_algorithm
  • 5. Topics from LDA computer chemistry cortex orbit infection methods synthesis stimulus dust immune number oxidation fig jupiter aids two reaction vision line infected principle product neuron system viral design organic recordings solar cells Five topics from a 50-topic LDA model to fit Science from 1980 – 2002 (Blei and Lafferty, 2009) methods k of the for the the operations the the the objects of the o and the of a of to a linear we of functional a of algorithm and to problem and to requires is problems for the we problems a that and in Ten randomly chosen topics from a 50-topic LDA model fit to abstracts from the Journal of the ACM (JACM) from the years 1987 to 2004 (Blei et al., 2010).
  • 6. The interpretation problem 1. Labeling the topics is difficult (J. Chang et al., 2009) 2. The relationships between topics are not identified 3. The information in the topics is based solely on the input corpus 4. The external validity of the topics may be limited
  • 7. Collaborative Knowledge Bases 1. Labeled topics 2. Connected to each other in a meaningful way 3. Contain rich, focused information on particular topics 4. Contain fresh, up-to-date information about practically everything
  • 8. Wikipedia Pages as Topics LDA topic Wikipedia Page orbit Solar System dust “The Solar System[a] consists of the Sun jupiter and the astronomical objects gravitationally bound in orbit around it, line all of which formed from the collapse of a system giant molecular cloud approximately 4.6 solar billion years ago…” gas atmospheric (http://en.wikipedia.org/wiki/Solar_System) mars field
  • 9. Wikipedia Pages as Topics Topics are characterized as distributions over observed words in Wikipedia pages Wikipedia Word Freq. orbit 34 0.12 dust 7 0.02 {Wi Î k} bk = p(Wi | k) = N jupiter 36 0.12 line 0 0.00 å {W Î k} i i system 76 0.26 βk : Per-topic word distribution solar 110 0.38 gas 11 0.04 atmospheric 1 0.00 mars 8 0.03 field 8 0.03
  • 10. DOCUMENT – TOPIC DOCUMENT – W0RD TOPIC - WORD Θ (D x K) W (D x W ) β (K x W) Z d,n W d,n n Z d,n LDA d d Wiki (W x K) k k WIKI d = d * D: Documents K: Topics W: Words
  • 11. Experiment Data 617 abstracts from Journal of the ACM Classified into 80 categories by their authors 53 categories have corresponding Wikipedia Pages Abstracts {Article Name: On the (Im)possibility of Obfuscating Programs, Category: D.4. Operating Systems Add. Category: F.1 Computation by Abstract Devices … } Category Mappings Category Wikipedia Page D.4 Operating Systems: Operating System F.1 Computation by Abstract Devices : Abstract Machine
  • 12. Three variations of our method - Inbound links are Wikipedia pages that link to the topic page - Outbound links are Wikipedia pages linked to by the topic page - Text-based method only uses word distributions in topic pages
  • 13. Results Method Primary Primary or Additional Text 182 (29.5%) 314 (50.8%) Inbound links 131 (21.2%) 249 (40.0%) Outbound links 79 (12.8%) 166 (26.9%) The number (and percentage) of authors’ primary ACM topic labels, or authors’ primary + additional ACM topics successfully identified by each method. LDA cannot be compared without an additional step mapping word distributions to ACM topics.
  • 15. Concluding Remarks The Wiki categories often match the categories that were chosen by the authors. When they don’t match, they generally appear plausible. Among the variations of our method, the text based approach performed better than link based approaches. Among the link based approaches, inbound links performed better than outbound links.
  • 16. Next Steps Dependent topic structures Combine heuristics with generative models: Wikipedia as a prior for the topic distribution Learn from the documents observed.

Editor's Notes

  1. Blei- “Much of my research is in topic models, which are a suite of algorithms to uncover the hidden thematic structure of a collection of documents. These algorithms help us develop new ways to search, browse and summarize large archives of texts.”
  2. Here is an example of a paragraphWe assume that some number of topics exist in a document setEach document is a mixture of these corpus wide topicsEach topic is a distribution over wordsEach word is drawn from one of those topics
  3. Describing what they mean is different,
  4. Use posterior expectations / approximate posterior inference: gibbs sampling, variational inference
  5. The reason we chose this so that we can validate our results
  6. Pause… Thank you