SlideShare a Scribd company logo
1 of 21
Download to read offline
Topic Extraction for Domain Ontology
Guided By:
Prof. S.B. Karthick
Project By:
Keerti Bhogaraju TY – C - 13
Pratiksha Jadhav TY – C – 50
Rasika Khatke TY – C - 66
Prajakta Jawale TY – C - 71
BRACT’s VISHWAKARMA INSTITUTE of TECHNOLOGY
Pune
Department of Computer Engineering
 Domain Ontology
Domain ontology is a collection of vocabularies and the specifications of the conceptualization of a
given domain (Gruber, 1993)
Examples: -
 Specific Domain chosen
Knowledge - based search engine in the field of science for students in 3rd, 4th and 5th grade.
Example: - The human body systems
 Purpose of Topic Extraction
• To Identify relevant concepts hidden in the corpus of documents
• To obtain terms which may be considered as linguistic realizations of domain specific concepts
• To assign every term found on the corpus to a specific context
• To classify documents for information discovery
• To identify key concepts and the relationships among them in ontology
 Project Development Stages
i. Obtain domain knowledge
ii. PDF to document conversion
iii. “Cleansing” of the document
a. Tokenizing
b. Filtering (Removal of stop words)
iv. Applying either of the methods mentioned below: -
a. Clustering using K-Means algorithm
b. Topic Modeling – Latent Dirichlet Allocation (LDA)
v. Extraction of topics
 Method1: Clustering using K-Means
 Clustering is the process of partitioning a group of data points into a small number of clusters
 K- Means clustering is a method of vector quantization which aims to partition n observations
into k clusters in which each observation belongs to the clusters with the nearest mean, serving
as a prototype of the cluster.
 Algorithm: -
 About K-Means
1. Initial centroids are often chosen randomly.
2. The centroid is (typically) the mean of the points in the cluster.
3. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
Euclidean Distance
4. We use the following equation to calculate the n dimensional centroid point amidst k n-
dimensional points
 Example of K-Means
 Example of K-Means
 Advantages and Limitations of K-
Means
Advantages
 If variables are huge, then K-Means most of the times computationally faster than hierarchical clustering,
if we keep k small.
 K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
Limitations
 Difficult to predict K-Value
 Doesn’t work well with globular clusters
 Different initial partitions can result in different final clusters
 It does not work well with clusters (in the original data) of Different size and Different density
 Method2: Topic Modeling using LDA
 Useful for organizing large blocks of textual data, information retrieval from unstructured text
and feature selection.
 A process to automatically identify topics present in a text object and to derive hidden
patterns exhibited by a text corpus. Thus, assisting better decision making.
 LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are
represented as random mixtures over latent topics, where a topic is characterized by a
distribution over words.
 Latent Dirichlet Allocation (LDA)
Algorithm
1. A new topic “k” is assigned to word “w” with a probability P which is a product of two
probabilities p1 and p2.
2. p1 –> p(topic t / document d) = the proportion of words in document d that are currently
assigned to topic t.
3. p2 –> p(word w / topic t) = the proportion of assignments to topic t over all documents that
come from this word w.
4. The current topic – word assignment is updated with a new topic with the probability,
product of p1 and p2
5. Iterates through each word “w” for each document “d” and tries to adjust the current topic –
word assignment with a new assignment.
 Dirichlet Distribution
 Example of LDA
 Example of LDA
 Advantages and Limitations of LDA
 The advantages of LDA is that LDA is a probabilistic model with interpretable topics.
The disadvantages are that it is hard to know when LDA is working --- topics are soft-clusters so
there is no objective metric to say "this is the best choice" of hyper parameters.
 Natural Language Text Processing
 Natural Language Processing (NLP) refers to AI method of communicating with an intelligent
systems using a natural language such as English.
Techniques from NLP used in the project: -
i.) Tokenizing
ii.) Stop words
iii.) Named Entity Recognition
iv.) POS Tagging
v.) Lemmatizing
 Future Scope
 Information Extraction
 Retrieval of relations and hierarchies among concepts
 Ontology Building
 Tools and Libraries used: -
 Programming Language: Python
 Libraries used: - nltk
gensim
scikitlearn
quandl
pandas
 References
 https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
 sentdex machine learning algorithms tutorial
 sentdex nltk tutorial
 Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools
J. I. Toledo-Alvarado*, A. Guzmán-Arenas, G. L. Martínez-Luna
Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN)
Av. Juan de Dios Báti
 K-means Algorithm Cluster Analysis in Data Mining by Zijun Zhang
Thank You

More Related Content

What's hot

5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information RetrievalBhaskar Mitra
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentationSoojung Hong
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all thatZhibo Xiao
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introductionYueshen Xu
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...KozoChikai
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)KU Leuven
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analyticsFarheen Nilofer
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for RetrievalBhaskar Mitra
 
The vector space model
The vector space modelThe vector space model
The vector space modelpkgosh
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding Systeminscit2006
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisNYC Predictive Analytics
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)KU Leuven
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 

What's hot (20)

5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval5 Lessons Learned from Designing Neural Models for Information Retrieval
5 Lessons Learned from Designing Neural Models for Information Retrieval
 
Latent dirichletallocation presentation
Latent dirichletallocation presentationLatent dirichletallocation presentation
Latent dirichletallocation presentation
 
Topic model, LDA and all that
Topic model, LDA and all thatTopic model, LDA and all that
Topic model, LDA and all that
 
Topic model an introduction
Topic model an introductionTopic model an introduction
Topic model an introduction
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
Analysis of Similarity Measures between Short Text for the NTCIR-12 Short Tex...
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Tdm probabilistic models (part 2)
Tdm probabilistic  models (part  2)Tdm probabilistic  models (part  2)
Tdm probabilistic models (part 2)
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
Seminar dm
Seminar dmSeminar dm
Seminar dm
 
The Duet model
The Duet modelThe Duet model
The Duet model
 
Deep Neural Methods for Retrieval
Deep Neural Methods for RetrievalDeep Neural Methods for Retrieval
Deep Neural Methods for Retrieval
 
The vector space model
The vector space modelThe vector space model
The vector space model
 
E43022023
E43022023E43022023
E43022023
 
An Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding SystemAn Intuitive Natural Language Understanding System
An Intuitive Natural Language Understanding System
 
Introduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic AnalysisIntroduction to Probabilistic Latent Semantic Analysis
Introduction to Probabilistic Latent Semantic Analysis
 
Probabilistic models (part 1)
Probabilistic models (part 1)Probabilistic models (part 1)
Probabilistic models (part 1)
 
Topic modelling
Topic modellingTopic modelling
Topic modelling
 
Ir 03
Ir   03Ir   03
Ir 03
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 

Similar to Topic Extraction on Domain Ontology

kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptbutest
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGcscpconf
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGAndrew Parish
 
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...AI Publications
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsParang Saraf
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspectiveankurpandeyinfo
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language ProcessingNimrita Koul
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
Co-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text ClassificationCo-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text Classificationpaperpublications3
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyICDEcCnferenece
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksLeonardo Di Donato
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...BaoTramDuong2
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Innovation Quotient Pvt Ltd
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
 
Developing A Theory Of Information Science Research
Developing A Theory Of Information Science ResearchDeveloping A Theory Of Information Science Research
Developing A Theory Of Information Science ResearchRajee Dent
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Jennifer D'Souza
 

Similar to Topic Extraction on Domain Ontology (20)

kantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.pptkantorNSF-NIJ-ISI-03-06-04.ppt
kantorNSF-NIJ-ISI-03-06-04.ppt
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLINGA TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
A TEXT MINING RESEARCH BASED ON LDA TOPIC MODELLING
 
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
Low Resource Domain Subjective Context Feature Extraction via Thematic Meta-l...
 
G04124041046
G04124041046G04124041046
G04124041046
 
Concurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector RepresentationsConcurrent Inference of Topic Models and Distributed Vector Representations
Concurrent Inference of Topic Models and Distributed Vector Representations
 
The science behind predictive analytics a text mining perspective
The science behind predictive analytics  a text mining perspectiveThe science behind predictive analytics  a text mining perspective
The science behind predictive analytics a text mining perspective
 
Natural Language Processing
Natural Language ProcessingNatural Language Processing
Natural Language Processing
 
Probabilistic Topic models
Probabilistic Topic modelsProbabilistic Topic models
Probabilistic Topic models
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
Co-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text ClassificationCo-Clustering For Cross-Domain Text Classification
Co-Clustering For Cross-Domain Text Classification
 
Topic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental surveyTopic modeling of marketing scientific papers: An experimental survey
Topic modeling of marketing scientific papers: An experimental survey
 
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasksTopic Modeling for Information Retrieval and Word Sense Disambiguation tasks
Topic Modeling for Information Retrieval and Word Sense Disambiguation tasks
 
Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...Discover How Scientific Data is Used for the Public Good with Natural Languag...
Discover How Scientific Data is Used for the Public Good with Natural Languag...
 
Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...Usage of word sense disambiguation in concept identification in ontology cons...
Usage of word sense disambiguation in concept identification in ontology cons...
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
Developing A Theory Of Information Science Research
Developing A Theory Of Information Science ResearchDeveloping A Theory Of Information Science Research
Developing A Theory Of Information Science Research
 
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
Pattern-based Acquisition of Scientific Entities from Scholarly Article Title...
 

Recently uploaded

Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...PrithaVashisht1
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptaigil2
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructuresonikadigital1
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Guido X Jansen
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introductionsanjaymuralee1
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)Data & Analytics Magazin
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxDwiAyuSitiHartinah
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerPavel Šabatka
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best PracticesDataArchiva
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationGiorgio Carbone
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024Becky Burwell
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.JasonViviers2
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Vladislav Solodkiy
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionajayrajaganeshkayala
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxVenkatasubramani13
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityAggregage
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?sonikadigital1
 

Recently uploaded (17)

Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...Elements of language learning - an analysis of how different elements of lang...
Elements of language learning - an analysis of how different elements of lang...
 
MEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .pptMEASURES OF DISPERSION I BSc Botany .ppt
MEASURES OF DISPERSION I BSc Botany .ppt
 
ChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics InfrastructureChistaDATA Real-Time DATA Analytics Infrastructure
ChistaDATA Real-Time DATA Analytics Infrastructure
 
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
Persuasive E-commerce, Our Biased Brain @ Bikkeldag 2024
 
Virtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product IntroductionVirtuosoft SmartSync Product Introduction
Virtuosoft SmartSync Product Introduction
 
AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)AI for Sustainable Development Goals (SDGs)
AI for Sustainable Development Goals (SDGs)
 
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptxTINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
TINJUAN PEMROSESAN TRANSAKSI DAN ERP.pptx
 
The Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayerThe Universal GTM - how we design GTM and dataLayer
The Universal GTM - how we design GTM and dataLayer
 
5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices5 Ds to Define Data Archiving Best Practices
5 Ds to Define Data Archiving Best Practices
 
Master's Thesis - Data Science - Presentation
Master's Thesis - Data Science - PresentationMaster's Thesis - Data Science - Presentation
Master's Thesis - Data Science - Presentation
 
SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024SFBA Splunk Usergroup meeting March 13, 2024
SFBA Splunk Usergroup meeting March 13, 2024
 
YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.YourView Panel Book.pptx YourView Panel Book.
YourView Panel Book.pptx YourView Panel Book.
 
Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023Cash Is Still King: ATM market research '2023
Cash Is Still King: ATM market research '2023
 
CI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual interventionCI, CD -Tools to integrate without manual intervention
CI, CD -Tools to integrate without manual intervention
 
Mapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptxMapping the pubmed data under different suptopics using NLP.pptx
Mapping the pubmed data under different suptopics using NLP.pptx
 
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for ClarityStrategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
Strategic CX: A Deep Dive into Voice of the Customer Insights for Clarity
 
How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?How is Real-Time Analytics Different from Traditional OLAP?
How is Real-Time Analytics Different from Traditional OLAP?
 

Topic Extraction on Domain Ontology

  • 1. Topic Extraction for Domain Ontology Guided By: Prof. S.B. Karthick Project By: Keerti Bhogaraju TY – C - 13 Pratiksha Jadhav TY – C – 50 Rasika Khatke TY – C - 66 Prajakta Jawale TY – C - 71 BRACT’s VISHWAKARMA INSTITUTE of TECHNOLOGY Pune Department of Computer Engineering
  • 2.  Domain Ontology Domain ontology is a collection of vocabularies and the specifications of the conceptualization of a given domain (Gruber, 1993) Examples: -
  • 3.  Specific Domain chosen Knowledge - based search engine in the field of science for students in 3rd, 4th and 5th grade. Example: - The human body systems
  • 4.  Purpose of Topic Extraction • To Identify relevant concepts hidden in the corpus of documents • To obtain terms which may be considered as linguistic realizations of domain specific concepts • To assign every term found on the corpus to a specific context • To classify documents for information discovery • To identify key concepts and the relationships among them in ontology
  • 5.  Project Development Stages i. Obtain domain knowledge ii. PDF to document conversion iii. “Cleansing” of the document a. Tokenizing b. Filtering (Removal of stop words) iv. Applying either of the methods mentioned below: - a. Clustering using K-Means algorithm b. Topic Modeling – Latent Dirichlet Allocation (LDA) v. Extraction of topics
  • 6.  Method1: Clustering using K-Means  Clustering is the process of partitioning a group of data points into a small number of clusters  K- Means clustering is a method of vector quantization which aims to partition n observations into k clusters in which each observation belongs to the clusters with the nearest mean, serving as a prototype of the cluster.  Algorithm: -
  • 7.  About K-Means 1. Initial centroids are often chosen randomly. 2. The centroid is (typically) the mean of the points in the cluster. 3. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc. Euclidean Distance 4. We use the following equation to calculate the n dimensional centroid point amidst k n- dimensional points
  • 8.  Example of K-Means
  • 9.  Example of K-Means
  • 10.  Advantages and Limitations of K- Means Advantages  If variables are huge, then K-Means most of the times computationally faster than hierarchical clustering, if we keep k small.  K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular. Limitations  Difficult to predict K-Value  Doesn’t work well with globular clusters  Different initial partitions can result in different final clusters  It does not work well with clusters (in the original data) of Different size and Different density
  • 11.  Method2: Topic Modeling using LDA  Useful for organizing large blocks of textual data, information retrieval from unstructured text and feature selection.  A process to automatically identify topics present in a text object and to derive hidden patterns exhibited by a text corpus. Thus, assisting better decision making.  LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are represented as random mixtures over latent topics, where a topic is characterized by a distribution over words.
  • 12.  Latent Dirichlet Allocation (LDA) Algorithm 1. A new topic “k” is assigned to word “w” with a probability P which is a product of two probabilities p1 and p2. 2. p1 –> p(topic t / document d) = the proportion of words in document d that are currently assigned to topic t. 3. p2 –> p(word w / topic t) = the proportion of assignments to topic t over all documents that come from this word w. 4. The current topic – word assignment is updated with a new topic with the probability, product of p1 and p2 5. Iterates through each word “w” for each document “d” and tries to adjust the current topic – word assignment with a new assignment.
  • 16.  Advantages and Limitations of LDA  The advantages of LDA is that LDA is a probabilistic model with interpretable topics. The disadvantages are that it is hard to know when LDA is working --- topics are soft-clusters so there is no objective metric to say "this is the best choice" of hyper parameters.
  • 17.  Natural Language Text Processing  Natural Language Processing (NLP) refers to AI method of communicating with an intelligent systems using a natural language such as English. Techniques from NLP used in the project: - i.) Tokenizing ii.) Stop words iii.) Named Entity Recognition iv.) POS Tagging v.) Lemmatizing
  • 18.  Future Scope  Information Extraction  Retrieval of relations and hierarchies among concepts  Ontology Building
  • 19.  Tools and Libraries used: -  Programming Language: Python  Libraries used: - nltk gensim scikitlearn quandl pandas
  • 20.  References  https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/  sentdex machine learning algorithms tutorial  sentdex nltk tutorial  Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools J. I. Toledo-Alvarado*, A. Guzmán-Arenas, G. L. Martínez-Luna Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN) Av. Juan de Dios Báti  K-means Algorithm Cluster Analysis in Data Mining by Zijun Zhang