Topic Extraction on Domain Ontology

Topic Extraction for Domain Ontology
Guided By:
Prof. S.B. Karthick
Project By:
Keerti Bhogaraju TY – C - 13
Pratiksha Jadhav TY – C – 50
Rasika Khatke TY – C - 66
Prajakta Jawale TY – C - 71
BRACT’s VISHWAKARMA INSTITUTE of TECHNOLOGY
Pune
Department of Computer Engineering

 Domain Ontology
Domain ontology is a collection of vocabularies and the specifications of the conceptualization of a
given domain (Gruber, 1993)
Examples: -

 Specific Domain chosen
Knowledge - based search engine in the field of science for students in 3rd, 4th and 5th grade.
Example: - The human body systems

 Purpose of Topic Extraction
• To Identify relevant concepts hidden in the corpus of documents
• To obtain terms which may be considered as linguistic realizations of domain specific concepts
• To assign every term found on the corpus to a specific context
• To classify documents for information discovery
• To identify key concepts and the relationships among them in ontology

 Project Development Stages
i. Obtain domain knowledge
ii. PDF to document conversion
iii. “Cleansing” of the document
a. Tokenizing
b. Filtering (Removal of stop words)
iv. Applying either of the methods mentioned below: -
a. Clustering using K-Means algorithm
b. Topic Modeling – Latent Dirichlet Allocation (LDA)
v. Extraction of topics

 Method1: Clustering using K-Means
 Clustering is the process of partitioning a group of data points into a small number of clusters
 K- Means clustering is a method of vector quantization which aims to partition n observations
into k clusters in which each observation belongs to the clusters with the nearest mean, serving
as a prototype of the cluster.
 Algorithm: -

 About K-Means
1. Initial centroids are often chosen randomly.
2. The centroid is (typically) the mean of the points in the cluster.
3. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
Euclidean Distance
4. We use the following equation to calculate the n dimensional centroid point amidst k n-
dimensional points

 Advantages and Limitations of K-
Means
Advantages
 If variables are huge, then K-Means most of the times computationally faster than hierarchical clustering,
if we keep k small.
 K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
Limitations
 Difficult to predict K-Value
 Doesn’t work well with globular clusters
 Different initial partitions can result in different final clusters
 It does not work well with clusters (in the original data) of Different size and Different density

 Method2: Topic Modeling using LDA
 Useful for organizing large blocks of textual data, information retrieval from unstructured text
and feature selection.
 A process to automatically identify topics present in a text object and to derive hidden
patterns exhibited by a text corpus. Thus, assisting better decision making.
 LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are
represented as random mixtures over latent topics, where a topic is characterized by a
distribution over words.

 Latent Dirichlet Allocation (LDA)
Algorithm
1. A new topic “k” is assigned to word “w” with a probability P which is a product of two
probabilities p1 and p2.
2. p1 –> p(topic t / document d) = the proportion of words in document d that are currently
assigned to topic t.
3. p2 –> p(word w / topic t) = the proportion of assignments to topic t over all documents that
come from this word w.
4. The current topic – word assignment is updated with a new topic with the probability,
product of p1 and p2
5. Iterates through each word “w” for each document “d” and tries to adjust the current topic –
word assignment with a new assignment.

 Advantages and Limitations of LDA
 The advantages of LDA is that LDA is a probabilistic model with interpretable topics.
The disadvantages are that it is hard to know when LDA is working --- topics are soft-clusters so
there is no objective metric to say "this is the best choice" of hyper parameters.

 Natural Language Text Processing
 Natural Language Processing (NLP) refers to AI method of communicating with an intelligent
systems using a natural language such as English.
Techniques from NLP used in the project: -
i.) Tokenizing
ii.) Stop words
iii.) Named Entity Recognition
iv.) POS Tagging
v.) Lemmatizing

 Future Scope
 Information Extraction
 Retrieval of relations and hierarchies among concepts
 Ontology Building

 Tools and Libraries used: -
 Programming Language: Python
 Libraries used: - nltk
gensim
scikitlearn
quandl
pandas

 References
 https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
 sentdex machine learning algorithms tutorial
 sentdex nltk tutorial
 Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools
J. I. Toledo-Alvarado*, A. Guzmán-Arenas, G. L. Martínez-Luna
Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN)
Av. Juan de Dios Báti
 K-means Algorithm Cluster Analysis in Data Mining by Zijun Zhang

Topic Extraction on Domain Ontology

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Topic Extraction on Domain Ontology

Similar to Topic Extraction on Domain Ontology (20)

Recently uploaded

Recently uploaded (20)

Topic Extraction on Domain Ontology