This document discusses topic extraction for domain ontology. It describes domain ontology as a collection of vocabularies and conceptualization of a given domain. The purpose of topic extraction is to identify relevant concepts in documents, obtain domain-specific terms, classify documents, and identify key concepts and relationships for an ontology. The project stages include obtaining domain knowledge, preprocessing documents, and applying either K-Means clustering or Latent Dirichlet Allocation to extract topics. K-Means partitions data into clusters while LDA represents documents as mixtures over topics characterized by word distributions.
How is Real-Time Analytics Different from Traditional OLAP?
Topic Extraction on Domain Ontology
1. Topic Extraction for Domain Ontology
Guided By:
Prof. S.B. Karthick
Project By:
Keerti Bhogaraju TY – C - 13
Pratiksha Jadhav TY – C – 50
Rasika Khatke TY – C - 66
Prajakta Jawale TY – C - 71
BRACT’s VISHWAKARMA INSTITUTE of TECHNOLOGY
Pune
Department of Computer Engineering
2. Domain Ontology
Domain ontology is a collection of vocabularies and the specifications of the conceptualization of a
given domain (Gruber, 1993)
Examples: -
3. Specific Domain chosen
Knowledge - based search engine in the field of science for students in 3rd, 4th and 5th grade.
Example: - The human body systems
4. Purpose of Topic Extraction
• To Identify relevant concepts hidden in the corpus of documents
• To obtain terms which may be considered as linguistic realizations of domain specific concepts
• To assign every term found on the corpus to a specific context
• To classify documents for information discovery
• To identify key concepts and the relationships among them in ontology
5. Project Development Stages
i. Obtain domain knowledge
ii. PDF to document conversion
iii. “Cleansing” of the document
a. Tokenizing
b. Filtering (Removal of stop words)
iv. Applying either of the methods mentioned below: -
a. Clustering using K-Means algorithm
b. Topic Modeling – Latent Dirichlet Allocation (LDA)
v. Extraction of topics
6. Method1: Clustering using K-Means
Clustering is the process of partitioning a group of data points into a small number of clusters
K- Means clustering is a method of vector quantization which aims to partition n observations
into k clusters in which each observation belongs to the clusters with the nearest mean, serving
as a prototype of the cluster.
Algorithm: -
7. About K-Means
1. Initial centroids are often chosen randomly.
2. The centroid is (typically) the mean of the points in the cluster.
3. ‘Closeness’ is measured by Euclidean distance, cosine similarity, correlation, etc.
Euclidean Distance
4. We use the following equation to calculate the n dimensional centroid point amidst k n-
dimensional points
10. Advantages and Limitations of K-
Means
Advantages
If variables are huge, then K-Means most of the times computationally faster than hierarchical clustering,
if we keep k small.
K-Means produce tighter clusters than hierarchical clustering, especially if the clusters are globular.
Limitations
Difficult to predict K-Value
Doesn’t work well with globular clusters
Different initial partitions can result in different final clusters
It does not work well with clusters (in the original data) of Different size and Different density
11. Method2: Topic Modeling using LDA
Useful for organizing large blocks of textual data, information retrieval from unstructured text
and feature selection.
A process to automatically identify topics present in a text object and to derive hidden
patterns exhibited by a text corpus. Thus, assisting better decision making.
LDA is a generative probabilistic model of a corpus. The basic idea is that the documents are
represented as random mixtures over latent topics, where a topic is characterized by a
distribution over words.
12. Latent Dirichlet Allocation (LDA)
Algorithm
1. A new topic “k” is assigned to word “w” with a probability P which is a product of two
probabilities p1 and p2.
2. p1 –> p(topic t / document d) = the proportion of words in document d that are currently
assigned to topic t.
3. p2 –> p(word w / topic t) = the proportion of assignments to topic t over all documents that
come from this word w.
4. The current topic – word assignment is updated with a new topic with the probability,
product of p1 and p2
5. Iterates through each word “w” for each document “d” and tries to adjust the current topic –
word assignment with a new assignment.
16. Advantages and Limitations of LDA
The advantages of LDA is that LDA is a probabilistic model with interpretable topics.
The disadvantages are that it is hard to know when LDA is working --- topics are soft-clusters so
there is no objective metric to say "this is the best choice" of hyper parameters.
17. Natural Language Text Processing
Natural Language Processing (NLP) refers to AI method of communicating with an intelligent
systems using a natural language such as English.
Techniques from NLP used in the project: -
i.) Tokenizing
ii.) Stop words
iii.) Named Entity Recognition
iv.) POS Tagging
v.) Lemmatizing
18. Future Scope
Information Extraction
Retrieval of relations and hierarchies among concepts
Ontology Building
20. References
https://www.analyticsvidhya.com/blog/2016/08/beginners-guide-to-topic-modeling-in-python/
sentdex machine learning algorithms tutorial
sentdex nltk tutorial
Automatic Building of an Ontology from a Corpus of Text Documents Using Data Mining Tools
J. I. Toledo-Alvarado*, A. Guzmán-Arenas, G. L. Martínez-Luna
Centro de Investigación en Computación (CIC), Instituto Politécnico Nacional (IPN)
Av. Juan de Dios Báti
K-means Algorithm Cluster Analysis in Data Mining by Zijun Zhang