Hierarchical Topic Detection and Representation

Hierarchical Topic Detection and Representation
Yash Vadalia (201001015)
Raj Mehta (201305504)
Lalit M (201101189)
Ashutosh Borkar (201101002)

Introduction
● Huge volume of news/information.
● Automatic processing of information to keep up with latest updates.
● Documents with similar stories are clustered together.
● Topics extracted from these clusters.
● Applications: Searching, topic based document suggestion.

Parsing
● Corpus: Real news dataset (link).
● Unstructured data makes information extraction difficult.
● Data has huge amount of noise.
○ html tags
○ non-printable characters

...continued
● Process the raw data and remove noise (HTML tags, comments, etc).
● Segment each document into sentences and further into words/tokens.
● Stop words removal and Stemming.
● Tag each token with the right parts of speech (POS).
● Store the tag and frequency of all nouns and verbs (document vector)

Document Similarity
● Document similarity: Cosine similarity of document vectors.
● Higher the similarity, more the probability of having similar topic
● Wt
represents weight of a word and is given by

Cluster Similarity
● Various linkage criteria are available for finding similarity between
clusters:
○ Single Linkage
○ Complete Linkage
○ Mean Linkage
○ Centroid Linkage
○ Minimum Energy etc
● Mean Linkage is prefered over other since it reduces the effect of chaining.

Clustering
● Agglomerative hierarchical clustering
○ Consider each document as single cluster
○ Find most (max) similar pair of clusters to merge
○ Merge into single cluster
○ Repeat
● Each iteration reduces the number of cluster by one.
● Termination
○ Either the maximum similarity goes below a threshold
○ Requisite number of clusters formed.

Topic Extraction
● Used TF-IDF and parsimonious model to weigh terms to get the most
relevant topics.
● Parsimonious Model

...continued
● Words having less weight are ignored and ones with maximum weight are
considered as topic for that cluster.
● Instead of all kinds of word, processing specific parts of speech yeilds more
relevant topics.
● Proper Nouns and Verbs are represent entities and events respectively in a
document.

Results
● Output is a binary tree having various clusters combined at each level.
● Each non-leaf node in tree is a cluster.
● Each leaf-node is document.
● Tree is not well balanced and do suffer little from chaining if almost all
documents are of same topic.

Conclusion
● HTD is a newer variant over Topic detection.
● Provides multiple level of granularity.
● Major issue in the statistical approach we followed is scaling.
○ Cubic complexity of processing (document similarity matrix,
clustering)
● The relevance between the documents can be improved as we go towards
the events from documents.

Hierarchical Topic Detection and Representation

Recommended

Recommended

More Related Content

What's hot

What's hot (12)

Viewers also liked

Viewers also liked (17)

Similar to Hierarchical Topic Detection and Representation

Similar to Hierarchical Topic Detection and Representation (20)

Recently uploaded

Recently uploaded (20)

Hierarchical Topic Detection and Representation