The document presents a tag recommendation model for collaborative bookmarking systems. It discusses using Lucene indexing and clustering approaches to suggest the most relevant tags for a given URL and description. The team describes their approach which involves preprocessing training data, crawling URLs to extract content, indexing crawled data with Lucene, and using similarity-based clustering and popularity of tags within clusters to recommend tags.
2. Abstract
We present a tag recommendation model for collaborative
bookmarking systems.
Suggesting most relevant tags for a given URL and its
description.
We are using Lucene index and clustering based approach
to determine the same.
3. Problem Statement
Design a tag recommendation system which will form a tag
cloud from a given corpus.
The tag recommendation problem can be described as follows:
For a given post P whose user is U and resource is R, a set of tags
are suggested as tags for the post.
The commonly used approach to choose the tags is rule-based
and classification-based methods, but both of them have
defects: rule-based approach relies on expert experience and
manual efforts to set up the rules and tuning the parameters;
classification-based is restrict to the fix of tag space and is
inefficient when it is treated as a multi-label problem.
4. Related Work
Some of the previous work in tag recommendation area has been done in
content-based and collaborative approach.
In the content-based approach, a system exploits some textual source with
Information Retrieval-related techniques in order to extract relevant N-
grams from the text.
5. Approach
We started with some pre-processing of given training to make it
more suitable for crawling purposes.
Then we crawl the URLs from given training dataset to extract
the web content like text, pdf, html document etc. and
normalize it to remove unwanted tags.
Then we use Lucene to Index the crawled data.
We are using similarity score based approach and clustering
based approach. We created clusters of similar links such that
each link in each cluster is similar to each other link the cluster
based on a pre-determined approach.
Finally for each group we find the most popular tags.
6. Approach… (continued)
For the Extraction of candidate tags we are using
following sources::
URL given by the user
From the user's previously tagged resources
From the given description
Word related tags which are extracted from description
For Ranking we are using user history. The groups most
similar to the link and description by user are identified and
the tags to these groups are become tag for the link.
8. Theory
As a part of our clustering model we are calculating clusters on
following different events:
Grouping the links based on their similarity to other links
Weighing the groups on their popularity in user's link and
description
Giving more weight to title tag in over all data
How much tag is related to words in given description