We've developed a new technology for semantic text analysis and semantic search. The main idea behind our technology is that we use knowledge extreacted from Wikipedia to facilitate text analysis. To recent moment Wikipedia has grown into the biggest database of concepts and their relationships that ever existed. Wikipedia is great for a number of reasons (i t provides a number of things ) : 1) Comprehensive coverage (it contains very general concepts such car, computer, government, etc and a lot of niche concepts such as new small startup companies or people known only in some mmunities) 2) Continuously brought up-to-date (it is often updated just in minutes after announcements) 3) It is well-structured (it has redirects (Ivan the Terrible redirected to Ivan IV of Russia) which is synonims, it has disambiguation pages (homonyms) which includes different meaning for a term (IBM may stands for International Business Machines or International Brotherhood of Magicians). Using Wikipedia as a big knowledge base allows us to significantly improve a number of techniques and develop new techniques that were not possible before. Here is list of techniques that we developed: Advance NLP etc It is just a list of techniques. I will explain how it all works.
betweenness – how much is edge “in between” different communities modularity - partition is a good one, if there are many edges within communities and only a few between them
Zero-cost deployment and customization: No machine learning techniques which require human labor, no “cold start”
We analyse Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms
We use Dice-measure with weighted links (bi-directional links, direct links, “see also” links, etc)
Basic Technique: Semantic Relatedness of Terms Dmitry Lizorkin, Pavel Velikhov, Maxim Grinev, Denis Turdakov Accuracy Estimate and Optimization Techniques for SimRank Computation, VLDB 2008
Terms Detection and Disambiguation
Example: IBM may stand for International Business Machines Corp . or International Brotherhood of Magicians
We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text
Example: Platform is mentioned in the context of implementation , open-source , web-server, HTTP
Denis Turdakov, Pavel Velikhov
“ Semantic Relatedness Metric for Wikipedia Concepts Based on
Link Analysis and its Application to Word Sense Disambiguation ”
SYRCoDIS, 2008
Keywords Extraction
Build document semantic graph using semantic relatedness between Wikipedia terms detected in the doc
Discover community structure of the document semantic graph
Community – densely interconnected group of nodes in a graph
Girvan-Newman algorithm for detection community structure in networks
Select “best” communities:
Densed communities contain key terms
Sparse communities contain not important terms, and possible disambiguation mistakes
Maria Grineva, Maxim Grinev, Dmitry Lizorkin Extracting Key Terms From Noisy and Multitheme Documents WWW2009: 18th International World Wide Web Conference
Keywords Extraction (Example) Semantic graph built from a news article " Apple to Make ITunes More Accessible For the Blind "
Advantages of the Keywords Extraction Method
No training . Instead of training the system with hand-created examples, we use semantic information derived from Wikipedia
Noise and multi-theme stability. Good at filtering out noise and discover topics in Web pages
Thematically grouped key terms . Significantly improve further inferring of document topics
High accuracy . Evaluated using human judgments
Other Methods
General Topic Inference for a doc
using spreading activation over Wikipedia categories graph
Example: Amazon EC2, Microsoft Azure, Google MapReduce => Cloud Computing
Building Thematically Grouped Tag Clouds for many docs
Girvan-Newman algorithm to split into thematic groups
Topic inference for each group
Document classification
Semantic similarity is used to indentify indirect relationships between terms (e.g. a doc about collaborative filtering is classified to recommender system )
Semantic Search & Navigation
Search by Concept :
Advantages of query and in-doc terms disambiguation
Result: documents about the concept and related concepts ordered by relevance (keywordness)
Smart Faceted Navigation : query-relevant facets using semantic relatedness
Concept-tips to grasp the result documents
Each document in the result is accompanied with concepts-tips that explain how this document is relevant to the Query
0 comments
Post a comment