Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases
1. Semantic Data Search and Analysis
Using Web-based User-Generated
Knowledge Bases
Dr. Maria Grineva
Systems Group @ ETH Zurich
Sunday, April 7, 13
2. Today’s Search is Based On Links
• Full-text search is the main way to
access information on the Web
• The goal of Web search engines: find out
the most relevant pages for the user’s
query
• Google employs the Web’s hyperlinks to
compute relevance of a Web page
(PageRank)
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
3. Domains Without Links
• PageRank does not work when documents are
are not interlinked
• Breaking news and Blog posts - must
be available in real-time, when no links have
been created yet
• Enterprise databases - documents are
not well interconnected because of
organizational silos and limited number of
people who create and use them
Sunday, April 7, 13
4. Web-based User-Generated
Knowledge Bases
• To rank and organize documents that are not
interlinked well, we need additional knowledge
bases:
• Wikipedia - Online encyclopedia
• Twitter - real-time microblogging service
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
5. The Goal of This Project
Develop a technology which automatically extracts semantic
information:
• from Wikipedia - term meanings, relationships,
ontologies ...
• from Twitter - real-time information about breaking
news, trends, people opinions ...
and applies this information to organize:
• news and blogs on the Web
• documents in enterprise databases
We will release our technology as an open source software
framework
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
6. Semantic Text Analysis Using
Wikipedia
• Leveraging Wikipedia to improve text analysis methods:
• Comprehensive coverage (6M terms vs. 65K in Britannica)
• Continuously brought up-to-date
• Rich structure (cross-references between articles, categories, redirect
pages, disambiguation pages, info-boxes)
• New algorithms:
• Advanced NLP: Word Sense Disambiguation, Keyword Extraction, Topic
Inference
• Automatic Ontology Management: Organizing Concept into Thematically
Grouped Tag Clouds
• Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation
• Zero-cost deployment and customization: No need to train methods, no
human labor, no “cold start” problem
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
7. Basic Technique:
Semantic Relatedness of Terms
• We analyze Wikipedia Links Structure to compute
Semantic Relatedness of Wikipedia terms
• We use Dice-measure with weighted hyperlinks
(bi-directional links, direct links, “see also” links,
etc)
Dmitry Lizorkin, Pavel Velikhov, Maria Grineva, Maxim Grinev
Accuracy Estimate and Optimization Techniques for SimRank Computation
VLDB 2008
Sunday, April 7, 13
8. Word Sense Disambiguation
• Exmple: IBM may stand for International Business
Machines Corp. or International Brotherhood of Magicians
• We use Wikipedia redirection (synonyms) and
disambiguation pages (homonyms) to detect and
disambiguate terms in a text
• Example: Platform is mentioned in the context of
implementation, open-source, web-server, HTTP
Sunday, April 7, 13
9. Prototype of a Semantic Search Engine for the Blogosphere
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
10. Twitter - A Real-Time News Medium
• ~200M users all over the world posting
short messages (tweets) via mobile devices
and web browser
• ~140M tweets per day
• Twitter - is an open social network where
everyone can follow everyone
• Retweets - a mechanism for fast news
spreading
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
11. Following + Retweets:
Twitter is the Fastest News Medium
• Twitter reacts faster than
mainstream media: Haiti
Earthquake, Hudson river plane crash
• Everyone can be a reporter: real-
time updates on the revolutions in
Tunisia, Egypt, Libya, Iran ...
Sunday, April 7, 13
12. Extracting Useful Information
From Twitter
• Popularity of a URL
• Sentiments, opinions about a news story
(tweets containing the news URL)
• Trending topics: what is being actively
discussed right now
• Personalization of news based on user’s
friends connections:
The Tweeted Times http://tweetedtimes.com
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
13. The Tweeted Times: personalized newspaper generated from
user’s Twitter account
Sunday, April 7, 13
14. At the Systems Layer
• Scalable distributed architecture is required:
• Hadoop (MapReduce software framework)
for batch processing of Wikipedia
snapshots
• Real-time analytics based on distributed
key-value store for online Twitter stream
processing
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
15. Scalable Real-Time Analytics Based
On Distributed Key-Value Store
• At Systems Group, we are working on a system
for real-time analytics based on Cassandra:
• We extend Cassandra with:
• push-style procedure for real-time
analytics
• incremental computations (alternative
to batch-processing) - processing data as it
arrives from the stream
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
16. References
• Prototype of the semantic search engine
Blognoon:
http://blognoon.com
• The Tweeted Times - personalized newspaper
based on user’s Twitter account:
http://tweetedtimes.com
• Triggy: a system for real-time analytics:
http://www.systems.ethz.ch/research/projects
22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13