Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.
Semantic Data Search and Analysis                  Using Web-based User-Generated                          Knowledge Bases...
Today’s Search is Based On Links                      • Full-text search is the main way to                         access...
Domains Without Links                      •   PageRank does not work when documents are                          are not ...
Web-based User-Generated                          Knowledge Bases               • To rank and organize documents that are ...
The Goal of This Project                      Develop a technology which automatically extracts semantic                  ...
Semantic Text Analysis Using                                       Wikipedia                      •   Leveraging Wikipedia...
Basic Technique:                          Semantic Relatedness of Terms                      •   We analyze Wikipedia Link...
Word Sense Disambiguation                      •   Exmple: IBM may stand for International Business                       ...
Prototype of a Semantic Search Engine for the Blogosphere         22 March 2011              Systems Group @ ETH Zurich fo...
Twitter - A Real-Time News Medium                      • ~200M users all over the world posting                         sh...
Following + Retweets:                Twitter is the Fastest News Medium       •       Twitter reacts faster than          ...
Extracting Useful Information                                From Twitter                      • Popularity of a URL      ...
The Tweeted Times: personalized newspaper generated from                        user’s Twitter accountSunday, April 7, 13
At the Systems Layer                      • Scalable distributed architecture is required:                       • Hadoop ...
Scalable Real-Time Analytics Based                   On Distributed Key-Value Store                      •   At Systems Gr...
References                      • Prototype of the semantic search engine                        Blognoon:                ...
Upcoming SlideShare
Loading in …5
×

Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

1,037 views

Published on

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

  1. 1. Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases Dr. Maria Grineva Systems Group @ ETH ZurichSunday, April 7, 13
  2. 2. Today’s Search is Based On Links • Full-text search is the main way to access information on the Web • The goal of Web search engines: find out the most relevant pages for the user’s query • Google employs the Web’s hyperlinks to compute relevance of a Web page (PageRank) 22 March 2011 Systems Group @ ETH Zurich for Hasler FoundationSunday, April 7, 13
  3. 3. Domains Without Links • PageRank does not work when documents are are not interlinked • Breaking news and Blog posts - must be available in real-time, when no links have been created yet • Enterprise databases - documents are not well interconnected because of organizational silos and limited number of people who create and use themSunday, April 7, 13
  4. 4. Web-based User-Generated Knowledge Bases • To rank and organize documents that are not interlinked well, we need additional knowledge bases: • Wikipedia - Online encyclopedia • Twitter - real-time microblogging service 22 March 2011 Systems Group @ ETH Zurich for Hasler FoundationSunday, April 7, 13
  5. 5. The Goal of This Project Develop a technology which automatically extracts semantic information: • from Wikipedia - term meanings, relationships, ontologies ... • from Twitter - real-time information about breaking news, trends, people opinions ... and applies this information to organize: • news and blogs on the Web • documents in enterprise databases We will release our technology as an open source software framework 22 March 2011 Systems Group @ ETH Zurich for Hasler FoundationSunday, April 7, 13
  6. 6. Semantic Text Analysis Using Wikipedia • Leveraging Wikipedia to improve text analysis methods: • Comprehensive coverage (6M terms vs. 65K in Britannica) • Continuously brought up-to-date • Rich structure (cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes) • New algorithms: • Advanced NLP: Word Sense Disambiguation, Keyword Extraction, Topic Inference • Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds • Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation • Zero-cost deployment and customization: No need to train methods, no human labor, no “cold start” problem 22 March 2011 Systems Group @ ETH Zurich for Hasler FoundationSunday, April 7, 13
  7. 7. Basic Technique: Semantic Relatedness of Terms • We analyze Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms • We use Dice-measure with weighted hyperlinks (bi-directional links, direct links, “see also” links, etc)Dmitry Lizorkin, Pavel Velikhov, Maria Grineva, Maxim GrinevAccuracy Estimate and Optimization Techniques for SimRank ComputationVLDB 2008Sunday, April 7, 13
  8. 8. Word Sense Disambiguation • Exmple: IBM may stand for International Business Machines Corp. or International Brotherhood of Magicians • We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text • Example: Platform is mentioned in the context of implementation, open-source, web-server, HTTPSunday, April 7, 13
  9. 9. Prototype of a Semantic Search Engine for the Blogosphere 22 March 2011 Systems Group @ ETH Zurich for Hasler FoundationSunday, April 7, 13
  10. 10. Twitter - A Real-Time News Medium • ~200M users all over the world posting short messages (tweets) via mobile devices and web browser • ~140M tweets per day • Twitter - is an open social network where everyone can follow everyone • Retweets - a mechanism for fast news spreading 22 March 2011 Systems Group @ ETH Zurich for Hasler FoundationSunday, April 7, 13
  11. 11. Following + Retweets: Twitter is the Fastest News Medium • Twitter reacts faster than mainstream media: Haiti Earthquake, Hudson river plane crash • Everyone can be a reporter: real- time updates on the revolutions in Tunisia, Egypt, Libya, Iran ...Sunday, April 7, 13
  12. 12. Extracting Useful Information From Twitter • Popularity of a URL • Sentiments, opinions about a news story (tweets containing the news URL) • Trending topics: what is being actively discussed right now • Personalization of news based on user’s friends connections: The Tweeted Times http://tweetedtimes.com 22 March 2011 Systems Group @ ETH Zurich for Hasler FoundationSunday, April 7, 13
  13. 13. The Tweeted Times: personalized newspaper generated from user’s Twitter accountSunday, April 7, 13
  14. 14. At the Systems Layer • Scalable distributed architecture is required: • Hadoop (MapReduce software framework) for batch processing of Wikipedia snapshots • Real-time analytics based on distributed key-value store for online Twitter stream processing 22 March 2011 Systems Group @ ETH Zurich for Hasler FoundationSunday, April 7, 13
  15. 15. Scalable Real-Time Analytics Based On Distributed Key-Value Store • At Systems Group, we are working on a system for real-time analytics based on Cassandra: • We extend Cassandra with: • push-style procedure for real-time analytics • incremental computations (alternative to batch-processing) - processing data as it arrives from the stream 22 March 2011 Systems Group @ ETH Zurich for Hasler FoundationSunday, April 7, 13
  16. 16. References • Prototype of the semantic search engine Blognoon: http://blognoon.com • The Tweeted Times - personalized newspaper based on user’s Twitter account: http://tweetedtimes.com • Triggy: a system for real-time analytics: http://www.systems.ethz.ch/research/projects 22 March 2011 Systems Group @ ETH Zurich for Hasler FoundationSunday, April 7, 13

×