Semantic Data Search and Analysis
                  Using Web-based User-Generated
                          Knowledge Bases

                               Dr. Maria Grineva
                         Systems Group @ ETH Zurich




Sunday, April 7, 13
Today’s Search is Based On Links
                      • Full-text search is the main way to
                         access information on the Web
                      • The goal of Web search engines: find out
                         the most relevant pages for the user’s
                         query
                      • Google employs the Web’s hyperlinks to
                         compute relevance of a Web page
                         (PageRank)


         22 March 2011                          Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Domains Without Links
                      •   PageRank does not work when documents are
                          are not interlinked
                          •   Breaking news and Blog posts - must
                              be available in real-time, when no links have
                              been created yet
                          •   Enterprise databases - documents are
                              not well interconnected because of
                              organizational silos and limited number of
                              people who create and use them


Sunday, April 7, 13
Web-based User-Generated
                          Knowledge Bases

               • To rank and organize documents that are not
                      interlinked well, we need additional knowledge
                      bases:
                      •   Wikipedia - Online encyclopedia

                      •   Twitter - real-time microblogging service



         22 March 2011                              Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
The Goal of This Project
                      Develop a technology which automatically extracts semantic
                      information:
                          • from Wikipedia - term meanings, relationships,
                              ontologies ...
                          • from Twitter - real-time information about breaking
                              news, trends, people opinions ...
                      and applies this information to organize:
                          • news and blogs on the Web
                          • documents in enterprise databases
                      We will release our technology as an open source software
                      framework

         22 March 2011                                   Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Semantic Text Analysis Using
                                       Wikipedia
                      •   Leveraging Wikipedia to improve text analysis methods:

                          •   Comprehensive coverage (6M terms vs. 65K in Britannica)

                          •   Continuously brought up-to-date

                          •   Rich structure (cross-references between articles, categories, redirect
                              pages, disambiguation pages, info-boxes)

                      •   New algorithms:

                          •   Advanced NLP: Word Sense Disambiguation, Keyword Extraction, Topic
                              Inference

                          •   Automatic Ontology Management: Organizing Concept into Thematically
                              Grouped Tag Clouds

                          •   Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation

                          •   Zero-cost deployment and customization: No need to train methods, no
                              human labor, no “cold start” problem

         22 March 2011                                              Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Basic Technique:
                          Semantic Relatedness of Terms
                      •   We analyze Wikipedia Links Structure to compute
                          Semantic Relatedness of Wikipedia terms
                      •   We use Dice-measure with weighted hyperlinks
                          (bi-directional links, direct links, “see also” links,
                          etc)




Dmitry Lizorkin, Pavel Velikhov, Maria Grineva, Maxim Grinev
Accuracy Estimate and Optimization Techniques for SimRank Computation
VLDB 2008
Sunday, April 7, 13
Word Sense Disambiguation
                      •   Exmple: IBM may stand for International Business
                          Machines Corp. or International Brotherhood of Magicians
                      •   We use Wikipedia redirection (synonyms) and
                          disambiguation pages (homonyms) to detect and
                          disambiguate terms in a text
                      •   Example: Platform is mentioned in the context of
                          implementation, open-source, web-server, HTTP




Sunday, April 7, 13
Prototype of a Semantic Search Engine for the Blogosphere




         22 March 2011              Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Twitter - A Real-Time News Medium

                      • ~200M users all over the world posting
                         short messages (tweets) via mobile devices
                         and web browser
                      • ~140M tweets per day
                      • Twitter - is an open social network where
                         everyone can follow everyone
                      • Retweets - a mechanism for fast news
                         spreading


         22 March 2011                          Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Following + Retweets:
                Twitter is the Fastest News Medium
       •       Twitter reacts faster than
               mainstream media: Haiti
               Earthquake, Hudson river plane crash
       •       Everyone can be a reporter: real-
               time updates on the revolutions in
               Tunisia, Egypt, Libya, Iran ...




Sunday, April 7, 13
Extracting Useful Information
                                From Twitter
                      • Popularity of a URL
                      • Sentiments, opinions about a news story
                         (tweets containing the news URL)
                      • Trending topics: what is being actively
                         discussed right now
                      • Personalization of news based on user’s
                         friends connections:
                         The Tweeted Times http://tweetedtimes.com

         22 March 2011                          Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
The Tweeted Times: personalized newspaper generated from
                        user’s Twitter account




Sunday, April 7, 13
At the Systems Layer
                      • Scalable distributed architecture is required:
                       • Hadoop (MapReduce software framework)
                          for batch processing of Wikipedia
                          snapshots
                        • Real-time analytics based on distributed
                          key-value store for online Twitter stream
                          processing


         22 March 2011                           Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Scalable Real-Time Analytics Based
                   On Distributed Key-Value Store
                      •   At Systems Group, we are working on a system
                          for real-time analytics based on Cassandra:
                      •   We extend Cassandra with:
                          •   push-style procedure for real-time
                              analytics
                          •   incremental computations (alternative
                              to batch-processing) - processing data as it
                              arrives from the stream


         22 March 2011                              Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
References
                      • Prototype of the semantic search engine
                        Blognoon:
                        http://blognoon.com

                      • The Tweeted Times - personalized newspaper
                        based on user’s Twitter account:
                        http://tweetedtimes.com

                      • Triggy: a system for real-time analytics:
                        http://www.systems.ethz.ch/research/projects


         22 March 2011                              Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13

Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

  • 1.
    Semantic Data Searchand Analysis Using Web-based User-Generated Knowledge Bases Dr. Maria Grineva Systems Group @ ETH Zurich Sunday, April 7, 13
  • 2.
    Today’s Search isBased On Links • Full-text search is the main way to access information on the Web • The goal of Web search engines: find out the most relevant pages for the user’s query • Google employs the Web’s hyperlinks to compute relevance of a Web page (PageRank) 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 3.
    Domains Without Links • PageRank does not work when documents are are not interlinked • Breaking news and Blog posts - must be available in real-time, when no links have been created yet • Enterprise databases - documents are not well interconnected because of organizational silos and limited number of people who create and use them Sunday, April 7, 13
  • 4.
    Web-based User-Generated Knowledge Bases • To rank and organize documents that are not interlinked well, we need additional knowledge bases: • Wikipedia - Online encyclopedia • Twitter - real-time microblogging service 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 5.
    The Goal ofThis Project Develop a technology which automatically extracts semantic information: • from Wikipedia - term meanings, relationships, ontologies ... • from Twitter - real-time information about breaking news, trends, people opinions ... and applies this information to organize: • news and blogs on the Web • documents in enterprise databases We will release our technology as an open source software framework 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 6.
    Semantic Text AnalysisUsing Wikipedia • Leveraging Wikipedia to improve text analysis methods: • Comprehensive coverage (6M terms vs. 65K in Britannica) • Continuously brought up-to-date • Rich structure (cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes) • New algorithms: • Advanced NLP: Word Sense Disambiguation, Keyword Extraction, Topic Inference • Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds • Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation • Zero-cost deployment and customization: No need to train methods, no human labor, no “cold start” problem 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 7.
    Basic Technique: Semantic Relatedness of Terms • We analyze Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms • We use Dice-measure with weighted hyperlinks (bi-directional links, direct links, “see also” links, etc) Dmitry Lizorkin, Pavel Velikhov, Maria Grineva, Maxim Grinev Accuracy Estimate and Optimization Techniques for SimRank Computation VLDB 2008 Sunday, April 7, 13
  • 8.
    Word Sense Disambiguation • Exmple: IBM may stand for International Business Machines Corp. or International Brotherhood of Magicians • We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text • Example: Platform is mentioned in the context of implementation, open-source, web-server, HTTP Sunday, April 7, 13
  • 9.
    Prototype of aSemantic Search Engine for the Blogosphere 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 10.
    Twitter - AReal-Time News Medium • ~200M users all over the world posting short messages (tweets) via mobile devices and web browser • ~140M tweets per day • Twitter - is an open social network where everyone can follow everyone • Retweets - a mechanism for fast news spreading 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 11.
    Following + Retweets: Twitter is the Fastest News Medium • Twitter reacts faster than mainstream media: Haiti Earthquake, Hudson river plane crash • Everyone can be a reporter: real- time updates on the revolutions in Tunisia, Egypt, Libya, Iran ... Sunday, April 7, 13
  • 12.
    Extracting Useful Information From Twitter • Popularity of a URL • Sentiments, opinions about a news story (tweets containing the news URL) • Trending topics: what is being actively discussed right now • Personalization of news based on user’s friends connections: The Tweeted Times http://tweetedtimes.com 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 13.
    The Tweeted Times:personalized newspaper generated from user’s Twitter account Sunday, April 7, 13
  • 14.
    At the SystemsLayer • Scalable distributed architecture is required: • Hadoop (MapReduce software framework) for batch processing of Wikipedia snapshots • Real-time analytics based on distributed key-value store for online Twitter stream processing 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 15.
    Scalable Real-Time AnalyticsBased On Distributed Key-Value Store • At Systems Group, we are working on a system for real-time analytics based on Cassandra: • We extend Cassandra with: • push-style procedure for real-time analytics • incremental computations (alternative to batch-processing) - processing data as it arrives from the stream 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 16.
    References • Prototype of the semantic search engine Blognoon: http://blognoon.com • The Tweeted Times - personalized newspaper based on user’s Twitter account: http://tweetedtimes.com • Triggy: a system for real-time analytics: http://www.systems.ethz.ch/research/projects 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13