SlideShare a Scribd company logo
1 of 16
Download to read offline
Semantic Data Search and Analysis
                  Using Web-based User-Generated
                          Knowledge Bases

                               Dr. Maria Grineva
                         Systems Group @ ETH Zurich




Sunday, April 7, 13
Today’s Search is Based On Links
                      • Full-text search is the main way to
                         access information on the Web
                      • The goal of Web search engines: find out
                         the most relevant pages for the user’s
                         query
                      • Google employs the Web’s hyperlinks to
                         compute relevance of a Web page
                         (PageRank)


         22 March 2011                          Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Domains Without Links
                      •   PageRank does not work when documents are
                          are not interlinked
                          •   Breaking news and Blog posts - must
                              be available in real-time, when no links have
                              been created yet
                          •   Enterprise databases - documents are
                              not well interconnected because of
                              organizational silos and limited number of
                              people who create and use them


Sunday, April 7, 13
Web-based User-Generated
                          Knowledge Bases

               • To rank and organize documents that are not
                      interlinked well, we need additional knowledge
                      bases:
                      •   Wikipedia - Online encyclopedia

                      •   Twitter - real-time microblogging service



         22 March 2011                              Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
The Goal of This Project
                      Develop a technology which automatically extracts semantic
                      information:
                          • from Wikipedia - term meanings, relationships,
                              ontologies ...
                          • from Twitter - real-time information about breaking
                              news, trends, people opinions ...
                      and applies this information to organize:
                          • news and blogs on the Web
                          • documents in enterprise databases
                      We will release our technology as an open source software
                      framework

         22 March 2011                                   Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Semantic Text Analysis Using
                                       Wikipedia
                      •   Leveraging Wikipedia to improve text analysis methods:

                          •   Comprehensive coverage (6M terms vs. 65K in Britannica)

                          •   Continuously brought up-to-date

                          •   Rich structure (cross-references between articles, categories, redirect
                              pages, disambiguation pages, info-boxes)

                      •   New algorithms:

                          •   Advanced NLP: Word Sense Disambiguation, Keyword Extraction, Topic
                              Inference

                          •   Automatic Ontology Management: Organizing Concept into Thematically
                              Grouped Tag Clouds

                          •   Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation

                          •   Zero-cost deployment and customization: No need to train methods, no
                              human labor, no “cold start” problem

         22 March 2011                                              Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Basic Technique:
                          Semantic Relatedness of Terms
                      •   We analyze Wikipedia Links Structure to compute
                          Semantic Relatedness of Wikipedia terms
                      •   We use Dice-measure with weighted hyperlinks
                          (bi-directional links, direct links, “see also” links,
                          etc)




Dmitry Lizorkin, Pavel Velikhov, Maria Grineva, Maxim Grinev
Accuracy Estimate and Optimization Techniques for SimRank Computation
VLDB 2008
Sunday, April 7, 13
Word Sense Disambiguation
                      •   Exmple: IBM may stand for International Business
                          Machines Corp. or International Brotherhood of Magicians
                      •   We use Wikipedia redirection (synonyms) and
                          disambiguation pages (homonyms) to detect and
                          disambiguate terms in a text
                      •   Example: Platform is mentioned in the context of
                          implementation, open-source, web-server, HTTP




Sunday, April 7, 13
Prototype of a Semantic Search Engine for the Blogosphere




         22 March 2011              Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Twitter - A Real-Time News Medium

                      • ~200M users all over the world posting
                         short messages (tweets) via mobile devices
                         and web browser
                      • ~140M tweets per day
                      • Twitter - is an open social network where
                         everyone can follow everyone
                      • Retweets - a mechanism for fast news
                         spreading


         22 March 2011                          Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Following + Retweets:
                Twitter is the Fastest News Medium
       •       Twitter reacts faster than
               mainstream media: Haiti
               Earthquake, Hudson river plane crash
       •       Everyone can be a reporter: real-
               time updates on the revolutions in
               Tunisia, Egypt, Libya, Iran ...




Sunday, April 7, 13
Extracting Useful Information
                                From Twitter
                      • Popularity of a URL
                      • Sentiments, opinions about a news story
                         (tweets containing the news URL)
                      • Trending topics: what is being actively
                         discussed right now
                      • Personalization of news based on user’s
                         friends connections:
                         The Tweeted Times http://tweetedtimes.com

         22 March 2011                          Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
The Tweeted Times: personalized newspaper generated from
                        user’s Twitter account




Sunday, April 7, 13
At the Systems Layer
                      • Scalable distributed architecture is required:
                       • Hadoop (MapReduce software framework)
                          for batch processing of Wikipedia
                          snapshots
                        • Real-time analytics based on distributed
                          key-value store for online Twitter stream
                          processing


         22 March 2011                           Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
Scalable Real-Time Analytics Based
                   On Distributed Key-Value Store
                      •   At Systems Group, we are working on a system
                          for real-time analytics based on Cassandra:
                      •   We extend Cassandra with:
                          •   push-style procedure for real-time
                              analytics
                          •   incremental computations (alternative
                              to batch-processing) - processing data as it
                              arrives from the stream


         22 March 2011                              Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13
References
                      • Prototype of the semantic search engine
                        Blognoon:
                        http://blognoon.com

                      • The Tweeted Times - personalized newspaper
                        based on user’s Twitter account:
                        http://tweetedtimes.com

                      • Triggy: a system for real-time analytics:
                        http://www.systems.ethz.ch/research/projects


         22 March 2011                              Systems Group @ ETH Zurich for Hasler Foundation
Sunday, April 7, 13

More Related Content

What's hot

The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkThe HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkRobert H. McDonald
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesDorothea Salo
 
#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks
#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks
#mytweet via Instagram: Exploring User Behaviour Across Multiple Social NetworksBang Hui Lim
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsJon Voss
 
Mapping Online Publics: Researching the Uses of Twitter
Mapping Online Publics: Researching the Uses of TwitterMapping Online Publics: Researching the Uses of Twitter
Mapping Online Publics: Researching the Uses of TwitterAxel Bruns
 
The "social" side of digital science
The "social" side of digital scienceThe "social" side of digital science
The "social" side of digital scienceKaitlin Thaney
 

What's hot (7)

The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data FrameworkThe HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
The HathiTrust Research Center: Big Data Analytics in a Secure Data Framework
 
Preservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanitiesPreservation and institutional repositories for the digital arts and humanities
Preservation and institutional repositories for the digital arts and humanities
 
#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks
#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks
#mytweet via Instagram: Exploring User Behaviour Across Multiple Social Networks
 
Linked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & MuseumsLinked Open Data in Libraries, Archives & Museums
Linked Open Data in Libraries, Archives & Museums
 
Mapping Online Publics: Researching the Uses of Twitter
Mapping Online Publics: Researching the Uses of TwitterMapping Online Publics: Researching the Uses of Twitter
Mapping Online Publics: Researching the Uses of Twitter
 
The "social" side of digital science
The "social" side of digital scienceThe "social" side of digital science
The "social" side of digital science
 
Reference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and RemedyReference Rot and Linked Data: Threat and Remedy
Reference Rot and Linked Data: Threat and Remedy
 

Similar to Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisOpen Analytics
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysisikanow
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsJohn Breslin
 
New Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter DataNew Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter DataAxel Bruns
 
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...Mindtrek
 
Online data sources and information exposure
Online data sources and information exposureOnline data sources and information exposure
Online data sources and information exposureUniversity of Southampton
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignCommunitySense
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and TechniquesBernhard Haslhofer
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?Daniel S. Katz
 
Next generation repositories
Next generation repositoriesNext generation repositories
Next generation repositoriesPaul Walk
 
Session 0.0 poster minutes madness
Session 0.0   poster minutes madnessSession 0.0   poster minutes madness
Session 0.0 poster minutes madnesssemanticsconference
 
Do Libraries Meet Research 2.0 : collaborative tools and relevance for Resear...
Do Libraries Meet Research 2.0 : collaborative tools and relevance for Resear...Do Libraries Meet Research 2.0 : collaborative tools and relevance for Resear...
Do Libraries Meet Research 2.0 : collaborative tools and relevance for Resear...Guus van den Brekel
 
Session 1.4 a distributed network of heritage information
Session 1.4   a distributed network of heritage informationSession 1.4   a distributed network of heritage information
Session 1.4 a distributed network of heritage informationsemanticsconference
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamEnno Meijers
 
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...Jukka Huhtamäki
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012lljohnston
 

Similar to Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases (20)

Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Building Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media AnalysisBuilding Effective Frameworks for Social Media Analysis
Building Effective Frameworks for Social Media Analysis
 
Breaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social SemanticsBreaking Down Walls in Enterprise with Social Semantics
Breaking Down Walls in Enterprise with Social Semantics
 
The Future of Research Communications and e-Scholarship: Are we there yet?
The Future of Research Communications and e-Scholarship: Are we there yet?The Future of Research Communications and e-Scholarship: Are we there yet?
The Future of Research Communications and e-Scholarship: Are we there yet?
 
New Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter DataNew Methodologies for Capturing and Working with Publicly Available Twitter Data
New Methodologies for Capturing and Working with Publicly Available Twitter Data
 
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
 
C N I20080404
C N I20080404C N I20080404
C N I20080404
 
Torsten Reimer
Torsten ReimerTorsten Reimer
Torsten Reimer
 
Online data sources and information exposure
Online data sources and information exposureOnline data sources and information exposure
Online data sources and information exposure
 
Linked (Open) Data
Linked (Open) DataLinked (Open) Data
Linked (Open) Data
 
Conversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems DesignConversations in Context: A Twitter Case for Social Media Systems Design
Conversations in Context: A Twitter Case for Social Media Systems Design
 
Open Data - Principles and Techniques
Open Data - Principles and TechniquesOpen Data - Principles and Techniques
Open Data - Principles and Techniques
 
What is eScience, and where does it go from here?
What is eScience, and where does it go from here?What is eScience, and where does it go from here?
What is eScience, and where does it go from here?
 
Next generation repositories
Next generation repositoriesNext generation repositories
Next generation repositories
 
Session 0.0 poster minutes madness
Session 0.0   poster minutes madnessSession 0.0   poster minutes madness
Session 0.0 poster minutes madness
 
Do Libraries Meet Research 2.0 : collaborative tools and relevance for Resear...
Do Libraries Meet Research 2.0 : collaborative tools and relevance for Resear...Do Libraries Meet Research 2.0 : collaborative tools and relevance for Resear...
Do Libraries Meet Research 2.0 : collaborative tools and relevance for Resear...
 
Session 1.4 a distributed network of heritage information
Session 1.4   a distributed network of heritage informationSession 1.4   a distributed network of heritage information
Session 1.4 a distributed network of heritage information
 
A distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics AmsterdamA distributed network of digital heritage information - Semantics Amsterdam
A distributed network of digital heritage information - Semantics Amsterdam
 
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
Visualizing Co-authorship Networks for Actionable Insights: Action Design Res...
 
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
Leslie Johnston: Library Big Data Repository Services, Open Repositories 2012
 

More from maria.grineva

Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Webmaria.grineva
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Webmaria.grineva
 
Getting Value From Social Media
Getting Value From Social MediaGetting Value From Social Media
Getting Value From Social Mediamaria.grineva
 
Architecture of Native XML Database Sedna
Architecture of Native XML Database SednaArchitecture of Native XML Database Sedna
Architecture of Native XML Database Sednamaria.grineva
 
XQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database SednaXQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database Sednamaria.grineva
 
Extracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme DocumentsExtracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme Documentsmaria.grineva
 
Effective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From TextEffective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From Textmaria.grineva
 

More from maria.grineva (8)

Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
 
Analytics for the Real-Time Web
Analytics for the Real-Time WebAnalytics for the Real-Time Web
Analytics for the Real-Time Web
 
Getting Value From Social Media
Getting Value From Social MediaGetting Value From Social Media
Getting Value From Social Media
 
Filtering Twitter
Filtering TwitterFiltering Twitter
Filtering Twitter
 
Architecture of Native XML Database Sedna
Architecture of Native XML Database SednaArchitecture of Native XML Database Sedna
Architecture of Native XML Database Sedna
 
XQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database SednaXQuery Triggers in Native XML Database Sedna
XQuery Triggers in Native XML Database Sedna
 
Extracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme DocumentsExtracting Key Terms From Noisy and Multi-theme Documents
Extracting Key Terms From Noisy and Multi-theme Documents
 
Effective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From TextEffective Extraction of Thematically Grouped Key Terms From Text
Effective Extraction of Thematically Grouped Key Terms From Text
 

Recently uploaded

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksSoftradix Technologies
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitecturePixlogix Infotech
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 

Recently uploaded (20)

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Benefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other FrameworksBenefits Of Flutter Compared To Other Frameworks
Benefits Of Flutter Compared To Other Frameworks
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Understanding the Laravel MVC Architecture
Understanding the Laravel MVC ArchitectureUnderstanding the Laravel MVC Architecture
Understanding the Laravel MVC Architecture
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 

Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases

  • 1. Semantic Data Search and Analysis Using Web-based User-Generated Knowledge Bases Dr. Maria Grineva Systems Group @ ETH Zurich Sunday, April 7, 13
  • 2. Today’s Search is Based On Links • Full-text search is the main way to access information on the Web • The goal of Web search engines: find out the most relevant pages for the user’s query • Google employs the Web’s hyperlinks to compute relevance of a Web page (PageRank) 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 3. Domains Without Links • PageRank does not work when documents are are not interlinked • Breaking news and Blog posts - must be available in real-time, when no links have been created yet • Enterprise databases - documents are not well interconnected because of organizational silos and limited number of people who create and use them Sunday, April 7, 13
  • 4. Web-based User-Generated Knowledge Bases • To rank and organize documents that are not interlinked well, we need additional knowledge bases: • Wikipedia - Online encyclopedia • Twitter - real-time microblogging service 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 5. The Goal of This Project Develop a technology which automatically extracts semantic information: • from Wikipedia - term meanings, relationships, ontologies ... • from Twitter - real-time information about breaking news, trends, people opinions ... and applies this information to organize: • news and blogs on the Web • documents in enterprise databases We will release our technology as an open source software framework 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 6. Semantic Text Analysis Using Wikipedia • Leveraging Wikipedia to improve text analysis methods: • Comprehensive coverage (6M terms vs. 65K in Britannica) • Continuously brought up-to-date • Rich structure (cross-references between articles, categories, redirect pages, disambiguation pages, info-boxes) • New algorithms: • Advanced NLP: Word Sense Disambiguation, Keyword Extraction, Topic Inference • Automatic Ontology Management: Organizing Concept into Thematically Grouped Tag Clouds • Semantic Search: Concept-based Similarity Search, Smart Faceted Navigation • Zero-cost deployment and customization: No need to train methods, no human labor, no “cold start” problem 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 7. Basic Technique: Semantic Relatedness of Terms • We analyze Wikipedia Links Structure to compute Semantic Relatedness of Wikipedia terms • We use Dice-measure with weighted hyperlinks (bi-directional links, direct links, “see also” links, etc) Dmitry Lizorkin, Pavel Velikhov, Maria Grineva, Maxim Grinev Accuracy Estimate and Optimization Techniques for SimRank Computation VLDB 2008 Sunday, April 7, 13
  • 8. Word Sense Disambiguation • Exmple: IBM may stand for International Business Machines Corp. or International Brotherhood of Magicians • We use Wikipedia redirection (synonyms) and disambiguation pages (homonyms) to detect and disambiguate terms in a text • Example: Platform is mentioned in the context of implementation, open-source, web-server, HTTP Sunday, April 7, 13
  • 9. Prototype of a Semantic Search Engine for the Blogosphere 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 10. Twitter - A Real-Time News Medium • ~200M users all over the world posting short messages (tweets) via mobile devices and web browser • ~140M tweets per day • Twitter - is an open social network where everyone can follow everyone • Retweets - a mechanism for fast news spreading 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 11. Following + Retweets: Twitter is the Fastest News Medium • Twitter reacts faster than mainstream media: Haiti Earthquake, Hudson river plane crash • Everyone can be a reporter: real- time updates on the revolutions in Tunisia, Egypt, Libya, Iran ... Sunday, April 7, 13
  • 12. Extracting Useful Information From Twitter • Popularity of a URL • Sentiments, opinions about a news story (tweets containing the news URL) • Trending topics: what is being actively discussed right now • Personalization of news based on user’s friends connections: The Tweeted Times http://tweetedtimes.com 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 13. The Tweeted Times: personalized newspaper generated from user’s Twitter account Sunday, April 7, 13
  • 14. At the Systems Layer • Scalable distributed architecture is required: • Hadoop (MapReduce software framework) for batch processing of Wikipedia snapshots • Real-time analytics based on distributed key-value store for online Twitter stream processing 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 15. Scalable Real-Time Analytics Based On Distributed Key-Value Store • At Systems Group, we are working on a system for real-time analytics based on Cassandra: • We extend Cassandra with: • push-style procedure for real-time analytics • incremental computations (alternative to batch-processing) - processing data as it arrives from the stream 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13
  • 16. References • Prototype of the semantic search engine Blognoon: http://blognoon.com • The Tweeted Times - personalized newspaper based on user’s Twitter account: http://tweetedtimes.com • Triggy: a system for real-time analytics: http://www.systems.ethz.ch/research/projects 22 March 2011 Systems Group @ ETH Zurich for Hasler Foundation Sunday, April 7, 13