SlideShare a Scribd company logo
1 of 10
Download to read offline
Search Interfaces on the Web:
Querying and Characterizing
Lectio Praecursoria
12.06.2008
Denis Shestakov
denis.shestakov@utu.fi
Department of Information Technology, University of Turku
Turku Centre for Computer Science
Background
• Search engines (e.g., Google) do not crawl
•
•

and index a significant portion of the Web
The information from non-indexable part of
the Web cannot be found and accessed via
searchers
Important type of web content which is badly
indexed:
• web pages generated based on parameters
provided by users via search interfaces

• Filling out a search form is a hard task for any
automatic agent (e.g., search engines’
robots)
Lectio Praecursoria 12.06.2008

2
Background
• The part of the Web ’behind’ search interfaces
•
•

is known as deep Web (or hidden Web)
Search interfaces are entry-points to myriads
of databases on the Web
The central problem:

• High-quality and publicly available data
stored in a huge number of databases is
available only via search interfaces (to access
a database of interest, a user has to know
location of its search interface)
• Web pages in the deep Web (so called datarich pages) contain blocks of structured
information (in contrast to ordinary web pages
which are typically unstructured)

Lectio Praecursoria 12.06.2008

3
Example of a search interface & search results
AutoTrader search form

(http://autotrader.com/):

Lectio Praecursoria 12.06.2008

4
Deep Web: numbers & misconceptions
• Number of web databases:

• Survey in April 2004: 450 000 web databases (and this is
underestimated value)

• Size of the deep Web:

• Survey of 2001: 400 to 550 times larger than the
indexable Web; but it is not that big
• No other reliable estimates of the entire size exist
• According to my own indirect assessments: comparable
with the size of the indexable Web

• Content of some web databases is, in fact,
indexable:

• No reliable estimates but one can expect one fourth is
indexed
• Correlation with database subjects: content of
books/movies/music databases (relatively ’static’ data) is
indexed well
• But, even if known to searchers, data is often outdated

Lectio Praecursoria 12.06.2008

5
Thesis contributions:
querying search interfaces
• Approach to automate querying and
•
•
•

retrieving information behind search
interfaces
Essential in case of complex queries
A form query language that allows to
formulate queries and extract useful
information from the pages with results
A prototype system for querying web
databases

Lectio Praecursoria 12.06.2008

6
Thesis contributions:
characterization of the deep Web
• Previous surveys are based on study of
•
•
•

deep web resources mainly in English
Two new methods for characterizing the
deep Web
Two surveys of one national (Russian)
segment of the Web
Dataset describing more than 200 web
databases (statistically reliable)

Lectio Praecursoria 12.06.2008

7
Thesis contributions:
finding web databases
• For any given topic there are too many web
•
•

databases with relevant content: discovery
automation is required
A system for finding and classifying search
interfaces
Intended for:
• Deep Web characterization studies
• Building directories of web databases

• Deal with Javascript-rich and non-HTML

search forms (these types of forms are ignored in
almost all other approaches to the deep Web)

Lectio Praecursoria 12.06.2008

8
Applications
•

Web search engines:

•

Information owners and providers

•

Vertical/topical search engines

• Eager to improve their coverage of the Web
• In April 2008 Google announced they were
experimenting with their form crawler (hence, most
likely, other searchers would also have it
tested/implemented/etc. in their robots within
2008-2009)
• Typically want to disseminate their (publiclyavailable) information
• Interest in discovery methods as they want their
resources to be discovered and searched
• Find information on a specialized topic
• Need methods to extract data from relevant
resources and aggregate it

Lectio Praecursoria 12.06.2008

9
Future work
•
•

The most promising direction: discovery of web
databases
The goal: building a relatively complete directory
(Yahoo!-like) of databases on the Web
• Specialized directories already exist
• Several ‘universal’ directories (e.g.,
completeplanet.com) also exist but, as reported,
are outdated and cover only a small portion of deep
web resources
• Due to the huge number of existing web databases,
building and then maintaining such a directory
would require automatic methods (discovery,
classification, etc.)

Lectio Praecursoria 12.06.2008

10

More Related Content

What's hot

A Brief Overview of BIBFRAME, by Angela Kroeger
A Brief Overview of BIBFRAME, by Angela KroegerA Brief Overview of BIBFRAME, by Angela Kroeger
A Brief Overview of BIBFRAME, by Angela KroegerAngela Kroeger
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosEUCLID project
 
Introduction to discovery layers- June 23b
Introduction to discovery layers- June 23bIntroduction to discovery layers- June 23b
Introduction to discovery layers- June 23bKathy Bryce
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked DataEUCLID project
 
Linked Data MLA 2015
Linked Data MLA 2015Linked Data MLA 2015
Linked Data MLA 2015Cason Snow
 
Linked data MLA 2015
Linked data MLA 2015Linked data MLA 2015
Linked data MLA 2015Cason Snow
 
The Progress of BIBFRAME, by Angela Kroeger
The Progress of BIBFRAME, by Angela KroegerThe Progress of BIBFRAME, by Angela Kroeger
The Progress of BIBFRAME, by Angela KroegerAngela Kroeger
 
Consuming Linked Data by Humans - WWW2010
Consuming Linked Data by Humans - WWW2010Consuming Linked Data by Humans - WWW2010
Consuming Linked Data by Humans - WWW2010Juan Sequeda
 
Cataloger 3.0: Competencies and Education for the BIBFRAME Catalog
Cataloger 3.0: Competencies and Education for the BIBFRAME CatalogCataloger 3.0: Competencies and Education for the BIBFRAME Catalog
Cataloger 3.0: Competencies and Education for the BIBFRAME CatalogAllison Jai O'Dell
 
Managing Annotations (OR2016)
Managing Annotations (OR2016)Managing Annotations (OR2016)
Managing Annotations (OR2016)Robert Sanderson
 
IIIF Foundational Specifications
IIIF Foundational SpecificationsIIIF Foundational Specifications
IIIF Foundational SpecificationsRobert Sanderson
 
Next generation online catalogs
Next generation online catalogsNext generation online catalogs
Next generation online catalogsafraser246
 

What's hot (20)

April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early AdoptersApril 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
April 8 NISO Webinar: Experimenting with BIBFRAME: Reports from Early Adopters
 
Freire model api
Freire model apiFreire model api
Freire model api
 
A Brief Overview of BIBFRAME, by Angela Kroeger
A Brief Overview of BIBFRAME, by Angela KroegerA Brief Overview of BIBFRAME, by Angela Kroeger
A Brief Overview of BIBFRAME, by Angela Kroeger
 
Usage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application ScenariosUsage of Linked Data: Introduction and Application Scenarios
Usage of Linked Data: Introduction and Application Scenarios
 
Introduction to discovery layers- June 23b
Introduction to discovery layers- June 23bIntroduction to discovery layers- June 23b
Introduction to discovery layers- June 23b
 
Interaction with Linked Data
Interaction with Linked DataInteraction with Linked Data
Interaction with Linked Data
 
Linked Data MLA 2015
Linked Data MLA 2015Linked Data MLA 2015
Linked Data MLA 2015
 
Linked data MLA 2015
Linked data MLA 2015Linked data MLA 2015
Linked data MLA 2015
 
Wacker-4-june15
Wacker-4-june15Wacker-4-june15
Wacker-4-june15
 
The Progress of BIBFRAME, by Angela Kroeger
The Progress of BIBFRAME, by Angela KroegerThe Progress of BIBFRAME, by Angela Kroeger
The Progress of BIBFRAME, by Angela Kroeger
 
Consuming Linked Data by Humans - WWW2010
Consuming Linked Data by Humans - WWW2010Consuming Linked Data by Humans - WWW2010
Consuming Linked Data by Humans - WWW2010
 
LIBRIS - Linked Library Data
LIBRIS - Linked Library DataLIBRIS - Linked Library Data
LIBRIS - Linked Library Data
 
Snac webinar v3
Snac webinar v3Snac webinar v3
Snac webinar v3
 
Cataloger 3.0: Competencies and Education for the BIBFRAME Catalog
Cataloger 3.0: Competencies and Education for the BIBFRAME CatalogCataloger 3.0: Competencies and Education for the BIBFRAME Catalog
Cataloger 3.0: Competencies and Education for the BIBFRAME Catalog
 
Managing Annotations (OR2016)
Managing Annotations (OR2016)Managing Annotations (OR2016)
Managing Annotations (OR2016)
 
IIIF Foundational Specifications
IIIF Foundational SpecificationsIIIF Foundational Specifications
IIIF Foundational Specifications
 
Wiggins-7-jun15
Wiggins-7-jun15Wiggins-7-jun15
Wiggins-7-jun15
 
Longwell final ppt
Longwell final pptLongwell final ppt
Longwell final ppt
 
Next generation online catalogs
Next generation online catalogsNext generation online catalogs
Next generation online catalogs
 
Butler - Security Lessons Learned from an Ezproxy Admin
Butler - Security Lessons Learned from an Ezproxy AdminButler - Security Lessons Learned from an Ezproxy Admin
Butler - Security Lessons Learned from an Ezproxy Admin
 

Similar to Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008

Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for LibrariesRichard Wallis
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technologyStefanos Anastasiadis
 
Deep Web and Digital Investigations
Deep Web and Digital Investigations Deep Web and Digital Investigations
Deep Web and Digital Investigations Damir Delija
 
TechFuse 2013 - Break down the walls SharePoint 2013
TechFuse 2013 - Break down the walls SharePoint 2013TechFuse 2013 - Break down the walls SharePoint 2013
TechFuse 2013 - Break down the walls SharePoint 2013Avtex
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentPeter Haase
 
Discovery Layer Strategies for Kuali OLE: Indiana University
Discovery Layer Strategies for Kuali OLE: Indiana UniversityDiscovery Layer Strategies for Kuali OLE: Indiana University
Discovery Layer Strategies for Kuali OLE: Indiana UniversityCourtney McDonald
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceMicah Altman
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniquesTola Odugbesan
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)TimelessFuture
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypresNekoGato
 
Web Scale Discovery Services: Google like search experience
Web Scale Discovery Services: Google like search experienceWeb Scale Discovery Services: Google like search experience
Web Scale Discovery Services: Google like search experienceNikesh Narayanan
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Peter Mika
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices Richard Wallis
 
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web Thanh Tran
 
Internet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and MoreInternet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and Moreeclark131
 
Information Architecture for SharePoint
Information Architecture for SharePointInformation Architecture for SharePoint
Information Architecture for SharePointnForm User Experience
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Roxanne Missingham
 
Evaluation of web scale discovery services
Evaluation of web scale discovery servicesEvaluation of web scale discovery services
Evaluation of web scale discovery servicesNikesh Narayanan
 

Similar to Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008 (20)

Web Mining
Web MiningWeb Mining
Web Mining
 
Web mining
Web miningWeb mining
Web mining
 
Three Linked Data choices for Libraries
Three Linked Data choices for LibrariesThree Linked Data choices for Libraries
Three Linked Data choices for Libraries
 
Web search engines and search technology
Web search engines and search technologyWeb search engines and search technology
Web search engines and search technology
 
Deep Web and Digital Investigations
Deep Web and Digital Investigations Deep Web and Digital Investigations
Deep Web and Digital Investigations
 
TechFuse 2013 - Break down the walls SharePoint 2013
TechFuse 2013 - Break down the walls SharePoint 2013TechFuse 2013 - Break down the walls SharePoint 2013
TechFuse 2013 - Break down the walls SharePoint 2013
 
Cloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application DevelopmentCloud-based Linked Data Management for Self-service Application Development
Cloud-based Linked Data Management for Self-service Application Development
 
Discovery Layer Strategies for Kuali OLE: Indiana University
Discovery Layer Strategies for Kuali OLE: Indiana UniversityDiscovery Layer Strategies for Kuali OLE: Indiana University
Discovery Layer Strategies for Kuali OLE: Indiana University
 
Gary Price, MIT Program on Information Science
Gary Price, MIT Program on Information ScienceGary Price, MIT Program on Information Science
Gary Price, MIT Program on Information Science
 
Internet browsing techniques
Internet browsing techniquesInternet browsing techniques
Internet browsing techniques
 
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
WebART: Facilitating Scholarly Use of Web Archives (IIPC, Apr. 2013)
 
Scalability andefficiencypres
Scalability andefficiencypresScalability andefficiencypres
Scalability andefficiencypres
 
Web Scale Discovery Services: Google like search experience
Web Scale Discovery Services: Google like search experienceWeb Scale Discovery Services: Google like search experience
Web Scale Discovery Services: Google like search experience
 
Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012Semantic Search overview at SSSW 2012
Semantic Search overview at SSSW 2012
 
Marc and beyond: 3 Linked Data Choices
 Marc and beyond: 3 Linked Data Choices  Marc and beyond: 3 Linked Data Choices
Marc and beyond: 3 Linked Data Choices
 
From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web From Expert Finding to Entity Search on the Web
From Expert Finding to Entity Search on the Web
 
Internet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and MoreInternet Research: Finding Websites, Blogs, Wikis, and More
Internet Research: Finding Websites, Blogs, Wikis, and More
 
Information Architecture for SharePoint
Information Architecture for SharePointInformation Architecture for SharePoint
Information Architecture for SharePoint
 
Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012Slides anu talkwebarchivingaug2012
Slides anu talkwebarchivingaug2012
 
Evaluation of web scale discovery services
Evaluation of web scale discovery servicesEvaluation of web scale discovery services
Evaluation of web scale discovery services
 

More from Denis Shestakov

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Denis Shestakov
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the WebDenis Shestakov
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationDenis Shestakov
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawlingDenis Shestakov
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceDenis Shestakov
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopDenis Shestakov
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawlingDenis Shestakov
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep WebDenis Shestakov
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database SystemsDenis Shestakov
 

More from Denis Shestakov (9)

Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)Intelligent Web Crawling (WI-IAT 2013 Tutorial)
Intelligent Web Crawling (WI-IAT 2013 Tutorial)
 
Deep Web: Databases on the Web
Deep Web: Databases on the WebDeep Web: Databases on the Web
Deep Web: Databases on the Web
 
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertationSearch Interfaces on the Web: Querying and Characterizing, PhD dissertation
Search Interfaces on the Web: Querying and Characterizing, PhD dissertation
 
Intelligent web crawling
Intelligent web crawlingIntelligent web crawling
Intelligent web crawling
 
Terabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practiceTerabyte-scale image similarity search: experience and best practice
Terabyte-scale image similarity search: experience and best practice
 
Scalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with HadoopScalable high-dimensional indexing with Hadoop
Scalable high-dimensional indexing with Hadoop
 
Current challenges in web crawling
Current challenges in web crawlingCurrent challenges in web crawling
Current challenges in web crawling
 
Sampling national deep Web
Sampling national deep WebSampling national deep Web
Sampling national deep Web
 
Biological Database Systems
Biological Database SystemsBiological Database Systems
Biological Database Systems
 

Recently uploaded

The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)codyslingerland1
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosErol GIRAUDY
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4DianaGray10
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfCheryl Hung
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightSafe Software
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingFrancesco Corti
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updateadam112203
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch TuesdayIvanti
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0DanBrown980551
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Libraryshyamraj55
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Alkin Tezuysal
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and businessFrancesco Corti
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarThousandEyes
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationKnoldus Inc.
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingMAGNIntelligence
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.IPLOOK Networks
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNeo4j
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl
 

Recently uploaded (20)

The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)The New Cloud World Order Is FinOps (Slideshow)
The New Cloud World Order Is FinOps (Slideshow)
 
Scenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenariosScenario Library et REX Discover industry- and role- based scenarios
Scenario Library et REX Discover industry- and role- based scenarios
 
UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4UiPath Studio Web workshop series - Day 4
UiPath Studio Web workshop series - Day 4
 
Key Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdfKey Trends Shaping the Future of Infrastructure.pdf
Key Trends Shaping the Future of Infrastructure.pdf
 
The Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and InsightThe Zero-ETL Approach: Enhancing Data Agility and Insight
The Zero-ETL Approach: Enhancing Data Agility and Insight
 
Where developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is goingWhere developers are challenged, what developers want and where DevEx is going
Where developers are challenged, what developers want and where DevEx is going
 
Patch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 updatePatch notes explaining DISARM Version 1.4 update
Patch notes explaining DISARM Version 1.4 update
 
March Patch Tuesday
March Patch TuesdayMarch Patch Tuesday
March Patch Tuesday
 
LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0LF Energy Webinar - Unveiling OpenEEMeter 4.0
LF Energy Webinar - Unveiling OpenEEMeter 4.0
 
How to release an Open Source Dataweave Library
How to release an Open Source Dataweave LibraryHow to release an Open Source Dataweave Library
How to release an Open Source Dataweave Library
 
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
Design and Modeling for MySQL SCALE 21X Pasadena, CA Mar 2024
 
From the origin to the future of Open Source model and business
From the origin to the future of  Open Source model and businessFrom the origin to the future of  Open Source model and business
From the origin to the future of Open Source model and business
 
EMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? WebinarEMEA What is ThousandEyes? Webinar
EMEA What is ThousandEyes? Webinar
 
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie WorldTrustArc Webinar - How to Live in a Post Third-Party Cookie World
TrustArc Webinar - How to Live in a Post Third-Party Cookie World
 
Introduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its applicationIntroduction to RAG (Retrieval Augmented Generation) and its application
Introduction to RAG (Retrieval Augmented Generation) and its application
 
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through TokenizationStobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
Stobox 4: Revolutionizing Investment in Real-World Assets Through Tokenization
 
IT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced ComputingIT Service Management (ITSM) Best Practices for Advanced Computing
IT Service Management (ITSM) Best Practices for Advanced Computing
 
Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.Introduction - IPLOOK NETWORKS CO., LTD.
Introduction - IPLOOK NETWORKS CO., LTD.
 
Novo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4jNovo Nordisk's journey in developing an open-source application on Neo4j
Novo Nordisk's journey in developing an open-source application on Neo4j
 
Planetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile BrochurePlanetek Italia Srl - Corporate Profile Brochure
Planetek Italia Srl - Corporate Profile Brochure
 

Lectio Praecursoria: Search Interfaces on the Web: Querying and Characterizing, 12.06.2008

  • 1. Search Interfaces on the Web: Querying and Characterizing Lectio Praecursoria 12.06.2008 Denis Shestakov denis.shestakov@utu.fi Department of Information Technology, University of Turku Turku Centre for Computer Science
  • 2. Background • Search engines (e.g., Google) do not crawl • • and index a significant portion of the Web The information from non-indexable part of the Web cannot be found and accessed via searchers Important type of web content which is badly indexed: • web pages generated based on parameters provided by users via search interfaces • Filling out a search form is a hard task for any automatic agent (e.g., search engines’ robots) Lectio Praecursoria 12.06.2008 2
  • 3. Background • The part of the Web ’behind’ search interfaces • • is known as deep Web (or hidden Web) Search interfaces are entry-points to myriads of databases on the Web The central problem: • High-quality and publicly available data stored in a huge number of databases is available only via search interfaces (to access a database of interest, a user has to know location of its search interface) • Web pages in the deep Web (so called datarich pages) contain blocks of structured information (in contrast to ordinary web pages which are typically unstructured) Lectio Praecursoria 12.06.2008 3
  • 4. Example of a search interface & search results AutoTrader search form (http://autotrader.com/): Lectio Praecursoria 12.06.2008 4
  • 5. Deep Web: numbers & misconceptions • Number of web databases: • Survey in April 2004: 450 000 web databases (and this is underestimated value) • Size of the deep Web: • Survey of 2001: 400 to 550 times larger than the indexable Web; but it is not that big • No other reliable estimates of the entire size exist • According to my own indirect assessments: comparable with the size of the indexable Web • Content of some web databases is, in fact, indexable: • No reliable estimates but one can expect one fourth is indexed • Correlation with database subjects: content of books/movies/music databases (relatively ’static’ data) is indexed well • But, even if known to searchers, data is often outdated Lectio Praecursoria 12.06.2008 5
  • 6. Thesis contributions: querying search interfaces • Approach to automate querying and • • • retrieving information behind search interfaces Essential in case of complex queries A form query language that allows to formulate queries and extract useful information from the pages with results A prototype system for querying web databases Lectio Praecursoria 12.06.2008 6
  • 7. Thesis contributions: characterization of the deep Web • Previous surveys are based on study of • • • deep web resources mainly in English Two new methods for characterizing the deep Web Two surveys of one national (Russian) segment of the Web Dataset describing more than 200 web databases (statistically reliable) Lectio Praecursoria 12.06.2008 7
  • 8. Thesis contributions: finding web databases • For any given topic there are too many web • • databases with relevant content: discovery automation is required A system for finding and classifying search interfaces Intended for: • Deep Web characterization studies • Building directories of web databases • Deal with Javascript-rich and non-HTML search forms (these types of forms are ignored in almost all other approaches to the deep Web) Lectio Praecursoria 12.06.2008 8
  • 9. Applications • Web search engines: • Information owners and providers • Vertical/topical search engines • Eager to improve their coverage of the Web • In April 2008 Google announced they were experimenting with their form crawler (hence, most likely, other searchers would also have it tested/implemented/etc. in their robots within 2008-2009) • Typically want to disseminate their (publiclyavailable) information • Interest in discovery methods as they want their resources to be discovered and searched • Find information on a specialized topic • Need methods to extract data from relevant resources and aggregate it Lectio Praecursoria 12.06.2008 9
  • 10. Future work • • The most promising direction: discovery of web databases The goal: building a relatively complete directory (Yahoo!-like) of databases on the Web • Specialized directories already exist • Several ‘universal’ directories (e.g., completeplanet.com) also exist but, as reported, are outdated and cover only a small portion of deep web resources • Due to the huge number of existing web databases, building and then maintaining such a directory would require automatic methods (discovery, classification, etc.) Lectio Praecursoria 12.06.2008 10