2. Project Idea in Brief
Extracting structured data from deep Web pages is a
challenging problem due to the underlying intricate
structures of pages.
The approaches for extraction have the following
limitations:
1. They are Web-page-programming-language dependent.
2. They are incapable of handling the ever-increasing
complexity of HTML source code of Web pages.
But the designers of Web pages always arrange the
data records and the data items with visual regularity
to meet the reading habits of human beings.
VIDE
4. Contd..
So we explore the visual regularity of the data records and
data items on Web pages and try to implement , Vision-
based Data Extractor (ViDE), to extract structured results
from Web pages automatically.
This approach employs following steps::
1. Identify and understand the visual structure of the
page/document.
2. Extract data records from the page.
3. Partition extracted data records into data items.
So to implement this ,we try to develop a vision based
data extractor tool which can help researchers to find
documents related to author of their research area.
VIDE
5. Contd…
Assumption 1:
The major assumption of the project work is that the
targeted PDF document file which the user wants to
search over HTTP has universal same format
internationally acclaimed and known to everyone.
VIDE
6. Contd…
Searching for a specific person is one of the most popular
search queries.
However, when a person name is queried, the returned
result often contains web pages related to several distinct
namesakes who have the queried name.
The task of disambiguating and finding the web pages
related to the specific person of interest is left to the user.
Assumption 2:
The cluster key is assumed to be email ID of the author with
the help of which the user can segregate the same types of
different papers published by same author.
VIDE
7. Contd..
The project work also deploys the usage of the Web
crawler, which can be used for crawling through a
whole site on the Inter/Intranet.
A web crawler is a program or an automated script
which browses the World Wide Web in a
methodical automated manner.
The architecture of Web Crawler uses multiple HTTP
connections to WWW.
A Web crawler also known as a web spiders, web
robots, worms, walkers and wanderers are almost as
old as the web itself. In this proposal, we highlight
the application of applying our approaches for web
querying using yahoo BOSS Search API along with
clustering algorithm.
VIDE
8. Aim of Project
The main aim of the project work is to build an easy
and reliable data extraction tool that will extract
queried information from the bundles of
information existing in web in more organized and
unambiguous (unique) manner, and present it in a
friendly and easy-to-read format. The output of our
application will be an auto generated HTML page
VIDE
9. Literature Survey
Searching for an information on the Web is not an easy task. Searching
for personal information is sometimes even more complicated. Below
are several common problems we face when trying to get personal
details from the web:
Majority of the Information is distributed between different sites.
It is not updated.
Multi-Referent ambiguity – two or more people with the same name.
Multi-morphic ambiguity which is because one name may be referred to
in different forms.
In the most popular search engine Google, one can set the target name
and based on the extremely limited facilities to narrow down the search,
still the user has 100% feasilibility of receiving irrelevant information in
the output search hits. Not only this, the user has to manually see, open,
and then download their respective file which is extremely time
consuming. The major reason behind this is that there is no uniform
format for personal information.
VIDE
10. YAHOO BOSS
BOSS is an open API that enables developers to use
Yahoo! Search to build search products leveraging
their own data, content, technology, social graph, or
other assets.
Boss Services
WEB Search the web
NEWS Search for news
IMAGES Search for images
SPELLING SUGGESTIONS Retrieve spelling suggestions
BOSS SITEE XPLORER Get traffic and usage of your websites
VIDE
11. System Requirement Specification
PRODUCT PERSPECTIVE
One of the key challenges that needs to be overcome to make the
project functionality a reality, is to build an advance query system that is
capable of reaching high disambiguation quality.
The project work is targeted to design an advance version of the search
engine using Web data extraction framework and Clustering
Algorithm.
In this research work, the focus is mainly on searching for personal
information of scientists and researchers.
The user has to set the proper target name for search, which when
completed, the user will receive complete PDF and image files based on
the key (e-mail) of the search.
Each group of information items (cluster) will be defined by its key
(email) and the user make the choice.
The result page will be produced from the chosen clusters. For making
the search operationally accurate, we will assume the usage of IEEE doc
files as they carry a standard format of name, e-mail ID, publication,
images, and links to the full images.
VIDE
12. System Requirement Specification
Resource-Requirements
Hardware Requirement specification:
◦ Intel Pentium III Processor, 2 GB,RAM 20 GB HDD
◦ LAN/ Internet Connection to Server Machine
◦ TCP/IP network for communication between clients and
server
Software Requirement Specification:
◦ Operating System: Windows XP
◦ Programming Tool: Java Swing
◦ IDE: NetBeans
VIDE
29. Future Development
A major open issue for future work is a detailed study of
how the system could become even more distributed,
retaining though quality of the content of the crawled
pages.
Due to dynamic nature of the Web, the average freshness or
quality of the page downloaded need to be checked, the
crawler can be enhanced to check this and also detect links
written in JAVA scripts or VB scripts and also provision to
support file formats like XML, RTF, PDF, Microsoft word
and Microsoft PPT can be done.
VIDE
30. References
Base Paper: Wei Liu, Xiaofeng Meng, Member, IEEE, and Weiyi Meng, ViDE: A Vision-Based Approach for Deep Web
Data Extraction, IEEE Transactions On Knowledge And Data Engineering, Vol. 22, IEEE-2010
[1] Exploiting Web querying for Web People Search in WePS2 Rabia Nuray-Turan Zhaoqi Chen Dmitri V. Kalashnikov
Sharad Mehrotra, IEEE 2009-12-22
[1] Javier Artiles, Satoshi Sekine, Julio Gonzalo, Web People Search - Results of the first evaluation and the plan for the
second – ACM portal, April 21-25, 2008 · Beijing, China
[2] Javier Artiles, Julio Gonzalo, Satoshi Sekine, The SemEval-2007WePS Evaluation: Establishing a benchmark for the
Web People Search Task, Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007),
pages 64–69
[3] Ron Bekkerman, Andrew McCallum, Disambiguating Web Appearances of People in a Social Network,
International World Wide Web Conference Committee (IW3C2), 2005
[4] Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, Measuring Semantic Similarity between Words Using Web
Search Engines, Copyright is held by the International World Wide Web Conference Committee (IW3C2)., 2007
[5] Nguyen Bach & Simon Fung, Co-reference Resolution for Person Names,
[6] Dmitri V. Kalashnikov, Rabia Nuray-Turan, Sharad Mehrotra, Towards Breaking the Quality Curse. A Web-
Querying Approach to Web People Search, ACM-2008
[7] Krisztian Balog, Leif Azzopardi, Maarten de Rijke, Personal Name Resolution of Web People Search, NLPIX2008,
April 22, 2008, Beijing, China
VIDE