SlideShare a Scribd company logo
1 of 31
By
Shuveta .C.Chanchlani
Under the guidance
of
Prof.Sachin Bojewar
And
Prof.Varsha Bhosale
Project Idea in Brief
 Extracting structured data from deep Web pages is a
challenging problem due to the underlying intricate
structures of pages.
 The approaches for extraction have the following
limitations:
1. They are Web-page-programming-language dependent.
2. They are incapable of handling the ever-increasing
complexity of HTML source code of Web pages.
 But the designers of Web pages always arrange the
data records and the data items with visual regularity
to meet the reading habits of human beings.
VIDE
Contd..
VIDE
Contd..
 So we explore the visual regularity of the data records and
data items on Web pages and try to implement , Vision-
based Data Extractor (ViDE), to extract structured results
from Web pages automatically.
 This approach employs following steps::
1. Identify and understand the visual structure of the
page/document.
2. Extract data records from the page.
3. Partition extracted data records into data items.
 So to implement this ,we try to develop a vision based
data extractor tool which can help researchers to find
documents related to author of their research area.
VIDE
Contd…
Assumption 1:
 The major assumption of the project work is that the
targeted PDF document file which the user wants to
search over HTTP has universal same format
internationally acclaimed and known to everyone.

VIDE
Contd…
 Searching for a specific person is one of the most popular
search queries.
 However, when a person name is queried, the returned
result often contains web pages related to several distinct
namesakes who have the queried name.
 The task of disambiguating and finding the web pages
related to the specific person of interest is left to the user.
 Assumption 2:
The cluster key is assumed to be email ID of the author with
the help of which the user can segregate the same types of
different papers published by same author.
VIDE
Contd..
 The project work also deploys the usage of the Web
crawler, which can be used for crawling through a
whole site on the Inter/Intranet.
 A web crawler is a program or an automated script
which browses the World Wide Web in a
methodical automated manner.
 The architecture of Web Crawler uses multiple HTTP
connections to WWW.
 A Web crawler also known as a web spiders, web
robots, worms, walkers and wanderers are almost as
old as the web itself. In this proposal, we highlight
the application of applying our approaches for web
querying using yahoo BOSS Search API along with
clustering algorithm.
VIDE
Aim of Project
The main aim of the project work is to build an easy
and reliable data extraction tool that will extract
queried information from the bundles of
information existing in web in more organized and
unambiguous (unique) manner, and present it in a
friendly and easy-to-read format. The output of our
application will be an auto generated HTML page
VIDE
Literature Survey
Searching for an information on the Web is not an easy task. Searching
for personal information is sometimes even more complicated. Below
are several common problems we face when trying to get personal
details from the web:
 Majority of the Information is distributed between different sites.
 It is not updated.
 Multi-Referent ambiguity – two or more people with the same name.
 Multi-morphic ambiguity which is because one name may be referred to
in different forms.
 In the most popular search engine Google, one can set the target name
and based on the extremely limited facilities to narrow down the search,
still the user has 100% feasilibility of receiving irrelevant information in
the output search hits. Not only this, the user has to manually see, open,
and then download their respective file which is extremely time
consuming. The major reason behind this is that there is no uniform
format for personal information.
VIDE
YAHOO BOSS
 BOSS is an open API that enables developers to use
Yahoo! Search to build search products leveraging
their own data, content, technology, social graph, or
other assets.
 Boss Services
WEB Search the web
NEWS Search for news
IMAGES Search for images
SPELLING SUGGESTIONS Retrieve spelling suggestions
BOSS SITEE XPLORER Get traffic and usage of your websites
VIDE
System Requirement Specification
PRODUCT PERSPECTIVE
 One of the key challenges that needs to be overcome to make the
project functionality a reality, is to build an advance query system that is
capable of reaching high disambiguation quality.
 The project work is targeted to design an advance version of the search
engine using Web data extraction framework and Clustering
Algorithm.
 In this research work, the focus is mainly on searching for personal
information of scientists and researchers.
 The user has to set the proper target name for search, which when
completed, the user will receive complete PDF and image files based on
the key (e-mail) of the search.
 Each group of information items (cluster) will be defined by its key
(email) and the user make the choice.
 The result page will be produced from the chosen clusters. For making
the search operationally accurate, we will assume the usage of IEEE doc
files as they carry a standard format of name, e-mail ID, publication,
images, and links to the full images.
VIDE
System Requirement Specification
Resource-Requirements
 Hardware Requirement specification:
◦ Intel Pentium III Processor, 2 GB,RAM 20 GB HDD
◦ LAN/ Internet Connection to Server Machine
◦ TCP/IP network for communication between clients and
server
 Software Requirement Specification:
◦ Operating System: Windows XP
◦ Programming Tool: Java Swing
◦ IDE: NetBeans
VIDE
Project Modules
 Crawler-Module
 Praser Module
 Cluster Module
 Page Manager Module
VIDE
Sub-Modules/Classes
1. Home Page Manager
2. Crawler
3. Data Object
4. Document (extends DataObject)
5. IEEE doc (extends Document)
6. Image (extends Data Object)
7. Cluster Manager
8. Document Info
9. Cluster
10. Key
11. Get Name of GUI
12. Page GUI
VIDE
Data Flow Diagram
VIDE
ViDE
Application
Target
Keyword
Data
Extraction
Level Zero Data Flow Diagram
Crawler Indexer Cluster
Position
Feature
Download
Pictures
Layout Feature
Text extracted
Cluster Key
Level One Data Flow Diagram
VIDE
Parser
Homepage
Manager
Crawler Document
Request for New
Document using
Iterator
Constructor
(URL)
New document
Object
Document_Parse
Level two Data Flow Diagram
VIDE
Document
Cluster
manager
Cluster
Add_Doc_Info
Automatic Filtering
of irrelevant
GUI Manager
getCluster
List_Cluster
Level three Data Flow Diagram
VIDE
Class Diagrams
VIDE
VIDE
VIDE
Cluster
[from cluster]
- key : KeyClass
- clusterInfo : ArrayList<DocInfo>
~ Cluster(k : KeyClass)
+ getKey() : KeyClass
+ insertDocInfo(di : DocInfo) : void
+ getInfoSize() : int
+ getInfoIterator() : Iterator<DocInfo>
ClusterManager
[fromcluster]
- clusters : ArrayList<Cluster> = null
+ AUTO_FILTERING_RESULTS : int = 5
+ addToClusters(doc : VideDocument) : void
+ filterClusters() : void
+ getClusters() : ArrayList<Cluster>
+ setClusters(clu : ArrayList<Cluster>) : void
KeyClass
[fromcluster]
+ email : String
~ KeyClass(e : String)
Crawler
[fromcrawler]
- docList : ArrayList<IEEEdoc>
- imgList : ArrayList<VideImage>
+ docIterator : Iterator<IEEEdoc>
+ imageIterator : Iterator<VideImage>
+ search(Name : String) : void
DocInfo
[fromcrawler]
- infoMap : HashMap<String, String>
~ DocInfo()
+ addInfo(field : String, value : String) : void
+ getDetail(field : String) : String
VidePageManager
[from videpage]
+ TEMP_PATH : String = "."
+ PATH_SLASH : String = "/"
+ DOC_NUMBER : String = "30"
+ targetName : String
+ nameGUI : GetNameGUI
+ pageGUI : PageGUI
+ message : String = ""
+ clusters : ArrayList<Cluster> = null
+ clustersReady : boolean = false
~ logicThread : Thread
+ main(args : String[]) : void
+ debugMessage(msg : String) : void
~ VidePageManager()
+ run() : void
VIDE
VIDE
VIDE
Future Development
 A major open issue for future work is a detailed study of
how the system could become even more distributed,
retaining though quality of the content of the crawled
pages.
 Due to dynamic nature of the Web, the average freshness or
quality of the page downloaded need to be checked, the
crawler can be enhanced to check this and also detect links
written in JAVA scripts or VB scripts and also provision to
support file formats like XML, RTF, PDF, Microsoft word
and Microsoft PPT can be done.
VIDE
References
 Base Paper: Wei Liu, Xiaofeng Meng, Member, IEEE, and Weiyi Meng, ViDE: A Vision-Based Approach for Deep Web
Data Extraction, IEEE Transactions On Knowledge And Data Engineering, Vol. 22, IEEE-2010

 [1] Exploiting Web querying for Web People Search in WePS2 Rabia Nuray-Turan Zhaoqi Chen Dmitri V. Kalashnikov
Sharad Mehrotra, IEEE 2009-12-22

 [1] Javier Artiles, Satoshi Sekine, Julio Gonzalo, Web People Search - Results of the first evaluation and the plan for the
second – ACM portal, April 21-25, 2008 · Beijing, China
 [2] Javier Artiles, Julio Gonzalo, Satoshi Sekine, The SemEval-2007WePS Evaluation: Establishing a benchmark for the
Web People Search Task, Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007),
pages 64–69
 [3] Ron Bekkerman, Andrew McCallum, Disambiguating Web Appearances of People in a Social Network,
International World Wide Web Conference Committee (IW3C2), 2005
 [4] Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, Measuring Semantic Similarity between Words Using Web
Search Engines, Copyright is held by the International World Wide Web Conference Committee (IW3C2)., 2007
 [5] Nguyen Bach & Simon Fung, Co-reference Resolution for Person Names,
 [6] Dmitri V. Kalashnikov, Rabia Nuray-Turan, Sharad Mehrotra, Towards Breaking the Quality Curse. A Web-
Querying Approach to Web People Search, ACM-2008
 [7] Krisztian Balog, Leif Azzopardi, Maarten de Rijke, Personal Name Resolution of Web People Search, NLPIX2008,
April 22, 2008, Beijing, China
VIDE
Thank You

More Related Content

Similar to final ppt.pptx

Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web DevelopmentRobert J. Stein
 
Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web DevelopmentRobert J. Stein
 
Search Engine Scrapper
Search Engine ScrapperSearch Engine Scrapper
Search Engine ScrapperIRJET Journal
 
Search Engine Scrapper
Search Engine ScrapperSearch Engine Scrapper
Search Engine ScrapperIRJET Journal
 
Microsoft Entity Framework
Microsoft Entity FrameworkMicrosoft Entity Framework
Microsoft Entity FrameworkMahmoud Tolba
 
Microsoft Entity Framework
Microsoft Entity FrameworkMicrosoft Entity Framework
Microsoft Entity FrameworkMahmoud Tolba
 
Yelpcamp: A review based website for campgrounds
Yelpcamp: A review based website for campgroundsYelpcamp: A review based website for campgrounds
Yelpcamp: A review based website for campgroundsIRJET Journal
 
Yelpcamp: A review based website for campgrounds
Yelpcamp: A review based website for campgroundsYelpcamp: A review based website for campgrounds
Yelpcamp: A review based website for campgroundsIRJET Journal
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Conceptijceronline
 
Architecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC SolutionArchitecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC SolutionAndrea Saltarello
 
Architecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC SolutionArchitecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC SolutionAndrea Saltarello
 

Similar to final ppt.pptx (20)

Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web Development
 
Advanced Web Development
Advanced Web DevelopmentAdvanced Web Development
Advanced Web Development
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
 
Df25632640
Df25632640Df25632640
Df25632640
 
Df25632640
Df25632640Df25632640
Df25632640
 
Search Engine Scrapper
Search Engine ScrapperSearch Engine Scrapper
Search Engine Scrapper
 
Search Engine Scrapper
Search Engine ScrapperSearch Engine Scrapper
Search Engine Scrapper
 
Microsoft Entity Framework
Microsoft Entity FrameworkMicrosoft Entity Framework
Microsoft Entity Framework
 
Microsoft Entity Framework
Microsoft Entity FrameworkMicrosoft Entity Framework
Microsoft Entity Framework
 
Yelpcamp: A review based website for campgrounds
Yelpcamp: A review based website for campgroundsYelpcamp: A review based website for campgrounds
Yelpcamp: A review based website for campgrounds
 
Yelpcamp: A review based website for campgrounds
Yelpcamp: A review based website for campgroundsYelpcamp: A review based website for campgrounds
Yelpcamp: A review based website for campgrounds
 
320 324
320 324320 324
320 324
 
320 324
320 324320 324
320 324
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 
Web Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features ConceptWeb Content Mining Based on Dom Intersection and Visual Features Concept
Web Content Mining Based on Dom Intersection and Visual Features Concept
 
Architecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC SolutionArchitecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC Solution
 
Sup documentation
Sup documentationSup documentation
Sup documentation
 
Architecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC SolutionArchitecting an ASP.NET MVC Solution
Architecting an ASP.NET MVC Solution
 
Sup documentation
Sup documentationSup documentation
Sup documentation
 

Recently uploaded

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINESIVASHANKAR N
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxwendy cai
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile servicerehmti665
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝soniya singh
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxDeepakSakkari2
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130Suhani Kapoor
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCall Girls in Nagpur High Profile
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024hassan khalil
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...ranjana rawat
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escortsranjana rawat
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSRajkumarAkumalla
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSKurinjimalarL3
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAbhinavSharma374939
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSSIVASHANKAR N
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur High Profile
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...RajaP95
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxAsutosh Ranjan
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingrakeshbaidya232001
 

Recently uploaded (20)

MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINEMANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
MANUFACTURING PROCESS-II UNIT-2 LATHE MACHINE
 
What are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptxWhat are the advantages and disadvantages of membrane structures.pptx
What are the advantages and disadvantages of membrane structures.pptx
 
Call Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile serviceCall Girls Delhi {Jodhpur} 9711199012 high profile service
Call Girls Delhi {Jodhpur} 9711199012 high profile service
 
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
 
Biology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptxBiology for Computer Engineers Course Handout.pptx
Biology for Computer Engineers Course Handout.pptx
 
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINEDJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
DJARUM4D - SLOT GACOR ONLINE | SLOT DEMO ONLINE
 
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
VIP Call Girls Service Kondapur Hyderabad Call +91-8250192130
 
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
9953056974 Call Girls In South Ex, Escorts (Delhi) NCR.pdf
 
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service NashikCollege Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
College Call Girls Nashik Nehal 7001305949 Independent Escort Service Nashik
 
Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024Architect Hassan Khalil Portfolio for 2024
Architect Hassan Khalil Portfolio for 2024
 
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
(PRIYA) Rajgurunagar Call Girls Just Call 7001035870 [ Cash on Delivery ] Pun...
 
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
(MEERA) Dapodi Call Girls Just Call 7001035870 [ Cash on Delivery ] Pune Escorts
 
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICSHARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
HARDNESS, FRACTURE TOUGHNESS AND STRENGTH OF CERAMICS
 
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICSAPPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
APPLICATIONS-AC/DC DRIVES-OPERATING CHARACTERISTICS
 
Analog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog ConverterAnalog to Digital and Digital to Analog Converter
Analog to Digital and Digital to Analog Converter
 
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLSMANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
MANUFACTURING PROCESS-II UNIT-5 NC MACHINE TOOLS
 
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur EscortsCall Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
Call Girls in Nagpur Suman Call 7001035870 Meet With Nagpur Escorts
 
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
IMPLICATIONS OF THE ABOVE HOLISTIC UNDERSTANDING OF HARMONY ON PROFESSIONAL E...
 
Coefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptxCoefficient of Thermal Expansion and their Importance.pptx
Coefficient of Thermal Expansion and their Importance.pptx
 
Porous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writingPorous Ceramics seminar and technical writing
Porous Ceramics seminar and technical writing
 

final ppt.pptx

  • 1. By Shuveta .C.Chanchlani Under the guidance of Prof.Sachin Bojewar And Prof.Varsha Bhosale
  • 2. Project Idea in Brief  Extracting structured data from deep Web pages is a challenging problem due to the underlying intricate structures of pages.  The approaches for extraction have the following limitations: 1. They are Web-page-programming-language dependent. 2. They are incapable of handling the ever-increasing complexity of HTML source code of Web pages.  But the designers of Web pages always arrange the data records and the data items with visual regularity to meet the reading habits of human beings. VIDE
  • 4. Contd..  So we explore the visual regularity of the data records and data items on Web pages and try to implement , Vision- based Data Extractor (ViDE), to extract structured results from Web pages automatically.  This approach employs following steps:: 1. Identify and understand the visual structure of the page/document. 2. Extract data records from the page. 3. Partition extracted data records into data items.  So to implement this ,we try to develop a vision based data extractor tool which can help researchers to find documents related to author of their research area. VIDE
  • 5. Contd… Assumption 1:  The major assumption of the project work is that the targeted PDF document file which the user wants to search over HTTP has universal same format internationally acclaimed and known to everyone.  VIDE
  • 6. Contd…  Searching for a specific person is one of the most popular search queries.  However, when a person name is queried, the returned result often contains web pages related to several distinct namesakes who have the queried name.  The task of disambiguating and finding the web pages related to the specific person of interest is left to the user.  Assumption 2: The cluster key is assumed to be email ID of the author with the help of which the user can segregate the same types of different papers published by same author. VIDE
  • 7. Contd..  The project work also deploys the usage of the Web crawler, which can be used for crawling through a whole site on the Inter/Intranet.  A web crawler is a program or an automated script which browses the World Wide Web in a methodical automated manner.  The architecture of Web Crawler uses multiple HTTP connections to WWW.  A Web crawler also known as a web spiders, web robots, worms, walkers and wanderers are almost as old as the web itself. In this proposal, we highlight the application of applying our approaches for web querying using yahoo BOSS Search API along with clustering algorithm. VIDE
  • 8. Aim of Project The main aim of the project work is to build an easy and reliable data extraction tool that will extract queried information from the bundles of information existing in web in more organized and unambiguous (unique) manner, and present it in a friendly and easy-to-read format. The output of our application will be an auto generated HTML page VIDE
  • 9. Literature Survey Searching for an information on the Web is not an easy task. Searching for personal information is sometimes even more complicated. Below are several common problems we face when trying to get personal details from the web:  Majority of the Information is distributed between different sites.  It is not updated.  Multi-Referent ambiguity – two or more people with the same name.  Multi-morphic ambiguity which is because one name may be referred to in different forms.  In the most popular search engine Google, one can set the target name and based on the extremely limited facilities to narrow down the search, still the user has 100% feasilibility of receiving irrelevant information in the output search hits. Not only this, the user has to manually see, open, and then download their respective file which is extremely time consuming. The major reason behind this is that there is no uniform format for personal information. VIDE
  • 10. YAHOO BOSS  BOSS is an open API that enables developers to use Yahoo! Search to build search products leveraging their own data, content, technology, social graph, or other assets.  Boss Services WEB Search the web NEWS Search for news IMAGES Search for images SPELLING SUGGESTIONS Retrieve spelling suggestions BOSS SITEE XPLORER Get traffic and usage of your websites VIDE
  • 11. System Requirement Specification PRODUCT PERSPECTIVE  One of the key challenges that needs to be overcome to make the project functionality a reality, is to build an advance query system that is capable of reaching high disambiguation quality.  The project work is targeted to design an advance version of the search engine using Web data extraction framework and Clustering Algorithm.  In this research work, the focus is mainly on searching for personal information of scientists and researchers.  The user has to set the proper target name for search, which when completed, the user will receive complete PDF and image files based on the key (e-mail) of the search.  Each group of information items (cluster) will be defined by its key (email) and the user make the choice.  The result page will be produced from the chosen clusters. For making the search operationally accurate, we will assume the usage of IEEE doc files as they carry a standard format of name, e-mail ID, publication, images, and links to the full images. VIDE
  • 12. System Requirement Specification Resource-Requirements  Hardware Requirement specification: ◦ Intel Pentium III Processor, 2 GB,RAM 20 GB HDD ◦ LAN/ Internet Connection to Server Machine ◦ TCP/IP network for communication between clients and server  Software Requirement Specification: ◦ Operating System: Windows XP ◦ Programming Tool: Java Swing ◦ IDE: NetBeans VIDE
  • 13. Project Modules  Crawler-Module  Praser Module  Cluster Module  Page Manager Module VIDE
  • 14. Sub-Modules/Classes 1. Home Page Manager 2. Crawler 3. Data Object 4. Document (extends DataObject) 5. IEEE doc (extends Document) 6. Image (extends Data Object) 7. Cluster Manager 8. Document Info 9. Cluster 10. Key 11. Get Name of GUI 12. Page GUI VIDE
  • 16. ViDE Application Target Keyword Data Extraction Level Zero Data Flow Diagram Crawler Indexer Cluster Position Feature Download Pictures Layout Feature Text extracted Cluster Key Level One Data Flow Diagram VIDE Parser
  • 17. Homepage Manager Crawler Document Request for New Document using Iterator Constructor (URL) New document Object Document_Parse Level two Data Flow Diagram VIDE
  • 18. Document Cluster manager Cluster Add_Doc_Info Automatic Filtering of irrelevant GUI Manager getCluster List_Cluster Level three Data Flow Diagram VIDE
  • 20. VIDE
  • 21. VIDE
  • 22. Cluster [from cluster] - key : KeyClass - clusterInfo : ArrayList<DocInfo> ~ Cluster(k : KeyClass) + getKey() : KeyClass + insertDocInfo(di : DocInfo) : void + getInfoSize() : int + getInfoIterator() : Iterator<DocInfo> ClusterManager [fromcluster] - clusters : ArrayList<Cluster> = null + AUTO_FILTERING_RESULTS : int = 5 + addToClusters(doc : VideDocument) : void + filterClusters() : void + getClusters() : ArrayList<Cluster> + setClusters(clu : ArrayList<Cluster>) : void KeyClass [fromcluster] + email : String ~ KeyClass(e : String) Crawler [fromcrawler] - docList : ArrayList<IEEEdoc> - imgList : ArrayList<VideImage> + docIterator : Iterator<IEEEdoc> + imageIterator : Iterator<VideImage> + search(Name : String) : void DocInfo [fromcrawler] - infoMap : HashMap<String, String> ~ DocInfo() + addInfo(field : String, value : String) : void + getDetail(field : String) : String VidePageManager [from videpage] + TEMP_PATH : String = "." + PATH_SLASH : String = "/" + DOC_NUMBER : String = "30" + targetName : String + nameGUI : GetNameGUI + pageGUI : PageGUI + message : String = "" + clusters : ArrayList<Cluster> = null + clustersReady : boolean = false ~ logicThread : Thread + main(args : String[]) : void + debugMessage(msg : String) : void ~ VidePageManager() + run() : void VIDE
  • 23. VIDE
  • 24.
  • 25.
  • 26. VIDE
  • 27.
  • 28.
  • 29. Future Development  A major open issue for future work is a detailed study of how the system could become even more distributed, retaining though quality of the content of the crawled pages.  Due to dynamic nature of the Web, the average freshness or quality of the page downloaded need to be checked, the crawler can be enhanced to check this and also detect links written in JAVA scripts or VB scripts and also provision to support file formats like XML, RTF, PDF, Microsoft word and Microsoft PPT can be done. VIDE
  • 30. References  Base Paper: Wei Liu, Xiaofeng Meng, Member, IEEE, and Weiyi Meng, ViDE: A Vision-Based Approach for Deep Web Data Extraction, IEEE Transactions On Knowledge And Data Engineering, Vol. 22, IEEE-2010   [1] Exploiting Web querying for Web People Search in WePS2 Rabia Nuray-Turan Zhaoqi Chen Dmitri V. Kalashnikov Sharad Mehrotra, IEEE 2009-12-22   [1] Javier Artiles, Satoshi Sekine, Julio Gonzalo, Web People Search - Results of the first evaluation and the plan for the second – ACM portal, April 21-25, 2008 · Beijing, China  [2] Javier Artiles, Julio Gonzalo, Satoshi Sekine, The SemEval-2007WePS Evaluation: Establishing a benchmark for the Web People Search Task, Proceedings of the 4th International Workshop on Semantic Evaluations (SemEval-2007), pages 64–69  [3] Ron Bekkerman, Andrew McCallum, Disambiguating Web Appearances of People in a Social Network, International World Wide Web Conference Committee (IW3C2), 2005  [4] Danushka Bollegala, Yutaka Matsuo, Mitsuru Ishizuka, Measuring Semantic Similarity between Words Using Web Search Engines, Copyright is held by the International World Wide Web Conference Committee (IW3C2)., 2007  [5] Nguyen Bach & Simon Fung, Co-reference Resolution for Person Names,  [6] Dmitri V. Kalashnikov, Rabia Nuray-Turan, Sharad Mehrotra, Towards Breaking the Quality Curse. A Web- Querying Approach to Web People Search, ACM-2008  [7] Krisztian Balog, Leif Azzopardi, Maarten de Rijke, Personal Name Resolution of Web People Search, NLPIX2008, April 22, 2008, Beijing, China VIDE