3. Introduction 3 Web site A Result pages A Web site B Result pages B Integrated Information pages Web site C Result pages C
4. Introduction Information integration is the merging of information from disparate sources with differing conceptual, contextual and typographical representations. It is used in consolidation of data from unstructured or semi-structured resources. Final result will be displayed in rich modules, i.e. tables, lists, graphs and maps. Users could get them via RSS, gadget or mails. 4
5. Related Work Relations, Cards, and Search Templates UIST’07 In Figure, three objects in the left of arrow stand for search templates, relations, cards. Cards in the right of arrow mean information after integrating. Leverage the supervised extractor. 5
6. Related Work Damia:Data Mashups for Intranet Applications SIGMOD’08 Integrate information from the internal data source of company. Chiefs will operate the system easily and quickly without programmers. Employees will get mashups from a feed server. 6
7. Related Work Transcendence: Enabling a Personal View of the Deep Web IUI’08 Leverage the unsupervised extractor Users must the use firefox browser, but GoD is not because it’s a web-based application. 7
8. Related Work User-centric Web Data Integration: Design and Implementation of Gadget on Demand System Leverage the unsupervised extractor Integrate information from multiple source Only have a few clicks to integrate information from multiple source. Users can use the system without the ability programming. 8
9. Related Work Dapper For purposes, it is similar to GoD. Leverage the supervised extractor. Provide a virtual browser to achieve “What You See Is What You Get”. It’s not like GoD to extract information from multiple source. 9
10. Web Information Extraction Full operators for a wrapper Mapping of an incoming query By hand The construction of an extractor Construct a base framework 10
12. Analysis Different Extractors Unsupervised extractor Supervised extractor Induction based labeled page examples Knowledge-based extractors 12
13. GoD with Unsupervised/supervised Extractor 13 Supervised Input web pages Label page IE system Unsupervised Select Fields & Data Select Display Module Publish Integrate sources
14. GoD with Unsupervised/supervised Extractor For extractor’s precision: Supervised > Unsupervised For user case flow: Unsupervised is easier then supervised. For designing the user interface: Supervised is more complex than unsupervised. 14
15. Extractor for GoD Problem Formulation: Give a web page and a pattern tree that FiVaTech produced. The task is to make the use of a pattern tree to extract data from a web page. The problem will become two sub-problems. Pattern matching Approximate matching of textual attributes 15
16. Extractor for GoD Preprocessing Pattern Tree Dom Tree Pattern matching Content Matching Candidate paths Data Existed Data 16
28. Information Integration Tasks Real-time system phase Users could use the system to create the gadget they think. Backgroud gadget execution phase The system will update the content of gadget periodically or for request. 19
29. Real-time System Phase Domain exists? No FiVaTech Data Web pages Yes Pattern Tree Get Pattern Tree from DB Extractor Extractor Data Data 20
30. Backgroud Gadget Execution Phase – Using Extractor DB Download web pages Gadget’s profile Web Pages Pattern Tree Extractor Update Gadget’s profile Data 21
31. Backgroud Gadget Execution Phase – Using Schema Matching DB Download web pages Gadget’s profile Web Pages Update Gadget’s profile FiVaTech Schema Matching Data Data 22
32. Future Work We will implement the Web information extraction system. We will also redesign easy-to-use interface and information integration chart. 23