Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

2009 God


Published on

Published in: Spiritual
  • Be the first to comment

  • Be the first to like this

2009 God

  1. 1. Information Extraction Tasks<br />Yen Ling<br />2009<br />1<br />
  2. 2. Outline<br />Information Integration<br />Generating an Extractor<br />Information Extraction Tasks<br />2<br />
  3. 3. Introduction<br />3<br />Web site A<br />Result pages A<br />Web site B<br />Result pages B<br />Integrated Information<br />pages<br />Web site C<br />Result pages C<br />
  4. 4. Introduction <br />Information integration is the merging of information from disparate sources with differing conceptual, contextual and typographical representations.<br />It is used in consolidation of data from unstructured or semi-structured resources.<br />Final result will be displayed in rich modules, i.e. tables, lists, graphs and maps.<br />Users could get them via RSS, gadget or mails.<br />4<br />
  5. 5. Related Work<br />Relations, Cards, and Search Templates<br />UIST’07<br />In Figure, three objects in the left of arrow stand for search templates, relations, cards.<br />Cards in the right of arrow mean information after integrating.<br />Leverage the supervised extractor.<br />5<br />
  6. 6. Related Work<br />Damia:Data Mashups for Intranet Applications<br />SIGMOD’08<br />Integrate information from the internal data source of company.<br />Chiefs will operate the system easily and quickly without programmers.<br />Employees will get mashups from a feed server.<br />6<br />
  7. 7. Related Work<br />Transcendence: Enabling a Personal View of the Deep Web<br />IUI’08<br />Leverage the unsupervised extractor<br />Users must the use firefox browser, but GoD is not because it’s a web-based application.<br />7<br />
  8. 8. Related Work<br />User-centric Web Data Integration: Design and Implementation of Gadget on Demand System<br />Leverage the unsupervised extractor<br />Integrate information from multiple source<br />Only have a few clicks to integrate information from multiple source.<br />Users can use the system without the ability programming.<br />8<br />
  9. 9. Related Work<br />Dapper<br />For purposes, it is similar to GoD.<br />Leverage the supervised extractor.<br />Provide a virtual browser to achieve “What You See Is What You Get”.<br />It’s not like GoD to extract information from multiple source.<br />9<br />
  10. 10. Web Information Extraction <br />Full operators for a wrapper<br />Mapping of an incoming query<br />By hand<br />The construction of an extractor<br />Construct a base framework<br />10<br />
  11. 11. Outline<br />Information Integration<br />Generating an Extractor<br />Information Extraction Tasks<br />11<br />
  12. 12. Analysis Different Extractors<br />Unsupervised extractor<br />Supervised extractor<br />Induction based labeled page examples<br />Knowledge-based extractors<br />12<br />
  13. 13. GoD with Unsupervised/supervised Extractor <br />13<br />Supervised<br />Input web pages<br />Label page <br />IE system<br />Unsupervised<br />Select Fields & Data <br />Select Display Module<br />Publish<br />Integrate sources <br />
  14. 14. GoD with Unsupervised/supervised Extractor <br />For extractor’s precision:<br />Supervised &gt; Unsupervised<br />For user case flow:<br />Unsupervised is easier then supervised.<br />For designing the user interface:<br />Supervised is more complex than unsupervised.<br />14<br />
  15. 15. Extractor for GoD<br />Problem Formulation:<br />Give a web page and a pattern tree that FiVaTech produced.<br />The task is to make the use of a pattern tree to extract data from a web page.<br />The problem will become two sub-problems.<br />Pattern matching<br />Approximate matching of textual attributes<br />15<br />
  16. 16. Extractor for GoD<br />Preprocessing<br />Pattern Tree<br />Dom Tree<br />Pattern matching<br />Content<br />Matching<br />Candidate paths<br />Data<br />Existed Data<br />16<br />
  17. 17. Extractor<br /><ul><li>Pattern matching
  18. 18. Peer node reorganization
  19. 19. Approximate matching of textual attributes
  20. 20. Find attributes from data that FiVaTech extracted.
  21. 21. Attributes
  22. 22. Date
  23. 23. Money
  24. 24. Telephone
  25. 25. Word Number
  26. 26. …..</li></ul>17<br />
  27. 27. Outline<br />Information Integration<br />Generating an Extractor<br />Information Extraction Tasks<br />18<br />
  28. 28. Information Integration Tasks<br />Real-time system phase<br />Users could use the system to create the gadget they think.<br />Backgroud gadget execution phase<br />The system will update the content of gadget periodically or for request.<br />19<br />
  29. 29. Real-time System Phase<br />Domain exists?<br />No<br />FiVaTech<br />Data<br />Web pages<br />Yes<br />Pattern Tree<br />Get Pattern Tree from DB<br />Extractor<br />Extractor<br />Data<br />Data<br />20<br />
  30. 30. Backgroud Gadget Execution Phase – Using Extractor<br />DB<br />Download web pages<br />Gadget’s profile<br />Web Pages<br />Pattern Tree<br />Extractor<br />Update Gadget’s profile<br />Data<br />21<br />
  31. 31. Backgroud Gadget Execution Phase – Using Schema Matching<br />DB<br />Download web pages<br />Gadget’s profile<br />Web Pages<br />Update Gadget’s profile<br />FiVaTech<br />Schema Matching<br />Data<br />Data<br />22<br />
  32. 32. Future Work<br />We will implement the Web information extraction system.<br />We will also redesign easy-to-use interface and information integration chart.<br />23<br />
  33. 33. Thanks for your time.<br />24<br />