Information Extraction from the WWW using Machine Learning Techniques Lee McCluskey, Dept of Informatics email: lee@hud.ac...
Motivation <ul><li>General:   The WWW is a virtually limitless mass of information aimed mainly for human consumption. It ...
Overview of Talk <ul><ul><li>We will investigate  Information Extraction:  This is the process of extracting “meaningful” ...
Information Extraction from the WWW – WHY? <ul><li>Problem: You’re on ebay and you want a toilet cistern & wash basin that...
Information Extraction from the WWW – WHY? <ul><li>Our (KTP) interest –  </li></ul><ul><li>extract data from www related t...
Information Extraction from The Web <ul><li>Information extraction is the process of extracting “meaningful” data from raw...
Information Extraction from The Web WRAPPERS WEB PAGES STRUCTURED DATA BA red 555 sue MSc red 123 dave PhD grey 345 bill B...
Information Extraction <ul><li>The Web’s HTML content  makes it difficult to retrieve and integrate data from multiple sou...
Example of Automated Extraction <residential> <house> < location>   <city>   Hebden Bridge  </city> <county>  West Yorkshi...
Information Extraction <ul><li>How can we create wrappers to ‘extract meaningful data’ from the current Web? </li></ul><ul...
Using ‘Rule Induction’ to learn wrappers for html pages <ul><li>The user is given or acquires ‘typical examples’ of the we...
Rule Induction is an area of Machine Learning <ul><li>Machine Learning </li></ul>Similarity-Based  Learning Explanation-Ba...
Rule Induction from Examples   <ul><li>Roughly, the algorithm is as follows: </li></ul><ul><li>Input: a (large) number of ...
Actual IE Example: University of Southern California’s Info Sciences Institute (ISI)’s   “Information agent” <ul><li>SPECI...
Heracles’  Stalker  inductive algorithm <ul><li>This generates wrappers – in this case rules that identify the start and e...
Example of training examples <ul><li>Stalker is given examples of ‘items’ it had to learn the wrapper for – eg examples of...
Problems with Wrapper Induction <ul><li>ISI report some success with their travel Information Agent, and its IE process, B...
Summary <ul><li>Information Extraction is the process of extracting “meaningful” data from raw or semi-structured text </l...
Extra Reading  <ul><li>http://www.isi.edu/info-agents/ </li></ul><ul><li>Learning to Extract Symbolic Knowledge from the W...
Related Legal/ Ethical/ Professional/ Methodological Issues <ul><li>Is it  legal  and/or  ethical  to automatically ‘harve...
Upcoming SlideShare
Loading in...5
×

Semantic Web

367
-1

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total Views
367
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
9
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide
  • 04/26/10 Points to make: 1) XML is an extensible markup language to describe structured data 2) XML is similar to HTML in that -- they are both markup languages, descending from SGML -- they both use tags 3) XML differs from HTML in that -- XML tags on data elements identify the meaning of data, rather than specifying how data should be formatted, as in HTML. XML therefore separates the three components of documents: content, structure, and presentation. -- relationships among data elements are provided via simple nesting The example should hopefully make these points clear. It shows the data from the same source, as published in HTML and XML formats. Note that even the XML document is more verbose, it also provides information in a far more convenient and usable format from a data management perspective.
  • Semantic Web

    1. 1. Information Extraction from the WWW using Machine Learning Techniques Lee McCluskey, Dept of Informatics email: lee@hud.ac.uk
    2. 2. Motivation <ul><li>General: The WWW is a virtually limitless mass of information aimed mainly for human consumption. It is desirable to make this information generally available for use by computer programs in order to provide higher levels of to service to people. </li></ul><ul><li>This supports the new area of “Semantic Technologies” – apparently the new “billion dollar” market.. </li></ul><ul><ul><li>NOW: Desk Top + Client-Server Technologies </li></ul></ul><ul><ul><li>COMING: Distributed Intelligent Services </li></ul></ul><ul><li>Specific: This work is related to a Knowledge Transfer Partnership just starting with a local company called View Based Systems. </li></ul>
    3. 3. Overview of Talk <ul><ul><li>We will investigate Information Extraction: This is the process of extracting “meaningful” data from raw or semi-structured text </li></ul></ul><ul><li>We will investigate techniques from ‘similarity-based’ Machine Learning to learn/extract meaning from traditional web page content </li></ul><ul><li>Also, Information Agents: These are programs that can retrieve information from web sites using database-like queries and can integrate info from web sites to solve complex queries </li></ul>
    4. 4. Information Extraction from the WWW – WHY? <ul><li>Problem: You’re on ebay and you want a toilet cistern & wash basin that have a combined width of under 90cm </li></ul><ul><li>Solution: waste all Sunday afternoon going through 673 entries for “toilet” looking for widths and cross checking with 923 entries for wash basin! </li></ul><ul><li>Need a universally-recognised query language </li></ul><ul><li>Need to avoid the problems of identity (!) with universally-accessible vocabularies </li></ul><ul><li>Need to be able to reason with acquired knowledge </li></ul>
    5. 5. Information Extraction from the WWW – WHY? <ul><li>Our (KTP) interest – </li></ul><ul><li>extract data from www related to a “theme” or subculture eg bee-keeping, role playing games, Northern Soul music.. </li></ul><ul><li>We want to populate and maintain a central database with this information … </li></ul>
    6. 6. Information Extraction from The Web <ul><li>Information extraction is the process of extracting “meaningful” data from raw or semi-structured text </li></ul><ul><li>IE tasks form a spectrum .. </li></ul><ul><ul><li>“ Feature Extraction” - extract a particular piece of data from a semi- or unstructured document and give it an XML markup eg extract an address from an html web page . </li></ul></ul>“ Natural Language Understanding” - take raw (English) text from a web page and turn into some logic representing its meaning. EASIER HARDER
    7. 7. Information Extraction from The Web WRAPPERS WEB PAGES STRUCTURED DATA BA red 555 sue MSc red 123 dave PhD grey 345 bill BSc blue 664 tom
    8. 8. Information Extraction <ul><li>The Web’s HTML content makes it difficult to retrieve and integrate data from multiple sources. </li></ul><ul><li>An agent can use a wrapper to extract the information from the collection of similarly-looking Web pages. </li></ul><ul><li>The wrapper ~ grammar of the data in the web site + code to utilize the grammar </li></ul><ul><li>This is similar to turning the HTML => XML+ grammar (DTD) </li></ul>
    9. 9. Example of Automated Extraction <residential> <house> < location> <city> Hebden Bridge </city> <county> West Yorkshire </county> <country> UK </country> </location> <agent-phone> 01422 843222 </agent-phone> <listed-price> £350,000 </listed-price> <comments> Bijou residence on the edge of this popular little town... </comments> </house> ... </residential> <h1> Residential Housing </h1> <ul> House For Sale <li> location: Hebden Bridge <li> agent-phone: 01422 843222 <li> listed-price: £350,000 <li> comments: Bijou residence on the edge of this popular little town... </ul> <hr> <ul> House For Sale ... </ul> ... Source: HTML ======> Destination: XML NB: XML + schema + recognised names wrapper
    10. 10. Information Extraction <ul><li>How can we create wrappers to ‘extract meaningful data’ from the current Web? </li></ul><ul><li>?? Write a wrapper to extract data …. BUT would have to write a tool for every type of data / every type of webpage eg a C program to process every eBay page on toilets and output widths. </li></ul><ul><li>No - This is far too specific! </li></ul><ul><li>?? Write a tool to learn wrappers by inducing the format of web pages and/or particular fields. </li></ul><ul><li>.. this is more general and maintainable </li></ul>
    11. 11. Using ‘Rule Induction’ to learn wrappers for html pages <ul><li>The user is given or acquires ‘typical examples’ of the web pages containing the content to be learned </li></ul><ul><li>The user points out fields to be learned to the agent. </li></ul><ul><li>The agent builds up a characterization of the formats from the examples and transforms this into a wrapper in the form of a set of rules </li></ul><ul><li>The wrapper is used by the agent to recognize and extract data from similar web pages </li></ul>
    12. 12. Rule Induction is an area of Machine Learning <ul><li>Machine Learning </li></ul>Similarity-Based Learning Explanation-Based Learning Neural Networks Learning from Examples Learning by Observation Rule Induction Symbolic Learning Sub-symbolic learning Genetic Approaches
    13. 13. Rule Induction from Examples <ul><li>Roughly, the algorithm is as follows: </li></ul><ul><li>Input: a (large) number of +ve instances (examples) of concept C </li></ul><ul><li>+ (possibly) a number of –ve instances of C </li></ul><ul><li>Output: a characterization H of the examples forming the rule </li></ul><ul><li>H => C </li></ul>
    14. 14. Actual IE Example: University of Southern California’s Info Sciences Institute (ISI)’s “Information agent” <ul><li>SPECIFIC PROBLEM: travel planning using the Web as an information source. There are huge number of travel sites, with different types of information. </li></ul><ul><li>- hotel and flight information, </li></ul><ul><li>- airports that are closest to your destination, </li></ul><ul><li>- directions to your hotel </li></ul><ul><li>- weather in the destination city …ETC </li></ul><ul><li>Information Agents are capable of retrieving and integrating info from web sites to solve complex queries or tasks eg “book my travel for my business trip next week” </li></ul><ul><li>See the Heracles project ( http://www.isi.edu/info-agents/) </li></ul>
    15. 15. Heracles’ Stalker inductive algorithm <ul><li>This generates wrappers – in this case rules that identify the start and end of an item within a web page. </li></ul><ul><li>It uses </li></ul><ul><li>EXAMPLES </li></ul><ul><li>A HIERARCHICAL MODEL (ONTOLOGY) OF WHAT TO EXPECT IN A WEB PAGE </li></ul>
    16. 16. Example of training examples <ul><li>Stalker is given examples of ‘items’ it had to learn the wrapper for – eg examples of the item (or concept) “area code” of a tel no, </li></ul><ul><li>E1: 513 Pixco, <b>Venice</b>, Phone: 1-<b> 800 </b>-555-1515 </li></ul><ul><li>E2: 90 Colfax, <b> Palms </b>, Phone: ( 818 ) 508-1570 </li></ul><ul><li>E3: 523 1st St., <b> LA </b>, Phone: 1-<b> 888 </b>-578-2293 </li></ul><ul><li>E4: 403 La Tijera, <b> Watts </b>, Phone: ( 310 ) 798-0008 </li></ul><ul><li>Stalker learns wrappers that detect the begin/end patterns of fields so that they can be used to ‘mine’ data in unseen web pages </li></ul>
    17. 17. Problems with Wrapper Induction <ul><li>ISI report some success with their travel Information Agent, and its IE process, BUT: </li></ul><ul><li>Wrapper Brittleness – website format may change – maintenance is costly </li></ul><ul><li>Background knowledge (token hierarchy) not strong </li></ul><ul><li>Unsupervised Wrapper induction would be better </li></ul>
    18. 18. Summary <ul><li>Information Extraction is the process of extracting “meaningful” data from raw or semi-structured text </li></ul><ul><li>Wrappers are programs (rules) which are attached to web pages to extract data </li></ul><ul><li>Machine Learning techniques can be used to create wrappers </li></ul><ul><li>There are still many problems with these methods – especially in the learning and maintaining of wrappers </li></ul>
    19. 19. Extra Reading <ul><li>http://www.isi.edu/info-agents/ </li></ul><ul><li>Learning to Extract Symbolic Knowledge from the World Wide Web. M. Craven, D. DiPasquo, D.  Freitag, A. McCallum, T. Mitchell, K. Nigam and S. Slattery. AAAI-98. January 1998. </li></ul><ul><li>“ Hierarchical Wrapper Induction for Semi-structured Information Sources” Ion Muslea, Steven Minton, Craig A. Knoblock, Kluwer, 1999. </li></ul><ul><li>See Kushmerick references – apparently he invented wrapper induction </li></ul>
    20. 20. Related Legal/ Ethical/ Professional/ Methodological Issues <ul><li>Is it legal and/or ethical to automatically ‘harvest’ data from the www and re-use or sell it? In what cases is it illegal? </li></ul><ul><li>How does one automate checking the veracity of www data? </li></ul><ul><li>Will website owners conceal their data if the practice becomes widespread? </li></ul><ul><li>Future: do we really want distributed web intelligence? </li></ul>
    1. A particular slide catching your eye?

      Clipping is a handy way to collect important slides you want to go back to later.

    ×