Web Information Retrieval and Extraction  Chia-Hui Chang, Associate Professor National Central University, Taiwan [email_a...
Course Content <ul><li>Web Information Integration </li></ul><ul><li>Web Information Retrieval </li></ul><ul><li>Tradition...
Topic I: Web Information Integration <ul><li>Search Interface Integration </li></ul><ul><li>Web page collection </li></ul>...
Web Page Collection  <ul><li>Metacrawler  http://www.metacrawler.com/ </li></ul><ul><ul><li>Google  ·  Yahoo  ·  Ask Jeeve...
Web Data Extraction <ul><li>Example </li></ul><ul><li>Technology </li></ul><ul><ul><li>Information Extraction Systems </li...
Topic II: Web Information Retrieval <ul><li>From User Perspective </li></ul><ul><ul><li>Browsing via categories </li></ul>...
Web Categories <ul><li>Yahoo  http:// www.yahoo.com </li></ul><ul><ul><li>Fourteen categories and ninety subcategories </l...
Search Engines <ul><li>Google  http:// www.google.com </li></ul><ul><ul><li>Search by keyword matching </li></ul></ul><ul>...
Question Answering <ul><li>Askjeeves  http://www.ask.com </li></ul><ul><ul><li>Input a question or keywords </li></ul></ul...
Topic III: Techniques from  Traditional IR <ul><li>Text Operations </li></ul><ul><ul><li>Lexical analysis of the text </li...
Topic IV: Web Mining <ul><li>Usage Analysis </li></ul><ul><li>Focused Crawling </li></ul><ul><li>Clustering of Web search ...
Available Techniques <ul><li>Artificial Intelligence </li></ul><ul><ul><li>Search and Logic programming </li></ul></ul><ul...
Classical Tasks <ul><li>Classification </li></ul><ul><ul><li>Artificial Intelligence, Machine Learning </li></ul></ul><ul>...
Classification Methods <ul><li>Supervised Learning (Concept Learning) </li></ul><ul><ul><li>General-to-specific ording </l...
Clustering Algorithms <ul><li>Unsupervised learning (comparative analysis) </li></ul><ul><ul><li>Partition Methods </li></...
Pattern Mining <ul><li>Various kinds of patterns </li></ul><ul><ul><li>Association Rules </li></ul></ul><ul><ul><ul><li>Cl...
Applications <ul><li>Relational Data </li></ul><ul><ul><li>E.g.  Northern Group Retail  (Business Intelligence) </li></ul>...
Course Schedule <ul><li>Web Data Extraction (3 weeks) </li></ul><ul><li>Web Interface Integration (1 week) </li></ul><ul><...
Grading <ul><li>Project I: 30% </li></ul><ul><ul><li>Implementation of the chosen paper (W10) </li></ul></ul><ul><li>Proje...
References <ul><li>Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval, Addison Wesley </li></ul><ul><...
Upcoming SlideShare
Loading in …5
×

Course Introduction

447 views
367 views

Published on

0 Comments
1 Like
Statistics
Notes
  • Be the first to comment

No Downloads
Views
Total views
447
On SlideShare
0
From Embeds
0
Number of Embeds
3
Actions
Shares
0
Downloads
8
Comments
0
Likes
1
Embeds 0
No embeds

No notes for slide

Course Introduction

  1. 1. Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan [email_address] Sep. 16, 2005
  2. 2. Course Content <ul><li>Web Information Integration </li></ul><ul><li>Web Information Retrieval </li></ul><ul><li>Traditional IR systems </li></ul><ul><li>Web Mining </li></ul>
  3. 3. Topic I: Web Information Integration <ul><li>Search Interface Integration </li></ul><ul><li>Web page collection </li></ul><ul><li>Web data extraction </li></ul><ul><li>Search result integration </li></ul><ul><li>Web Service </li></ul>
  4. 4. Web Page Collection <ul><li>Metacrawler http://www.metacrawler.com/ </li></ul><ul><ul><li>Google · Yahoo · Ask Jeeves About · LookSmart · Overture · FindWhat </li></ul></ul><ul><li>Ebay http://www.ebay.com/ </li></ul><ul><ul><li>Information asymmetry between buyers and sellers </li></ul></ul><ul><li>Technology </li></ul><ul><ul><li>Program generators </li></ul></ul><ul><ul><li>WNDL, W4F, XWrap, Robomaker </li></ul></ul>
  5. 5. Web Data Extraction <ul><li>Example </li></ul><ul><li>Technology </li></ul><ul><ul><li>Information Extraction Systems </li></ul></ul><ul><ul><li>WIEN, Softmealy, Stalker, IEPAD, DeLA, OLERA, Roadrunner, EXALG, XWrap, W4F, etc. </li></ul></ul><ul><ul><li>Data Annotation </li></ul></ul><ul><li>Wrapper induction is an excellent exercise of machine learning technologies </li></ul>
  6. 6. Topic II: Web Information Retrieval <ul><li>From User Perspective </li></ul><ul><ul><li>Browsing via categories </li></ul></ul><ul><ul><li>Searching via search engines </li></ul></ul><ul><ul><li>Query answering </li></ul></ul><ul><li>From System Perspective </li></ul><ul><ul><li>Web crawling </li></ul></ul><ul><ul><li>Indexing and querying </li></ul></ul><ul><ul><li>Link-based ranking </li></ul></ul><ul><ul><li>Query answering </li></ul></ul><ul><ul><li>Semantic Web, XML retrieval, etc. </li></ul></ul>
  7. 7. Web Categories <ul><li>Yahoo http:// www.yahoo.com </li></ul><ul><ul><li>Fourteen categories and ninety subcategories </li></ul></ul><ul><ul><li>Categorization by humans </li></ul></ul><ul><li>Technology </li></ul><ul><ul><li>Document classification </li></ul></ul><ul><li>Pros and Cons </li></ul><ul><ul><li>Overview of the content in the database </li></ul></ul><ul><ul><li>Browsing without specific targets </li></ul></ul>
  8. 8. Search Engines <ul><li>Google http:// www.google.com </li></ul><ul><ul><li>Search by keyword matching </li></ul></ul><ul><ul><li>Business model </li></ul></ul><ul><li>Technology </li></ul><ul><ul><li>Web Crawling </li></ul></ul><ul><ul><li>Indexing for fast search </li></ul></ul><ul><ul><li>Ranking for good results </li></ul></ul><ul><li>Pros and Cons </li></ul><ul><ul><li>Search engines locate the documents not the answers </li></ul></ul>
  9. 9. Question Answering <ul><li>Askjeeves http://www.ask.com </li></ul><ul><ul><li>Input a question or keywords </li></ul></ul><ul><ul><li>Relevance feedback from users to clarify the targets </li></ul></ul><ul><li>ExtAns (Molla et al., 2003) </li></ul><ul><li>Technology </li></ul><ul><ul><li>Text information extraction </li></ul></ul><ul><ul><li>Natural Language Processing </li></ul></ul>
  10. 10. Topic III: Techniques from Traditional IR <ul><li>Text Operations </li></ul><ul><ul><li>Lexical analysis of the text </li></ul></ul><ul><ul><li>Elimination of stop words </li></ul></ul><ul><ul><li>Index term selection </li></ul></ul><ul><li>Indexing and Searching </li></ul><ul><ul><li>Inverted files </li></ul></ul><ul><ul><li>Suffix trees and suffix arrays </li></ul></ul><ul><ul><li>Signature files </li></ul></ul><ul><li>IR Model and Ranking Technique </li></ul><ul><li>Query Operations </li></ul><ul><ul><li>Relevance feedback </li></ul></ul><ul><ul><li>Query expansion </li></ul></ul>
  11. 11. Topic IV: Web Mining <ul><li>Usage Analysis </li></ul><ul><li>Focused Crawling </li></ul><ul><li>Clustering of Web search result </li></ul><ul><li>Text classification </li></ul>
  12. 12. Available Techniques <ul><li>Artificial Intelligence </li></ul><ul><ul><li>Search and Logic programming </li></ul></ul><ul><li>Machine Learning </li></ul><ul><ul><li>Supervised learning (classification) </li></ul></ul><ul><ul><li>Unsupervised learning (clustering) </li></ul></ul><ul><li>Database and Warehousing </li></ul><ul><ul><li>OLAP and Iceberg queries </li></ul></ul><ul><li>Data Mining </li></ul><ul><ul><li>Pattern mining from large data sets </li></ul></ul><ul><li>Other Disciplines </li></ul><ul><ul><li>Statistics, neural network, genetic algorithms, etc. </li></ul></ul>
  13. 13. Classical Tasks <ul><li>Classification </li></ul><ul><ul><li>Artificial Intelligence, Machine Learning </li></ul></ul><ul><li>Clustering </li></ul><ul><ul><li>Pattern recognition, neural network </li></ul></ul><ul><li>Pattern Mining </li></ul><ul><ul><li>Association rules, sequential patterns, episodes mining, periodic patterns, frequent continuities, etc. </li></ul></ul>
  14. 14. Classification Methods <ul><li>Supervised Learning (Concept Learning) </li></ul><ul><ul><li>General-to-specific ording </li></ul></ul><ul><ul><li>Decision tree learning </li></ul></ul><ul><ul><li>Bayesian learning </li></ul></ul><ul><ul><li>Instance-based learning </li></ul></ul><ul><ul><li>Sequential covering algorithms </li></ul></ul><ul><ul><li>Artificial neural networks </li></ul></ul><ul><ul><li>Genetic algorithms </li></ul></ul><ul><li>Reference: Mitchell, 1997 </li></ul>
  15. 15. Clustering Algorithms <ul><li>Unsupervised learning (comparative analysis) </li></ul><ul><ul><li>Partition Methods </li></ul></ul><ul><ul><li>Hierarchical Methods </li></ul></ul><ul><ul><li>Model-based Clustering Methods </li></ul></ul><ul><ul><li>Density-based Methods </li></ul></ul><ul><ul><li>Grid-based Methods </li></ul></ul><ul><li>Reference: Han and Kamber (Chapter 8) </li></ul>
  16. 16. Pattern Mining <ul><li>Various kinds of patterns </li></ul><ul><ul><li>Association Rules </li></ul></ul><ul><ul><ul><li>Closed itemsets, maximal itemsets, non-redundant rules, etc. </li></ul></ul></ul><ul><ul><li>Sequential patterns </li></ul></ul><ul><ul><li>Episodes mining </li></ul></ul><ul><ul><li>Periodic patterns </li></ul></ul><ul><ul><li>Frequent continuities </li></ul></ul>
  17. 17. Applications <ul><li>Relational Data </li></ul><ul><ul><li>E.g. Northern Group Retail (Business Intelligence) </li></ul></ul><ul><ul><li>Banking, Insurance, Health, others </li></ul></ul><ul><li>Web Information Retrieval and Extraction </li></ul><ul><li>Bioinformatics </li></ul><ul><li>Multimedia Mining </li></ul><ul><li>Spatial Data Mining </li></ul><ul><li>Time-series Data Mining </li></ul>
  18. 18. Course Schedule <ul><li>Web Data Extraction (3 weeks) </li></ul><ul><li>Web Interface Integration (1 week) </li></ul><ul><li>Web Page Collection (1 week) </li></ul><ul><li>Techniques from Traditional IR (2 weeks) </li></ul><ul><li>Query Answering (1 week) </li></ul><ul><li>Link Based Analysis (1 week) </li></ul><ul><li>Focused Crawling (1 week) </li></ul><ul><li>Web Usage Mining (1 week) </li></ul><ul><li>Clustering Search Result (1 week) </li></ul><ul><li>Text Classification (1 week) </li></ul>
  19. 19. Grading <ul><li>Project I: 30% </li></ul><ul><ul><li>Implementation of the chosen paper (W10) </li></ul></ul><ul><li>Project II: 30% </li></ul><ul><ul><li>Topic can be chosen freely (W16) </li></ul></ul><ul><li>Paper reading: 20% </li></ul><ul><ul><li>Presentation </li></ul></ul><ul><li>Homework: 10% </li></ul><ul><li>Involvement in the Class: 10% </li></ul>
  20. 20. References <ul><li>Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval, Addison Wesley </li></ul><ul><li>Han, J. and Kamber, M. 2001. Data Mining:  Concepts and Techniques, Morgan Kaufmann Publishers </li></ul><ul><li>Mitchell, T. M. 1997. Machine Learning, McGRAW-HILL. </li></ul><ul><li>Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J. and Hess, M. 2003. ExtrAns: Extracting Answers from Technical Texts. IEEE Intelligent Systems, July/August 2003, 12-17. </li></ul>

×