Course Introduction
Upcoming SlideShare
Loading in...5
×
 

Course Introduction

on

  • 452 views

 

Statistics

Views

Total Views
452
Views on SlideShare
450
Embed Views
2

Actions

Likes
1
Downloads
7
Comments
0

1 Embed 2

http://www.linkedin.com 2

Accessibility

Categories

Upload Details

Uploaded via as Microsoft PowerPoint

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Course Introduction Course Introduction Presentation Transcript

  • Web Information Retrieval and Extraction Chia-Hui Chang, Associate Professor National Central University, Taiwan [email_address] Sep. 16, 2005
  • Course Content
    • Web Information Integration
    • Web Information Retrieval
    • Traditional IR systems
    • Web Mining
  • Topic I: Web Information Integration
    • Search Interface Integration
    • Web page collection
    • Web data extraction
    • Search result integration
    • Web Service
  • Web Page Collection
    • Metacrawler http://www.metacrawler.com/
      • Google · Yahoo · Ask Jeeves About · LookSmart · Overture · FindWhat
    • Ebay http://www.ebay.com/
      • Information asymmetry between buyers and sellers
    • Technology
      • Program generators
      • WNDL, W4F, XWrap, Robomaker
  • Web Data Extraction
    • Example
    • Technology
      • Information Extraction Systems
      • WIEN, Softmealy, Stalker, IEPAD, DeLA, OLERA, Roadrunner, EXALG, XWrap, W4F, etc.
      • Data Annotation
    • Wrapper induction is an excellent exercise of machine learning technologies
  • Topic II: Web Information Retrieval
    • From User Perspective
      • Browsing via categories
      • Searching via search engines
      • Query answering
    • From System Perspective
      • Web crawling
      • Indexing and querying
      • Link-based ranking
      • Query answering
      • Semantic Web, XML retrieval, etc.
  • Web Categories
    • Yahoo http:// www.yahoo.com
      • Fourteen categories and ninety subcategories
      • Categorization by humans
    • Technology
      • Document classification
    • Pros and Cons
      • Overview of the content in the database
      • Browsing without specific targets
  • Search Engines
    • Google http:// www.google.com
      • Search by keyword matching
      • Business model
    • Technology
      • Web Crawling
      • Indexing for fast search
      • Ranking for good results
    • Pros and Cons
      • Search engines locate the documents not the answers
  • Question Answering
    • Askjeeves http://www.ask.com
      • Input a question or keywords
      • Relevance feedback from users to clarify the targets
    • ExtAns (Molla et al., 2003)
    • Technology
      • Text information extraction
      • Natural Language Processing
  • Topic III: Techniques from Traditional IR
    • Text Operations
      • Lexical analysis of the text
      • Elimination of stop words
      • Index term selection
    • Indexing and Searching
      • Inverted files
      • Suffix trees and suffix arrays
      • Signature files
    • IR Model and Ranking Technique
    • Query Operations
      • Relevance feedback
      • Query expansion
  • Topic IV: Web Mining
    • Usage Analysis
    • Focused Crawling
    • Clustering of Web search result
    • Text classification
  • Available Techniques
    • Artificial Intelligence
      • Search and Logic programming
    • Machine Learning
      • Supervised learning (classification)
      • Unsupervised learning (clustering)
    • Database and Warehousing
      • OLAP and Iceberg queries
    • Data Mining
      • Pattern mining from large data sets
    • Other Disciplines
      • Statistics, neural network, genetic algorithms, etc.
  • Classical Tasks
    • Classification
      • Artificial Intelligence, Machine Learning
    • Clustering
      • Pattern recognition, neural network
    • Pattern Mining
      • Association rules, sequential patterns, episodes mining, periodic patterns, frequent continuities, etc.
  • Classification Methods
    • Supervised Learning (Concept Learning)
      • General-to-specific ording
      • Decision tree learning
      • Bayesian learning
      • Instance-based learning
      • Sequential covering algorithms
      • Artificial neural networks
      • Genetic algorithms
    • Reference: Mitchell, 1997
  • Clustering Algorithms
    • Unsupervised learning (comparative analysis)
      • Partition Methods
      • Hierarchical Methods
      • Model-based Clustering Methods
      • Density-based Methods
      • Grid-based Methods
    • Reference: Han and Kamber (Chapter 8)
  • Pattern Mining
    • Various kinds of patterns
      • Association Rules
        • Closed itemsets, maximal itemsets, non-redundant rules, etc.
      • Sequential patterns
      • Episodes mining
      • Periodic patterns
      • Frequent continuities
  • Applications
    • Relational Data
      • E.g. Northern Group Retail (Business Intelligence)
      • Banking, Insurance, Health, others
    • Web Information Retrieval and Extraction
    • Bioinformatics
    • Multimedia Mining
    • Spatial Data Mining
    • Time-series Data Mining
  • Course Schedule
    • Web Data Extraction (3 weeks)
    • Web Interface Integration (1 week)
    • Web Page Collection (1 week)
    • Techniques from Traditional IR (2 weeks)
    • Query Answering (1 week)
    • Link Based Analysis (1 week)
    • Focused Crawling (1 week)
    • Web Usage Mining (1 week)
    • Clustering Search Result (1 week)
    • Text Classification (1 week)
  • Grading
    • Project I: 30%
      • Implementation of the chosen paper (W10)
    • Project II: 30%
      • Topic can be chosen freely (W16)
    • Paper reading: 20%
      • Presentation
    • Homework: 10%
    • Involvement in the Class: 10%
  • References
    • Baeza-Yates, R. and Ribeiro-Neto, B. 1999. Modern Information Retrieval, Addison Wesley
    • Han, J. and Kamber, M. 2001. Data Mining:  Concepts and Techniques, Morgan Kaufmann Publishers
    • Mitchell, T. M. 1997. Machine Learning, McGRAW-HILL.
    • Molla, D., Schwitter, R., Rinaldi, F., Dowdall, J. and Hess, M. 2003. ExtrAns: Extracting Answers from Technical Texts. IEEE Intelligent Systems, July/August 2003, 12-17.