SlideShare a Scribd company logo
1
Information Retrieval and Extraction
資訊檢索與擷取
Chia-Hui Chang, Assistant Professor
Dept. of Computer Science & Information Engineering
National Central University, Taiwan
2
Information Retrieval
 generic information retrieval system
select and return to the user desired documents from a large set
of documents in accordance with criteria specified by the user
 functions
» document search
the selection of documents from an existing collection of
documents
» document routing
the dissemination of incoming documents to appropriate
users on the basis of user interest profiles
3
Detection Need
 Definition
a set of criteria specified by the user which describes the kind of
information desired.
» queries in document search task
» profiles in routing task
 forms
» keywords
» keywords with Boolean operators
» free text
» example documents
» ...
4
Example
<head> Tipster Topic Description
<num> Number: 033
<dom> Domain: Science and Technology
<title> Topic: Companies Capable of Producing Document Management
<des> Description:
Document must identify a company who has the capability to produce document
management system by obtaining a turnkey- system or by obtaining and integrating
the basic components.
<narr> Narrative:
To be relevant, the document must identify a turnkey document management system
or components which could be integrated to form a document management system
and the name of either the company developing the system or the company using the
system. These components are: a computer, image scanner or optical character
recognition system, and an information retrieval or text management system.
5
Example (Continued)
<con> Concepts:
1. document management, document processing, office automation
electronic imaging
2. image scanner, optical character recognition (OCR)
3. text management, text retrieval, text database
4. optical disk
<fac> Factors:
<def> Definitions
Document Management-The creation, storage and retrieval of documents containing,
text, images, and graphics. Image Scanner-A device that converts a printed image
into a video image, without recognizing the actual content of the text or pictures.
Optical Disk-A disk that is written and read by light, and are sometimes associated
with the storage of digital images because of their high storage capacity.
6
search vs. routing
 The search process matches a single Detection Need against
the stored corpus to return a subset of documents.
 Routing matches a single document against a group of Profiles
to determine which users are interested in the document.
 Profiles stand long-term expressions of user needs.
 Search queries are ad hoc in nature.
 A generic detection architecture can be used for both the search
and routing.
7
Search
 retrieval of desired documents from an existing corpus
 Retrospective search is frequently interactive.
 Methods
» indexing the corpus by keyword, stem and/or phrase
» apply statistical and/or learning techniques to better
understand the content of the corpus
» analyze free text Detection Needs to compare with the
indexed corpus or a single document
» ...
8
Document Detection: Search
9
Document Detection: Search(Continued)
 Document Corpus
» the content of the corpus may have significant the
performance in some applications
 Preprocessing of Document Corpus
» stemming
» a list of stop words
» phrases, multi-term items
» ...
10
Document Detection: Search(Continued)
 Building Index from Stems
» key place for optimizing run-time performance
» cost to build the index for a large corpus
 Document Index
» a list of terms, stems, phrases, etc.
» frequency of terms in the document and corpus
» frequency of the co-occurrence of terms within the corpus
» index may be as large as the original document corpus
11
Document Detection: Search(Continued)
 Detection Need
» the user’s criteria for a relevant document
 Convert Detection Need to System Specific Query
» first transformed into a detection query, and then a retrieval
query.
» detection query: specific to the retrieval engine, but
independent of the corpus
» retrieval query: specific to the retrieval engine, and to the
corpus
12
Document Detection: Search(Continued)
 Compare Query with Index
 Resultant Rank Ordered List of Documents
» Return the top ‘N’ documents
» Rank the list of relevant documents from the most relevant to
the query to the least relevant
13
Routing
14
Routing (Continued)
 Profile of Multiple Detection Needs
» A Profile is a group of individual Detection Needs that
describes a user’s areas of interest.
» All Profiles will be compared to each incoming document (via
the Profile index).
» If a document matches a Profile the user is notified about the
existence of a relevant document.
15
Routing (Continued)
 Convert Detection Need to System Specific Query
 Building Index from Queries
» similar to build the corpus index for searching
» the quantify of source data (Profiles) is usually much less
than a document corpus
» Profiles may have more specific, structured data in the form
of SGML tagged fields
16
Routing (Continued)
 Routing Profile Index
» The index will be system specific and will make use of all the
preprocessing techniques employed by a particular detection
system.
 Document to be routed
» A stream of incoming documents is handled one at a time to
determine where each should be directed.
» Routing implementation may handle multiple document
streams and multiple Profiles.
17
Routing (Continued)
 Preprocessing of Document
» A document is preprocessed in the same manner that a
query would be set-up in a search
» The document and query roles are reversed compared with
the search process
 Compare Document with Index
» Identify which Profiles are relevant to the document
» Given a document, which of the indexed profiles match it?
18
Routing (Continued)
 Resultant List of Profiles
» The list of Profiles identify which user should receive the
document
19
Summary
 Generate a representation of the meaning or content
of each object based on its description.
 Generate a representation of the meaning of the
information need.
 Compare these two representations to select those
objects that are most likely to match the information
need.
20
Documents Queries
Document
Representation
Query
Representation
Comparison
Basic Architecture of
an Information Retrieval System
21
Research Issues
 Given a set of description for objects in the collection and a
description of an information need, we must consider
 Issue 1
» What makes a good document representation?
» How can a representation be generated from a description of
the document?
» What are retrievable units and how are they organized?
22
Research Issues (Continued)
 Issue 2
How can we represent the information need and how can we
acquire this representation?
» from a description of the information need or
» through interaction with the user?
 Issue 3
How can we compare representations to judge likelihood that a
document matches an information need?
 Issue 4
How can we evaluate the effectiveness of the retrieval process?
23
Information Extraction
 Generic Information Extraction System
An information extraction system is a cascade of transducers or
modules that at each step add structure and often lose information,
hopefully irrelevant, by applying rules that are acquired manually and/or
automatically.
24
Information Extraction (Continued)
 What are the transducers or modules?
 What are their input and output?
 What structure is added?
 What information is lost?
 What is the form of the rules?
 How are the rules applied?
 How are the rules acquired?
25
Example: Parser
 Transducer: parser
 Input: the sequence of words or lexical items
 Output: a parse tree
 Information added: predicate-argument and
modification relations
 Information lost: no
 Rule form: unification grammars
 Application method: chart parser
 Acquisition method: manually
26
Modules
 Text Zoner
turn a text into a set of text segments
 Preprocessor
turn a text or text segment into a sequence of
sentences, each of which is a sequence of lexical
items, where a lexical item is a word together with its
lexical attributes
 Filter
turn a set of sentences into a smaller set of
sentences by filtering out the irrelevant ones
 Preparser
take a sequence of lexical items and try to identify
various reliably determinable, small-scale structures
27
Modules (Continued)
 Parser
input a sequence of lexical items and perhaps small-
scale structures (phrases) and output a set of parse
tree fragments, possibly complete
 Fragment Combiner
turn a set of parse tree or logical form fragments into
a parse tree or logical form for the whole sentence
 Semantic Interpreter
generate a semantic structure or logical form from a
parse tree or from parse tree fragments
28
Modules (Continued)
 Lexical Disambiguation
turn a semantic structure with general or ambiguous
predicates into a semantic structure with specific,
unambiguous predicates
 Coreference Resolution, or Discourse Processing
turn a tree-like structure into a network-like structure
by identifying different descriptions of the same entity
in different parts of the text
 Template Generator
derive the templates from the semantic structures

More Related Content

Similar to intro.ppt

Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
Text Mining.pptx
Text Mining.pptxText Mining.pptx
Text Mining.pptx
vrundadevani
 
Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...
IRJET Journal
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir models
Vaibhav Khanna
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
Joey Wen
 
Information Filtration
Information FiltrationInformation Filtration
Information Filtration
Ali Jafar
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
BAIRAVI T
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithm
Rupali Bhatnagar
 
information retrieval in artificial intelligence
information retrieval in artificial intelligenceinformation retrieval in artificial intelligence
information retrieval in artificial intelligence
PriyadharshiniG41
 
G1803054653
G1803054653G1803054653
G1803054653
IOSR Journals
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval ssilambu111
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.ppt
SamuelKetema1
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Laurent Alquier
 
Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabularies
Valeria Pesce
 
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
tmra
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineSalford Systems
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Final
guestcaef1d
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniques
Kausar Mukadam
 

Similar to intro.ppt (20)

Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Text Mining.pptx
Text Mining.pptxText Mining.pptx
Text Mining.pptx
 
Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...Algorithm for calculating relevance of documents in information retrieval sys...
Algorithm for calculating relevance of documents in information retrieval sys...
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir models
 
Intro to elasticsearch
Intro to elasticsearchIntro to elasticsearch
Intro to elasticsearch
 
Information Filtration
Information FiltrationInformation Filtration
Information Filtration
 
Information retrieval-systems notes
Information retrieval-systems notesInformation retrieval-systems notes
Information retrieval-systems notes
 
Information retrival system and PageRank algorithm
Information retrival system and PageRank algorithmInformation retrival system and PageRank algorithm
Information retrival system and PageRank algorithm
 
information retrieval in artificial intelligence
information retrieval in artificial intelligenceinformation retrieval in artificial intelligence
information retrieval in artificial intelligence
 
G1803054653
G1803054653G1803054653
G1803054653
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
chapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.pptchapter 1-Overview of Information Retrieval.ppt
chapter 1-Overview of Information Retrieval.ppt
 
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.Exploration of a Data Landscape using a Collaborative Linked Data Framework.
Exploration of a Data Landscape using a Collaborative Linked Data Framework.
 
Dataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabulariesDataset description: DCAT and other vocabularies
Dataset description: DCAT and other vocabularies
 
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
A Topic map-based ontology IR system versus Clustering-based IR System: A Com...
 
Machine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search EngineMachine Learned Relevance at A Large Scale Search Engine
Machine Learned Relevance at A Large Scale Search Engine
 
Inteligent Catalogue Final
Inteligent Catalogue FinalInteligent Catalogue Final
Inteligent Catalogue Final
 
CloWSer
CloWSerCloWSer
CloWSer
 
Degreeproject
DegreeprojectDegreeproject
Degreeproject
 
Research on ontology based information retrieval techniques
Research on ontology based information retrieval techniquesResearch on ontology based information retrieval techniques
Research on ontology based information retrieval techniques
 

More from UbaidURRahman78

DIP lab 8.pptx
DIP lab 8.pptxDIP lab 8.pptx
DIP lab 8.pptx
UbaidURRahman78
 
dbms (1).ppt
dbms (1).pptdbms (1).ppt
dbms (1).ppt
UbaidURRahman78
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
UbaidURRahman78
 
PPT-UEU-Basis-Data-Pertemuan-1.pptx
PPT-UEU-Basis-Data-Pertemuan-1.pptxPPT-UEU-Basis-Data-Pertemuan-1.pptx
PPT-UEU-Basis-Data-Pertemuan-1.pptx
UbaidURRahman78
 
Lecture5_ServerVirtualization.pptx
Lecture5_ServerVirtualization.pptxLecture5_ServerVirtualization.pptx
Lecture5_ServerVirtualization.pptx
UbaidURRahman78
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
UbaidURRahman78
 
Demo lec.pptx
Demo lec.pptxDemo lec.pptx
Demo lec.pptx
UbaidURRahman78
 

More from UbaidURRahman78 (7)

DIP lab 8.pptx
DIP lab 8.pptxDIP lab 8.pptx
DIP lab 8.pptx
 
dbms (1).ppt
dbms (1).pptdbms (1).ppt
dbms (1).ppt
 
L2-3.FA19.ppt
L2-3.FA19.pptL2-3.FA19.ppt
L2-3.FA19.ppt
 
PPT-UEU-Basis-Data-Pertemuan-1.pptx
PPT-UEU-Basis-Data-Pertemuan-1.pptxPPT-UEU-Basis-Data-Pertemuan-1.pptx
PPT-UEU-Basis-Data-Pertemuan-1.pptx
 
Lecture5_ServerVirtualization.pptx
Lecture5_ServerVirtualization.pptxLecture5_ServerVirtualization.pptx
Lecture5_ServerVirtualization.pptx
 
Virtualization.ppt
Virtualization.pptVirtualization.ppt
Virtualization.ppt
 
Demo lec.pptx
Demo lec.pptxDemo lec.pptx
Demo lec.pptx
 

Recently uploaded

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Product School
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
91mobiles
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
Dorra BARTAGUIZ
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Product School
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
Thijs Feryn
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
Elena Simperl
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
Product School
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Ramesh Iyer
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
Product School
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
DianaGray10
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Product School
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
Guy Korland
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
Alison B. Lowndes
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
DianaGray10
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
RTTS
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
ThousandEyes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
BookNet Canada
 

Recently uploaded (20)

FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdfFIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
FIDO Alliance Osaka Seminar: FIDO Security Aspects.pdf
 
Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...Mission to Decommission: Importance of Decommissioning Products to Increase E...
Mission to Decommission: Importance of Decommissioning Products to Increase E...
 
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdfSmart TV Buyer Insights Survey 2024 by 91mobiles.pdf
Smart TV Buyer Insights Survey 2024 by 91mobiles.pdf
 
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdfFIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
FIDO Alliance Osaka Seminar: Passkeys and the Road Ahead.pdf
 
Elevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object CalisthenicsElevating Tactical DDD Patterns Through Object Calisthenics
Elevating Tactical DDD Patterns Through Object Calisthenics
 
Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...Designing Great Products: The Power of Design and Leadership by Chief Designe...
Designing Great Products: The Power of Design and Leadership by Chief Designe...
 
Accelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish CachingAccelerate your Kubernetes clusters with Varnish Caching
Accelerate your Kubernetes clusters with Varnish Caching
 
When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...When stars align: studies in data quality, knowledge graphs, and machine lear...
When stars align: studies in data quality, knowledge graphs, and machine lear...
 
How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...How world-class product teams are winning in the AI era by CEO and Founder, P...
How world-class product teams are winning in the AI era by CEO and Founder, P...
 
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
Builder.ai Founder Sachin Dev Duggal's Strategic Approach to Create an Innova...
 
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdfFIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
FIDO Alliance Osaka Seminar: The WebAuthn API and Discoverable Credentials.pdf
 
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
De-mystifying Zero to One: Design Informed Techniques for Greenfield Innovati...
 
Connector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a buttonConnector Corner: Automate dynamic content and events by pushing a button
Connector Corner: Automate dynamic content and events by pushing a button
 
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
Unsubscribed: Combat Subscription Fatigue With a Membership Mentality by Head...
 
GraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge GraphGraphRAG is All You need? LLM & Knowledge Graph
GraphRAG is All You need? LLM & Knowledge Graph
 
Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........Bits & Pixels using AI for Good.........
Bits & Pixels using AI for Good.........
 
UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4UiPath Test Automation using UiPath Test Suite series, part 4
UiPath Test Automation using UiPath Test Suite series, part 4
 
JMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and GrafanaJMeter webinar - integration with InfluxDB and Grafana
JMeter webinar - integration with InfluxDB and Grafana
 
Assuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyesAssuring Contact Center Experiences for Your Customers With ThousandEyes
Assuring Contact Center Experiences for Your Customers With ThousandEyes
 
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...Transcript: Selling digital books in 2024: Insights from industry leaders - T...
Transcript: Selling digital books in 2024: Insights from industry leaders - T...
 

intro.ppt

  • 1. 1 Information Retrieval and Extraction 資訊檢索與擷取 Chia-Hui Chang, Assistant Professor Dept. of Computer Science & Information Engineering National Central University, Taiwan
  • 2. 2 Information Retrieval  generic information retrieval system select and return to the user desired documents from a large set of documents in accordance with criteria specified by the user  functions » document search the selection of documents from an existing collection of documents » document routing the dissemination of incoming documents to appropriate users on the basis of user interest profiles
  • 3. 3 Detection Need  Definition a set of criteria specified by the user which describes the kind of information desired. » queries in document search task » profiles in routing task  forms » keywords » keywords with Boolean operators » free text » example documents » ...
  • 4. 4 Example <head> Tipster Topic Description <num> Number: 033 <dom> Domain: Science and Technology <title> Topic: Companies Capable of Producing Document Management <des> Description: Document must identify a company who has the capability to produce document management system by obtaining a turnkey- system or by obtaining and integrating the basic components. <narr> Narrative: To be relevant, the document must identify a turnkey document management system or components which could be integrated to form a document management system and the name of either the company developing the system or the company using the system. These components are: a computer, image scanner or optical character recognition system, and an information retrieval or text management system.
  • 5. 5 Example (Continued) <con> Concepts: 1. document management, document processing, office automation electronic imaging 2. image scanner, optical character recognition (OCR) 3. text management, text retrieval, text database 4. optical disk <fac> Factors: <def> Definitions Document Management-The creation, storage and retrieval of documents containing, text, images, and graphics. Image Scanner-A device that converts a printed image into a video image, without recognizing the actual content of the text or pictures. Optical Disk-A disk that is written and read by light, and are sometimes associated with the storage of digital images because of their high storage capacity.
  • 6. 6 search vs. routing  The search process matches a single Detection Need against the stored corpus to return a subset of documents.  Routing matches a single document against a group of Profiles to determine which users are interested in the document.  Profiles stand long-term expressions of user needs.  Search queries are ad hoc in nature.  A generic detection architecture can be used for both the search and routing.
  • 7. 7 Search  retrieval of desired documents from an existing corpus  Retrospective search is frequently interactive.  Methods » indexing the corpus by keyword, stem and/or phrase » apply statistical and/or learning techniques to better understand the content of the corpus » analyze free text Detection Needs to compare with the indexed corpus or a single document » ...
  • 9. 9 Document Detection: Search(Continued)  Document Corpus » the content of the corpus may have significant the performance in some applications  Preprocessing of Document Corpus » stemming » a list of stop words » phrases, multi-term items » ...
  • 10. 10 Document Detection: Search(Continued)  Building Index from Stems » key place for optimizing run-time performance » cost to build the index for a large corpus  Document Index » a list of terms, stems, phrases, etc. » frequency of terms in the document and corpus » frequency of the co-occurrence of terms within the corpus » index may be as large as the original document corpus
  • 11. 11 Document Detection: Search(Continued)  Detection Need » the user’s criteria for a relevant document  Convert Detection Need to System Specific Query » first transformed into a detection query, and then a retrieval query. » detection query: specific to the retrieval engine, but independent of the corpus » retrieval query: specific to the retrieval engine, and to the corpus
  • 12. 12 Document Detection: Search(Continued)  Compare Query with Index  Resultant Rank Ordered List of Documents » Return the top ‘N’ documents » Rank the list of relevant documents from the most relevant to the query to the least relevant
  • 14. 14 Routing (Continued)  Profile of Multiple Detection Needs » A Profile is a group of individual Detection Needs that describes a user’s areas of interest. » All Profiles will be compared to each incoming document (via the Profile index). » If a document matches a Profile the user is notified about the existence of a relevant document.
  • 15. 15 Routing (Continued)  Convert Detection Need to System Specific Query  Building Index from Queries » similar to build the corpus index for searching » the quantify of source data (Profiles) is usually much less than a document corpus » Profiles may have more specific, structured data in the form of SGML tagged fields
  • 16. 16 Routing (Continued)  Routing Profile Index » The index will be system specific and will make use of all the preprocessing techniques employed by a particular detection system.  Document to be routed » A stream of incoming documents is handled one at a time to determine where each should be directed. » Routing implementation may handle multiple document streams and multiple Profiles.
  • 17. 17 Routing (Continued)  Preprocessing of Document » A document is preprocessed in the same manner that a query would be set-up in a search » The document and query roles are reversed compared with the search process  Compare Document with Index » Identify which Profiles are relevant to the document » Given a document, which of the indexed profiles match it?
  • 18. 18 Routing (Continued)  Resultant List of Profiles » The list of Profiles identify which user should receive the document
  • 19. 19 Summary  Generate a representation of the meaning or content of each object based on its description.  Generate a representation of the meaning of the information need.  Compare these two representations to select those objects that are most likely to match the information need.
  • 21. 21 Research Issues  Given a set of description for objects in the collection and a description of an information need, we must consider  Issue 1 » What makes a good document representation? » How can a representation be generated from a description of the document? » What are retrievable units and how are they organized?
  • 22. 22 Research Issues (Continued)  Issue 2 How can we represent the information need and how can we acquire this representation? » from a description of the information need or » through interaction with the user?  Issue 3 How can we compare representations to judge likelihood that a document matches an information need?  Issue 4 How can we evaluate the effectiveness of the retrieval process?
  • 23. 23 Information Extraction  Generic Information Extraction System An information extraction system is a cascade of transducers or modules that at each step add structure and often lose information, hopefully irrelevant, by applying rules that are acquired manually and/or automatically.
  • 24. 24 Information Extraction (Continued)  What are the transducers or modules?  What are their input and output?  What structure is added?  What information is lost?  What is the form of the rules?  How are the rules applied?  How are the rules acquired?
  • 25. 25 Example: Parser  Transducer: parser  Input: the sequence of words or lexical items  Output: a parse tree  Information added: predicate-argument and modification relations  Information lost: no  Rule form: unification grammars  Application method: chart parser  Acquisition method: manually
  • 26. 26 Modules  Text Zoner turn a text into a set of text segments  Preprocessor turn a text or text segment into a sequence of sentences, each of which is a sequence of lexical items, where a lexical item is a word together with its lexical attributes  Filter turn a set of sentences into a smaller set of sentences by filtering out the irrelevant ones  Preparser take a sequence of lexical items and try to identify various reliably determinable, small-scale structures
  • 27. 27 Modules (Continued)  Parser input a sequence of lexical items and perhaps small- scale structures (phrases) and output a set of parse tree fragments, possibly complete  Fragment Combiner turn a set of parse tree or logical form fragments into a parse tree or logical form for the whole sentence  Semantic Interpreter generate a semantic structure or logical form from a parse tree or from parse tree fragments
  • 28. 28 Modules (Continued)  Lexical Disambiguation turn a semantic structure with general or ambiguous predicates into a semantic structure with specific, unambiguous predicates  Coreference Resolution, or Discourse Processing turn a tree-like structure into a network-like structure by identifying different descriptions of the same entity in different parts of the text  Template Generator derive the templates from the semantic structures