(C) 2005, The University of Michigan 1
Information Retrieval
Dragomir R. Radev
University of Michigan
radev@umich.edu
Sept...
(C) 2005, The University of Michigan 2
About the instructor
• Dragomir R. Radev
• Associate Professor, University of Michi...
(C) 2005, The University of Michigan 3
Introduction
(C) 2005, The University of Michigan 4
IR systems
• Google
• Vivísimo
• AskJeeves
• NSIR
• Lemur
• MG
• Nutch
(C) 2005, The University of Michigan 5
Examples of IR systems
• Conventional (library catalog).
Search by keyword, title, ...
(C) 2005, The University of Michigan 6
(C) 2005, The University of Michigan 7
(C) 2005, The University of Michigan 8
Need for IR
• Advent of WWW - more than 8 Billion
documents indexed on Google
• How...
(C) 2005, The University of Michigan 9
Some definitions of Information
Retrieval (IR)
Salton (1989): “Information-retrieva...
(C) 2005, The University of Michigan 10
Sample queries (from Excite)
In what year did baseball become an offical sport?
pl...
(C) 2005, The University of Michigan 11
Mappings and abstractions
Reality Data
Information need Query
From Korfhage’s book
(C) 2005, The University of Michigan 12
Typical IR system
• (Crawling)
• Indexing
• Retrieval
• User interface
(C) 2005, The University of Michigan 13
Key Terms Used in IR
• QUERY: a representation of what the user is looking for -
c...
(C) 2005, The University of Michigan 14
Documents
(C) 2005, The University of Michigan 15
Documents
• Not just printed paper
• collections vs. documents
• data structures: ...
(C) 2005, The University of Michigan 16
Document preprocessing
• Formatting
• Tokenization (Paul’s, Willow Dr., Dr.
Willow...
(C) 2005, The University of Michigan 17
Document representations
• Term-document matrix (m x n)
• term-term matrix (m x m ...
(C) 2005, The University of Michigan 18
Document representations
• Term-document matrix
– Evaluating queries (e.g., (A∧B)∨...
(C) 2005, The University of Michigan 19
IR models
(C) 2005, The University of Michigan 20
Major IR models
• Boolean
• Vector
• Probabilistic
• Language modeling
• Fuzzy ret...
(C) 2005, The University of Michigan 21
Major IR tasks
• Ad-hoc
• Filtering and routing
• Question answering
• Spoken docu...
(C) 2005, The University of Michigan 22
Venn diagrams
x w y z
D1
D2
(C) 2005, The University of Michigan 23
Boolean model
A B
(C) 2005, The University of Michigan 24
restaurants AND (Mideastern OR vegetarian) AND inexpensive
Boolean queries
• What ...
(C) 2005, The University of Michigan 25
Boolean queries
• Weighting (Beethoven AND sonatas)
• precedence
coffee AND croiss...
(C) 2005, The University of Michigan 27
Boolean model
• Partition
• Partial relevance?
• Operators: AND, NOT, OR, parenthe...
(C) 2005, The University of Michigan 28
Exercise
• D1 = “computer information retrieval”
• D2 = “computer retrieval”
• D3 ...
(C) 2005, The University of Michigan 29
Exercise
0
1 Swift
2 Shakespeare
3 Shakespeare Swift
4 Milton
5 Milton Swift
6 Mil...
(C) 2005, The University of Michigan 30
Stop lists
• 250-300 most common words in English
account for 50% or more of a giv...
(C) 2005, The University of Michigan 31
Vector models
Term 1
Term 2
Term 3
Doc 1
Doc 2
Doc 3
(C) 2005, The University of Michigan 32
Vector queries
• Each document is represented as a vector
• non-efficient represen...
(C) 2005, The University of Michigan 33
The matching process
• Document space
• Matching is done between a document and
a ...
(C) 2005, The University of Michigan 34
Miscellaneous similarity measures
• The Cosine measure
σ (D,Q) = =
Σ (di x qi)
Σ (...
(C) 2005, The University of Michigan 35
Exercise
• Compute the cosine measures σ (D1,D2) and
σ (D1,D3) for the documents: ...
(C) 2005, The University of Michigan 36
Evaluation
(C) 2005, The University of Michigan 37
Relevance
• Difficult to change: fuzzy, inconsistent
• Methods: exhaustive, sampli...
(C) 2005, The University of Michigan 38
Contingency table
w x
y z
n2 = w + y
n1 = w + x
N
relevant
not relevant
retrieved ...
(C) 2005, The University of Michigan 39
Precision and Recall
Recall:
Precision:
w
w+y
w+x
w
(C) 2005, The University of Michigan 40
Exercise
Go to Google (www.google.com) and search for documents on
Tolkien’s “Lord...
(C) 2005, The University of Michigan 41
n Doc. no Relevant? Recall Precision
1 588 x 0.2 1.00
2 589 x 0.4 1.00
3 576 0.4 0...
(C) 2005, The University of Michigan 42
P/R graph
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0....
(C) 2005, The University of Michigan 43
Issues
• Why not use accuracy A=(w+z)/N?
• Average precision
• Average P at given ...
(C) 2005, The University of Michigan 47
Relevance collections
• TREC ad hoc collections, 2-6 GB
• TREC Web collections, 2-...
(C) 2005, The University of Michigan 48
Sample TREC query
<top>
<num> Number: 305
<title> Most Dangerous Vehicles
<desc> D...
(C) 2005, The University of Michigan 49
<DOCNO> LA031689-0177 </DOCNO>
<DOCID> 31701 </DOCID>
<DATE><P>March 16, 1989, Thu...
(C) 2005, The University of Michigan 50
TREC (cont’d)
• http://trec.nist.gov/tracks.html
• http://
trec.nist.gov/presentat...
(C) 2005, The University of Michigan 51
Word distribution models
(C) 2005, The University of Michigan 52
Shakespeare
• Romeo and Juliet:
• And, 667; The, 661; I, 570; To, 515; A, 447; Of,...
(C) 2005, The University of Michigan 53
The BNC (Adam Kilgarriff)
• 1 6187267 the det
• 2 4239632 be v
• 3 3093444 of prep...
(C) 2005, The University of Michigan 54
Stop lists
• 250-300 most common words in English
account for 50% or more of a giv...
(C) 2005, The University of Michigan 55
Zipf’s law
Rank x Frequency ≈ Constant
Rank Term Freq. Z Rank Term Freq. Z
1 the 6...
(C) 2005, The University of Michigan 56
Zipf's law is fairly general!
• Frequency of accesses to web pages
• in particular...
(C) 2005, The University of Michigan 57
Zipf’s law (cont’d)
• Limitations:
– Low and high frequencies
– Lack of convergenc...
(C) 2005, The University of Michigan 60
Indexing
(C) 2005, The University of Michigan 61
Methods
• Manual: e.g., Library of Congress subject
headings, MeSH
• Automatic
(C) 2005, The University of Michigan 62
LOC subject headings
http://www.loc.gov/catdir/cpso/lcco/lcco.html
A -- GENERAL WO...
(C) 2005, The University of Michigan 63
MedicineCLASS R - MEDICINE
Subclass R
R5-920 Medicine (General)
R5-130.5 General w...
(C) 2005, The University of Michigan 64
Finding the most frequent terms in a
document
• Typically stop words: the, and, in...
(C) 2005, The University of Michigan 65
Luhn’s method
WORDS
FREQUENCY
E
(C) 2005, The University of Michigan 66
Computing term salience
• Term frequency (IDF)
• Document frequency (DF)
• Inverse...
(C) 2005, The University of Michigan 67
Applications of TFIDF
• Cosine similarity
• Indexing
• Clustering
(C) 2005, The University of Michigan 68
Vector-based matching
• The cosine measure
sim (D,C) =
(dk . ck .idf(k))
(dk)2
. (...
(C) 2005, The University of Michigan 69
IDF: Inverse document frequency
N: number of documents
dk: number of documents con...
(C) 2005, The University of Michigan 70
Asian and European news
622.941 deng
306.835 china
196.725 beijing
153.608 chinese...
(C) 2005, The University of Michigan 71
Other topics
120.385 shuttle
99.487 space
90.128 telescope
70.224 hubble
59.992 ro...
(C) 2005, The University of Michigan 72
Compression
(C) 2005, The University of Michigan 73
Compression
• Methods
– Fixed length codes
– Huffman coding
– Ziv-Lempel codes
(C) 2005, The University of Michigan 74
Fixed length codes
• Binary representations
– ASCII
– Representational power (2k
s...
(C) 2005, The University of Michigan 75
Variable length codes
• Alphabet:
A .- N -. 0 -----
B -... O --- 1 .----
C -.-. P ...
(C) 2005, The University of Michigan 76
Most frequent letters in English
• Most frequent letters:
– E T A O I N S H R D L ...
(C) 2005, The University of Michigan 77
Useful links about cryptography
• http://world.std.com/~franl/crypto.html
• http:/...
(C) 2005, The University of Michigan 78
Huffman coding
• Developed by David Huffman (1952)
• Average of 5 bits per charact...
(C) 2005, The University of Michigan 79
Symbol Frequency
A 7
B 4
C 10
D 5
E 2
F 11
G 15
H 3
I 7
J 8
(C) 2005, The University of Michigan 80
0
0
0
0
0
0
0
0
0
1
1
1
1
1
1
1
1
1
c
b d
f
g
i j
he
a
(C) 2005, The University of Michigan 81
Symbol Code
A 0110
B 0010
C 000
D 0011
E 01110
F 010
G 10
H 01111
I 110
J 111
(C) 2005, The University of Michigan 82
Exercise
• Consider the bit string:
01101101111000100110001110100111000
1101011010...
(C) 2005, The University of Michigan 83
Ziv-Lempel coding
• Two types - one is known as LZ77 (used in
GZIP)
• Code: set of...
(C) 2005, The University of Michigan 84
• <0,0,p> p
• <0,0,e> pe
• <0,0,t> pet
• <2,1,r> peter
• <0,0,_> peter_
• <6,1,i> ...
(C) 2005, The University of Michigan 85
Links on text compression
• Data compression:
– http://www.data-compression.info/
...
(C) 2005, The University of Michigan 86
Relevance feedback
and
query expansion
(C) 2005, The University of Michigan 87
Relevance feedback
• Problem: initial query may not be the most
appropriate to sat...
(C) 2005, The University of Michigan 88
Relevance feedback
• Automatic
• Manual
• Method: identifying feedback terms
Q’ = ...
(C) 2005, The University of Michigan 89
Example
• Q = “safety minivans”
• D1 = “car safety minivans tests injury statistic...
(C) 2005, The University of Michigan 90
Pseudo relevance feedback
• Automatic query expansion
– Thesaurus-based expansion ...
(C) 2005, The University of Michigan 106
String matching
(C) 2005, The University of Michigan 107
String matching methods
• Index-based
• Full or approximate
– E.g., theater = the...
(C) 2005, The University of Michigan 108
Index-based matching
• Inverted files
• Position-based inverted files
• Block-bas...
(C) 2005, The University of Michigan 109
Inverted index (trie)
Letters: 60
Text: 11, 19
Words: 33, 40
Made: 50
Many: 28
l
...
(C) 2005, The University of Michigan 110
Sequential searching
• No indexing structure given
• Given: database d and search...
(C) 2005, The University of Michigan 111
Knuth-Morris-Pratt
• Average runtime similar to BF
• Worst case runtime is linear...
(C) 2005, The University of Michigan 112
Knuth-Morris-Pratt (cont’d)
• Example (http://en.wikipedia.org/wiki/Knuth-Morris-...
(C) 2005, The University of Michigan 113
Knuth-Morris-Pratt (cont’d)
ABC ABC ABC ABDAB ABCDABCDABDE
ABCDABD
^
ABC ABC ABC ...
(C) 2005, The University of Michigan 115
Word similarity
• Hamming distance - when words are of the same
length
• Levensht...
(C) 2005, The University of Michigan 116
Levenshtein edit distance
• Examples:
– Theatre-> theater
– Ghaddafi->Qadafi
– Co...
(C) 2005, The University of Michigan 117
Recurrence relation
• Three dependencies
– D(i,0)=i
– D(0,j)=j
– D(i,j)=min[D(i-1...
(C) 2005, The University of Michigan 118
Example
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1
I 2 2...
(C) 2005, The University of Michigan 119
Example (cont’d)
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V ...
(C) 2005, The University of Michigan 120
Tracebacks
Gusfield 1997
W R I T E R S
0 1 2 3 4 5 6 7
0 0 1 2 3 4 5 6 7
V 1 1 1 ...
(C) 2005, The University of Michigan 121
Weighted edit distance
• Used to emphasize the relative cost of
different edit op...
(C) 2005, The University of Michigan 122
• Web sites:
– http://www.merriampark.com/ld.htm
– http://odur.let.rug.nl/~kleiwe...
(C) 2005, The University of Michigan 123
Clustering
(C) 2005, The University of Michigan 124
Clustering
• Exclusive/overlapping clusters
• Hierarchical/flat clusters
• The cl...
(C) 2005, The University of Michigan 125
Representations for document
clustering
• Typically: vector-based
– Words: “cat”,...
(C) 2005, The University of Michigan 126
Hierarchical clustering
Dendrograms
http://odur.let.rug.nl/~kleiweg/clustering/cl...
(C) 2005, The University of Michigan 127
Another example
• Kingdom = animal
• Phylum = Chordata
• Subphylum = Vertebrata
•...
(C) 2005, The University of Michigan 128
Clustering using dendrograms
REPEAT
Compute pairwise similarities
Identify closes...
(C) 2005, The University of Michigan 129
Methods
• Single-linkage
– One common pair is sufficient
– disadvantages: long ch...
(C) 2005, The University of Michigan 130
k-means
• Needed: small number k of desired clusters
• hard vs. soft decisions
• ...
(C) 2005, The University of Michigan 131
k-means
1 initialize cluster centroids to arbitrary vectors
2 while further impro...
(C) 2005, The University of Michigan 132
Example
• Cluster the following vectors into two
groups:
– A = <1,6>
– B = <2,2>
...
(C) 2005, The University of Michigan 133
Complexity
• Complexity = O(kn) because at each step, n
documents have to be comp...
(C) 2005, The University of Michigan 136
Human clustering
• Significant disagreement in the number of
clusters, overlap of...
(C) 2005, The University of Michigan 137
Lexical networks
(C) 2005, The University of Michigan 138
Lexical Networks
• Used to represent relationships between
words
• Example: WordN...
(C) 2005, The University of Michigan 139
Lexical matrix
Word Forms
Word
Meanings F1 F2 F3 … Fn
M1 E1,1 E1,2
M2 E1,2
…
…
Mm...
(C) 2005, The University of Michigan 140
Synsets
• Disambiguation
– {board, plank}
– {board, committee}
• Synonyms
– subst...
(C) 2005, The University of Michigan 141
$ ./wn board -hypen
Synonyms/Hypernyms (Ordered by Frequency) of noun board
9 sen...
(C) 2005, The University of Michigan 142
Sense 4
display panel, display board, board
=> display
=> electronic device
=> de...
(C) 2005, The University of Michigan 143
Sense 7
control panel, instrument panel, control board, board, panel
=> electrica...
(C) 2005, The University of Michigan 144
Antonymy
• “x” vs. “not-x”
• “rich” vs. “poor”?
• {rise, ascend} vs. {fall, desce...
(C) 2005, The University of Michigan 145
Other relations
• Meronymy: X is a meronym of Y when
native speakers of English a...
(C) 2005, The University of Michigan 146
Other features of WordNet
• Index of familiarity
• Polysemy
(C) 2005, The University of Michigan 147
board used as a noun is familiar (polysemy count = 9)
bird used as a noun is comm...
(C) 2005, The University of Michigan 148
Compound nouns
advisory board
appeals board
backboard
backgammon board
baseboard
...
(C) 2005, The University of Michigan 149
Overview of senses
1. board -- (a committee having supervisory powers; "the board...
(C) 2005, The University of Michigan 150
Top-level concepts
{act, action, activity}
{animal, fauna}
{artifact}
{attribute,...
(C) 2005, The University of Michigan 151
WordNet and DistSim
wn reason -hypen - hypernyms
wn reason -synsn - synsets
wn re...
(C) 2005, The University of Michigan 152
System comparison
(C) 2005, The University of Michigan 153
Comparing two systems
• Comparing A and B
• One query?
• Average performance?
• N...
(C) 2005, The University of Michigan 154
The sign test
• Example 1:
– A > B (12 times)
– A = B (25 times)
– A < B (3 times...
(C) 2005, The University of Michigan 155
Other tests
• The t test:
– Takes into account the actual performances, not
just ...
(C) 2005, The University of Michigan 156
Techniques for dimensionality
reduction: SVD and LSI
(C) 2005, The University of Michigan 157
Techniques for dimensionality
reduction
• Based on matrix decomposition
(goal: pr...
(C) 2005, The University of Michigan 158
SVD: Singular Value
Decomposition
• A=UΣVT
• This decomposition exists for all ma...
(C) 2005, The University of Michigan 159
Term matrix normalization


















=
1
0
1
1
0
0
000
1...
(C) 2005, The University of Michigan 160
Example (Berry and Browne)
• T1: baby
• T2: child
• T3: guide
• T4: health
• T5: ...
(C) 2005, The University of Michigan 161
Document term matrix




























=
00...
(C) 2005, The University of Michigan 162
Decomposition
u =
-0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0
-...
(C) 2005, The University of Michigan 163
Decomposition
s =
1.5849 0 0 0 0 0 0
0 1.2721 0 0 0 0 0
0 0 1.1946 0 0 0 0
0 0 0 ...
(C) 2005, The University of Michigan 164
Rank-4 approximation
s4 =
1.5849 0 0 0 0 0 0
0 1.2721 0 0 0 0 0
0 0 1.1946 0 0 0 ...
(C) 2005, The University of Michigan 165
Rank-4 approximation
u*s4*v'
-0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002
-...
(C) 2005, The University of Michigan 166
Rank-4 approximation
u*s4
-1.1056 -0.1203 0.0207 -0.5558 0 0 0
-0.4155 0.3748 0.5...
(C) 2005, The University of Michigan 167
Rank-4 approximation
s4*v'
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.745...
(C) 2005, The University of Michigan 168
Rank-2 approximation
s2 =
1.5849 0 0 0 0 0 0
0 1.2721 0 0 0 0 0
0 0 0 0 0 0 0
0 0...
(C) 2005, The University of Michigan 169
Rank-2 approximation
u*s2*v'
0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563
0.2...
(C) 2005, The University of Michigan 170
Rank-2 approximation
u*s2
-1.1056 -0.1203 0 0 0 0 0
-0.4155 0.3748 0 0 0 0 0
-0.5...
(C) 2005, The University of Michigan 171
Rank-2 approximation
s2*v'
-0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.745...
(C) 2005, The University of Michigan 172
Documents to concepts and terms to
concepts
A(:,1)'*u*s
-0.4238 0.6784 -0.8541 0....
(C) 2005, The University of Michigan 173
Documents to concepts and terms to
concepts
>> A(:,4)'*u*s2
-0.9972 0.6478 0 0 0 ...
(C) 2005, The University of Michigan 174
Cont’d
>> (s2*v'*A(1,:)')'
-1.7523 -0.1530 0 0 0 0 0 0 0
>> (s2*v'*A(2,:)')'
-0.6...
(C) 2005, The University of Michigan 175
Cont’d
>> (s2*v'*A(6,:)')'
-0.4730 0.6078 0 0 0 0 0 0 0
>> (s2*v'*A(7,:)')'
-0.88...
(C) 2005, The University of Michigan 176
Properties
A*A'
1.5471 0.3364 0.5041 0.2025 0.3364 0.2025 0.5041 0.2025 0.2025
0....
(C) 2005, The University of Michigan 177
Latent semantic indexing (LSI)
• Dimensionality reduction = identification of
hid...
(C) 2005, The University of Michigan 178
Useful pointers
• http://lsa.colorado.edu
• http://lsi.research.telcordia.com/
• ...
(C) 2005, The University of Michigan 179
Models of the Web
(C) 2005, The University of Michigan 180
Size
• The Web is the largest repository of data and it
grows exponentially.
– 32...
(C) 2005, The University of Michigan 181
Bow-tie model of the Web
SCC
56 M
OUT
44 M
IN
44 M
Bröder & al. WWW 2000, Dill & ...
(C) 2005, The University of Michigan 182
Power laws
• Web site size (Huberman and Adamic 1999)
• Power-law connectivity (B...
(C) 2005, The University of Michigan 183
Small-world networks
• Diameter = average length of the shortest path
between all...
(C) 2005, The University of Michigan 184
Clustering coefficient
• Cliquishness (c): between the kv (kv – 1)/2 pairs
of nei...
(C) 2005, The University of Michigan 185
Models of the Web
Npk
k
ke
kP
kk
=〉〈
〉〈
=
〉−〈
!
)(
)(
)(
αζ
α−
=
k
kP
A
B
a
b
• E...
(C) 2005, The University of Michigan 188
Social network analysis for IR
(C) 2005, The University of Michigan 189
Social networks
• Induced by a relation
• Symmetric or not
• Examples:
– Friendsh...
(C) 2005, The University of Michigan 190
Krebs 2004
(C) 2005, The University of Michigan 191
Prestige and centrality
• Degree centrality: how many neighbors each node has.
• ...
(C) 2005, The University of Michigan 192
Graph-based representations
1
2
3
4
5
7
6 8
1 2 3 4 5 6 7 8
1 1 1
2 1
3 1 1
4 1
5...
(C) 2005, The University of Michigan 193
Markov chains
• A homogeneous Markov chain is defined by an
initial distribution ...
(C) 2005, The University of Michigan 194
Stationary solutions
• The fundamental Ergodic Theorem for Markov chains [Grimmet...
(C) 2005, The University of Michigan 195
1
2
3
4
5
7
6 8
Example
This graph E has a second graph E’
(not drawn) superimpos...
(C) 2005, The University of Michigan 196
Eigenvectors
• An eigenvector is an implicit “direction” for a
matrix.
Mv = λv, w...
(C) 2005, The University of Michigan 197
Computing the stationary
distribution
0)( =−
=
pEI
pEp
T
T
function PowerStatDist...
(C) 2005, The University of Michigan 198
1
2
3
4
5
7
6 8
Example
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
1 2 3 4 5 6 7 8
P...
(C) 2005, The University of Michigan 199
How Google works
• Crawling
• Anchor text
• Fast query processing
• Pagerank
(C) 2005, The University of Michigan 200
More about PageRank
• Named after Larry Page, founder of Google (and
UM alum)
• R...
(C) 2005, The University of Michigan 201
HITS
• Query-dependent model (Kleinberg 97)
• Hubs and authorities (e.g., cars, H...
(C) 2005, The University of Michigan 202
The link-content hypothesis
• Topical locality: page is similar (σ) to the page t...
(C) 2005, The University of Michigan 203
Measuring the Web
(C) 2005, The University of Michigan 204
Bharat and Broder 1998
• Based on crawls of HotBot, Altavista,
Excite, and InfoSe...
(C) 2005, The University of Michigan 205
Example (from Bharat&Broder)
A similar approach by Lawrence and Giles yields 320M...
(C) 2005, The University of Michigan 212
Question answering
(C) 2005, The University of Michigan 213
People ask questions
• Excite corpus of 2,477,283 queries (one
day’s worth)
• 8.4...
(C) 2005, The University of Michigan 214
People ask questions
In what year did baseball become an offical sport?
Who is th...
(C) 2005, The University of Michigan 215
(C) 2005, The University of Michigan 216
Question answering
What is the largest city in Northern Afghanistan?
(C) 2005, The University of Michigan 217
Possible approaches
• Map?
• Knowledge base
Find x: city (x) ∧ located (x,”Northe...
(C) 2005, The University of Michigan 218
The TREC Q&A evaluation
• Run by NIST [Voorhees and Tice 2000]
• 2GB of input
• 2...
(C) 2005, The University of Michigan 219
Q: When did Nelson Mandela become president of South Africa?
A: 10 May 1994
Q: Ho...
(C) 2005, The University of Michigan 220
Q: Why did David Koresh ask the FBI for a word processor?
Q: Name the designer of...
(C) 2005, The University of Michigan 221
NSIR
• Current project at U-M
– http://tangra.si.umich.edu/clair/NSIR/html/nsir.c...
(C) 2005, The University of Michigan 222
(C) 2005, The University of Michigan 223
... Afghanistan, Kabul, 2,450 ... Administrative capital and largest city (1997 e...
(C) 2005, The University of Michigan 224
Document retrieval
Query modulation
Sentence retrieval
Answer extraction
Answer r...
(C) 2005, The University of Michigan 225
(C) 2005, The University of Michigan 226
(C) 2005, The University of Michigan 227
Research problems
• Source identification:
– semi-structured vs. text sources
• Q...
(C) 2005, The University of Michigan 228
Document retrieval
• Use existing search engines: Google,
AlltheWeb, NorthernLigh...
(C) 2005, The University of Michigan 229
F
tfwtfwidftfw
S
N
k
k
N
j
j
N
i
ii
i
∑∑∑ ===
++
=
321
1
3
1
2
1
1 ****
Sentence ...
(C) 2005, The University of Michigan 230
Probabilistic phrase reranking
• Answer extraction: probabilistic phrase
rerankin...
(C) 2005, The University of Michigan 231
Phrase types
PERSON PLACE DATE NUMBER DEFINITION
ORGANIZATION DESCRIPTION ABBREVI...
(C) 2005, The University of Michigan 232
Question Type Identification
• Wh-type not sufficient:
• Who: PERSON 77, DESCRIPT...
(C) 2005, The University of Michigan 233
Ripper performance
-20.69%-TREC8,9,10
30%17.03%TREC10TREC8,9
24%22.4%TREC8TREC9
T...
(C) 2005, The University of Michigan 234
Regex performance
7.6%5.5%4.6%TREC8,9,10
18.2%6%7.4%TREC8,9
18%15%7.8%TREC9
Test ...
(C) 2005, The University of Michigan 235
Phrase ranking
• Phrases are identified by a shallow parser
(ltchunk from Edinbur...
(C) 2005, The University of Michigan 236
Proximity
• Phrasal answers tend to appear near words
from the query
• Average di...
(C) 2005, The University of Michigan 237
Part of speech signature
NO (100%)
NO (86.7%) PERSON (3.8%) NUMBER (3.8%) ORG (2....
(C) 2005, The University of Michigan 238
Query overlap and frequency
• Query overlap:
– What is the capital of Zimbabwe?
–...
(C) 2005, The University of Michigan 239
Reranking
Rank Probability and phrase
1 0.599862 the_DT Space_NNP Flight_NNP Oper...
(C) 2005, The University of Michigan 240
Reranking
Rank Probability and phrase
1 0.465012 Space_NNP Administration_NNP ._....
(C) 2005, The University of Michigan 241
Reranking
Rank Probability and phrase
1 0.478857 Neptune_NNP Beach_NNP ._.
2 0.44...
(C) 2005, The University of Michigan 242
(C) 2005, The University of Michigan 243
(C) 2005, The University of Michigan 244
(C) 2005, The University of Michigan 245
Document level performance
164163149#>0
1.33611.04950.8355Avg
GoogleNLightAlltheW...
(C) 2005, The University of Michigan 246
Sentence level performance
1351371591191211599999148#>0
0.490.542.550.440.482.530...
(C) 2005, The University of Michigan 247
Phrase level performance
0.1990.1570.1170.105Combined
0.06460.0580.0540.038Global...
(C) 2005, The University of Michigan 248
(C) 2005, The University of Michigan 249
(C) 2005, The University of Michigan 250
Text classification
(C) 2005, The University of Michigan 251
Introduction
• Text classification: assigning documents to
predefined categories
...
(C) 2005, The University of Michigan 252
Generative models: knn
• K-nearest neighbors
• Very easy to program
• Issues: cho...
(C) 2005, The University of Michigan 253
Feature selection: The Χ2
test
• For a term t:
• Testing for independence:
P(C=0,...
(C) 2005, The University of Michigan 254
Feature selection: The Χ2
test
• High values of Χ2
indicate lower belief in
indep...
(C) 2005, The University of Michigan 255
Feature selection: mutual
information
• No document length scaling is needed
• Do...
(C) 2005, The University of Michigan 256
Naïve Bayesian classifiers
• Naïve Bayesian classifier
• Assuming statistical ind...
(C) 2005, The University of Michigan 257
Spam recognition
Return-Path: <ig_esq@rediffmail.com>
X-Sieve: CMU Sieve 2.2
From...
(C) 2005, The University of Michigan 258
Well-known datasets
• 20 newsgroups
– http://people.csail.mit.edu/u/j/jrennie/pub...
(C) 2005, The University of Michigan 259
Support vector machines
• Introduced by Vapnik in the early 90s.
(C) 2005, The University of Michigan 260
Semi-supervised learning
• EM
• Co-training
• Graph-based
(C) 2005, The University of Michigan 261
Additional topics
• Soft margins
• VC dimension
• Kernel methods
(C) 2005, The University of Michigan 262
• SVMs are widely considered to be the best
method for text classification (look ...
(C) 2005, The University of Michigan 263
Readings
• Books:
• 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Info...
(C) 2005, The University of Michigan 264
Readings
• Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience i...
(C) 2005, The University of Michigan 265
More readings
• Gerard Salton, Automatic Text Processing, Addison-
Wesley (1989)
...
(C) 2005, The University of Michigan 266
Thank you!
Благодаря!
Upcoming SlideShare
Loading in …5
×

Database Application Design

8,100 views

Published on

0 Comments
0 Likes
Statistics
Notes
  • Be the first to comment

  • Be the first to like this

No Downloads
Views
Total views
8,100
On SlideShare
0
From Embeds
0
Number of Embeds
1
Actions
Shares
0
Downloads
14
Comments
0
Likes
0
Embeds 0
No embeds

No notes for slide
  • 1.5 million terabytes = 1.5 petabytes = 1.5 E +18.
    At 2,000 bytes/page = 1 E 15 pages = 1 billion years at 1 page/minute
    1 in every 28 page views on the Web is a search results page (3.5% of all page views)(June 1, 1999, Alexa Insider)
  • Indegree/outdegree plots
    Evolving networks: fundamental object of statistical physics
    Social networks
    Bow tie model (not?)
    My Erdos number is 4
    Growth of the Web (p 130)
    Why is the Web different (MN) – how do users really create the Web
    Fat-tailed distributions, small worlds (W&amp;S, # of triangles), hard to destroy
    Communities (Kleinberg)
    Evolutionary networks, equilibrium/non-equilibrium
    Evaluation metrics: degree distribution, clustering coefficient, diameter (give comparison)
    W/S model is equlibrium (term from statistical mechanics)
    EN have history, memory
    Small word networks (+result)
    D&amp;M 105-106
    Zipf
    What are typical clustering coefficients and diameters
    Random graph – clustering coefficient is much smaller
    Social networks
    Mesoscopic objects
    Phase transitions
    Bipartite networks
    Pagerank
    Google changed the way the Web works
  • Perltree/Lexpagerank/Web Models/Lex/Link
    Predict communities based on a few words
    Realistic attachment model (based on words)
    Dilemma: add links/remove links
    Property-driven attachment
  • Pagerank/HITS
  • Directed, undirected
  • Stochastic, aperiodic, irreducible (strongly connected)
    Unique solutions
  • Power methods (KHMG)
  • Each question can have multiple answers
  • Database Application Design

    1. 1. (C) 2005, The University of Michigan 1 Information Retrieval Dragomir R. Radev University of Michigan radev@umich.edu September 19, 2005
    2. 2. (C) 2005, The University of Michigan 2 About the instructor • Dragomir R. Radev • Associate Professor, University of Michigan – School of Information – Department of Electrical Engineering and Computer Science – Department of Linguistics • Head of CLAIR (Computational Linguistics And Information Retrieval) at U. Michigan • Treasurer, North American Chapter of the ACL • Ph.D., 1998, Computer Science, Columbia University • Email: radev@umich.edu • Home page: http://tangra.si.umich.edu/~radev
    3. 3. (C) 2005, The University of Michigan 3 Introduction
    4. 4. (C) 2005, The University of Michigan 4 IR systems • Google • Vivísimo • AskJeeves • NSIR • Lemur • MG • Nutch
    5. 5. (C) 2005, The University of Michigan 5 Examples of IR systems • Conventional (library catalog). Search by keyword, title, author, etc. • Text-based (Lexis-Nexis, Google, FAST). Search by keywords. Limited search using queries in natural language. • Multimedia (QBIC, WebSeek, SaFe) Search by visual appearance (shapes, colors,… ). • Question answering systems (AskJeeves, NSIR, Answerbus) Search in (restricted) natural language
    6. 6. (C) 2005, The University of Michigan 6
    7. 7. (C) 2005, The University of Michigan 7
    8. 8. (C) 2005, The University of Michigan 8 Need for IR • Advent of WWW - more than 8 Billion documents indexed on Google • How much information? 200TB according to Lyman and Varian 2003. http://www.sims.berkeley.edu/research/projects/how-much-info/ • Search, routing, filtering • User’s information need
    9. 9. (C) 2005, The University of Michigan 9 Some definitions of Information Retrieval (IR) Salton (1989): “Information-retrieval systems process files of records and requests for information, and identify and retrieve from the files certain records in response to the information requests. The retrieval of particular records depends on the similarity between the records and the queries, which in turn is measured by comparing the values of certain attributes to records and information requests.” Kowalski (1997): “An Information Retrieval System is a system that is capable of storage, retrieval, and maintenance of information. Information in this context can be composed of text (including numeric and date data), images, audio, video, and other multi-media objects).”
    10. 10. (C) 2005, The University of Michigan 10 Sample queries (from Excite) In what year did baseball become an offical sport? play station codes . com birth control and depression government "WorkAbility I"+conference kitchen appliances where can I find a chines rosewood tiger electronics 58 Plymouth Fury How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? emeril Lagasse Hubble M.S Subalaksmi running
    11. 11. (C) 2005, The University of Michigan 11 Mappings and abstractions Reality Data Information need Query From Korfhage’s book
    12. 12. (C) 2005, The University of Michigan 12 Typical IR system • (Crawling) • Indexing • Retrieval • User interface
    13. 13. (C) 2005, The University of Michigan 13 Key Terms Used in IR • QUERY: a representation of what the user is looking for - can be a list of words or a phrase. • DOCUMENT: an information entity that the user wants to retrieve • COLLECTION: a set of documents • INDEX: a representation of information that makes querying easier • TERM: word or concept that appears in a document or a query
    14. 14. (C) 2005, The University of Michigan 14 Documents
    15. 15. (C) 2005, The University of Michigan 15 Documents • Not just printed paper • collections vs. documents • data structures: representations • Bag of words method • document surrogates: keywords, summaries • encoding: ASCII, Unicode, etc.
    16. 16. (C) 2005, The University of Michigan 16 Document preprocessing • Formatting • Tokenization (Paul’s, Willow Dr., Dr. Willow, 555-1212, New York, ad hoc) • Casing (cat vs. CAT) • Stemming (computer, computation) • Soundex
    17. 17. (C) 2005, The University of Michigan 17 Document representations • Term-document matrix (m x n) • term-term matrix (m x m x n) • document-document matrix (n x n) • Example: 3,000,000 documents (n) with 50,000 terms (m) • sparse matrices • Boolean vs. integer matrices
    18. 18. (C) 2005, The University of Michigan 18 Document representations • Term-document matrix – Evaluating queries (e.g., (A∧B)∨C) – Storage issues • Inverted files – Storage issues – Evaluating queries – Advantages and disadvantages
    19. 19. (C) 2005, The University of Michigan 19 IR models
    20. 20. (C) 2005, The University of Michigan 20 Major IR models • Boolean • Vector • Probabilistic • Language modeling • Fuzzy retrieval • Latent semantic indexing
    21. 21. (C) 2005, The University of Michigan 21 Major IR tasks • Ad-hoc • Filtering and routing • Question answering • Spoken document retrieval • Multimedia retrieval
    22. 22. (C) 2005, The University of Michigan 22 Venn diagrams x w y z D1 D2
    23. 23. (C) 2005, The University of Michigan 23 Boolean model A B
    24. 24. (C) 2005, The University of Michigan 24 restaurants AND (Mideastern OR vegetarian) AND inexpensive Boolean queries • What types of documents are returned? • Stemming • thesaurus expansion • inclusive vs. exclusive OR • confusing uses of AND and OR dinner AND sports AND symphony 4 OF (Pentium, printer, cache, PC, monitor, computer, personal)
    25. 25. (C) 2005, The University of Michigan 25 Boolean queries • Weighting (Beethoven AND sonatas) • precedence coffee AND croissant OR muffin raincoat AND umbrella OR sunglasses • Use of negation: potential problems • Conjunctive and Disjunctive normal forms • Full CNF and DNF
    26. 26. (C) 2005, The University of Michigan 27 Boolean model • Partition • Partial relevance? • Operators: AND, NOT, OR, parentheses
    27. 27. (C) 2005, The University of Michigan 28 Exercise • D1 = “computer information retrieval” • D2 = “computer retrieval” • D3 = “information” • D4 = “computer information” • Q1 = “information ∧ retrieval” • Q2 = “information ∧ ¬computer”
    28. 28. (C) 2005, The University of Michigan 29 Exercise 0 1 Swift 2 Shakespeare 3 Shakespeare Swift 4 Milton 5 Milton Swift 6 Milton Shakespeare 7 Milton Shakespeare Swift 8 Chaucer 9 Chaucer Swift 10 Chaucer Shakespeare 11 Chaucer Shakespeare Swift 12 Chaucer Milton 13 Chaucer Milton Swift 14 Chaucer Milton Shakespeare 15 Chaucer Milton Shakespeare Swift ((chaucer OR milton) AND (NOT swift)) OR ((NOT chaucer) AND (swift OR shakespeare))
    29. 29. (C) 2005, The University of Michigan 30 Stop lists • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2.63
    30. 30. (C) 2005, The University of Michigan 31 Vector models Term 1 Term 2 Term 3 Doc 1 Doc 2 Doc 3
    31. 31. (C) 2005, The University of Michigan 32 Vector queries • Each document is represented as a vector • non-efficient representations (bit vectors) • dimensional compatibility W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 C1 C2 C3 C4 C5 C6 C7 C8 C9 C10
    32. 32. (C) 2005, The University of Michigan 33 The matching process • Document space • Matching is done between a document and a query (or between two documents) • distance vs. similarity • Euclidean distance, Manhattan distance, Word overlap, Jaccard coefficient, etc.
    33. 33. (C) 2005, The University of Michigan 34 Miscellaneous similarity measures • The Cosine measure σ (D,Q) = = Σ (di x qi) Σ (di)2 * Σ (qi)2 |X ∩ Y| |X| * |Y| σ (D,Q) = |X ∩ Y| |X ∪ Y| • The Jaccard coefficient
    34. 34. (C) 2005, The University of Michigan 35 Exercise • Compute the cosine measures σ (D1,D2) and σ (D1,D3) for the documents: D1 = <1,3>, D2 = <100,300> and D3 = <3,1> • Compute the corresponding Euclidean distances, Manhattan distances, and Jaccard coefficients.
    35. 35. (C) 2005, The University of Michigan 36 Evaluation
    36. 36. (C) 2005, The University of Michigan 37 Relevance • Difficult to change: fuzzy, inconsistent • Methods: exhaustive, sampling, pooling, search-based
    37. 37. (C) 2005, The University of Michigan 38 Contingency table w x y z n2 = w + y n1 = w + x N relevant not relevant retrieved not retrieved
    38. 38. (C) 2005, The University of Michigan 39 Precision and Recall Recall: Precision: w w+y w+x w
    39. 39. (C) 2005, The University of Michigan 40 Exercise Go to Google (www.google.com) and search for documents on Tolkien’s “Lord of the Rings”. Try different ways of phrasing the query: e.g., Tolkien, “JRR Melville”, +”JRR Tolkien” +Lord of the Rings”, etc. For each query, compute the precision (P) based on the first 10 documents returned by AltaVista. Note! Before starting the exercise, have a clear idea of what a relevant document for your query should look like. Try different information needs. Later, try different queries.
    40. 40. (C) 2005, The University of Michigan 41 n Doc. no Relevant? Recall Precision 1 588 x 0.2 1.00 2 589 x 0.4 1.00 3 576 0.4 0.67 4 590 x 0.6 0.75 5 986 0.6 0.60 6 592 x 0.8 0.67 7 984 0.8 0.57 8 988 0.8 0.50 9 578 0.8 0.44 10 985 0.8 0.40 11 103 0.8 0.36 12 591 0.8 0.33 13 772 x 1.0 0.38 14 990 1.0 0.36 [From Salton’s book]
    41. 41. (C) 2005, The University of Michigan 42 P/R graph 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 Recall Precision Interpolated average precision (e.g., 11pt) Interpolation – what is precision at recall=0.5?
    42. 42. (C) 2005, The University of Michigan 43 Issues • Why not use accuracy A=(w+z)/N? • Average precision • Average P at given “document cutoff values” • Report when P=R • F measure: F=(β2 +1)PR/(β2 P+R) • F1 measure: F1 = 2/(1/R+1/P) : harmonic mean of P and R
    43. 43. (C) 2005, The University of Michigan 47 Relevance collections • TREC ad hoc collections, 2-6 GB • TREC Web collections, 2-100GB
    44. 44. (C) 2005, The University of Michigan 48 Sample TREC query <top> <num> Number: 305 <title> Most Dangerous Vehicles <desc> Description: Which are the most crashworthy, and least crashworthy, passenger vehicles? <narr> Narrative: A relevant document will contain information on the crashworthiness of a given vehicle or vehicles that can be used to draw a comparison with other vehicles. The document will have to describe/compare vehicles, not drivers. For instance, it should be expected that vehicles preferred by 16-25 year-olds would be involved in more crashes, because that age group is involved in more crashes. I would view number of fatalities per 100 crashes to be more revealing of a vehicle's crashworthiness than the number of crashes per 100,000 miles, for example. </top> LA031689-0177 FT922-1008 LA090190-0126 LA101190-0218 LA082690-0158 LA112590-0109 FT944-136 LA020590-0119 FT944-5300 LA052190-0048 LA051689-0139 FT944-9371 LA032390-0172 LA042790-0172 LA021790-0136 LA092289-0167 LA111189-0013 LA120189-0179 LA020490-0021 LA122989-0063 LA091389-0119 LA072189-0048 FT944-15615 LA091589-0101 LA021289-0208
    45. 45. (C) 2005, The University of Michigan 49 <DOCNO> LA031689-0177 </DOCNO> <DOCID> 31701 </DOCID> <DATE><P>March 16, 1989, Thursday, Home Edition </P></DATE> <SECTION><P>Business; Part 4; Page 1; Column 5; Financial Desk </P></SECTION> <LENGTH><P>586 words </P></LENGTH> <HEADLINE><P>AGENCY TO LAUNCH STUDY OF FORD BRONCO II AFTER HIGH RATE OF ROLL-OVER ACCIDENTS </P></HEADLINE> <BYLINE><P>By LINDA WILLIAMS, Times Staff Writer </P></BYLINE> <TEXT> <P>The federal government's highway safety watchdog said Wednesday that the Ford Bronco II appears to be involved in more fatal roll-over accidents than other vehicles in its class and that it will seek to determine if the vehicle itself contributes to the accidents. </P> <P>The decision to do an engineering analysis of the Ford Motor Co. utility-sport vehicle grew out of a federal accident study of the Suzuki Samurai, said Tim Hurd, a spokesman for the National Highway Traffic Safety Administration. NHTSA looked at Samurai accidents after Consumer Reports magazine charged that the vehicle had basic design flaws. </P> <P>Several Fatalities </P> <P>However, the accident study showed that the "Ford Bronco II appears to have a higher number of single-vehicle, first event roll-overs, particularly those involving fatalities," Hurd said. The engineering analysis of the Bronco, the second of three levels of investigation conducted by NHTSA, will cover the 1984-1989 Bronco II models, the agency said. </P> <P>According to a Fatal Accident Reporting System study included in the September report on the Samurai, 43 Bronco II single-vehicle roll-overs caused fatalities, or 19 of every 100,000 vehicles. There were eight Samurai fatal roll-overs, or 6 per 100,000; 13 involving the Chevrolet S10 Blazers or GMC Jimmy, or 6 per 100,000, and six fatal Jeep Cherokee roll-overs, for 2.5 per 100,000. After the accident report, NHTSA declined to investigate the Samurai. </P> ... </TEXT> <GRAPHIC><P> Photo, The Ford Bronco II "appears to have a higher number of single-vehicle, first event roll-overs," a federal official said. </P></GRAPHIC> <SUBJECT> <P>TRAFFIC ACCIDENTS; FORD MOTOR CORP; NATIONAL HIGHWAY TRAFFIC SAFETY ADMINISTRATION; VEHICLE INSPECTIONS; RECREATIONAL VEHICLES; SUZUKI MOTOR CO; AUTOMOBILE SAFETY </P> </SUBJECT> </DOC>
    46. 46. (C) 2005, The University of Michigan 50 TREC (cont’d) • http://trec.nist.gov/tracks.html • http:// trec.nist.gov/presentations/presentations.html
    47. 47. (C) 2005, The University of Michigan 51 Word distribution models
    48. 48. (C) 2005, The University of Michigan 52 Shakespeare • Romeo and Juliet: • And, 667; The, 661; I, 570; To, 515; A, 447; Of, 382; My, 356; Is, 343; That, 343; In, 314; You, 289; Thou, 277; Me, 262; Not, 257; With, 234; It, 224; For, 223; This, 215; Be, 207; But, 181; Thy, 167; What, 163; O, 160; As, 156; Her, 150; Will, 147; So, 145; Thee, 139; Love, 135; His, 128; Have, 127; He, 120; Romeo, 115; By, 114; She, 114; Shall, 107; Your, 103; No, 102; Come, 96; Him, 96; All, 92; Do, 89; From, 86; Then, 83; Good, 82; Now, 82; Here, 80; If, 80; An, 78; Go, 76; On, 76; I'll, 71; Death, 69; Night, 68; Are, 67; More, 67; We, 66; At, 65; Man, 65; Or, 65; There, 64; Hath, 63; Which, 60; • … • A-bed, 1; A-bleeding, 1; A-weary, 1; Abate, 1; Abbey, 1; Abhorred, 1; Abhors, 1; Aboard, 1; Abound'st, 1; Abroach, 1; Absolved, 1; Abuse, 1; Abused, 1; Abuses, 1; Accents, 1; Access, 1; Accident, 1; Accidents, 1; According, 1; Accursed, 1; Accustom'd, 1; Ache, 1; Aches, 1; Aching, 1; Acknowledge, 1; Acquaint, 1; Acquaintance, 1; Acted, 1; Acting, 1; Action, 1; Acts, 1; Adam, 1; Add, 1; Added, 1; Adding, 1; Addle, 1; Adjacent, 1; Admired, 1; Ado, 1; Advance, 1; Adversary, 1; Adversity's, 1; Advise, 1; Afeard, 1; Affecting, 1; Afflicted, 1; Affliction, 1; Affords, 1; Affray, 1; Affright, 1; Afire, 1; Agate-stone, 1; Agile, 1; Agree, 1; Agrees, 1; Aim'd, 1; Alderman, 1; All-cheering, 1; All-seeing, 1; Alla, 1; Alliance, 1; Alligator, 1; Allow, 1; Ally, 1; Although, 1; http://www.mta75.org/curriculum/english/Shakes/indexx.html
    49. 49. (C) 2005, The University of Michigan 53 The BNC (Adam Kilgarriff) • 1 6187267 the det • 2 4239632 be v • 3 3093444 of prep • 4 2687863 and conj • 5 2186369 a det • 6 1924315 in prep • 7 1620850 to infinitive-marker • 8 1375636 have v • 9 1090186 it pron • 10 1039323 to prep • 11 887877 for prep • 12 884599 i pron • 13 760399 that conj • 14 695498 you pron • 15 681255 he pron • 16 680739 on prep • 17 675027 with prep • 18 559596 do v • 19 534162 at prep • 20 517171 by prep Kilgarriff, A. Putting Frequencies in the Dictionary. International Journal of Lexicography 10 (2) 1997. Pp 135--155
    50. 50. (C) 2005, The University of Michigan 54 Stop lists • 250-300 most common words in English account for 50% or more of a given text. • Example: “the” and “of” represent 10% of tokens. “and”, “to”, “a”, and “in” - another 10%. Next 12 words - another 10%. • Moby Dick Ch.1: 859 unique words (types), 2256 word occurrences (tokens). Top 65 types cover 1132 tokens (> 50%). • Token/type ratio: 2256/859 = 2.63
    51. 51. (C) 2005, The University of Michigan 55 Zipf’s law Rank x Frequency ≈ Constant Rank Term Freq. Z Rank Term Freq. Z 1 the 69,971 0.070 6 in 21,341 0.128 2 of 36,411 0.073 7 that 10,595 0.074 3 and 28,852 0.086 8 is 10,099 0.081 4 to 26.149 0.104 9 was 9,816 0.088 5 a 23,237 0.116 10 he 9,543 0.095
    52. 52. (C) 2005, The University of Michigan 56 Zipf's law is fairly general! • Frequency of accesses to web pages • in particular the access counts on the Wikipedia page, with s approximately equal to 0.3 • page access counts on Polish Wikipedia (data for late July 2003) approximately obey Zipf's law with s about 0.5 • Words in the English language • for instance, in Shakespeare’s play Hamlet with s approximately 0.5 • Sizes of settlements • Income distributions amongst individuals • Size of earthquakes • Notes in musical performances http://en.wikipedia.org/wiki/Zipf's_law
    53. 53. (C) 2005, The University of Michigan 57 Zipf’s law (cont’d) • Limitations: – Low and high frequencies – Lack of convergence • Power law with coefficient c = -1 – Y=kxc • Li (1992) – typing words one letter at a time, including spaces
    54. 54. (C) 2005, The University of Michigan 60 Indexing
    55. 55. (C) 2005, The University of Michigan 61 Methods • Manual: e.g., Library of Congress subject headings, MeSH • Automatic
    56. 56. (C) 2005, The University of Michigan 62 LOC subject headings http://www.loc.gov/catdir/cpso/lcco/lcco.html A -- GENERAL WORKS B -- PHILOSOPHY. PSYCHOLOGY. RELIGION C -- AUXILIARY SCIENCES OF HISTORY D -- HISTORY (GENERAL) AND HISTORY OF EUROPE E -- HISTORY: AMERICA F -- HISTORY: AMERICA G -- GEOGRAPHY. ANTHROPOLOGY. RECREATION H -- SOCIAL SCIENCES J -- POLITICAL SCIENCE K -- LAW L -- EDUCATION M -- MUSIC AND BOOKS ON MUSIC N -- FINE ARTS P -- LANGUAGE AND LITERATURE Q -- SCIENCE R -- MEDICINE S -- AGRICULTURE T -- TECHNOLOGY U -- MILITARY SCIENCE V -- NAVAL SCIENCE Z -- BIBLIOGRAPHY. LIBRARY SCIENCE. INFORMATION RESOURCES (GENERAL)
    57. 57. (C) 2005, The University of Michigan 63 MedicineCLASS R - MEDICINE Subclass R R5-920 Medicine (General) R5-130.5 General works R131-687 History of medicine. Medical expeditions R690-697 Medicine as a profession. Physicians R702-703 Medicine and the humanities. Medicine and disease in relation to history, literature, etc. R711-713.97 Directories R722-722.32 Missionary medicine. Medical missionaries R723-726 Medical philosophy. Medical ethics R726.5-726.8 Medicine and disease in relation to psychology. Terminal care. Dying R727-727.5 Medical personnel and the public. Physician and the public R728-733 Practice of medicine. Medical practice economics R735-854 Medical education. Medical schools. Research R855-855.5 Medical technology R856-857 Biomedical engineering. Electronics. Instrumentation R858-859.7 Computer applications to medicine. Medical informatics R864 Medical records R895-920 Medical physics. Medical radiology. Nuclear medicine
    58. 58. (C) 2005, The University of Michigan 64 Finding the most frequent terms in a document • Typically stop words: the, and, in • Not content-bearing • Terms vs. words • Luhn’s method
    59. 59. (C) 2005, The University of Michigan 65 Luhn’s method WORDS FREQUENCY E
    60. 60. (C) 2005, The University of Michigan 66 Computing term salience • Term frequency (IDF) • Document frequency (DF) • Inverse document frequency (IDF) N wDF wIDF )( log)( −=
    61. 61. (C) 2005, The University of Michigan 67 Applications of TFIDF • Cosine similarity • Indexing • Clustering
    62. 62. (C) 2005, The University of Michigan 68 Vector-based matching • The cosine measure sim (D,C) = (dk . ck .idf(k)) (dk)2 . (ck)2 k Σ Σ Σ k k
    63. 63. (C) 2005, The University of Michigan 69 IDF: Inverse document frequency N: number of documents dk: number of documents containing term k fik: absolute frequency of term k in document i wik: weight of term k in document i idfk = log2(N/dk) + 1 = log2N - log2dk + 1 TF * IDF is used for automated indexing and for topic discrimination:
    64. 64. (C) 2005, The University of Michigan 70 Asian and European news 622.941 deng 306.835 china 196.725 beijing 153.608 chinese 152.113 xiaoping 124.591 jiang 108.777 communist 102.894 body 85.173 party 71.898 died 68.820 leader 43.402 state 38.166 people 97.487 nato 92.151 albright 74.652 belgrade 46.657 enlargement 34.778 alliance 34.778 french 33.803 opposition 32.571 russia 14.095 government 9.389 told 9.154 would 8.459 their 6.059 which
    65. 65. (C) 2005, The University of Michigan 71 Other topics 120.385 shuttle 99.487 space 90.128 telescope 70.224 hubble 59.992 rocket 50.160 astronauts 49.722 discovery 47.782 canaveral 47.782 cape 40.889 mission 35.778 florida 27.063 center 74.652 compuserve 65.321 massey 55.989 salizzoni 29.996 bob 27.994 online 27.198 executive 15.890 interim 15.271 chief 11.647 service 11.174 second 6.781 world 6.315 president
    66. 66. (C) 2005, The University of Michigan 72 Compression
    67. 67. (C) 2005, The University of Michigan 73 Compression • Methods – Fixed length codes – Huffman coding – Ziv-Lempel codes
    68. 68. (C) 2005, The University of Michigan 74 Fixed length codes • Binary representations – ASCII – Representational power (2k symbols where k is the number of bits)
    69. 69. (C) 2005, The University of Michigan 75 Variable length codes • Alphabet: A .- N -. 0 ----- B -... O --- 1 .---- C -.-. P .--. 2 ..--- D -.. Q --.- 3 ...— E . R .-. 4 ....- F ..-. S ... 5 ..... G --. T - 6 -.... H .... U ..- 7 --... I .. V ...- 8 ---.. J .--- W .-- 9 ----. K -.- X -..- L .-.. Y -.— M -- Z --.. • Demo: – http://www.babbage.demon.co.uk/morse.html – http://www.scphillips.com/morse/
    70. 70. (C) 2005, The University of Michigan 76 Most frequent letters in English • Most frequent letters: – E T A O I N S H R D L U – http://www.math.cornell.edu/~mec/modules/cryptograp hy/subs/frequencies.html • Demo: – http://www.amstat.org/publications/jse/secure/v7n2/cou nt-char.cfm • Also: bigrams: – TH HE IN ER AN RE ND AT ON NT – http://www.math.cornell.edu/~mec/modules/cryptograp hy/subs/digraphs.html
    71. 71. (C) 2005, The University of Michigan 77 Useful links about cryptography • http://world.std.com/~franl/crypto.html • http://www.faqs.org/faqs/cryptography-faq/ • http://en.wikipedia.org/wiki/Cryptography
    72. 72. (C) 2005, The University of Michigan 78 Huffman coding • Developed by David Huffman (1952) • Average of 5 bits per character (37.5% compression) • Based on frequency distributions of symbols • Algorithm: iteratively build a tree of symbols starting with the two least frequent symbols
    73. 73. (C) 2005, The University of Michigan 79 Symbol Frequency A 7 B 4 C 10 D 5 E 2 F 11 G 15 H 3 I 7 J 8
    74. 74. (C) 2005, The University of Michigan 80 0 0 0 0 0 0 0 0 0 1 1 1 1 1 1 1 1 1 c b d f g i j he a
    75. 75. (C) 2005, The University of Michigan 81 Symbol Code A 0110 B 0010 C 000 D 0011 E 01110 F 010 G 10 H 01111 I 110 J 111
    76. 76. (C) 2005, The University of Michigan 82 Exercise • Consider the bit string: 01101101111000100110001110100111000 110101101011101 • Use the Huffman code from the example to decode it. • Try inserting, deleting, and switching some bits at random locations and try decoding.
    77. 77. (C) 2005, The University of Michigan 83 Ziv-Lempel coding • Two types - one is known as LZ77 (used in GZIP) • Code: set of triples <a,b,c> • a: how far back in the decoded text to look for the upcoming text segment • b: how many characters to copy • c: new character to add to complete segment
    78. 78. (C) 2005, The University of Michigan 84 • <0,0,p> p • <0,0,e> pe • <0,0,t> pet • <2,1,r> peter • <0,0,_> peter_ • <6,1,i> peter_pi • <8,2,r> peter_piper • <6,3,c> peter_piper_pic • <0,0,k> peter_piper_pick • <7,1,d> peter_piper_picked • <7,1,a> peter_piper_picked_a • <9,2,e> peter_piper_picked_a_pe • <9,2,_> peter_piper_picked_a_peck_ • <0,0,o> peter_piper_picked_a_peck_o • <0,0,f> peter_piper_picked_a_peck_of • <17,5,l> peter_piper_picked_a_peck_of_pickl • <12,1,d> peter_piper_picked_a_peck_of_pickled • <16,3,p> peter_piper_picked_a_peck_of_pickled_pep • <3,2,r> peter_piper_picked_a_peck_of_pickled_pepper • <0,0,s> peter_piper_picked_a_peck_of_pickled_peppers
    79. 79. (C) 2005, The University of Michigan 85 Links on text compression • Data compression: – http://www.data-compression.info/ • Calgary corpus: – http://links.uwaterloo.ca/calgary.corpus.html • Huffman coding: – http://www.compressconsult.com/huffman/ – http://en.wikipedia.org/wiki/Huffman_coding • LZ – http://en.wikipedia.org/wiki/LZ77
    80. 80. (C) 2005, The University of Michigan 86 Relevance feedback and query expansion
    81. 81. (C) 2005, The University of Michigan 87 Relevance feedback • Problem: initial query may not be the most appropriate to satisfy a given information need. • Idea: modify the original query so that it gets closer to the right documents in the vector space
    82. 82. (C) 2005, The University of Michigan 88 Relevance feedback • Automatic • Manual • Method: identifying feedback terms Q’ = a1Q + a2R - a3N Often a1= 1, a2 = 1/|R| and a3 = 1/|N|
    83. 83. (C) 2005, The University of Michigan 89 Example • Q = “safety minivans” • D1 = “car safety minivans tests injury statistics” - relevant • D2 = “liability tests safety” - relevant • D3 = “car passengers injury reviews” - non- relevant • R = ? • S = ? • Q’ = ?
    84. 84. (C) 2005, The University of Michigan 90 Pseudo relevance feedback • Automatic query expansion – Thesaurus-based expansion (e.g., using latent semantic indexing – later…) – Distributional similarity – Query log mining
    85. 85. (C) 2005, The University of Michigan 106 String matching
    86. 86. (C) 2005, The University of Michigan 107 String matching methods • Index-based • Full or approximate – E.g., theater = theatre
    87. 87. (C) 2005, The University of Michigan 108 Index-based matching • Inverted files • Position-based inverted files • Block-based inverted files 1 6 9 11 1719 24 28 33 40 46 50 55 60 This is a text. A text has many words. Words are made from letters. Text: 11, 19 Words: 33, 40 From: 55
    88. 88. (C) 2005, The University of Michigan 109 Inverted index (trie) Letters: 60 Text: 11, 19 Words: 33, 40 Made: 50 Many: 28 l m t w a d n
    89. 89. (C) 2005, The University of Michigan 110 Sequential searching • No indexing structure given • Given: database d and search pattern p. – Example: find “words” in the earlier example • Brute force method – try all possible starting positions – O(n) positions in the database and O(m) characters in the pattern so the total worst-case runtime is O(mn) – Typical runtime is actually O(n) given that mismatches are easy to notice
    90. 90. (C) 2005, The University of Michigan 111 Knuth-Morris-Pratt • Average runtime similar to BF • Worst case runtime is linear: O(n) • Idea: reuse knowledge • Need preprocessing of the pattern
    91. 91. (C) 2005, The University of Michigan 112 Knuth-Morris-Pratt (cont’d) • Example (http://en.wikipedia.org/wiki/Knuth-Morris-Pratt_algorithm) database: ABC ABC ABC ABDAB ABCDABCDABDE pattern: ABCDABD index 0 1 2 3 4 5 6 7 char A B C D A B D – pos -1 0 0 0 0 1 2 0 1234567 ABCDABD ABCDABD
    92. 92. (C) 2005, The University of Michigan 113 Knuth-Morris-Pratt (cont’d) ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^ ABC ABC ABC ABDAB ABCDABCDABDE ABCDABD ^
    93. 93. (C) 2005, The University of Michigan 115 Word similarity • Hamming distance - when words are of the same length • Levenshtein distance - number of edits (insertions, deletions, replacements) – color --> colour (1) – survey --> surgery (2) – com puter --> computer ? • Longest common subsequence (LCS) – lcs (survey, surgery) = surey
    94. 94. (C) 2005, The University of Michigan 116 Levenshtein edit distance • Examples: – Theatre-> theater – Ghaddafi->Qadafi – Computer->counter • Edit distance (inserts, deletes, substitutions) – Edit transcript • Done through dynamic programming
    95. 95. (C) 2005, The University of Michigan 117 Recurrence relation • Three dependencies – D(i,0)=i – D(0,j)=j – D(i,j)=min[D(i-1,j)+1,D(1,j-1)+1,D(i-1,j- 1)+t(i,j)] • Simple edit distance: – t(i,j) = 0 iff S1(i)=S2(j)
    96. 96. (C) 2005, The University of Michigan 118 Example Gusfield 1997 W R I T E R S 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 V 1 1 I 2 2 N 3 3 T 4 4 N 5 5 E 6 6 R 7 7
    97. 97. (C) 2005, The University of Michigan 119 Example (cont’d) Gusfield 1997 W R I T E R S 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 V 1 1 1 2 3 4 5 6 7 I 2 2 2 2 2 3 4 5 6 N 3 3 3 3 3 3 4 5 6 T 4 4 4 4 4 * N 5 5 E 6 6 R 7 7
    98. 98. (C) 2005, The University of Michigan 120 Tracebacks Gusfield 1997 W R I T E R S 0 1 2 3 4 5 6 7 0 0 1 2 3 4 5 6 7 V 1 1 1 2 3 4 5 6 7 I 2 2 2 2 2 3 4 5 6 N 3 3 3 3 3 3 4 5 6 T 4 4 4 4 4 * N 5 5 E 6 6 R 7 7
    99. 99. (C) 2005, The University of Michigan 121 Weighted edit distance • Used to emphasize the relative cost of different edit operations • Useful in bioinformatics – Homology information – BLAST – Blosum – http://eta.embl- heidelberg.de:8000/misc/mat/blosum50.html
    100. 100. (C) 2005, The University of Michigan 122 • Web sites: – http://www.merriampark.com/ld.htm – http://odur.let.rug.nl/~kleiweg/lev/
    101. 101. (C) 2005, The University of Michigan 123 Clustering
    102. 102. (C) 2005, The University of Michigan 124 Clustering • Exclusive/overlapping clusters • Hierarchical/flat clusters • The cluster hypothesis – Documents in the same cluster are relevant to the same query
    103. 103. (C) 2005, The University of Michigan 125 Representations for document clustering • Typically: vector-based – Words: “cat”, “dog”, etc. – Features: document length, author name, etc. • Each document is represented as a vector in an n-dimensional space • Similar documents appear nearby in the vector space (distance measures are needed)
    104. 104. (C) 2005, The University of Michigan 126 Hierarchical clustering Dendrograms http://odur.let.rug.nl/~kleiweg/clustering/clustering.html E.g., language similarity:
    105. 105. (C) 2005, The University of Michigan 127 Another example • Kingdom = animal • Phylum = Chordata • Subphylum = Vertebrata • Class = Osteichthyes • Subclass = Actinoptergyii • Order = Salmoniformes • Family = Salmonidae • Genus = Oncorhynchus • Species = Oncorhynchus kisutch (Coho salmon)
    106. 106. (C) 2005, The University of Michigan 128 Clustering using dendrograms REPEAT Compute pairwise similarities Identify closest pair Merge pair into single node UNTIL only one node left Q: what is the equivalent Venn diagram representation? Example: cluster the following sentences: A B C B A A D C C A D E C D E F C D A E F G F D A A C D A B A
    107. 107. (C) 2005, The University of Michigan 129 Methods • Single-linkage – One common pair is sufficient – disadvantages: long chains • Complete-linkage – All pairs have to match – Disadvantages: too conservative • Average-linkage • Centroid-based (online) – Look at distances to centroids • Demo: – /clair4/class/ir-w05/clustering
    108. 108. (C) 2005, The University of Michigan 130 k-means • Needed: small number k of desired clusters • hard vs. soft decisions • Example: Weka
    109. 109. (C) 2005, The University of Michigan 131 k-means 1 initialize cluster centroids to arbitrary vectors 2 while further improvement is possible do 3 for each document d do 4 find the cluster c whose centroid is closest to d 5 assign d to cluster c 6 end for 7 for each cluster c do 8 recompute the centroid of cluster c based on its documents 9 end for 10 end while
    110. 110. (C) 2005, The University of Michigan 132 Example • Cluster the following vectors into two groups: – A = <1,6> – B = <2,2> – C = <4,0> – D = <3,3> – E = <2,5> – F = <2,1>
    111. 111. (C) 2005, The University of Michigan 133 Complexity • Complexity = O(kn) because at each step, n documents have to be compared to k centroids.
    112. 112. (C) 2005, The University of Michigan 136 Human clustering • Significant disagreement in the number of clusters, overlap of clusters, and the composition of clusters (Maczkassy et al. 1998).
    113. 113. (C) 2005, The University of Michigan 137 Lexical networks
    114. 114. (C) 2005, The University of Michigan 138 Lexical Networks • Used to represent relationships between words • Example: WordNet - created by George Miller’s team at Princeton • Based on synsets (synonyms, interchangeable words) and lexical matrices
    115. 115. (C) 2005, The University of Michigan 139 Lexical matrix Word Forms Word Meanings F1 F2 F3 … Fn M1 E1,1 E1,2 M2 E1,2 … … Mm Em,n
    116. 116. (C) 2005, The University of Michigan 140 Synsets • Disambiguation – {board, plank} – {board, committee} • Synonyms – substitution – weak substitution – synonyms must be of the same part of speech
    117. 117. (C) 2005, The University of Michigan 141 $ ./wn board -hypen Synonyms/Hypernyms (Ordered by Frequency) of noun board 9 senses of board Sense 1 board => committee, commission => administrative unit => unit, social unit => organization, organisation => social group => group, grouping Sense 2 board => sheet, flat solid => artifact, artefact => object, physical object => entity, something Sense 3 board, plank => lumber, timber => building material => artifact, artefact => object, physical object => entity, something
    118. 118. (C) 2005, The University of Michigan 142 Sense 4 display panel, display board, board => display => electronic device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 5 board, gameboard => surface => artifact, artefact => object, physical object => entity, something Sense 6 board, table => fare => food, nutrient => substance, matter => object, physical object => entity, something
    119. 119. (C) 2005, The University of Michigan 143 Sense 7 control panel, instrument panel, control board, board, panel => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 8 circuit board, circuit card, board, card => printed circuit => computer circuit => circuit, electrical circuit, electric circuit => electrical device => device => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something Sense 9 dining table, board => table => furniture, piece of furniture, article of furniture => furnishings => instrumentality, instrumentation => artifact, artefact => object, physical object => entity, something
    120. 120. (C) 2005, The University of Michigan 144 Antonymy • “x” vs. “not-x” • “rich” vs. “poor”? • {rise, ascend} vs. {fall, descend}
    121. 121. (C) 2005, The University of Michigan 145 Other relations • Meronymy: X is a meronym of Y when native speakers of English accept sentences similar to “X is a part of Y”, “X is a member of Y”. • Hyponymy: {tree} is a hyponym of {plant}. • Hierarchical structure based on hyponymy (and hypernymy).
    122. 122. (C) 2005, The University of Michigan 146 Other features of WordNet • Index of familiarity • Polysemy
    123. 123. (C) 2005, The University of Michigan 147 board used as a noun is familiar (polysemy count = 9) bird used as a noun is common (polysemy count = 5) cat used as a noun is common (polysemy count = 7) house used as a noun is familiar (polysemy count = 11) information used as a noun is common (polysemy count = 5) retrieval used as a noun is uncommon (polysemy count = 3) serendipity used as a noun is very rare (polysemy count = 1) Familiarity and polysemy
    124. 124. (C) 2005, The University of Michigan 148 Compound nouns advisory board appeals board backboard backgammon board baseboard basketball backboard big board billboard binder's board binder board blackboard board game board measure board meeting board member board of appeals board of directors board of education board of regents board of trustees
    125. 125. (C) 2005, The University of Michigan 149 Overview of senses 1. board -- (a committee having supervisory powers; "the board has seven members") 2. board -- (a flat piece of material designed for a special purpose; "he nailed boards across the windows") 3. board, plank -- (a stout length of sawn timber; made in a wide variety of sizes and used for many purposes) 4. display panel, display board, board -- (a board on which information can be displayed to public view) 5. board, gameboard -- (a flat portable surface (usually rectangular) designed for board games; "he got out the board and set up the pieces") 6. board, table -- (food or meals in general; "she sets a fine table"; "room and board") 7. control panel, instrument panel, control board, board, panel -- (an insulated panel containing switches and dials and meters for controlling electrical devices; "he checked the instrument panel"; "suddenly the board lit up like a Christmas tree") 8. circuit board, circuit card, board, card -- (a printed circuit that can be inserted into expansion slots in a computer to increase the computer's capabilities) 9. dining table, board -- (a table at which meals are served; "he helped her clear the dining table"; "a feast was spread upon the board")
    126. 126. (C) 2005, The University of Michigan 150 Top-level concepts {act, action, activity} {animal, fauna} {artifact} {attribute, property} {body, corpus} {cognition, knowledge} {communication} {event, happening} {feeling, emotion} {food} {group, collection} {location, place} {motive} {natural object} {natural phenomenon} {person, human being} {plant, flora} {possession} {process} {quantity, amount} {relation} {shape} {state, condition} {substance} {time}
    127. 127. (C) 2005, The University of Michigan 151 WordNet and DistSim wn reason -hypen - hypernyms wn reason -synsn - synsets wn reason -simsn - synonyms wn reason -over - overview of senses wn reason -famln - familiarity/polysemy wn reason -grepn - compound nouns /data2/tools/relatedwords/relate reason
    128. 128. (C) 2005, The University of Michigan 152 System comparison
    129. 129. (C) 2005, The University of Michigan 153 Comparing two systems • Comparing A and B • One query? • Average performance? • Need: A to consistently outperform B [this slide: courtesy James Allan]
    130. 130. (C) 2005, The University of Michigan 154 The sign test • Example 1: – A > B (12 times) – A = B (25 times) – A < B (3 times) – p < 0.035 (significant at the 5% level) • Example 2: – A > B (18 times) – A < B (9 times) – p < 0.122 (not significant at the 5% level) [this slide: courtesy James Allan]
    131. 131. (C) 2005, The University of Michigan 155 Other tests • The t test: – Takes into account the actual performances, not just which system is better – http://nimitz.mcs.kent.edu/~blewis/stat/tTest.ht ml • The sign test: – http://www.fon.hum.uva.nl/Service/Statistics/Si gn_Test.html
    132. 132. (C) 2005, The University of Michigan 156 Techniques for dimensionality reduction: SVD and LSI
    133. 133. (C) 2005, The University of Michigan 157 Techniques for dimensionality reduction • Based on matrix decomposition (goal: preserve clusters, explain away variance) • A quick review of matrices – Vectors – Matrices – Matrix multiplication           =           −          1 1 2 * 1494 852 321
    134. 134. (C) 2005, The University of Michigan 158 SVD: Singular Value Decomposition • A=UΣVT • This decomposition exists for all matrices, dense or sparse • If A has 5 columns and 3 rows, then U will be 5x5 and V will be 3x3 • In Matlab, use [U,S,V] = svd (A)
    135. 135. (C) 2005, The University of Michigan 159 Term matrix normalization                   = 1 0 1 1 0 0 000 101 101 0 0 0 0 1 1 110 110 101 A                   = 71.0 00.0 71.0 58.0 00.0 00.0 00.000.000.0 45.000.058.0 45.000.058.0 00.0 00.0 00.0 00.0 58.0 58.0 45.071.000.0 45.071.000.0 45.000.058.0 )(n A D1 D2 D3 D4 D5 D1 D2 D3 D4 D5
    136. 136. (C) 2005, The University of Michigan 160 Example (Berry and Browne) • T1: baby • T2: child • T3: guide • T4: health • T5: home • T6: infant • T7: proofing • T8: safety • T9: toddler • D1: infant & toddler first aid • D2: babies & children’s room (for your home) • D3: child safety at home • D4: your baby’s health and safety: from infant to toddler • D5: baby proofing basics • D6: your guide to easy rust proofing • D7: beanie babies collector’s guide
    137. 137. (C) 2005, The University of Michigan 161 Document term matrix                             = 0001001 0001100 0110000 0001001 0000110 0001000 1100000 0000110 1011010 A                             = 00045.00071.0 00045.058.000 071.071.00000 00045.00071.0 000058.058.00 00045.0000 71.071.000000 000058.058.00 71.0071.045.0058.00 )(n A
    138. 138. (C) 2005, The University of Michigan 162 Decomposition u = -0.6976 -0.0945 0.0174 -0.6950 0.0000 0.0153 0.1442 -0.0000 0 -0.2622 0.2946 0.4693 0.1968 -0.0000 -0.2467 -0.1571 -0.6356 0.3098 -0.3519 -0.4495 -0.1026 0.4014 0.7071 -0.0065 -0.0493 -0.0000 0.0000 -0.1127 0.1416 -0.1478 -0.0734 0.0000 0.4842 -0.8400 0.0000 -0.0000 -0.2622 0.2946 0.4693 0.1968 0.0000 -0.2467 -0.1571 0.6356 -0.3098 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 -0.3098 -0.6356 -0.3519 -0.4495 -0.1026 0.4014 -0.7071 -0.0065 -0.0493 0.0000 -0.0000 -0.2112 0.3334 0.0962 0.2819 -0.0000 0.7338 0.4659 -0.0000 0.0000 -0.1883 0.3756 -0.5035 0.1273 -0.0000 -0.2293 0.0339 0.3098 0.6356 v = -0.1687 0.4192 -0.5986 0.2261 0 -0.5720 0.2433 -0.4472 0.2255 0.4641 -0.2187 0.0000 -0.4871 -0.4987 -0.2692 0.4206 0.5024 0.4900 -0.0000 0.2450 0.4451 -0.3970 0.4003 -0.3923 -0.1305 0 0.6124 -0.3690 -0.4702 -0.3037 -0.0507 -0.2607 -0.7071 0.0110 0.3407 -0.3153 -0.5018 -0.1220 0.7128 -0.0000 -0.0162 -0.3544 -0.4702 -0.3037 -0.0507 -0.2607 0.7071 0.0110 0.3407
    139. 139. (C) 2005, The University of Michigan 163 Decomposition s = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0.7100 0 0 0 0 0 0 0 0.5692 0 0 0 0 0 0 0 0.1977 0 0 0 0 0 0 0 0 0 0 0 0 0 0 Spread on the v1 axis
    140. 140. (C) 2005, The University of Michigan 164 Rank-4 approximation s4 = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 1.1946 0 0 0 0 0 0 0 0.7996 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    141. 141. (C) 2005, The University of Michigan 165 Rank-4 approximation u*s4*v' -0.0019 0.5985 -0.0148 0.4552 0.7002 0.0102 0.7002 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.1980 0.0514 0.0064 0.2199 0.0535 -0.0544 0.0535 -0.0728 0.4961 0.6282 0.0745 0.0121 -0.0133 0.0121 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008 0.0003 -0.0067 0.0052 -0.0013 0.3584 0.7065 0.3584 0.2165 0.2494 0.4367 0.2282 -0.0360 0.0394 -0.0360 0.6337 -0.0602 0.0290 0.5324 -0.0008 0.0003 -0.0008
    142. 142. (C) 2005, The University of Michigan 166 Rank-4 approximation u*s4 -1.1056 -0.1203 0.0207 -0.5558 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.1786 0.1801 -0.1765 -0.0587 0 0 0 -0.4155 0.3748 0.5606 0.1573 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0 -0.5576 -0.5719 -0.1226 0.3210 0 0 0 -0.3348 0.4241 0.1149 0.2255 0 0 0 -0.2984 0.4778 -0.6015 0.1018 0 0 0
    143. 143. (C) 2005, The University of Michigan 167 Rank-4 approximation s4*v' -0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 -0.7150 0.5544 0.6001 -0.4686 -0.0605 -0.1457 -0.0605 0.1808 -0.1749 0.3918 -0.1043 -0.2085 0.5700 -0.2085 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    144. 144. (C) 2005, The University of Michigan 168 Rank-2 approximation s2 = 1.5849 0 0 0 0 0 0 0 1.2721 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    145. 145. (C) 2005, The University of Michigan 169 Rank-2 approximation u*s2*v' 0.1361 0.4673 0.2470 0.3908 0.5563 0.4089 0.5563 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.1057 0.1205 0.1239 0.1430 0.0293 -0.0341 0.0293 0.2272 0.2703 0.2695 0.3150 0.0815 -0.0571 0.0815 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048 -0.1457 0.1204 -0.0904 -0.0075 0.4358 0.4628 0.4358 0.2343 0.2454 0.2685 0.3027 0.0286 -0.1073 0.0286 0.2507 0.2412 0.2813 0.3097 -0.0048 -0.1457 -0.0048
    146. 146. (C) 2005, The University of Michigan 170 Rank-2 approximation u*s2 -1.1056 -0.1203 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.1786 0.1801 0 0 0 0 0 -0.4155 0.3748 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0 -0.5576 -0.5719 0 0 0 0 0 -0.3348 0.4241 0 0 0 0 0 -0.2984 0.4778 0 0 0 0 0
    147. 147. (C) 2005, The University of Michigan 171 Rank-2 approximation s2*v' -0.2674 -0.7087 -0.4266 -0.6292 -0.7451 -0.4996 -0.7451 0.5333 0.2869 0.5351 0.5092 -0.3863 -0.6384 -0.3863 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    148. 148. (C) 2005, The University of Michigan 172 Documents to concepts and terms to concepts A(:,1)'*u*s -0.4238 0.6784 -0.8541 0.1446 -0.0000 -0.1853 0.0095 >> A(:,1)'*u*s4 -0.4238 0.6784 -0.8541 0.1446 0 0 0 >> A(:,1)'*u*s2 -0.4238 0.6784 0 0 0 0 0 >> A(:,2)'*u*s2 -1.1233 0.3650 0 0 0 0 0 >> A(:,3)'*u*s2 -0.6762 0.6807 0 0 0 0 0
    149. 149. (C) 2005, The University of Michigan 173 Documents to concepts and terms to concepts >> A(:,4)'*u*s2 -0.9972 0.6478 0 0 0 0 0 >> A(:,5)'*u*s2 -1.1809 -0.4914 0 0 0 0 0 >> A(:,6)'*u*s2 -0.7918 -0.8121 0 0 0 0 0 >> A(:,7)'*u*s2 -1.1809 -0.4914 0 0 0 0 0
    150. 150. (C) 2005, The University of Michigan 174 Cont’d >> (s2*v'*A(1,:)')' -1.7523 -0.1530 0 0 0 0 0 0 0 >> (s2*v'*A(2,:)')' -0.6585 0.4768 0 0 0 0 0 0 0 >> (s2*v'*A(3,:)')' -0.8838 -0.7275 0 0 0 0 0 0 0 >> (s2*v'*A(4,:)')' -0.2831 0.2291 0 0 0 0 0 0 0 >> (s2*v'*A(5,:)')' -0.6585 0.4768 0 0 0 0 0 0 0
    151. 151. (C) 2005, The University of Michigan 175 Cont’d >> (s2*v'*A(6,:)')' -0.4730 0.6078 0 0 0 0 0 0 0 >> (s2*v'*A(7,:)')' -0.8838 -0.7275 0 0 0 0 0 0 0 >> (s2*v'*A(8,:)')' -0.5306 0.5395 0 0 0 0 0 0 0 >> (s2*v'*A(9,:)')‘ -0.4730 0.6078 0 0 0 0 0 0 0
    152. 152. (C) 2005, The University of Michigan 176 Properties A*A' 1.5471 0.3364 0.5041 0.2025 0.3364 0.2025 0.5041 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.5041 0 1.0082 0 0 0 0.5041 0 0 0.2025 0 0 0.2025 0 0.2025 0 0.2025 0.2025 0.3364 0.6728 0 0 0.6728 0 0 0.3364 0 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 0.5041 0 0.5041 0 0 0 1.0082 0 0 0.2025 0.3364 0 0.2025 0.3364 0.2025 0 0.5389 0.2025 0.2025 0 0 0.2025 0 0.7066 0 0.2025 0.7066 A'*A 1.0082 0 0 0.6390 0 0 0 0 1.0092 0.6728 0.2610 0.4118 0 0.4118 0 0.6728 1.0092 0.2610 0 0 0 0.6390 0.2610 0.2610 1.0125 0.3195 0 0.3195 0 0.4118 0 0.3195 1.0082 0.5041 0.5041 0 0 0 0 0.5041 1.0082 0.5041 0 0.4118 0 0.3195 0.5041 0.5041 1.0082 A is a document to term matrix. What is A*A’, what is A’*A?
    153. 153. (C) 2005, The University of Michigan 177 Latent semantic indexing (LSI) • Dimensionality reduction = identification of hidden (latent) concepts • Query matching in latent space
    154. 154. (C) 2005, The University of Michigan 178 Useful pointers • http://lsa.colorado.edu • http://lsi.research.telcordia.com/ • http://www.cs.utk.edu/~lsi/ • http://javelina.cet.middlebury.edu/lsa/out/ls a_definition.htm • http://citeseer.nj.nec.com/deerwester90inde xing.html • http://www.pcug.org.au/~jdowling/
    155. 155. (C) 2005, The University of Michigan 179 Models of the Web
    156. 156. (C) 2005, The University of Michigan 180 Size • The Web is the largest repository of data and it grows exponentially. – 320 Million Web pages [Lawrence & Giles 1998] – 800 Million Web pages, 15 TB [Lawrence & Giles 1999] – 8 Billion Web pages indexed [Google 2005] • Amount of data – roughly 200 TB [Lyman et al. 2003]
    157. 157. (C) 2005, The University of Michigan 181 Bow-tie model of the Web SCC 56 M OUT 44 M IN 44 M Bröder & al. WWW 2000, Dill & al. VLDB 2001 DISC 17 M TEND 44M 24% of pages reachable from a given page
    158. 158. (C) 2005, The University of Michigan 182 Power laws • Web site size (Huberman and Adamic 1999) • Power-law connectivity (Barabasi and Albert 1999): exponents 2.45 for out-degree and 2.1 for the in-degree • Others: call graphs among telephone carriers, citation networks (Redner 1998), e.g., Erdos, collaboration graph of actors, metabolic pathways (Jeong et al. 2000), protein networks (Maslov and Sneppen 2002). All values of gamma are around 2-3.
    159. 159. (C) 2005, The University of Michigan 183 Small-world networks • Diameter = average length of the shortest path between all pairs of nodes. Example… • Milgram experiment (1967) – Kansas/Omaha --> Boston (42/160 letters) – diameter = 6 • Albert et al. 1999 – average distance between two verstices is d = 0.35 + 2.06 log10n. For n = 109 , d=18.89. • Six degrees of separation
    160. 160. (C) 2005, The University of Michigan 184 Clustering coefficient • Cliquishness (c): between the kv (kv – 1)/2 pairs of neighbors. • Examples: n k d drand C crand Actors 225226 61 3.65 2.99 0.79 0.00027 Power grid 4941 2.67 18.7 12.4 0.08 0.005 C. Elegans 282 14 2.65 2.25 0.28 0.05
    161. 161. (C) 2005, The University of Michigan 185 Models of the Web Npk k ke kP kk =〉〈 〉〈 = 〉−〈 ! )( )( )( αζ α− = k kP A B a b • Erdös/Rényi 59, 60 • Barabási/Albert 99 • Watts/Strogatz 98 • Kleinberg 98 • Menczer 02 • Radev 03 • Evolving networks: fundamental object of statistical physics, social networks, mathematical biology, and epidemiology
    162. 162. (C) 2005, The University of Michigan 188 Social network analysis for IR
    163. 163. (C) 2005, The University of Michigan 189 Social networks • Induced by a relation • Symmetric or not • Examples: – Friendship networks – Board membership – Citations – Power grid of the US – WWW
    164. 164. (C) 2005, The University of Michigan 190 Krebs 2004
    165. 165. (C) 2005, The University of Michigan 191 Prestige and centrality • Degree centrality: how many neighbors each node has. • Closeness centrality: how close a node is to all of the other nodes • Betweenness centrality: based on the role that a node plays by virtue of being on the path between two other nodes • Eigenvector centrality: the paths in the random walk are weighted by the centrality of the nodes that the path connects. • Prestige = same as centrality but for directed graphs.
    166. 166. (C) 2005, The University of Michigan 192 Graph-based representations 1 2 3 4 5 7 6 8 1 2 3 4 5 6 7 8 1 1 1 2 1 3 1 1 4 1 5 1 1 1 1 6 1 1 7 8 Square connectivity (incidence) matrix Graph G (V,E)
    167. 167. (C) 2005, The University of Michigan 193 Markov chains • A homogeneous Markov chain is defined by an initial distribution x and a Markov kernel E. • Path = sequence (x0, x1, …, xn). Xi = xi-1*E • The probability of a path can be computed as a product of probabilities for each step i. • Random walk = find Xj given x0, E, and j.
    168. 168. (C) 2005, The University of Michigan 194 Stationary solutions • The fundamental Ergodic Theorem for Markov chains [Grimmett and Stirzaker 1989] says that the Markov chain with kernel E has a stationary distribution p under three conditions: – E is stochastic – E is irreducible – E is aperiodic • To make these conditions true: – All rows of E add up to 1 (and no value is negative) – Make sure that E is strongly connected – Make sure that E is not bipartite • Example: PageRank [Brin and Page 1998]: use “teleportation”
    169. 169. (C) 2005, The University of Michigan 195 1 2 3 4 5 7 6 8 Example This graph E has a second graph E’ (not drawn) superimposed on it: E’ is the uniform transition graph. 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 PageRank t=0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 PageRank t=1
    170. 170. (C) 2005, The University of Michigan 196 Eigenvectors • An eigenvector is an implicit “direction” for a matrix. Mv = λv, where v is non-zero, though λ can be any complex number in principle. • The largest eigenvalue of a stochastic matrix E is real: λ1 = 1. • For λ1, the left (principal) eigenvector is p, the right eigenvector = 1 • In other words, ET p = p.
    171. 171. (C) 2005, The University of Michigan 197 Computing the stationary distribution 0)( =− = pEI pEp T T function PowerStatDist (E): begin p(0) = u; (or p(0) = [1,0,…0]) i=1; repeat p(i) = ET p(i-1) L = ||p(i) -p(i-1 )||1; i = i + 1; until L < ε return p(i) end Solution for the stationary distribution
    172. 172. (C) 2005, The University of Michigan 198 1 2 3 4 5 7 6 8 Example 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 PageRank t=0 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 PageRank t=1 0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 1 2 3 4 5 6 7 8 PageRank t=10
    173. 173. (C) 2005, The University of Michigan 199 How Google works • Crawling • Anchor text • Fast query processing • Pagerank
    174. 174. (C) 2005, The University of Michigan 200 More about PageRank • Named after Larry Page, founder of Google (and UM alum) • Reading “The anatomy of a large-scale hypertextual web search engine” by Brin and Page. • Independent of query (although more recent work by Haveliwala (WWW 2002) has also identified topic-based PageRank.
    175. 175. (C) 2005, The University of Michigan 201 HITS • Query-dependent model (Kleinberg 97) • Hubs and authorities (e.g., cars, Honda) • Algorithm – obtain root set using input query – expanded the root set by radius one – run iterations on the hub and authority scores together – report top-ranking authorities and hubs hEa T = ' Eah = '
    176. 176. (C) 2005, The University of Michigan 202 The link-content hypothesis • Topical locality: page is similar (σ) to the page that points to it (δ). • Davison (TF*IDF, 100K pages) – 0.31 same domain – 0.23 linked pages – 0.19 sibling – 0.02 random • Menczer (373K pages, non-linear least squares fit) • Chakrabarti (focused crawling) - prob. of losing the topic Van Rijsbergen 1979, Chakrabarti & al. WWW 1999, Davison SIGIR 2000, Menczer 2001 2 1 )1()( α δα σσδσ − ∞∞ −+≈ e 03.0=∞σα1=1.8, α2=0.6,
    177. 177. (C) 2005, The University of Michigan 203 Measuring the Web
    178. 178. (C) 2005, The University of Michigan 204 Bharat and Broder 1998 • Based on crawls of HotBot, Altavista, Excite, and InfoSeek • 10,000 queries in mid and late 1997 • Estimate is 200M pages • Only 1.4% are indexed by all of them
    179. 179. (C) 2005, The University of Michigan 205 Example (from Bharat&Broder) A similar approach by Lawrence and Giles yields 320M pages (Lawrence and Giles 1998).
    180. 180. (C) 2005, The University of Michigan 212 Question answering
    181. 181. (C) 2005, The University of Michigan 213 People ask questions • Excite corpus of 2,477,283 queries (one day’s worth) • 8.4% of them are questions – 43.9% factual (what is the country code for Belgium) – 56.1% procedural (how do I set up TCP/IP) or other • In other words, 100 K questions per day
    182. 182. (C) 2005, The University of Michigan 214 People ask questions In what year did baseball become an offical sport? Who is the largest man in the world? Where can i get information on Raphael? where can i find information on puritan religion? Where can I find how much my house is worth? how do i get out of debt? Where can I found out how to pass a drug test? When is the Super Bowl? who is California's District State Senator? where can I buy extra nibs for a foutain pen? how do i set up tcp/ip ? what time is it in west samoa? Where can I buy a little kitty cat? what are the symptoms of attention deficit disorder? Where can I get some information on Michael Jordan? How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? When did the Neanderthal man live? Which Frenchman declined the Nobel Prize for Literature for ideological reasons? What is the largest city in Northern Afghanistan? How does the character Seyavash in Ferdowsi's Shahnameh exhibit characteristics of a hero? When did the Neanderthal man live?
    183. 183. (C) 2005, The University of Michigan 215
    184. 184. (C) 2005, The University of Michigan 216 Question answering What is the largest city in Northern Afghanistan?
    185. 185. (C) 2005, The University of Michigan 217 Possible approaches • Map? • Knowledge base Find x: city (x) ∧ located (x,”Northern Afghanistan”) ∧ ∧ ¬exists (y): city (y) ∧ located (y,”Northern Afghanistan”) ∧ ∧ greaterthan (population (y), population (x)) • Database? • World factbook? • Search engine?
    186. 186. (C) 2005, The University of Michigan 218 The TREC Q&A evaluation • Run by NIST [Voorhees and Tice 2000] • 2GB of input • 200 questions • Essentially fact extraction – Who was Lincoln’s secretary of state? – What does the Peugeot company manufacture? • Questions are based on text • Answers are assumed to be present • No inference needed
    187. 187. (C) 2005, The University of Michigan 219 Q: When did Nelson Mandela become president of South Africa? A: 10 May 1994 Q: How tall is the Matterhorn? A: The institute revised the Matterhorn 's height to 14,776 feet 9 inches Q: How tall is the replica of the Matterhorn at Disneyland? A: In fact he has climbed the 147-foot Matterhorn at Disneyland every week end for the last 3 1/2 years Q: If Iraq attacks a neighboring country, what should the US do? A: ?? Question answering
    188. 188. (C) 2005, The University of Michigan 220 Q: Why did David Koresh ask the FBI for a word processor? Q: Name the designer of the shoe that spawned millions of plastic imitations, known as "jellies". Q: What is the brightest star visible from Earth? Q: What are the Valdez Principles? Q: Name a film that has won the Golden Bear in the Berlin Film Festival? Q: Name a country that is developing a magnetic levitation railway system? Q: Name the first private citizen to fly in space. Q: What did Shostakovich write for Rostropovich? Q: What is the term for the sum of all genetic material in a given organism? Q: What is considered the costliest disaster the insurance industry has ever faced? Q: What is Head Start? Q: What was Agent Orange used for during the Vietnam War? Q: What did John Hinckley do to impress Jodie Foster? Q: What was the first Gilbert and Sullivan opera? Q: What did Richard Feynman say upon hearing he would receive the Nobel Prize in Physics? Q: How did Socrates die? Q: Why are electric cars less efficient in the north-east than in California?
    189. 189. (C) 2005, The University of Michigan 221 NSIR • Current project at U-M – http://tangra.si.umich.edu/clair/NSIR/html/nsir.cgi • Reading: – [Radev et al., 2005a] • Dragomir R. Radev, Weiguo Fan, Hong Qi, Harris Wu, and Amardeep Grewal. Probabilistic question answering on the web. Journal of the American Society for Information Science and Technology 56(3), March 2005 • http://tangra.si.umich.edu/~radev/bib2html/radev-bib.html
    190. 190. (C) 2005, The University of Michigan 222
    191. 191. (C) 2005, The University of Michigan 223 ... Afghanistan, Kabul, 2,450 ... Administrative capital and largest city (1997 est ... Undetermined. Panama, Panama City, 450,668. ... of the Gauteng, Northern Province, Mpumalanga ... www.infoplease.com/cgi-bin/id/A0855603 ... died in Kano, northern Nigeria's largest city, during two days of anti-American riots led by Muslims protesting the US-led bombing of Afghanistan, according to ... www.washingtonpost.com/wp-dyn/print/world/ ... air strikes on the city. ... the Taliban militia in northern Afghanistan in a significant blow ... defection would be the largest since the United States ... www.afgha.com/index.php - 60k ... Kabul is the capital and largest city of Afghanistan. . ... met. area pop. 2,029,889), is the largest city in Uttar Pradesh, a state in northern India. . ... school.discovery.com/homeworkhelp/worldbook/atozgeography/ k/k1menu.html ... Gudermes, Chechnya's second largest town. The attack ... location in Afghanistan's outlying regions ... in the city of Mazar-i-Sharif, a Northern Alliance-affiliated ... english.pravda.ru/hotspots/2001/09/17/ ... Get Worse By RICK BRAGG Pakistan's largest city is getting a jump on the ... Region: Education Offers Women in Northern Afghanistan a Ray of Hope. ... www.nytimes.com/pages/world/asia/ ... within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan, held since 1998 by the Taliban. There was no immediate comment ... uk.fc.yahoo.com/photos/a/afghanistan.html Google
    192. 192. (C) 2005, The University of Michigan 224 Document retrieval Query modulation Sentence retrieval Answer extraction Answer ranking What is the largest city in Northern Afghanistan? (largest OR biggest) city “Northern Afghanistan” www.infoplease.com/cgi-bin/id/A0855603 www.washingtonpost.com/wp-dyn/print/world/ Gudermes, Chechnya's second largest town … location in Afghanistan's outlying regions within three miles of the airport at Mazar-e-Sharif, the largest city in northern Afghanistan Gudermes Mazer-e-Sharif Mazer-e-Sharif Gudermes
    193. 193. (C) 2005, The University of Michigan 225
    194. 194. (C) 2005, The University of Michigan 226
    195. 195. (C) 2005, The University of Michigan 227 Research problems • Source identification: – semi-structured vs. text sources • Query modulation: – best paraphrase of a NL question given the syntax of a search engine? – Compare two approaches: noisy channel model and rule-based • Sentence ranking – n-gram matching, Okapi, co-reference? • Answer extraction – question type identification – phrase chunking – no general-purpose named entity tagger available • Answer ranking – what are the best predictors of a phrase being the answer to a given question: question type, proximity to query words, frequency • Evaluation (MRDR) – accuracy, reliability, timeliness
    196. 196. (C) 2005, The University of Michigan 228 Document retrieval • Use existing search engines: Google, AlltheWeb, NorthernLight • No modifications to question • CF: work on QASM (ACM CIKM 2001)
    197. 197. (C) 2005, The University of Michigan 229 F tfwtfwidftfw S N k k N j j N i ii i ∑∑∑ === ++ = 321 1 3 1 2 1 1 **** Sentence ranking • Weighted N-gram matching: • Weights are determined empirically, e.g., 0.6, 0.3, and 0.1
    198. 198. (C) 2005, The University of Michigan 230 Probabilistic phrase reranking • Answer extraction: probabilistic phrase reranking. What is: p(ph is answer to q | q, ph) • Evaluation: TRDR – Example: (2,8,10) gives .725 – Document, sentence, or phrase level • Criterion: presence of answer(s) • High correlation with manual assessment       ∑= n i irn 1 11
    199. 199. (C) 2005, The University of Michigan 231 Phrase types PERSON PLACE DATE NUMBER DEFINITION ORGANIZATION DESCRIPTION ABBREVIATION KNOWNFOR RATE LENGTH MONEY REASON DURATION PURPOSE NOMINAL OTHER
    200. 200. (C) 2005, The University of Michigan 232 Question Type Identification • Wh-type not sufficient: • Who: PERSON 77, DESCRIPTION 19, ORG 6 • What: NOMINAL 78, PLACE 27, DEF26, PERSON 18, ORG 16, NUMBER 14, etc. • How: NUMBER 33, LENGTH 6, RATE 2, etc. • Ripper: – 13 features: Question-Words, Wh-Word, Word-Beside- Wh-Word, Is-Noun-Length, Is-Noun-Person, etc. – Top 2 question types • Heuristic algorithm: – About 100 regular expressions based on words and parts of speech
    201. 201. (C) 2005, The University of Michigan 233 Ripper performance -20.69%-TREC8,9,10 30%17.03%TREC10TREC8,9 24%22.4%TREC8TREC9 Test Error Rate Train Error Rate TestTraining
    202. 202. (C) 2005, The University of Michigan 234 Regex performance 7.6%5.5%4.6%TREC8,9,10 18.2%6%7.4%TREC8,9 18%15%7.8%TREC9 Test on TREC10 Test on TREC8 Test on TREC9 Training
    203. 203. (C) 2005, The University of Michigan 235 Phrase ranking • Phrases are identified by a shallow parser (ltchunk from Edinburgh) • Four features: – Proximity – POS (part-of-speech) signature (qtype) – Query overlap – Frequency
    204. 204. (C) 2005, The University of Michigan 236 Proximity • Phrasal answers tend to appear near words from the query • Average distance = 7 words, range = 1 to 50 words • Use linear rescaling of scores
    205. 205. (C) 2005, The University of Michigan 237 Part of speech signature NO (100%) NO (86.7%) PERSON (3.8%) NUMBER (3.8%) ORG (2.5%) PERSON (37.4%) PLACE (29.6%) DATE (21.7%) NO (7.6%) NO (75.6%) NUMBER (11.1%) PLACE (4.4%) ORG (4.4%) PLACE (37.3%) PERSON (35.6%) NO (16.9%) ORG (10.2%) ORG (55.6%) NO (33.3%) PLACE (5.6%) DATE (5.6%) VBD DT NN NNP DT JJ NNP NNP NNP DT NNP Phrase TypesSignature Example: “Hugo/NNP Young/NNP” P (PERSON | “NNP NNP”) = .458 Example: “the/DT Space/NNP Flight/NNP Operations/NNP contractor/NN” P (PERSON | “DT NNP NNP NNP NN”) = 0 Penn Treebank tagset (DT = determiner, JJ = adjective)
    206. 206. (C) 2005, The University of Michigan 238 Query overlap and frequency • Query overlap: – What is the capital of Zimbabwe? – Possible choices: Mugabe, Zimbabwe, Luanda, Harare • Frequency: – Not necessarily accurate but rather useful
    207. 207. (C) 2005, The University of Michigan 239 Reranking Rank Probability and phrase 1 0.599862 the_DT Space_NNP Flight_NNP Operations_NNP contractor_NN ._. 2 0.598564 International_NNP Space_NNP Station_NNP Alpha_NNP 3 0.598398 International_NNP Space_NNP Station_NNP 4 0.598125 to_TO become_VB 5 0.594763 a_DT joint_JJ venture_NN United_NNP Space_NNP Alliance_NNP 6 0.593933 NASA_NNP Johnson_NNP Space_NNP Center_NNP 7 0.587140 will_MD form_VB 8 0.585410 The_DT purpose_NN 9 0.576797 prime_JJ contracts_NNS 10 0.568013 First_NNP American_NNP 11 0.567361 this_DT bulletin_NN board_NN 12 0.565757 Space_NNP :_: 13 0.562627 'Spirit_NN '_'' of_IN ... 41 0.516368 Alan_NNP Shepard_NNP Proximity = .5164
    208. 208. (C) 2005, The University of Michigan 240 Reranking Rank Probability and phrase 1 0.465012 Space_NNP Administration_NNP ._. 2 0.446466 SPACE_NNP CALENDAR_NNP _. 3 0.413976 First_NNP American_NNP 4 0.399043 International_NNP Space_NNP Station_NNP Alpha_NNP 5 0.396250 her_PRP$ third_JJ space_NN mission_NN 6 0.395956 NASA_NNP Johnson_NNP Space_NNP Center_NNP 7 0.394122 the_DT American_NNP Commercial_NNP Launch_NNP Industry_NNP 8 0.390163 the_DT Red_NNP Planet_NNP ._. 9 0.379797 First_NNP American_NNP 10 0.376336 Alan_NNP Shepard_NNP 11 0.375669 February_NNP 12 0.374813 Space_NNP 13 0.373999 International_NNP Space_NNP Station_NNP Qtype = .7288 Proximity * qtype = .3763
    209. 209. (C) 2005, The University of Michigan 241 Reranking Rank Probability and phrase 1 0.478857 Neptune_NNP Beach_NNP ._. 2 0.449232 February_NNP 3 0.447075 Go_NNP 4 0.437895 Space_NNP 5 0.431835 Go_NNP 6 0.424678 Alan_NNP Shepard_NNP 7 0.423855 First_NNP American_NNP 8 0.421133 Space_NNP May_NNP 9 0.411065 First_NNP American_NNP woman_NN 10 0.401994 Life_NNP Sciences_NNP 11 0.385763 Space_NNP Shuttle_NNP Discovery_NNP STS-60_NN 12 0.381865 the_DT Moon_NNP International_NNP Space_NNP Station_NNP 13 0.370030 Space_NNP Research_NNP A_NNP Session_NNP All four features
    210. 210. (C) 2005, The University of Michigan 242
    211. 211. (C) 2005, The University of Michigan 243
    212. 212. (C) 2005, The University of Michigan 244
    213. 213. (C) 2005, The University of Michigan 245 Document level performance 164163149#>0 1.33611.04950.8355Avg GoogleNLightAlltheWebEngine TREC 8 corpus (200 questions)
    214. 214. (C) 2005, The University of Michigan 246 Sentence level performance 1351371591191211599999148#>0 0.490.542.550.440.482.530.260.312.13Avg GO O GO L GO U NL O NL L NL U AW O AW L AW U Engine
    215. 215. (C) 2005, The University of Michigan 247 Phrase level performance 0.1990.1570.1170.105Combined 0.06460.0580.0540.038Global proximity 0.06460.0680.0480.026Appearance order 1.9412.6982.6522.176Upperbound Google S+PGoogle D+PNorthernLightAlltheWeb Experiments performed Oct-Nov. 2001
    216. 216. (C) 2005, The University of Michigan 248
    217. 217. (C) 2005, The University of Michigan 249
    218. 218. (C) 2005, The University of Michigan 250 Text classification
    219. 219. (C) 2005, The University of Michigan 251 Introduction • Text classification: assigning documents to predefined categories • Hierarchical vs. flat • Many techniques: generative (maxent, knn, Naïve Bayes) vs. discriminative (SVM, regression) • Generative: model joint prob. p(x,y) and use Bayesian prediction to compute p(y|x) • Discriminative: model p(y|x) directly.
    220. 220. (C) 2005, The University of Michigan 252 Generative models: knn • K-nearest neighbors • Very easy to program • Issues: choosing k, b? ∑∈ += )( ),(),( qdkNNd qcq ddsbdcscore
    221. 221. (C) 2005, The University of Michigan 253 Feature selection: The Χ2 test • For a term t: • Testing for independence: P(C=0,It=0) should be equal to P(C=0) P(It=0) – P(C=0) = (k00+k01)/n – P(C=1) = 1-P(C=0) = (k10+k11)/n – P(It=0) = (k00+K10)/n – P(It=1) = 1-P(It=0) = (k01+k11)/n It 0 1 C 0 k00 k01 1 k10 k11
    222. 222. (C) 2005, The University of Michigan 254 Feature selection: The Χ2 test • High values of Χ2 indicate lower belief in independence. • In practice, compute Χ2 for all words and pick the top k among them. ))()()(( )( 0010011100011011 2 011000112 kkkkkkkk kkkkn Χ ++++ − =
    223. 223. (C) 2005, The University of Michigan 255 Feature selection: mutual information • No document length scaling is needed • Documents are assumed to be generated according to the multinomial model ∑∑= x y yPxP yxP yxPYXMI )()( ),( log),(),(
    224. 224. (C) 2005, The University of Michigan 256 Naïve Bayesian classifiers • Naïve Bayesian classifier • Assuming statistical independence ),( )()|,...,( ),...,|( ,...21 21 21 k k k FFFP CdPCdFFFP FFFCdP ∈∈ =∈ ∏ ∏ = = ∈∈ =∈ k j j k j j k FP CdPCdFP FFFCdP 1 1 21 )( )()|( ),...,|(
    225. 225. (C) 2005, The University of Michigan 257 Spam recognition Return-Path: <ig_esq@rediffmail.com> X-Sieve: CMU Sieve 2.2 From: "Ibrahim Galadima" <ig_esq@rediffmail.com> Reply-To: galadima_esq@netpiper.com To: webmaster@aclweb.org Date: Tue, 14 Jan 2003 21:06:26 -0800 Subject: Gooday DEAR SIR FUNDS FOR INVESTMENTS THIS LETTER MAY COME TO YOU AS A SURPRISE SINCE I HAD NO PREVIOUS CORRESPONDENCE WITH YOU I AM THE CHAIRMAN TENDER BOARD OF INDEPENDENT NATIONAL ELECTORAL COMMISSION INEC I GOT YOUR CONTACT IN THE COURSE OF MY SEARCH FOR A RELIABLE PERSON WITH WHOM TO HANDLE A VERY CONFIDENTIAL TRANSACTION INVOLVING THE ! TRANSFER OF FUND VALUED AT TWENTY ONE MILLION SIX HUNDRED THOUSAND UNITED STATES DOLLARS US$20M TO A SAFE FOREIGN ACCOUNT THE ABOVE FUND IN QUESTION IS NOT CONNECTED WITH ARMS, DRUGS OR MONEY LAUNDERING IT IS A PRODUCT OF OVER INVOICED CONTRACT AWARDED IN 1999 BY INEC TO A
    226. 226. (C) 2005, The University of Michigan 258 Well-known datasets • 20 newsgroups – http://people.csail.mit.edu/u/j/jrennie/public_html/20Ne wsgroups/ • Reuters-21578 – Cats: grain, acquisitions, corn, crude, wheat, trade… • WebKB – http://www-2.cs.cmu.edu/~webkb/ – course, student, faculty, staff, project, dept, other – NB performance (2000) – P=26,43,18,6,13,2,94 – R=83,75,77,9,73,100,35
    227. 227. (C) 2005, The University of Michigan 259 Support vector machines • Introduced by Vapnik in the early 90s.
    228. 228. (C) 2005, The University of Michigan 260 Semi-supervised learning • EM • Co-training • Graph-based
    229. 229. (C) 2005, The University of Michigan 261 Additional topics • Soft margins • VC dimension • Kernel methods
    230. 230. (C) 2005, The University of Michigan 262 • SVMs are widely considered to be the best method for text classification (look at papers by Sebastiani, Christianini, Joachims), e.g. 86% accuracy on Reuters. • NB also good in many circumstances
    231. 231. (C) 2005, The University of Michigan 263 Readings • Books: • 1. Ricardo Baeza-Yates and Berthier Ribeiro-Neto; Modern Information Retrieval, Addison- Wesley/ACM Press, 1999. • 2. Pierre Baldi, Paolo Frasconi, Padhraic Smyth; Modeling the Internet and the Web: Probabilistic Methods and Algorithms; Wiley, 2003, ISBN: 0-470-84906-1 • Papers: • Barabasi and Albert "Emergence of scaling in random networks" Science (286) 509-512, 1999 • Bharat and Broder "A technique for measuring the relative size and overlap of public Web search engines" WWW 1998 • Brin and Page "The Anatomy of a Large-Scale Hypertextual Web Search Engine" WWW 1998 • Bush "As we may thing" The Atlantic Monthly 1945 • Chakrabarti, van den Berg, and Dom "Focused Crawling" WWW 1999 • Cho, Garcia-Molina, and Page "Efficient Crawling Through URL Ordering" WWW 1998 • Davison "Topical locality on the Web" SIGIR 2000 • Dean and Henzinger "Finding related pages in the World Wide Web" WWW 1999 • Deerwester, Dumais, Landauer, Furnas, Harshman "Indexing by latent semantic analysis" JASIS 41(6) 1990
    232. 232. (C) 2005, The University of Michigan 264 Readings • Erkan and Radev "LexRank: Graph-based Lexical Centrality as Salience in Text Summarization" JAIR 22, 2004 • Jeong and Barabasi "Diameter of the world wide web" Nature (401) 130-131, 1999 • Hawking, Voorhees, Craswell, and Bailey "Overview of the TREC-8 Web Track" TREC 2000 • Haveliwala "Topic-sensitive pagerank" WWW 2002 • Kumar, Raghavan, Rajagopalan, Sivakumar, Tomkins, Upfal "The Web as a graph" PODS 2000 • Lawrence and Giles "Accessibility of information on the Web" Nature (400) 107-109, 1999 • Lawrence and Giles "Searching the World-Wide Web" Science (280) 98-100, 1998 • Menczer "Links tell us about lexical and semantic Web content" arXiv 2001 • Page, Brin, Motwani, and Winograd "The PageRank citation ranking: Bringing order to the Web" Stanford TR, 1998 • Radev, Fan, Qi, Wu and Grewal "Probabilistic Question Answering on the Web" JASIST 2005 • Singhal "Modern Information Retrieval: an Overview" IEEE 2001
    233. 233. (C) 2005, The University of Michigan 265 More readings • Gerard Salton, Automatic Text Processing, Addison- Wesley (1989) • Gerald Kowalski, Information Retrieval Systems: Theory and Implementation, Kluwer (1997) • Gerard Salton and M. McGill, Introduction to Modern Information Retrieval, McGraw-Hill (1983) • C. J. an Rijsbergen, Information Retrieval, Buttersworths (1979) • Ian H. Witten, Alistair Moffat, and Timothy C. Bell, Managing Gigabytes, Van Nostrand Reinhold (1994) • ACM SIGIR Proceedings, SIGIR Forum • ACM conferences in Digital Libraries
    234. 234. (C) 2005, The University of Michigan 266 Thank you! Благодаря!

    ×