27. 向量相似度算法
余弦相似性( cosine-based similarity )
相关相似性( Pearson 相关系数 )
修正的余弦相似性( adjusted-cosine
similarity )
2 2
( )( )
( , )
( ) ( )
uv
uv uv
ui i ui ii I
ui i vi ii I i I
R R R R
sim u v
R R R R
∈
∈ ∈
− −
=
− −
∑
∑ ∑
2 2
( )( )
( , )
( ) ( )
uv
uv uv
ui u ui vi I
ui u vi vi I i I
R R R R
sim u v
R R R R
∈
∈ ∈
− −
=
− −
∑
∑ ∑
1
2 2
1 1
( , ) cos( , )
n
ui vi
i
n n
ui vi
i i
R R
u v
sim u v u v
u v
R R
=
= =
= = =
×
∑
∑ ∑
r r
r r g
uur ur
27
34. In Lucene, a TermFreqVector is a representation of
all of the terms and term counts in a specific Field
of a Document instance
As a tuple:
termFreq = <term, term countD>
<fieldName, <…,termFreqi, termFreqi+1,…>>
As Java:
public String getField();
public String[] getTerms();
public int[] getTermFrequencies();
Lucene Term Vectors (TV)
Parallel Arrays
35. Lucene Term Vectors (TV)
Field.TermVector.NO: 不保存 term vectors
Field.TermVector.YES: 保存 term vectors
Field.TermVector.WITH_POSITIONS: 保存 term
vectors.( 保存值和 token 位置信息 )
Field.TermVector.WITH_OFFSETS: 保存 term
vectors.( 保存值和 Token 的 offset)
Field.TermVector.WITH_POSITIONS_OFFSETS: 保
存 term vectors.( 保存值和 token 位置信息和 Token 的
offset)
35
Term Vectors were officially added in the 1.4
Look at TermFreqVector interface definition
The getTerms() and getTermFrequencies() are parallel arrays, that is getTerms()[i] has a document frequency of getTermFrequencies()[i]