SlideShare a Scribd company logo
1 of 66
Download to read offline
Karam Abdulahhad
GESIS - Cologne
karam.abdulahhad@gesis.org
karam.abdulahhad@gmail.com
Beyond Classical Information Retrieval (IR)
Conceptual IR
Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad2
How have “fiddles” changed over time
Violins
Like most technological breakthroughs, today's
violin is an evolutionary product. So far as we
know, there were no violins in 1500. A century
later, there were several types and probably
thousands of specimens north and south of the
Alps, and from England to Poland. A marvel of
craftsmanship and acoustical engineering, the
violin produced more sound than any stringed
instrument to date. Almost immediately,
composers, players and collectors liked what
they heard and saw. Italian and non-Italian
makers proliferated.
……….
Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad3
Historical information about “sugar
river bank”
History and Mission Statement
…………
The Bank continues to grow at a healthy pace.
We have continued to do well and be a leader
in our industry. Our main branch was expanded
in 1982 and we now have branches in Sunapee,
New London, Warner, Grantham and Concord.
We at Sugar River Bank are proud of our
history and growth. It is the responsibility of
each and every member of our Bank's family to
insure continued growth in the future.
…………
www.sugarriverbank.com
Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad4
Historical information about “sugar
river bank”
The Life-Sustaining Sugar River
…………
The west branch of the Sugar River historically
supported a native trout population, but had
suffered from sedimentation, overgrazing of its
banks and warming water. “Restoration efforts
in the Dane County portion of the watershed
reduced nonpoint source pollution, installed
riverbank vegetative filter strips, improved in-
stream habitat, restricted cattle access to
streams, and improved management of animal
waste from barnyards,” says Hansis.
…………
northwestquarterly.com
Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad5
Part-Whole
Hand Body
Heteronyms
Bank(com) Bank(geo)
Hyponym / Hypernym
B-cell Lymphocyte
Synonyms
Violin Fiddle
Co-hyponym
Cat Dog
Observations
1. Inadequacy of the term-independence assumption,
which leads to the term-mismatch problem
20-12-2018GESIS - K.Abdulahhad6
Observations
1. Inadequacy of the term-independence assumption,
which leads to the term-mismatch problem
2. Retrieval process has an inferential nature, where the
classical word-based document-query comparison
paradigm is insufficient
20-12-2018GESIS - K.Abdulahhad7
20-12-2018GESIS - K.Abdulahhad8
Conceptual approach
Conceptual approach
20-12-2018GESIS - K.Abdulahhad9
 Concepts are categories encompassing all synonymous
terms
Conceptual approach
20-12-2018GESIS - K.Abdulahhad10
 Concepts are categories encompassing all synonymous
terms
Atrial fibrillation
Auricular fibrillation
C0004238
Ticker
Watch
S04563183
Cancer
Malignant neoplastic disease
S14263400
WordNet
Snake
Serpent
Ophidian
S01729333
UMLS
Skin cancer
Melanoma
Malignant neoplasm of skin
C0004238
Conceptual approach
20-12-2018GESIS - K.Abdulahhad11
 Concepts are categories encompassing all synonymous
terms
Using concepts IDs
instead of terms
Atrial fibrillation
Auricular fibrillation
C0004238
Ticker
Watch
S04563183
Cancer
Malignant neoplastic disease
S14263400
WordNet
Snake
Serpent
Ophidian
S01729333
UMLS
Skin cancer
Melanoma
Malignant neoplasm of skin
C0004238
20-12-2018GESIS - K.Abdulahhad12
Part I: Relative Concept Frequency
[1] K. . Abdulahhad et al., Revisiting the Term Frequency in concept-Based IR Models. DEXA 2013
[2] K. . Abdulahhad et al., MRIM at ImageCLEF2012. From Words to Concepts: A New Counting Approach.
CLEF 2012
Relative Concept Frequency (problem)
20-12-2018GESIS - K.Abdulahhad13
 Text to concepts mapping
Relative Concept Frequency (problem)
20-12-2018GESIS - K.Abdulahhad14
 Text to concepts mapping
 Using MetaMap & UMLS concepts
Relative Concept Frequency (problem)
20-12-2018GESIS - K.Abdulahhad15
 Text to concepts mapping
 Using MetaMap & UMLS concepts
Precision
Recall
Relative Concept Frequency (problem)
GESIS - K.Abdulahhad16
Word-space Concept-space
𝑑 =‘lobar pneumonia x-ray’
𝑑 = 3 𝑑 =? ?
 Document length
20-12-2018
Relative Concept Frequency (idea)
 Use all concepts but maintaining word-based document
length
 Structure based redistribution of word-based document
length on concepts
GESIS - K.Abdulahhad17 20-12-2018
Relative Concept Frequency (how)
 Computing relative frequency
GESIS - K.Abdulahhad18 20-12-2018
Relative Concept Frequency (how)
 Computing relative frequency
 Hypothesis 1: concepts of
larger phrase receive larger
count (more specific meaning)
GESIS - K.Abdulahhad19 20-12-2018
Relative Concept Frequency (how)
 Computing relative frequency
 Hypothesis 1: concepts of
larger phrase receive larger
count (more specific meaning)
 Hypothesis 2: the bigger the
set of concepts is for a phrase,
the less important count its
concepts receive (ambiguity)
GESIS - K.Abdulahhad20 20-12-2018
Relative Concept Frequency (how)
 Computing relative frequency
 Hypothesis 1: concepts of
larger phrase receive larger
count (more specific meaning)
 Hypothesis 2: the bigger the
set of concepts is for a phrase,
the less important count its
concepts receive (ambiguity)
 Hypothesis 3: maintaining
word-based 𝑑
GESIS - K.Abdulahhad21 20-12-2018
Computing Relative Concept Frequency
(Step 1)
 Step 1: map text to concepts (via e.g. MetaMap)
GESIS - K.Abdulahhad22
Sub-phrases Concepts
𝑇1:‘lobar pneumonia’ 𝐶1 = 𝐶0032300, 𝐶0155862
𝑇2:‘pneumonia x-ray’ 𝐶2 = 𝐶0581647
𝑇3:‘lobar’ 𝐶3 =
𝐶1511010, 𝐶1428707,
𝐶0796494
𝑇4:‘pneumonia’ 𝐶4 =
𝐶0024109, 𝐶1278908,
𝐶0032285, 𝐶2707265,
𝐶2709248
𝑇5:‘x-ray’ 𝐶4 =
𝐶0034571, 𝐶0043299,
𝐶0043309, 𝐶1306645,
𝐶1714805, 𝐶1962945
‘lobar pneumonia x-ray’
MetaMap
20-12-2018
Computing Relative Concept Frequency
(Step 2)
 Step 2: build hierarchy
GESIS - K.Abdulahhad23
𝑇𝑖, 𝐶𝑖 < 𝑇𝑗, 𝐶𝑗 ⇔ 𝑇𝑖 ⊂ 𝑇𝑗
 11,CT
R
 22 ,CT
 33,CT  44 ,CT  55,CT
Virtual node
Sub-phrases Concepts
𝑇1:‘lobar pneumonia’ 𝐶1 = 𝐶0032300, 𝐶0155862
𝑇2:‘pneumonia x-ray’ 𝐶2 = 𝐶0581647
𝑇3:‘lobar’ 𝐶3 =
𝐶1511010, 𝐶1428707,
𝐶0796494
𝑇4:‘pneumonia’ 𝐶4 =
𝐶0024109, 𝐶1278908,
𝐶0032285, 𝐶2707265,
𝐶2709248
𝑇5:‘x-ray’ 𝐶4 =
𝐶0034571, 𝐶0043299,
𝐶0043309, 𝐶1306645,
𝐶1714805, 𝐶1962945
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 Step 3: compute relative frequency 𝑟𝑓𝑖
 Breadth first search
 The relative frequency 𝑟𝑓𝑖 of 𝑐 ∈ 𝐶𝑖 must be
propositional to 𝑇𝑖 (Hypothesis 1), and inversely
propositional to 𝐶𝑖 (Hypothesis 2)
 Maintaining 𝑑 by distributing it on the concepts of 𝑑
(Hypothesis 3).
GESIS - K.Abdulahhad24
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300
𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
 11,CT
R
 22 ,CT
 33,CT  44 ,CT  55 ,CT
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad25
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
3
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 Step 3: computing relative weight
 For each node 𝑇𝑖, 𝐶𝑖 we compute three values
 𝛼𝑖 the amount that should be distributed on the concepts of the current node 𝑇𝑖, 𝐶𝑖 and
its children
 𝛼𝑖 = 𝛿 𝑝𝑎𝑟𝑒𝑛𝑡 × 𝑇𝑖𝑝𝑎𝑟𝑒𝑛𝑡𝑠
 𝛿𝑖 the portion of one single word of the input amount 𝛼𝑖
 𝛿𝑖 =
𝛼 𝑖
𝑇 𝑖 + 𝑇 𝑐ℎ𝑖𝑙𝑑𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛
 𝛽𝑖, or equivalently 𝑟𝑓𝑖, the relative frequency of each concept 𝑐 ∈ 𝐶𝑖
 𝛽𝑖 =
𝛿 𝑖× 𝑇𝑖
𝐶 𝑖
GESIS - K.Abdulahhad26 20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad27
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅 = 3
𝛼 𝑅 3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
𝛿1
𝛽1
𝛼2
𝛿2
𝛽2
𝛼3
𝛿3
𝛽3
𝛼4
𝛿4
𝛽4
𝛼5
𝛿5
𝛽5
𝛿 𝑅 =
𝛼 𝑅
𝑇𝑅 + 𝑇1 + 𝑇2
=
3
4
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼 𝑅 = 3
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad28
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
𝛿2
𝛽2
𝛼3
𝛿3
𝛽3
𝛼4
𝛿4
𝛽4
𝛼5
𝛿5
𝛽5
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼1 = 𝛿 𝑅 × 𝑇1 =
3
2
𝛿1 =
𝛼1
𝑇1 + 𝑇3 + 𝑇4
=
3
8
𝛽1 =
𝛿1 × 𝑇1
𝐶1
=
3
8
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad29
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
3
2
𝛿2
3
8
𝛽2
3
4
𝛼3
𝛿3
𝛽3
𝛼4
𝛿4
𝛽4
𝛼5
𝛿5
𝛽5
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼2 = 𝛿 𝑅 × 𝑇2 =
3
2
𝛿2 =
𝛼2
𝑇2 + 𝑇4 + 𝑇5
=
3
8
𝛽2 =
𝛿2 × 𝑇2
𝐶2
=
3
4
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad30
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
3
2
𝛿2
3
8
𝛽2
3
4
𝛼3
3
8
𝛿3
3
8
𝛽3
1
8
𝛼4
𝛿4
𝛽4
𝛼5
𝛿5
𝛽5
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼3 = 𝛿1 × 𝑇3 =
3
8
𝛿3 =
𝛼3
𝑇3
=
3
8
𝛽3 =
𝛿3 × 𝑇3
𝐶3
=
1
8
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad31
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
3
2
𝛿2
3
8
𝛽2
3
4
𝛼3
3
8
𝛿3
3
8
𝛽3
1
8
𝛼4
3
4
𝛿4
3
4
𝛽4
3
20
𝛼5
𝛿5
𝛽5
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼4 = 𝛿1 × 𝑇4 + 𝛿2 × 𝑇4 =
3
4
𝛿4 =
𝛼4
𝑇4
=
3
4
𝛽4 =
𝛿4 × 𝑇4
𝐶4
=
3
20
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad32
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
3
2
𝛿2
3
8
𝛽2
3
4
𝛼3
3
8
𝛿3
3
8
𝛽3
1
8
𝛼4
3
4
𝛿4
3
4
𝛽4
3
20
𝛼5
3
8
𝛿5
3
8
𝛽5
1
16
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼5 = 𝛿2 × 𝑇5 =
3
8
𝛿5 =
𝛼5
𝑇5
=
3
8
𝛽5 =
𝛿5 × 𝑇5
𝐶5
=
1
16
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of phrase
‘lobar pneumonia x-ray’ on its concepts
GESIS - K.Abdulahhad33
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
3
2
𝛿2
3
8
𝛽2
3
4
𝛼3
3
8
𝛿3
3
8
𝛽3
1
8
𝛼4
3
4
𝛿4
3
4
𝛽4
3
20
𝛼5
3
8
𝛿5
3
8
𝛽5
1
16
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
20-12-2018
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300 3
8𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
3
4
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
1
8
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
3
20
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
1
16
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300 3
8𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
3
4
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
1
8
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
3
20
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
1
16
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
Computing Relative Concept Frequency
(Step 3)
GESIS - K.Abdulahhad34 20-12-2018
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300 3
8𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
3
4
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
1
8
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
3
20
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
1
16
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
Computing Relative Concept Frequency
(Step 3)
 From this table, we can see that the
concepts of less ambiguous and
longest phrase have the highest
frequency
GESIS - K.Abdulahhad35 20-12-2018
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300 3
8𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
3
4
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
1
8
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
3
20
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
1
16
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
Computing Relative Concept Frequency
(Step 3)
 From this table, we can see that the
concepts of less ambiguous and
longest phrase have the highest
frequency
 Concepts of most ambiguous and
shortest phrase have the lowest
frequency
GESIS - K.Abdulahhad36 20-12-2018
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300 3
8𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
3
4
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
1
8
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
3
20
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
1
16
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
Computing Relative Concept Frequency
(Step 3)
 From this table, we can see that the
concepts of less ambiguous and
longest phrase have the highest
frequency
 Concepts of most ambiguous and
shortest phrase have the lowest
frequency
GESIS - K.Abdulahhad37 20-12-2018
𝑟𝑓𝑖 = 3
Relative Concept Frequency (results)
 Corpora
GESIS - K.Abdulahhad38 20-12-2018
104.26
Relative Concept Frequency (results)
GESIS - K.Abdulahhad39 20-12-2018
(*) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t. classical concepts frequency
TF
Relative Concept Frequency (results)
GESIS - K.Abdulahhad40 20-12-2018
Relative Concept Frequency (conclusion)
 Dealing with the document length deformation
 Encouraging results
 Increase recall
 Maintain or even increase the precision
 Can be used with classical IR models
 Change the (TF) component
GESIS - K.Abdulahhad41 20-12-2018
20-12-2018GESIS - K.Abdulahhad42
Part II: Concept Embedding
[3] K. Abdulahhad, Concept embedding for information retrieval. ECIR 2018
Concept embedding (problem)
20-12-2018GESIS - K.Abdulahhad43
fiddle violinS04544161
C0004238 skin cancermelanoma
Concept embedding (problem)
20-12-2018GESIS - K.Abdulahhad44
fiddle violinS04544161
C0004238 skin cancermelanoma
B-celllymphocyte C0004561 C0024264
Concept embedding (problem)
20-12-2018GESIS - K.Abdulahhad45
fiddle violinS04544161
C0004238 skin cancermelanoma
B-celllymphocyte C0004561 C0024264
is-a
Concept embedding (problem)
20-12-2018GESIS - K.Abdulahhad46
fiddle violinS04544161
C0004238 skin cancermelanoma
B-celllymphocyte C0004561 C0024264
is-a
Relation-based concept similarity is problematic
Concept embedding (problem)
20-12-2018GESIS - K.Abdulahhad47
fiddle violinS04544161
C0004238 skin cancermelanoma
B-celllymphocyte C0004561 C0024264
is-a
Relation-based concept similarity is problematic
fiddle violin
B-cell lymphocyte
handbody
is-a
part-of
Relations have different
semantics & properties
synonymous
Concept embedding (idea)
20-12-2018GESIS - K.Abdulahhad48
 Concepts as vectors
 Still using concepts to reduce mismatch effect
 Avoiding the complexities of relation-based inter-
concept similarity
Concept embedding (idea)
20-12-2018GESIS - K.Abdulahhad49
 Concepts as vectors
 Still using concepts to reduce mismatch effect
 Avoiding the complexities of relation-based inter-
concept similarity
Check adaptability of concept-embedding-based
similarity to IR
Goal
Concept embedding (approaches)
20-12-2018GESIS - K.Abdulahhad50
 Flat embedding
⋯
𝑐 = 𝐹 𝑤1, ⋯ , 𝑤 𝑛 𝑐
𝑤1 𝑤 𝑛
Concept embedding (approaches)
20-12-2018GESIS - K.Abdulahhad51
 Hierarchical embedding
⋮
⋮⋯
𝑐
⋯
⋯
⋮
⋮
⋮
⋮
𝑤1 𝑤 𝑛
𝑠1 𝑠 𝑚
𝑡1 𝑡 𝑘
𝑠𝑖 = 𝐹 𝑤1
𝑖
, ⋯ , 𝑤 𝑛
𝑖
𝑡𝑗 = 𝐹 𝑠1
𝑗
, ⋯ , 𝑠 𝑚
𝑗
𝑐 = 𝐹 𝑡1, ⋯ , 𝑡 𝑘
Concept embedding (approaches)
20-12-2018GESIS - K.Abdulahhad52
 Weighted embedding
𝑐 = 𝐹 𝛼1 𝑤1, ⋯ , 𝛼 𝑛 𝑤 𝑛 𝑐
𝑤1 𝑤 𝑛⋯
Concept embedding (experiments)
20-12-2018GESIS - K.Abdulahhad53
 Experiments consist of two parts
 Generating concept embedding vectors
 Testing a vector-based concept similarity for ad-hoc IR
Concept embedding (experiments)
1. Generating concept embedding vectors
20-12-2018GESIS - K.Abdulahhad54
 Word embedding
 PubMed Central collection (1177879 vocabularies)
 Word2Vec
 Vector size 500
 Continuous bag of words
 Window size 8
 Negative sampling 25
Concept embedding (experiments)
1. Generating concept embedding vectors
20-12-2018GESIS - K.Abdulahhad55
 Concept embedding
 UMLS2017 concepts (only English content)
 For each concept, we build the corresponding set of words
 Flat embedding
 Replace F by avg
 Hierarchical embedding
 Replace F by avg
 Weighted embedding
 Replace F by weighted-avg
 The weight 𝛼 𝑤 of a word w is: 𝛼 𝑤 = ln
𝑁+1
𝑛
 N the number of documents in PubMed Central
 n is the document frequency of w in PubMed Central
Concept embedding (experiments)
1. Generating concept embedding vectors
20-12-2018GESIS - K.Abdulahhad56
 Concept embedding (missing words)
 Fixed random vectors
 Several experiments for weighting missing words
 The word is too popular n = N (poor idf)
 The word is too rare n = 1 (high idf)
 Or in between n = N/2
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad57
 Corpora
 clef11 & clef12
 Text to concepts mapping
 MetaMap
 UMLS concepts
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad58
 IR model and concept similarity
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad59
 IR model and concept similarity
 Weight(c): BM25 and Pivoted Normalization
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad60
 IR model and concept similarity
 Weight(c): BM25 and Pivoted Normalization
 Concept similarity
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
𝑠𝑖𝑚 𝑐𝑖, 𝑐𝑗 =
0 cos 𝜃 ≤ 0
𝛽 × cos 𝜃 2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad61
 IR model and concept similarity
 Weight(c): BM25 and Pivoted Normalization
 Concept similarity
 For comparison (Leacock)
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
𝑠𝑖𝑚 𝑐𝑖, 𝑐𝑗 =
0 cos 𝜃 ≤ 0
𝛽 × cos 𝜃 2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad62
 Results
(*) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t.“NoEmb-NoSim”
(†) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t.“NoEmb-Leacock”
Concept embedding (conclusion)
20-12-2018GESIS - K.Abdulahhad63
 Three approaches to build concept vectors
based on word embedding
 Promising results to use vector-based concept
representation and similarity
 Concepts and words are represented in the
same vector space
 they are comparable
 Improve approaches like MetaMap
20-12-2018GESIS - K.Abdulahhad64
Conclusion
Conclusion
 Dealing with the two observations
 Inadequacy of the term independence assumption
 Retrieval process has an inferential nature
 Conceptual IR
 Document length deformation
 Inter-concept relations quantification
20-12-2018GESIS - K.Abdulahhad65
20-12-2018GESIS - K.Abdulahhad66
Thank you …

More Related Content

Similar to Beyond Classical Information Retrieval (IR): Conceptual IR

sigmod-keynote.pdf
sigmod-keynote.pdfsigmod-keynote.pdf
sigmod-keynote.pdf
ssuser56e850
 
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsV.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
Elinor Velasquez
 
From health persona to societal health uci 131202
From health persona to societal health  uci  131202From health persona to societal health  uci  131202
From health persona to societal health uci 131202
Ramesh Jain
 

Similar to Beyond Classical Information Retrieval (IR): Conceptual IR (20)

sigmod-keynote.pdf
sigmod-keynote.pdfsigmod-keynote.pdf
sigmod-keynote.pdf
 
Teaching & Learning with Technology TLT 2016
Teaching & Learning with Technology TLT 2016Teaching & Learning with Technology TLT 2016
Teaching & Learning with Technology TLT 2016
 
Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18
 
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsV.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
 
Tim Brown ACEAS Phenocams
Tim Brown ACEAS PhenocamsTim Brown ACEAS Phenocams
Tim Brown ACEAS Phenocams
 
Deep red - The environmental impact of deep learning (Paolo Caressa)
Deep red - The environmental impact of deep learning (Paolo Caressa)Deep red - The environmental impact of deep learning (Paolo Caressa)
Deep red - The environmental impact of deep learning (Paolo Caressa)
 
Dgpg college kanpur_2015
Dgpg college kanpur_2015Dgpg college kanpur_2015
Dgpg college kanpur_2015
 
From health persona to societal health uci 131202
From health persona to societal health  uci  131202From health persona to societal health  uci  131202
From health persona to societal health uci 131202
 
Doing Scientific Investigations W2D3.pptx
Doing Scientific Investigations W2D3.pptxDoing Scientific Investigations W2D3.pptx
Doing Scientific Investigations W2D3.pptx
 
NLP support for clinical tasks and decisions
NLP support for clinical tasks and decisionsNLP support for clinical tasks and decisions
NLP support for clinical tasks and decisions
 
Deep Learning for Food Analysis
Deep Learning for Food Analysis Deep Learning for Food Analysis
Deep Learning for Food Analysis
 
"The data revolution", par Serena Capital
"The data revolution", par Serena Capital"The data revolution", par Serena Capital
"The data revolution", par Serena Capital
 
The Data Revolution - Serena Capital
The Data Revolution - Serena CapitalThe Data Revolution - Serena Capital
The Data Revolution - Serena Capital
 
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
 
Thailand Policy Foresight in Covid-19 Era
Thailand Policy Foresight in Covid-19 EraThailand Policy Foresight in Covid-19 Era
Thailand Policy Foresight in Covid-19 Era
 
Big Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH HeadedBig Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH Headed
 
Bigdata AI
Bigdata AI Bigdata AI
Bigdata AI
 
Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...
 
CHAPTER 14 CLUSTERING.PPTX
CHAPTER 14 CLUSTERING.PPTXCHAPTER 14 CLUSTERING.PPTX
CHAPTER 14 CLUSTERING.PPTX
 
15/3 -17 impact exponential technologies
15/3 -17 impact exponential technologies 15/3 -17 impact exponential technologies
15/3 -17 impact exponential technologies
 

Recently uploaded

Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
dq9vz1isj
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
pyhepag
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
0uyfyq0q4
 

Recently uploaded (20)

Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotecAbortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
Abortion pills in Riyadh Saudi Arabia (+966572737505 buy cytotec
 
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(WashU毕业证书)圣路易斯华盛顿大学毕业证成绩单本科硕士学位证留信学历认证
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp  Number 24/7
ℂall Girls Balbir Nagar ℂall Now Chhaya ☎ 9899900591 WhatsApp Number 24/7
 
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
Data Visualization Exploring and Explaining with Data 1st Edition by Camm sol...
 
123.docx. .
123.docx.                                 .123.docx.                                 .
123.docx. .
 
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotecAbortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
Abortion pills in Dammam Saudi Arabia// +966572737505 // buy cytotec
 
What is Insertion Sort. Its basic information
What is Insertion Sort. Its basic informationWhat is Insertion Sort. Its basic information
What is Insertion Sort. Its basic information
 
2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting2024 Q2 Orange County (CA) Tableau User Group Meeting
2024 Q2 Orange County (CA) Tableau User Group Meeting
 
2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call2024 Q1 Tableau User Group Leader Quarterly Call
2024 Q1 Tableau User Group Leader Quarterly Call
 
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
1:1原版定制伦敦政治经济学院毕业证(LSE毕业证)成绩单学位证书留信学历认证
 
Artificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdfArtificial_General_Intelligence__storm_gen_article.pdf
Artificial_General_Intelligence__storm_gen_article.pdf
 
Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)Atlantic Grupa Case Study (Mintec Data AI)
Atlantic Grupa Case Study (Mintec Data AI)
 
一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理一比一原版西悉尼大学毕业证成绩单如何办理
一比一原版西悉尼大学毕业证成绩单如何办理
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
如何办理滑铁卢大学毕业证(Waterloo毕业证)成绩单本科学位证原版一比一
 
Seven tools of quality control.slideshare
Seven tools of quality control.slideshareSeven tools of quality control.slideshare
Seven tools of quality control.slideshare
 
Heaps & its operation -Max Heap, Min Heap
Heaps & its operation -Max Heap, Min  HeapHeaps & its operation -Max Heap, Min  Heap
Heaps & its operation -Max Heap, Min Heap
 
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam DunksNOAM AAUG Adobe Summit 2024: Summit Slam Dunks
NOAM AAUG Adobe Summit 2024: Summit Slam Dunks
 

Beyond Classical Information Retrieval (IR): Conceptual IR

  • 1. Karam Abdulahhad GESIS - Cologne karam.abdulahhad@gesis.org karam.abdulahhad@gmail.com Beyond Classical Information Retrieval (IR) Conceptual IR
  • 2. Linguistic phenomena & IR problems 20-12-2018GESIS - K.Abdulahhad2 How have “fiddles” changed over time Violins Like most technological breakthroughs, today's violin is an evolutionary product. So far as we know, there were no violins in 1500. A century later, there were several types and probably thousands of specimens north and south of the Alps, and from England to Poland. A marvel of craftsmanship and acoustical engineering, the violin produced more sound than any stringed instrument to date. Almost immediately, composers, players and collectors liked what they heard and saw. Italian and non-Italian makers proliferated. ……….
  • 3. Linguistic phenomena & IR problems 20-12-2018GESIS - K.Abdulahhad3 Historical information about “sugar river bank” History and Mission Statement ………… The Bank continues to grow at a healthy pace. We have continued to do well and be a leader in our industry. Our main branch was expanded in 1982 and we now have branches in Sunapee, New London, Warner, Grantham and Concord. We at Sugar River Bank are proud of our history and growth. It is the responsibility of each and every member of our Bank's family to insure continued growth in the future. ………… www.sugarriverbank.com
  • 4. Linguistic phenomena & IR problems 20-12-2018GESIS - K.Abdulahhad4 Historical information about “sugar river bank” The Life-Sustaining Sugar River ………… The west branch of the Sugar River historically supported a native trout population, but had suffered from sedimentation, overgrazing of its banks and warming water. “Restoration efforts in the Dane County portion of the watershed reduced nonpoint source pollution, installed riverbank vegetative filter strips, improved in- stream habitat, restricted cattle access to streams, and improved management of animal waste from barnyards,” says Hansis. ………… northwestquarterly.com
  • 5. Linguistic phenomena & IR problems 20-12-2018GESIS - K.Abdulahhad5 Part-Whole Hand Body Heteronyms Bank(com) Bank(geo) Hyponym / Hypernym B-cell Lymphocyte Synonyms Violin Fiddle Co-hyponym Cat Dog
  • 6. Observations 1. Inadequacy of the term-independence assumption, which leads to the term-mismatch problem 20-12-2018GESIS - K.Abdulahhad6
  • 7. Observations 1. Inadequacy of the term-independence assumption, which leads to the term-mismatch problem 2. Retrieval process has an inferential nature, where the classical word-based document-query comparison paradigm is insufficient 20-12-2018GESIS - K.Abdulahhad7
  • 9. Conceptual approach 20-12-2018GESIS - K.Abdulahhad9  Concepts are categories encompassing all synonymous terms
  • 10. Conceptual approach 20-12-2018GESIS - K.Abdulahhad10  Concepts are categories encompassing all synonymous terms Atrial fibrillation Auricular fibrillation C0004238 Ticker Watch S04563183 Cancer Malignant neoplastic disease S14263400 WordNet Snake Serpent Ophidian S01729333 UMLS Skin cancer Melanoma Malignant neoplasm of skin C0004238
  • 11. Conceptual approach 20-12-2018GESIS - K.Abdulahhad11  Concepts are categories encompassing all synonymous terms Using concepts IDs instead of terms Atrial fibrillation Auricular fibrillation C0004238 Ticker Watch S04563183 Cancer Malignant neoplastic disease S14263400 WordNet Snake Serpent Ophidian S01729333 UMLS Skin cancer Melanoma Malignant neoplasm of skin C0004238
  • 12. 20-12-2018GESIS - K.Abdulahhad12 Part I: Relative Concept Frequency [1] K. . Abdulahhad et al., Revisiting the Term Frequency in concept-Based IR Models. DEXA 2013 [2] K. . Abdulahhad et al., MRIM at ImageCLEF2012. From Words to Concepts: A New Counting Approach. CLEF 2012
  • 13. Relative Concept Frequency (problem) 20-12-2018GESIS - K.Abdulahhad13  Text to concepts mapping
  • 14. Relative Concept Frequency (problem) 20-12-2018GESIS - K.Abdulahhad14  Text to concepts mapping  Using MetaMap & UMLS concepts
  • 15. Relative Concept Frequency (problem) 20-12-2018GESIS - K.Abdulahhad15  Text to concepts mapping  Using MetaMap & UMLS concepts Precision Recall
  • 16. Relative Concept Frequency (problem) GESIS - K.Abdulahhad16 Word-space Concept-space 𝑑 =‘lobar pneumonia x-ray’ 𝑑 = 3 𝑑 =? ?  Document length 20-12-2018
  • 17. Relative Concept Frequency (idea)  Use all concepts but maintaining word-based document length  Structure based redistribution of word-based document length on concepts GESIS - K.Abdulahhad17 20-12-2018
  • 18. Relative Concept Frequency (how)  Computing relative frequency GESIS - K.Abdulahhad18 20-12-2018
  • 19. Relative Concept Frequency (how)  Computing relative frequency  Hypothesis 1: concepts of larger phrase receive larger count (more specific meaning) GESIS - K.Abdulahhad19 20-12-2018
  • 20. Relative Concept Frequency (how)  Computing relative frequency  Hypothesis 1: concepts of larger phrase receive larger count (more specific meaning)  Hypothesis 2: the bigger the set of concepts is for a phrase, the less important count its concepts receive (ambiguity) GESIS - K.Abdulahhad20 20-12-2018
  • 21. Relative Concept Frequency (how)  Computing relative frequency  Hypothesis 1: concepts of larger phrase receive larger count (more specific meaning)  Hypothesis 2: the bigger the set of concepts is for a phrase, the less important count its concepts receive (ambiguity)  Hypothesis 3: maintaining word-based 𝑑 GESIS - K.Abdulahhad21 20-12-2018
  • 22. Computing Relative Concept Frequency (Step 1)  Step 1: map text to concepts (via e.g. MetaMap) GESIS - K.Abdulahhad22 Sub-phrases Concepts 𝑇1:‘lobar pneumonia’ 𝐶1 = 𝐶0032300, 𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝐶2 = 𝐶0581647 𝑇3:‘lobar’ 𝐶3 = 𝐶1511010, 𝐶1428707, 𝐶0796494 𝑇4:‘pneumonia’ 𝐶4 = 𝐶0024109, 𝐶1278908, 𝐶0032285, 𝐶2707265, 𝐶2709248 𝑇5:‘x-ray’ 𝐶4 = 𝐶0034571, 𝐶0043299, 𝐶0043309, 𝐶1306645, 𝐶1714805, 𝐶1962945 ‘lobar pneumonia x-ray’ MetaMap 20-12-2018
  • 23. Computing Relative Concept Frequency (Step 2)  Step 2: build hierarchy GESIS - K.Abdulahhad23 𝑇𝑖, 𝐶𝑖 < 𝑇𝑗, 𝐶𝑗 ⇔ 𝑇𝑖 ⊂ 𝑇𝑗  11,CT R  22 ,CT  33,CT  44 ,CT  55,CT Virtual node Sub-phrases Concepts 𝑇1:‘lobar pneumonia’ 𝐶1 = 𝐶0032300, 𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝐶2 = 𝐶0581647 𝑇3:‘lobar’ 𝐶3 = 𝐶1511010, 𝐶1428707, 𝐶0796494 𝑇4:‘pneumonia’ 𝐶4 = 𝐶0024109, 𝐶1278908, 𝐶0032285, 𝐶2707265, 𝐶2709248 𝑇5:‘x-ray’ 𝐶4 = 𝐶0034571, 𝐶0043299, 𝐶0043309, 𝐶1306645, 𝐶1714805, 𝐶1962945 20-12-2018
  • 24. Computing Relative Concept Frequency (Step 3)  Step 3: compute relative frequency 𝑟𝑓𝑖  Breadth first search  The relative frequency 𝑟𝑓𝑖 of 𝑐 ∈ 𝐶𝑖 must be propositional to 𝑇𝑖 (Hypothesis 1), and inversely propositional to 𝐶𝑖 (Hypothesis 2)  Maintaining 𝑑 by distributing it on the concepts of 𝑑 (Hypothesis 3). GESIS - K.Abdulahhad24 Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945  11,CT R  22 ,CT  33,CT  44 ,CT  55 ,CT 20-12-2018
  • 25. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad25  11,CT R  22,CT  33,CT  44,CT  55,CT 3 20-12-2018
  • 26. Computing Relative Concept Frequency (Step 3)  Step 3: computing relative weight  For each node 𝑇𝑖, 𝐶𝑖 we compute three values  𝛼𝑖 the amount that should be distributed on the concepts of the current node 𝑇𝑖, 𝐶𝑖 and its children  𝛼𝑖 = 𝛿 𝑝𝑎𝑟𝑒𝑛𝑡 × 𝑇𝑖𝑝𝑎𝑟𝑒𝑛𝑡𝑠  𝛿𝑖 the portion of one single word of the input amount 𝛼𝑖  𝛿𝑖 = 𝛼 𝑖 𝑇 𝑖 + 𝑇 𝑐ℎ𝑖𝑙𝑑𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛  𝛽𝑖, or equivalently 𝑟𝑓𝑖, the relative frequency of each concept 𝑐 ∈ 𝐶𝑖  𝛽𝑖 = 𝛿 𝑖× 𝑇𝑖 𝐶 𝑖 GESIS - K.Abdulahhad26 20-12-2018
  • 27. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad27  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 = 3 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 𝛿1 𝛽1 𝛼2 𝛿2 𝛽2 𝛼3 𝛿3 𝛽3 𝛼4 𝛿4 𝛽4 𝛼5 𝛿5 𝛽5 𝛿 𝑅 = 𝛼 𝑅 𝑇𝑅 + 𝑇1 + 𝑇2 = 3 4 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼 𝑅 = 3 20-12-2018
  • 28. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad28  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 𝛿2 𝛽2 𝛼3 𝛿3 𝛽3 𝛼4 𝛿4 𝛽4 𝛼5 𝛿5 𝛽5 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼1 = 𝛿 𝑅 × 𝑇1 = 3 2 𝛿1 = 𝛼1 𝑇1 + 𝑇3 + 𝑇4 = 3 8 𝛽1 = 𝛿1 × 𝑇1 𝐶1 = 3 8 20-12-2018
  • 29. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad29  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 3 2 𝛿2 3 8 𝛽2 3 4 𝛼3 𝛿3 𝛽3 𝛼4 𝛿4 𝛽4 𝛼5 𝛿5 𝛽5 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼2 = 𝛿 𝑅 × 𝑇2 = 3 2 𝛿2 = 𝛼2 𝑇2 + 𝑇4 + 𝑇5 = 3 8 𝛽2 = 𝛿2 × 𝑇2 𝐶2 = 3 4 20-12-2018
  • 30. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad30  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 3 2 𝛿2 3 8 𝛽2 3 4 𝛼3 3 8 𝛿3 3 8 𝛽3 1 8 𝛼4 𝛿4 𝛽4 𝛼5 𝛿5 𝛽5 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼3 = 𝛿1 × 𝑇3 = 3 8 𝛿3 = 𝛼3 𝑇3 = 3 8 𝛽3 = 𝛿3 × 𝑇3 𝐶3 = 1 8 20-12-2018
  • 31. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad31  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 3 2 𝛿2 3 8 𝛽2 3 4 𝛼3 3 8 𝛿3 3 8 𝛽3 1 8 𝛼4 3 4 𝛿4 3 4 𝛽4 3 20 𝛼5 𝛿5 𝛽5 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼4 = 𝛿1 × 𝑇4 + 𝛿2 × 𝑇4 = 3 4 𝛿4 = 𝛼4 𝑇4 = 3 4 𝛽4 = 𝛿4 × 𝑇4 𝐶4 = 3 20 20-12-2018
  • 32. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad32  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 3 2 𝛿2 3 8 𝛽2 3 4 𝛼3 3 8 𝛿3 3 8 𝛽3 1 8 𝛼4 3 4 𝛿4 3 4 𝛽4 3 20 𝛼5 3 8 𝛿5 3 8 𝛽5 1 16 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼5 = 𝛿2 × 𝑇5 = 3 8 𝛿5 = 𝛼5 𝑇5 = 3 8 𝛽5 = 𝛿5 × 𝑇5 𝐶5 = 1 16 20-12-2018
  • 33. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad33  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 3 2 𝛿2 3 8 𝛽2 3 4 𝛼3 3 8 𝛿3 3 8 𝛽3 1 8 𝛼4 3 4 𝛿4 3 4 𝛽4 3 20 𝛼5 3 8 𝛿5 3 8 𝛽5 1 16 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 20-12-2018 Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 3 8𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 3 4 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 1 8 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 3 20 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 1 16 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945
  • 34. Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 3 8𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 3 4 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 1 8 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 3 20 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 1 16 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945 Computing Relative Concept Frequency (Step 3) GESIS - K.Abdulahhad34 20-12-2018
  • 35. Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 3 8𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 3 4 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 1 8 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 3 20 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 1 16 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945 Computing Relative Concept Frequency (Step 3)  From this table, we can see that the concepts of less ambiguous and longest phrase have the highest frequency GESIS - K.Abdulahhad35 20-12-2018
  • 36. Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 3 8𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 3 4 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 1 8 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 3 20 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 1 16 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945 Computing Relative Concept Frequency (Step 3)  From this table, we can see that the concepts of less ambiguous and longest phrase have the highest frequency  Concepts of most ambiguous and shortest phrase have the lowest frequency GESIS - K.Abdulahhad36 20-12-2018
  • 37. Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 3 8𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 3 4 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 1 8 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 3 20 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 1 16 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945 Computing Relative Concept Frequency (Step 3)  From this table, we can see that the concepts of less ambiguous and longest phrase have the highest frequency  Concepts of most ambiguous and shortest phrase have the lowest frequency GESIS - K.Abdulahhad37 20-12-2018 𝑟𝑓𝑖 = 3
  • 38. Relative Concept Frequency (results)  Corpora GESIS - K.Abdulahhad38 20-12-2018 104.26
  • 39. Relative Concept Frequency (results) GESIS - K.Abdulahhad39 20-12-2018 (*) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t. classical concepts frequency TF
  • 40. Relative Concept Frequency (results) GESIS - K.Abdulahhad40 20-12-2018
  • 41. Relative Concept Frequency (conclusion)  Dealing with the document length deformation  Encouraging results  Increase recall  Maintain or even increase the precision  Can be used with classical IR models  Change the (TF) component GESIS - K.Abdulahhad41 20-12-2018
  • 42. 20-12-2018GESIS - K.Abdulahhad42 Part II: Concept Embedding [3] K. Abdulahhad, Concept embedding for information retrieval. ECIR 2018
  • 43. Concept embedding (problem) 20-12-2018GESIS - K.Abdulahhad43 fiddle violinS04544161 C0004238 skin cancermelanoma
  • 44. Concept embedding (problem) 20-12-2018GESIS - K.Abdulahhad44 fiddle violinS04544161 C0004238 skin cancermelanoma B-celllymphocyte C0004561 C0024264
  • 45. Concept embedding (problem) 20-12-2018GESIS - K.Abdulahhad45 fiddle violinS04544161 C0004238 skin cancermelanoma B-celllymphocyte C0004561 C0024264 is-a
  • 46. Concept embedding (problem) 20-12-2018GESIS - K.Abdulahhad46 fiddle violinS04544161 C0004238 skin cancermelanoma B-celllymphocyte C0004561 C0024264 is-a Relation-based concept similarity is problematic
  • 47. Concept embedding (problem) 20-12-2018GESIS - K.Abdulahhad47 fiddle violinS04544161 C0004238 skin cancermelanoma B-celllymphocyte C0004561 C0024264 is-a Relation-based concept similarity is problematic fiddle violin B-cell lymphocyte handbody is-a part-of Relations have different semantics & properties synonymous
  • 48. Concept embedding (idea) 20-12-2018GESIS - K.Abdulahhad48  Concepts as vectors  Still using concepts to reduce mismatch effect  Avoiding the complexities of relation-based inter- concept similarity
  • 49. Concept embedding (idea) 20-12-2018GESIS - K.Abdulahhad49  Concepts as vectors  Still using concepts to reduce mismatch effect  Avoiding the complexities of relation-based inter- concept similarity Check adaptability of concept-embedding-based similarity to IR Goal
  • 50. Concept embedding (approaches) 20-12-2018GESIS - K.Abdulahhad50  Flat embedding ⋯ 𝑐 = 𝐹 𝑤1, ⋯ , 𝑤 𝑛 𝑐 𝑤1 𝑤 𝑛
  • 51. Concept embedding (approaches) 20-12-2018GESIS - K.Abdulahhad51  Hierarchical embedding ⋮ ⋮⋯ 𝑐 ⋯ ⋯ ⋮ ⋮ ⋮ ⋮ 𝑤1 𝑤 𝑛 𝑠1 𝑠 𝑚 𝑡1 𝑡 𝑘 𝑠𝑖 = 𝐹 𝑤1 𝑖 , ⋯ , 𝑤 𝑛 𝑖 𝑡𝑗 = 𝐹 𝑠1 𝑗 , ⋯ , 𝑠 𝑚 𝑗 𝑐 = 𝐹 𝑡1, ⋯ , 𝑡 𝑘
  • 52. Concept embedding (approaches) 20-12-2018GESIS - K.Abdulahhad52  Weighted embedding 𝑐 = 𝐹 𝛼1 𝑤1, ⋯ , 𝛼 𝑛 𝑤 𝑛 𝑐 𝑤1 𝑤 𝑛⋯
  • 53. Concept embedding (experiments) 20-12-2018GESIS - K.Abdulahhad53  Experiments consist of two parts  Generating concept embedding vectors  Testing a vector-based concept similarity for ad-hoc IR
  • 54. Concept embedding (experiments) 1. Generating concept embedding vectors 20-12-2018GESIS - K.Abdulahhad54  Word embedding  PubMed Central collection (1177879 vocabularies)  Word2Vec  Vector size 500  Continuous bag of words  Window size 8  Negative sampling 25
  • 55. Concept embedding (experiments) 1. Generating concept embedding vectors 20-12-2018GESIS - K.Abdulahhad55  Concept embedding  UMLS2017 concepts (only English content)  For each concept, we build the corresponding set of words  Flat embedding  Replace F by avg  Hierarchical embedding  Replace F by avg  Weighted embedding  Replace F by weighted-avg  The weight 𝛼 𝑤 of a word w is: 𝛼 𝑤 = ln 𝑁+1 𝑛  N the number of documents in PubMed Central  n is the document frequency of w in PubMed Central
  • 56. Concept embedding (experiments) 1. Generating concept embedding vectors 20-12-2018GESIS - K.Abdulahhad56  Concept embedding (missing words)  Fixed random vectors  Several experiments for weighting missing words  The word is too popular n = N (poor idf)  The word is too rare n = 1 (high idf)  Or in between n = N/2
  • 57. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad57  Corpora  clef11 & clef12  Text to concepts mapping  MetaMap  UMLS concepts
  • 58. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad58  IR model and concept similarity 𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗ 𝑐∈𝑞
  • 59. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad59  IR model and concept similarity  Weight(c): BM25 and Pivoted Normalization 𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗ 𝑐∈𝑞
  • 60. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad60  IR model and concept similarity  Weight(c): BM25 and Pivoted Normalization  Concept similarity 𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗ 𝑐∈𝑞 𝑠𝑖𝑚 𝑐𝑖, 𝑐𝑗 = 0 cos 𝜃 ≤ 0 𝛽 × cos 𝜃 2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 61. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad61  IR model and concept similarity  Weight(c): BM25 and Pivoted Normalization  Concept similarity  For comparison (Leacock) 𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗ 𝑐∈𝑞 𝑠𝑖𝑚 𝑐𝑖, 𝑐𝑗 = 0 cos 𝜃 ≤ 0 𝛽 × cos 𝜃 2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 62. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad62  Results (*) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t.“NoEmb-NoSim” (†) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t.“NoEmb-Leacock”
  • 63. Concept embedding (conclusion) 20-12-2018GESIS - K.Abdulahhad63  Three approaches to build concept vectors based on word embedding  Promising results to use vector-based concept representation and similarity  Concepts and words are represented in the same vector space  they are comparable  Improve approaches like MetaMap
  • 65. Conclusion  Dealing with the two observations  Inadequacy of the term independence assumption  Retrieval process has an inferential nature  Conceptual IR  Document length deformation  Inter-concept relations quantification 20-12-2018GESIS - K.Abdulahhad65