SlideShare a Scribd company logo
1 of 66
Download to read offline
Karam Abdulahhad
GESIS - Cologne
karam.abdulahhad@gesis.org
karam.abdulahhad@gmail.com
Beyond Classical Information Retrieval (IR)
Conceptual IR
Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad2
How have “fiddles” changed over time
Violins
Like most technological breakthroughs, today's
violin is an evolutionary product. So far as we
know, there were no violins in 1500. A century
later, there were several types and probably
thousands of specimens north and south of the
Alps, and from England to Poland. A marvel of
craftsmanship and acoustical engineering, the
violin produced more sound than any stringed
instrument to date. Almost immediately,
composers, players and collectors liked what
they heard and saw. Italian and non-Italian
makers proliferated.
……….
Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad3
Historical information about “sugar
river bank”
History and Mission Statement
…………
The Bank continues to grow at a healthy pace.
We have continued to do well and be a leader
in our industry. Our main branch was expanded
in 1982 and we now have branches in Sunapee,
New London, Warner, Grantham and Concord.
We at Sugar River Bank are proud of our
history and growth. It is the responsibility of
each and every member of our Bank's family to
insure continued growth in the future.
…………
www.sugarriverbank.com
Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad4
Historical information about “sugar
river bank”
The Life-Sustaining Sugar River
…………
The west branch of the Sugar River historically
supported a native trout population, but had
suffered from sedimentation, overgrazing of its
banks and warming water. “Restoration efforts
in the Dane County portion of the watershed
reduced nonpoint source pollution, installed
riverbank vegetative filter strips, improved in-
stream habitat, restricted cattle access to
streams, and improved management of animal
waste from barnyards,” says Hansis.
…………
northwestquarterly.com
Linguistic phenomena & IR problems
20-12-2018GESIS - K.Abdulahhad5
Part-Whole
Hand Body
Heteronyms
Bank(com) Bank(geo)
Hyponym / Hypernym
B-cell Lymphocyte
Synonyms
Violin Fiddle
Co-hyponym
Cat Dog
Observations
1. Inadequacy of the term-independence assumption,
which leads to the term-mismatch problem
20-12-2018GESIS - K.Abdulahhad6
Observations
1. Inadequacy of the term-independence assumption,
which leads to the term-mismatch problem
2. Retrieval process has an inferential nature, where the
classical word-based document-query comparison
paradigm is insufficient
20-12-2018GESIS - K.Abdulahhad7
20-12-2018GESIS - K.Abdulahhad8
Conceptual approach
Conceptual approach
20-12-2018GESIS - K.Abdulahhad9
 Concepts are categories encompassing all synonymous
terms
Conceptual approach
20-12-2018GESIS - K.Abdulahhad10
 Concepts are categories encompassing all synonymous
terms
Atrial fibrillation
Auricular fibrillation
C0004238
Ticker
Watch
S04563183
Cancer
Malignant neoplastic disease
S14263400
WordNet
Snake
Serpent
Ophidian
S01729333
UMLS
Skin cancer
Melanoma
Malignant neoplasm of skin
C0004238
Conceptual approach
20-12-2018GESIS - K.Abdulahhad11
 Concepts are categories encompassing all synonymous
terms
Using concepts IDs
instead of terms
Atrial fibrillation
Auricular fibrillation
C0004238
Ticker
Watch
S04563183
Cancer
Malignant neoplastic disease
S14263400
WordNet
Snake
Serpent
Ophidian
S01729333
UMLS
Skin cancer
Melanoma
Malignant neoplasm of skin
C0004238
20-12-2018GESIS - K.Abdulahhad12
Part I: Relative Concept Frequency
[1] K. . Abdulahhad et al., Revisiting the Term Frequency in concept-Based IR Models. DEXA 2013
[2] K. . Abdulahhad et al., MRIM at ImageCLEF2012. From Words to Concepts: A New Counting Approach.
CLEF 2012
Relative Concept Frequency (problem)
20-12-2018GESIS - K.Abdulahhad13
 Text to concepts mapping
Relative Concept Frequency (problem)
20-12-2018GESIS - K.Abdulahhad14
 Text to concepts mapping
 Using MetaMap & UMLS concepts
Relative Concept Frequency (problem)
20-12-2018GESIS - K.Abdulahhad15
 Text to concepts mapping
 Using MetaMap & UMLS concepts
Precision
Recall
Relative Concept Frequency (problem)
GESIS - K.Abdulahhad16
Word-space Concept-space
𝑑 =‘lobar pneumonia x-ray’
𝑑 = 3 𝑑 =? ?
 Document length
20-12-2018
Relative Concept Frequency (idea)
 Use all concepts but maintaining word-based document
length
 Structure based redistribution of word-based document
length on concepts
GESIS - K.Abdulahhad17 20-12-2018
Relative Concept Frequency (how)
 Computing relative frequency
GESIS - K.Abdulahhad18 20-12-2018
Relative Concept Frequency (how)
 Computing relative frequency
 Hypothesis 1: concepts of
larger phrase receive larger
count (more specific meaning)
GESIS - K.Abdulahhad19 20-12-2018
Relative Concept Frequency (how)
 Computing relative frequency
 Hypothesis 1: concepts of
larger phrase receive larger
count (more specific meaning)
 Hypothesis 2: the bigger the
set of concepts is for a phrase,
the less important count its
concepts receive (ambiguity)
GESIS - K.Abdulahhad20 20-12-2018
Relative Concept Frequency (how)
 Computing relative frequency
 Hypothesis 1: concepts of
larger phrase receive larger
count (more specific meaning)
 Hypothesis 2: the bigger the
set of concepts is for a phrase,
the less important count its
concepts receive (ambiguity)
 Hypothesis 3: maintaining
word-based 𝑑
GESIS - K.Abdulahhad21 20-12-2018
Computing Relative Concept Frequency
(Step 1)
 Step 1: map text to concepts (via e.g. MetaMap)
GESIS - K.Abdulahhad22
Sub-phrases Concepts
𝑇1:‘lobar pneumonia’ 𝐶1 = 𝐶0032300, 𝐶0155862
𝑇2:‘pneumonia x-ray’ 𝐶2 = 𝐶0581647
𝑇3:‘lobar’ 𝐶3 =
𝐶1511010, 𝐶1428707,
𝐶0796494
𝑇4:‘pneumonia’ 𝐶4 =
𝐶0024109, 𝐶1278908,
𝐶0032285, 𝐶2707265,
𝐶2709248
𝑇5:‘x-ray’ 𝐶4 =
𝐶0034571, 𝐶0043299,
𝐶0043309, 𝐶1306645,
𝐶1714805, 𝐶1962945
‘lobar pneumonia x-ray’
MetaMap
20-12-2018
Computing Relative Concept Frequency
(Step 2)
 Step 2: build hierarchy
GESIS - K.Abdulahhad23
𝑇𝑖, 𝐶𝑖 < 𝑇𝑗, 𝐶𝑗 ⇔ 𝑇𝑖 ⊂ 𝑇𝑗
 11,CT
R
 22 ,CT
 33,CT  44 ,CT  55,CT
Virtual node
Sub-phrases Concepts
𝑇1:‘lobar pneumonia’ 𝐶1 = 𝐶0032300, 𝐶0155862
𝑇2:‘pneumonia x-ray’ 𝐶2 = 𝐶0581647
𝑇3:‘lobar’ 𝐶3 =
𝐶1511010, 𝐶1428707,
𝐶0796494
𝑇4:‘pneumonia’ 𝐶4 =
𝐶0024109, 𝐶1278908,
𝐶0032285, 𝐶2707265,
𝐶2709248
𝑇5:‘x-ray’ 𝐶4 =
𝐶0034571, 𝐶0043299,
𝐶0043309, 𝐶1306645,
𝐶1714805, 𝐶1962945
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 Step 3: compute relative frequency 𝑟𝑓𝑖
 Breadth first search
 The relative frequency 𝑟𝑓𝑖 of 𝑐 ∈ 𝐶𝑖 must be
propositional to 𝑇𝑖 (Hypothesis 1), and inversely
propositional to 𝐶𝑖 (Hypothesis 2)
 Maintaining 𝑑 by distributing it on the concepts of 𝑑
(Hypothesis 3).
GESIS - K.Abdulahhad24
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300
𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
 11,CT
R
 22 ,CT
 33,CT  44 ,CT  55 ,CT
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad25
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
3
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 Step 3: computing relative weight
 For each node 𝑇𝑖, 𝐶𝑖 we compute three values
 𝛼𝑖 the amount that should be distributed on the concepts of the current node 𝑇𝑖, 𝐶𝑖 and
its children
 𝛼𝑖 = 𝛿 𝑝𝑎𝑟𝑒𝑛𝑡 × 𝑇𝑖𝑝𝑎𝑟𝑒𝑛𝑡𝑠
 𝛿𝑖 the portion of one single word of the input amount 𝛼𝑖
 𝛿𝑖 =
𝛼 𝑖
𝑇 𝑖 + 𝑇 𝑐ℎ𝑖𝑙𝑑𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛
 𝛽𝑖, or equivalently 𝑟𝑓𝑖, the relative frequency of each concept 𝑐 ∈ 𝐶𝑖
 𝛽𝑖 =
𝛿 𝑖× 𝑇𝑖
𝐶 𝑖
GESIS - K.Abdulahhad26 20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad27
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅 = 3
𝛼 𝑅 3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
𝛿1
𝛽1
𝛼2
𝛿2
𝛽2
𝛼3
𝛿3
𝛽3
𝛼4
𝛿4
𝛽4
𝛼5
𝛿5
𝛽5
𝛿 𝑅 =
𝛼 𝑅
𝑇𝑅 + 𝑇1 + 𝑇2
=
3
4
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼 𝑅 = 3
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad28
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
𝛿2
𝛽2
𝛼3
𝛿3
𝛽3
𝛼4
𝛿4
𝛽4
𝛼5
𝛿5
𝛽5
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼1 = 𝛿 𝑅 × 𝑇1 =
3
2
𝛿1 =
𝛼1
𝑇1 + 𝑇3 + 𝑇4
=
3
8
𝛽1 =
𝛿1 × 𝑇1
𝐶1
=
3
8
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad29
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
3
2
𝛿2
3
8
𝛽2
3
4
𝛼3
𝛿3
𝛽3
𝛼4
𝛿4
𝛽4
𝛼5
𝛿5
𝛽5
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼2 = 𝛿 𝑅 × 𝑇2 =
3
2
𝛿2 =
𝛼2
𝑇2 + 𝑇4 + 𝑇5
=
3
8
𝛽2 =
𝛿2 × 𝑇2
𝐶2
=
3
4
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad30
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
3
2
𝛿2
3
8
𝛽2
3
4
𝛼3
3
8
𝛿3
3
8
𝛽3
1
8
𝛼4
𝛿4
𝛽4
𝛼5
𝛿5
𝛽5
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼3 = 𝛿1 × 𝑇3 =
3
8
𝛿3 =
𝛼3
𝑇3
=
3
8
𝛽3 =
𝛿3 × 𝑇3
𝐶3
=
1
8
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad31
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
3
2
𝛿2
3
8
𝛽2
3
4
𝛼3
3
8
𝛿3
3
8
𝛽3
1
8
𝛼4
3
4
𝛿4
3
4
𝛽4
3
20
𝛼5
𝛿5
𝛽5
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼4 = 𝛿1 × 𝑇4 + 𝛿2 × 𝑇4 =
3
4
𝛿4 =
𝛼4
𝑇4
=
3
4
𝛽4 =
𝛿4 × 𝑇4
𝐶4
=
3
20
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia
x-ray’ on its concepts
GESIS - K.Abdulahhad32
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
3
2
𝛿2
3
8
𝛽2
3
4
𝛼3
3
8
𝛿3
3
8
𝛽3
1
8
𝛼4
3
4
𝛿4
3
4
𝛽4
3
20
𝛼5
3
8
𝛿5
3
8
𝛽5
1
16
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
𝛼5 = 𝛿2 × 𝑇5 =
3
8
𝛿5 =
𝛼5
𝑇5
=
3
8
𝛽5 =
𝛿5 × 𝑇5
𝐶5
=
1
16
20-12-2018
Computing Relative Concept Frequency
(Step 3)
 We distribute the 𝑑 = 3 of phrase
‘lobar pneumonia x-ray’ on its concepts
GESIS - K.Abdulahhad33
 11,CT
R
 22,CT
 33,CT  44,CT  55,CT
𝛼 𝑅
3
𝛿 𝑅
3
4
𝛽 𝑅
𝛼1
3
2
𝛿1
3
8
𝛽1
3
8
𝛼2
3
2
𝛿2
3
8
𝛽2
3
4
𝛼3
3
8
𝛿3
3
8
𝛽3
1
8
𝛼4
3
4
𝛿4
3
4
𝛽4
3
20
𝛼5
3
8
𝛿5
3
8
𝛽5
1
16
𝑇𝑅 = 0
𝐶 𝑅 = 0
𝑇2 = 2
𝐶2 = 1
𝑇1 = 2
𝐶1 = 2
𝑇4 = 1
𝐶4 = 5
𝑇5 = 1
𝐶5 = 6
𝑇3 = 1
𝐶3 = 3
20-12-2018
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300 3
8𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
3
4
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
1
8
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
3
20
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
1
16
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300 3
8𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
3
4
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
1
8
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
3
20
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
1
16
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
Computing Relative Concept Frequency
(Step 3)
GESIS - K.Abdulahhad34 20-12-2018
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300 3
8𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
3
4
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
1
8
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
3
20
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
1
16
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
Computing Relative Concept Frequency
(Step 3)
 From this table, we can see that the
concepts of less ambiguous and
longest phrase have the highest
frequency
GESIS - K.Abdulahhad35 20-12-2018
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300 3
8𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
3
4
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
1
8
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
3
20
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
1
16
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
Computing Relative Concept Frequency
(Step 3)
 From this table, we can see that the
concepts of less ambiguous and
longest phrase have the highest
frequency
 Concepts of most ambiguous and
shortest phrase have the lowest
frequency
GESIS - K.Abdulahhad36 20-12-2018
Sub-phrases Concept 𝒓𝒇𝒊
𝑇1:‘lobar pneumonia’
𝑇1 = 2, 𝐶1 = 2
𝐶0032300 3
8𝐶0155862
𝑇2:‘pneumonia x-ray’
𝑇2 = 2, 𝐶2 = 1
𝐶0581647
3
4
𝑇3:‘lobar’
𝑇3 = 1, 𝐶3 =3
𝐶1511010
1
8
𝐶1428707
𝐶0796494
𝑇4:‘pneumonia’
𝑇4 = 1, 𝐶4 =5
𝐶0024109
3
20
𝐶1278908
𝐶0032285
𝐶2707265
𝐶2709248
𝑇5:‘x-ray’
𝑇5 = 1, 𝐶5 =6
𝐶0034571
1
16
𝐶0043299
𝐶0043309
𝐶1306645
𝐶1714805
𝐶1962945
Computing Relative Concept Frequency
(Step 3)
 From this table, we can see that the
concepts of less ambiguous and
longest phrase have the highest
frequency
 Concepts of most ambiguous and
shortest phrase have the lowest
frequency
GESIS - K.Abdulahhad37 20-12-2018
𝑟𝑓𝑖 = 3
Relative Concept Frequency (results)
 Corpora
GESIS - K.Abdulahhad38 20-12-2018
104.26
Relative Concept Frequency (results)
GESIS - K.Abdulahhad39 20-12-2018
(*) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t. classical concepts frequency
TF
Relative Concept Frequency (results)
GESIS - K.Abdulahhad40 20-12-2018
Relative Concept Frequency (conclusion)
 Dealing with the document length deformation
 Encouraging results
 Increase recall
 Maintain or even increase the precision
 Can be used with classical IR models
 Change the (TF) component
GESIS - K.Abdulahhad41 20-12-2018
20-12-2018GESIS - K.Abdulahhad42
Part II: Concept Embedding
[3] K. Abdulahhad, Concept embedding for information retrieval. ECIR 2018
Concept embedding (problem)
20-12-2018GESIS - K.Abdulahhad43
fiddle violinS04544161
C0004238 skin cancermelanoma
Concept embedding (problem)
20-12-2018GESIS - K.Abdulahhad44
fiddle violinS04544161
C0004238 skin cancermelanoma
B-celllymphocyte C0004561 C0024264
Concept embedding (problem)
20-12-2018GESIS - K.Abdulahhad45
fiddle violinS04544161
C0004238 skin cancermelanoma
B-celllymphocyte C0004561 C0024264
is-a
Concept embedding (problem)
20-12-2018GESIS - K.Abdulahhad46
fiddle violinS04544161
C0004238 skin cancermelanoma
B-celllymphocyte C0004561 C0024264
is-a
Relation-based concept similarity is problematic
Concept embedding (problem)
20-12-2018GESIS - K.Abdulahhad47
fiddle violinS04544161
C0004238 skin cancermelanoma
B-celllymphocyte C0004561 C0024264
is-a
Relation-based concept similarity is problematic
fiddle violin
B-cell lymphocyte
handbody
is-a
part-of
Relations have different
semantics & properties
synonymous
Concept embedding (idea)
20-12-2018GESIS - K.Abdulahhad48
 Concepts as vectors
 Still using concepts to reduce mismatch effect
 Avoiding the complexities of relation-based inter-
concept similarity
Concept embedding (idea)
20-12-2018GESIS - K.Abdulahhad49
 Concepts as vectors
 Still using concepts to reduce mismatch effect
 Avoiding the complexities of relation-based inter-
concept similarity
Check adaptability of concept-embedding-based
similarity to IR
Goal
Concept embedding (approaches)
20-12-2018GESIS - K.Abdulahhad50
 Flat embedding
⋯
𝑐 = 𝐹 𝑤1, ⋯ , 𝑤 𝑛 𝑐
𝑤1 𝑤 𝑛
Concept embedding (approaches)
20-12-2018GESIS - K.Abdulahhad51
 Hierarchical embedding
⋮
⋮⋯
𝑐
⋯
⋯
⋮
⋮
⋮
⋮
𝑤1 𝑤 𝑛
𝑠1 𝑠 𝑚
𝑡1 𝑡 𝑘
𝑠𝑖 = 𝐹 𝑤1
𝑖
, ⋯ , 𝑤 𝑛
𝑖
𝑡𝑗 = 𝐹 𝑠1
𝑗
, ⋯ , 𝑠 𝑚
𝑗
𝑐 = 𝐹 𝑡1, ⋯ , 𝑡 𝑘
Concept embedding (approaches)
20-12-2018GESIS - K.Abdulahhad52
 Weighted embedding
𝑐 = 𝐹 𝛼1 𝑤1, ⋯ , 𝛼 𝑛 𝑤 𝑛 𝑐
𝑤1 𝑤 𝑛⋯
Concept embedding (experiments)
20-12-2018GESIS - K.Abdulahhad53
 Experiments consist of two parts
 Generating concept embedding vectors
 Testing a vector-based concept similarity for ad-hoc IR
Concept embedding (experiments)
1. Generating concept embedding vectors
20-12-2018GESIS - K.Abdulahhad54
 Word embedding
 PubMed Central collection (1177879 vocabularies)
 Word2Vec
 Vector size 500
 Continuous bag of words
 Window size 8
 Negative sampling 25
Concept embedding (experiments)
1. Generating concept embedding vectors
20-12-2018GESIS - K.Abdulahhad55
 Concept embedding
 UMLS2017 concepts (only English content)
 For each concept, we build the corresponding set of words
 Flat embedding
 Replace F by avg
 Hierarchical embedding
 Replace F by avg
 Weighted embedding
 Replace F by weighted-avg
 The weight 𝛼 𝑤 of a word w is: 𝛼 𝑤 = ln
𝑁+1
𝑛
 N the number of documents in PubMed Central
 n is the document frequency of w in PubMed Central
Concept embedding (experiments)
1. Generating concept embedding vectors
20-12-2018GESIS - K.Abdulahhad56
 Concept embedding (missing words)
 Fixed random vectors
 Several experiments for weighting missing words
 The word is too popular n = N (poor idf)
 The word is too rare n = 1 (high idf)
 Or in between n = N/2
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad57
 Corpora
 clef11 & clef12
 Text to concepts mapping
 MetaMap
 UMLS concepts
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad58
 IR model and concept similarity
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad59
 IR model and concept similarity
 Weight(c): BM25 and Pivoted Normalization
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad60
 IR model and concept similarity
 Weight(c): BM25 and Pivoted Normalization
 Concept similarity
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
𝑠𝑖𝑚 𝑐𝑖, 𝑐𝑗 =
0 cos 𝜃 ≤ 0
𝛽 × cos 𝜃 2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad61
 IR model and concept similarity
 Weight(c): BM25 and Pivoted Normalization
 Concept similarity
 For comparison (Leacock)
𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗
𝑐∈𝑞
𝑠𝑖𝑚 𝑐𝑖, 𝑐𝑗 =
0 cos 𝜃 ≤ 0
𝛽 × cos 𝜃 2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
Concept embedding (experiments)
2. Testing a vector-based concept similarity for ad-hoc IR
20-12-2018GESIS - K.Abdulahhad62
 Results
(*) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t.“NoEmb-NoSim”
(†) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t.“NoEmb-Leacock”
Concept embedding (conclusion)
20-12-2018GESIS - K.Abdulahhad63
 Three approaches to build concept vectors
based on word embedding
 Promising results to use vector-based concept
representation and similarity
 Concepts and words are represented in the
same vector space
 they are comparable
 Improve approaches like MetaMap
20-12-2018GESIS - K.Abdulahhad64
Conclusion
Conclusion
 Dealing with the two observations
 Inadequacy of the term independence assumption
 Retrieval process has an inferential nature
 Conceptual IR
 Document length deformation
 Inter-concept relations quantification
20-12-2018GESIS - K.Abdulahhad65
20-12-2018GESIS - K.Abdulahhad66
Thank you …

More Related Content

Similar to Beyond Classical Information Retrieval (IR): Conceptual IR

sigmod-keynote.pdf
sigmod-keynote.pdfsigmod-keynote.pdf
sigmod-keynote.pdf
ssuser56e850
 
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsV.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
Elinor Velasquez
 
From health persona to societal health uci 131202
From health persona to societal health  uci  131202From health persona to societal health  uci  131202
From health persona to societal health uci 131202
Ramesh Jain
 

Similar to Beyond Classical Information Retrieval (IR): Conceptual IR (20)

sigmod-keynote.pdf
sigmod-keynote.pdfsigmod-keynote.pdf
sigmod-keynote.pdf
 
Teaching & Learning with Technology TLT 2016
Teaching & Learning with Technology TLT 2016Teaching & Learning with Technology TLT 2016
Teaching & Learning with Technology TLT 2016
 
Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18Simplicial closure and higher-order link prediction --- SIAMNS18
Simplicial closure and higher-order link prediction --- SIAMNS18
 
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive AnalyticsV.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
V.8.0-Emerging Frontiers and Future Directions for Predictive Analytics
 
Tim Brown ACEAS Phenocams
Tim Brown ACEAS PhenocamsTim Brown ACEAS Phenocams
Tim Brown ACEAS Phenocams
 
Deep red - The environmental impact of deep learning (Paolo Caressa)
Deep red - The environmental impact of deep learning (Paolo Caressa)Deep red - The environmental impact of deep learning (Paolo Caressa)
Deep red - The environmental impact of deep learning (Paolo Caressa)
 
Dgpg college kanpur_2015
Dgpg college kanpur_2015Dgpg college kanpur_2015
Dgpg college kanpur_2015
 
From health persona to societal health uci 131202
From health persona to societal health  uci  131202From health persona to societal health  uci  131202
From health persona to societal health uci 131202
 
Doing Scientific Investigations W2D3.pptx
Doing Scientific Investigations W2D3.pptxDoing Scientific Investigations W2D3.pptx
Doing Scientific Investigations W2D3.pptx
 
NLP support for clinical tasks and decisions
NLP support for clinical tasks and decisionsNLP support for clinical tasks and decisions
NLP support for clinical tasks and decisions
 
Deep Learning for Food Analysis
Deep Learning for Food Analysis Deep Learning for Food Analysis
Deep Learning for Food Analysis
 
"The data revolution", par Serena Capital
"The data revolution", par Serena Capital"The data revolution", par Serena Capital
"The data revolution", par Serena Capital
 
The Data Revolution - Serena Capital
The Data Revolution - Serena CapitalThe Data Revolution - Serena Capital
The Data Revolution - Serena Capital
 
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
QMC: Transition Workshop - Selected Highlights from the Probabilistic Numeric...
 
Thailand Policy Foresight in Covid-19 Era
Thailand Policy Foresight in Covid-19 EraThailand Policy Foresight in Covid-19 Era
Thailand Policy Foresight in Covid-19 Era
 
Big Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH HeadedBig Data in Biomedicine: Where is the NIH Headed
Big Data in Biomedicine: Where is the NIH Headed
 
Bigdata AI
Bigdata AI Bigdata AI
Bigdata AI
 
Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...Democratising biodiversity and genomics research: open and citizen science to...
Democratising biodiversity and genomics research: open and citizen science to...
 
CHAPTER 14 CLUSTERING.PPTX
CHAPTER 14 CLUSTERING.PPTXCHAPTER 14 CLUSTERING.PPTX
CHAPTER 14 CLUSTERING.PPTX
 
15/3 -17 impact exponential technologies
15/3 -17 impact exponential technologies 15/3 -17 impact exponential technologies
15/3 -17 impact exponential technologies
 

Recently uploaded

Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh +966572737505 get cytotec
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
saurabvyas476
 
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptxClient Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Stephen266013
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
siskavia95
 
bams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxbams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptx
JocylDuran
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
23050636
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
acoha1
 

Recently uploaded (20)

Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted KitAbortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
Abortion pills in Riyadh Saudi Arabia| +966572737505 | Get Cytotec, Unwanted Kit
 
Digital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham WareDigital Transformation Playbook by Graham Ware
Digital Transformation Playbook by Graham Ware
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444sourabh vyas1222222222222222222244444444
sourabh vyas1222222222222222222244444444
 
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTSDBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
DBMS UNIT 5 46 CONTAINS NOTES FOR THE STUDENTS
 
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptxClient Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
Client Researchhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhhh.pptx
 
社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction社内勉強会資料_Object Recognition as Next Token Prediction
社内勉強会資料_Object Recognition as Next Token Prediction
 
Predictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting TechniquesPredictive Precipitation: Advanced Rain Forecasting Techniques
Predictive Precipitation: Advanced Rain Forecasting Techniques
 
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontangobat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di  Bontang
obat aborsi Bontang wa 082135199655 jual obat aborsi cytotec asli di Bontang
 
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
Unsatisfied Bhabhi ℂall Girls Vadodara Book Esha 7427069034 Top Class ℂall Gi...
 
bams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptxbams-3rd-case-presentation-scabies-12-05-2020.pptx
bams-3rd-case-presentation-scabies-12-05-2020.pptx
 
Displacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second DerivativesDisplacement, Velocity, Acceleration, and Second Derivatives
Displacement, Velocity, Acceleration, and Second Derivatives
 
How to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data AnalyticsHow to Transform Clinical Trial Management with Advanced Data Analytics
How to Transform Clinical Trial Management with Advanced Data Analytics
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
Statistics Informed Decisions Using Data 5th edition by Michael Sullivan solu...
 
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
Las implicancias del memorándum de entendimiento entre Codelco y SQM según la...
 
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
如何办理(UPenn毕业证书)宾夕法尼亚大学毕业证成绩单本科硕士学位证留信学历认证
 
Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024Northern New England Tableau User Group (TUG) May 2024
Northern New England Tableau User Group (TUG) May 2024
 
DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1DAA Assignment Solution.pdf is the best1
DAA Assignment Solution.pdf is the best1
 

Beyond Classical Information Retrieval (IR): Conceptual IR

  • 1. Karam Abdulahhad GESIS - Cologne karam.abdulahhad@gesis.org karam.abdulahhad@gmail.com Beyond Classical Information Retrieval (IR) Conceptual IR
  • 2. Linguistic phenomena & IR problems 20-12-2018GESIS - K.Abdulahhad2 How have “fiddles” changed over time Violins Like most technological breakthroughs, today's violin is an evolutionary product. So far as we know, there were no violins in 1500. A century later, there were several types and probably thousands of specimens north and south of the Alps, and from England to Poland. A marvel of craftsmanship and acoustical engineering, the violin produced more sound than any stringed instrument to date. Almost immediately, composers, players and collectors liked what they heard and saw. Italian and non-Italian makers proliferated. ……….
  • 3. Linguistic phenomena & IR problems 20-12-2018GESIS - K.Abdulahhad3 Historical information about “sugar river bank” History and Mission Statement ………… The Bank continues to grow at a healthy pace. We have continued to do well and be a leader in our industry. Our main branch was expanded in 1982 and we now have branches in Sunapee, New London, Warner, Grantham and Concord. We at Sugar River Bank are proud of our history and growth. It is the responsibility of each and every member of our Bank's family to insure continued growth in the future. ………… www.sugarriverbank.com
  • 4. Linguistic phenomena & IR problems 20-12-2018GESIS - K.Abdulahhad4 Historical information about “sugar river bank” The Life-Sustaining Sugar River ………… The west branch of the Sugar River historically supported a native trout population, but had suffered from sedimentation, overgrazing of its banks and warming water. “Restoration efforts in the Dane County portion of the watershed reduced nonpoint source pollution, installed riverbank vegetative filter strips, improved in- stream habitat, restricted cattle access to streams, and improved management of animal waste from barnyards,” says Hansis. ………… northwestquarterly.com
  • 5. Linguistic phenomena & IR problems 20-12-2018GESIS - K.Abdulahhad5 Part-Whole Hand Body Heteronyms Bank(com) Bank(geo) Hyponym / Hypernym B-cell Lymphocyte Synonyms Violin Fiddle Co-hyponym Cat Dog
  • 6. Observations 1. Inadequacy of the term-independence assumption, which leads to the term-mismatch problem 20-12-2018GESIS - K.Abdulahhad6
  • 7. Observations 1. Inadequacy of the term-independence assumption, which leads to the term-mismatch problem 2. Retrieval process has an inferential nature, where the classical word-based document-query comparison paradigm is insufficient 20-12-2018GESIS - K.Abdulahhad7
  • 9. Conceptual approach 20-12-2018GESIS - K.Abdulahhad9  Concepts are categories encompassing all synonymous terms
  • 10. Conceptual approach 20-12-2018GESIS - K.Abdulahhad10  Concepts are categories encompassing all synonymous terms Atrial fibrillation Auricular fibrillation C0004238 Ticker Watch S04563183 Cancer Malignant neoplastic disease S14263400 WordNet Snake Serpent Ophidian S01729333 UMLS Skin cancer Melanoma Malignant neoplasm of skin C0004238
  • 11. Conceptual approach 20-12-2018GESIS - K.Abdulahhad11  Concepts are categories encompassing all synonymous terms Using concepts IDs instead of terms Atrial fibrillation Auricular fibrillation C0004238 Ticker Watch S04563183 Cancer Malignant neoplastic disease S14263400 WordNet Snake Serpent Ophidian S01729333 UMLS Skin cancer Melanoma Malignant neoplasm of skin C0004238
  • 12. 20-12-2018GESIS - K.Abdulahhad12 Part I: Relative Concept Frequency [1] K. . Abdulahhad et al., Revisiting the Term Frequency in concept-Based IR Models. DEXA 2013 [2] K. . Abdulahhad et al., MRIM at ImageCLEF2012. From Words to Concepts: A New Counting Approach. CLEF 2012
  • 13. Relative Concept Frequency (problem) 20-12-2018GESIS - K.Abdulahhad13  Text to concepts mapping
  • 14. Relative Concept Frequency (problem) 20-12-2018GESIS - K.Abdulahhad14  Text to concepts mapping  Using MetaMap & UMLS concepts
  • 15. Relative Concept Frequency (problem) 20-12-2018GESIS - K.Abdulahhad15  Text to concepts mapping  Using MetaMap & UMLS concepts Precision Recall
  • 16. Relative Concept Frequency (problem) GESIS - K.Abdulahhad16 Word-space Concept-space 𝑑 =‘lobar pneumonia x-ray’ 𝑑 = 3 𝑑 =? ?  Document length 20-12-2018
  • 17. Relative Concept Frequency (idea)  Use all concepts but maintaining word-based document length  Structure based redistribution of word-based document length on concepts GESIS - K.Abdulahhad17 20-12-2018
  • 18. Relative Concept Frequency (how)  Computing relative frequency GESIS - K.Abdulahhad18 20-12-2018
  • 19. Relative Concept Frequency (how)  Computing relative frequency  Hypothesis 1: concepts of larger phrase receive larger count (more specific meaning) GESIS - K.Abdulahhad19 20-12-2018
  • 20. Relative Concept Frequency (how)  Computing relative frequency  Hypothesis 1: concepts of larger phrase receive larger count (more specific meaning)  Hypothesis 2: the bigger the set of concepts is for a phrase, the less important count its concepts receive (ambiguity) GESIS - K.Abdulahhad20 20-12-2018
  • 21. Relative Concept Frequency (how)  Computing relative frequency  Hypothesis 1: concepts of larger phrase receive larger count (more specific meaning)  Hypothesis 2: the bigger the set of concepts is for a phrase, the less important count its concepts receive (ambiguity)  Hypothesis 3: maintaining word-based 𝑑 GESIS - K.Abdulahhad21 20-12-2018
  • 22. Computing Relative Concept Frequency (Step 1)  Step 1: map text to concepts (via e.g. MetaMap) GESIS - K.Abdulahhad22 Sub-phrases Concepts 𝑇1:‘lobar pneumonia’ 𝐶1 = 𝐶0032300, 𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝐶2 = 𝐶0581647 𝑇3:‘lobar’ 𝐶3 = 𝐶1511010, 𝐶1428707, 𝐶0796494 𝑇4:‘pneumonia’ 𝐶4 = 𝐶0024109, 𝐶1278908, 𝐶0032285, 𝐶2707265, 𝐶2709248 𝑇5:‘x-ray’ 𝐶4 = 𝐶0034571, 𝐶0043299, 𝐶0043309, 𝐶1306645, 𝐶1714805, 𝐶1962945 ‘lobar pneumonia x-ray’ MetaMap 20-12-2018
  • 23. Computing Relative Concept Frequency (Step 2)  Step 2: build hierarchy GESIS - K.Abdulahhad23 𝑇𝑖, 𝐶𝑖 < 𝑇𝑗, 𝐶𝑗 ⇔ 𝑇𝑖 ⊂ 𝑇𝑗  11,CT R  22 ,CT  33,CT  44 ,CT  55,CT Virtual node Sub-phrases Concepts 𝑇1:‘lobar pneumonia’ 𝐶1 = 𝐶0032300, 𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝐶2 = 𝐶0581647 𝑇3:‘lobar’ 𝐶3 = 𝐶1511010, 𝐶1428707, 𝐶0796494 𝑇4:‘pneumonia’ 𝐶4 = 𝐶0024109, 𝐶1278908, 𝐶0032285, 𝐶2707265, 𝐶2709248 𝑇5:‘x-ray’ 𝐶4 = 𝐶0034571, 𝐶0043299, 𝐶0043309, 𝐶1306645, 𝐶1714805, 𝐶1962945 20-12-2018
  • 24. Computing Relative Concept Frequency (Step 3)  Step 3: compute relative frequency 𝑟𝑓𝑖  Breadth first search  The relative frequency 𝑟𝑓𝑖 of 𝑐 ∈ 𝐶𝑖 must be propositional to 𝑇𝑖 (Hypothesis 1), and inversely propositional to 𝐶𝑖 (Hypothesis 2)  Maintaining 𝑑 by distributing it on the concepts of 𝑑 (Hypothesis 3). GESIS - K.Abdulahhad24 Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945  11,CT R  22 ,CT  33,CT  44 ,CT  55 ,CT 20-12-2018
  • 25. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad25  11,CT R  22,CT  33,CT  44,CT  55,CT 3 20-12-2018
  • 26. Computing Relative Concept Frequency (Step 3)  Step 3: computing relative weight  For each node 𝑇𝑖, 𝐶𝑖 we compute three values  𝛼𝑖 the amount that should be distributed on the concepts of the current node 𝑇𝑖, 𝐶𝑖 and its children  𝛼𝑖 = 𝛿 𝑝𝑎𝑟𝑒𝑛𝑡 × 𝑇𝑖𝑝𝑎𝑟𝑒𝑛𝑡𝑠  𝛿𝑖 the portion of one single word of the input amount 𝛼𝑖  𝛿𝑖 = 𝛼 𝑖 𝑇 𝑖 + 𝑇 𝑐ℎ𝑖𝑙𝑑𝑐ℎ𝑖𝑙𝑑𝑟𝑒𝑛  𝛽𝑖, or equivalently 𝑟𝑓𝑖, the relative frequency of each concept 𝑐 ∈ 𝐶𝑖  𝛽𝑖 = 𝛿 𝑖× 𝑇𝑖 𝐶 𝑖 GESIS - K.Abdulahhad26 20-12-2018
  • 27. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad27  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 = 3 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 𝛿1 𝛽1 𝛼2 𝛿2 𝛽2 𝛼3 𝛿3 𝛽3 𝛼4 𝛿4 𝛽4 𝛼5 𝛿5 𝛽5 𝛿 𝑅 = 𝛼 𝑅 𝑇𝑅 + 𝑇1 + 𝑇2 = 3 4 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼 𝑅 = 3 20-12-2018
  • 28. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad28  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 𝛿2 𝛽2 𝛼3 𝛿3 𝛽3 𝛼4 𝛿4 𝛽4 𝛼5 𝛿5 𝛽5 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼1 = 𝛿 𝑅 × 𝑇1 = 3 2 𝛿1 = 𝛼1 𝑇1 + 𝑇3 + 𝑇4 = 3 8 𝛽1 = 𝛿1 × 𝑇1 𝐶1 = 3 8 20-12-2018
  • 29. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad29  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 3 2 𝛿2 3 8 𝛽2 3 4 𝛼3 𝛿3 𝛽3 𝛼4 𝛿4 𝛽4 𝛼5 𝛿5 𝛽5 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼2 = 𝛿 𝑅 × 𝑇2 = 3 2 𝛿2 = 𝛼2 𝑇2 + 𝑇4 + 𝑇5 = 3 8 𝛽2 = 𝛿2 × 𝑇2 𝐶2 = 3 4 20-12-2018
  • 30. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad30  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 3 2 𝛿2 3 8 𝛽2 3 4 𝛼3 3 8 𝛿3 3 8 𝛽3 1 8 𝛼4 𝛿4 𝛽4 𝛼5 𝛿5 𝛽5 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼3 = 𝛿1 × 𝑇3 = 3 8 𝛿3 = 𝛼3 𝑇3 = 3 8 𝛽3 = 𝛿3 × 𝑇3 𝐶3 = 1 8 20-12-2018
  • 31. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad31  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 3 2 𝛿2 3 8 𝛽2 3 4 𝛼3 3 8 𝛿3 3 8 𝛽3 1 8 𝛼4 3 4 𝛿4 3 4 𝛽4 3 20 𝛼5 𝛿5 𝛽5 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼4 = 𝛿1 × 𝑇4 + 𝛿2 × 𝑇4 = 3 4 𝛿4 = 𝛼4 𝑇4 = 3 4 𝛽4 = 𝛿4 × 𝑇4 𝐶4 = 3 20 20-12-2018
  • 32. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of the phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad32  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 3 2 𝛿2 3 8 𝛽2 3 4 𝛼3 3 8 𝛿3 3 8 𝛽3 1 8 𝛼4 3 4 𝛿4 3 4 𝛽4 3 20 𝛼5 3 8 𝛿5 3 8 𝛽5 1 16 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 𝛼5 = 𝛿2 × 𝑇5 = 3 8 𝛿5 = 𝛼5 𝑇5 = 3 8 𝛽5 = 𝛿5 × 𝑇5 𝐶5 = 1 16 20-12-2018
  • 33. Computing Relative Concept Frequency (Step 3)  We distribute the 𝑑 = 3 of phrase ‘lobar pneumonia x-ray’ on its concepts GESIS - K.Abdulahhad33  11,CT R  22,CT  33,CT  44,CT  55,CT 𝛼 𝑅 3 𝛿 𝑅 3 4 𝛽 𝑅 𝛼1 3 2 𝛿1 3 8 𝛽1 3 8 𝛼2 3 2 𝛿2 3 8 𝛽2 3 4 𝛼3 3 8 𝛿3 3 8 𝛽3 1 8 𝛼4 3 4 𝛿4 3 4 𝛽4 3 20 𝛼5 3 8 𝛿5 3 8 𝛽5 1 16 𝑇𝑅 = 0 𝐶 𝑅 = 0 𝑇2 = 2 𝐶2 = 1 𝑇1 = 2 𝐶1 = 2 𝑇4 = 1 𝐶4 = 5 𝑇5 = 1 𝐶5 = 6 𝑇3 = 1 𝐶3 = 3 20-12-2018 Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 3 8𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 3 4 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 1 8 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 3 20 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 1 16 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945
  • 34. Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 3 8𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 3 4 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 1 8 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 3 20 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 1 16 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945 Computing Relative Concept Frequency (Step 3) GESIS - K.Abdulahhad34 20-12-2018
  • 35. Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 3 8𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 3 4 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 1 8 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 3 20 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 1 16 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945 Computing Relative Concept Frequency (Step 3)  From this table, we can see that the concepts of less ambiguous and longest phrase have the highest frequency GESIS - K.Abdulahhad35 20-12-2018
  • 36. Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 3 8𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 3 4 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 1 8 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 3 20 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 1 16 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945 Computing Relative Concept Frequency (Step 3)  From this table, we can see that the concepts of less ambiguous and longest phrase have the highest frequency  Concepts of most ambiguous and shortest phrase have the lowest frequency GESIS - K.Abdulahhad36 20-12-2018
  • 37. Sub-phrases Concept 𝒓𝒇𝒊 𝑇1:‘lobar pneumonia’ 𝑇1 = 2, 𝐶1 = 2 𝐶0032300 3 8𝐶0155862 𝑇2:‘pneumonia x-ray’ 𝑇2 = 2, 𝐶2 = 1 𝐶0581647 3 4 𝑇3:‘lobar’ 𝑇3 = 1, 𝐶3 =3 𝐶1511010 1 8 𝐶1428707 𝐶0796494 𝑇4:‘pneumonia’ 𝑇4 = 1, 𝐶4 =5 𝐶0024109 3 20 𝐶1278908 𝐶0032285 𝐶2707265 𝐶2709248 𝑇5:‘x-ray’ 𝑇5 = 1, 𝐶5 =6 𝐶0034571 1 16 𝐶0043299 𝐶0043309 𝐶1306645 𝐶1714805 𝐶1962945 Computing Relative Concept Frequency (Step 3)  From this table, we can see that the concepts of less ambiguous and longest phrase have the highest frequency  Concepts of most ambiguous and shortest phrase have the lowest frequency GESIS - K.Abdulahhad37 20-12-2018 𝑟𝑓𝑖 = 3
  • 38. Relative Concept Frequency (results)  Corpora GESIS - K.Abdulahhad38 20-12-2018 104.26
  • 39. Relative Concept Frequency (results) GESIS - K.Abdulahhad39 20-12-2018 (*) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t. classical concepts frequency TF
  • 40. Relative Concept Frequency (results) GESIS - K.Abdulahhad40 20-12-2018
  • 41. Relative Concept Frequency (conclusion)  Dealing with the document length deformation  Encouraging results  Increase recall  Maintain or even increase the precision  Can be used with classical IR models  Change the (TF) component GESIS - K.Abdulahhad41 20-12-2018
  • 42. 20-12-2018GESIS - K.Abdulahhad42 Part II: Concept Embedding [3] K. Abdulahhad, Concept embedding for information retrieval. ECIR 2018
  • 43. Concept embedding (problem) 20-12-2018GESIS - K.Abdulahhad43 fiddle violinS04544161 C0004238 skin cancermelanoma
  • 44. Concept embedding (problem) 20-12-2018GESIS - K.Abdulahhad44 fiddle violinS04544161 C0004238 skin cancermelanoma B-celllymphocyte C0004561 C0024264
  • 45. Concept embedding (problem) 20-12-2018GESIS - K.Abdulahhad45 fiddle violinS04544161 C0004238 skin cancermelanoma B-celllymphocyte C0004561 C0024264 is-a
  • 46. Concept embedding (problem) 20-12-2018GESIS - K.Abdulahhad46 fiddle violinS04544161 C0004238 skin cancermelanoma B-celllymphocyte C0004561 C0024264 is-a Relation-based concept similarity is problematic
  • 47. Concept embedding (problem) 20-12-2018GESIS - K.Abdulahhad47 fiddle violinS04544161 C0004238 skin cancermelanoma B-celllymphocyte C0004561 C0024264 is-a Relation-based concept similarity is problematic fiddle violin B-cell lymphocyte handbody is-a part-of Relations have different semantics & properties synonymous
  • 48. Concept embedding (idea) 20-12-2018GESIS - K.Abdulahhad48  Concepts as vectors  Still using concepts to reduce mismatch effect  Avoiding the complexities of relation-based inter- concept similarity
  • 49. Concept embedding (idea) 20-12-2018GESIS - K.Abdulahhad49  Concepts as vectors  Still using concepts to reduce mismatch effect  Avoiding the complexities of relation-based inter- concept similarity Check adaptability of concept-embedding-based similarity to IR Goal
  • 50. Concept embedding (approaches) 20-12-2018GESIS - K.Abdulahhad50  Flat embedding ⋯ 𝑐 = 𝐹 𝑤1, ⋯ , 𝑤 𝑛 𝑐 𝑤1 𝑤 𝑛
  • 51. Concept embedding (approaches) 20-12-2018GESIS - K.Abdulahhad51  Hierarchical embedding ⋮ ⋮⋯ 𝑐 ⋯ ⋯ ⋮ ⋮ ⋮ ⋮ 𝑤1 𝑤 𝑛 𝑠1 𝑠 𝑚 𝑡1 𝑡 𝑘 𝑠𝑖 = 𝐹 𝑤1 𝑖 , ⋯ , 𝑤 𝑛 𝑖 𝑡𝑗 = 𝐹 𝑠1 𝑗 , ⋯ , 𝑠 𝑚 𝑗 𝑐 = 𝐹 𝑡1, ⋯ , 𝑡 𝑘
  • 52. Concept embedding (approaches) 20-12-2018GESIS - K.Abdulahhad52  Weighted embedding 𝑐 = 𝐹 𝛼1 𝑤1, ⋯ , 𝛼 𝑛 𝑤 𝑛 𝑐 𝑤1 𝑤 𝑛⋯
  • 53. Concept embedding (experiments) 20-12-2018GESIS - K.Abdulahhad53  Experiments consist of two parts  Generating concept embedding vectors  Testing a vector-based concept similarity for ad-hoc IR
  • 54. Concept embedding (experiments) 1. Generating concept embedding vectors 20-12-2018GESIS - K.Abdulahhad54  Word embedding  PubMed Central collection (1177879 vocabularies)  Word2Vec  Vector size 500  Continuous bag of words  Window size 8  Negative sampling 25
  • 55. Concept embedding (experiments) 1. Generating concept embedding vectors 20-12-2018GESIS - K.Abdulahhad55  Concept embedding  UMLS2017 concepts (only English content)  For each concept, we build the corresponding set of words  Flat embedding  Replace F by avg  Hierarchical embedding  Replace F by avg  Weighted embedding  Replace F by weighted-avg  The weight 𝛼 𝑤 of a word w is: 𝛼 𝑤 = ln 𝑁+1 𝑛  N the number of documents in PubMed Central  n is the document frequency of w in PubMed Central
  • 56. Concept embedding (experiments) 1. Generating concept embedding vectors 20-12-2018GESIS - K.Abdulahhad56  Concept embedding (missing words)  Fixed random vectors  Several experiments for weighting missing words  The word is too popular n = N (poor idf)  The word is too rare n = 1 (high idf)  Or in between n = N/2
  • 57. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad57  Corpora  clef11 & clef12  Text to concepts mapping  MetaMap  UMLS concepts
  • 58. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad58  IR model and concept similarity 𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗ 𝑐∈𝑞
  • 59. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad59  IR model and concept similarity  Weight(c): BM25 and Pivoted Normalization 𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗ 𝑐∈𝑞
  • 60. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad60  IR model and concept similarity  Weight(c): BM25 and Pivoted Normalization  Concept similarity 𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗ 𝑐∈𝑞 𝑠𝑖𝑚 𝑐𝑖, 𝑐𝑗 = 0 cos 𝜃 ≤ 0 𝛽 × cos 𝜃 2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 61. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad61  IR model and concept similarity  Weight(c): BM25 and Pivoted Normalization  Concept similarity  For comparison (Leacock) 𝑅𝑆𝑉 𝑑, 𝑞 = 𝑤𝑒𝑖𝑔ℎ𝑡 𝑞 𝑐 × 𝑠𝑖𝑚 𝑐, 𝑐∗ × 𝑤𝑒𝑖𝑔ℎ𝑡 𝑑 𝑐∗ 𝑐∈𝑞 𝑠𝑖𝑚 𝑐𝑖, 𝑐𝑗 = 0 cos 𝜃 ≤ 0 𝛽 × cos 𝜃 2 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒
  • 62. Concept embedding (experiments) 2. Testing a vector-based concept similarity for ad-hoc IR 20-12-2018GESIS - K.Abdulahhad62  Results (*) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t.“NoEmb-NoSim” (†) indicates to statistically significant (𝛼 < 0.05) improvement w.r.t.“NoEmb-Leacock”
  • 63. Concept embedding (conclusion) 20-12-2018GESIS - K.Abdulahhad63  Three approaches to build concept vectors based on word embedding  Promising results to use vector-based concept representation and similarity  Concepts and words are represented in the same vector space  they are comparable  Improve approaches like MetaMap
  • 65. Conclusion  Dealing with the two observations  Inadequacy of the term independence assumption  Retrieval process has an inferential nature  Conceptual IR  Document length deformation  Inter-concept relations quantification 20-12-2018GESIS - K.Abdulahhad65