SlideShare a Scribd company logo
Looking at the Long Tail
2nd Spinoza Workshop
#SpinozaLongTail
It is time to move
from “Big Data”
to “small data”
Language as
a system and
language use
● The world changes rapidly (Brexit)
● The language system changes slowly
(Nexit)
● The relation between the two changes
constantly
Small data example
Imagine you visit a dear friend for a game of chess. During the game you complain
about your white queen being captured too early. While chatting, your friend tells
you that his Ana is now already 13 years old, and is beginning with high school in
September. After the game, you offer to buy him a beer in O’Neil’s, which is just a
2-3 minutes walk.
Small data example
Imagine [2] you visit [11] a dear [8] friend [5] for a game [14] of chess [2]. During
the game [14] you complain [2] about your white [25] queen [12] being captured
[11] too early. While chatting [4], your friend [5] tells [9] you that his Ana [42] is
now already 13 years [4] old [9], and is beginning [11] with [high school [1]] in
September. After the game [14], you offer [16] to buy [6] him a beer [1] in
O’Neil’s[5], which is just a 2-3 minutes [8] walk [17].
2*11*8*5*14*2*14*2*25*12*11*4*5*9*42*4*9*11*1*14*16*6*1*5*8*17
= 2,1185E36 possible interpretations and still missing some
If a machine joined the conversation, what would it
understand?
Probably, it would think that you talk about: capturing “The White Queen”
TV-series, the ANA airways based in Japan being 13 years old, and the sport
equipment store O’Neil where you apparently can get a beer (or the recently
retired famous basketball player Shaq O’Neil).
An interpretation that may make sense from a Big Data perspective but that does
not make any sense as a combination! Where is the coherence?
Understanding of language is about the Long Tail
with many, many small data niches
“Today, a 6 year old has seen less data and read less
language than most machines, but still these machines make
mistakes that the 6 year old will never make.”
How to create semantic tasks/challenges:
● that force systems to use more
‘intelligence’,
● to understand small data and its details,
● without knowing in advance what these
details are.
Looking at the Long Tail
Looking at the Long Tail
Practicalities
Schedule overview
09:30-10:05 Welcome & Introduction
10:05-12:25 Invited talks
12:25-13:25 Lunch
13:25-14:05 Keynote
14:05-17:45 Practical session
17:45-18:00 Wrap-up
18:00- Drinks
Food & Beverages
● Coffee
○ Two official coffee breaks (one in the morning + one in the afternoon)
○ Coffee machines available in the hall at any time
● Lunch
○ Will be brought to the forum
● Drinks
○ After the workshop
○ In a nice place nearby the VU (follow us :-) )
#SpinozaLongTail
the Long Tail
Dinos born in different generations live in different worlds, i.e. what they know
about the world depends on the time they live in.
t
World(t)
Every generation of dinos has its own Ronaldo
t
Datasets as Time-bound World Proxies
t
Big Data can be a bad Proxy
t
The datasets fail to represent everything in the world,
and over-represent some things.
The Long Tail Phenomenon of Disambiguation (1)
In theory, any lexical expression can refer to any meaning of the world at that time
(and vice versa).
People may be able to use the full range of expressions and interpretation of a
language. But, they are not aware of it and only use a specific set in relation to a
real life situation.
This balances the trade-off between:
● using many different expressions, and
● resolving extreme ambiguity of a small set of expressions
● contextual competition (are you stating the obvious)
The Long Tail Phenomenon of Disambiguation (2)
The distribution of the lexical expressions and the denoted meanings in evaluation
datasets both follow Zipf’s Law.
But physically there is no Zipfian distribution between the meanings.
Any of these worlds contains a set of unique concepts and instances,
each existing physically exactly once.
On instance level, any entity, concept, or event is as prominent as any
other.
Tendencies between the long tails of expressions
and meanings
Dataset Analysis
Our analysis shows that the existing disambiguation datasets exhibit:
● Low ambiguity (~1.0)
● Low variance (~1.0)
● High dominance
● Temporal bias towards data from 1961 or 1990s
Semantic overfitting: what `world' do we consider when evaluating disambiguation of text? (Under review)
Datasets: How old is the data?
Semantic overfitting: what `world' do we consider when evaluating disambiguation of text? (Under review)
Datasets: How dominant is the data?
Marieke van Erp, Pablo Mendes, Heiko Paulheim, Filip Ilievski, Julien Plu, Giuseppe Rizzo and Joerg Waitelonis
(2016). Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better
Job. In Proceedings of LREC 2016.
Problem Statement: The Long Tail DeTail
Disambiguation datasets contain a lot from the “head”, and only
accidental details from the “tail”.
Hence, our machines are very good at identifying popular objects (the
“head”) whereas their performance is extremely low on unpopular objects
(the “tail”).
This is further complicated by the fact that the popularity is determined by
the context: topic, time, location, community.
Systems: Our analysis
For WSD, we tried to improve on the disambiguation of the less frequent senses
Outcome: better performance on the tail causes worse performance on the head!
Marten Postma, Ruben Izquierdo, Eneko Agirre, German Rigau, Piek Vossen (2016).
Addressing the MFS Bias in WSD systems. In Proceedings of LREC 2016.
We aim to propose this task as a “Long Tail Shared Disambiguation Task” to the
next call for SemEval-2018 tasks, which is expected late 2016/early 2017.
In addition, we plan to propose a workshop for ACL 2017.
Goal(s) of the workshop
Goal(s) of the workshop
We want systems to resolve extreme ambiguity within specific context:
Discriminate one context from another
Use more semantics: coherence, logic, inferencing, comprehensive
Combine semantic layers and subtasks
Use the complete document and more (external knowledge)
Answer more complex questions, e.g. quantification and identity
Be smarter than a 6 year old
Can explain why something is an answer
We do not want: yet another domain task
Looking at the Long Tail: What can be done?
Systems Resources
Evaluation
Datasets
#1 Datasets
What kind of datasets are needed for the long tail disambiguation task?
● Properties
● Multi-task
● Are current ones sufficient?
● Optimal acquisition methods
#1 Datasets: Property-driven data
MEANTIME -> Multi-task corpus
Anne-Lyse Minard, Manuela Speranza, Ruben Urizar, Begoña Altuna, Marieke van Erp, Anneleen Schoen and Chantal van
Son (2016). MEANTIME, the NewsReader Multilingual Event and Time Corpus. In Proceedings of LREC 2016.
Dutch SemCor -> Balanced WSD corpus
Piek Vossen, Rubén Izquierdo, and Attila Görög (2013). DutchSemCor: in quest of the ideal sense-tagged corpus. RANLP.
ECB+ -> Increased ambiguity for event coreference
Agata Cybulska and Piek Vossen (2014). Using a sledgehammer to crack a nut? Lexical diversity and event coreference
resolution. In LREC 2014.
#2 Resources
What kind of knowledge is needed for the long tail disambiguation task?
● Long tail knowledge bases
● Contextual knowledge bases
● Locating appropriate knowledge
#3 Evaluation
How should we evaluate the long tail disambiguation task(s)?
● Optimal evaluation metric(s)
● Generalizability over disambiguation tasks
● Incentivizing context- and long tail-aware systems
#4 Systems
What are the requirements for a system to perform well on the long tail
disambiguation task(s)?
● Existing systems
● Multi-task approach
● Long tail performance
● Sustainable systems
Thanks!

More Related Content

Similar to 2nd Spinoza workshop: Looking at the Long Tail - introductory slides

How To Make Outline For Essay
How To Make Outline For EssayHow To Make Outline For Essay
How To Make Outline For Essay
Julia Slater
 
Language And Culture Essay
Language And Culture EssayLanguage And Culture Essay
Language And Culture Essay
Germaine Newman
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Leon Derczynski
 
Knowledge = Information + Context
Knowledge = Information + ContextKnowledge = Information + Context
Knowledge = Information + Context
Stefan Gradmann
 
Sample Of A Cause And Effect Essay
Sample Of A Cause And Effect EssaySample Of A Cause And Effect Essay
Sample Of A Cause And Effect Essay
Kathy Murray
 
What Shakespeare Taught Us About Visualization and Data Science
What Shakespeare Taught Us About Visualization and Data ScienceWhat Shakespeare Taught Us About Visualization and Data Science
What Shakespeare Taught Us About Visualization and Data Science
gleicher
 
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure SoulierHow to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
Paris Women in Machine Learning and Data Science
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the masses
Jose Luis Lopez Pino
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
Paul Groth
 
Individual functional atlasing of the human brain with multitask fMRI data: l...
Individual functional atlasing of the human brain with multitask fMRI data: l...Individual functional atlasing of the human brain with multitask fMRI data: l...
Individual functional atlasing of the human brain with multitask fMRI data: l...
Ana Luísa Pinho
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
Theodore J. LaGrow
 
LabClass1_pt1.ppt
LabClass1_pt1.pptLabClass1_pt1.ppt
LabClass1_pt1.ppt
AshenafiGebre4
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
Soha82
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
Karthik Murugesan
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
Dirk Roorda
 
Use Your Words: Content Strategy to Influence Behavior
Use Your Words: Content Strategy to Influence BehaviorUse Your Words: Content Strategy to Influence Behavior
Use Your Words: Content Strategy to Influence Behavior
Liz Danzico
 
Open University - TU100 Day school 1
Open University - TU100 Day school 1Open University - TU100 Day school 1
Open University - TU100 Day school 1
Sarah Horrigan-Fullard
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
Michel Bruley
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
Paul Groth
 
How machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIHow machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AI
Verena Rieser
 

Similar to 2nd Spinoza workshop: Looking at the Long Tail - introductory slides (20)

How To Make Outline For Essay
How To Make Outline For EssayHow To Make Outline For Essay
How To Make Outline For Essay
 
Language And Culture Essay
Language And Culture EssayLanguage And Culture Essay
Language And Culture Essay
 
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition ResourceBroad Twitter Corpus: A Diverse Named Entity Recognition Resource
Broad Twitter Corpus: A Diverse Named Entity Recognition Resource
 
Knowledge = Information + Context
Knowledge = Information + ContextKnowledge = Information + Context
Knowledge = Information + Context
 
Sample Of A Cause And Effect Essay
Sample Of A Cause And Effect EssaySample Of A Cause And Effect Essay
Sample Of A Cause And Effect Essay
 
What Shakespeare Taught Us About Visualization and Data Science
What Shakespeare Taught Us About Visualization and Data ScienceWhat Shakespeare Taught Us About Visualization and Data Science
What Shakespeare Taught Us About Visualization and Data Science
 
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure SoulierHow to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
How to supervise a thesis in NLP in the ChatGPT era? By Laure Soulier
 
BDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the massesBDS14 Big Data Analytics to the masses
BDS14 Big Data Analytics to the masses
 
Minimal viable-datareuse-czi
Minimal viable-datareuse-cziMinimal viable-datareuse-czi
Minimal viable-datareuse-czi
 
Individual functional atlasing of the human brain with multitask fMRI data: l...
Individual functional atlasing of the human brain with multitask fMRI data: l...Individual functional atlasing of the human brain with multitask fMRI data: l...
Individual functional atlasing of the human brain with multitask fMRI data: l...
 
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-ProcessingAn-Exploration-of-scientific-literature-using-Natural-Language-Processing
An-Exploration-of-scientific-literature-using-Natural-Language-Processing
 
LabClass1_pt1.ppt
LabClass1_pt1.pptLabClass1_pt1.ppt
LabClass1_pt1.ppt
 
NLP_guest_lecture.pdf
NLP_guest_lecture.pdfNLP_guest_lecture.pdf
NLP_guest_lecture.pdf
 
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
BIng NLP Expert - Dl summer-school-2017.-jianfeng-gao.v2
 
Data management for researchers
Data management for researchersData management for researchers
Data management for researchers
 
Use Your Words: Content Strategy to Influence Behavior
Use Your Words: Content Strategy to Influence BehaviorUse Your Words: Content Strategy to Influence Behavior
Use Your Words: Content Strategy to Influence Behavior
 
Open University - TU100 Day school 1
Open University - TU100 Day school 1Open University - TU100 Day school 1
Open University - TU100 Day school 1
 
Big Data and Natural Language Processing
Big Data and Natural Language ProcessingBig Data and Natural Language Processing
Big Data and Natural Language Processing
 
Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.Data Communities - reusable data in and outside your organization.
Data Communities - reusable data in and outside your organization.
 
How machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AIHow machines learn to talk. Machine Learning for Conversational AI
How machines learn to talk. Machine Learning for Conversational AI
 

More from Filip Ilievski

The Commonsense Knowledge Graph
The Commonsense Knowledge GraphThe Commonsense Knowledge Graph
The Commonsense Knowledge Graph
Filip Ilievski
 
Commonsense knowledge in Wikidata
Commonsense knowledge in WikidataCommonsense knowledge in Wikidata
Commonsense knowledge in Wikidata
Filip Ilievski
 
SemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tailSemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tail
Filip Ilievski
 
A look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleA look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubble
Filip Ilievski
 
Systematic Study of Long Tail Phenomena in Entity Linking
Systematic Study of Long Tail Phenomena in Entity LinkingSystematic Study of Long Tail Phenomena in Entity Linking
Systematic Study of Long Tail Phenomena in Entity Linking
Filip Ilievski
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
Filip Ilievski
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
Filip Ilievski
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Filip Ilievski
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event Coreference
Filip Ilievski
 
Mini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimizationMini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimization
Filip Ilievski
 
CLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimizationCLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimization
Filip Ilievski
 

More from Filip Ilievski (11)

The Commonsense Knowledge Graph
The Commonsense Knowledge GraphThe Commonsense Knowledge Graph
The Commonsense Knowledge Graph
 
Commonsense knowledge in Wikidata
Commonsense knowledge in WikidataCommonsense knowledge in Wikidata
Commonsense knowledge in Wikidata
 
SemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tailSemEval-2018 task 5: Counting events and participants in the long tail
SemEval-2018 task 5: Counting events and participants in the long tail
 
A look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubbleA look inside Babelfy: Examining the bubble
A look inside Babelfy: Examining the bubble
 
Systematic Study of Long Tail Phenomena in Entity Linking
Systematic Study of Long Tail Phenomena in Entity LinkingSystematic Study of Long Tail Phenomena in Entity Linking
Systematic Study of Long Tail Phenomena in Entity Linking
 
NoSQL databases
NoSQL databasesNoSQL databases
NoSQL databases
 
LOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked DataLOTUS: Adaptive Text Search for Big Linked Data
LOTUS: Adaptive Text Search for Big Linked Data
 
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15Lotus: Linked Open Text UnleaShed - ISWC COLD '15
Lotus: Linked Open Text UnleaShed - ISWC COLD '15
 
NAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event CoreferenceNAF2SEM and cross-document Event Coreference
NAF2SEM and cross-document Event Coreference
 
Mini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimizationMini seminar presentation on context-based NED optimization
Mini seminar presentation on context-based NED optimization
 
CLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimizationCLiN 25: NED with two-stage coherence optimization
CLiN 25: NED with two-stage coherence optimization
 

Recently uploaded

Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
Leonel Morgado
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
KrushnaDarade1
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
Carl Bergstrom
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
Texas Alliance of Groundwater Districts
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
Sérgio Sacani
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
International Food Policy Research Institute- South Asia Office
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
Aditi Bajpai
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
Areesha Ahmad
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Texas Alliance of Groundwater Districts
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
vluwdy49
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
Vandana Devesh Sharma
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
hozt8xgk
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
terusbelajar5
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
pablovgd
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
PRIYANKA PATEL
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
LengamoLAppostilic
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
IshaGoswami9
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
Daniel Tubbenhauer
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
Gokturk Mehmet Dilci
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
HongcNguyn6
 

Recently uploaded (20)

Immersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths ForwardImmersive Learning That Works: Research Grounding and Paths Forward
Immersive Learning That Works: Research Grounding and Paths Forward
 
SAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdfSAR of Medicinal Chemistry 1st by dk.pdf
SAR of Medicinal Chemistry 1st by dk.pdf
 
The cost of acquiring information by natural selection
The cost of acquiring information by natural selectionThe cost of acquiring information by natural selection
The cost of acquiring information by natural selection
 
Bob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdfBob Reedy - Nitrate in Texas Groundwater.pdf
Bob Reedy - Nitrate in Texas Groundwater.pdf
 
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
EWOCS-I: The catalog of X-ray sources in Westerlund 1 from the Extended Weste...
 
Direct Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart AgricultureDirect Seeded Rice - Climate Smart Agriculture
Direct Seeded Rice - Climate Smart Agriculture
 
Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.Micronuclei test.M.sc.zoology.fisheries.
Micronuclei test.M.sc.zoology.fisheries.
 
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of ProteinsGBSN - Biochemistry (Unit 6) Chemistry of Proteins
GBSN - Biochemistry (Unit 6) Chemistry of Proteins
 
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero WaterSharlene Leurig - Enabling Onsite Water Use with Net Zero Water
Sharlene Leurig - Enabling Onsite Water Use with Net Zero Water
 
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
在线办理(salfor毕业证书)索尔福德大学毕业证毕业完成信一模一样
 
Compexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titrationCompexometric titration/Chelatorphy titration/chelating titration
Compexometric titration/Chelatorphy titration/chelating titration
 
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
快速办理(UAM毕业证书)马德里自治大学毕业证学位证一模一样
 
Medical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptxMedical Orthopedic PowerPoint Templates.pptx
Medical Orthopedic PowerPoint Templates.pptx
 
NuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyerNuGOweek 2024 Ghent programme overview flyer
NuGOweek 2024 Ghent programme overview flyer
 
ESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptxESR spectroscopy in liquid food and beverages.pptx
ESR spectroscopy in liquid food and beverages.pptx
 
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdfwaterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
waterlessdyeingtechnolgyusing carbon dioxide chemicalspdf
 
Phenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvementPhenomics assisted breeding in crop improvement
Phenomics assisted breeding in crop improvement
 
Equivariant neural networks and representation theory
Equivariant neural networks and representation theoryEquivariant neural networks and representation theory
Equivariant neural networks and representation theory
 
Shallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptxShallowest Oil Discovery of Turkiye.pptx
Shallowest Oil Discovery of Turkiye.pptx
 
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốtmô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
mô tả các thí nghiệm về đánh giá tác động dòng khí hóa sau đốt
 

2nd Spinoza workshop: Looking at the Long Tail - introductory slides

  • 1. Looking at the Long Tail 2nd Spinoza Workshop #SpinozaLongTail
  • 2. It is time to move from “Big Data” to “small data”
  • 3. Language as a system and language use
  • 4.
  • 5. ● The world changes rapidly (Brexit) ● The language system changes slowly (Nexit) ● The relation between the two changes constantly
  • 6. Small data example Imagine you visit a dear friend for a game of chess. During the game you complain about your white queen being captured too early. While chatting, your friend tells you that his Ana is now already 13 years old, and is beginning with high school in September. After the game, you offer to buy him a beer in O’Neil’s, which is just a 2-3 minutes walk.
  • 7. Small data example Imagine [2] you visit [11] a dear [8] friend [5] for a game [14] of chess [2]. During the game [14] you complain [2] about your white [25] queen [12] being captured [11] too early. While chatting [4], your friend [5] tells [9] you that his Ana [42] is now already 13 years [4] old [9], and is beginning [11] with [high school [1]] in September. After the game [14], you offer [16] to buy [6] him a beer [1] in O’Neil’s[5], which is just a 2-3 minutes [8] walk [17]. 2*11*8*5*14*2*14*2*25*12*11*4*5*9*42*4*9*11*1*14*16*6*1*5*8*17 = 2,1185E36 possible interpretations and still missing some
  • 8. If a machine joined the conversation, what would it understand? Probably, it would think that you talk about: capturing “The White Queen” TV-series, the ANA airways based in Japan being 13 years old, and the sport equipment store O’Neil where you apparently can get a beer (or the recently retired famous basketball player Shaq O’Neil). An interpretation that may make sense from a Big Data perspective but that does not make any sense as a combination! Where is the coherence?
  • 9. Understanding of language is about the Long Tail with many, many small data niches “Today, a 6 year old has seen less data and read less language than most machines, but still these machines make mistakes that the 6 year old will never make.”
  • 10. How to create semantic tasks/challenges: ● that force systems to use more ‘intelligence’, ● to understand small data and its details, ● without knowing in advance what these details are.
  • 11. Looking at the Long Tail
  • 12. Looking at the Long Tail Practicalities
  • 13. Schedule overview 09:30-10:05 Welcome & Introduction 10:05-12:25 Invited talks 12:25-13:25 Lunch 13:25-14:05 Keynote 14:05-17:45 Practical session 17:45-18:00 Wrap-up 18:00- Drinks
  • 14. Food & Beverages ● Coffee ○ Two official coffee breaks (one in the morning + one in the afternoon) ○ Coffee machines available in the hall at any time ● Lunch ○ Will be brought to the forum ● Drinks ○ After the workshop ○ In a nice place nearby the VU (follow us :-) )
  • 17. Dinos born in different generations live in different worlds, i.e. what they know about the world depends on the time they live in. t World(t)
  • 18. Every generation of dinos has its own Ronaldo t
  • 19. Datasets as Time-bound World Proxies t
  • 20. Big Data can be a bad Proxy t The datasets fail to represent everything in the world, and over-represent some things.
  • 21. The Long Tail Phenomenon of Disambiguation (1) In theory, any lexical expression can refer to any meaning of the world at that time (and vice versa). People may be able to use the full range of expressions and interpretation of a language. But, they are not aware of it and only use a specific set in relation to a real life situation. This balances the trade-off between: ● using many different expressions, and ● resolving extreme ambiguity of a small set of expressions ● contextual competition (are you stating the obvious)
  • 22. The Long Tail Phenomenon of Disambiguation (2) The distribution of the lexical expressions and the denoted meanings in evaluation datasets both follow Zipf’s Law. But physically there is no Zipfian distribution between the meanings. Any of these worlds contains a set of unique concepts and instances, each existing physically exactly once. On instance level, any entity, concept, or event is as prominent as any other.
  • 23. Tendencies between the long tails of expressions and meanings
  • 24. Dataset Analysis Our analysis shows that the existing disambiguation datasets exhibit: ● Low ambiguity (~1.0) ● Low variance (~1.0) ● High dominance ● Temporal bias towards data from 1961 or 1990s Semantic overfitting: what `world' do we consider when evaluating disambiguation of text? (Under review)
  • 25. Datasets: How old is the data? Semantic overfitting: what `world' do we consider when evaluating disambiguation of text? (Under review)
  • 26. Datasets: How dominant is the data? Marieke van Erp, Pablo Mendes, Heiko Paulheim, Filip Ilievski, Julien Plu, Giuseppe Rizzo and Joerg Waitelonis (2016). Evaluating Entity Linking: An Analysis of Current Benchmark Datasets and a Roadmap for Doing a Better Job. In Proceedings of LREC 2016.
  • 27. Problem Statement: The Long Tail DeTail Disambiguation datasets contain a lot from the “head”, and only accidental details from the “tail”. Hence, our machines are very good at identifying popular objects (the “head”) whereas their performance is extremely low on unpopular objects (the “tail”). This is further complicated by the fact that the popularity is determined by the context: topic, time, location, community.
  • 28. Systems: Our analysis For WSD, we tried to improve on the disambiguation of the less frequent senses Outcome: better performance on the tail causes worse performance on the head! Marten Postma, Ruben Izquierdo, Eneko Agirre, German Rigau, Piek Vossen (2016). Addressing the MFS Bias in WSD systems. In Proceedings of LREC 2016.
  • 29. We aim to propose this task as a “Long Tail Shared Disambiguation Task” to the next call for SemEval-2018 tasks, which is expected late 2016/early 2017. In addition, we plan to propose a workshop for ACL 2017. Goal(s) of the workshop
  • 30. Goal(s) of the workshop We want systems to resolve extreme ambiguity within specific context: Discriminate one context from another Use more semantics: coherence, logic, inferencing, comprehensive Combine semantic layers and subtasks Use the complete document and more (external knowledge) Answer more complex questions, e.g. quantification and identity Be smarter than a 6 year old Can explain why something is an answer We do not want: yet another domain task
  • 31. Looking at the Long Tail: What can be done? Systems Resources Evaluation Datasets
  • 32. #1 Datasets What kind of datasets are needed for the long tail disambiguation task? ● Properties ● Multi-task ● Are current ones sufficient? ● Optimal acquisition methods
  • 33. #1 Datasets: Property-driven data MEANTIME -> Multi-task corpus Anne-Lyse Minard, Manuela Speranza, Ruben Urizar, Begoña Altuna, Marieke van Erp, Anneleen Schoen and Chantal van Son (2016). MEANTIME, the NewsReader Multilingual Event and Time Corpus. In Proceedings of LREC 2016. Dutch SemCor -> Balanced WSD corpus Piek Vossen, Rubén Izquierdo, and Attila Görög (2013). DutchSemCor: in quest of the ideal sense-tagged corpus. RANLP. ECB+ -> Increased ambiguity for event coreference Agata Cybulska and Piek Vossen (2014). Using a sledgehammer to crack a nut? Lexical diversity and event coreference resolution. In LREC 2014.
  • 34. #2 Resources What kind of knowledge is needed for the long tail disambiguation task? ● Long tail knowledge bases ● Contextual knowledge bases ● Locating appropriate knowledge
  • 35. #3 Evaluation How should we evaluate the long tail disambiguation task(s)? ● Optimal evaluation metric(s) ● Generalizability over disambiguation tasks ● Incentivizing context- and long tail-aware systems
  • 36. #4 Systems What are the requirements for a system to perform well on the long tail disambiguation task(s)? ● Existing systems ● Multi-task approach ● Long tail performance ● Sustainable systems