Marieke van Erp, Pablo Mendes, Heiko Paulheim, Filip Ilievski, Julien Plu, Giuseppe Rizzo and Joerg Waitelonis
Presented at LREC 2016:
http://www.lrec-conf.org/proceedings/lrec2016/pdf/926_Paper.pdf
June 2013 StartUp Health Insights Funding ReportStartUp Health
See StartUp Health Insights (http://www.startuphealth.com/insights) for the most comprehensive digital health funding database. Apply to StartUp Health Academy here: http://www.startuphealth.com/about-us/application/
RIR Report: LACNIC Update from ARIN 32
Full meeting report from ARIN 32 available at: https://www.arin.net/participate/meetings/reports/ARIN_32/index.html
The domain as unifier, how focusing on social history can bring technical fie...Marieke van Erp
Invited talk given at the final CEDAR symposium about the interaction between (social) history, language technology, and semantic web.
https://socialhistory.org/en/events/final-cedar-mini-symposium
Using Watson to build Cognitive IoT Apps on BluemixIBM
Learn how IBM Watson is allowing developers to build cognitive applications in the IBM Cloud. Using the IoT foundation and Watson, the future of connected devices is staying connected in a cognitive way with smarter apps and smarter devices.
tidycf: Turning cashflows on their sides to turn analysis on its headEmily Riederer
Presentation from rstudio::conf 2018 describing work developing an internal R package for cash flow analysis and for spearheading the adoption of open science / reproducible research principles at Capital One
June 2013 StartUp Health Insights Funding ReportStartUp Health
See StartUp Health Insights (http://www.startuphealth.com/insights) for the most comprehensive digital health funding database. Apply to StartUp Health Academy here: http://www.startuphealth.com/about-us/application/
RIR Report: LACNIC Update from ARIN 32
Full meeting report from ARIN 32 available at: https://www.arin.net/participate/meetings/reports/ARIN_32/index.html
The domain as unifier, how focusing on social history can bring technical fie...Marieke van Erp
Invited talk given at the final CEDAR symposium about the interaction between (social) history, language technology, and semantic web.
https://socialhistory.org/en/events/final-cedar-mini-symposium
Using Watson to build Cognitive IoT Apps on BluemixIBM
Learn how IBM Watson is allowing developers to build cognitive applications in the IBM Cloud. Using the IoT foundation and Watson, the future of connected devices is staying connected in a cognitive way with smarter apps and smarter devices.
tidycf: Turning cashflows on their sides to turn analysis on its headEmily Riederer
Presentation from rstudio::conf 2018 describing work developing an internal R package for cash flow analysis and for spearheading the adoption of open science / reproducible research principles at Capital One
Frag Flow: Automated Fragment Detection in Scientific Workflowsdgarijo
eScience 2014, Guarujá (Brasil). Abstract—Scientific workflows provide the means to define, execute and reproduce computational experiments. However, reusing existing workflows still poses challenges for workflow designers. Workflows are often too large and too specific to reuse in their entirety, so reuse is more likely to happen for fragments of workflows. These fragments may be identified manually by users as sub-workflows, or detected automatically. In this paper we present the FragFlow approach, which detects workflow fragments automatically by analyzing existing workflow corpora with graph mining algorithms. FragFlow detects the most common workflow fragments, links them to the original workflows and visualizes them. We evaluate our approach by comparing FragFlow results against user-defined sub-workflows from three different corpora of the LONI Pipeline system. Based on this evaluation, we discuss how automated workflow fragment detection could facilitate workflow reuse
Open Access is increasingly a determining part of the structures and processes of scholarly communication, particularly in the emerging open science modus operandi, which presupposes the opening of all research components. Currently, most scholarly communication instances, products and services refer to open access in some way. The bibliographic indexes started to identify open access articles. New publishers were created, most commercial publishers started to publish open access journals or offer authors the possibility to publish open access articles in subscription journals. Open access mega journals have appeared. In developing countries, open access journals predominate, with emphasis on the pioneering SciELO Program, publishing open access journals from 1998, four years before the Budapest Open Access Initiative declaration. The preprints modality with open access availability of manuscripts before evaluation and publication in journals grows and new tools appear. Several innovative models have emerged in recent years to promote open access to journal articles, such as library consortia or crowdfunding. There is still difficulty and resistance from publishers in developing financial models that enable open access, and the calculation of article processing charges (APC) remains opaque. But the main force that can make the universalization of open access viable is public policies, the best example being currently the European Commission’s Horizon 2020 program.
Before this landscape, this panel will analyze progress already achieved, the promising solutions and the persistent barriers in the routes towards the universalization of open access.
Syllabus
The classical open access modalities – gold route journals, green route, new models of open access financing, metrics on the status of open access, barriers to the universalization of open access, and open access policies.
Calculation of portfolio risk and return using Markowitz model
Calculation of optimal risk portfolio using a single index model; analysis using regression output and ANOVA table
The excel file can be viewed at:
https://drive.google.com/file/d/0B6nSS-YiZgneZTQ5blpWNkY3N2M/view?usp=sharing
BILS 2015 TU Wien exputec BioVT Pharmaceutical Engineering
"Quicker Ways to a Scalable Process Based on Sound-Science Based Methodologies"
Christoph Herwig
Towards Culturally Aware AI Systems - TSDH SymposiumMarieke van Erp
Towards Culturally Aware AI Systems
Presented 23 June 2021
Slide credits: Cultural AI team members Andrei Nesterov, Laura Hollink, Ryan Brate, Valentin Vogelmann + input and inspiration from all Cultural AI Colleagues
Biases in data can be both explicit and implicit. Explicitly, ‘The Dutch Seventeenth Century’ and ‘The Dutch Golden Age’ are pseudo-synonymous and refer to a particular era of Dutch history. Implicitly, the ‘Golden Age’ moniker is contested due to the fact that the geopolitical and economic expansion came with great costs, such as the slave trade. A simple two-word phrase can carry strong contestations, and entire research fields, such as post-colonial studies, are devoted to them. However, these sometimes subtle (and sometimes not so subtle) differences in voice are as yet not often represented well in AI systems.
In this talk, I will discuss how the Cultural AI Lab is working towards creating AI systems that are implicitly or explicitly aware of the subtle and subjective complexity of human culture. I will highlight the different research strands and activities that look at AI from different angles as well as how we engage with our user communities to create synergies between the technology and the daily practice of cultural heritage professionals.
The Human in Digital Humanities
Online Symposium, Tilburg School of Humanities & Digital Sciences
Tilburg University
https://www.digitalhumanitiestilburg.com/
Marieke van Erp & Victor de Boer (2021, June). A Polyvocal and Contextualised Semantic Web. In European Semantic Web Conference (pp. 506-512). Springer, Cham.
Presented on 8 June, 2021
More Related Content
Similar to Evaluating entity linking an analysis of current benchmark datasets and a roadmap for doing a better job (3)
Frag Flow: Automated Fragment Detection in Scientific Workflowsdgarijo
eScience 2014, Guarujá (Brasil). Abstract—Scientific workflows provide the means to define, execute and reproduce computational experiments. However, reusing existing workflows still poses challenges for workflow designers. Workflows are often too large and too specific to reuse in their entirety, so reuse is more likely to happen for fragments of workflows. These fragments may be identified manually by users as sub-workflows, or detected automatically. In this paper we present the FragFlow approach, which detects workflow fragments automatically by analyzing existing workflow corpora with graph mining algorithms. FragFlow detects the most common workflow fragments, links them to the original workflows and visualizes them. We evaluate our approach by comparing FragFlow results against user-defined sub-workflows from three different corpora of the LONI Pipeline system. Based on this evaluation, we discuss how automated workflow fragment detection could facilitate workflow reuse
Open Access is increasingly a determining part of the structures and processes of scholarly communication, particularly in the emerging open science modus operandi, which presupposes the opening of all research components. Currently, most scholarly communication instances, products and services refer to open access in some way. The bibliographic indexes started to identify open access articles. New publishers were created, most commercial publishers started to publish open access journals or offer authors the possibility to publish open access articles in subscription journals. Open access mega journals have appeared. In developing countries, open access journals predominate, with emphasis on the pioneering SciELO Program, publishing open access journals from 1998, four years before the Budapest Open Access Initiative declaration. The preprints modality with open access availability of manuscripts before evaluation and publication in journals grows and new tools appear. Several innovative models have emerged in recent years to promote open access to journal articles, such as library consortia or crowdfunding. There is still difficulty and resistance from publishers in developing financial models that enable open access, and the calculation of article processing charges (APC) remains opaque. But the main force that can make the universalization of open access viable is public policies, the best example being currently the European Commission’s Horizon 2020 program.
Before this landscape, this panel will analyze progress already achieved, the promising solutions and the persistent barriers in the routes towards the universalization of open access.
Syllabus
The classical open access modalities – gold route journals, green route, new models of open access financing, metrics on the status of open access, barriers to the universalization of open access, and open access policies.
Calculation of portfolio risk and return using Markowitz model
Calculation of optimal risk portfolio using a single index model; analysis using regression output and ANOVA table
The excel file can be viewed at:
https://drive.google.com/file/d/0B6nSS-YiZgneZTQ5blpWNkY3N2M/view?usp=sharing
BILS 2015 TU Wien exputec BioVT Pharmaceutical Engineering
"Quicker Ways to a Scalable Process Based on Sound-Science Based Methodologies"
Christoph Herwig
Similar to Evaluating entity linking an analysis of current benchmark datasets and a roadmap for doing a better job (3) (20)
Towards Culturally Aware AI Systems - TSDH SymposiumMarieke van Erp
Towards Culturally Aware AI Systems
Presented 23 June 2021
Slide credits: Cultural AI team members Andrei Nesterov, Laura Hollink, Ryan Brate, Valentin Vogelmann + input and inspiration from all Cultural AI Colleagues
Biases in data can be both explicit and implicit. Explicitly, ‘The Dutch Seventeenth Century’ and ‘The Dutch Golden Age’ are pseudo-synonymous and refer to a particular era of Dutch history. Implicitly, the ‘Golden Age’ moniker is contested due to the fact that the geopolitical and economic expansion came with great costs, such as the slave trade. A simple two-word phrase can carry strong contestations, and entire research fields, such as post-colonial studies, are devoted to them. However, these sometimes subtle (and sometimes not so subtle) differences in voice are as yet not often represented well in AI systems.
In this talk, I will discuss how the Cultural AI Lab is working towards creating AI systems that are implicitly or explicitly aware of the subtle and subjective complexity of human culture. I will highlight the different research strands and activities that look at AI from different angles as well as how we engage with our user communities to create synergies between the technology and the daily practice of cultural heritage professionals.
The Human in Digital Humanities
Online Symposium, Tilburg School of Humanities & Digital Sciences
Tilburg University
https://www.digitalhumanitiestilburg.com/
Marieke van Erp & Victor de Boer (2021, June). A Polyvocal and Contextualised Semantic Web. In European Semantic Web Conference (pp. 506-512). Springer, Cham.
Presented on 8 June, 2021
Computationally Tracing Concepts Through Time and SpaceMarieke van Erp
Slides for HNR2020 Keynote presentation
Abstract:
Digitised sources are a treasure trove for scholars, but accessing the information contained in them is far from trivial. Due to scale, traditional methods are insufficient to analyse the big data coming from these sources. Hence, computational methods look to be the solution. Indeed, computational methods can be utilised to identify and model concepts in large digital datasets, however the nature of these datasets as well as that of humanities research questions requires caution. In particular, the ramifications of time and location on understanding concepts cannot be underestimated.
In this talk, Marieke will present ongoing work on computationally tracing concepts through time and across geography using language and semantic web technology. The work illustrates that seemingly simple concepts (e.g. sugar) prove to be much more complex than expected. We discuss the importance of semantics in helping not only to deal with this complexity but reify it so that it can be interrogated both computationally and via expert analysis.
Slides 5, 8, 11, 12, 15, 16, 17, 18, 19, 20 are based the presentation Tabea Tietz gave for the paper "Challenges of Knowledge Graph Evolution from an NLP Perspective" in the WHiSe Workshop @ ESWC 2020 (2 June 2020).
http://hnr2020.historicalnetworkresearch.org/
The Hitchhiker's Guide to the Future of Digital HumanitiesMarieke van Erp
Slides of my DHOxSS closing lecture
Oxford, 26 July 2019
Abstract
In the constellation of research fields, new configurations are continuously reshaping our ideas of what a field should be. This is particularly the case in the young field of digital humanities which, as David M. Berry noted, started with a focus on improving access to digital repositories and then moved to expanding the limits of archives to include born-digital materials as research objects. Both moves greatly impacted our research practice. However, I argue that we have only started scratching the surface of what digital methods can mean for humanities research.
In particular, as our methods and collaborations with other fields have matured, we can now start imagining new types of research questions that go beyond the sum of their ‘digital’ and ‘humanities’ parts -- to fundamentally change the nature of the humanities questions that we can ask. For such a reshaping to occur, we need to deepen the connection to our academic neighbours and keep looking beyond our own research community in order to ask these new questions. In my talk, I will present how multi-disciplinary collaborations between historians, linguists, and computer scientists can bring about new insights that may form the first steps to this future.
Why language technology can’t handle Game of Thrones (yet)Marieke van Erp
Natural language processing (NLP) tools are commonly used in many day-to-day applications such as Siri and Google, but the effectiveness of these technologies is not thoroughly understood. I will present joint work with colleagues from the Vrij Universiteit Amsterdam in which we perform a thorough evaluation of four different name recognition tools on 40 popular novels (including A Game of Thrones). I will highlight why literary texts are so difficult for NLP tools as well as solutions for improving their performance.
Finding common ground between text, maps, and tables for quantitative and qua...Marieke van Erp
Invited talk given at 8th AIUCD Conference 2019 – ‘Pedagogy, teaching, and research in the age of Digital Humanities’
http://aiucd2019.uniud.it/
24 January 2019, Udine, Italy
Slicing and Dicing a Newspaper Corpus for Historical Ecology ResearchMarieke van Erp
Presented at EKAW 2018
Historical newspapers are a novel source of information for historical ecologists to study the interactions between humans and animals through time and space. Newspaper archives are particularly interesting to analyse because of their breadth and depth. However, the size and the occasional noisiness of such archives also brings difficulties, as manual analysis is impossible. In this paper, we present experiments and results on automatic query expansion and categorisation for the perception of animal species between 1800 and 1940. For query expansion and to the manual annotation process, we used lexicons. For the categorisation we trained a Support Vector Machine model. Our results indicate that we can distinguish newspaper articles that are about animal species from those that are not with an F 1 of 0.92 and the subcategorisation of the different types of newspapers on animals up to 0.84 F 1 .
Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...Marieke van Erp
Giuseppe Rizzo, Biana Pereira, Andra Varga, Marieke van Erp, Amparo Elizabeth Cano Basave
Presented on Wednesday 10 October at the 17th International Semantic Web Conference (ISWC 2018)
Paper: http://www.semantic-web-journal.net/content/lessons-learnt-named-entity-recognition-and-linking-neel-challenge-series
Conference: http://iswc2018.semanticweb.org/
Entity Typing Using Distributional Semantics and DBpedia Marieke van Erp
Presentation given at NLP&DBpedia workshop on 18 October 2016. The presentation accompanies the work described in: https://nlpdbpedia2016.files.wordpress.com/2016/09/nlpdbpedia2016_paper_9.pdf
Finding Stories in 1,784,532 Events: Scaling up computational models of narr...Marieke van Erp
Slides of the NewsReader Computational Models of Narrative Presentation "Finding Stories in 1,784,532 Events: Scaling Up Computational Models of Narrative - Marieke van Erp, Antske Fokkens, and Piek Vossen"
Workshop page: http://narrative.csail.mit.edu/cmn14/
Project page: http://www.newsreader-project.eu
Evaluating Named Entity Recognition and Disambiguation in News and TweetsMarieke van Erp
Named entity recognition and disambiguation are important for information extraction and populating knowledge bases. Detecting and classifying named entities has traditionally been taken on by the natural language processing community, whilst linking of entities to external resources, such as DBpedia and GeoNames, has been the domain of the Semantic Web community. As these tasks are treated in different communities, it is difficult to assess the performance of these tasks combined.
We present results on an evaluation of the NERD-ML approach on newswire and tweets for both Named Entity Recognition and Named Entity Disambiguation.
Presented at CLIN 24: http://clin24.inl.nl/
http://nerd.eurecom.fr
https://github.com/giusepperizzo/nerdml
Slide 1: Title Slide
Extrachromosomal Inheritance
Slide 2: Introduction to Extrachromosomal Inheritance
Definition: Extrachromosomal inheritance refers to the transmission of genetic material that is not found within the nucleus.
Key Components: Involves genes located in mitochondria, chloroplasts, and plasmids.
Slide 3: Mitochondrial Inheritance
Mitochondria: Organelles responsible for energy production.
Mitochondrial DNA (mtDNA): Circular DNA molecule found in mitochondria.
Inheritance Pattern: Maternally inherited, meaning it is passed from mothers to all their offspring.
Diseases: Examples include Leber’s hereditary optic neuropathy (LHON) and mitochondrial myopathy.
Slide 4: Chloroplast Inheritance
Chloroplasts: Organelles responsible for photosynthesis in plants.
Chloroplast DNA (cpDNA): Circular DNA molecule found in chloroplasts.
Inheritance Pattern: Often maternally inherited in most plants, but can vary in some species.
Examples: Variegation in plants, where leaf color patterns are determined by chloroplast DNA.
Slide 5: Plasmid Inheritance
Plasmids: Small, circular DNA molecules found in bacteria and some eukaryotes.
Features: Can carry antibiotic resistance genes and can be transferred between cells through processes like conjugation.
Significance: Important in biotechnology for gene cloning and genetic engineering.
Slide 6: Mechanisms of Extrachromosomal Inheritance
Non-Mendelian Patterns: Do not follow Mendel’s laws of inheritance.
Cytoplasmic Segregation: During cell division, organelles like mitochondria and chloroplasts are randomly distributed to daughter cells.
Heteroplasmy: Presence of more than one type of organellar genome within a cell, leading to variation in expression.
Slide 7: Examples of Extrachromosomal Inheritance
Four O’clock Plant (Mirabilis jalapa): Shows variegated leaves due to different cpDNA in leaf cells.
Petite Mutants in Yeast: Result from mutations in mitochondrial DNA affecting respiration.
Slide 8: Importance of Extrachromosomal Inheritance
Evolution: Provides insight into the evolution of eukaryotic cells.
Medicine: Understanding mitochondrial inheritance helps in diagnosing and treating mitochondrial diseases.
Agriculture: Chloroplast inheritance can be used in plant breeding and genetic modification.
Slide 9: Recent Research and Advances
Gene Editing: Techniques like CRISPR-Cas9 are being used to edit mitochondrial and chloroplast DNA.
Therapies: Development of mitochondrial replacement therapy (MRT) for preventing mitochondrial diseases.
Slide 10: Conclusion
Summary: Extrachromosomal inheritance involves the transmission of genetic material outside the nucleus and plays a crucial role in genetics, medicine, and biotechnology.
Future Directions: Continued research and technological advancements hold promise for new treatments and applications.
Slide 11: Questions and Discussion
Invite Audience: Open the floor for any questions or further discussion on the topic.
Nutraceutical market, scope and growth: Herbal drug technologyLokesh Patil
As consumer awareness of health and wellness rises, the nutraceutical market—which includes goods like functional meals, drinks, and dietary supplements that provide health advantages beyond basic nutrition—is growing significantly. As healthcare expenses rise, the population ages, and people want natural and preventative health solutions more and more, this industry is increasing quickly. Further driving market expansion are product formulation innovations and the use of cutting-edge technology for customized nutrition. With its worldwide reach, the nutraceutical industry is expected to keep growing and provide significant chances for research and investment in a number of categories, including vitamins, minerals, probiotics, and herbal supplements.
Earliest Galaxies in the JADES Origins Field: Luminosity Function and Cosmic ...Sérgio Sacani
We characterize the earliest galaxy population in the JADES Origins Field (JOF), the deepest
imaging field observed with JWST. We make use of the ancillary Hubble optical images (5 filters
spanning 0.4−0.9µm) and novel JWST images with 14 filters spanning 0.8−5µm, including 7 mediumband filters, and reaching total exposure times of up to 46 hours per filter. We combine all our data
at > 2.3µm to construct an ultradeep image, reaching as deep as ≈ 31.4 AB mag in the stack and
30.3-31.0 AB mag (5σ, r = 0.1” circular aperture) in individual filters. We measure photometric
redshifts and use robust selection criteria to identify a sample of eight galaxy candidates at redshifts
z = 11.5 − 15. These objects show compact half-light radii of R1/2 ∼ 50 − 200pc, stellar masses of
M⋆ ∼ 107−108M⊙, and star-formation rates of SFR ∼ 0.1−1 M⊙ yr−1
. Our search finds no candidates
at 15 < z < 20, placing upper limits at these redshifts. We develop a forward modeling approach to
infer the properties of the evolving luminosity function without binning in redshift or luminosity that
marginalizes over the photometric redshift uncertainty of our candidate galaxies and incorporates the
impact of non-detections. We find a z = 12 luminosity function in good agreement with prior results,
and that the luminosity function normalization and UV luminosity density decline by a factor of ∼ 2.5
from z = 12 to z = 14. We discuss the possible implications of our results in the context of theoretical
models for evolution of the dark matter halo mass function.
This presentation explores a brief idea about the structural and functional attributes of nucleotides, the structure and function of genetic materials along with the impact of UV rays and pH upon them.
Comparing Evolved Extractive Text Summary Scores of Bidirectional Encoder Rep...University of Maribor
Slides from:
11th International Conference on Electrical, Electronics and Computer Engineering (IcETRAN), Niš, 3-6 June 2024
Track: Artificial Intelligence
https://www.etran.rs/2024/en/home-english/
What is greenhouse gasses and how many gasses are there to affect the Earth.moosaasad1975
What are greenhouse gasses how they affect the earth and its environment what is the future of the environment and earth how the weather and the climate effects.
Multi-source connectivity as the driver of solar wind variability in the heli...Sérgio Sacani
The ambient solar wind that flls the heliosphere originates from multiple
sources in the solar corona and is highly structured. It is often described
as high-speed, relatively homogeneous, plasma streams from coronal
holes and slow-speed, highly variable, streams whose source regions are
under debate. A key goal of ESA/NASA’s Solar Orbiter mission is to identify
solar wind sources and understand what drives the complexity seen in the
heliosphere. By combining magnetic feld modelling and spectroscopic
techniques with high-resolution observations and measurements, we show
that the solar wind variability detected in situ by Solar Orbiter in March
2022 is driven by spatio-temporal changes in the magnetic connectivity to
multiple sources in the solar atmosphere. The magnetic feld footpoints
connected to the spacecraft moved from the boundaries of a coronal hole
to one active region (12961) and then across to another region (12957). This
is refected in the in situ measurements, which show the transition from fast
to highly Alfvénic then to slow solar wind that is disrupted by the arrival of
a coronal mass ejection. Our results describe solar wind variability at 0.5 au
but are applicable to near-Earth observatories.
Evaluating entity linking an analysis of current benchmark datasets and a roadmap for doing a better job (3)
1. Evaluating Entity Linking: An Analysis of Current
Benchmark Datasets and a Roadmap for Doing a
Better Job
Marieke van Erp, Pablo Mendes, Heiko Paulheim, Filip Ilievski, Julien Plu, Giuseppe
Rizzo and Joerg Waitelonis
https://github.com/dbpedia-spotlight/evaluation-datasets
2. Take home message
• Existing entity linking datasets:
• are not interoperable
• do not cover many different domains
• skew towards popular and frequent entities
• We need to:
• Document & Standardise
• Diversify to cover different domains and the long tail
https://github.com/dbpedia-spotlight/evaluation-datasets/ 2
3. Why
• Named entity linking approaches achieve F1 scores of ~.80
on various benchmark datasets
• Are we really testing our approaches on all aspects of the
entity linking task?
It’s not just us:
Maud Ehrmann and Damien Nouvel and
Sophie Rosset. Named Entity Resources -
Overview and Outlook. LREC 2016
https://github.com/dbpedia-spotlight/evaluation-datasets/ 3
4. This work
• Analysis of 7 entity linking benchmark datasets
• Dataset characteristics (document type, domain, license etc)
• Entity, surface form & mention characterisation (overlap
between datasets, confusability, prominence, dominance,
types, etc)
• Annotation characteristics (nested entities, redundancy, IAA,
offsets)
+ Roadmap: how can we do better
https://github.com/dbpedia-spotlight/evaluation-datasets/ 4
5. Entity Overlap
• Number of entities present in one dataset that are also
present in other datasets
AIDA-YAGO2 (5,596)
NEEL2014 (2,380)
NEEL2015 (2,800)
OKE2015 (531)
RSS500 (849)
WES2015 (7,309)
Wikinews (279)
https://github.com/dbpedia-spotlight/evaluation-datasets/ 5
6. Datasets
Dataset Type Domain Doc length Format Encoding License
AIDA-YAGO2 news general medium TSV ASCII Agreement
2014/2015
NEEL
tweets general short TSV ASCII Open
OKE2015 encyclopaedia general long NIF/RDF UTF8 Open
RSS-500 news general medium NIF/RDF UTF8 Open
WES2015 blog science long NIF/RDF UTF8 Open
WikiNews news general medium XML UTF8 Open
https://github.com/dbpedia-spotlight/evaluation-datasets/ 6
7. Entity Overlap
• Number of entities present in one dataset that are also
present in other datasets
AIDA-YAGO2 NEEL2014 NEEL2015 OKE2015 RSS500 WES2015 Wikinews
AIDA-YAGO2 (5,596) 5.87% 8.06% 0.00% 1.26% 4.80% 1.16%
NEEL2014 (2,380) 13.73% 68.49% 2.39% 2.56% 12.35% 2.82%
NEEL2015 (2,800) 16.11% 58.21% 2.00% 2.54% 7.93% 2.57%
OKE2015 (531) 0.00% 10.73% 10.55% 2.44% 28.06% 3.95%
RSS500 (849) 8.24% 7.18% 8.36% 1.53% 3.18% 1.88%
WES2015 (7,309) 3.68% 4.02% 3.04% 2.04% 0.16% 0.66%
Wikinews (279) 23.30% 24.01% 25.81% 7.53% 5.73% 17.20%
https://github.com/dbpedia-spotlight/evaluation-datasets/ 7
8. Entity Overlap
• Number of entities present in one dataset that are also
present in other datasets
AIDA-YAGO2 NEEL2014 NEEL2015 OKE2015 RSS500 WES2015 Wikinews
AIDA-YAGO2 (5,596) 5.87% 8.06% 0.00% 1.26% 4.80% 1.16%
NEEL2014 (2,380) 13.73% 68.49% 2.39% 2.56% 12.35% 2.82%
NEEL2015 (2,800) 16.11% 58.21% 2.00% 2.54% 7.93% 2.57%
OKE2015 (531) 0.00% 10.73% 10.55% 2.44% 28.06% 3.95%
RSS500 (849) 8.24% 7.18% 8.36% 1.53% 3.18% 1.88%
WES2015 (7,309) 3.68% 4.02% 3.04% 2.04% 0.16% 0.66%
Wikinews (279) 23.30% 24.01% 25.81% 7.53% 5.73% 17.20%
https://github.com/dbpedia-spotlight/evaluation-datasets/ 8
9. Entity Overlap
• Number of entities present in one dataset that are also
present in other datasets
AIDA-YAGO2 NEEL2014 NEEL2015 OKE2015 RSS500 WES2015 Wikinews
AIDA-YAGO2 (5,596) 5.87% 8.06% 0.00% 1.26% 4.80% 1.16%
NEEL2014 (2,380) 13.73% 68.49% 2.39% 2.56% 12.35% 2.82%
NEEL2015 (2,800) 16.11% 58.21% 2.00% 2.54% 7.93% 2.57%
OKE2015 (531) 0.00% 10.73% 10.55% 2.44% 28.06% 3.95%
RSS500 (849) 8.24% 7.18% 8.36% 1.53% 3.18% 1.88%
WES2015 (7,309) 3.68% 4.02% 3.04% 2.04% 0.16% 0.66%
Wikinews (279) 23.30% 24.01% 25.81% 7.53% 5.73% 17.20%
https://github.com/dbpedia-spotlight/evaluation-datasets/ 9
10. Entity Overlap
• Number of entities present in one dataset that are also
present in other datasets
AIDA-YAGO2 NEEL2014 NEEL2015 OKE2015 RSS500 WES2015 Wikinews
AIDA-YAGO2 (5,596) 5.87% 8.06% 0.00% 1.26% 4.80% 1.16%
NEEL2014 (2,380) 13.73% 68.49% 2.39% 2.56% 12.35% 2.82%
NEEL2015 (2,800) 16.11% 58.21% 2.00% 2.54% 7.93% 2.57%
OKE2015 (531) 0.00% 10.73% 10.55% 2.44% 28.06% 3.95%
RSS500 (849) 8.24% 7.18% 8.36% 1.53% 3.18% 1.88%
WES2015 (7,309) 3.68% 4.02% 3.04% 2.04% 0.16% 0.66%
Wikinews (279) 23.30% 24.01% 25.81% 7.53% 5.73% 17.20%
https://github.com/dbpedia-spotlight/evaluation-datasets/ 10
17. How can we do better?
• Document your dataset!
• Use a standardised format
• Diversify both in domains and in entity distribution
https://github.com/dbpedia-spotlight/evaluation-datasets/ 17
18. Work in Progress & Future work
• Analyse more datasets
• Evaluate the temporal dimension of datasets (current work
by Filip Ilievski & Marten Postma)
• Integrate and build better datasets
https://github.com/dbpedia-spotlight/evaluation-datasets/ 18
19. Want to help?
Scripts and data used here can be found at:
Contact marieke.van.erp@vu.nl if you want to collaborate
https://github.com/dbpedia-spotlight/evaluation-datasets/
19