Entity Typing Using Distributional Semantics and DBpedia

•

0 likes•264 views

Presentation given at NLP&DBpedia workshop on 18 October 2016. The presentation accompanies the work described in: https://nlpdbpedia2016.files.wordpress.com/2016/09/nlpdbpedia2016_paper_9.pdf

Entity Typing Using
Distributional Semantics and DBpedia
Marieke van Erp and Piek Vossen

Conclusions
• Finegrained entity typing is necessary for semantic
queries over text
• Search space for word2vec is large, topics help
• Combining Distributional Semantics with DBpedia can
help overcome NIL and Dark Entities
https://github.com/MvanErp/entity-typing/

Dark entities: little or no information available in KB
https://github.com/MvanErp/entity-typing/

Distributional Semantics
• Similar concepts (denoted by words) occur in similar
contexts
• Word2Vec (Mikolov et al., 2013) explores this notion in a
popular implementation
Sushi
Teriyaki
Udon
Okonomiyaki
Soba
Sashimi
Kimono
Yukata
Nemaki
KFC
Steak
Hamburger
McDonald’s
Jeans
T-shirt
Skirt

Research Question:
• Can we predict the type of the concept ‘Sushi’ by
modelling it in a distributional semantics space and
comparing its vector to the vectors of concepts for which
we do know the type?
Sushi
Teriyaki
Udon
Okonomiyaki
Soba
Sashimi
Kimono
Yukata
Nemaki
KFC
Steak
Hamburger
McDonald’s
Jeans
T-shirt
Skirt

Setup
• 7 Named Entity Linking Benchmark datasets (AIDA-YAGO,
2014 NEEL, 2015 NEEL, OKE2015, RSS500, WES2015,
Wikinews)
• 3 Word2Vec models: GoogleNews, English Wikipedia,
Reuters RCV1*
• Compare all entities within datasets to each other and return
highest ranking type (as taken from DBpedia)
* AIDA-YAGO is part of Reuters RCV1
https://github.com/MvanErp/entity-typing/

Initial results
• Not so great?
https://github.com/MvanErp/entity-typing/

Initial results (some footnotes)
• Ranking approach favours ﬁne-grained entity types
• The Word2Vec corpus matters! NEEL2014&2015 are derived
from Tweets, typically low coverage when querying news
• Smaller datasets (Wikinews, WES2015, OKE2015) do better?
https://github.com/MvanErp/entity-typing/

Let’s zoom in
on topics
• Initially, all entities
within a benchmark
dataset were
compared to all other
entities.
• What if we only
compare entities from
sports documents to
other entities from
sports documents?
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
AIDA−YAGO Coarsegrained Categories GoogleNews Fine
20
40
60
80
100
1
5
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
AIDA−YAGO Coarsegrained Categories RCV1 Fine
20
40
60
80
100
1
5
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
AIDA−YAGO Coarsegrained Categories Wikipedia Fine
20
40
60
80
100
1
5
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
AIDA−YAGO Finegrained Categories GoogleNews Fine
20
40
60
80
100
1
5
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
AIDA−YAGO Finegrained Categories RCV1 Fine
20
40
60
80
100
1
5
10
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
AIDA−YAGO Finegrained Categories Wikipedia Fine
20
40
60
80
100
1
5
10
https://github.com/MvanErp/entity-typing/

Conclusions and Future Work
• Difﬁcult task, but topics help
• Ranking needs to be improved
• Multi-class classiﬁcation (KFC: food & organisation,
Arnold Schwarzenegger: Actor & Politician)
• What else can we discover beyond type?
https://github.com/MvanErp/entity-typing/

Thank you!
https://github.com/MvanErp/entity-typing/

This research was made possible by the CLARIAH-CORE project
ﬁnanced by NWO.
http://www.clariah.nl

The document summarizes a team's sentiment analysis project on tweets mentioning JetBlue airline. It introduces the project goals of understanding sentiment patterns and predicting tweet sentiment. It describes preprocessing tweets, creating term document matrices, and using lexicon-based and machine learning approaches. Evaluation found the naive Bayesian model worked best at sentiment classification, achieving 61% accuracy. The document concludes more advanced techniques could improve analysis, like clustering tweets or integrating additional contextual variables.

Getting your hands on graphs

Red Pill Now

Improving data interoperability in Python and R

Wes McKinney

Apache Arrow is a new open source project that aims to establish a common in-memory data representation that can improve interoperability across data science programming languages like Python and R. It provides a standardized columnar memory format that can reduce the CPU overhead of serialization and deserialization between systems by 70-80%. The Feather file format leverages Arrow to provide a fast, language-agnostic binary file format for data frames that enables very fast read/write speeds between Python and R. While Feather has benefits, it still requires data conversion between Arrow storage and each language's native data structures; establishing a common in-memory representation at the C/C++ level could further improve sharing of algorithms and libraries.

Sociology 270

Tiffini Travis

This document provides an overview of databases and resources for sociological research. It discusses databases such as SocIndex, Sociological Abstracts, Psycinfo and ERIC that are available through the university's ONESEARCH tool. It also briefly discusses qualitative and mixed methodologies. The document then provides tips for developing search strategies, including using keywords, root endings like "suicid*" and phrase searching with quotation marks. Finally, it addresses common citation errors and how to fix them for APA and ASA style.

How to do Keyword Research: 7 Techniques & Tools

Affilorama

Oleg Trygub TweetSyn demo

Oleg Trygub

The document describes TweetSyn, a system that uses Word2Vec models to find synonyms for search queries and feed them into Elasticsearch for conceptual search of tweets. It discusses challenges in architecting a system with asynchronous calls to Elasticsearch and Redis. The pipeline caches synonym models in Redis, indexes tweets with synonyms, registers queries to percolators to enable near real-time search across tweets and synonyms.

Evaluating entity linking an analysis of current benchmark datasets and a ro...

Marieke van Erp

The domain as unifier, how focusing on social history can bring technical fie...

Marieke van Erp

In this presentation will explore the closed world of language as a system of word relations. Words and texts are highly ambiguous, but we believe the complete scope and complexity of this ambiguity is not well defined yet. The goal is to more properly define the problem and find the optimal solution given the vast volumes of textual data that are available. Most of the WSD systems are not tacking properly the problem and the context is not being modelled in a proper way. Besides to this, lately WSD has been changed from a purely lexical approach (static view) to a reference approach (dynamic view). Considering these two facts, the role of the background and discourse information is crucial. To prove our hypothesis about what WSD systems are not facing properly, we performed an error analysis on the participant outputs of the SensEval/SemEval WSD competitions. Interesting and surprising conclusions came out of this analysis. Finally, our participation on the last SemEval-2015 task 13: Multilingual All-Words WSD and Entity Linking. In our system we implement our ideas about using background information to perform WSD.

2017-01-25-SystemT-Overview-Stanford

Laura Chiticariu

HDRF: Stream-Based Partitioning for Power-Law Graphs

Fabio Petroni, PhD

F. Petroni, L. Querzoni, K. Daudjee, S. Kamali and G. Iacoboni: "HDRF: Stream-Based Partitioning for Power-Law Graphs." In: Proceedings of the 24th ACM International Conference on Information and Knowledge Management (CIKM), 2015. Abstract: "Balanced graph partitioning is a fundamental problem that is receiving growing attention with the emergence of distributed graph-computing (DGC) frameworks. In these frameworks, the partitioning strategy plays an important role since it drives the communication cost and the workload balance among computing nodes, thereby affecting system performance. However, existing solutions only partially exploit a key characteristic of natural graphs commonly found in the real-world: their highly skewed power-law degree distributions. In this paper, we propose High-Degree (are) Replicated First (HDRF), a novel streaming vertex-cut graph partitioning algorithm that effectively exploits skewed degree distributions by explicitly taking into account vertex degree in the placement decision. We analytically and experimentally evaluate HDRF on both synthetic and real-world graphs and show that it outperforms all existing algorithms in partitioning quality."

Entity Typing and Event Extraction

Marieke van Erp

Mining at scale with latent factor models for matrix completion

Fabio Petroni, PhD

PhD Thesis F. Petroni: "Mining at scale with latent factor models for matrix completion." Sapienza University of Rome, 2016. Abstract: "Predicting which relationships are likely to occur between real-world objects is a key task for several applications. For instance, recommender systems aim at predicting the existence of unknown relationships between users and items, and exploit this information to provide personalized suggestions for items to be of use to a specific user. Matrix completion techniques aim at solving this task, identifying and leveraging the latent factors that triggered the the creation of known relationships to infer missing ones. This problem, however, is made challenging by the size of today’s datasets. One way to handle such large-scale data, in a reasonable amount of time, is to distribute the matrix completion procedure over a cluster of commodity machines. However, current approaches lack of efficiency and scalability, since, for instance, they do not minimize the communication or ensure a balance workload in the cluster. A further aspect of matrix completion techniques we investigate is how to improve their prediction performance. This can be done, for instance, considering the context in which relationships have been captured. However, incorporating generic contextual information within a matrix completion algorithm is a challenging task. In the first part of this thesis, we study distributed matrix completion solutions, and address the above issues by examining input slicing techniques based on graph partitioning algorithms. In the second part of the thesis, we focus on context-aware matrix completion techniques, providing solutions that can work both (i) when the revealed entries in the matrix have multiple values and (ii) all the same value."

LCBM: Statistics-Based Parallel Collaborative Filtering

Fabio Petroni, PhD

F. Petroni, L. Querzoni, R. Beraldi, M. Paolucci: "LCBM: Statistics-Based Parallel Collaborative Filtering." In: Proceedings of the 17th International Conference on Business Information Systems (BIS), 2014. Abstract: "In the last ten years, recommendation systems evolved from novelties to powerful business tools, deeply changing the internet industry. Collaborative Filtering (CF) represents today’s a widely adopted strategy to build recommendation engines. The most advanced CF techniques (i.e. those based on matrix factorization) provide high quality results, but may incur prohibitive computational costs when applied to very large data sets. In this paper we present Linear Classifier of Beta distributions Means (LCBM), a novel collaborative filtering algorithm for binary ratings that is (i) inherently parallelizable and (ii) provides results whose quality is on-par with state-of-the-art solutions (iii) at a fraction of the computational cost."

KafNafParserPy: a python library for parsing/creating KAF and NAF files

Rubén Izquierdo Beviá

Topic modeling and WSD on the Ancora corpus

Rubén Izquierdo Beviá

In this paper we present an approach to Word Sense Disambiguation based on Topic Modeling (LDA). Our approach consists of two different steps, where first a binary classifier is applied to decide whether the most frequent sense applies or not, and then another classifier deals with the non most frequent sense cases. An exhaustive evaluation is performed on the Spanish corpus Ancora, to analyze the performance of our two-step system and the impact of the context and the different parameters in the system. Our best experiment reaches an accuracy of 74.53, which is 6 points over the highest baseline. All the software developed for these experiments has been made freely available, to enable reproducibility and allow the re-usage of the software.

The Power of Declarative Analytics

Yunyao Li

Invited Talk at Modern Data Management Systems Summit on August 29-30, 2014 at Tsinghua University in Beijing, China. http://ise.thss.tsinghua.edu.cn/MDMS/English/program.jsp Abstract: Modern enterprises are increasingly relying on complex analyses on large data sets to drive business decisions. Tasks such as root cause analysis from system logs and lead generation based on social media, customer retention and digital marketing are rapidly gaining importance. These applications generally consist of three major analytic phases: text analytics, semi-structured data processing (joins, group-by, aggregation), and statistical/predictive modeling. The size of the datasets in conjunction with the complexity of the analysis necessitates large-scale distributed processing of the analytical algorithms. At IBM we are building tools and technologies based on declarative languages to support each of these analytic phases. The declarative nature of the language abstracts away the need for programmer-optimization. Furthermore, the syntax of these languages is designed to appeal to the corresponding communities. As an example for statistical modeling, we expose a high-level language with syntax similar to R -- a very popular statistical processing language. In this talk I will give an overview of some real-world big data applications we are currently working on and use that to motivate the need for declarative analytics consisting of the three major phases discussed above. I will then describe, in some detail, declarative systems for text analytics along with a discussion on speeds, feeds and comparisons.

RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus

Rubén Izquierdo Beviá

The document describes the DutchSemCor (DSC) project, which aims to create a large-scale sense-tagged Dutch corpus. It presents the following: 1. The DSC was created using multiple approaches - a balanced-sense corpus tagged manually, a balanced-context corpus extending the first with new contexts, and automatic tagging of a full corpus to determine sense distributions. 2. Evaluation shows the balanced-context corpus improved WSD performance, and sense distributions derived automatically were a good predictor of the actual distributions. 3. In total, the DSC contains over 400,000 manually sense-tagged tokens across 3,000 words, providing semantic information to support WSD research on Dutch.

CLTL python course: Object Oriented Programming (3/3)

Rubén Izquierdo Beviá

Polyglot: Multilingual Semantic Role Labeling with Unified Labels

Yunyao Li

Poster for our ACL paper "Polyglot: Multilingual Semantic Role Labeling with Unified Labels". Abstract: We present POLYGLOT, a semantic role labeling system capable of semantically parsing sentences in 9 different languages from 4 different language groups. A core differentiator is that this system predicts English Proposition Bank labels for all supported languages. This means that for instance a Japanese sentence will be tagged with the same labels as an English sentence with similar semantics would be. This is made possible by training the system with target language data that was automatically labeled with English PropBank labels using an annotation projection approach. We give an overview of our system, the automatically produced training data, and discuss possible applications and limitations of this work. We present a demonstrator that accepts sentences in English, German, French, Spanish, Japanese, Chinese, Arabic, Russian and Hindi and outputs a visualization of its shallow semantics.

DutchSemCor workshop: Domain classification and WSD systems

Rubén Izquierdo Beviá

The document discusses domain classification and word sense disambiguation systems. It describes a domain classifier that uses support vector machines to assign domain labels to texts from 37 predefined domains. It also describes three word sense disambiguation systems: timbl-DSC which uses k-nearest neighbor classification, svm-DSC which uses support vector machines, and ukb-DSC which is an unsupervised knowledge-based system. The systems are evaluated using fold cross-validation, random evaluation on texts from SONAR, and evaluation on independent texts, with the combination of systems achieving the best performance.

HSIENA: a hybrid publish/subscribe system

Fabio Petroni, PhD

F. Petroni and L. Querzoni: "HSIENA: a hybrid publish/subscribe system." In: Proceedings of the 1st International Workshop on Dependable and Secure Computing for Large-scale Complex Critical Infrastructures (DESEC-LCCI), 2012. Abstract: "The SIENA publish/subscribe system represents a prototypical design for a distributed event notification service implementing the content-based publish/subscribe communication paradigm. A clear shortcoming of SIENA is represented by its static configuration that must be managed and updated by human administrators every time one of its internal processes (brokers) needs to be added or repaired (e.g. due to a crash failure). This problem limits the applicability of SIENA in large complex critical infrastructures where self-adaptation and -configuration are crucial requirements. In this paper we propose HSIENA, a hybrid architecture that complements SIENA by adding the ability to self-reconfigure after broker additions and removals. The architecture has a novel design that mixes the classic SIENA’s distributed architecture with a highly available cloud-based storage service."

Transparent Machine Learning for Information Extraction: State-of-the-Art and...

Yunyao Li

Enterprise Search in the Big Data Era: Recent Developments and Open Challenges

Yunyao Li

This is the slides used in our 3-hour tutorial at VLDB'2014. Yunyao Li, Ziyang Liu, Huaiyu Zhu: Enterprise Search in the Big Data Era: Recent Developments and Open Challenges. PVLDB 7(13): 1717-1718 (2014) Abstract: Enterprise search allows users in an enterprise to retrieve desired information through a simple search interface. It is widely viewed as an important productivity tool within an enterprise. While Internet search engines have been highly successful, enterprise search remains notoriously challenging due to a variety of unique challenges, and is being made more so by the increasing heterogeneity and volume of enterprise data. On the other hand, enterprise search also presents opportunities to succeed in ways beyond current Internet search capabilities. This tutorial presents an organized overview of these challenges and opportunities, and reviews the state-of-the-art techniques for building a reliable and high quality enterprise search engine, in the context of the rise of big data.

Error analysis of Word Sense Disambiguation

Rubén Izquierdo Beviá

CLIN-2015 Presentation Word Sense Disambiguation is still an unsolved problem in Natural Language Processing. We claim that most approaches do not model the context correctly, by relying too much on the local context (the words surrounding the word in question), or on the most frequent sense of a word. In order to provide evidence for this claim, we conducted an in-depth analysis of all-words tasks of the competitions that have been organized (Senseval 2&3, Semeval-2007, Semeval-2010, Semeval 2013). We focused on the average error rate per competition and across competitions per part of speech, lemma, relative frequency class, and polysemy class. In addition, we inspected the “difficulty” of a token(word) by calculating the average polysemy of the words in the sentence of a token. Finally, we inspected to what extent systems always chose the most frequent sense. The results from Senseval 2, which are representative of other competitions, showed that the average error rate for monosemous words was 33.3% due to part of speech errors. This number was 71% for multiword and phrasal verbs. In addition, we observe that higher polysemy yields a higher error rate. Moreover, we do not observe a drop in the error rate if there are multiple occurrences of the same lemma, which might indicate that systems rely mostly on the sentence itself. Finally, out of the 799 tokens for which the correct sense was not the most frequent sense, system still assigned the most frequent sense in 84% of the cases. For future work, we plan to develop a strategy in order to determine in which context the predominant sense should be assigned, and more importantly when it should not be assigned. One of the most important parts of this strategy would be to not only determine the meaning of a specific word, but to also know it’s referential meaning. For example, in the case of the lemma ‘winner’, we do not only want to know what ‘winner’ means, but we also want to know what this ‘winner’ won and who this ‘winner’ was.

CORE: Context-Aware Open Relation Extraction with Factorization Machines

Fabio Petroni, PhD

F. Petroni, L. Del Corro and R. Gemulla: "CORE: Context-Aware Open Relation Extraction with Factorization Machines." In: Proceedings of the 2015 Conference on Empirical Methods in Natural Language Processing (EMNLP), 2015. Abstract: "We propose CORE, a novel matrix factorization model that leverages contextual information for open relation extraction. Our model is based on factorization machines and integrates facts from various sources, such as knowledge bases or open information extractors, as well as the context in which these facts have been ob- served. We argue that integrating contextual information—such as metadata about extraction sources, lexical context, or type information—significantly improves prediction performance. Open information extractors, for example, may produce extractions that are unspecific or ambiguous when taken out of context. Our experimental study on a large real-world dataset indicates that CORE has significantly better prediction performance than state-of- the-art approaches when contextual information is available."

Juan Calvino y el Calvinismo

Rubén Izquierdo Beviá

Juan Calvino fue un teólogo francés del siglo XVI que ayudó a liderar la Reforma Protestante. Adoptó las enseñanzas de Lutero y desarrolló cinco puntos centrales del calvinismo: la depravación total del hombre, la elección incondicional, la expiación limitada, la gracia irresistible y la perseverancia de los santos. Calvino estableció una iglesia reformada en Ginebra y sus enseñanzas se extendieron por Europa y América del Norte, influyendo en la sociedad y

Information Extraction

Rubén Izquierdo Beviá

Vector Search for Data Scientists.pdf

ConnorShorten2

Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...

Marieke van Erp

Viewers also liked

ULM-1 Understanding Languages by Machines: The borders of Ambiguity

Rubén Izquierdo Beviá

2017-01-25-SystemT-Overview-Stanford

Laura Chiticariu

HDRF: Stream-Based Partitioning for Power-Law Graphs

Fabio Petroni, PhD

Entity Typing and Event Extraction

Marieke van Erp

Mining at scale with latent factor models for matrix completion

Fabio Petroni, PhD

LCBM: Statistics-Based Parallel Collaborative Filtering

Fabio Petroni, PhD

KafNafParserPy: a python library for parsing/creating KAF and NAF files

Rubén Izquierdo Beviá

Topic modeling and WSD on the Ancora corpus

Rubén Izquierdo Beviá

The Power of Declarative Analytics

Yunyao Li

RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus

Rubén Izquierdo Beviá

CLTL python course: Object Oriented Programming (3/3)

Rubén Izquierdo Beviá

Polyglot: Multilingual Semantic Role Labeling with Unified Labels

Yunyao Li

DutchSemCor workshop: Domain classification and WSD systems

Rubén Izquierdo Beviá

HSIENA: a hybrid publish/subscribe system

Fabio Petroni, PhD

Transparent Machine Learning for Information Extraction: State-of-the-Art and...

Yunyao Li

Enterprise Search in the Big Data Era: Recent Developments and Open Challenges

Yunyao Li

Error analysis of Word Sense Disambiguation

Rubén Izquierdo Beviá

CORE: Context-Aware Open Relation Extraction with Factorization Machines

Fabio Petroni, PhD

Juan Calvino y el Calvinismo

Rubén Izquierdo Beviá

Information Extraction

Rubén Izquierdo Beviá

Viewers also liked (20)

ULM-1 Understanding Languages by Machines: The borders of Ambiguity

2017-01-25-SystemT-Overview-Stanford

HDRF: Stream-Based Partitioning for Power-Law Graphs

Entity Typing and Event Extraction

Mining at scale with latent factor models for matrix completion

LCBM: Statistics-Based Parallel Collaborative Filtering

KafNafParserPy: a python library for parsing/creating KAF and NAF files

Topic modeling and WSD on the Ancora corpus

The Power of Declarative Analytics

RANLP2013: DutchSemCor, in Quest of the Ideal Sense Tagged Corpus

CLTL python course: Object Oriented Programming (3/3)

Polyglot: Multilingual Semantic Role Labeling with Unified Labels

DutchSemCor workshop: Domain classification and WSD systems

HSIENA: a hybrid publish/subscribe system

Transparent Machine Learning for Information Extraction: State-of-the-Art and...

Enterprise Search in the Big Data Era: Recent Developments and Open Challenges

Error analysis of Word Sense Disambiguation

CORE: Context-Aware Open Relation Extraction with Factorization Machines

Juan Calvino y el Calvinismo

Information Extraction

Similar to Entity Typing Using Distributional Semantics and DBpedia

Vector Search for Data Scientists.pdf

ConnorShorten2

Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...

Marieke van Erp

Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com

Lucidworks

This document summarizes Simon Hughes' presentation on using vector representations for semantic matching in search. It discusses using word embeddings to learn vector representations of words that capture their semantic meaning based on context. Approaches for searching with word embeddings include expanding queries with related terms from the embedding model or clustering the embeddings and mapping queries to clusters. The document also covers techniques for indexing and searching vector representations in an inverted index, such as using locality-sensitive hashing or k-means trees to map vectors to discrete tokens that can be indexed.

Vectors in Search - Towards More Semantic Matching

Simon Hughes

With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then I will describe a few different techniques for efficiently searching vector-based representations in an inverted index, such as learning sparse representations of vectors, clustering, and learning binary vectors. Finally, I will discuss some of the pitfalls of vector-based search, and how to get the best of both worlds by combining vector-based scoring with traditional relevancy metrics such as BM25.

Searching with vectors

Simon Hughes

Haystack 2019 - Search with Vectors - Simon Hughes

OpenSource Connections

With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then describe a few different techniques for efficiently searching vector-based representations in an inverted index, including LSH, vector quantization and k-means tree, and compare their performance in terms of speed and relevancy. Finally, I will describe how each technique can be implemented efficiently in a lucene-based search engine such as Solr or Elastic Search.

What I Learned Building a Toy Example to Crawl & Render like Google

Catalyst

JR Oakes presented on building a toy crawler and renderer to better understand how Google and other search engines operate. He discussed key components of crawlers like prioritizing pages, handling duplicates, and respecting robots.txt. He then created a simple "toy internet" of linked pages to crawl. JR built a basic crawler in Python that applies techniques like PageRank, renders pages with Chrome Headless, and includes a search interface using Streamlit. The open source project is intended for learning and experimentation.

Groundhog Day: Near-Duplicate Detection on Twitter

Ke Tao

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...

Pierpaolo Basile

This presentation describes the participation of the UNIBA team in the Named Entity rEcognition and Linking (NEEL) Chal- lenge. We propose a knowledge-based algorithm able to recognize and link named entities in English tweets. The approach combines the simple Lesk algorithm with information coming from both a distributional semantic model and usage frequency of Wikipedia concepts. The algorithm per- forms poorly in the entity recognition, while it achieves good results in the disambiguation step.

Vectorization In NLP.pptx

Chode Amarnath

Vectorization is the process of converting words into numerical representations. Common techniques include bag-of-words which counts word frequencies, and TF-IDF which weights words based on frequency and importance. Word embedding techniques like Word2Vec and GloVe generate vector representations of words that encode semantic and syntactic relationships. Word2Vec uses the CBOW and Skip-gram models to predict words from contexts to learn embeddings, while GloVe uses global word co-occurrence statistics from a corpus. These pre-trained word embeddings can then be used for downstream NLP tasks.

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Young Seok Kim

Deep Learning for Information Retrieval: Models, Progress, & Opportunities

Matthew Lease

Data Science - Part XI - Text Analytics

Derek Kane

This is an introduction to text analytics for advanced business users and IT professionals with limited programming expertise. The presentation will go through different areas of text analytics as well as provide some real work examples that help to make the subject matter a little more relatable. We will cover topics like search engine building, categorization (supervised and unsupervised), clustering, NLP, and social media analysis.

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

Lucidworks

The document discusses implementing conceptual search in Solr. It describes how conceptual search aims to improve recall without reducing precision by matching documents based on concepts rather than keywords alone. It explains how Word2Vec can be used to learn related concepts from documents and represent words as vectors, which can then be embedded in Solr through synonym filters and payloads to enable conceptual search queries. This allows retrieving more relevant documents that do not contain the exact search terms but are still conceptually related.

Alternative microservices - one size doesn't fit all

Jeppe Cramon

TechSEO Boost 2017: Fun with Machine Learning: How Machine Learning is Shapin...

Catalyst

This document discusses how machine learning is shaping Google and technical SEO. It addresses how TF-IDF is not the best algorithm and that BM25 and machine learning take other factors into account. Wikimedia Research has released machine learning ranking models on GitHub. The document also discusses how Google may use click-through rate as a ranking factor alongside other signals processed by machine learning algorithms, and how techniques like query disambiguation, semantic relevance analysis, content deduplication, and evaluating click satisfaction should be focuses for technical SEO.

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...

Daniel Zivkovic

Serverless Toronto's 6th-anniversary event helps IT pros understand and prepare for the #GenAI tsunami ahead. You'll gain situational awareness of the LLM Landscape, receive condensed insights, and actionable advice about RAG in 2024 from Google AI Lead Mark Ryan and LlamaIndex creator Jerry Liu. We chose #RAG (Retrieval-Augmented Generation) because it is the predominant paradigm for building #LLM (Large Language Model) applications in enterprises today - and that's where the jobs will be shifting. Here is the recording: https://youtu.be/P5xd1ZjD-Os?si=iq8xibj5pJsJ62oW

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...

Databricks

The Semantic Engine is a custom search engine deployable on top of large, non-native language corpora that goes beyond keyword search and does NOT require translation. The large, on-the-fly calculations essential to making this an effective search engine necessitated development on a distributed platform capable of processing large volumes of unstructured data. Hear how the low barrier to entry provided by Apache Spark allowed the Novetta Solutions team to focus on the hard analytical challenges presented by their data, without having to spend much time grappling with the inherent difficulties normally associated with distributed computing.

Visually Exploring Patent Collections for Events and Patterns

Xiaoyu Wang

Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond

Bhaskar Mitra

The emergence of deep learning-based methods for information retrieval (IR) poses several challenges and opportunities for benchmarking. Some of these are new, while others have evolved from existing challenges in IR exacerbated by the scale at which deep learning models operate. In this talk, I will present a brief overview of what we have learned from our work on MS MARCO and the TREC Deep Learning track, and reflect on the road ahead.

Similar to Entity Typing Using Distributional Semantics and DBpedia (20)

Vector Search for Data Scientists.pdf

Lessons Learnt from the Named Entity rEcognition and Linking (NEEL) Challenge...

Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com

Vectors in Search - Towards More Semantic Matching

Searching with vectors

Haystack 2019 - Search with Vectors - Simon Hughes

What I Learned Building a Toy Example to Crawl & Render like Google

Groundhog Day: Near-Duplicate Detection on Twitter

UNIBA: Exploiting a Distributional Semantic Model for Disambiguating and Link...

Vectorization In NLP.pptx

BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding

Deep Learning for Information Retrieval: Models, Progress, & Opportunities

Data Science - Part XI - Text Analytics

Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...

Alternative microservices - one size doesn't fit all

TechSEO Boost 2017: Fun with Machine Learning: How Machine Learning is Shapin...

All in AI: LLM Landscape & RAG in 2024 with Mark Ryan (Google) & Jerry Liu (L...

Semantic Search: Fast Results from Large, Non-Native Language Corpora with Ro...

Visually Exploring Patent Collections for Events and Patterns

Benchmarking for Neural Information Retrieval: MS MARCO, TREC, and Beyond

More from Marieke van Erp

Towards Culturally Aware AI Systems - TSDH Symposium

Marieke van Erp

The document discusses culturally aware AI and polyvocality in knowledge graphs. It notes that most current knowledge graphs reflect a single perspective and contemporary sources, lacking polyvocality. The challenges of identifying polyvocal knowledge, representing polyvocality in models and data, and presenting polyvocal knowledge are discussed. Transparent data stories are proposed as a way to represent multiple perspectives on cultural objects through alternative storylines and making the underlying data and knowledge graph transparent.

A Polyvocal and Contextualised Semantic Web

Marieke van Erp

AI x Digital Humanities = > Inclusiviteit

Marieke van Erp

Computationally Tracing Concepts Through Time and Space

Marieke van Erp

Slides for HNR2020 Keynote presentation Abstract: Digitised sources are a treasure trove for scholars, but accessing the information contained in them is far from trivial. Due to scale, traditional methods are insufficient to analyse the big data coming from these sources. Hence, computational methods look to be the solution. Indeed, computational methods can be utilised to identify and model concepts in large digital datasets, however the nature of these datasets as well as that of humanities research questions requires caution. In particular, the ramifications of time and location on understanding concepts cannot be underestimated. In this talk, Marieke will present ongoing work on computationally tracing concepts through time and across geography using language and semantic web technology. The work illustrates that seemingly simple concepts (e.g. sugar) prove to be much more complex than expected. We discuss the importance of semantics in helping not only to deal with this complexity but reify it so that it can be interrogated both computationally and via expert analysis. Slides 5, 8, 11, 12, 15, 16, 17, 18, 19, 20 are based the presentation Tabea Tietz gave for the paper "Challenges of Knowledge Graph Evolution from an NLP Perspective" in the WHiSe Workshop @ ESWC 2020 (2 June 2020). http://hnr2020.historicalnetworkresearch.org/

The Hitchhiker's Guide to the Future of Digital Humanities

Marieke van Erp

Slides of my DHOxSS closing lecture Oxford, 26 July 2019 Abstract In the constellation of research fields, new configurations are continuously reshaping our ideas of what a field should be. This is particularly the case in the young field of digital humanities which, as David M. Berry noted, started with a focus on improving access to digital repositories and then moved to expanding the limits of archives to include born-digital materials as research objects. Both moves greatly impacted our research practice. However, I argue that we have only started scratching the surface of what digital methods can mean for humanities research. In particular, as our methods and collaborations with other fields have matured, we can now start imagining new types of research questions that go beyond the sum of their ‘digital’ and ‘humanities’ parts -- to fundamentally change the nature of the humanities questions that we can ask. For such a reshaping to occur, we need to deepen the connection to our academic neighbours and keep looking beyond our own research community in order to ask these new questions. In my talk, I will present how multi-disciplinary collaborations between historians, linguists, and computer scientists can bring about new insights that may form the first steps to this future.

Why language technology can’t handle Game of Thrones (yet)

Marieke van Erp

Natural language processing (NLP) tools are commonly used in many day-to-day applications such as Siri and Google, but the effectiveness of these technologies is not thoroughly understood. I will present joint work with colleagues from the Vrij Universiteit Amsterdam in which we perform a thorough evaluation of four different name recognition tools on 40 popular novels (including A Game of Thrones). I will highlight why literary texts are so difficult for NLP tools as well as solutions for improving their performance.

(Beyond) Combining Text and Tables for qualitative and quantitative research

Marieke van Erp

Finding common ground between text, maps, and tables for quantitative and qua...

Marieke van Erp

Slicing and Dicing a Newspaper Corpus for Historical Ecology Research

Marieke van Erp

Presented at EKAW 2018 Historical newspapers are a novel source of information for historical ecologists to study the interactions between humans and animals through time and space. Newspaper archives are particularly interesting to analyse because of their breadth and depth. However, the size and the occasional noisiness of such archives also brings difficulties, as manual analysis is impossible. In this paper, we present experiments and results on automatic query expansion and categorisation for the perception of animal species between 1800 and 1940. For query expansion and to the manual annotation process, we used lexicons. For the categorisation we trained a Support Vector Machine model. Our results indicate that we can distinguish newspaper articles that are about animal species from those that are not with an F 1 of 0.92 and the subcategorisation of the different types of newspapers on animals up to 0.84 F 1 .

Good Lynx, bad Lynx: Document enrichment for historical ecologists

Marieke van Erp

This document describes a research project called SERPENS that analyzed newspaper articles mentioning lynx to understand how people's perceptions of the animal changed over time. Researchers developed a machine learning classifier to categorize articles into topics like natural history, nuisance, pest control, and others. The classifier achieved over 70% accuracy. Analysis of categorized articles showed shifting perceptions, with more recent articles discussing lynx accidents and its figurative uses compared to older economic focus. The project aims to help historians study how human-wildlife relations have changed.

Towards Semantic Enrichment of Newspapers: a historical ecology use case

Marieke van Erp

Natural Language Processing en Named Entity Recognition

Marieke van Erp

HuC lecture - Digital and Humanities: Continuing the Conversation

Marieke van Erp

Multilingual Fine-grained Entity Typing

Marieke van Erp

This document presents an approach for multilingual fine-grained entity typing using Wikipedia text and DBpedia taxonomy. Feature vectors are generated from entity mentions, surrounding text, and type information to train a model using fastText. The approach is tested on Dutch and Spanish, achieving results comparable to prior work on English datasets. Challenges include incomplete type coverage in DBpedia and cultural differences in entity types between languages. Code and experiments are available online.

Finding Stories in 1,784,532 Events: Scaling up computational models of narr...

Marieke van Erp

Evaluating Named Entity Recognition and Disambiguation in News and Tweets

Marieke van Erp

Named entity recognition and disambiguation are important for information extraction and populating knowledge bases. Detecting and classifying named entities has traditionally been taken on by the natural language processing community, whilst linking of entities to external resources, such as DBpedia and GeoNames, has been the domain of the Semantic Web community. As these tasks are treated in different communities, it is difficult to assess the performance of these tasks combined. We present results on an evaluation of the NERD-ML approach on newswire and tweets for both Named Entity Recognition and Named Entity Disambiguation. Presented at CLIN 24: http://clin24.inl.nl/ http://nerd.eurecom.fr https://github.com/giusepperizzo/nerdml

Orientation EBC 2013: Digitising Natural History

Marieke van Erp

Offspring from Reproduction Problems: what replication failure teaches us

Marieke van Erp

From Events to Stories: Different ways of structuring the same bag of events ...

Marieke van Erp

Lecture4 Social Web

Marieke van Erp

More from Marieke van Erp (20)

Towards Culturally Aware AI Systems - TSDH Symposium

A Polyvocal and Contextualised Semantic Web

AI x Digital Humanities = > Inclusiviteit

Computationally Tracing Concepts Through Time and Space

The Hitchhiker's Guide to the Future of Digital Humanities

Why language technology can’t handle Game of Thrones (yet)

(Beyond) Combining Text and Tables for qualitative and quantitative research

Finding common ground between text, maps, and tables for quantitative and qua...

Slicing and Dicing a Newspaper Corpus for Historical Ecology Research

Good Lynx, bad Lynx: Document enrichment for historical ecologists

Towards Semantic Enrichment of Newspapers: a historical ecology use case

Natural Language Processing en Named Entity Recognition

HuC lecture - Digital and Humanities: Continuing the Conversation

Multilingual Fine-grained Entity Typing

Finding Stories in 1,784,532 Events: Scaling up computational models of narr...

Evaluating Named Entity Recognition and Disambiguation in News and Tweets

Orientation EBC 2013: Digitising Natural History

Offspring from Reproduction Problems: what replication failure teaches us

From Events to Stories: Different ways of structuring the same bag of events ...

Lecture4 Social Web

Recently uploaded

Introduction of Cybersecurity with OSS at Code Europe 2024

Hiroshi SHIBATA

I develop the Ruby programming language, RubyGems, and Bundler, which are package managers for Ruby. Today, I will introduce how to enhance the security of your application using open-source software (OSS) examples from Ruby and RubyGems. The first topic is CVE (Common Vulnerabilities and Exposures). I have published CVEs many times. But what exactly is a CVE? I'll provide a basic understanding of CVEs and explain how to detect and handle vulnerabilities in OSS. Next, let's discuss package managers. Package managers play a critical role in the OSS ecosystem. I'll explain how to manage library dependencies in your application. I'll share insights into how the Ruby and RubyGems core team works to keep our ecosystem safe. By the end of this talk, you'll have a better understanding of how to safeguard your code.

Webinar: Designing a schema for a Data Warehouse

Federico Razzoli

Are you new to data warehouses (DWH)? Do you need to check whether your data warehouse follows the best practices for a good design? In both cases, this webinar is for you. A data warehouse is a central relational database that contains all measurements about a business or an organisation. This data comes from a variety of heterogeneous data sources, which includes databases of any type that back the applications used by the company, data files exported by some applications, or APIs provided by internal or external services. But designing a data warehouse correctly is a hard task, which requires gathering information about the business processes that need to be analysed in the first place. These processes must be translated into so-called star schemas, which means, denormalised databases where each table represents a dimension or facts. We will discuss these topics: - How to gather information about a business; - Understanding dictionaries and how to identify business entities; - Dimensions and facts; - Setting a table granularity; - Types of facts; - Types of dimensions; - Snowflakes and how to avoid them; - Expanding existing dimensions and facts.

5th LF Energy Power Grid Model Meet-up Slides

DanBrown980551

5th Power Grid Model Meet-up It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology. Power Grid Model The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services. Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability. Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization. What to expect For the upcoming meetup we are organizing, we have an exciting lineup of activities planned: -Insightful presentations covering two practical applications of the Power Grid Model. -An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024. -An interactive brainstorming session to discuss and propose new feature requests. -An opportunity to connect with fellow Power Grid Model enthusiasts and users.

Digital Marketing Trends in 2024 | Guide for Staying Ahead

Wask

https://www.wask.co/ebooks/digital-marketing-trends-in-2024 Feeling lost in the digital marketing whirlwind of 2024? Technology is changing, consumer habits are evolving, and staying ahead of the curve feels like a never-ending pursuit. This e-book is your compass. Dive into actionable insights to handle the complexities of modern marketing. From hyper-personalization to the power of user-generated content, learn how to build long-term relationships with your audience and unlock the secrets to success in the ever-shifting digital landscape.

Taking AI to the Next Level in Manufacturing.pdf

ssuserfac0301

Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as: 1. How quickly AI is being implemented in manufacturing. 2. Which barriers stand in the way of AI adoption. 3. How data quality and governance form the backbone of AI. 4. Organizational processes and structures that may inhibit effective AI adoption. 6. Ideas and approaches to help build your organization's AI strategy.

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf

Chart Kalyan

Programming Foundation Models with DSPy - Meetup Slides

Zilliz

Columbus Data & Analytics Wednesdays - June 2024

Jason Packer

June Patch Tuesday

Ivanti

Ivanti’s Patch Tuesday breakdown goes beyond patching your applications and brings you the intelligence and guidance needed to prioritize where to focus your attention first. Catch early analysis on our Ivanti blog, then join industry expert Chris Goettl for the Patch Tuesday Webinar Event. There we’ll do a deep dive into each of the bulletins and give guidance on the risks associated with the newly-identified vulnerabilities.

20240609 QFM020 Irresponsible AI Reading List May 2024

Matthew Sinclair

OpenID AuthZEN Interop Read Out - Authorization

David Brossard

Energy Efficient Video Encoding for Cloud and Edge Computing Instances

Alpen-Adria-Universität

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

IndexBug

20240607 QFM018 Elixir Reading List May 2024

Matthew Sinclair

Recommendation System using RAG Architecture

fredae14

How to use Firebase Data Connect For Flutter

Daiki Mogmet Ito

National Security Agency - NSA mobile device best practices

Quotidiano Piemontese

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

shyamraj55

Fueling AI with Great Data with Airbyte Webinar

Zilliz

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

Recently uploaded (20)

Introduction of Cybersecurity with OSS at Code Europe 2024

Webinar: Designing a schema for a Data Warehouse

5th LF Energy Power Grid Model Meet-up Slides

Digital Marketing Trends in 2024 | Guide for Staying Ahead

Taking AI to the Next Level in Manufacturing.pdf

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf

Programming Foundation Models with DSPy - Meetup Slides

Columbus Data & Analytics Wednesdays - June 2024

June Patch Tuesday

20240609 QFM020 Irresponsible AI Reading List May 2024

OpenID AuthZEN Interop Read Out - Authorization

Energy Efficient Video Encoding for Cloud and Edge Computing Instances

AI 101: An Introduction to the Basics and Impact of Artificial Intelligence

20240607 QFM018 Elixir Reading List May 2024

Recommendation System using RAG Architecture

How to use Firebase Data Connect For Flutter

National Security Agency - NSA mobile device best practices

Let's Integrate MuleSoft RPA, COMPOSER, APM with AWS IDP along with Slack

Fueling AI with Great Data with Airbyte Webinar

HCL Notes and Domino License Cost Reduction in the World of DLAU

Entity Typing Using Distributional Semantics and DBpedia

1. Entity Typing Using Distributional Semantics and DBpedia Marieke van Erp and Piek Vossen

2. Conclusions • Finegrained entity typing is necessary for semantic queries over text • Search space for word2vec is large, topics help • Combining Distributional Semantics with DBpedia can help overcome NIL and Dark Entities https://github.com/MvanErp/entity-typing/

3. Dark entities: little or no information available in KB https://github.com/MvanErp/entity-typing/

4. Dark entities: little or no information available in KB https://github.com/MvanErp/entity-typing/

5. Distributional Semantics • Similar concepts (denoted by words) occur in similar contexts • Word2Vec (Mikolov et al., 2013) explores this notion in a popular implementation Sushi Teriyaki Udon Okonomiyaki Soba Sashimi Kimono Yukata Nemaki KFC Steak Hamburger McDonald’s Jeans T-shirt Skirt

6. Research Question: • Can we predict the type of the concept ‘Sushi’ by modelling it in a distributional semantics space and comparing its vector to the vectors of concepts for which we do know the type? Sushi Teriyaki Udon Okonomiyaki Soba Sashimi Kimono Yukata Nemaki KFC Steak Hamburger McDonald’s Jeans T-shirt Skirt

7. Setup • 7 Named Entity Linking Benchmark datasets (AIDA-YAGO, 2014 NEEL, 2015 NEEL, OKE2015, RSS500, WES2015, Wikinews) • 3 Word2Vec models: GoogleNews, English Wikipedia, Reuters RCV1* • Compare all entities within datasets to each other and return highest ranking type (as taken from DBpedia) * AIDA-YAGO is part of Reuters RCV1 https://github.com/MvanErp/entity-typing/

8. Initial results • Not so great? https://github.com/MvanErp/entity-typing/

9. Initial results (some footnotes) • Ranking approach favours ﬁne-grained entity types • The Word2Vec corpus matters! NEEL2014&2015 are derived from Tweets, typically low coverage when querying news • Smaller datasets (Wikinews, WES2015, OKE2015) do better? https://github.com/MvanErp/entity-typing/

10. Let’s zoom in on topics • Initially, all entities within a benchmark dataset were compared to all other entities. • What if we only compare entities from sports documents to other entities from sports documents? 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 AIDA−YAGO Coarsegrained Categories GoogleNews Fine 20 40 60 80 100 1 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 AIDA−YAGO Coarsegrained Categories RCV1 Fine 20 40 60 80 100 1 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 AIDA−YAGO Coarsegrained Categories Wikipedia Fine 20 40 60 80 100 1 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 AIDA−YAGO Finegrained Categories GoogleNews Fine 20 40 60 80 100 1 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 AIDA−YAGO Finegrained Categories RCV1 Fine 20 40 60 80 100 1 5 10 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 AIDA−YAGO Finegrained Categories Wikipedia Fine 20 40 60 80 100 1 5 10 https://github.com/MvanErp/entity-typing/

11. Conclusions and Future Work • Difﬁcult task, but topics help • Ranking needs to be improved • Multi-class classiﬁcation (KFC: food & organisation, Arnold Schwarzenegger: Actor & Politician) • What else can we discover beyond type? https://github.com/MvanErp/entity-typing/

12. Thank you! https://github.com/MvanErp/entity-typing/

13. This research was made possible by the CLARIAH-CORE project ﬁnanced by NWO. http://www.clariah.nl

Entity Typing Using Distributional Semantics and DBpedia

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Entity Typing Using Distributional Semantics and DBpedia

Similar to Entity Typing Using Distributional Semantics and DBpedia (20)

More from Marieke van Erp

More from Marieke van Erp (20)

Recently uploaded

Recently uploaded (20)

Entity Typing Using Distributional Semantics and DBpedia