This document discusses OpenWordNet-PT, an open-source Portuguese wordnet. It began as an automated projection of the English Princeton WordNet onto Portuguese, leveraging existing multilingual resources. The initial version contained over 120,000 synsets mapped to Portuguese with automated methods. Further work involved manual validation and revision of the most frequent synsets to improve precision. The resource has been used in applications such as word sense disambiguation, sentiment analysis, and linking deverbal nouns to their corresponding verbs. Ongoing work involves continued manual review and expansion efforts to improve coverage and relations. The document outlines the methodology and provides examples of entries. It is available via GitHub under a free license and as Linked Data to support interoper
4. +
WordNet…
n
WordNet created at Princeton University under
George A. Miller, since 1985. A lexical database for
English: groups English words into sets of synonyms
called synsets, provides short, general definitions, and
records the various semantic relations between these.
n
This produces a combination of dictionary and
thesaurus that is intuitive, usable, and supports
automatic text analysis and artificial intelligence
applications. Released under a BSD style license, can be
downloaded and used freely.
n
WordNet is the most commonly used computational
lexicon of English.
n
Some complaints that WordNet encodes sense
distinctions that are too fine-grained even for humans.
The granularity issue has been tackled by proposing
clustering methods that automatically group together
similar senses of the same word.
5. +
Global WordNet Association
n
Christiane Fellbaum and Piek Vossen
(EuroWordNet 1996-1999, GWA since)
n
The Global WordNet Association (GWA)
is a free, public and non-commercial
organization that provides a platform for
discussing, sharing and connecting
wordnets for all languages in the world.
n
Global WordNet Grid since 2006. Open
Multilingual Wordnet
http://casta-net.jp/~kuribayashi/multi/
Francis Bond
n
A simple user interface: Welcome to the
Open Multi-lingual Wordnet (1.0)
http://casta-net.jp/~kuribayashi/cgi-bin/
wn-multi.cgi
7. + OpenWordnet-PT?
(aren’t all wordnets open?)
Previous work: WordNet.PT and WordNet.PT
global (Lisboa), MultiWordNet.PT and Brazilian
WordNet by Bento Dias.
We need a Portuguese
Wordnet for our work, but
none of the previous projects
is openly available.
8. +
Previous Portuguese WordNets…
n
WordNet.PT (P. Marrafa) since 1999, part
of EuroWordNet, 19K expressions,
manually curated, online consulting
only. Some domains
n
MWN.PT - MultiWordnet of Portuguese
(A. Branco), since 2008, part of MWN,
over 17,200 manually validated
concepts/synsets, not free
n
WN.Br (B. Dias da Silva) since 2000, not
open, not available online, REBECA 2010
only ‘wheeled vehicles’….
9. + OpenWN-PT: What?
n
Leverage EuroWordNet, MultiWordNet, Global
WordNet experience
n
Recruited Gerard de Melo for project
n
Leverage YAGO, UWN/Menta experience…
n
UWN/MENTA (de Melo/Weikum) A large-scale
multilingual lexical knowledge base built using
statistical methods, transforming WordNet into a
massively multilingual resource (over 1 million words
and several million named entities in a single large
multilingual taxonomy)
n
Portuguese `projection’ of UWN/Menta is the basis of
automated version of a OpenWordNet-PT, publicly
available.
https://github.com/arademaker/wordnet-br
10. + OpenWN-PT: the basis…
Combined the following data:
Princeton WordNet 3.0 used to obtain English
glosses and English terms for synset IDs.
The unreleased 2010-12 version UWN and MENTA
provided candidate terms in Portuguese, candidate
glosses in Portuguese (from Wikipedia), and
candidate terms in Spanish.
The EuroWordNet base concept list (5000_bc.xml)
provides the base concept numbers.
The original file was mapped from WordNet 2.0 to
3.0 using the mappings from WN-Map. When
multiple mappings for a WordNet 2.0 synset existed,
all possible WordNet 3.0 synsets were kept.
http://nlp.lsi.upc.edu/web/index.php?option=com_content&task=view&id=21&Itemid=57
https://github.com/arademaker/wordnet-br
11. +
OpenWN-PT: the method
n
a two-tiered methodology: high precision for the
more frequent words of the language, but also high
to cover a wide range of words in the long tail
n
Translation dictionaries to map the English
members of a synset to possible Portuguese
translation candidates. To disambiguate and
choose the correct translations, feature vectors for
possible translations are created by computing
graph-based statistics in the graph of words,
translations, and synsets. Monolingual wordnets
and parallel corpora used to enrich this graph.
Statistical learning techniques used to iteratively
refine this information and build an output graph
connecting Portuguese words to synsets.
n
Wikipedia pages are then linked to relevant
WordNet synsets by learning from similar graphbased features as well as gloss similarity scores.
12. +
More method…
n
To have high precision for the most
important concepts of a language,
rely on human annotators.
n
Set of 4689 “Common Base Concepts”
GWA
n
2,498 manually entered sense-word
pairs as well as an additional 1,299
manually written Portuguese synset
glosses.
n
Does it work?
15. + OpenWN-PT: what does it look like?
n
Typical good entry with minor manual improvements.
n
Automatic produces candidate Portuguese words for each
of some of WN3.0 synsets.
n
Check suggested words and add Portuguese gloss and
examples.
16. + OpenWN-PT: what does it look like?
Not very useful
Good automatically suggestion
19. + OpenWN-PT: manual revisions
Native speakers, but
not linguists…
Plenty of errors…
20. +
OpenWN-PT: RDF Representation
n
Why? To address the issue of interoperability between
wordnets. To be able to rely on Linked Data and
Semantic Web standards such as RDF and OWL.
n
The emergence of Linked Data projects for lexical and
reasoning resources make OpenWN-PT encoded and
distributed in RDF/OWL.
n
Standards allow both the data model and the actual data
in the same format. Plus range of existing data
processing tools, including databases (“triple stores”)
with SQL-like query interfaces (SPARQL).
n
Standard W3C encoding of WordNet in RDF since 2006.
OpenWN-PT is is modelled after and fully interoperable
with Princeton WordNet.
n
This means that one can easily find Portuguese
equivalents for specific English word senses and vice
versa. Also means OpenWN-PT is part of a large
ecosystem of compatible resources, including
domain identifiers and mappings to Wikipedia.
21. +
Progress Report
n
Checking is much easier than starting from scratch..
n
But long and tedious work to check even the initial 5k synsets
suggested by GWA (not done, yet!), let alone all synsets in
OpenWN-PT
n
Necessary? YES! Lexical gaps of all sorts
n
But resource is being used, warts and all…
n
Improving the resource: new data from Bond/Foster and some
manual additions
22. +
Use Cases: FreeLing
n
Word Sense Disambiguation
via FreeLing 3.0 An Open
Source Suite of Language
Analyzers
n
OpenWN-PT has been
incorporated into FreeLing
(Padro’ and Stanilovsky, 2012)
n
Using Freeling’s word sense
disambiguation framework, a
given Portuguese text can
automatically be annotated
with word senses.
n
UPC, Barcelona
23. +
Use Cases: Sentiment Analysis
n
Sentiment Analysis, using tweets
about soccer games
n
OpenWN-PT and SentiWordNet
to compare the
MachineLearning-based
sentiment analysis integrated
into IBM InfoSphere Streams
(ISS) platform.
n
1 million tweets, 4 friendly
matches Brazilian team in 2013,
7 classes of positivity
n
IBM Research, BR
24. +
Use Cases: NomLex-Br (Livy Real)
n
extension of OpenWN-PT aims at incorporating links to connect
deverbal nouns with their corresponding verbs.
n
For English, NOMLEX (Macleod et al., 1998) has
provided extensive descriptions of nominal- izations
via extensions of initial core.
n
NOMLEX was constructed starting out with
nominalizations with the suffixes -ion, -ment and -er,
taking samples of the most frequent words first in a
list of nouns from a combination of the Brown Corpus
and the Wall Street Journal (about 1 million words of
each).
n
NOMLEX-BR Translation of initial core, plus French
Nomage
n
Overall, we have created over 2,000 entries. These
have been integrated into OpenWN-PT, will facilitate
their use for linguistic research as well as information
extraction
The destruction of the city by Alexander in 330BC…
25. +
Use Cases: NomLex-Br (Livy Real)
n
Incorporating NOMLEX-BR data into OpenWN-PT
has shown itself useful in pinpointing some issues
with the coherence and richness of OpenWN-PT.
n
the word abasement corresponds in NOMLEX to
the verb abase, and thus we would like a similar
correspondence between the Portuguese noun
aviltamento and the verb aviltar (our suggested
translations). OpenWN-PT simply has two synsets
humilhar, abaixar and humilhar, rebaixar. The more
common verb humilhar is repeated, while the
uncommon aviltar was left out.
n
Other useful kinds of relationships between parts
of speech (say the connections between adjectives and adverbs) are likely to also help to
improve the accuracy and richness
26. +
Miscellaneous Experiments
n
Accuracy: choose six relations:
hypernymOf, memberHolonymOf,
instanceOf, substanceHolonymOf,
entails and causes.
n
For each of these relations, we
randomly chose 30 pairs of synsets
and then random words from each
synset. We ended up with 180
random sentences for manual
verification.
n
The linguist marked each
sentence as “correct”, “wrong” or
“dubious”. Obtained 150
sentences correct (83% of the
sentences), 17 marked as wrong, 13
marked as dubious.
n
Need more systematic effort. But
results were encouraging
n
Coverage: Using DHBB to
complete NOMLEX-BR.
n
Other paper…
27. + Conclusions
n
We discussed the implementation and some
applications of OpenWordNet-PT, an open WordNet for Brazilian Portuguese.
n
Recent improvements include better coverage
and nominalization links connecting nouns and
verbs.
n
The resource has been used in developing a
high-throughput commercial system as well as
in a cultural heritage project, and we anticipate
that numerous further applications will follow.
n
The data is freely available from http://
github.com/ arademaker/wordnet-br/ and a
SPARQL Endpoint at logics.emap.fgv.br:10035.
n
Browsing via Open Multilingual Wordnet //
www.casta-net.jp/ ~kuribayashi/ cgi-bin/wnmulti.cgi is fun
28. + OpenWN-PT: next steps?..
n
First finish translating the “core” synsets in the Princeton
WordNet to Portuguese.
n
Increase number of relations in OpenWN-PT as a way of
improving adequacy and coherence.
n
Adding the Portuguese terms that satisfy different relations?
n
OpenVerbNet-PT?
n
Since we have a first target corpus, the Brazilian Historical
Biographic Dictionary, we can also calculate word frequency to
prioritize expansion of the OpenWN-PT and go back to the
ontology building...
30. +
References
Revisiting a Brazilian Wordnet. Valeria de Paiva, Alexandre
Rademaker, (2012)
Proceedings of Global Wordnet Conference, Global Wordnet
Association, Matsue.
OpenWordNet-PT: An Open Brazilian WordNet For Reasoning.
de Paiva, Valeria, Alexandre Rademaker, and Gerard de Melo. In
Proceedings of the 24th International Conference On Computational
Linguistics. http://hdl.handle.net/10438/10274.
OpenWordNet-PT: A Project Report. Alexandre Rademaker, Valeria
de Paiva, Gerard de Melo, Livy Real and Maira Gatti.
Proceedings of the 7th Global Wordnet Conference, Tartu, Estonia.
Global Wordnet Association, 2014.
Embedding NomLex-BR Nominalizations Into OpenWordnet-PT.
Coelho, Livy Maria Real, Alexandre Rademaker, Valeria De Paiva, and
Gerard de Melo. 2014. In Proceedings of the 7th Global WordNet
Conference. Tartu, Estonia
31. +
Other stuff to add in?…
n
Onto.PT, ES wordnet?
n
Editing interfaces?
n
BabelNet?
n
NER issues?
n
Temporal issues?
n
Work with Claudia Freitas?…Leonel?
n
Work on implicatives/factives in Portuguese?
n
FOIS workshop
32. +
References
Towards a Universal Wordnet by Learning from Combined
Evidence Gerard de Melo, Gerhard Weikum (2009)
18th ACM Conference on Information and Knowledge Management
(CIKM 2009), Hong Kong, China.
Bridges from Language to Logic: Concepts, Contexts and
Ontologies Valeria de Paiva (2010)
Logical and Semantic Frameworks with Applications, LSFA'10, Natal,
Brazil, 2010.
`A Basic Logic for Textual inference", AAAI Workshop on Inference for
Textual Question Answering, 2005.
``Textual Inference Logic: Take Two", CONTEXT 2007.
``Precision-focused Textual Inference", Workshop on Textual
Entailment and Paraphrasing, 2007.
PARC's Bridge and Question Answering System Proceedings of
Grammar Engineering Across Frameworks, 2007.
33. + Simplifying the PARC’s Bridge Architecture
Text
Parsing
Inference
Engines
KR Mapping
F-structure
semantics
KR
Sources
Assertions
Query
Question
Grammar
Stanford Parser
Term rewriting
OpenWN-PT
SUMO-PT
KR mapping rules
Textual Inference
logics
Idea: Simplify and reproduce components in PORTUGUESE