The importance of Being Erneast: Open datasets in Portuguese

1/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
The Importance of Being Earnest:
Open Datasets in Portuguese
Valeria de Paiva
OPENCOR 2021
Dec 2021
Valeria de Paiva OpenCor2021

2/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Thanks, Livy!

3/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Personal Stories
I’m an AI scientist, a mathematician, a computational semanticist
and a category theorist.
I work in Silicon Valley, have done so for the last 22 years, applying
pure mathematics to computing, in surprising ways.

4/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
In the Valley

5/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Personal stories: Now

6/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
How?

7/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Samsung Research America (2019-2020)
Dialogue and Knowledge Representation Lab, SRA, Mountain
View
project: systems to make Bixby (voice personal assistant)
communicate well with home appliances via SmartHome
Samsung acquired Viv in Oct 2016, had acquired SmartThings
in 2014, need to integrate stacks, grow Bixby
SmartThings: leading open platform for the smart home and
the consumer Internet of Things in 2014.
opensource project: develop ontology of smart devices, based on
WikiData and costumer-facing functionality.

8/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Nuance Communications (2012-2018)
AI and NLP Lab in Sunnyvale
Nuance had the best voice recognition software system in
2012. needed to add AI to make sounds into knowledge. big
effort in several labs: Montreal, Boston, Sunnyvale.
application areas: health systems, automotive, law, CRM,
banks, insurance, etc
our lab projects: personal assistant for Living Room (TV 2nd
screen), PA for automotive companies
Building small smartness into conventional search, e.g. ‘find
allergy medication near me’
opensource projects: voice interfaces to WikiData? student’s
internships?

9/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Rearden Commerce (2011-2012)
AI/KR Lab in Foster City
a white-labelling shop for travel and expenses/procurement
systems. application areas: air travel tickets, hotels, shows&
sports, restaurants, ground transportation, parking, etc
our lab project: a Groupon-like app as RC acquired HomeRun
Using ontologies to discover what hotel reviewers really valued
opensource possible projects: projection of WikiData adapted to
Brazil, for ‘skills sets’, for Brazilian culture, etc

10/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Cuil (2008-2010)
Start-up search company in Menlo Park
a Google-competitor created by ex-googlers
all sorts of tasks, from baby-sitting servers to dealing with
costumers
Learning to rank algorithms
PARC Forum talk: Adventures in Searchland
opensource possible project: timelines in Portuguese

11/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
PARC, XLE and Bridge

12/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
How to think about Conversational Assistants?

13/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
How to think about Conversational Assistants?

14/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Several Conversational Assistants: applications
(AI Summit 2018, Luxembourg)

15/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Several Conversational Assistants
(AI Summit 2018, Luxembourg)

16/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Natural Language Inference (NLI)
Shock: work of almost nine years at PARC was out of reach
when I left in 2008
I gave a talk at SRI proposing to redo it all, open source
(Bridges, ENTCS2011)
Pleased to report that almost all of it is now available
open-source, redone from scratch, using new techniques
Katerina Kalouli, ex-PhD student at Konstanz, now assistant
professor in Munich GKR Demo:
https://cis.lmu.de/ kalouli/resources.html

17/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
What about Portuguese?
Alexandre Rademaker and I started OpenWordNet-PT in 2012:
There was no opensource WN of Portuguese then
Sources in GitHub
OWN-PT originally obtained from Universal WordNet (Weikum
and de Melo)
RDF distribution from the beginning
openwordnet-pt.org

18/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT Data

19/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT Examples

20/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT Basic Stats
1. OWN-PT is big, around 50K synsets.
2. PWN is much bigger, 117K synsets.
3. we have more than 7K synsets of verbs–definitely not enough,
but one can start to play
4. More than twice as big as Russian WordNet, bigger than
Spanish, only slightly smaller than French
5. and issues, many issues

21/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT Papers
1. Papers trying to clean up the database
2. Nominalizations and their issues (Livy Real)
3. Using corpora to extend our vocabulary (Claudia Freitas)
4. Interfaces for progress (Fabricio Chalub)
5. Two papers on verb lexicon
6. Two papers on Historical archives (DHBB)
7. WordNets themselves (Hugo & Alberto)
8. Gentilics, Adverbs
9. Temporal expressions
10. Morpholinks
etc..

22/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT
We were doing so well...
GoogleTranslate, Open MultiLingual WordNet, BabelNet, Freeling
used our OWN-PT.
https://translate.google.com/intl/en/about/license/ still says:
But then Transformers arrived! with them a series of new
challenges.

23/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications

24/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT Questions
1. One can try to carry on cleaning the data, using the
lexicographers files (to the left). Is it worth doing it?
2. Should we instead grow, not worrying much about precision?
3. I wish we had glosses in Portuguese. Alberto Simoẽs produced
them for us, but we never implemented/added them to the
database, as the quality of the Portuguese text wasn’t great.
4. This data is open source, anyone who can get it and make it
better. and let us have it. Or not: our license is very broad
5. In a long term project eventually goals diverge and people want
to try other things. The beauty of github is being able to keep the
version you want
In any case a high quality Portuguese WordNet is simply one
lexical resource we need. We need others and we have been
working on those in parallel.

25/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications

26/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Beginning UD-PT
1. Corpus Bosque: traditional news corpus (EU-PT and BR-PT),
mangled by several versions and conversions.
2. PALAVRAS (Bick): a rule-based Constraint Grammar CG
system designed for Portuguese. It produces deep linguistic
analyses, with tags at the morphological, syntactic (dependency)
and semantic levels. (not open source)
3. First version of our data, UD 1.4 compliant, included in UD
release 1.4 as UD Portuguese-Bosque. not too bad!
4. Then we ”accepted the challenge”of updating UD-PT-Bosque
to UD 2.0 guidelines and replacing the previous UD Portuguese
corpus. Phew!

27/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Issues of UD-PT
1. Gender: underspecified gender. grande (big) or feliz (happy)
2. MWEs: changing from Palavras to UD1.x to UD2.x was
complicated. MWE still are.
3. Participles: verbs or adjectives?
4. ellipses: changes from UD1 to UD2, plus ellipses are difficult
5. Clitics, also all the things that ”se, que”can be.
6. Non-explicit subjects (sujeito oculto and others) see excellent
new work of Freitas, de Souza.
7. Negation (UD changed its mind) and negation is hard
8. Appositives vs. nmod PT had a diff opinion
9. Auxiliary verbs
10. xcomp-that, ccomp-to
See http://medialab.di.unipi.it/depling/assets/docs/
day2/02_demo2.pdf for status in 2017. Now a meeting group.

28/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications

29/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
SICK-BR
e.g. https://www.ime.usp.br/~bruna/SICK_PT.pdf
1. A big group from GLIC, USP.
2. An easy to obtain, state-of-the-art automated translation (Milos
Stanojevic)
3. Lots of human work correcting automated translation to get
4. SICK-BR, a Brazilian Portuguese corpus annotated with
inference relations and semantic relatedness between pairs of
sentences
5. SICK-BR is a translation and adaptation of the original SICK, a
corpus of English sentences used in several semantic evaluations
6. SICK-BR around 10k sentence pairs annotated for
neutral/contradiction/entailment relations and for semantic
relatedness

30/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
SICK-BR
https://www.ime.usp.br/~bruna/SICK_PT.pdf
1. Basic idea: logic is kind of universal, works the same in different
languages
2. NLI is very important in the new style of Natural Language
Understanding. Hence ASSIN, ASSIN2, SICK-BR.
3. But many difficulties of translation, even for simple sentences as
in SICK
3. Difficult to decide if the difficulties of translation are simply that
4. phenomena described in SICK seems universal enough, but
language is structured differently
5. Much more work to do...

31/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Conclusions
Open source datasets are important
Not only for English!
BenderRule: Say the language you’re dealing with, always.
Also document your datasets properly!
A video worth watching ”Data Statements for Natural Language
Processing: Toward Mitigating System Bias and Enabling Better
Science”https://vimeo.com/359686057 only 19 min
But indeed we have our work cutout for us! Thanks!

The importance of Being Erneast: Open datasets in Portuguese

More Related Content

What's hot

Similar to The importance of Being Erneast: Open datasets in Portuguese

More from Valeria de Paiva

Recently uploaded

The importance of Being Erneast: Open datasets in Portuguese