1/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
The Importance of Being Earnest:
Open Datasets in Portuguese
Valeria de Paiva
OPENCOR 2021
Dec 2021
Valeria de Paiva OpenCor2021
2/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Thanks, Livy!
Valeria de Paiva OpenCor2021
3/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Personal Stories
I’m an AI scientist, a mathematician, a computational semanticist
and a category theorist.
I work in Silicon Valley, have done so for the last 22 years, applying
pure mathematics to computing, in surprising ways.
Valeria de Paiva OpenCor2021
4/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
In the Valley
Valeria de Paiva OpenCor2021
5/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Personal stories: Now
Valeria de Paiva OpenCor2021
6/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
How?
Valeria de Paiva OpenCor2021
7/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Samsung Research America (2019-2020)
Dialogue and Knowledge Representation Lab, SRA, Mountain
View
project: systems to make Bixby (voice personal assistant)
communicate well with home appliances via SmartHome
Samsung acquired Viv in Oct 2016, had acquired SmartThings
in 2014, need to integrate stacks, grow Bixby
SmartThings: leading open platform for the smart home and
the consumer Internet of Things in 2014.
opensource project: develop ontology of smart devices, based on
WikiData and costumer-facing functionality.
Valeria de Paiva OpenCor2021
8/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Nuance Communications (2012-2018)
AI and NLP Lab in Sunnyvale
Nuance had the best voice recognition software system in
2012. needed to add AI to make sounds into knowledge. big
effort in several labs: Montreal, Boston, Sunnyvale.
application areas: health systems, automotive, law, CRM,
banks, insurance, etc
our lab projects: personal assistant for Living Room (TV 2nd
screen), PA for automotive companies
Building small smartness into conventional search, e.g. ‘find
allergy medication near me’
opensource projects: voice interfaces to WikiData? student’s
internships?
Valeria de Paiva OpenCor2021
9/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Rearden Commerce (2011-2012)
AI/KR Lab in Foster City
a white-labelling shop for travel and expenses/procurement
systems. application areas: air travel tickets, hotels, shows&
sports, restaurants, ground transportation, parking, etc
our lab project: a Groupon-like app as RC acquired HomeRun
Using ontologies to discover what hotel reviewers really valued
opensource possible projects: projection of WikiData adapted to
Brazil, for ‘skills sets’, for Brazilian culture, etc
Valeria de Paiva OpenCor2021
10/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Cuil (2008-2010)
Start-up search company in Menlo Park
a Google-competitor created by ex-googlers
all sorts of tasks, from baby-sitting servers to dealing with
costumers
Learning to rank algorithms
PARC Forum talk: Adventures in Searchland
opensource possible project: timelines in Portuguese
Valeria de Paiva OpenCor2021
11/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
PARC, XLE and Bridge
Valeria de Paiva OpenCor2021
12/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
How to think about Conversational Assistants?
Valeria de Paiva OpenCor2021
13/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
How to think about Conversational Assistants?
Valeria de Paiva OpenCor2021
14/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Several Conversational Assistants: applications
(AI Summit 2018, Luxembourg)
Valeria de Paiva OpenCor2021
15/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Several Conversational Assistants
(AI Summit 2018, Luxembourg)
Valeria de Paiva OpenCor2021
16/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Natural Language Inference (NLI)
Shock: work of almost nine years at PARC was out of reach
when I left in 2008
I gave a talk at SRI proposing to redo it all, open source
(Bridges, ENTCS2011)
Pleased to report that almost all of it is now available
open-source, redone from scratch, using new techniques
Katerina Kalouli, ex-PhD student at Konstanz, now assistant
professor in Munich GKR Demo:
https://cis.lmu.de/ kalouli/resources.html
Valeria de Paiva OpenCor2021
17/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
What about Portuguese?
Alexandre Rademaker and I started OpenWordNet-PT in 2012:
There was no opensource WN of Portuguese then
Sources in GitHub
OWN-PT originally obtained from Universal WordNet (Weikum
and de Melo)
RDF distribution from the beginning
openwordnet-pt.org
Valeria de Paiva OpenCor2021
18/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT Data
Valeria de Paiva OpenCor2021
19/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT Examples
Valeria de Paiva OpenCor2021
20/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT Basic Stats
1. OWN-PT is big, around 50K synsets.
2. PWN is much bigger, 117K synsets.
3. we have more than 7K synsets of verbs–definitely not enough,
but one can start to play
4. More than twice as big as Russian WordNet, bigger than
Spanish, only slightly smaller than French
5. and issues, many issues
Valeria de Paiva OpenCor2021
21/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT Papers
1. Papers trying to clean up the database
2. Nominalizations and their issues (Livy Real)
3. Using corpora to extend our vocabulary (Claudia Freitas)
4. Interfaces for progress (Fabricio Chalub)
5. Two papers on verb lexicon
6. Two papers on Historical archives (DHBB)
7. WordNets themselves (Hugo & Alberto)
8. Gentilics, Adverbs
9. Temporal expressions
10. Morpholinks
etc..
Valeria de Paiva OpenCor2021
22/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT
We were doing so well...
GoogleTranslate, Open MultiLingual WordNet, BabelNet, Freeling
used our OWN-PT.
https://translate.google.com/intl/en/about/license/ still says:
But then Transformers arrived! with them a series of new
challenges.
Valeria de Paiva OpenCor2021
23/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Valeria de Paiva OpenCor2021
24/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
OpenWordNet-PT Questions
1. One can try to carry on cleaning the data, using the
lexicographers files (to the left). Is it worth doing it?
2. Should we instead grow, not worrying much about precision?
3. I wish we had glosses in Portuguese. Alberto Simoẽs produced
them for us, but we never implemented/added them to the
database, as the quality of the Portuguese text wasn’t great.
4. This data is open source, anyone who can get it and make it
better. and let us have it. Or not: our license is very broad
5. In a long term project eventually goals diverge and people want
to try other things. The beauty of github is being able to keep the
version you want
In any case a high quality Portuguese WordNet is simply one
lexical resource we need. We need others and we have been
working on those in parallel.
Valeria de Paiva OpenCor2021
25/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Valeria de Paiva OpenCor2021
26/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Beginning UD-PT
1. Corpus Bosque: traditional news corpus (EU-PT and BR-PT),
mangled by several versions and conversions.
2. PALAVRAS (Bick): a rule-based Constraint Grammar CG
system designed for Portuguese. It produces deep linguistic
analyses, with tags at the morphological, syntactic (dependency)
and semantic levels. (not open source)
3. First version of our data, UD 1.4 compliant, included in UD
release 1.4 as UD Portuguese-Bosque. not too bad!
4. Then we ”accepted the challenge”of updating UD-PT-Bosque
to UD 2.0 guidelines and replacing the previous UD Portuguese
corpus. Phew!
Valeria de Paiva OpenCor2021
27/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Issues of UD-PT
1. Gender: underspecified gender. grande (big) or feliz (happy)
2. MWEs: changing from Palavras to UD1.x to UD2.x was
complicated. MWE still are.
3. Participles: verbs or adjectives?
4. ellipses: changes from UD1 to UD2, plus ellipses are difficult
5. Clitics, also all the things that ”se, que”can be.
6. Non-explicit subjects (sujeito oculto and others) see excellent
new work of Freitas, de Souza.
7. Negation (UD changed its mind) and negation is hard
8. Appositives vs. nmod PT had a diff opinion
9. Auxiliary verbs
10. xcomp-that, ccomp-to
See http://medialab.di.unipi.it/depling/assets/docs/
day2/02_demo2.pdf for status in 2017. Now a meeting group.
Valeria de Paiva OpenCor2021
28/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Valeria de Paiva OpenCor2021
29/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
SICK-BR
e.g. https://www.ime.usp.br/~bruna/SICK_PT.pdf
1. A big group from GLIC, USP.
2. An easy to obtain, state-of-the-art automated translation (Milos
Stanojevic)
3. Lots of human work correcting automated translation to get
4. SICK-BR, a Brazilian Portuguese corpus annotated with
inference relations and semantic relatedness between pairs of
sentences
5. SICK-BR is a translation and adaptation of the original SICK, a
corpus of English sentences used in several semantic evaluations
6. SICK-BR around 10k sentence pairs annotated for
neutral/contradiction/entailment relations and for semantic
relatedness
Valeria de Paiva OpenCor2021
30/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
SICK-BR
https://www.ime.usp.br/~bruna/SICK_PT.pdf
1. Basic idea: logic is kind of universal, works the same in different
languages
2. NLI is very important in the new style of Natural Language
Understanding. Hence ASSIN, ASSIN2, SICK-BR.
3. But many difficulties of translation, even for simple sentences as
in SICK
3. Difficult to decide if the difficulties of translation are simply that
4. phenomena described in SICK seems universal enough, but
language is structured differently
5. Much more work to do...
Valeria de Paiva OpenCor2021
31/31
Introduction
Silicon Valley
PARC, XLE, Bridge
Applications
Conclusions
Open source datasets are important
Not only for English!
BenderRule: Say the language you’re dealing with, always.
Also document your datasets properly!
A video worth watching ”Data Statements for Natural Language
Processing: Toward Mitigating System Bias and Enabling Better
Science”https://vimeo.com/359686057 only 19 min
But indeed we have our work cutout for us! Thanks!
Valeria de Paiva OpenCor2021

The importance of Being Erneast: Open datasets in Portuguese

  • 1.
    1/31 Introduction Silicon Valley PARC, XLE,Bridge Applications The Importance of Being Earnest: Open Datasets in Portuguese Valeria de Paiva OPENCOR 2021 Dec 2021 Valeria de Paiva OpenCor2021
  • 2.
    2/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Thanks, Livy! Valeria de Paiva OpenCor2021
  • 3.
    3/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Personal Stories I’m an AI scientist, a mathematician, a computational semanticist and a category theorist. I work in Silicon Valley, have done so for the last 22 years, applying pure mathematics to computing, in surprising ways. Valeria de Paiva OpenCor2021
  • 4.
    4/31 Introduction Silicon Valley PARC, XLE,Bridge Applications In the Valley Valeria de Paiva OpenCor2021
  • 5.
    5/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Personal stories: Now Valeria de Paiva OpenCor2021
  • 6.
    6/31 Introduction Silicon Valley PARC, XLE,Bridge Applications How? Valeria de Paiva OpenCor2021
  • 7.
    7/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Samsung Research America (2019-2020) Dialogue and Knowledge Representation Lab, SRA, Mountain View project: systems to make Bixby (voice personal assistant) communicate well with home appliances via SmartHome Samsung acquired Viv in Oct 2016, had acquired SmartThings in 2014, need to integrate stacks, grow Bixby SmartThings: leading open platform for the smart home and the consumer Internet of Things in 2014. opensource project: develop ontology of smart devices, based on WikiData and costumer-facing functionality. Valeria de Paiva OpenCor2021
  • 8.
    8/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Nuance Communications (2012-2018) AI and NLP Lab in Sunnyvale Nuance had the best voice recognition software system in 2012. needed to add AI to make sounds into knowledge. big effort in several labs: Montreal, Boston, Sunnyvale. application areas: health systems, automotive, law, CRM, banks, insurance, etc our lab projects: personal assistant for Living Room (TV 2nd screen), PA for automotive companies Building small smartness into conventional search, e.g. ‘find allergy medication near me’ opensource projects: voice interfaces to WikiData? student’s internships? Valeria de Paiva OpenCor2021
  • 9.
    9/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Rearden Commerce (2011-2012) AI/KR Lab in Foster City a white-labelling shop for travel and expenses/procurement systems. application areas: air travel tickets, hotels, shows& sports, restaurants, ground transportation, parking, etc our lab project: a Groupon-like app as RC acquired HomeRun Using ontologies to discover what hotel reviewers really valued opensource possible projects: projection of WikiData adapted to Brazil, for ‘skills sets’, for Brazilian culture, etc Valeria de Paiva OpenCor2021
  • 10.
    10/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Cuil (2008-2010) Start-up search company in Menlo Park a Google-competitor created by ex-googlers all sorts of tasks, from baby-sitting servers to dealing with costumers Learning to rank algorithms PARC Forum talk: Adventures in Searchland opensource possible project: timelines in Portuguese Valeria de Paiva OpenCor2021
  • 11.
    11/31 Introduction Silicon Valley PARC, XLE,Bridge Applications PARC, XLE and Bridge Valeria de Paiva OpenCor2021
  • 12.
    12/31 Introduction Silicon Valley PARC, XLE,Bridge Applications How to think about Conversational Assistants? Valeria de Paiva OpenCor2021
  • 13.
    13/31 Introduction Silicon Valley PARC, XLE,Bridge Applications How to think about Conversational Assistants? Valeria de Paiva OpenCor2021
  • 14.
    14/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Several Conversational Assistants: applications (AI Summit 2018, Luxembourg) Valeria de Paiva OpenCor2021
  • 15.
    15/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Several Conversational Assistants (AI Summit 2018, Luxembourg) Valeria de Paiva OpenCor2021
  • 16.
    16/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Natural Language Inference (NLI) Shock: work of almost nine years at PARC was out of reach when I left in 2008 I gave a talk at SRI proposing to redo it all, open source (Bridges, ENTCS2011) Pleased to report that almost all of it is now available open-source, redone from scratch, using new techniques Katerina Kalouli, ex-PhD student at Konstanz, now assistant professor in Munich GKR Demo: https://cis.lmu.de/ kalouli/resources.html Valeria de Paiva OpenCor2021
  • 17.
    17/31 Introduction Silicon Valley PARC, XLE,Bridge Applications What about Portuguese? Alexandre Rademaker and I started OpenWordNet-PT in 2012: There was no opensource WN of Portuguese then Sources in GitHub OWN-PT originally obtained from Universal WordNet (Weikum and de Melo) RDF distribution from the beginning openwordnet-pt.org Valeria de Paiva OpenCor2021
  • 18.
    18/31 Introduction Silicon Valley PARC, XLE,Bridge Applications OpenWordNet-PT Data Valeria de Paiva OpenCor2021
  • 19.
    19/31 Introduction Silicon Valley PARC, XLE,Bridge Applications OpenWordNet-PT Examples Valeria de Paiva OpenCor2021
  • 20.
    20/31 Introduction Silicon Valley PARC, XLE,Bridge Applications OpenWordNet-PT Basic Stats 1. OWN-PT is big, around 50K synsets. 2. PWN is much bigger, 117K synsets. 3. we have more than 7K synsets of verbs–definitely not enough, but one can start to play 4. More than twice as big as Russian WordNet, bigger than Spanish, only slightly smaller than French 5. and issues, many issues Valeria de Paiva OpenCor2021
  • 21.
    21/31 Introduction Silicon Valley PARC, XLE,Bridge Applications OpenWordNet-PT Papers 1. Papers trying to clean up the database 2. Nominalizations and their issues (Livy Real) 3. Using corpora to extend our vocabulary (Claudia Freitas) 4. Interfaces for progress (Fabricio Chalub) 5. Two papers on verb lexicon 6. Two papers on Historical archives (DHBB) 7. WordNets themselves (Hugo & Alberto) 8. Gentilics, Adverbs 9. Temporal expressions 10. Morpholinks etc.. Valeria de Paiva OpenCor2021
  • 22.
    22/31 Introduction Silicon Valley PARC, XLE,Bridge Applications OpenWordNet-PT We were doing so well... GoogleTranslate, Open MultiLingual WordNet, BabelNet, Freeling used our OWN-PT. https://translate.google.com/intl/en/about/license/ still says: But then Transformers arrived! with them a series of new challenges. Valeria de Paiva OpenCor2021
  • 23.
    23/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Valeria de Paiva OpenCor2021
  • 24.
    24/31 Introduction Silicon Valley PARC, XLE,Bridge Applications OpenWordNet-PT Questions 1. One can try to carry on cleaning the data, using the lexicographers files (to the left). Is it worth doing it? 2. Should we instead grow, not worrying much about precision? 3. I wish we had glosses in Portuguese. Alberto Simoẽs produced them for us, but we never implemented/added them to the database, as the quality of the Portuguese text wasn’t great. 4. This data is open source, anyone who can get it and make it better. and let us have it. Or not: our license is very broad 5. In a long term project eventually goals diverge and people want to try other things. The beauty of github is being able to keep the version you want In any case a high quality Portuguese WordNet is simply one lexical resource we need. We need others and we have been working on those in parallel. Valeria de Paiva OpenCor2021
  • 25.
    25/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Valeria de Paiva OpenCor2021
  • 26.
    26/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Beginning UD-PT 1. Corpus Bosque: traditional news corpus (EU-PT and BR-PT), mangled by several versions and conversions. 2. PALAVRAS (Bick): a rule-based Constraint Grammar CG system designed for Portuguese. It produces deep linguistic analyses, with tags at the morphological, syntactic (dependency) and semantic levels. (not open source) 3. First version of our data, UD 1.4 compliant, included in UD release 1.4 as UD Portuguese-Bosque. not too bad! 4. Then we ”accepted the challenge”of updating UD-PT-Bosque to UD 2.0 guidelines and replacing the previous UD Portuguese corpus. Phew! Valeria de Paiva OpenCor2021
  • 27.
    27/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Issues of UD-PT 1. Gender: underspecified gender. grande (big) or feliz (happy) 2. MWEs: changing from Palavras to UD1.x to UD2.x was complicated. MWE still are. 3. Participles: verbs or adjectives? 4. ellipses: changes from UD1 to UD2, plus ellipses are difficult 5. Clitics, also all the things that ”se, que”can be. 6. Non-explicit subjects (sujeito oculto and others) see excellent new work of Freitas, de Souza. 7. Negation (UD changed its mind) and negation is hard 8. Appositives vs. nmod PT had a diff opinion 9. Auxiliary verbs 10. xcomp-that, ccomp-to See http://medialab.di.unipi.it/depling/assets/docs/ day2/02_demo2.pdf for status in 2017. Now a meeting group. Valeria de Paiva OpenCor2021
  • 28.
    28/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Valeria de Paiva OpenCor2021
  • 29.
    29/31 Introduction Silicon Valley PARC, XLE,Bridge Applications SICK-BR e.g. https://www.ime.usp.br/~bruna/SICK_PT.pdf 1. A big group from GLIC, USP. 2. An easy to obtain, state-of-the-art automated translation (Milos Stanojevic) 3. Lots of human work correcting automated translation to get 4. SICK-BR, a Brazilian Portuguese corpus annotated with inference relations and semantic relatedness between pairs of sentences 5. SICK-BR is a translation and adaptation of the original SICK, a corpus of English sentences used in several semantic evaluations 6. SICK-BR around 10k sentence pairs annotated for neutral/contradiction/entailment relations and for semantic relatedness Valeria de Paiva OpenCor2021
  • 30.
    30/31 Introduction Silicon Valley PARC, XLE,Bridge Applications SICK-BR https://www.ime.usp.br/~bruna/SICK_PT.pdf 1. Basic idea: logic is kind of universal, works the same in different languages 2. NLI is very important in the new style of Natural Language Understanding. Hence ASSIN, ASSIN2, SICK-BR. 3. But many difficulties of translation, even for simple sentences as in SICK 3. Difficult to decide if the difficulties of translation are simply that 4. phenomena described in SICK seems universal enough, but language is structured differently 5. Much more work to do... Valeria de Paiva OpenCor2021
  • 31.
    31/31 Introduction Silicon Valley PARC, XLE,Bridge Applications Conclusions Open source datasets are important Not only for English! BenderRule: Say the language you’re dealing with, always. Also document your datasets properly! A video worth watching ”Data Statements for Natural Language Processing: Toward Mitigating System Bias and Enabling Better Science”https://vimeo.com/359686057 only 19 min But indeed we have our work cutout for us! Thanks! Valeria de Paiva OpenCor2021