Revisiting a Brazilian WordNet


Published on

Valeria de Paiva and Alexandre Rademaker, Jan 2012, Matsue, Japan

Published in: Technology, Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

Revisiting a Brazilian WordNet

  1. 1. Revisiting a Brazilian WordNet Valeria de Paiva Alexandre Rademaker Rearden Commerce Applied Mathematics School, FGV Foster City, CA, USA Rio de Janeiro, Brazil Abstract would have been very sensible to use a Brazilian WordNet, if we could have one. While we orig- Brazilian Portuguese needs a WordNet inally expected to be able to use some existing that is open access, downloadable and Brazilian WordNet, out of the box, it turns out that changeable, so that it can be improved by these are not available. There are some attempts. researchers, such as the community inter- There is the project WordNet.PT (Portuguese ested in automated deduction. This would WordNet) from the “Centro de Linguitica da Uni- be very valuable to linguists and com- versidade de Lisboa” headed by Palmira Mar- puter scientist interested in representing rafa. But this is available online only, no down- knowledge obtained from texts. We dis- load available and, as far as we can see on their cuss briefly why we want a Brazilian Por- webpages, little development has happened re- tuguese WordNet and how we are going cently to this project. The WordNet.PT version about getting one. These are only first available online 1 has about 19000 lexical expres- steps, though, as our project is just start- sions, from different semantic fields. The frag- ing. ment made available online includes expressions from subdomains such as art, clothing, geogra-1 Introduction phy, health, institutions, living entities and trans-WordNet (Fellbaum, 1998) is an extremely valu- portation, but no description of other domainsable resource for research in Computational Lin- and/or future releases of the database are dis-guistics and Natural Language Processing in gen- cussed. The group has also a newer version oferal. WordNet has been used for a number of dif- WordNet.PT called WordNet.PTglobal (Mar-ferent purposes in information systems, including rafa et al., 2011) which pays attention to differentword sense disambiguation, information retrieval, varieties of Portuguese, like African variations inautomatic text classification, automatic text sum- the language. But while this is very interesting formarization, and dozens of other knowledge inten- linguistic comparative research and useful for on-sive projects. line queries 2 , this smaller version of WordNet.PT But if there is still a lack of lexical resources is still not available for download and/or modifica-for English, the problem is ten-fold more acute for tions and improvements.other languages, which lack even easily accessible Then there is also the MultiWordNet projectcorpora and basic tools such as tokenizers, taggers and its Portuguese version MWN.PT, developedand splitters. This lack of resources slows down by Antˆ nio Branco and colleagues at the NLX- oconsiderably, almost stops completely any work Natural Language and Speech Group, of the Uni-on reasoning about knowledge obtained from lan- versity of Lisbon, Department of Informatics.guage, our main goal. According to their description 3 MWN.PT the ¸˜ We are starting a project at Foundacao Getulio MultiWordnet of Portuguese (version 1) spansVargas (FGV) in Brazil, where we want, in the over 17,200 manually validated concepts/synsets,long run, to use formal logical tools to rea- linked under the semantic relations of hyponymyson about knowledge obtained from text in Por- and hypernymy. These concepts are made of overtuguese. We are logicians, not linguists, so we 1 to minimize the amount of Computational 2 3Linguistics that we have to develop. Hence it
  2. 2. 21,000 word senses/word forms and 16,000 lem- sensible and feasible.mas from both European and American variants To recap the suggestion was to build the firstof Portuguese. It includes the subontologies un- version of the Grid around the set of 4689 “Com-der the concepts of Person, Organization, Event, mon Base Concepts” and to make the Grid free,Location, and Art works, which are covered by following the example of the Princeton WordNet.the top ontology made of the Portuguese equiva- Now the Base Concepts are supposed to be thelents to all concepts in the 4 top layers of the En- most important concepts in the various wordnetsglish Princeton WordNet and to the 98 Base Con- of different languages. The importance of the con-cepts suggested by the Global Wordnet Associa- cepts was measured in terms of two main crite-tion, and the 164 Core Base Concepts indicated by ria: (1) A high position in the semantic hierar-the EuroWordNet project. But again this wordnet chy; (2) Having many relationships to other con-is available online only and/or with a restrictive li- cepts. The procedure described as the “expandcense that requires payment. approach” seems to us viable: First translate the Thirdly there is a first version of a Brazilian Por- synsets in the Princeton WordNet to Portuguese,tuguese version of Wordnet developed by Bento then take over the relations from Princeton and re-Dias da Silva and collaborators (Dias-Da-Silva et vise, adding the Portuguese terms that satisfy dif-al., 2000; Scarton and Aluisio, 2009). But this also ferent relations. Then revise and revise and revisecannot be downloaded, is not available online and until we can guarantee the consistency of the tax-is not being maintained in an open access basis, onomy.which is one of the strongest points of Princeton In somewhat more detail but still following theWordNet. Open access availability is one of the suggestions of the global wordnet grid, we thinkmain reasons we would like to have our own ver- we can develop a core WordNet for Brazilian Por-sion of WordNet-BR, which we are calling WN- tuguese by: first representing the 1024 core ba-BR. This is because we believe that resources like sic concepts by one or more synsets in PortugueseWikipedia and WordNet need to be open and mod- that are either equivalent or very closely associ-ifiable by others in order to improve over time. ated to the original core concepts in the Princeton Finally there is a whole batch of work WordNet. Then adding Base Concepts that are im-on merging WordNet with Wikipedia cate- portant to Brazilian Portuguese, but not in the setgories and infoboxes, that tries to leverage the of core basic concepts. (For that one could usework already done by the Wikipedia volunteers. lists of Portuguese words listed by frequency, andAmongst these we are particularly excited about comparisons with other Wordnets for romance lan-YAGO/MENTA (de Melo and Weikum, 2010) guages.) Next we need to check that this forms a(and YAGO2 for further work on temporal/spatial closed and consistent hierarchy. Finally we shouldinformation), which we describe below. This kind add further relations necessary to specify the se-of work fits in well with our goals of ultimately mantics of the basic concepts.doing reasoning, in large scale, with knowledge While this procedure seems sensible andobtained from text. doable, despite being hard work, the existence of several versions of Wordnet, and the fact that the2 Global WordNet Grid concepts uncovered as basic ones are not relatedIn this forum it is perhaps not necessary to recall to their synsets in WordNet 3.0 makes things morethat the Global WordNet Association aims at the difficult. WordNet 3.0 has the huge advantage ofdevelopment of wordnets for all languages of the disambiguated glosses, a real plus if the goal is se-world and to extend the existing wordnets to full mantics. But the mechanics of following the pro-coverage and all parts-of-speech. In 2006 the asso- cedure sketched above turn out more complicatedciation launched a project to start building a com- than expected.pletely free worldwide wordnet “grid”. This grid Our solution is to use the mapping from Word-would be built around a shared set of concepts, and Net 2.0 to WordNet 3.0 provided by (Daud´ et ewould be expressed in terms of the original Word- al., 2000). The idea consists in, using the map,net synsets and SUMO (Niles and Pease, 2001) identify the WordNet 3.0 synsets equivalent to theterms. The idea of the grid is very appealing and 4689 WordNet 2.0 basic concepts listed in the Eu-the suggested procedure to create wordnets looks roWordNet project (Stamou et al., 2002).
  3. 3. We plan to use the RDF version of WordNet 2010), for example. Projections of SUMO into3.0 made freely available online 4 by Mark van description logics, automatically available, can beAssem and Jacco van Ossenbruggen. The RDF run in the OWL reasoners.version has the benefit of better supporting our In the beginning of 2010 we started an informalcollaborative work, facilitating the maintenance in project of discussing how logic and automated rea-the same data structure of both English and Por- soning could have a bigger impact, if coupled withtuguese versions of the glosses and lexical forms, natural language processing and how it would beand, once loaded in a Triple Store, providing us a great thing to translate some of the advances al-with a working environment to add or remove ready made for English text understanding to Por-relations, comments and so on. Unfortunately, tuguese text understanding and reasoning.there are at least two different versions of Word- Since one of us (Valeria de Paiva) had workedNet available in RDF. The RDF representation of for almost nine years in Xerox PARC, in the sys-WordNet 2.0 is described in http://www.w3. tems developed by the Natural Language Theoryorg/TR/wordnet-rdf/ and seems to be used and Technology (NLTT) group, particularly on theas reference for the newly minted RDF represen- system Bridge, we requested an academic licensetation of WordNet 3.0. Nevertheless, we still have to the XLE (Xerox Language Engine) to try toto investigate the differences between the Word- adapt the systems to Brazilian Portuguese. How-Net 2.0/3.0 mapping used in the RDF representa- ever, we are both logicians, our expertise lies attion of WordNet 3.0 and the mapping provided by the end of the long pipeline of the system Bridge(Daud´ et al., 2000). e and we tried to recruit, still informally, people with more expertise on the language side of the project.3 Sumo and WordNet-BR But at that stage we did not have any formal back-The Global WordNet Grid approach has three as- ing, so despite some interesting offers, nothingpects that seem to us worth considering. First much happened. Recently we have been grantedthere is the work on determining which synsets formal backing, although still in small scale, and(corresponding to concepts) are most popular in one of the opportunities that presented itself wasseveral languages. This work was done in the to forestall the need for the creation of a BrazilianEuroWordNet projects and it would be a shame WordNet (or perhaps to help improve the creationnot to use it. The emphasis on many languages of such) , via the use of SUMO.should help filter out personal and cultural biases. Wordnet is an important component of the XLEThen there is the proposed stepping up schedule, Unified Lexicon (UL (Crouch and King, 2005)),which seems to us very attractive, as a way of help- as the logical formulae created by the Abstracting to get the “grunt” work done. Finally and in Knowledge Representation (AKR) component ofsome ways more importantly there is the connec- the system are given meaning, in terms of Word-tion with SUMO that we discuss now. net sysets. A previous version of the system used, According to Wikipedia, the Suggested Upper instead of the Unified Lexicon, Cyc (Lenat, 1995)Merged Ontology or SUMO is an upper ontol- concepts as semantics. As discussed in (De Paivaogy intended as a foundation ontology for a va- et al., 2007) the sparseness of Cyc concepts wasriety of computer information processing systems. the main reason to move away from Cyc onto aIt can be downloaded and used freely and it has version of the Bridge system based on the UL andbeen available and in development since 2000. A WordNet. Since a WordNet-BR is not available,mapping from WordNet synsets to SUMO has also a workaround might be gotten via SUMO, if thisbeen defined and maintained for several versions were to be available in Portuguese. As a warmingof WordNet. Most importantly for us, SUMO is exercise we translated the basic concepts used fororganized for interoperability of automated rea- the basic concept descriptions in SUMO to Brazil-soning engines. In particular SUMO’s associated ian Portuguese 6 and this is already available onopen source knowledge engineering environment, the SourceForge repository for Sigma, SUMO’sSigma 5 runs already in Vampire (Ganzinger et knowledge engineering platform. This is not aal., 1999) and Leo-II (Benzmueller and Paulson, substitute for a Brazilian Portuguese WordNet, but merely a stopping stone towards it. 4 5 6
  4. 4. 4 Scaling Up? References C. Benzmueller and L.C. Paulson. 2010. MultimodalOne of the ways we are considering of scaling and intuitionistic logics in simple type theory. Logicup our proposal, from the five thousand concepts Journal of IGPL, 18(6):881.suggested in the grid page to the level that we D. Crouch and T.H. King. 2005. Unifying lexicalthink is necessary for our application goes via the resources. In Proceedings of the Interdisciplinarywork on YAGO (Suchanek et al., 2007) and, per- Workshop on the Identification and Representationhaps, YAGO2. The YAGO approach to informa- of Verb Features and Verb Classes, Saarbruecken, Germany.tion extraction for building a searchable, large-scale, highly accurate knowledge base of common J. Daud´ , L. Padr´ , and G. Rigau. 2000. Map- e ofacts goes via harvesting infoboxes and category ping wordnets using structural information. Innames from Wikipedia for facts about individual Proceedings of the 38th Annual Meeting on As- sociation for Computational Linguistics, pagesentities. It reconciles these with the taxonomic 504–511. Association for Computational Lin-backbone of WordNet in order to ensure that all guistics. have proper classes and the class system index.php?option=com_content&task=is consistent. The work in YAGO at the Max- view&id=21&Itemid=57.PlanckInstitute has led de Melo and Weikum to G. de Melo and G. Weikum. 2010. Menta: inducingwork in MENTA (Multilingual Taxonomies from multilingual taxonomies from wikipedia. In Pro-Wikipedia), which can be considered a multilin- ceedings of the 19th ACM international conference on Information and knowledge management, pagesgual version of WordNet. From this multilingual 1099–1108. ACM.version (with 254 languages) we want to ‘project’the component consisting of Portuguese synsets G. De Melo, F. Suchanek, and A. Pease. 2008. In-only. (The plan is to use the work in the Por- tegrating yago into the suggested upper merged on- tology. In Tools with Artificial Intelligence, 2008.tuguese version of MENTA to complement, auto- ICTAI’08. 20th IEEE International Conference on,matically, the five thousand concepts for WN-BR, volume 1, pages 190–193. IEEE.obtained through manual translation. We believe V. De Paiva, DG Bobrow, C. Condoravdi, R. Crouch,that the MENTA (de Melo and Weikum, 2010) L. Karttunen, TH King, R. Nairn, and A. Zaenen.projection into Portuguese could give us a rea- 2007. Textual inference logic: Take two. Proceed-sonable basis in terms of synsets in Portuguese ings of the Workshop on Contexts and Ontologies,to which we would like to compare the existing Representation and Reasoning, page 27.versions of Portuguese wordnets.) Since YAGO B. C. Dias-Da-Silva, H. R. Moraes, M. F. Oliveira,is already integrated with SUMO (De Melo et al., R. Hasegawa, D. A. Amorim, C. Paschoalino, and2008), we hope to be able to maintain consistency ¸˜ A. C. Nascimento. 2000. Construcao de um the-of the database. saurus eletrˆ nico para o portuguˆ s do brasil. In Pro- o e cessamento Computacional do Portuguˆ s escrito e e falado (Propor), pages 1–10.5 Conclusions C. Fellbaum. 1998. WordNet: An electronic lexical database. The MIT press.It is early days for our project and time will tell Harald Ganzinger, Alexandre Riazanov, and Andreiwhether our decision to follow the global Word- Voronkov. 1999. Vampire. In Automated Deduc-Net grid guidelines for seeding new wordnets will tion CADE-16, volume 1632 of Lecture Notes inpay off or not, and if so, how well. We have now Computer Science, pages 674–674. Springer Berlina master student interested in the project and more / Heidelberg.interested students are expected. One thing is clear D.B. Lenat. 1995. Cyc: A large-scale investment into us, whatever kinds of resource we end up with, knowledge infrastructure. Communications of thewe hope to make them freely available in one of ACM, 38(11):33–38.the numerous sites (SourceForge, GitHub, etc.) at Palmira Marrafa, Raquel Amaro, and Sara Mendes.our disposal nowadays. We are not aware of any 2011. global – extending tosuch specialized lexicons available for Brazilian portuguese varieties. In Proceedings of the FirstPortuguese and it is about time that we had them Workshop on Algorithms and Resources for Mod- elling of Dialects and Language Varieties, pages 70–openly and freely accessible. 74, Edinburgh, Scotland, July. Association for Com- putational Linguistics.
  5. 5. I. Niles and A. Pease. 2001. Towards a standard upper ontology. In Proceedings of the international con- ference on Formal Ontology in Information Systems- Volume 2001, pages 2–9. ACM.Carolina Scarton and Sandra Aluisio. 2009. Heranca ¸ a ¸˜ autom´ tica das relacoes de hiperˆ nimia para a word- o Technical Report NILC-TR-09-10, USP, Sao Carlos, SP, Brazil.S. Stamou, K. Oflazer, K. Pala, D. Christoudoulakis, D. Cristea, D. Tufis, S. Koeva, G. Totkov, D. Dutoit, and M. Grigoriadou. 2002. Balkanet a multilingual semantic network for the balkan languages. In Pro- ceedings of the International Wordnet Conference, Mysore, India, pages 21–25.Fabian M. Suchanek, Gjergji Kasneci, and Gerhard Weikum. 2007. Yago: A Core of Semantic Knowl- edge. In 16th international World Wide Web con- ference (WWW 2007), New York, NY, USA. ACM Press.