SlideShare a Scribd company logo
1 of 8
Download to read offline
Multilingual Central Repository version 3.0:
                        upgrading a very large lexical knowledge base

                   Aitor Gonz´ lez Agirre, Egoitz Laparra, German Rigau
                             a
                                  Basque Country University
                                  Donostia, Basque Country
            {aitor.gonzalez, egoitz.laparra, german.rigau}@ehu.es


                        Abstract                     2 4003) (Vossen, 1998). The MCR is the re-
                                                     sult of the European MEANING project (IST-
        This paper describes the upgrading pro-      2001-34460) (Rigau et al., 2002), as well as
        cess of the Multilingual Central Reposi-     projects KNOW (TIN2006-15049-C03)3 (Agirre
        tory (MCR). The new MCR uses WordNet         et al., 2009), KNOW2 (TIN2009-14715-C04)4
        3.0 as Interlingual-Index (ILI). Now, the    and several complementary actions associated to
        current version of the MCR integrates in     the KNOW2 project. The original MCR was
        the same EuroWordNet framework word-         aligned to the 1.6 version of WordNet. In the
        nets from five different languages: En-       framework of the KNOW2 project, we decided to
        glish, Spanish, Catalan, Basque and Gali-    upgrade the MCR to be aligned to a most recent
        cian. In order to provide ontological co-    version of WordNet.
        herence to all the integrated wordnets,         The previous version of the MCR, aligned
        the MCR has also been enriched with a        to the English 1.6 WordNet version, also inte-
        disparate set of ontologies: Base Con-       grated the eXtended WordNet project (Mihalcea
        cepts, Top Ontology, WordNet Domains         and Moldovan, 2001), large collections of selec-
        and Suggested Upper Merged Ontology.         tional preferences acquired from SemCor (Agirre
        We also suggest a novel approach for im-     and Martinez, 2001) and different sets of named
        proving some of the semantic resources       entities (Alfonseca and Manandhar, 2002). It
        integrated in the MCR, including a semi-     was also enriched with semantic and ontological
        automatic method to propagate domain in-                                   ´
                                                     properties as Top Ontology (Alvez et al., 2008),
        formation. The whole content of the MCR      SUMO (Pease et al., 2002) or WordNet Domains
        is freely available.                         (Magnini and Cavagli` , 2000).
                                                                           a
                                                        The new MCR integrates wordnets of five
1       Introduction
                                                     different languages, including English, Spanish,
Building large and rich knowledge bases is a very    Catalan, Basque and Galician. This paper presents
costly effort which involves large research groups   the work carried out to upgrade the MCR to new
for long periods of development. For instance,       versions of these resources. By using technol-
hundreds of person-years have been invested in the   ogy to automatically align wordnets (Daud´ et al.,
                                                                                               e
development of wordnets for various languages        2003), we have been able to transport knowledge
(Fellbaum, 1998; Vossen, 1998; Tufis et al., 2004;    from different WordNet versions. Thus, we can
K. et al., 2010). In the case of the English Word-   maintain the compatibility between all the knowl-
Net, in more than ten years of manual construc-      edge bases that use a particular version of Word-
tion (from 1995 to 2006, that is, from version 1.5   Net as a sense repository. However, most of
to 3.0), WordNet grew from 103,445 to 235,402        the ontological knowledge have not been directly
semantic relations1 , which represents a growth of   ported from the previous version of the MCR.
around one thousand new relations per month.            Furthermore, WordNet Domains5 was gener-
   The Multilingual Central Repository (MCR)2        ated semi-automatically and has never been ver-
(Atserias et al., 2004b) follows the model pro-      ified completely. Additionaly, it was aligned to
posed by the European project EuroWordNet (LE-         3
                                                         http://ixa.si.ehu.es/know
    1                                                  4
      Symmetric relations are counted only once.         http://ixa.si.ehu.es/know2
    2                                                  5
      http://adimen.si.ehu.es/web/MCR                    http://wndomains.fbk.eu/
WordNet 1.6. Thus, one goal of this work is the
automatic construction of a new semantic resource
derived from WordNet Domains and aligned to
WordNet 3.0.
   To assist in the correction and maintenance of
the integrated resources in the MCR, we also
adapted and enhanced the Web EuroWordNet In-
terface (WEI) in both consult and edit modes.

2   Multilingual Central Repository 3.0
The first version of the MCR was built following
the model proposed by the EuroWordNet project.
The EuroWordNet architecture includes the Inter-       Figure 1: Example of a multiple intersection in the
Lingual Index (ILI), a Domain Ontology and a Top       mapping between two versions of WordNet.
Ontology (Vossen, 1998).
   Initially most of the knowledge uploaded into
the MCR was aligned to WordNet 1.6 and the                Upgrading ILI: The algorithm to align word-
Spanish, Catalan, Basque and Italian WordNet and       nets (Daud´ et al., 2000; Daud´ et al., 2001;
                                                                    e                      e
the MultiWordNet Domains, were using WordNet           Daud´ et al., 2003) produces two mappings for
                                                              e
1.6 as ILI (Bentivogli et al., 2002; Magnini and       each POS, one in each direction (from 1.6 to 3.0,
Cavagli` , 2000). Thus, the original MCR used
         a                                             and from 3.0 to 1.6). To upgrade the ILI, different
Princeton WordNet 1.6 as ILI. This option also         approaches were applied depending on the POS.
minimized side efects with other European initia-         For nouns, those synsets having multiple map-
tives (Balkanet, EuroTerm, etc.) and wordnet de-       pings from 1.6 to 3.0 were checked manually
velopments around Global WordNet Association.          (Pociello et al., 2008).
Thus, the Spanish, Catalan and Basque wordnets            For verbs, adjectives and adverbs, for those
as well as the EuroWordNet Top Ontology and the        synsets having multiple mappings, we took the in-
associated Base Concepts were transported from         tersection between the two mappings (from 1.6 to
its original WordNet 1.5 to WordNet 1.6 (Atserias      3.0, and from 3.0 to 1.6).
et al., 1997; Ben´tez et al., 1998; Atserias et al.,
                  ı                                       Upgrading WordNets: Finally, using the pre-
2004a).                                                vious mapping, we transported from ILI 1.6 to
   The release of new free versions of Spanish and     ILI 3.0 the Basque (Pociello et al., 2008) and
Galician wordnets aligned to Princeton WordNet         Catalan (Ben´tez et al., 1998) wordnets. The
                                                                      ı
3.0 (Fern´ ndez-Montraveta et al., 2008; Xavier
           a                                           English WordNet was uploaded directly from
et al., 2011) brought with it the need to update       the source files while the Spanish (Fern´ ndez-
                                                                                                    a
the MCR and transport all its previous content to      Montraveta et al., 2008) and Galician (Xavier
a new version using WordNet 3.0 as ILI. Thus, as       et al., 2011) wordnets were directly uploaded
a first step, we decided to transport Catalan and       from their database dumps.
Basque wordnets and the ontological knowledge:            It is possible to have multiple intersections for
Base Concepts, SUMO, WordNet Domains and               a source synset. When multiple intersections co-
Top Ontology.                                          lapsed into the same target synset, we decided to
                                                       join the set of variants from the source synsets to
2.1 Upgrading from 1.6 to 3.0                          the target synset.
This section describes the process carried out for        Figure 1 shows an example of this particular
adapting the MCR to ILI 3.0. Due to its size and       case (the intersections are displayed as dot lines).
complexity, all this process have been mainly au-      Therefore, the variants of the synsets A, B and C of
tomatic.                                               WordNet 1.6 will be placed together in the synset
                                                       Z of WordNet 3.0.
   To perform the porting between the wordnets
                                                          Upgrading Base Concepts: We used Base
1.6 and 3.0 we have followed a similar process to
                                                       Concepts directly generated for WN 3.06
the one used to port the Spanish and Catalan ver-
                                                          6
sions from 1.5 to 1.6 (Atserias et al., 2004a).               http://adimen.si.ehu.es/web/BLC
are highlighted using dot lines. In this specific
                                                        example synset soccer player inherits labels play
                                                        and athlete from its hypernyms player and athlete,
                                                        respectively. Note that synset hockey player does
                                                        not inherit any label form its hypernyms because
                                                        of it owns a domain (hockey). Similarly, synset
                                                        goalkeeper does not inherit domains coming from
                                                        the synset soccer player. Finally, synset titu-
                                                        lar goalkeeper inherits hockey domain (but nei-
                                                        ther play nor athlete domains).
                                                           Thus, some of the current content of the MCR
Figure 2: Example of a multiple intersection in the     will require a future revision. Fortunately, by
mapping between two versions of WordNet.                cross-checking its ontological knowledge most of
                                                        these errors can be easily detected.

(Izquierdo et al., 2007).                               2.2   Web EuroWordNet Interface
   Upgrading SUMO: SUMO has been directly
                                                        WEI is a web application that allows consult-
ported from version 1.6 using the mapping. Those
                                                        ing and editing the data contained in the MCR
unlabelled synsets have been filled through in-
                                                        and navigating throw them. Consulting refers to
heritance. The ontology of the previous version
                                                        exploring the content of the MCR by accessing
is a modified version of SUMO, trimmed and
                                                        words, a synsets, a variants or ILIs. The inter-
polished, to allow the use of first-order theorem
                                                        face presents different searching parameters and
provers (like E-prover or Vampire) for formal rea-
                                                        displays the query results. The different searching
soning, called AdimenSUMO7 . The next step is to
                                                        parameters are:
update AdimenSUMO using the latest version of
SUMO for WordNet 3.0 (available on the website            • Item: a value to search for, it can be a Word,
of SUMO8 ).                                                 a Synset a Variant or an ILI.
   Upgrading WordNet Domains: As SUMO,
what is currently in the MCR has been trans-              • Item type: the type of item to search for:
ported directly from version 1.6 using the map-             Word, a Synset a Variant or an ILI.
ping. Again, those unlabelled synsets have been
filled through inheritance. In addition, new ver-          • PoS: the item’ s gramatical cathegory or Part
sions have been generated using graph techniques            of Speech: Nouns, Verbs, Adjectives, Ad-
(see Section 4 for a detailed description of the pro-       verbs.
cess).
                                                          • Search: the type of search and subsearches
   Upgrading the Top Ontology: Similar to
                                                            (which are dinamically loaded from the
SUMO and WordNet Domains, what is currently
                                                            database): Synonyms, Hyponyms, etc.
available in the MCR has been transported directly
from version 1.6 using the mapping. Once more,            • WordNet Source: the WordNet from which
those unlabelled synsets have been filled through            navigate.
inheritance. It remains to check the incompatibili-
                                 ´
ties between labels following (Alvez et al., 2008).       • Navigation WordNet: the WordNet to which
   An example of how to perform the process of in-          navigate.
heritance used for SUMO, WordNet Domains and
Top Ontology is shown in Figure 2. The example            • Gloss: if selected it shows the glosses of the
is presented for domains, but it can be applied to          Synsets.
the other two cases.
                                                          • Score: if selected, it shows the confidence
   Figure 2 shows a sample hierarchy where each
                                                            factor.
node represents a synset. The domains are dis-
played on the sides. The inherited domain labels          • Rels: if selected, it shows information about
   7
       http://adimen.si.ehu.es/web/adimenSUMO               the relations that each Synset has in all the
   8
       http://www.ontologyportal.org/                       target languages.
• Full: if selected, makes a recursive search.        carried out using Apache, in the Operating Sys-
                                                        tem. This implies that the access control and user
  • Target WordNets: the target WordNets of             management was done outside WEI. The MCR is
    our search.                                         being edited in a distributed way. Several research
2.3 Automatic translations                              groups are editing the MCR in some of the lan-
                                                        guages. Each group has different users. Thus, the
The new version of WEI is able to use Automatic         responsibility of managing the users is also dis-
Translation Web Services for translating automati-      tributed.
cally the glosses and examples from other word-
nets. This new feature helps users to complete          3 Current state of the MCR
and/or improve the gloss or examples of a given
WordNet more quicky. Both glosses and exam-             In this section provide some information about the
ples are taken from the original English WordNet        current state of the MCR, including the progress
and translated to the target language. Suggestions      over the English WordNet.
for glosses and/or examples will appear below the          Tables 1, 2 and 3 present respectively the cur-
existing ones, and may choose the most appro-           rent number of synsets and variants, the number of
priate. In the current version, the translations of     glosses and the number of examples of each word-
the glosses and examples are translated only from       net per PoS.
English (despite the possibility of translating from
any available source).                                  4 A proposal for upgrading WordNet
                                                          Domains
2.4 Marks for synsets and variants
                                                        WordNet Domains9 (WND) is a lexical resource
In the new version of WEI it is possible to assign a
                                                        developed at ITC-IRST where synsets have been
mark to a variant or synset to indicate special prop-
                                                        semi-automatically annotated with one or more
erties. We can also write a small note or comment
                                                        domain labels from a set of 165 hierarchically or-
to explain better the reason to assign that mark.
                                                        ganized domains (Magnini and Cavagli` , 2000).
                                                                                                 a
   The available marks are the following:
                                                        WND allow to reduce the polysemy degree of the
  • Variant marks:                                      words, grouping those senses that belong to the
                                                        same domain (Magnini et al., 2002).
        – DUBLEX: For those variant with dubi-
                                                           But the semi-automatic method used to develop
          ous lexicalization.
                                                        this resource was not free of errors and inconsis-
        – INFL: Indicates that the variant is a in-     tencies. By cross-checking the ontological content
          flected one.                                   of the MCR it is possible to find some of these
        – RARE: Old fashioned or rarely used            problems. For instance, noun synset <diver 1
          variant.                                      frogman 1 underwater diver 1> defined as some-
        – SUBCAT: Subcategorization.                    one who works underwater has domain history be-
        – VULG: For those variants that are vul-        cause it inherits from its hypernym <explorer 1
          gar, rude, or offensive.                      adventurer 2>.

  • Synset marks:                                       4.0.1    Domain inheritance
        – GENLEX: Non-lexicalized general con-          WND was developed using WordNet 1.6. One
          cepts that are introduced to better orga-     consequence of the automatic mapping that we
          nize the hierarchy.                           used to upgrade version 1.6 to 3.0 is that many
                                                        synsets were left unlabeled (because there are new
        – HYPLEX: Indicates that the hypernym
                                                        synsets, changes in the structure, etc.).
          has identical lexicalization.
                                                          One of the first tasks undertaken has been to fill
        – SPECLEX: Domain specific terms that
                                                        these gaps. For them, we has carried out a propa-
          should be checked.
                                                        gation of the labels by inheritance for nominal and
2.5 User management                                     verbal synsets. The inherent structure of Word-
                                                        Net for adjectives and adverbs makes this spread
We also included a new user access control to
                                                          9
WEI. The previous user access control to WEI was              http://wndomains.fbk.eu/
WordNet        Nouns       Verbs    Adjectives       Adverbs       Synsets    WN %
             EngWN3.0      147,360     25,051       30,004          5,580       118,431     100%
             SpaWN3.0       40,009     11,107        7,005          1,106        59,227      50%
             CatWN3.0       51,598     11,577        7,679              2        46,027      39%
             EusWN3.0       41,071      9,472          148              0        30,615      26%
             GalWN3.0        9,114      1,413        4,866              0         9,320       8%

                     Table 1: Current number of synsets and variants of each WN.

             WordNet        Nouns      Verbs    Adjectives       Adverbs       Synsets     WN %
             EngWN3.0       82,379    13,767       18,156          3,621       117,923      100%
             SpaWN3.0       13,014     3,469        1,965            687        19,135       16%
             CatWN3.0        6,289        44          840              1         7,174        6%
             EusWN3.0        2,854        78             0             0         2,932        2%
             GalWN3.0        4,997         2        3,111              0         8,111        7%

                            Table 2: Current number of glosses of each WN.


not trivial. Therefore, this simple process has been   domain labels through WordNet. Additionaly, the
carried out only for nouns and verbs.                  method can also be used to detect anomalies in the
   We have worked exclusively on those synsets         original WND labels.
that had no labels at all. We inherited the labels
from its hypernyms. If a synset has more than one      4.0.2      A new graph based method
hypernym, the domain labels are taken from all of      UKB10    algorithm (Agirre and Soroa, 2009) ap-
them. We used a small list of incompatible labels      plies personalized PageRank on a graph derived
to detect incompatibilities. Therefore, the same       from a wordnet. This algorithm has proven to be
synset can not be both factotum and biology, or        very competitive on Word Sense Disambiguation
animals and plants.                                    tasks and it is easily portable to other languages
   This process increased our domain information       that have a wordnet (Agirre et al., 2010). Now,
by nearly a 18-19%, as shown in Tables 4 and 5:        we present a novel use of the UKB algorithm for
                                                       propagating information through a wordnet struc-
     PoS       Before       After    Increase          ture.
     Nouns     66,595      83,286       +25%              Given an input context, ’ukb ppv’ (Personal-
     Verbs     12,219      14,224       +16%           ized PageRank Vector) algorithm outputs a rank-
     All      100,315     119,011       +19%           ing vector over the nodes of a graph, after apply-
                                                       ing a Personalized PageRank over it. We just need
 Table 4: Number of synsets with domain labels.        to use a wordnet as a knowledge base and pass to
                                                       the application the contexts we want to process,
                                                       performing a kind of spreading activation through
     PoS       Before       After    Increase          the WordNet structure.
     Nouns     87,938     108,665       +24%
                                                          As context we use those synsets labelled with
     Verbs     13,026      15,051       +16%
                                                       a particular domain. Thus, for each of the 16911
     All      124,551     146,899       +18%
                                                       domain labels included in the MCR we generate a
                                                       context. Each file contains the list of offsets cor-
     Table 5: Total number of domain labels.           responding to those synsets with a particular do-
                                                       main label. After creating the context file, we just
  However, this process may also have propa-
                                                       need to execute ’ukb ppv’ that returns a ranking of
gated innapropriate domain labels to unlabeled
                                                       weights for each WordNet synset with respect to
synsets. It remains for future research an accurate
                                                       that particular domain.
evaluation of this new resource.
  In the next section we present some examples           10
                                                              http://ixa2.si.ehu.es/ukb/
                                                         11
using a new graph-based method for propagating                Excluding factotum labels.
WordNet        Nouns     Verbs    Adjectives    Adverbs     Synsets    WN %
             EngWN3.0       10,433   11,583       15,615       3,674      41,305     100%
             SpaWN3.0          478       27          195         967         700       2%
             CatWN3.0        2,103       46          368           0       2,517       6%
             EusWN3.0        2,377        0             0          0       2,377       6%
             GalWN3.0          270        2        4,291           0       4,563      11%

                          Table 3: Current number of examples of each WN.


   Once made the process for all domains we have      synset <pornography 1 porno 1 porn 1 erotica 1
weights for each synset and for each of the do-       smut 5> defined as creative activity (writing or
mains. Therefore, we know which are the highest       pictures or films etc.) of no literary or artistic
weights for each domain and the highest weights       value other than to stimulate sexual desire and la-
for each synset. This allows us to estimate which     belled with the domain law. The suggestions of
synsets are more representative of each domain        the algorithm seems to improve the current label-
and which domains are best for each synset.           ing because it suggests sexuality (possibly the best
   Basically, what we do is to mark some synsets      one) and cinema (possibly, the second best op-
with a domain (using the labels we already know       tion). Moreover, the wrong label dissapears.
from the original porting process) and use the
wordnet graph to propagate the new labelling. We                           Method 3
work on the assumption that a synset directly re-              Weight               Domain
lated to several synsets labelled with a particular         0.000123453:            sexuality
domain (i.e biology) would itself possibly be also          0.000112444:             cinema
related somehow to that domain (i.e. biology).              0.000077780:             theatre
Therefore, it makes no sense to use the domain              0.000075525:            painting
factotum for this technique.                                0.000062377:      telecommunication
   Table 6 shows the first ten domains and weights           0.000060640:           publishing
resulting from the application of this method on            0.000050370:    psychological features
synset <diver 1 frogman 1 underwater diver 1>.              0.000047003:          photography
The suggestions of the algorithm seems to im-               0.000046853:           artisanship
prove the current labeling because it suggests sub          0.000040458:          graphic arts
(possibly the best one) and diving (possibly, the
second best option). Moreover, the method sug-        Table 7: PPV weight rankings for sense porno1 .
                                                                                                  n
gests the wrong label with a much lower weight.

              Weight         Domain                   5 Concluding Remarks and Future
            0.0144335:         sub                      Directions
            0.0015939:        diving                  As a result of this work, the current version of
            0.0001725:     swimming                   the MCR consistently maintains new wordnet ver-
            0.0001297:       history                  sions for five languages (English, Spanish, Cata-
            0.0000557:       nautical                 lan, Basque and Galician), and the ontological
            0.0000529:       fashion                  knowledge from WordNet Domains, Top Ontol-
            0.0000412:      jewellery                 ogy and SUMO.
            0.0000315:     ethnology                     In particular, the main contributions of our work
            0.0000274:    archaeology                 can be summarized as follows:
            0.0000204:         gas                       We have created a new version of the MCR us-
                                                      ing WordNet 3.0 as ILI.
 Table 6: PPV weight rankings for sense diver1 .
                                             n           We have improved the existing Web EuroWord-
                                                      Net Interface (WEI) (both consult and edit inter-
   Table 7 shows the first ten domains and weights     faces) to work with the new version of the MCR.
resulting from the application of this method on      Now, the interface includes automatic translation
facilities, making it easier and faster the develop-   sible thanks to the support of the FP7 PATHS
ment of the resources integrated into the MCR.         project (ICT-2010-270082) and the Spanish na-
We also added new editing facilities for recording     tional projects KNOW (TIN2006-15049-C03) and
new linguistic information associated to the vari-     KNOW2 (TIN2009-14715-C04-04).
ants and synsets.
   We have uploaded into the new version of the
MCR the English WordNet 3.0, the new Spanish           References
WordNet 3.0 (Fern´ ndez-Montraveta et al., 2008)
                    a                                  Agirre, E., Cuadros, M., Rigau, G., and Soroa, A.
and a new Galician WordNet 3.0.                          (2010). Exploring Knowledge Bases for Similar-
                                                         ity. In Proceedings of the Seventh conference on
   We have used a complete mapping from Word-            International Language Resources and Evaluation
Net 1.6 to WordNet 3.0 (covering not only nouns,         (LREC10). European Language Resources Associ-
but verbs, adjectives and adverbs) to transport the      ation (ELRA). ISBN: 2-9517408-6-7. Pages 373–
Basque and Catalan wordnets and the ontological          377.”.
knowledge from the existing version of the MCR         Agirre, E. and Martinez, D. (2001). Knowledge
(using WordNet 1.6 as ILI) to the new MCR ver-           sources for Word Sense Disambiguation. In Pro-
sion (using WordNet 3.0 as ILI).                         ceedings of the Fourth International Conference
   We have applied a very simple estrategy to com-       TSD 2001, Plzen (Pilsen), Czech Republic. Pub-
                                                         lished in the Springer Verlag Lecture Notes in Com-
plete the ontological information by exploiting ba-      puter Science series. V´ clav Matousek, Pavel Maut-
                                                                                a
sic inheritance mechanisms. This process has been        ner, Roman Moucek, Karel Tauser (eds.) Copyright
applied to WordNet Domains, Top Ontology and             Springer-Verlag. ISBN: 3-540-42557-8. ”.
SUMO.                                                  Agirre, E., Rigau, G., Castell´ n, I., Alonso, L.,
                                                                                        o
   We have also investigated a new approach              Padr´ , L., Cuadros, M., Climent, S., and Coll-Florit,
                                                              o
for consistently propagating domain formation            M. (2009). KNOW: Developing large-scale mul-
through the WordNet structure by exploiting a            tilingual technologies for language understanding.
                                                         Procesamiento del Lenguaje Natural, (43):377–378.
well-known graph algorithm using UKB. Al-
though an exhaustive empirical evaluation should       Agirre, E. and Soroa, A. (2009). Personalizing PageR-
be addressed in a near future, a preliminary re-         ank for Word Sense Disambiguation. In Proceed-
                                                         ings of the 12th conference of the European chap-
view of the new resources created using this pro-
                                                         ter of the Association for Computational Linguistics
cess presents very interesting insights for future       (EACL-2009), Athens, Greece.
research.
   Obviously, further investigation is needed to as-   Alfonseca, E. and Manandhar, S. (2002). Distinguish-
                                                         ing concepts and instances in WordNet. In Proceed-
sess the quality of the new labelling of WordNet         ings of the first International Conference of Global
Domains. We plan to evaluate the quality of these        WordNet Association, Mysore, India.
new resources indirectly by comparing their per-
                                                       ´
                                                       Alvez, J., Atserias, J., Carrera, J., Climent, S., Laparra,
formance on a common Word Sense Disambigua-              E., Oliver, A., and Rigau, G. (2008). Complete and
tion task. We would also like to continue studing        Consistent Annotation of WordNet using the Top
different ways for selecting the most appropriate        Concept Ontology. In Proceedings of the the 6th
set of domain labels per synset. We also plan to         Conference on Language Resources and Evaluation
                                                         (LREC 2008), Marrakech (Morocco).
derive domain information from Wikipedia by ex-
ploiting WordNet++ (Navigli and Ponzetto, 2010).       Atserias, J., Climent, S., Farreres, J., Rigau, G., and
   The whole content of the MCR and the new              Rodr´guez, H. (1997). Combining Multiple Meth-
                                                               ı
WEI is freely available 12 .                             ods for the Automatic Construction of Multilingual
                                                         WordNets. In Proceedings of International Confer-
   Moreover, the maintenance of this type of             ence on Recent Advances in Natural Language Pro-
resources is continuous, and all the integrated          cessing (RANLP’97), Tzigov Chark, Bulgaria.
knowledge should be constantly updated and re-
                                                       Atserias, J., Rigau, G., and Villarejo, L. (2004a).
vised.                                                   Spanish WordNet 1.6: Porting the Spanish Wordnet
                                                         across Princeton versions. In LREC’04.
Acknowledgments
                                                       Atserias, J., Villarejo, L., Rigau, G., Agirre, E., Car-
We thank the IXA NLP group from the Basque               roll, J., Vossen, P., and Magnini, B. (2004b). The
Country University. This work was been pos-              MEANING Multilingual Central Respository. In
                                                         Proceedings of the Second International Global
  12
       http://adimen.si.ehu.es/web/MCR                   WordNet Conference (GWC’04).
Bentivogli, L., Pianta, E., and Girardi, C. (2002).        Navigli, R. and Ponzetto, S. P. (2010). Building a very
  MultiWordNet: developing an aligned multilingual           large multilingual semantic network. In Proceed-
  database. In Proceedings of the First International        ings of the 48th Annual Meeting of the Association
  Conference on Global WordNet, Mysore, India.               for Computational Linguistics, pages 216–225, Up-
                                                             psala, Sweden.
Ben´tez, L., Cervell, S., Escudero, G., L´ pez, M.,
   ı                                       o
  Rigau, G., and Taul´ , M. (1998). Methods and Tools
                     e                                     Pease, A., Niles, I., and Li, J. (2002). The Suggested
  for Building the Catalan WordNet. In Proceedings           Upper Merged Ontology: A Large Ontology for the
  of ELRA Workshop on Language Resources for Eu-             Semantic Web and its Applications. In AAAI-2002.
  ropean Minority Languages, Granada, Spain.
                                                           Pociello, E., Gurrutxaga, A., Agirre, E., Aldezabal,
                                                             I., and Rigau, G. (2008). WNTERM: Combining
Daud´ , J., Padr´ , L., and Rigau, G. (2000). Mapping
    e           o
                                                             the Basque WordNet and a Terminological Dictio-
  WordNets Using Structural Information. In 38th An-
                                                             nary. In 6th international conference on Language
  nual Meeting of the Association for Computational
                                                             Resources and Evaluation, LREC’08.
  Linguistics (ACL’2000)., Hong Kong.
                                                           Rigau, G., Magnini, B., Agirre, E., Vossen, P., and Car-
Daud´ , J., Padr´ , L., and Rigau, G. (2001). A Com-
     e          o                                            roll, J. (2002). MEANING: A Roadmap to Knowl-
  plete WN1.5 to WN1.6 Mapping. In Proceedings of            edge Technologies. In Proceedings of COLING
  NAACL Workshop ’WordNet and Other Lexical Re-              Workshop A Roadmap for Computational Linguis-
  sources: Applications, Extensions and Customiza-           tics, Taipei, Taiwan.
  tions’ (NAACL’2001)., Pittsburg, PA, USA.
                                                           Tufis, D., Cristea, D., and Stamou, S. (2004). Balka-
Daud´ , J., Padr´ , L., and Rigau, G. (2003). Mak-
     e          o                                            Net: Aims, Methods, Results and Perspectives. A
  ing Wordnet Mappings Robust. In Proceedings of             General Overview. Romanian Journal on Science
  the 19th Congreso de la Sociedad Espa˜ ola para el
                                         n                   Technology of Information. Special Issue on Balka-
  Procesamiento del Lenguage Natural, SEPLN, Uni-            net, 7(3–4):9–44.
  versidad Universidad de Alcal´ de Henares. Madrid,
                                a
  Spain.                                                   Vossen, P. (1998). EuroWordNet: A Multilingual
                                                             Database with Lexical Semantic Networks. Kluwer
Fellbaum, C. (1998). WordNet. An Electronic Lexical          Academic Publishers.
  Database. Language, Speech, and Communication.
                                                           Xavier, G. G., Clemente, X. M. G., Pereira, A. G.,
  The MIT Press.
                                                             and Lorenzo, V. T. (2011). Galnet: WordNet 3.0
                                                             do galego. Linguam´ tica, 3(1):61–67.
                                                                               a
Fern´ ndez-Montraveta, A., V´ zquez, G., and Fell-
    a                          a
  baum, C. (2008). Text Resources and Lexical
  Knowledge, volume 33 of Text, Translation, Compu-
  tational Processing, chapter The Spanish Version of
  WordNet 3.0, pages 175–182. Mouton de Gruyter.

Izquierdo, R., Su´ rez, A., and Rigau, G. (2007). Ex-
                  a
   ploring the Automatic Selection of Basic Level Con-
   cepts. In Proceedings of RANLP.

K., R., S., T., T., C., V., S., and H., I. (2010). WNMS:
  Connecting the Distributed WordNet in the Case of
  Asian WordNet. In Proceedings of the 5th Global
  WordNet Conference (GWC’10), Mumbai (India).

Magnini, B. and Cavagli` , G. (2000). Integrating sub-
                       a
 ject field codes into WordNet. In Proceedings of the
 Second Internatgional Conference on Language Re-
 sources and Evaluation (LREC), Athens. Greece.

Magnini, B., Satrapparava, C., Pezzulo, G., and
 Gliozzo, A. (2002). The Role of Domains Infor-
 mations. In In Word Sense Disambiguation, Treto,
 Cambridge.

Mihalcea, R. and Moldovan, D. (2001). eXtended
  WordNet: Progress Report. In Proceedings of
  NAACL Workshop WordNet and Other Lexical Re-
  sources: Applications, Extensions and Customiza-
  tions, pages 95–100, Pittsburg, PA, USA.

More Related Content

Similar to Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base

Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Senses
Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet SensesSemantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Senses
Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Sensespathsproject
 
Acceptance Testing Of A Spoken Language Translation System
Acceptance Testing Of A Spoken Language Translation SystemAcceptance Testing Of A Spoken Language Translation System
Acceptance Testing Of A Spoken Language Translation SystemMichele Thomas
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguisticsSpringer
 
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSTEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSijdms
 
Interoperability in public sector
Interoperability in public sectorInteroperability in public sector
Interoperability in public sectorVestforsk.no
 
An Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdfAn Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdfJackie Gold
 
The first FOSD-tacotron-2-based text-to-speech application for Vietnamese
The first FOSD-tacotron-2-based text-to-speech application for VietnameseThe first FOSD-tacotron-2-based text-to-speech application for Vietnamese
The first FOSD-tacotron-2-based text-to-speech application for VietnamesejournalBEEI
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert Systemcsandit
 
Interoperability in public sector presentation at e gove 2010 lausanne
Interoperability in public sector   presentation at e gove 2010 lausanneInteroperability in public sector   presentation at e gove 2010 lausanne
Interoperability in public sector presentation at e gove 2010 lausanneVestforsk.no
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportAlexandre Rademaker
 
Using translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traUsing translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traCamillaTonanzi
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationijnlc
 
Logics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingLogics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingValeria de Paiva
 
Developing an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyDeveloping an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyAlexander Decker
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answeringAli Kabbadj
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET Journal
 

Similar to Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base (20)

Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Senses
Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet SensesSemantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Senses
Semantic Hand-Tagging of the SenSem Corpus Using Spanish WordNet Senses
 
Acceptance Testing Of A Spoken Language Translation System
Acceptance Testing Of A Spoken Language Translation SystemAcceptance Testing Of A Spoken Language Translation System
Acceptance Testing Of A Spoken Language Translation System
 
Computational linguistics
Computational linguisticsComputational linguistics
Computational linguistics
 
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKSTEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
TEXT ADVERTISEMENTS ANALYSIS USING CONVOLUTIONAL NEURAL NETWORKS
 
Lit mtap
Lit mtapLit mtap
Lit mtap
 
Interoperability in public sector
Interoperability in public sectorInteroperability in public sector
Interoperability in public sector
 
An Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdfAn Open Online Dictionary for Endangered Uralic Languages.pdf
An Open Online Dictionary for Endangered Uralic Languages.pdf
 
AICOL2015_paper_16
AICOL2015_paper_16AICOL2015_paper_16
AICOL2015_paper_16
 
The first FOSD-tacotron-2-based text-to-speech application for Vietnamese
The first FOSD-tacotron-2-based text-to-speech application for VietnameseThe first FOSD-tacotron-2-based text-to-speech application for Vietnamese
The first FOSD-tacotron-2-based text-to-speech application for Vietnamese
 
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert SystemModeling of Speech Synthesis of Standard Arabic Using an Expert System
Modeling of Speech Synthesis of Standard Arabic Using an Expert System
 
Lemon at-mlw3
Lemon at-mlw3Lemon at-mlw3
Lemon at-mlw3
 
Interoperability in public sector presentation at e gove 2010 lausanne
Interoperability in public sector   presentation at e gove 2010 lausanneInteroperability in public sector   presentation at e gove 2010 lausanne
Interoperability in public sector presentation at e gove 2010 lausanne
 
OpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project ReportOpenWordnet-PT: A Project Report
OpenWordnet-PT: A Project Report
 
Managing domain ontologies within the AOS
Managing domain ontologies within the AOSManaging domain ontologies within the AOS
Managing domain ontologies within the AOS
 
Using translation memory_to_speed_up_tra
Using translation memory_to_speed_up_traUsing translation memory_to_speed_up_tra
Using translation memory_to_speed_up_tra
 
English kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translationEnglish kazakh parallel corpus for statistical machine translation
English kazakh parallel corpus for statistical machine translation
 
Logics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese UnderstandingLogics and Ontologies for Portuguese Understanding
Logics and Ontologies for Portuguese Understanding
 
Developing an architecture for translation engine using ontology
Developing an architecture for translation engine using ontologyDeveloping an architecture for translation engine using ontology
Developing an architecture for translation engine using ontology
 
French machine reading for question answering
French machine reading for question answeringFrench machine reading for question answering
French machine reading for question answering
 
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival FrameworkIRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
IRJET- Text to Speech Synthesis for Hindi Language using Festival Framework
 

More from pathsproject

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...pathsproject
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...pathsproject
 
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...pathsproject
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013pathsproject
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...pathsproject
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperpathsproject
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...pathsproject
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring reportpathsproject
 
Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...pathsproject
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperpathsproject
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013pathsproject
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conferencepathsproject
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013pathsproject
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013pathsproject
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationpathsproject
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentspathsproject
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0pathsproject
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypepathsproject
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-specpathsproject
 
PATHS Final state of art monitoring report v0_4
PATHS  Final state of art monitoring report v0_4PATHS  Final state of art monitoring report v0_4
PATHS Final state of art monitoring report v0_4pathsproject
 

More from pathsproject (20)

Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
Roadmap from ESEPaths to EDMPaths: a note on representing annotations resulti...
 
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...Aletras, Nikolaos  and  Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
Aletras, Nikolaos and Stevenson, Mark (2013) "Evaluating Topic Coherence Us...
 
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
PATHSenrich: A Web Service Prototype for Automatic Cultural Heritage Item Enr...
 
Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013Implementing Recommendations in the PATHS system, SUEDL 2013
Implementing Recommendations in the PATHS system, SUEDL 2013
 
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 User-Centred Design to Support Exploration and Path Creation in Cultural Her... User-Centred Design to Support Exploration and Path Creation in Cultural Her...
User-Centred Design to Support Exploration and Path Creation in Cultural Her...
 
Generating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paperGenerating Paths through Cultural Heritage Collections Latech2013 paper
Generating Paths through Cultural Heritage Collections Latech2013 paper
 
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
Supporting User's Exploration of Digital Libraries, Suedl 2012 workshop proce...
 
PATHS state of the art monitoring report
PATHS state of the art monitoring reportPATHS state of the art monitoring report
PATHS state of the art monitoring report
 
Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...Recommendations for the automatic enrichment of digital library content using...
Recommendations for the automatic enrichment of digital library content using...
 
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paperGenerating Paths through Cultural Heritage Collections, LATECH 2013 paper
Generating Paths through Cultural Heritage Collections, LATECH 2013 paper
 
PATHS @ LATECH 2013
PATHS @ LATECH 2013PATHS @ LATECH 2013
PATHS @ LATECH 2013
 
PATHS at the eChallenges conference
PATHS at the eChallenges conferencePATHS at the eChallenges conference
PATHS at the eChallenges conference
 
PATHS at the EAA conference 2013
PATHS at the EAA conference 2013PATHS at the EAA conference 2013
PATHS at the EAA conference 2013
 
PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013PATHS at the eCult dialogue day 2013
PATHS at the eCult dialogue day 2013
 
Comparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentationComparing taxonomies for organising collections of documents presentation
Comparing taxonomies for organising collections of documents presentation
 
Comparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documentsComparing taxonomies for organising collections of documents
Comparing taxonomies for organising collections of documents
 
PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0PATHS Final prototype interface design v1.0
PATHS Final prototype interface design v1.0
 
PATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototypePATHS Evaluation of the 1st paths prototype
PATHS Evaluation of the 1st paths prototype
 
PATHS Second prototype-functional-spec
PATHS Second prototype-functional-specPATHS Second prototype-functional-spec
PATHS Second prototype-functional-spec
 
PATHS Final state of art monitoring report v0_4
PATHS  Final state of art monitoring report v0_4PATHS  Final state of art monitoring report v0_4
PATHS Final state of art monitoring report v0_4
 

Recently uploaded

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsHyundai Motor Group
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetEnjoy Anytime
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAndikSusilo4
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 3652toLead Limited
 

Recently uploaded (20)

IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Pigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping ElbowsPigging Solutions Piggable Sweeping Elbows
Pigging Solutions Piggable Sweeping Elbows
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter RoadsSnow Chain-Integrated Tire for a Safe Drive on Winter Roads
Snow Chain-Integrated Tire for a Safe Drive on Winter Roads
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your BudgetHyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
Hyderabad Call Girls Khairatabad ✨ 7001305949 ✨ Cheap Price Your Budget
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
The transition to renewables in India.pdf
The transition to renewables in India.pdfThe transition to renewables in India.pdf
The transition to renewables in India.pdf
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
Azure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & ApplicationAzure Monitor & Application Insight to monitor Infrastructure & Application
Azure Monitor & Application Insight to monitor Infrastructure & Application
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
Tech-Forward - Achieving Business Readiness For Copilot in Microsoft 365
 

Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base

  • 1. Multilingual Central Repository version 3.0: upgrading a very large lexical knowledge base Aitor Gonz´ lez Agirre, Egoitz Laparra, German Rigau a Basque Country University Donostia, Basque Country {aitor.gonzalez, egoitz.laparra, german.rigau}@ehu.es Abstract 2 4003) (Vossen, 1998). The MCR is the re- sult of the European MEANING project (IST- This paper describes the upgrading pro- 2001-34460) (Rigau et al., 2002), as well as cess of the Multilingual Central Reposi- projects KNOW (TIN2006-15049-C03)3 (Agirre tory (MCR). The new MCR uses WordNet et al., 2009), KNOW2 (TIN2009-14715-C04)4 3.0 as Interlingual-Index (ILI). Now, the and several complementary actions associated to current version of the MCR integrates in the KNOW2 project. The original MCR was the same EuroWordNet framework word- aligned to the 1.6 version of WordNet. In the nets from five different languages: En- framework of the KNOW2 project, we decided to glish, Spanish, Catalan, Basque and Gali- upgrade the MCR to be aligned to a most recent cian. In order to provide ontological co- version of WordNet. herence to all the integrated wordnets, The previous version of the MCR, aligned the MCR has also been enriched with a to the English 1.6 WordNet version, also inte- disparate set of ontologies: Base Con- grated the eXtended WordNet project (Mihalcea cepts, Top Ontology, WordNet Domains and Moldovan, 2001), large collections of selec- and Suggested Upper Merged Ontology. tional preferences acquired from SemCor (Agirre We also suggest a novel approach for im- and Martinez, 2001) and different sets of named proving some of the semantic resources entities (Alfonseca and Manandhar, 2002). It integrated in the MCR, including a semi- was also enriched with semantic and ontological automatic method to propagate domain in- ´ properties as Top Ontology (Alvez et al., 2008), formation. The whole content of the MCR SUMO (Pease et al., 2002) or WordNet Domains is freely available. (Magnini and Cavagli` , 2000). a The new MCR integrates wordnets of five 1 Introduction different languages, including English, Spanish, Building large and rich knowledge bases is a very Catalan, Basque and Galician. This paper presents costly effort which involves large research groups the work carried out to upgrade the MCR to new for long periods of development. For instance, versions of these resources. By using technol- hundreds of person-years have been invested in the ogy to automatically align wordnets (Daud´ et al., e development of wordnets for various languages 2003), we have been able to transport knowledge (Fellbaum, 1998; Vossen, 1998; Tufis et al., 2004; from different WordNet versions. Thus, we can K. et al., 2010). In the case of the English Word- maintain the compatibility between all the knowl- Net, in more than ten years of manual construc- edge bases that use a particular version of Word- tion (from 1995 to 2006, that is, from version 1.5 Net as a sense repository. However, most of to 3.0), WordNet grew from 103,445 to 235,402 the ontological knowledge have not been directly semantic relations1 , which represents a growth of ported from the previous version of the MCR. around one thousand new relations per month. Furthermore, WordNet Domains5 was gener- The Multilingual Central Repository (MCR)2 ated semi-automatically and has never been ver- (Atserias et al., 2004b) follows the model pro- ified completely. Additionaly, it was aligned to posed by the European project EuroWordNet (LE- 3 http://ixa.si.ehu.es/know 1 4 Symmetric relations are counted only once. http://ixa.si.ehu.es/know2 2 5 http://adimen.si.ehu.es/web/MCR http://wndomains.fbk.eu/
  • 2. WordNet 1.6. Thus, one goal of this work is the automatic construction of a new semantic resource derived from WordNet Domains and aligned to WordNet 3.0. To assist in the correction and maintenance of the integrated resources in the MCR, we also adapted and enhanced the Web EuroWordNet In- terface (WEI) in both consult and edit modes. 2 Multilingual Central Repository 3.0 The first version of the MCR was built following the model proposed by the EuroWordNet project. The EuroWordNet architecture includes the Inter- Figure 1: Example of a multiple intersection in the Lingual Index (ILI), a Domain Ontology and a Top mapping between two versions of WordNet. Ontology (Vossen, 1998). Initially most of the knowledge uploaded into the MCR was aligned to WordNet 1.6 and the Upgrading ILI: The algorithm to align word- Spanish, Catalan, Basque and Italian WordNet and nets (Daud´ et al., 2000; Daud´ et al., 2001; e e the MultiWordNet Domains, were using WordNet Daud´ et al., 2003) produces two mappings for e 1.6 as ILI (Bentivogli et al., 2002; Magnini and each POS, one in each direction (from 1.6 to 3.0, Cavagli` , 2000). Thus, the original MCR used a and from 3.0 to 1.6). To upgrade the ILI, different Princeton WordNet 1.6 as ILI. This option also approaches were applied depending on the POS. minimized side efects with other European initia- For nouns, those synsets having multiple map- tives (Balkanet, EuroTerm, etc.) and wordnet de- pings from 1.6 to 3.0 were checked manually velopments around Global WordNet Association. (Pociello et al., 2008). Thus, the Spanish, Catalan and Basque wordnets For verbs, adjectives and adverbs, for those as well as the EuroWordNet Top Ontology and the synsets having multiple mappings, we took the in- associated Base Concepts were transported from tersection between the two mappings (from 1.6 to its original WordNet 1.5 to WordNet 1.6 (Atserias 3.0, and from 3.0 to 1.6). et al., 1997; Ben´tez et al., 1998; Atserias et al., ı Upgrading WordNets: Finally, using the pre- 2004a). vious mapping, we transported from ILI 1.6 to The release of new free versions of Spanish and ILI 3.0 the Basque (Pociello et al., 2008) and Galician wordnets aligned to Princeton WordNet Catalan (Ben´tez et al., 1998) wordnets. The ı 3.0 (Fern´ ndez-Montraveta et al., 2008; Xavier a English WordNet was uploaded directly from et al., 2011) brought with it the need to update the source files while the Spanish (Fern´ ndez- a the MCR and transport all its previous content to Montraveta et al., 2008) and Galician (Xavier a new version using WordNet 3.0 as ILI. Thus, as et al., 2011) wordnets were directly uploaded a first step, we decided to transport Catalan and from their database dumps. Basque wordnets and the ontological knowledge: It is possible to have multiple intersections for Base Concepts, SUMO, WordNet Domains and a source synset. When multiple intersections co- Top Ontology. lapsed into the same target synset, we decided to join the set of variants from the source synsets to 2.1 Upgrading from 1.6 to 3.0 the target synset. This section describes the process carried out for Figure 1 shows an example of this particular adapting the MCR to ILI 3.0. Due to its size and case (the intersections are displayed as dot lines). complexity, all this process have been mainly au- Therefore, the variants of the synsets A, B and C of tomatic. WordNet 1.6 will be placed together in the synset Z of WordNet 3.0. To perform the porting between the wordnets Upgrading Base Concepts: We used Base 1.6 and 3.0 we have followed a similar process to Concepts directly generated for WN 3.06 the one used to port the Spanish and Catalan ver- 6 sions from 1.5 to 1.6 (Atserias et al., 2004a). http://adimen.si.ehu.es/web/BLC
  • 3. are highlighted using dot lines. In this specific example synset soccer player inherits labels play and athlete from its hypernyms player and athlete, respectively. Note that synset hockey player does not inherit any label form its hypernyms because of it owns a domain (hockey). Similarly, synset goalkeeper does not inherit domains coming from the synset soccer player. Finally, synset titu- lar goalkeeper inherits hockey domain (but nei- ther play nor athlete domains). Thus, some of the current content of the MCR Figure 2: Example of a multiple intersection in the will require a future revision. Fortunately, by mapping between two versions of WordNet. cross-checking its ontological knowledge most of these errors can be easily detected. (Izquierdo et al., 2007). 2.2 Web EuroWordNet Interface Upgrading SUMO: SUMO has been directly WEI is a web application that allows consult- ported from version 1.6 using the mapping. Those ing and editing the data contained in the MCR unlabelled synsets have been filled through in- and navigating throw them. Consulting refers to heritance. The ontology of the previous version exploring the content of the MCR by accessing is a modified version of SUMO, trimmed and words, a synsets, a variants or ILIs. The inter- polished, to allow the use of first-order theorem face presents different searching parameters and provers (like E-prover or Vampire) for formal rea- displays the query results. The different searching soning, called AdimenSUMO7 . The next step is to parameters are: update AdimenSUMO using the latest version of SUMO for WordNet 3.0 (available on the website • Item: a value to search for, it can be a Word, of SUMO8 ). a Synset a Variant or an ILI. Upgrading WordNet Domains: As SUMO, what is currently in the MCR has been trans- • Item type: the type of item to search for: ported directly from version 1.6 using the map- Word, a Synset a Variant or an ILI. ping. Again, those unlabelled synsets have been filled through inheritance. In addition, new ver- • PoS: the item’ s gramatical cathegory or Part sions have been generated using graph techniques of Speech: Nouns, Verbs, Adjectives, Ad- (see Section 4 for a detailed description of the pro- verbs. cess). • Search: the type of search and subsearches Upgrading the Top Ontology: Similar to (which are dinamically loaded from the SUMO and WordNet Domains, what is currently database): Synonyms, Hyponyms, etc. available in the MCR has been transported directly from version 1.6 using the mapping. Once more, • WordNet Source: the WordNet from which those unlabelled synsets have been filled through navigate. inheritance. It remains to check the incompatibili- ´ ties between labels following (Alvez et al., 2008). • Navigation WordNet: the WordNet to which An example of how to perform the process of in- navigate. heritance used for SUMO, WordNet Domains and Top Ontology is shown in Figure 2. The example • Gloss: if selected it shows the glosses of the is presented for domains, but it can be applied to Synsets. the other two cases. • Score: if selected, it shows the confidence Figure 2 shows a sample hierarchy where each factor. node represents a synset. The domains are dis- played on the sides. The inherited domain labels • Rels: if selected, it shows information about 7 http://adimen.si.ehu.es/web/adimenSUMO the relations that each Synset has in all the 8 http://www.ontologyportal.org/ target languages.
  • 4. • Full: if selected, makes a recursive search. carried out using Apache, in the Operating Sys- tem. This implies that the access control and user • Target WordNets: the target WordNets of management was done outside WEI. The MCR is our search. being edited in a distributed way. Several research 2.3 Automatic translations groups are editing the MCR in some of the lan- guages. Each group has different users. Thus, the The new version of WEI is able to use Automatic responsibility of managing the users is also dis- Translation Web Services for translating automati- tributed. cally the glosses and examples from other word- nets. This new feature helps users to complete 3 Current state of the MCR and/or improve the gloss or examples of a given WordNet more quicky. Both glosses and exam- In this section provide some information about the ples are taken from the original English WordNet current state of the MCR, including the progress and translated to the target language. Suggestions over the English WordNet. for glosses and/or examples will appear below the Tables 1, 2 and 3 present respectively the cur- existing ones, and may choose the most appro- rent number of synsets and variants, the number of priate. In the current version, the translations of glosses and the number of examples of each word- the glosses and examples are translated only from net per PoS. English (despite the possibility of translating from any available source). 4 A proposal for upgrading WordNet Domains 2.4 Marks for synsets and variants WordNet Domains9 (WND) is a lexical resource In the new version of WEI it is possible to assign a developed at ITC-IRST where synsets have been mark to a variant or synset to indicate special prop- semi-automatically annotated with one or more erties. We can also write a small note or comment domain labels from a set of 165 hierarchically or- to explain better the reason to assign that mark. ganized domains (Magnini and Cavagli` , 2000). a The available marks are the following: WND allow to reduce the polysemy degree of the • Variant marks: words, grouping those senses that belong to the same domain (Magnini et al., 2002). – DUBLEX: For those variant with dubi- But the semi-automatic method used to develop ous lexicalization. this resource was not free of errors and inconsis- – INFL: Indicates that the variant is a in- tencies. By cross-checking the ontological content flected one. of the MCR it is possible to find some of these – RARE: Old fashioned or rarely used problems. For instance, noun synset <diver 1 variant. frogman 1 underwater diver 1> defined as some- – SUBCAT: Subcategorization. one who works underwater has domain history be- – VULG: For those variants that are vul- cause it inherits from its hypernym <explorer 1 gar, rude, or offensive. adventurer 2>. • Synset marks: 4.0.1 Domain inheritance – GENLEX: Non-lexicalized general con- WND was developed using WordNet 1.6. One cepts that are introduced to better orga- consequence of the automatic mapping that we nize the hierarchy. used to upgrade version 1.6 to 3.0 is that many synsets were left unlabeled (because there are new – HYPLEX: Indicates that the hypernym synsets, changes in the structure, etc.). has identical lexicalization. One of the first tasks undertaken has been to fill – SPECLEX: Domain specific terms that these gaps. For them, we has carried out a propa- should be checked. gation of the labels by inheritance for nominal and 2.5 User management verbal synsets. The inherent structure of Word- Net for adjectives and adverbs makes this spread We also included a new user access control to 9 WEI. The previous user access control to WEI was http://wndomains.fbk.eu/
  • 5. WordNet Nouns Verbs Adjectives Adverbs Synsets WN % EngWN3.0 147,360 25,051 30,004 5,580 118,431 100% SpaWN3.0 40,009 11,107 7,005 1,106 59,227 50% CatWN3.0 51,598 11,577 7,679 2 46,027 39% EusWN3.0 41,071 9,472 148 0 30,615 26% GalWN3.0 9,114 1,413 4,866 0 9,320 8% Table 1: Current number of synsets and variants of each WN. WordNet Nouns Verbs Adjectives Adverbs Synsets WN % EngWN3.0 82,379 13,767 18,156 3,621 117,923 100% SpaWN3.0 13,014 3,469 1,965 687 19,135 16% CatWN3.0 6,289 44 840 1 7,174 6% EusWN3.0 2,854 78 0 0 2,932 2% GalWN3.0 4,997 2 3,111 0 8,111 7% Table 2: Current number of glosses of each WN. not trivial. Therefore, this simple process has been domain labels through WordNet. Additionaly, the carried out only for nouns and verbs. method can also be used to detect anomalies in the We have worked exclusively on those synsets original WND labels. that had no labels at all. We inherited the labels from its hypernyms. If a synset has more than one 4.0.2 A new graph based method hypernym, the domain labels are taken from all of UKB10 algorithm (Agirre and Soroa, 2009) ap- them. We used a small list of incompatible labels plies personalized PageRank on a graph derived to detect incompatibilities. Therefore, the same from a wordnet. This algorithm has proven to be synset can not be both factotum and biology, or very competitive on Word Sense Disambiguation animals and plants. tasks and it is easily portable to other languages This process increased our domain information that have a wordnet (Agirre et al., 2010). Now, by nearly a 18-19%, as shown in Tables 4 and 5: we present a novel use of the UKB algorithm for propagating information through a wordnet struc- PoS Before After Increase ture. Nouns 66,595 83,286 +25% Given an input context, ’ukb ppv’ (Personal- Verbs 12,219 14,224 +16% ized PageRank Vector) algorithm outputs a rank- All 100,315 119,011 +19% ing vector over the nodes of a graph, after apply- ing a Personalized PageRank over it. We just need Table 4: Number of synsets with domain labels. to use a wordnet as a knowledge base and pass to the application the contexts we want to process, performing a kind of spreading activation through PoS Before After Increase the WordNet structure. Nouns 87,938 108,665 +24% As context we use those synsets labelled with Verbs 13,026 15,051 +16% a particular domain. Thus, for each of the 16911 All 124,551 146,899 +18% domain labels included in the MCR we generate a context. Each file contains the list of offsets cor- Table 5: Total number of domain labels. responding to those synsets with a particular do- main label. After creating the context file, we just However, this process may also have propa- need to execute ’ukb ppv’ that returns a ranking of gated innapropriate domain labels to unlabeled weights for each WordNet synset with respect to synsets. It remains for future research an accurate that particular domain. evaluation of this new resource. In the next section we present some examples 10 http://ixa2.si.ehu.es/ukb/ 11 using a new graph-based method for propagating Excluding factotum labels.
  • 6. WordNet Nouns Verbs Adjectives Adverbs Synsets WN % EngWN3.0 10,433 11,583 15,615 3,674 41,305 100% SpaWN3.0 478 27 195 967 700 2% CatWN3.0 2,103 46 368 0 2,517 6% EusWN3.0 2,377 0 0 0 2,377 6% GalWN3.0 270 2 4,291 0 4,563 11% Table 3: Current number of examples of each WN. Once made the process for all domains we have synset <pornography 1 porno 1 porn 1 erotica 1 weights for each synset and for each of the do- smut 5> defined as creative activity (writing or mains. Therefore, we know which are the highest pictures or films etc.) of no literary or artistic weights for each domain and the highest weights value other than to stimulate sexual desire and la- for each synset. This allows us to estimate which belled with the domain law. The suggestions of synsets are more representative of each domain the algorithm seems to improve the current label- and which domains are best for each synset. ing because it suggests sexuality (possibly the best Basically, what we do is to mark some synsets one) and cinema (possibly, the second best op- with a domain (using the labels we already know tion). Moreover, the wrong label dissapears. from the original porting process) and use the wordnet graph to propagate the new labelling. We Method 3 work on the assumption that a synset directly re- Weight Domain lated to several synsets labelled with a particular 0.000123453: sexuality domain (i.e biology) would itself possibly be also 0.000112444: cinema related somehow to that domain (i.e. biology). 0.000077780: theatre Therefore, it makes no sense to use the domain 0.000075525: painting factotum for this technique. 0.000062377: telecommunication Table 6 shows the first ten domains and weights 0.000060640: publishing resulting from the application of this method on 0.000050370: psychological features synset <diver 1 frogman 1 underwater diver 1>. 0.000047003: photography The suggestions of the algorithm seems to im- 0.000046853: artisanship prove the current labeling because it suggests sub 0.000040458: graphic arts (possibly the best one) and diving (possibly, the second best option). Moreover, the method sug- Table 7: PPV weight rankings for sense porno1 . n gests the wrong label with a much lower weight. Weight Domain 5 Concluding Remarks and Future 0.0144335: sub Directions 0.0015939: diving As a result of this work, the current version of 0.0001725: swimming the MCR consistently maintains new wordnet ver- 0.0001297: history sions for five languages (English, Spanish, Cata- 0.0000557: nautical lan, Basque and Galician), and the ontological 0.0000529: fashion knowledge from WordNet Domains, Top Ontol- 0.0000412: jewellery ogy and SUMO. 0.0000315: ethnology In particular, the main contributions of our work 0.0000274: archaeology can be summarized as follows: 0.0000204: gas We have created a new version of the MCR us- ing WordNet 3.0 as ILI. Table 6: PPV weight rankings for sense diver1 . n We have improved the existing Web EuroWord- Net Interface (WEI) (both consult and edit inter- Table 7 shows the first ten domains and weights faces) to work with the new version of the MCR. resulting from the application of this method on Now, the interface includes automatic translation
  • 7. facilities, making it easier and faster the develop- sible thanks to the support of the FP7 PATHS ment of the resources integrated into the MCR. project (ICT-2010-270082) and the Spanish na- We also added new editing facilities for recording tional projects KNOW (TIN2006-15049-C03) and new linguistic information associated to the vari- KNOW2 (TIN2009-14715-C04-04). ants and synsets. We have uploaded into the new version of the MCR the English WordNet 3.0, the new Spanish References WordNet 3.0 (Fern´ ndez-Montraveta et al., 2008) a Agirre, E., Cuadros, M., Rigau, G., and Soroa, A. and a new Galician WordNet 3.0. (2010). Exploring Knowledge Bases for Similar- ity. In Proceedings of the Seventh conference on We have used a complete mapping from Word- International Language Resources and Evaluation Net 1.6 to WordNet 3.0 (covering not only nouns, (LREC10). European Language Resources Associ- but verbs, adjectives and adverbs) to transport the ation (ELRA). ISBN: 2-9517408-6-7. Pages 373– Basque and Catalan wordnets and the ontological 377.”. knowledge from the existing version of the MCR Agirre, E. and Martinez, D. (2001). Knowledge (using WordNet 1.6 as ILI) to the new MCR ver- sources for Word Sense Disambiguation. In Pro- sion (using WordNet 3.0 as ILI). ceedings of the Fourth International Conference We have applied a very simple estrategy to com- TSD 2001, Plzen (Pilsen), Czech Republic. Pub- lished in the Springer Verlag Lecture Notes in Com- plete the ontological information by exploiting ba- puter Science series. V´ clav Matousek, Pavel Maut- a sic inheritance mechanisms. This process has been ner, Roman Moucek, Karel Tauser (eds.) Copyright applied to WordNet Domains, Top Ontology and Springer-Verlag. ISBN: 3-540-42557-8. ”. SUMO. Agirre, E., Rigau, G., Castell´ n, I., Alonso, L., o We have also investigated a new approach Padr´ , L., Cuadros, M., Climent, S., and Coll-Florit, o for consistently propagating domain formation M. (2009). KNOW: Developing large-scale mul- through the WordNet structure by exploiting a tilingual technologies for language understanding. Procesamiento del Lenguaje Natural, (43):377–378. well-known graph algorithm using UKB. Al- though an exhaustive empirical evaluation should Agirre, E. and Soroa, A. (2009). Personalizing PageR- be addressed in a near future, a preliminary re- ank for Word Sense Disambiguation. In Proceed- ings of the 12th conference of the European chap- view of the new resources created using this pro- ter of the Association for Computational Linguistics cess presents very interesting insights for future (EACL-2009), Athens, Greece. research. Obviously, further investigation is needed to as- Alfonseca, E. and Manandhar, S. (2002). Distinguish- ing concepts and instances in WordNet. In Proceed- sess the quality of the new labelling of WordNet ings of the first International Conference of Global Domains. We plan to evaluate the quality of these WordNet Association, Mysore, India. new resources indirectly by comparing their per- ´ Alvez, J., Atserias, J., Carrera, J., Climent, S., Laparra, formance on a common Word Sense Disambigua- E., Oliver, A., and Rigau, G. (2008). Complete and tion task. We would also like to continue studing Consistent Annotation of WordNet using the Top different ways for selecting the most appropriate Concept Ontology. In Proceedings of the the 6th set of domain labels per synset. We also plan to Conference on Language Resources and Evaluation (LREC 2008), Marrakech (Morocco). derive domain information from Wikipedia by ex- ploiting WordNet++ (Navigli and Ponzetto, 2010). Atserias, J., Climent, S., Farreres, J., Rigau, G., and The whole content of the MCR and the new Rodr´guez, H. (1997). Combining Multiple Meth- ı WEI is freely available 12 . ods for the Automatic Construction of Multilingual WordNets. In Proceedings of International Confer- Moreover, the maintenance of this type of ence on Recent Advances in Natural Language Pro- resources is continuous, and all the integrated cessing (RANLP’97), Tzigov Chark, Bulgaria. knowledge should be constantly updated and re- Atserias, J., Rigau, G., and Villarejo, L. (2004a). vised. Spanish WordNet 1.6: Porting the Spanish Wordnet across Princeton versions. In LREC’04. Acknowledgments Atserias, J., Villarejo, L., Rigau, G., Agirre, E., Car- We thank the IXA NLP group from the Basque roll, J., Vossen, P., and Magnini, B. (2004b). The Country University. This work was been pos- MEANING Multilingual Central Respository. In Proceedings of the Second International Global 12 http://adimen.si.ehu.es/web/MCR WordNet Conference (GWC’04).
  • 8. Bentivogli, L., Pianta, E., and Girardi, C. (2002). Navigli, R. and Ponzetto, S. P. (2010). Building a very MultiWordNet: developing an aligned multilingual large multilingual semantic network. In Proceed- database. In Proceedings of the First International ings of the 48th Annual Meeting of the Association Conference on Global WordNet, Mysore, India. for Computational Linguistics, pages 216–225, Up- psala, Sweden. Ben´tez, L., Cervell, S., Escudero, G., L´ pez, M., ı o Rigau, G., and Taul´ , M. (1998). Methods and Tools e Pease, A., Niles, I., and Li, J. (2002). The Suggested for Building the Catalan WordNet. In Proceedings Upper Merged Ontology: A Large Ontology for the of ELRA Workshop on Language Resources for Eu- Semantic Web and its Applications. In AAAI-2002. ropean Minority Languages, Granada, Spain. Pociello, E., Gurrutxaga, A., Agirre, E., Aldezabal, I., and Rigau, G. (2008). WNTERM: Combining Daud´ , J., Padr´ , L., and Rigau, G. (2000). Mapping e o the Basque WordNet and a Terminological Dictio- WordNets Using Structural Information. In 38th An- nary. In 6th international conference on Language nual Meeting of the Association for Computational Resources and Evaluation, LREC’08. Linguistics (ACL’2000)., Hong Kong. Rigau, G., Magnini, B., Agirre, E., Vossen, P., and Car- Daud´ , J., Padr´ , L., and Rigau, G. (2001). A Com- e o roll, J. (2002). MEANING: A Roadmap to Knowl- plete WN1.5 to WN1.6 Mapping. In Proceedings of edge Technologies. In Proceedings of COLING NAACL Workshop ’WordNet and Other Lexical Re- Workshop A Roadmap for Computational Linguis- sources: Applications, Extensions and Customiza- tics, Taipei, Taiwan. tions’ (NAACL’2001)., Pittsburg, PA, USA. Tufis, D., Cristea, D., and Stamou, S. (2004). Balka- Daud´ , J., Padr´ , L., and Rigau, G. (2003). Mak- e o Net: Aims, Methods, Results and Perspectives. A ing Wordnet Mappings Robust. In Proceedings of General Overview. Romanian Journal on Science the 19th Congreso de la Sociedad Espa˜ ola para el n Technology of Information. Special Issue on Balka- Procesamiento del Lenguage Natural, SEPLN, Uni- net, 7(3–4):9–44. versidad Universidad de Alcal´ de Henares. Madrid, a Spain. Vossen, P. (1998). EuroWordNet: A Multilingual Database with Lexical Semantic Networks. Kluwer Fellbaum, C. (1998). WordNet. An Electronic Lexical Academic Publishers. Database. Language, Speech, and Communication. Xavier, G. G., Clemente, X. M. G., Pereira, A. G., The MIT Press. and Lorenzo, V. T. (2011). Galnet: WordNet 3.0 do galego. Linguam´ tica, 3(1):61–67. a Fern´ ndez-Montraveta, A., V´ zquez, G., and Fell- a a baum, C. (2008). Text Resources and Lexical Knowledge, volume 33 of Text, Translation, Compu- tational Processing, chapter The Spanish Version of WordNet 3.0, pages 175–182. Mouton de Gruyter. Izquierdo, R., Su´ rez, A., and Rigau, G. (2007). Ex- a ploring the Automatic Selection of Basic Level Con- cepts. In Proceedings of RANLP. K., R., S., T., T., C., V., S., and H., I. (2010). WNMS: Connecting the Distributed WordNet in the Case of Asian WordNet. In Proceedings of the 5th Global WordNet Conference (GWC’10), Mumbai (India). Magnini, B. and Cavagli` , G. (2000). Integrating sub- a ject field codes into WordNet. In Proceedings of the Second Internatgional Conference on Language Re- sources and Evaluation (LREC), Athens. Greece. Magnini, B., Satrapparava, C., Pezzulo, G., and Gliozzo, A. (2002). The Role of Domains Infor- mations. In In Word Sense Disambiguation, Treto, Cambridge. Mihalcea, R. and Moldovan, D. (2001). eXtended WordNet: Progress Report. In Proceedings of NAACL Workshop WordNet and Other Lexical Re- sources: Applications, Extensions and Customiza- tions, pages 95–100, Pittsburg, PA, USA.