• Save
EMLex-A5: Specialized Dictionaries
Upcoming SlideShare
Loading in...5
×
 

EMLex-A5: Specialized Dictionaries

on

  • 1,561 views

Half course on Specialized Dictionaries from European Masters in Lexicography. The other half, with prof. Ulrich Heid is not available at the moment.

Half course on Specialized Dictionaries from European Masters in Lexicography. The other half, with prof. Ulrich Heid is not available at the moment.

Statistics

Views

Total Views
1,561
Slideshare-icon Views on SlideShare
1,474
Embed Views
87

Actions

Likes
0
Downloads
0
Comments
0

1 Embed 87

http://www.scoop.it 87

Accessibility

Categories

Upload Details

Uploaded via as Adobe PDF

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

    EMLex-A5: Specialized Dictionaries EMLex-A5: Specialized Dictionaries Presentation Transcript

    • A5 - Specialized Dictionaries Alberto Sim˜oes ambs@ilch.uminho.pt EMLex 2012/2013 Erlangen Alberto Sim˜oes A5 - Specialized Dictionaries 1/138
    • Part I Terminology vs Lexicography Alberto Sim˜oes A5 - Specialized Dictionaries 2/138
    • Overview 1 Term Orientation vs Concept Orientation 2 Classification Systems What are Classification Systems? Folksonomies Taxonomies Thesauri Ontologies 3 Further Reading Alberto Sim˜oes A5 - Specialized Dictionaries 3/138
    • Term vs Concept Orientation Most dictionaries are organized by terms: users look up entries by the word; entries describe all possible senses; the same explanation can appear for different words (synonyms); Most terminologies are organized by concepts: users look up entries by an instance word; but concepts exist organized as a single block; each concept is represented only once; all synonyms (and antonyms) are presented together; Alberto Sim˜oes A5 - Specialized Dictionaries 4/138
    • Term vs Concept Orientation Term Orientation: Dictionary Definition from Dictionary.com (May 3rd, 2013) Alberto Sim˜oes A5 - Specialized Dictionaries 5/138
    • Term vs Concept Orientation Concept Orientation: Terminology Entry from DeCS - Health Sciences Descriptors (May 3rd, 2013) Alberto Sim˜oes A5 - Specialized Dictionaries 6/138
    • Classification Systems Humans tend to organize; “disorganization is a kind of organization” This organization is usually done by classification; Classification can be as simple as tagging an object; “this is the pile of important documents, that of the unimportant ones” Classification is used everywhere! Alberto Sim˜oes A5 - Specialized Dictionaries 7/138
    • Where are classification systems used? Internet Social Networks (tagging); Libraries (ex. Universal Decimal Classification); Medicine (ex. Unified Medical Language System) Chemistry (ex. Periodic Table); Geography (ex. Geographic Taxonomy); Biology (ex. Linnaean taxonomy, Protein classification, . . . ); Alberto Sim˜oes A5 - Specialized Dictionaries 8/138
    • Classification Systems Classes Classification Systems can also be classified; One way to classify classification systems is by their ability to include properties and relations between the classified objects; We will discuss four types of classification systems: Folksonomies Taxonomies Thesauri Ontologies Alberto Sim˜oes A5 - Specialized Dictionaries 9/138
    • Folksonomies Alberto Sim˜oes A5 - Specialized Dictionaries 10/138
    • Folksonomies A folksonomy is a system of classification derived from the practice and method of collaboratively creating and managing tags to annotate and categorize content; this practice is also known as collaborative tagging, so- cial classification, social indexing, and social tagging. Folksonomy, a term coined by Thomas Vander Wal, is a portmanteau of folk and taxonomy. Folksonomy (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 11/138
    • Folsksonomies: How they work Other classification techniques often define someone or some group in charge of creating the classification system structure (authority); This group of people see the world from a specific point of view, that can be, or not, shared by others; Folksonomies solve this problem: power to the people; Instead of partitioning the world according to one particular view. They let the user present facets of objects; Users assign keywords (or tags, or labels) to objects (individuals); These keywords can be searched, indexed, and mathematical models can be applied to this data. Alberto Sim˜oes A5 - Specialized Dictionaries 12/138
    • Folksonomies An empirical analysis of the complex dynamics of tag- ging systems, published in 2007, has shown that con- sensus around stable distributions and shared vocab- ularies does emerge, even in the absence of a central controlled vocabulary. For content to be searchable, it should be categorized and grouped. While this was believed to require commonly agreed on sets of con- tent describing tags (much like keywords of a journal article), recent research has found that, in large folk- sonomies, common structures also emerge on the level of categorizations. Accordingly, it is possible to devise mathematical models that allow for translating from personal tag vocabularies (personomies) to the vocab- ulary shared by most users. Folksonomy (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 13/138
    • Folksonomies: example Top categories in the Portuguese Wikipedia (single words): 375 Sociologia 383 Ponerinae 395 Afro-brasileiros 404 Drilliidae 413 Filosofia 415 Coleophoridae 424 Psicologia 428 Terebridae 445 Clathurellinae 445 Digimons 445 Teuto-brasileiros 451 Apiaceae 483 Asteroides 486 Luso-brasileiros 492 Acaena 526 Rubiaceae 537 Dolichoderinae 730 Agonoxenidae 735 Acalypha 753 Mangeliinae 762 Crambidae 787 Poaceae 808 Colet^aneas 824 Theraphosidae 854 Myrmicinae 962 Fabaceae 974 Formicidae 1065 Agrostis 1096 Formicinae 1177 Aloe 1328 Conus 1338 ´Italo-brasileiros 1395 Asteraceae 1433 Coleophora 1514 Arctiidae 1516 Alchemilla 1689 Turridae 1879 Camponotus 2163 Acer 2744 Acacia Alberto Sim˜oes A5 - Specialized Dictionaries 14/138
    • Folksonomies: Pros and Cons Pros: doesn’t require expert cataloguers, authoritative sources or expert users; capability of matching users’ real needs and language: (inclusive — includes everyone’s words and vocabulary) controlled vocabularies are not practically and economically extensible, while folksonomies are; a low-investment bridge between personal classification and shared classification; easy to use and quick to classify big quantities of individuals; not all the limitations of folksonomies are defects :-) Alberto Sim˜oes A5 - Specialized Dictionaries 15/138
    • Folksonomies: Pros and Cons Cons: by itself, the vocabulary is flat; (there is no structure, just terms) not usable for small collections or those with few users; (statistical methods are dependent of population size) without some technology help, vocabularies get inexact or ambiguous; have a very low findability quotient. They are great for serendipity and browsing but not aimed at a targeted approach or search; Alberto Sim˜oes A5 - Specialized Dictionaries 16/138
    • Taxonomies Alberto Sim˜oes A5 - Specialized Dictionaries 17/138
    • Taxonomies Taxonomy is the science of identifying and naming species, and arranging them into a classification. The field of taxonomy, sometimes referred to as “biological taxonomy”, revolves around the description and use of taxonomic units, known as taxa. A resulting taxonomy is a particular classification, arranged in a hierarchical structure or classification scheme. Taxonomy (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 18/138
    • Taxonomies taxonomy [tæk’s6n@mI] n. (Life Sciences & Allied Applications / Biology) the branch of biology concerned with the classification of organisms into groups based on similarities of structure, origin, etc. the practice of arranging organisms in this way. the science or practice of classification. [from French taxonomie, from Greek taxis “order” + –nomy] Collins English Dictionary – Complete and Unabridged c HarperCollins Publishers 1991, 1994, 1998, 2000, 2003 Alberto Sim˜oes A5 - Specialized Dictionaries 19/138
    • Taxonomies: How they work? Used to partition the world into disjunctive classes or groups; Each group is, again, partitioned into sub-classes or sub-groups; And sub-classes are partitioned, and. . . Individuals are classified in one leaf category; (a classification is a path in the tree) Alberto Sim˜oes A5 - Specialized Dictionaries 20/138
    • Taxonomies: The typical example Alberto Sim˜oes A5 - Specialized Dictionaries 21/138
    • Taxonomies: examples used everyday Main index (top level) of Universal Decimal Classification: 0 Generalities (now Science and knowledge. Organization. Computer Science. Information. Documentation. Librarianship. Institutions. Publications) 1 Philosophy. Psychology 2 Religion. Theology 3 Social Sciences 4 Vacant 5 Mathematics and natural sciences 6 Applied sciences. Medicine. Technology 7 The arts. Recreation. Entertainment. Sport 8 Language. Linguistics. Literature 9 Geography. Biography. History Alberto Sim˜oes A5 - Specialized Dictionaries 22/138
    • Taxonomies: examples used everyday 8 Language. Linguistics. Literature 80 General questions [. . . ] linguistics and literature. Philology 81 Linguistics and languages 81-11 Schools and trends in linguistics 81-13 Methodology of linguistics. Methods and means 811 Languages 811.1/.2 Indo-European Languages 811.3 Dead languages of unknown affiliation. Caucasian languages 811.4 Afro-Asiatic, Nilo-Saharan, Congo-Kordofanian, Khoisan languages 811.5 Ural-Altaic, Palaeo-Siberian, Eskimo-Aleut, Dravidian and Sino-Tibetan languages. Japanese. Korean. . . 811.6 Austro-Asiatic languages. Austronesian languages 811.7 Indo-Pacific (non-Austronesian) languages. Australian languages 811.8 American indigenous languages 811.9 Artificial languages 82 Literature Alberto Sim˜oes A5 - Specialized Dictionaries 23/138
    • Taxonomies: Class Task 0 Science and knowledge. Organization. Computer Science. Information. . . 1 Philosophy. Psychology 2 Religion. Theology 3 Social Sciences 5 Mathematics and natural sciences 6 Applied sciences. Medicine. Technology 7 The arts. Recreation. Entertainment. Sport 8 Language. Linguistics. Literature 9 Geography. Biography. History Prolog Programming for Artificial Intelligence, Prof Ivan Bratko Alberto Sim˜oes A5 - Specialized Dictionaries 24/138
    • Taxonomies: Class Task 5 Mathematics, Natural Sciences 51 Mathematics 519 (no name, virtual class) 519.6 Computational mathematics. Numerical Analysis University of Minho Library Prolog Programming for Artificial Intelligence, Prof Ivan Bratko Alberto Sim˜oes A5 - Specialized Dictionaries 25/138
    • Taxonomies: Class Task 0 Science and knowledge. Organization. Computer Science. Information. . . 00 Prolegomena. Fundamentals of knowledge and culture. Propaedeutics 004 Computer science and technology. Computing. Data processing 004.4 Software 004.42 Computer programming. Computer programs Aveiro University Library Prolog Programming for Artificial Intelligence, Prof Ivan Bratko Alberto Sim˜oes A5 - Specialized Dictionaries 26/138
    • Taxonomies: Class Task 0 Science and knowledge. Organization. Computer Science. Information. . . 00 Prolegomena. Fundamentals of knowledge and culture. Propaedeutics 004 Computer science and technology. Computing. Data processing 004.4 Software 004.43 Computer Languages Porto Polytechnic Institute Library Prolog Programming for Artificial Intelligence, Prof Ivan Bratko Alberto Sim˜oes A5 - Specialized Dictionaries 27/138
    • Taxonomies: Class Task 0 Science and knowledge. Organization. Computer Science. Information. . . 00 Prolegomena. Fundamentals of knowledge and culture. Propaedeutics 004 Computer science and technology. Computing. Data processing 004.8 Artificial intelligence Algarve’s University Library Prolog Programming for Artificial Intelligence, Prof Ivan Bratko Alberto Sim˜oes A5 - Specialized Dictionaries 28/138
    • Taxonomies: Pros and Cons Pros: rigid tree, makes it easy to process; suitable for some areas (like life classification); the hierarchy helps searching for terms (abstraction); Cons: rigid tree, makes it difficult to classify; (different people classify objects differently) the structure is defined by some authority group; (for example, the UDC Consortium) forces the subdivision of the world; (categories are single-parental) as a workaround, people classify in more than one category; (so, the rigid tree Pro gets a Con) Alberto Sim˜oes A5 - Specialized Dictionaries 29/138
    • Thesauri Alberto Sim˜oes A5 - Specialized Dictionaries 30/138
    • Thesauri A thesaurus is a reference work that lists words grouped together according to similarity of meaning (containing synonyms and sometimes antonyms), in contrast to a dictionary, which contains definitions and pronunciations. Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 31/138
    • Thesauri In Information Science, Library Science, and Informa- tion Technology, specialized thesauri are designed for information retrieval. They are a type of controlled vocabulary, for indexing or tagging purposes. Such a thesaurus can be used as the basis of an index for on- line material. [. . . ] Unlike a literary thesaurus, these specialized thesauri typically focus on one discipline, subject or field of study. Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 32/138
    • Thesauri: How they work! Thesauri for information retrieval are typically con- structed by information specialists, and have their own unique vocabulary defining different kinds of terms and relationships. Terms are the basic semantic units for conveying con- cepts. They are usually single-word nouns, since nouns are the most concrete part of speech. [. . . ] When a term is ambiguous, a “scope note” can be added to ensure consistency, and give direction on how to inter- pret the term. “Term relationships” are links between terms. These relationships can be divided into three types: hierar- chical, equivalency or associative. Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 33/138
    • Thesauri: How they work! Hierarchical relationships are used to indicate terms which are narrower and broader in scope. A “Broader Term” (BT) or hyperonym is a more general term. Reciprocally, a “Narrower Term” (NT) or hyponym is a more specific term. BT and NT are reciprocals; a broader term necessarily implies at least one other term which is narrower. BT and NT are used to indicate class relationships, as well as part-whole relationships (meronyms and holonyms). Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 34/138
    • Thesauri: How they work! Example of a thesaurus with hierarchical relations. Feline NT Cat NT Panther Cat BT Feline Panther BT Feline NT Pink Panther Pink Panther BT Panther Alberto Sim˜oes A5 - Specialized Dictionaries 35/138
    • Thesauri: How they work! The equivalency relationship is used primarily to con- nect synonyms and near-synonyms. “Use” (USE) and “Used For” (UF) indicators are used when an autho- rized term is to be used for another, unauthorized, term. Unauthorized terms are often called “entry vo- cabulary”, “entry points”, “lead-in terms”, or “non- preferred terms”, pointing to the authorized term (also referred to as the “preferred term” or “descriptor”) that has been chosen to stand for the concept. Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 36/138
    • Thesauri: How they work! Example of a thesaurus with equivalency relations. Parliament USE European Parliament Parliament of Europe USE European Parliament European Parliament UF Parliament UF Parliament of Europe Alberto Sim˜oes A5 - Specialized Dictionaries 37/138
    • Thesauri: How they work! Associative relationships are used to connect two related terms whose relationship is neither hierarchical nor equivalent. This relationship is described by the indicator “Related Term” (RT). Associative relation- ships should be applied with caution, since excessive use of RTs will reduce specificity in searches. Consider the following: if the typical user is searching with term ”A”, would they also want resources tagged with term ”B”? If the answer is no, then an associative relationship should not be established. Thesauri (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 38/138
    • Thesauri: How they work! Example of a thesaurus with associative relations. Douro Porto BT River BT Portugal RT Porto RT Gaia Portugal NT Porto River NT Gaia NT Douro City Gaia NT Gaia BT Portugal NT Porto Note RT is not symmetrical. a RT b ⇒ b RT a. Alberto Sim˜oes A5 - Specialized Dictionaries 39/138
    • Thesauri: a simple example Quality Asia ChinaFood Safety Contamination Food Food Contamination BT NT BT NT RT RT USE USE Extract of Food Safety relationships in AGROVOC Alberto Sim˜oes A5 - Specialized Dictionaries 40/138
    • Thesauri: Pros and Cons Pros: More flexible than Taxonomies; (does not require a tree, work as a graph) Have other types of relationships than simple hierarchy; (like the associative relation) There is an ISO standard that documents their correct use; Standard defines mathematical properties for relationships; Cons: Standardized types of relationships are somewhat limited; (same relation for hyperonyms and meronyms) (non-hierarchical relation is too vague: related) No support for relationships with non-terms (features); Alberto Sim˜oes A5 - Specialized Dictionaries 41/138
    • Ontologies Alberto Sim˜oes A5 - Specialized Dictionaries 42/138
    • Ontologies Ontology is the philosophical study of the nature of being, existence, or reality as such, as well as the ba- sic categories of being and their relations. Tradition- ally listed as a part of the major branch of philosophy known as metaphysics, ontology deals with questions concerning what entities exist or can be said to exist, and how such entities can be grouped, related within a hierarchy, and subdivided according to similarities and differences. Ontology (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 43/138
    • Ontologies In computer science and information science, an ontol- ogy formally represents knowledge as a set of concepts within a domain, and the relationships between those concepts. It can be used to reason about the entities within that domain and may be used to describe the domain. Ontology: information science (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 44/138
    • Ontologies Contemporary ontologies share many structural simi- larities, regardless of the language in which they are expressed. Most ontologies describe individuals (in- stances), classes (concepts), attributes, and relations. Ontology: information science (Wikipedia, 2012) Alberto Sim˜oes A5 - Specialized Dictionaries 45/138
    • Ontologies Individuals are the instances or objects (the basic or “ground level” objects). Ontology: information science (Wikipedia, 2012) Unlike any of the other classification systems, Ontologies clearly include the individuals (or objects being classified) in the structure. Alberto Sim˜oes A5 - Specialized Dictionaries 46/138
    • Ontologies Individuals are the instances or objects (the basic or “ground level” objects). Ontology: information science (Wikipedia, 2012) Unlike any of the other classification systems, Ontologies clearly include the individuals (or objects being classified) in the structure. Alberto Sim˜oes A5 - Specialized Dictionaries 46/138
    • Ontologies Classes are sets, collections, concepts, [. . . ] or kinds of things. Ontology: information science (Wikipedia, 2012) Classes are the concepts used in Thesauri and Taxonomy. They can be super-classes, including sub-classes, or can just include individuals (low level classes, leafs if we were talking about taxonomies). Alberto Sim˜oes A5 - Specialized Dictionaries 47/138
    • Ontologies Classes are sets, collections, concepts, [. . . ] or kinds of things. Ontology: information science (Wikipedia, 2012) Classes are the concepts used in Thesauri and Taxonomy. They can be super-classes, including sub-classes, or can just include individuals (low level classes, leafs if we were talking about taxonomies). Alberto Sim˜oes A5 - Specialized Dictionaries 47/138
    • Ontologies Attributes are aspects, properties, features, character- istics, or parameters that objects (and classes) can have. Ontology: information science (Wikipedia, 2012) Attributes are properties of individuals or classes. If the individual is a book in a library, a property can be the number of pages, the title, the author. For a class, like “mammal”, an attribute can be a reference to its fur. Attributes are usually specified as a pair, the name of the attribute and its value. Alberto Sim˜oes A5 - Specialized Dictionaries 48/138
    • Ontologies Attributes are aspects, properties, features, character- istics, or parameters that objects (and classes) can have. Ontology: information science (Wikipedia, 2012) Attributes are properties of individuals or classes. If the individual is a book in a library, a property can be the number of pages, the title, the author. For a class, like “mammal”, an attribute can be a reference to its fur. Attributes are usually specified as a pair, the name of the attribute and its value. Alberto Sim˜oes A5 - Specialized Dictionaries 48/138
    • Ontologies Relations are ways in which classes and individuals can be related to one another. Ontology: information science (Wikipedia, 2012) Relations are similar to the relations used in Thesauri, but unlike them, there isn’t a list of valid relations. They can be the common hierarchical relations, or the relation “eat” relating animals with the animals they eat. Alberto Sim˜oes A5 - Specialized Dictionaries 49/138
    • Ontologies Relations are ways in which classes and individuals can be related to one another. Ontology: information science (Wikipedia, 2012) Relations are similar to the relations used in Thesauri, but unlike them, there isn’t a list of valid relations. They can be the common hierarchical relations, or the relation “eat” relating animals with the animals they eat. Alberto Sim˜oes A5 - Specialized Dictionaries 49/138
    • Ontologies Function terms: complex structures formed from cer- tain relations that can be used in place of an individual term in a statement. Ontology: information science (Wikipedia, 2012) Suppose you are adding Portuguese rivers to an Ontology. One can define a simple macro to add some default relations to the river: River (name) ∼=    Term → name Is a → river Is at → Portugal Alberto Sim˜oes A5 - Specialized Dictionaries 50/138
    • Ontologies Function terms: complex structures formed from cer- tain relations that can be used in place of an individual term in a statement. Ontology: information science (Wikipedia, 2012) Suppose you are adding Portuguese rivers to an Ontology. One can define a simple macro to add some default relations to the river: River (name) ∼=    Term → name Is a → river Is at → Portugal Alberto Sim˜oes A5 - Specialized Dictionaries 50/138
    • Ontologies Restrictions: formally stated descriptions of what must be true in order for some assertion to be accepted as input. Ontology: information science (Wikipedia, 2012) We can enforce that a capital of a country it a city: add (X capital-of Y ) iff X is-a City Alberto Sim˜oes A5 - Specialized Dictionaries 51/138
    • Ontologies Restrictions: formally stated descriptions of what must be true in order for some assertion to be accepted as input. Ontology: information science (Wikipedia, 2012) We can enforce that a capital of a country it a city: add (X capital-of Y ) iff X is-a City Alberto Sim˜oes A5 - Specialized Dictionaries 51/138
    • Ontologies Rules: statements in the form of an antecedent- consequent sentence that describe the logical infer- ences that can be drawn from an assertion in a partic- ular form. Ontology: information science (Wikipedia, 2012) On the other hand, if we trust who is editing an ontology, we can classify automatically it as a city, and its country as a. . . country: X capital-of Y ⇒X is-a City ∧ Y is-a Country Alberto Sim˜oes A5 - Specialized Dictionaries 52/138
    • Ontologies Rules: statements in the form of an antecedent- consequent sentence that describe the logical infer- ences that can be drawn from an assertion in a partic- ular form. Ontology: information science (Wikipedia, 2012) On the other hand, if we trust who is editing an ontology, we can classify automatically it as a city, and its country as a. . . country: X capital-of Y ⇒X is-a City ∧ Y is-a Country Alberto Sim˜oes A5 - Specialized Dictionaries 52/138
    • Ontologies Axioms: assertions (including rules) in a logical form that together comprise the overall theory that the on- tology describes in its domain of application. Ontology: information science (Wikipedia, 2012) Differs from Rules, as axioms are tests to guarantee the ontology structure. They are not used to infer new relations. They assert, and can/should be used for consistence checking. Alberto Sim˜oes A5 - Specialized Dictionaries 53/138
    • Ontologies Axioms: assertions (including rules) in a logical form that together comprise the overall theory that the on- tology describes in its domain of application. Ontology: information science (Wikipedia, 2012) Differs from Rules, as axioms are tests to guarantee the ontology structure. They are not used to infer new relations. They assert, and can/should be used for consistence checking. Alberto Sim˜oes A5 - Specialized Dictionaries 53/138
    • Ontologies Events: the changing of attributes or relations. Ontology: information science (Wikipedia, 2012) Similar to rules, but react to events. For example, if the user adds a feature stating that an individual lays eggs, classify it as an oviparous. Note that the division into Rules, Axioms and Events is not universal, and depends a lot on the application that is used to support the ontology. Alberto Sim˜oes A5 - Specialized Dictionaries 54/138
    • Ontologies Events: the changing of attributes or relations. Ontology: information science (Wikipedia, 2012) Similar to rules, but react to events. For example, if the user adds a feature stating that an individual lays eggs, classify it as an oviparous. Note that the division into Rules, Axioms and Events is not universal, and depends a lot on the application that is used to support the ontology. Alberto Sim˜oes A5 - Specialized Dictionaries 54/138
    • Ontologies: Example 1 Alberto Sim˜oes A5 - Specialized Dictionaries 55/138
    • Ontologies: Example 2 Alberto Sim˜oes A5 - Specialized Dictionaries 56/138
    • Ontologies: Pros and Cons Pros: More flexible than Thesauri; (graph with ad-hoc relationships) Lots of formalisms and standards (OWL, SKOS, . . . ); Lots of tools to edit (like Prot´eg´e); Languages for querying and completion (like SPARQL); Cons: As a classification approach, requires an authority for its definition, just like Taxonomies or Thesauri. Complexity: not everybody is able to create a detailed ontology. Alberto Sim˜oes A5 - Specialized Dictionaries 57/138
    • Further Reading Folksonomies: Folksonomy Coinage and Definition http://vanderwal.net/folksonomy.html Folksonomies: A User-Driven Approach to Organizing Content http://www.uie.com/articles/folksonomies/ Folksonomies: power to the people http://www.iskoi.org/doc/folksonomies.htm Folksonomies: Tidying up Tags? http://www.dlib.org/dlib/january06/guy/01guy.html Folksonomies - Cooperative Classification and Communication Through Shared Metadata http://www.adammathes.com/academic/ computer-mediated-communication/folksonomies.html Alberto Sim˜oes A5 - Specialized Dictionaries 58/138
    • Further Reading Taxonomies: Taxonomy http://en.wikipedia.org/wiki/Taxonomy Perspectives on Taxonomy, Classification, Structure and Find-ability http://www.serviceinnovation.org/included/docs/ kcs_taxonomy.pdf Universal Decimal Classification http://www.udcc.org/udcsummary/php/index.php Thesauri: Thesaurus http://en.wikipedia.org/wiki/Thesaurus Thesaurus principles and practice http://www.willpowerinfo.co.uk/thesprin.htm Alberto Sim˜oes A5 - Specialized Dictionaries 59/138
    • Further Reading Ontologies: Ontology (information science) http://en.wikipedia.org/wiki/Ontology_ (information_science) Prot´eg´e Ontology Editor http://protege.stanford.edu/ OWL Web Ontology Language http://www.w3.org/TR/owl-features/ SPARQL Query Language for RDF http://www.w3.org/TR/rdf-sparql-query/ Alberto Sim˜oes A5 - Specialized Dictionaries 60/138
    • Part II Terminology and Translation Alberto Sim˜oes A5 - Specialized Dictionaries 61/138
    • Overview 4 How translation works 5 The role of Terminology on Translation 6 Translation Software Standard translation software Standard terminology management software Alberto Sim˜oes A5 - Specialized Dictionaries 62/138
    • How Translation Works Alberto Sim˜oes A5 - Specialized Dictionaries 63/138
    • How Translation Works Manual Translation Translator uses some resources like dictionaries and terminologies, but search them manually. The type of translation done in the last century. Computer Assisted Translation Translator uses tools (CAT tools) to help the translation process. Help the translator to reuse previous translations, integrates with terminologies and help the translator dealing with different file formats. Exploratory Translation Using machine translation tools, like Google Translate to do a quick translation and understand texts. Not really a professional translation process. Machine Translation Computer systems that translate text using different techniques, from statistical information to translation rules. Quality raising in the last years, but too far away of a real translation work result. Alberto Sim˜oes A5 - Specialized Dictionaries 64/138
    • Computer Assisted Translation CAT tools translation process: 1 Document is opened in CAT tool; 2 First sentence is extracted and presented to be translated; 3 Sentence is looked-up in a database of previous translated sentences, looking up for similar sentences (fuzzy matching); 4 If found, translation is done (or fuzzy translation); 5 A terminology database is queried in order to check if sentence includes relevant terms to be translated; 6 Translator reviews the translation; 7 System saves the translation in a database of translations; 8 System saves the translation in the translated document; 9 Next sentence is extracted, and go to step 3. Alberto Sim˜oes A5 - Specialized Dictionaries 65/138
    • Computer Assisted Translation Alberto Sim˜oes A5 - Specialized Dictionaries 66/138
    • Translation Memories Databases of translations; Store sentences in two or more languages; Grow accordingly with the work of the translator; Can be shared between translators in a same project; Some big companies make their TM available to contracted translators in order to guarantee homogeneity in their translations. Alberto Sim˜oes A5 - Specialized Dictionaries 67/138
    • Terminology and Translation Translating terminology takes up to 40% of the time in translation: Translators not aware of technical areas; Translators need to understand term being translated; Researching on a specific area takes time; Terminology reduce time to research on term translation. Terminology helps the comprehension of concepts: There is no way to translate without understanding; Terminology might/should include explanations on terms; Terminology helps on Consistency and Standardization: Translate terms the same way through all the document; Translate terms the same way through all documents; Companies, Organization, Governmental Institutions define specific terminologies that should be used by translators; Alberto Sim˜oes A5 - Specialized Dictionaries 68/138
    • Further Reading CAT software Discover the benefits of using a CAT Tool: How can CAT Tools help you? by Jonathan T. Hine Jr. http://www.translationzone.com/en/translator-solutions/translation-memory/cat-tools/ What is a translation memory? by SDL Trados. http://www.translationzone.com/en/translator-solutions/translation-memory/default.asp What is terminology? by SDL Trados. http://www.translationzone.com/en/translator-solutions/terminology-management/default.asp Alberto Sim˜oes A5 - Specialized Dictionaries 69/138
    • Further Reading Terminology in Translation Terminology in translation, by Thorsten Trippel (1999) http://www.spectrum.uni-bielefeld.de/~ttrippel/terminology/node19.html Terminology Management in Translation, by Gabriele Sauberer (2009) http://www.termnet.org/downloads/english/events/itaindia_ workshop/GS_Terminology_Management_in_Translation.pdf The Role of Terminology Management in Localization, by Sue Ellen Wright (2006) http://www.translationzone.com/en/images/sue_ellen_slides_tcm18-25819.pdf Managing Terminology for Translation Using Translation Environment Tools: Towards a Definition of Best Practices, by Marta G´omez Palou Allard (2012) http://www.ruor.uottawa.ca/fr/ bitstream/handle/10393/22837/Gomez_Palou_Allard_Marta_2012_thesis.pdf Alberto Sim˜oes A5 - Specialized Dictionaries 70/138
    • Part III Introduction to Corpora Alberto Sim˜oes A5 - Specialized Dictionaries 71/138
    • Overview 7 Corpora Monolingual Corpora Parallel Corpora Corpora in the Web 8 The web as Corpora Do-it-yourself Corpora Basic Crawling Tools Alberto Sim˜oes A5 - Specialized Dictionaries 72/138
    • What is a Corpus? cor·pus /’kˆorp@s/ Noun 1. A collection of written texts, esp. the entire works of a particular author or a body of writing on a particular subject; 2. A collection of written or spoken material in machine-readable form, assembled for the purpose of studying linguistic structures, frequencies, etc. corpora is the plural for corpus. Alberto Sim˜oes A5 - Specialized Dictionaries 73/138
    • Corpora Classification Corpora is usually classified accordingly with the number of languages: Monolingual Corpus: documents are all written in one language; (in some cases with more than one variant) Multilingual Corpus: documents are written in more than one language; Alberto Sim˜oes A5 - Specialized Dictionaries 74/138
    • Corpora Classification There are two specially relevant types of multilingual corpora: Parallel Corpus: a text placed alongside its translation or translations. Parallel text alignment is the identification of corresponding blocks in both halves of the parallel text. Comparable Corpus: is one which selects similar texts in more than one language or variety. There is as yet no agreement on the nature of the similarity, because there are very few examples of comparable corpora Expert Advisory Group on Language Engineering Standards Guidelines (1996) Alberto Sim˜oes A5 - Specialized Dictionaries 75/138
    • Monolingual Corpora Examples British National Corpus (http://www.natcorp.ox.ac.uk/) The British National Corpus (BNC) is a 100 million word collection of samples of written and spoken language from a wide range of sources, designed to represent a wide cross-section of current British English, both spoken and written. CETEMP´ublico (http://www.linguateca.pt/cetempublico/) Corpus de Extractos de Textos Electr´onicos MCT/P´ublico is a corpus of approximately 180 million words in European Portuguese. It was created by the Computational Processing of Portuguese Project after an agreement between the Ministry of Science and Technology and the P´ublico newspaper, in April, 2000. CETENFolha (http://www.linguateca.pt/cetenfolha/) Corpus de Extractos de Textos Electr´onicos NILC/Folha de S. Paulo is a corpus of approximately 24 million words in Brazilian Portuguese, created by the Computational Processing of Portuguese Project using texts from the newspaper Folha de S. Paulo, that are part of the NILC/S˜ao Carlos Alberto Sim˜oes A5 - Specialized Dictionaries 76/138
    • Monolingual Corpora Examples Russian National Corpus (http://ruscorpora.ru/en/index.html) RNC is a corpus of the modern Russian language incorporating over 300 million words. The corpus of Russian is a reference system based on a collection of Russian texts in electronic form. Croatian National Corpus (http://www.hnk.ffzg.hr/cnc.htm) HNK is a systematized collection of selected texts mainly written in contemporary Croatian covering different media, genres, styles, fields and topics. The Corpus is accompanied by additional linguistic and non-linguistic data and stored in a database on our server which can be accessed with the search client program Bonito. KOTONOHA Corpus (http://www.kotonoha.gr.jp/) The Balanced Corpus of Contemporary Written Japanese includes text samples collected to be able to grasp an overall picture of the modern Japanese written language and includes about 100 million words. Alberto Sim˜oes A5 - Specialized Dictionaries 77/138
    • Parallel Corpora Examples Aligned Hansards (http://isi.edu/natural-language/download/hansard/) Aligned Hansards of the 36th Parliament of Canada, contains 1.3 million pairs of aligned text chunks (sentences or smaller fragments). COMPARA ( http://www.linguateca.pt/COMPARA/) COMPARA is a bidirectional parallel corpus of English and Portuguese. In other words, it is a type of database with original and translated texts in these two languages that have been linked together sentence by sentence. Europarl ( http://www.statmt.org/europarl/) The Europarl parallel corpus is extracted from the proceedings of the European Parliament. It includes versions in 11 European languages: Romanic (French, Italian, Spanish, Portuguese), Germanic (English, Dutch, German, Danish, Swedish), Greek and Finnish. Alberto Sim˜oes A5 - Specialized Dictionaries 78/138
    • Parallel Corpora Examples JRC-Acquis (http://langtech.jrc.it/JRC-Acquis.html) The Acquis Communautaire is the total body of European Union (EU) law applicable in the the EU Member States. It is a collection of parallel texts in 22 languages: Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish. OPUS (http://opus.lingfil.uu.se/) OPUS is a growing collection of translated texts from the web. In the OPUS project we try to convert and align free online data, to add linguistic annotation, and to provide the community with a publicly available parallel corpus. Per-Fide (http://per-fide.di.uminho.pt/cquery) Per-fide Project aims on the development of parallel corpora between Portuguese and six other Languages: English, Russian, French, Italian, German and Spanish. Alberto Sim˜oes A5 - Specialized Dictionaries 79/138
    • Querying Corpora Using http://corpus.leeds.ac.uk/protected/query.html Concordances of a single word: dog Concordances for a sequence of words: big bang Concordances for lemmas: [lemma="have"] Concordances for part of speech: [pos="NNS"] Combinations of the above: [lemma="have"] dog [lemma="be"] [lemma="have"] Regular expressions can be used: [pos="N.*"] [pos="V.*"] Multiple restrictions for same word: [pos="N.*" & word="d.*"] [pos="V.*"] Empty words: [pos="N.*"] [] [pos="V.*"] Alberto Sim˜oes A5 - Specialized Dictionaries 80/138
    • The Web as Corpora To study “purposeful language behavior,” corpus linguists require collections of authentic texts (spoken and/or written). It is therefore not surprising that many (corpus) linguists have recently turned to the World Wide Web as the richest and most easily accessible source of language material available. At the same time, for language technologists, who have been arguing for long that “more data is better data,” the WWW is a virtually unlimited source of “more data.” Wacky! A Wacky Introduction Silvia Bernardini, Marco Baroni and Stefan Evert Alberto Sim˜oes A5 - Specialized Dictionaries 81/138
    • Do-it-yourself Corpora The WWW has data from virtually any subject; There is data in mostly any language; Therefore, it is possible to build custom corpora! Collect text from the web. . . . . . on a specific language. . . . . . on the subject you want to study . . . . . . and retrieve as much text as you need. Alberto Sim˜oes A5 - Specialized Dictionaries 82/138
    • Basic Crawling Tools There are standard download tools that follow HTML links, and are able to download complete websites. They are known as web spiders, or web robots; Examples include “wget”, “wGetGUI” or “HTTrack”; But you need to process the files yourself. There are some projects that developed tools specific for corpora building. The most well known is “BootCaT” Alberto Sim˜oes A5 - Specialized Dictionaries 83/138
    • Further Reading Corpora: Corpus Creation - Handbook of NLP http://cgi.cse.unsw. edu.au/~handbookofnlp/index.php?n=Chapter7.Chapter7 Building and Using Your Own Corpora http: //www.lancs.ac.uk/fss/courses/ling/corpus/blue/diy_top.htm CQP Query Language Tutorial http://cwb.sourceforge.net/files/CQP_Tutorial/ Web as Corpora: Wacky! Working papers on the Web as Corpus http://wackybook.sslmit.unibo.it/ Wacky Wiki http://wacky.sslmit.unibo.it/doku.php Alberto Sim˜oes A5 - Specialized Dictionaries 84/138
    • Part IV Terminology Extraction from Monolingual Corpora Alberto Sim˜oes A5 - Specialized Dictionaries 85/138
    • Overview 9 Corpora for Terminology Building 10 Obtaining candidate terms from Corpora N-grams and Frequencies Lexical Difference Exploring Mutual Information Morphology Constraints 11 Exploring a Tool: Term-o-Matic Alberto Sim˜oes A5 - Specialized Dictionaries 86/138
    • Corpora for Terminology Building The use of a specific domain text or texts in order to understand what is that domain terminology is relevant; Words in context give more information than alone; There is no automatic method to extract specific domain terminology from a specific domain corpus; Nevertheless, there are automatic method to obtain candidate terms, that can later be analysed and incorporated in a terminology, or just discarded. Alberto Sim˜oes A5 - Specialized Dictionaries 87/138
    • Words n-Grams In the fields of computational linguistics and probability, an n-gram is a contiguous sequence of n items from a given sequence of text or speech. The items in question can be phonemes, syllables, letters, words or base pairs according to the application. n-grams are collected automatically from a text or speech corpus. Alberto Sim˜oes A5 - Specialized Dictionaries 88/138
    • One-Grams 1-Grams are usually known as words/tokens. :-) Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, Where’s the peck of pickled peppers Peter Piper picked? peter 4 piper 4 picked 4 a 2 peck 4 of 4 pickled 5 ... ... Alberto Sim˜oes A5 - Specialized Dictionaries 89/138
    • Bigrams All sequences of two words/tokens found in the text. Peter Piper picked a peck of pickled peppers. A peck of pickled peppers Peter Piper picked. If Peter Piper picked a peck of pickled peppers, Where’s the peck of pickled peppers Peter Piper picked? peter piper 4 piper picked 4 picked a 2 a peck 3 peck of 4 of pickled 4 pickled peppers 4 ... ... Alberto Sim˜oes A5 - Specialized Dictionaries 90/138
    • Top occurring trigrams for a real corpus in accordance with 31148 referred to in 27581 the member states 16999 accordance with the 16535 of the european 14772 laid down in 13301 to in article 13211 having regard to 12588 regard to the 11416 member states shall 11392 in order to 10563 in the case 10029 the provisions of 9825 the case of 9575 provided for in 9560 the member state 9360 of the member 8656 the commission shall 8013 of this directive 6679 a member state 6306 on the basis 6292 the european parliament 6274 the basis of 6265 and in particular 6225 down in article 6200 of the community 5958 accordance with article 5758 to in paragraph 5690 opinion of the 5599 the opinion of 5191 the competent authorities 5074 for the purposes 5024 the purposes of 4946 with the procedure 4878 to the commission 4843 the european community 4834 Alberto Sim˜oes A5 - Specialized Dictionaries 91/138
    • n-grams frequency n-Grams are usually computed together with their occurrence count — or frequency; In some situations, like statistic language models, other type of measures are also computed (probability — relative frequency; conditional probability, etc); One-grams frequency doesn’t help much on term candidate extraction — they just say that a word is more or less frequent. n-grams for n ≥ 2 can help finding sequence of words that occur lot of times. Alberto Sim˜oes A5 - Specialized Dictionaries 92/138
    • Stop Words and Lexical Difference There are words that rarely occur in terminology; At least, they rarely occur in the beginning or end of a multi-word term; For example, pronouns, articles, prepositions; These words are usually known as stop words; It is easy to find bigger or smaller lists of stop words for every language; We can ignore these words when computing n-grams. Alberto Sim˜oes A5 - Specialized Dictionaries 93/138
    • Detecting stop-words in accordance with 31148 referred to in 27581 the member states 16999 accordance with the 16535 of the european 14772 laid down in 13301 to in article 13211 having regard to 12588 regard to the 11416 member states shall 11392 in order to 10563 in the case 10029 the provisions of 9825 the case of 9575 provided for in 9560 the member state 9360 of the member 8656 the commission shall 8013 of thisi directive 6679 a member state 6306 on the basis 6292 the european parliament 6274 the basis of 6265 and in particular 6225 down in article 6200 of the community 5958 accordance with article 5758 to in paragraph 5690 opinion of the 5599 the opinion of 5191 the competent authorities 5074 for the purposes 5024 the purposes of 4946 with the procedure 4878 to the commission 4843 the european community 4834 Alberto Sim˜oes A5 - Specialized Dictionaries 94/138
    • Replacing stop words by a special token <tk> member states 32517 member states <tk> 30108 <tk> member state 19345 member state <tk> 17882 council directive <tk> 7869 <tk> council directive 7129 <tk> european parliament 5397 council regulation <tk> 5259 european parliament <tk> 5125 <tk> council regulation 4995 <tk> competent authorities 4964 competent authorities <tk> 4736 procedure laid <tk> 4472 <tk> treaty establishing 4375 treaty establishing <tk> 4373 <tk> competent authority 3694 official journal <tk> 3530 competent authority <tk> 3507 annex ii <tk> 3429 commission regulation <tk> 3171 <tk> commission regulation 2967 commission decision <tk> 2545 <tk> customs authorities 2542 <tk> commission decision 2429 customs authorities <tk> 2410 <tk> european economic 2285 <tk> administrative provisions 2017 <tk> contracting parties 2010 conditions laid <tk> 1998 contracting parties <tk> 1779 commission directive <tk> 1764 detailed rules <tk> 1738 <tk> community industry 1728 <tk> contracting party 1702 Alberto Sim˜oes A5 - Specialized Dictionaries 95/138
    • Trigrams that doesn’t include stop words member states relating 1523 member state concerned 1200 veterinary medicinal products 955 maximum residue limits 814 physically modified derivatives 700 european economic community 691 community trade mark 538 member states concerned 508 plant protection products 464 home member state 442 host member state 388 council common position 377 community plant variety 368 european atomic energy 346 animal health conditions 342 authorised representative established 327 implementing powers conferred 311 regional economic integration 263 median longitudinal plane 258 plant protection product 249 separate technical unit 246 national regulatory authorities 241 apply mutatis mutandis 241 common technical regulation 229 separate technical units 226 emission limit values 219 technically permissible maximum 215 maximum residue levels 212 retail trade services 200 temporary importation procedure 196 medicinal products intended 195 community transit procedure 195 atomic energy community 193 classical swine fever 189 Alberto Sim˜oes A5 - Specialized Dictionaries 96/138
    • Basic Lexical Difference What if we remove not just stop words, but common words? It is not that usual to find Osteoarthritis in common text. Therefore, it should be some kind of a domain term. We can obtain a list of common words from a generic corpus (say, jornalistic text) and subtract that lexicon from the one-grams we obtained. Result should include good term candidates! Alberto Sim˜oes A5 - Specialized Dictionaries 97/138
    • Basic Lexical Difference - Experiment Two random abstracts from PubMed articles related with cirrhosis; Top 1 000 occurring words in English; Compute one-grams on the abstracts; Subtract the top occurring words. Before liver 8 is 7 fibrosis 6 myofibroblast 6 pathway 5 kidney 5 expression 5 interstitial 4 signaling 3 target 3 differentiation 3 diseases 3 medullary 3 antioxidant 3 After liver 8 myofibroblast 6 fibrosis 6 pathway 5 kidney 5 interstitial 4 β-catenin 3 target 3 signaling 3 genes 3 differentiation 3 medullary 3 renal 3 adult 3Alberto Sim˜oes A5 - Specialized Dictionaries 98/138
    • Lexical Distribution Difference Previous example could benefit a bigger standard lexicon list; Abstracts are crowded with terminology, and few other words; Long lists may include words than are considered terminology! Example, for Informatics, folder or file can be terms. Instead of considering words as present or not, use their frequency; For instance, compute relative frequency and compare/subtract; Use a distribution comparison metric; ex., Kullback-Leibler terms: log P(i) Q(i) P (i) Alberto Sim˜oes A5 - Specialized Dictionaries 99/138
    • Lexical Distribution Difference Previous example could benefit a bigger standard lexicon list; Abstracts are crowded with terminology, and few other words; Long lists may include words than are considered terminology! Example, for Informatics, folder or file can be terms. Instead of considering words as present or not, use their frequency; For instance, compute relative frequency and compare/subtract; Use a distribution comparison metric; ex., Kullback-Leibler terms: log P(i) Q(i) P (i) Alberto Sim˜oes A5 - Specialized Dictionaries 99/138
    • Pointwise Mutual Information The Mutual Information (MI) is a quantity that measures the mutual dependence of two random variables X and Y . MI(X, Y ) = x∈X y∈Y P(x, y) log2 P(x, y) P(x)P(y) Intuitively, mutual information measures the information that X and Y share: it measures how much knowing one of these variables reduces uncertainty about the other. Alberto Sim˜oes A5 - Specialized Dictionaries 100/138
    • Pointwise Mutual Information When computing Mutual Information for two specific outcomes, the Pointwise Mutual Information (PMI) let us measure their mutual dependence: PMI(x, y) = log2 P(x, y) P(x)P(y) Given the number of tokens in the document N, and the number of occurrences for x, Oc(x): P(x) = Oc(x) N Given the number of tokens in the document N, and the number of occurrences for bigram x, y, Oc(x, y): P(x, y) = Oc(x,y) N Alberto Sim˜oes A5 - Specialized Dictionaries 101/138
    • Pointwise Mutual Information Sorted by occurrence count sonic fabric 14 7.3566 black holes 9 8.0912 black hole 7 8.0912 cassette tape 6 8.4968 build things 4 9.5348 smartphone makers 3 9.0087 alyce santoro 3 8.0912 like scratching 3 9.0087 barnard said 3 8.3042 milky way 3 9.1787 possible black 3 7.6762 neutron star 3 8.8567 just right 3 8.5937 records backwards 3 10.5937 Sorted by PMI special shuttle 1 12.1787 immediately reminded 1 12.1787 remain aware 1 12.1787 richard branson 1 12.1787 supercooled pods 1 12.1787 richie havens 1 12.1787 auspicious locations 1 12.1787 jimi hendrix 1 12.1787 account settings 1 12.1787 baggage carousel 1 12.1787 buddhist prayer 1 12.1787 reinvents electronics 1 12.1787 melbourne institute 1 12.1787 cow manure 1 12.1787 From a very small corpus constructed with 5 CNN news stories. Alberto Sim˜oes A5 - Specialized Dictionaries 102/138
    • Morphology Patterns Commonly, terms are nouns or noun phrases; Sometimes some verbs are also interesting; Typically the morphological structure of terms is well known; There is software that compute morphological information about each word in a sentence; We can use that information to obtain better term candidates. specify terms part-of-speech, genre, number, verb tenses, etc. . . Alberto Sim˜oes A5 - Specialized Dictionaries 103/138
    • Morphological Analysis How it (usually) works: 1 A tokenizer and a splitter split sentences into tokens and sentences; (different tools use them in different order, some as a single tool) 2 A morphological analyzer associates possible analysis to each word; (does not cope with ambiguity, just tags all possible analysis) 3 A Tagger or Parser choose the more likely analysis; (uses knowledge from manual annotated corpora, and machine learning algorithms) Alberto Sim˜oes A5 - Specialized Dictionaries 104/138
    • Morphological Patterns - Examples Noun Noun Noun 659 Community trade mark 483 plant protection products 475 EEC component type-approval 448 document number C 320 Community transit procedure 290 plant protection product 288 Community plant variety 257 EC type-examination certificate 214 EC component type-approval 176 EEC pattern approval 157 African swine fever 155 three-wheel motor vehicles 155 foot-and-mouth disease virus 153 conformity assessment procedures 148 emission limit values Adjective Adjective Noun 912 veterinary medicinal products 453 common agricultural policy 365 separate technical unit 291 separate technical units 265 median longitudinal plane 223 regional economic integration 202 competent national authorities 200 trans-European high-speed rail 199 sound financial management 189 veterinary medicinal product 182 certain agricultural products 176 national regulatory authorities 175 common technical regulation 168 certain third countries 166 other third countries 166 definitive anti-dumping duty 162 certain dangerous substances Alberto Sim˜oes A5 - Specialized Dictionaries 105/138
    • Term-o-Matic http://www.termomatic.com/ Alberto Sim˜oes A5 - Specialized Dictionaries 106/138
    • Term-o-Matic What it is: A simple web-application; Without user control; Developed specifically for this class; implement some of the methods presented before; What it is not: A commercial software; A professional tool; A tool free of bugs; A multilingue tool. Alberto Sim˜oes A5 - Specialized Dictionaries 107/138
    • Term-o-Matic What it is: A simple web-application; Without user control; Developed specifically for this class; implement some of the methods presented before; What it is not: A commercial software; A professional tool; A tool free of bugs; A multilingue tool. Alberto Sim˜oes A5 - Specialized Dictionaries 107/138
    • Term-o-Matic: overview Main screen, shows options, and summary on available data. Alberto Sim˜oes A5 - Specialized Dictionaries 108/138
    • Term-o-Matic: Add Text Use the Add Text option to add one-grams, bigrams and trigrams into the database (English, please!). Alberto Sim˜oes A5 - Specialized Dictionaries 109/138
    • Term-o-Matic: Add Text feedback After adding some text, a summary of the amount of data added is shown. Alberto Sim˜oes A5 - Specialized Dictionaries 110/138
    • Term-o-Matic: Manage Stopwords The Stop Words option allows to manage the list of stop-words. It is possible to add (to add more than one just separate words using spaces or other punctuation), and to delete them. Alberto Sim˜oes A5 - Specialized Dictionaries 111/138
    • Term-o-Matic: Manage Lexicon The Standard Lexicon option is very similar to the Stop Words option, but for the generic words. Alberto Sim˜oes A5 - Specialized Dictionaries 112/138
    • T-o-M: Words, Bigrams and Trigrams The Study Words, Study Bigrams and Study Trigrams work all in the same way, showing a list of words/bigrams/trigrams. Alberto Sim˜oes A5 - Specialized Dictionaries 113/138
    • T-o-M: Words, Bigrams and Trigrams Note that the PMI column is empty. This measure takes some time to compute, and therefore should be computed only when needed. Alberto Sim˜oes A5 - Specialized Dictionaries 114/138
    • T-o-M: Words, Bigrams and Trigrams To compute PMI use the Compute bi/trigrams PMI. After the software issue an ”OK” message, hit the back button on your browser and refresh. Alberto Sim˜oes A5 - Specialized Dictionaries 115/138
    • T-o-M: Words, Bigrams and Trigrams By default the list is sorted by occurrence count. You can change to PMI order as soon as it is computed. Alberto Sim˜oes A5 - Specialized Dictionaries 116/138
    • T-o-M: Words, Bigrams and Trigrams It is possible to remove entries with stop-words or punctuation; or entries with common words. Alberto Sim˜oes A5 - Specialized Dictionaries 117/138
    • T-o-M: Filtering by pattern To filter by a morphological pattern you must ensure that you run the Compute Morph. Analysis option after the last time you entered text. When the software says the process is complete (OK), hit the back button, and you are realy to use the pattern filtering. Just choose the categories you are looking for, and search for them. Alberto Sim˜oes A5 - Specialized Dictionaries 118/138
    • T-o-M: Filtering by Pattern Alberto Sim˜oes A5 - Specialized Dictionaries 119/138
    • Term-o-Matic: standard operation guide 1 Use the Add Text option to add text. Use it as many times as you need to create a big enough corpus; Do not add too much text at once. Add by blocks. Be sure to add thematic text; 2 Define a list of stop words (you might already have one). 3 Define a list of common words. Look for such lists in the web. 4 Compute PMIs and Morphological Analysis 5 Do queries! Alberto Sim˜oes A5 - Specialized Dictionaries 120/138
    • Evaluation Task Five students, Five subject areas, Five Term-o-Matic. Computer Science (http://termomatic.com/termomatic1) Medicine (http://termomatic.com/termomatic2) Europe (http://termomatic.com/termomatic3) Animal Biology (http://termomatic.com/termomatic4) Sports (http://termomatic.com/termomatic5) Alberto Sim˜oes A5 - Specialized Dictionaries 121/138
    • Part V Terminology Extraction from Multilingual Corpora Alberto Sim˜oes A5 - Specialized Dictionaries 122/138
    • Overview 12 Sentence and Word Alignment 13 Parallel Patterns Alberto Sim˜oes A5 - Specialized Dictionaries 123/138
    • Sentence Alignment Sentence alignment is the task of detecting translation relationships between sentences in parallel corpora. If sα is a sentence in a language Lα and sβ is a sentence in a language Lβ, the alignment process creates the pair (sα, sβ) if (there is a high probability that) sβ is a translation of sα. Alberto Sim˜oes A5 - Specialized Dictionaries 124/138
    • Word Alignment The Word Alignment is the task of detecting translation relationships between words or terms in sentence-aligned parallel corpora. There are two trends on word alignment: for each aligned sentence, create a link between every word and its translation; for the complete corpora, obtain a relationship between a word and a set of probable translations, together with a confidence measure (a kind of translation probability); Alberto Sim˜oes A5 - Specialized Dictionaries 125/138
    • Word Alignment The Word Alignment is the task of detecting translation relationships between words or terms in sentence-aligned parallel corpora. There are two trends on word alignment: for each aligned sentence, create a link between every word and its translation; for the complete corpora, obtain a relationship between a word and a set of probable translations, together with a confidence measure (a kind of translation probability); Alberto Sim˜oes A5 - Specialized Dictionaries 125/138
    • Probabilistic Translation Dictionaries Obtained with one of the word alignment methods; Define a relationship between a word and a set of probable translations; T (europe) =    europa 94.7% europeus 3.4% europeu 0.8% europeia 0.1% T (stupid) =    est´upido 47.6% est´upida 11.0% est´upidos 7.4% avisada 5.6% direita 5.6% impasse 4.5% ocupado 3.8% Alberto Sim˜oes A5 - Specialized Dictionaries 126/138
    • Translation Matrix discussion about alternative sources of financing for the european radical alliance . discussão 44 0 0 0 0 0 0 0 0 0 0 0 sobre 0 11 0 0 0 0 0 0 0 0 0 0 fontes 0 0 0 74 0 0 0 0 0 0 0 0 de 0 3 0 0 27 0 6 3 0 0 0 0 financiamento 0 0 0 0 0 56 0 0 0 0 0 0 alternativas 0 0 23 0 0 0 0 0 0 0 0 0 para 0 0 0 0 0 0 28 0 0 0 0 0 a 0 1 0 0 1 0 4 33 0 0 0 0 aliança 0 0 0 0 0 0 0 0 0 0 65 0 radical 0 0 0 0 0 0 0 0 0 80 0 0 europeia 0 0 0 0 0 0 0 0 59 0 0 0 . 0 0 0 0 0 0 0 0 0 0 0 80 Using the probabilistic translation dictionaries we are able to construct a translation matrix; Each cell has a translation probability obtained from the dictionary; Alberto Sim˜oes A5 - Specialized Dictionaries 127/138
    • Translation Patterns Translation changes word order (for some language pairs!); This change can be foreseen; This change can be defined formally as a pattern; These patterns can be used to obtain term candidates. Alberto Sim˜oes A5 - Specialized Dictionaries 128/138
    • Translation Pattern 1: ABBA Jogos Ol´ımpicos Olimpic X Games X Formally, T (A · B) = T (B) · T (A) Or in the tool syntax: [ABBA] A B = B A Alberto Sim˜oes A5 - Specialized Dictionaries 129/138
    • Translation Pattern 2: IDH ´ındice de desenvolvimento humano human X development X index X T (I · ”de” · D · H) = T (H) · T (D) · T (I) [IDH] I "de" D H = H D I Alberto Sim˜oes A5 - Specialized Dictionaries 130/138
    • Translation Pattern 3: FTP protocolo de transferˆencia de ficheiros file X transfer X protocol X T (P · ”de” · T · ”de” · F) = T (F) · T (T) · T (P) [FTP] P "de" T "de" F = F T P Alberto Sim˜oes A5 - Specialized Dictionaries 131/138
    • Patterns in Translation Matrix discussion about alternative sources of financing for the european radical alliance . discussão 44 0 0 0 0 0 0 0 0 0 0 0 sobre 0 11 0 0 0 0 0 0 0 0 0 0 fontes 0 0 0 74 0 0 0 0 0 0 0 0 de 0 3 0 0 27 0 6 3 0 0 0 0 financiamento 0 0 0 0 0 56 0 0 0 0 0 0 alternativas 0 0 23 0 0 0 0 0 0 0 0 0 para 0 0 0 0 0 0 28 0 0 0 0 0 a 0 1 0 0 1 0 4 33 0 0 0 0 aliança 0 0 0 0 0 0 0 0 0 0 65 0 radical 0 0 0 0 0 0 0 0 0 80 0 0 europeia 0 0 0 0 0 0 0 0 59 0 0 0 . 0 0 0 0 0 0 0 0 0 0 0 80 The two boxes correspond to the following two patterns: [P1] F "de" N A = A F "of" N [P2] A B C = C B A Alberto Sim˜oes A5 - Specialized Dictionaries 132/138
    • Terms extracted using A B = B A 21007 uni˜ao europeia ⇒ european union 9301 parlamento europeu ⇒ european parliament 4171 direitos humanos ⇒ human rights 3504 estados unidos ⇒ united states 2353 mercado interno ⇒ internal market 1911 posi¸c˜ao comum ⇒ common position 1826 pa´ıses candidatos ⇒ candidate countries 1776 comiss˜ao europeia ⇒ european commission 1708 conselho europeu ⇒ european council 1629 sa´ude p´ublica ⇒ public health 1558 direitos fundamentais ⇒ fundamental rights 1546 na¸c˜oes unidas ⇒ united nations 1337 pa´ıses terceiros ⇒ third countries 1294 conferˆencia intergovernamental ⇒ intergovernmental conference 1258 fundos estruturais ⇒ structural funds Alberto Sim˜oes A5 - Specialized Dictionaries 133/138
    • Terms extracted using A ”de” B = B A 729 plano de ac¸c˜ao ⇒ action plan 722 conselho de seguran¸ca ⇒ security council 680 processo de paz ⇒ peace process 582 mercado de trabalho ⇒ labour market 580 pena de morte ⇒ death penalty 492 pacto de estabilidade ⇒ stability pact 431 pol´ıtica de defesa ⇒ defence policy 353 acordo de associa¸c˜ao ⇒ association agreement 348 protocolo de quioto ⇒ kyoto protocol 343 programa de ac¸c˜ao ⇒ action programme 259 branqueamento de capitais ⇒ money laundering 258 comit´e de concilia¸c˜ao ⇒ conciliation committee 241 pol´ıtica de concorrˆencia ⇒ competition policy 226 processo de concilia¸c˜ao ⇒ conciliation procedure 217 requerentes de asilo ⇒ asylum seekers Alberto Sim˜oes A5 - Specialized Dictionaries 134/138
    • Terms extracted using A B C = C B A 531 pol´ıtica agr´ıcola comum ⇒ common agricultural policy 418 banco central europeu ⇒ european central bank 329 tribunal penal internacional ⇒ international criminal court 166 alian¸ca livre europeia ⇒ european free alliance 156 modelo social europeu ⇒ european social model 153 partidos pol´ıticos europeus ⇒ european political parties 83 fundo monet´ario internacional ⇒ international monetary fund 75 pol´ıtica externa comum ⇒ common foreign policy 66 organiza¸c˜ao mar´ıtima internacional ⇒ international maritime organisation 65 pr´opria uni˜ao europeia ⇒ european union itself 65 fundo social europeu ⇒ european social fund 55 direitos humanos fundamentais ⇒ fundamental human rights 45 rela¸c˜oes econ´omicas externas ⇒ external economic relations 45 homens e mulheres ⇒ women and men 45 agˆencia espacial europeia ⇒ european space agency Alberto Sim˜oes A5 - Specialized Dictionaries 135/138
    • Terms extracted: I ”de” D H = H D I 95 mandato de captura europeu ⇒ european arrest warrant 85 fontes de energia renov´aveis ⇒ renewable energy sources 80 mandado de captura europeu ⇒ european arrest warrant 67 sistemas de seguran¸ca social ⇒ social security systems 64 zona de com´ercio livre ⇒ free trade area 55 for¸ca de reac¸c˜ao r´apida ⇒ rapid reaction force 54 orienta¸c˜oes de pol´ıtica econ´omica ⇒ economic policy guidelines 46 planos de ac¸c˜ao nacionais ⇒ national action plans 46 direitos de propriedade intelectual ⇒ intellectual property rights 33 sistema de alerta r´apido ⇒ rapid alert system 29 pol´ıtica de defesa comum ⇒ common defence policy 29 m´etodo de coordena¸c˜ao aberta ⇒ open coordination method 27 m´etodo de coordena¸c˜ao aberto ⇒ open coordination method 27 conselho de empresa europeu ⇒ european works council 25 acordo de com´ercio livre ⇒ free trade agreement Alberto Sim˜oes A5 - Specialized Dictionaries 136/138
    • Adding Morphological Constraints The pattern language supports constraints; Constrains can be of different types; The most interesting are the morphological ones: [ABBA] A B[CAT<-adj] = B[CAT<-adj] A With this kind of constrain we can force the words in specific positions to be of specific morphological category. Alberto Sim˜oes A5 - Specialized Dictionaries 137/138
    • Further Reading Alignment tasks Sentence Alignment Survey http://www.statmt.org/survey/Topic/SentenceAlignment An overview of bitext alignment algorithms http://www. ida.liu.se/~jodfo/gslt/bitext-alignment-jody.pdf Word Alignment Survey http://www.statmt.org/survey/Topic/WordAlignment Terminology from Parallel Corpora Parallel corpus-based bilingual terminology extraction http: //ambs.perl-hackers.net/publications/tia09.pdf Alberto Sim˜oes A5 - Specialized Dictionaries 138/138