Introduction to development of lexical databases

Introduction to
Development of Lexical
Databases
Muhammad Shoaib
PhD Researcher (Biomedical Engineering)
Asan Medical Complex
College of Medicine University of Ulsan
Researcher Gachon University Gil Medical Center
Republic Of Korea

About me: Son of Soil
BS Computer Science (2006-2010)
FAST National University of Computer and Emerging
Sciences
ME Computer Engineering (2011-2013)
Jeju National University Republic of Korea
PhD Biomedical Engineering (2015 – To date)
Asan Medical Center, University of Ulsan, Republic of
Korea
Lecturer at Institute of Space Technology 2013-2015

Overview
Lexical Databases and DBMS
WordNet (we’ll see who can we adopt it)
Computational Linguistics
Lexical Ontologies

Database Management System
Data:
Facts and statistics collected together for
reference or analysis.
Database:
A structured set of data held in a computer,
especially one that is accessible in various ways.
Database Management System
computer-software application that interacts
with end-users, other applications

What we are talking about today?
Globalization requires more texts and speech
to be translated faster across more languages
Manual translation is difficult, expensive, time-
consuming
Machine translation is of low quality, often
unacceptable

Why Lexical Database
What are reading how computer’s can
understand?
Why we need computers for translations?
They are faster then humans
Can computer do the similar job as humans?
In linguistics probably not

Lexical Database
Machine Readable Dictionary
“A lexical database is a lexical resource which has an
associated software environment database which permits
access to its contents”
What is Lexical Resource?
“A lexical resource (LR) is a database consisting of one or
several dictionaries.”

What a Lexical Database Contains?
Information typically stored in a lexical
database includes
lexical category of words
synonyms of words,
semantic and phonological relations between
different words or sets of words.

Why Lexical Databases?
Natural language generation systems that produce
coherent discourses by verbalizing a set of triples
Question Answering systems that interpret user
questions with respect to one or more ontologies
Text interpretation systems that extract triples with
respect to one or more ontologies
Query interpretation and semantic search in
information retrieval systems
Natural language based interfaces to ontologies,
Semantic Web and Linked Data.

What is WordNet?
A large lexical database, or “electronic
dictionary,” developed and maintained at
Princeton
http://wordnet.princeton.edu
Includes most English nouns, verbs, adjectives,
adverbs
Can be used by humans and machines
Princeton WordNet is for English only, but it is
linked to wordnets is many other languages

Authors of the (first) WordNet
WordNet was created in the Cognitive
Science Laboratory of Princeton University under the
direction of psychology professor George Armitage
Miller starting in 1985 and has been directed in
recent years by Christiane Fellbaum
That is why it is usually called „the Princeton WordNet“
(PWN)
George Miller and Christiane Fellbaum were awarded
the 2006 Antonio Zampolli Prize for their work with
WordNet.

WordNet as described by authors
WordNet is an on-line lexical reference system
whose design is inspired by current
psycholinguistic theories of human lexical
memory. English nouns, verbs, and adjectives
are organized into synonym sets, each
representing one underlying lexical concept.
Different relations link the synonym sets.

What’s special about WordNet?
Traditional paper dictionaries are organized
alphabetically: words that are found together (on the
same page) are not related by meaning
WordNet is organized by meaning: words in close
proximity are semantically similar
Human users and computers can browse WordNet
and find words that are meaningfully related to their
queries (somewhat like in a hyperdimensional
thesaurus)

What’s special about WordNet?
WordNet gives information about two fundamental,
universal properties of human language:
polysemy and synonymy
Polysemy = one:many mapping of form and
meaning
Synonymy = one:many mapping of meaning and
form

Polysemy
One word form expresses multiple meanings
{table, tabular_array}
{table, piece_of_furniture}
{table, mesa}
{table, postpone}
Note: the most frequent word forms are the most
polysemous!

Synonymy
One concept is expressed by several different
word forms:
{beat, hit, strike}
{car, motorcar, auto, automobile}

Polysemy and synonymy
Understanding and generating language (as for
translation) means matching a word form with
the intended, context-appropriate meaning
People (fluent speakers of a language) do this
very efficiently

Synonymy in WordNet
WordNet groups (roughly) synonymous,
denotationally equivalent, words into unordered
sets of synonyms (“synsets”)
{hit, beat, strike}
{big, large}
{queue, line}
By definition, each synset expresses a distinct
meaning/concept
Each word form-meaning pair is unique

Polysemy in WordNet
A word form that appears in n synsets
is n-fold polysemous
{table, tabular_array}
{table, piece_of_furniture}
{table, mesa}
{table, postpone}
table is fourfold polysemous/has four senses
four distinct concepts are associated with the word form table

Hypernymy relates noun synsets
Relates more/less general concepts
Creates hierarchies, or “trees”
{vehicle}
/
{car, automobile} {bicycle, bike}
/
{convertible} {SUV} {mountain bike}
“A car is is a kind of vehicle” <=>“The class of vehicles includes cars, bikes”
Hierarchies can have up to 16 levels

Hyponymy (Association Rules)
Transitivity
A car is a kind of vehicle
An SUV is a kind of car
=> An SUV is a kind of vehicle

Meronymy/holonymy
(part-whole relation)
{car, automobile}
|
{engine}
/
{spark plug} {cylinder}
“An engine has spark plugs”
“Spark plus and cylinders are parts of
an engine”

Meronymy/Holonymy (Inheritance)
A finger is part of a hand
A hand is part of an arm
An arm is part of a body
=>a finger is part of a body

Upward hierarchy in WorldNet
{entity}
{physical_entity}
{object, physical_object}
{whole, unit}
{living_thing, animate_thing}
{organism, being}
{animal, animate_being, beast, brute, creature, fauna}
{chordate}
{vertebrate, craniate}
{mammal, mammalian}
{placental, placental_mammal, eutherian, eutherian_mammal}
{carnivore}
{canine, canid}
{dog, domestic_dog, Canis_familiaris}

25 unique beginners for noun
synsets
{act, action, activity} {food} {possession}
{animal, fauna} {location, place} {process}
{artifact} {motive} {quantity, amount}
{attribute, property} {group, collection} {relation}
{body, corpus} {natural object} {shape}
{cognition, knowledge} {natural phenomenon} {state, condition}
{communication} {person, human being} {substance}
{event, happening} {plant, flora} {time}
{feeling, emotion}

Verb clusters
Verbs of Bodily Functions and
Care (sweat)
Motion Verbs (move)
Verbs of Change (change) Emotion or Psych Verbs (feel)
Verbs of Communication (tell) Stative Verbs (have, wear)
Competition Verbs (race) Perception Verbs (see)
Consumption Verbs (drink) Verbs of Possession (possess,
own)
Contact Verbs (touch) Verbs of Social Interaction
(request, impeach)
Cognition Verbs (think) Weather Verbs (thunder)
Creation Verbs (create)

What is Computational Lexical Semantics
Any computational process involving word
meaning!
Computing Word Similarity
Distributional (Vector) Models of Meaning
Computing Word Relations
Word Sense Disambiguation
Semantic Role Labeling
Computing word connotation and sentiment

Concrete Applications
corpus linguistics
machine translation
text retrieval
text summarization
word processing help (discussed above)
expert systems
speech recognition/synthesis (touched upon above)
toys, games
automatic telephone interpretation system
ultimately … artificial intelligence, robotics

Corpus Linguistics
This is a generic name for various computer
applications that make use of large language
databases (called corpora)
Having access to a large database enabled us
to process linguistic data in a statistical way,
rather than in an analytical way.
This conflict of two opposing views
(statistical vs. analytical) is very apparent in
machine translation.

Machine Translation (1)
text-to-text translation (great need for
translation at UN, EC (European
Community)
Works best when two languages in
question are similar in structure
Usually, pre-editing and/or post-editing by
a human translator is required — machine-
assisted translation.

Machine Translation (2)
Traditionally, MT required parsing, possibly
some semantic analysis, then mapping to a
syntactic tree of the sentence in the target
language.
An alternative is appeal to statistical means
of mapping a surface string in the source
language to a surface string in the target
language.
Difficulty with word-for-word translation

Computational Semantics
The study of how to automate the process of
constructing and reasoning with meaning
representations of natural language expressions.
This could play an important role in such application
areas as machine translation when two
typologically distinct languages are involved (e.g.
English and Japanese).

Text Retrieval
key word  text/book
key word: morphology
1. Principles of Polymer Morphology
2. Image Analysis and Mathematical Morphology
3. Drainage Basin Morphology
4. French Morphology
We need morphological, syntactic, and semantic
information to find the right text/book.
Further applications: search engines, etc.

Text Summarization
We need to be able to select the right
information from the electronic documents
available (esp. on the web).
Automatic text summarization is a
technique that can help people to quickly
grasp the concepts presented in a
document by creating an abstract or
summary of the original text.

Semantic Web
Some people (e.g. Evergreen U) are trying
to classify contents of web pages so that
they are meaningful to computers. But this
is not an easy task since the categories
must presumably be pre-selected by
people.
The semantic Web provides a common
framework that allows data to be shared
and reused across application, enterprise,
and community boundaries.
http://www.w3.org/2001/sw/

Ontology:OriginsandHistory
OntologyinPhilosophy
 A philosophical discipline - a branch of
philosophy that deals with the nature and the
organisation of reality
 Science of Being (Aristotle, Metaphysics, IV, 1)
 Tries to answer the questions:
 What characterizes being?
 Eventually, what is being?

Ontology in Computer Science
An ontology is an engineering artifact:
It is constituted by a specific vocabulary
used to describe a certain reality, plus
a set of explicit assumptions regarding
the intended meaning of the vocabulary.
Thus, an ontology describes a formal specification of a
certain domain:
Shared understanding of a domain of
interest
Formal and machine manipulable model
of a domain of interest

How to use Lexical Ontologies
1. Ontology-based Information Extraction and
Ontology Population from Text
2. Ontology-based Question Answering
3. Natural Language Generation from Triples
4. Integration and publishing of legacy language
resources
5. Representation of Translations in the Web of
Data
6. Ontology-based Machine Translation

Conclusion
Database Development is basic building block for Machine
Translation, Natural Language Processing and
Computational Linguistics
WorldNet is one of the richest resource and its structure can
be used to create new lexical database for our language
(Urdu/Persian/Arabic)
Ontologies can be used to add enhanced semantics to the
lexical resources beyond the limits of databases because of
their nature and capability to describer things

Introduction to development of lexical databases

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Introduction to development of lexical databases

Similar to Introduction to development of lexical databases (20)

Recently uploaded

Recently uploaded (20)

Introduction to development of lexical databases