SlideShare a Scribd company logo
1 of 14
Annotated Bibliography

   Information Extraction and Natural Languag

                                   Processing

                               Jun-ichi Tsujii
                         tsujii@is.s.u-tokyo.ac.jp
                    Department of Information Science
                           University of Tokyo


                    Bunkyo-ku, Tokyo 113-0033 JAPAN



The papers below are aimed to illustrative of the work that has taken place in
information extraction in the last five years, as of October 2000. Much of this work has
been influenced by the DARPA-sponsored Message Understanding Conferences, which
has subdivided the IE task into up to four distinct phases: named entity, coreference
resolution, template element and scenario template. This has influenced a generation
of IE systems, and this is reflected in the headings of papers given below as well.


The list is not complete and there are undoubtedly several papers that have been
unintentionally missed out. In particular, the list of the named entity task is oriented
towards systems that learn automatically from corpora. You find a list of papers on
more structure-oriented methods in the list prepared by Dr.S.Ananiadou for the
tutorial.


(Note: ** Key paper - shows some technique that has influenced the development of
the field. * Recommended reading.)


Acknowledge
I wish to thank the members of the GENIA project at the University of Tokyo. In.
particular, the bibliography provided by Dr. N.Collier helped greatly to compile this
annotated list.
1. General Introduction


(1-1)*Appelt, D., and Israel, D.(1999): Introduction to Information
Extraction Technology, Tutorial for IJCAI-99, (http://www.ai.sri.com/~appelt/
ie-tutorial/)

This is a good current state of the arts survey of the field, easy to read. A list of useful
web sites is also given, such as publicly available linguistic resources, etc. A simple tool
kit of IE is also available from the web site.


(1-2) Pazienza, M.(Ed.) (1999): Information Extraction, Lecture Notes in
Artificial Intelligence 1714, Springer

A collection of papers which show the breadth of the field, including future directions
(adaptability of IE systems, the relationships with other fields like digital library, IR,
Speech, etc.) and description of a concrete system by the University of Rome.


(1-3) Grishman, R.(1998): Information Extraction and Speech Recognition,
in Proceedings of the Broadcast News Transcription and Understanding
Workshop (available from http://www.cs.nyu.edu/cs/projects/proteus/)

A short paper. While the second half is on IE and Speech, the first half of the paper is a
brief explanation of IE. If you need a short description of IE, this gives you a good
introduction to the field.


(1-4)*MUC-6, Proceedings of the Sixth Message Understanding Conference
(MUC-6), Morgan Kaufmann, San Meteo, CA.
(1-5)*MUC-7 (1998), Proceedings of the Seventh Message Understanding
Conference (MUC-7)}, published on the web site http://www.muc.saic.com/

The DARPA-sponsored Message Understanding Conferences (MUC) have greatly
affected the development of the field, including concrete task definitions, evaluation
methods, selection of domains, etc. From these proceedings, you can see how IE
systems work and how their performances are evaluated. The task definitions,
annotation scheme and evaluation methods of MUC-6 are found in the MUC-6 web site
(http://cs.nyu.edu/cs/faculty/grishman/muc6).
2. Actual IE systems: MUC Systems


(2-1)*Appelt, D. et al. (1993), “FASTUS: a finite-state processor for
information extraction from real-world texts”, Proceedings of the
International Joint Conference on Artificial Intelligence.

The system developed by SRI, FASTUS, is one of the representative systems of IE
developed in MUC. If you are interested in more thorough descriptions of FASTUS, see
(Hobbs,J.R. et al.: FASTUS: A Cascaded Finite-State Transducer for Extracting
Information from Natural-Language Text, which appear in (8-5). The article is also
available at http://www.ai.sri.com/~appelt/fastus-schabes).


(2-2)*Grishman, R. (1995), “The NYU system for MUC-6 or where's the
syntax?”, Proceedings of the 6th Message Understanding Conference}, pp
167-176, DARPA.

Another representative of the current IE systems based on pattern-matching
techniques (the expressive power almost equivalent to finite-state machines). While in
FASTUS given patterns are expanded to surface patterns by macros, their system has
a special input system that accepts surface examples and expands them, through
question-answering with a system designer, to generalized surface patterns.


The system works deterministically and uses hand-made patterns. However, the NYU
research group has tried various machine learning techniques to the named-entity
recognition task. Those are based on decision trees (5-4), Maximum Entropy (5-5), etc.


Their tool kit, which enables users to define a simple ontology of a domain and check
with actual corpora whether generated patterns properly work, is described in (Sekine,
S. et al.(1998): An Information Extraction System and a Customization Tool, in
Proceedings of the New Challenges in Natural Language Processing and its
Application, Tokyo, available from http://www.cs.nyu.edu/cs/projects/proteus/ ).


(2-3) Krupka, G. R. and Hausman, K. (1998), “IsoQuest Inc.: Description of
the NetOwl (tm) Extractor System as Used in MUC-7!”, Proceedings of 7th
Message Understanding Conference (MUC-7)}, DARPA.
(2-4) Srihari, R. (1998), “A Domain Independent Event Extraction Toolkit”,
AFRL-IF-RS-TR-1998-152 Final Technical Report, published by the Air
Force Research Laboratory, Information Directorate, Rome Research Site,
New York.
(2-5)*Srihari, R. and Li, W. (1999), “Information Extraction Supported
Question Answering”, Proceedings of TREC-8.

TREC (Text Retrieval Conference: http://trec.nist.gov/) is a twin brother of MUC under
DARPA-sponsored TIPSTER text program. While TREC used to focus on conventional
Information Retrieval, it has started a new track called QA (question answering) track.
The paper (2-5) describes an attempt of using an IE system (Textract developed by
Cymfony Inc.) in the MUC framework to the QA task.


There are several interesting comments in this paper. For example, they claim that
scenario templates (ST) in MUC were too domain specific and that the group had to
redesign them for GE (General Event) templates in order to cope with open-ended
nature of the QA task. A similar observation was made by Appelt in (1-1), though his
remark was concerned with the discrepancy between linguistic representations and the
output templates. FASTUS by SRI has an internal representation directly extracted
from texts, from which a post processor generates the official MUC-6 templates.


3. IE Systems for Biology, Biomedicine, Biochemistry, etc.


(3-1) * Rindflesch, T. C. et al. (2000), “EDGAR: Extraction of Drugs, Genes
and Relations from the Biomedical Literature”, Proceedings of the Pacific
Symposium on Bio-informatics (PSB'2000), Hawaii, USA, January.

Unlike most of the other systems that were developed by NLP research groups, this
system was developed by a research group specialized in Bioinformatics.         Domain
specific resources like UMLS, etc. are effectively used, together with special programs
like MetaMap that maps noun phrases to UMLS semantic types. The terms that the
authors used in this paper are somewhat different from standard terms of the NLP
community. Their “under-specified parser”, for example, is like a shallow parser or a
chunker in the NLP community. Some of the techniques illustrated look domain
specific and highly dependent on available resources and their properties. However, the
paper is full of insightful observations that have to be reflected in IE systems for this
domain (See also (7-4)).


(3-2) Craven, M. and Kumlian, J. (1999): Constructing Biological Knowledge
Base by Extracting Information from Text Sources, in Proceedings of the 7th
International Conference on Intelligent Systems for Molecular Biology
(ISMB-99).

Assuming that semantic lexicon (i.e. pairs of terms and their semantic classes) is given,
they try to train their probabilistic models, Naïve Bayes and a relational learning
algorithm. While the naïve Bayes model identifies sentences in abstracts that contain
relevant information, the relational learning algorithm identifies which phrases in the
identified sentences participate in relevant relationships. The system is not an IE
system, but a knowledge acquisition system for IE (See also Section 7).


 (3-3) Proux, D., et al. (2000): A Pragmatic Information Extraction Strategy
for gathering Data on Genetic Interaction, in proceedings of 8th
International Conference on Intelligent Systems for Molecular Biology, La
Jolla, Calif., pp 279 - 285

A conventional IE system is applied to extract gene interactions. It uses POS-tagging
based on FST and HMM, shallow parsing of local structures around verbs, and
knowledge-based processing by using Conceptual Graph of Sowa. They reported that
complex linguistic structures including nominalized events, co-ordinations etc. in
Medline abstracts hamper the performance.


(3-4) Milward, T., et al. (2000): Automatic Extraction of Protein Interactions
from Scientific Abstracts, in Proceedings of Pacific Symposium on
Biocomputing, pp538-549, World Scientific Press.
(3-5)Hamphrays, K., et al. (2000): Two Applications of Information
Extraction to Biological Science Journal Articles: Enzyme Interactions and
Protein Structures, in Proceedings of Pacific Symposium on Biocomputing,
pp 72-80, World Scientific Press

These two papers reported preliminary results of applying IE tool kits to biology texts.
(3-4) is by SRI, Cambridge (Highlight) , while (3-5) is by Sheffield University (Gate,
LaSIE). Due to the nature of preliminary experiments, the details of their evaluation
methods are not given.


4. IE Systems with full parsers

 Recently, there have been a few attempts of using full-fledged sentential parsing for
IE. They claim that the later stages of IE i.e. merging of templates, co-reference
recognition, etc. become convoluted without explicit recognition of sentence structures
(See (6-3)). It is also noted in (3-1) and (3-2) that sentences in abstracts of scientific
papers tend to have complex sentential structures like nested co-ordinations, which
cause difficulties on simple shallow parsing or pattern-matching techniques.
    While (4-1) is a general purpose IE system, (4-2) and (4-3) are applied to
Biochemistry fields.


(4-1) Ciravegna,F. et al. (1999) “Full Text Parsing using cascades of Rules”,
in Proceedings. of the Ninth Conference of the European Chapter of the
Association for Computational Linguistics (EACL99)

This is a paper by a European project FACILE. Their work is also found in (Black, W.,
et al. (1998): “FACILE: Description of the NE System Used for MUC-7", in Proceedings
of Message Understanding Conference Proceedings (MUC-7)). A tool kit of IE
(Pinocchio) has also been developed (http://ecate.itc.it:1025/cirave/pinocchio).


(4-2) Park, J. C., et al. (2001): Bi-directional Incremental Parsing for
Automatic Pathway Identification with Combinatory Categorical Grammar,
in Proceeding of PSB, Hawaii (to appear)

Many interesting examples are given to show the reasons why more linguistically
sound frameworks are required for IE in Bio-chemistry application. They focus on
extracting protein-protein interactions from texts. Efficiency of full parsing based on
CUG is improved by restricting the analysis to structures around a set of designated
verbs. They use CUG (Categorical Unification Grammar) as the grammar formalism.


(4-3)*Yakushiji, A., et al. (2001): Event Extraction from Biomedical Papers
using a Full Parser, in Proceedings of PSB 2001, Hawaii http://www-
tsujii.is.s.u-tokyo.ac.jp/)(to appear).
They use a special algorithm for efficient parsing for HPSG-like grammar formalisms
and devise a special method of extracting relevant information from partial parse
results. Even if full sentential parses are not obtained, relevant information is to be
extracted from partial parse results.


5. Named entity recognition using machine learning techniques

While techniques based on hand-made patterns had been dominant till MUC-5, there
were a strong interest in applying machine learning techniques to the task in MUC-6
and MUC-7. Since the NE task in IE shares common techniques with term recognition,
a comprehensive bibliography is given by the bibliography of Dr. S.Ananiadou. The
following list focuses on NE recognition in IE and particular, those using machine
learning techniques (See also Section 7).


(5-1)**Bikel, D. M., et al. (1997), “Nymble: a High-Performance Learning
Name-finder”, Proceedings of the Fifth Conference on Applied Natural
Language Processing, Morgan Kaufmann Publishers, pp. 194-201.

Among NE systems based on learning methods, the system in this paper shows the
best performance of the time. This is the first system that uses HMM for the NE task.
The paper shows that HMM based on a set of simple orthographic features gives a
remarkably good result, e.g. around 90 % accuracy.


(5-2) *Collier, N., Nobata, C., and Tsujii, J. (2000), “Extracting the Names of
Genes and Gene Products with a Hidden Markov Model”, Proceedings of the
18th     International     Conference      on     Computational      Linguistics
(COLING-2000), Saarbrucken, Germany.

The named entity recognition techniques based on HMM were first applied to terms in
the bio-chemical domain. The result shows that simple orthographical features work
fairy well in this domain as well, while the overall performance is not as good as the
NE task in MUC. The same group applied a decision tree method to the same problem
with the same set of features, the result of which is less than HMM (See: Nobata, C., et
al.(1999): “Automatic Term Identification and Classification in Biology Texts”, in
Proceeding. of 5th Natural Language Processing Pacific Rim Symposium, Beijing). The
results show that the NE task in the biology domain is much harder than the MUC
domain, due to abundant uses of multi-word expressions, abbreviations, complex co-
ordination within term expressions, etc.


(5-3)**Borthwick, A. et al. (1998), “Exploiting Diverse Knowledge Sources
via Maximum Entropy in Named Entity Recognition”, Proceedings of the
Sixth Workshop on Very Large Corpora, pp 152-160.
(5-4) Sekine, S., Grishman, R. and Shinou, H. (1998), “A decision tree
method for finding and classifying names in Japanese texts”, Proceedings of
the Sixth Workshop on Very Large Corpora.

These two papers are by the NYU group. While simple HMM does not allow multi-facet
features, decision trees and ME (maximum entropy) can deal with them. In particular,
ME accept a large set of features, from which it learns which features are relevant to
the task. ME seem to outperform other learning methods.


(5-5)*Collins, M. and Singer, Y. (1999), “Unsupervised Models for Named
Entity Classification”, Proceedings of the 1999 Joint SIGDAT Conference on
Empirical Methods in Natural Language Processing and Very Large
Corpora, University of Maryland, USA (http://www.research.att.com/~singer/).

Supervised learning methods tend to require substantial human effort of preparing
annotated corpora. This nullifies the advantage of trainable systems over the
knowledge engineering approach (See the discussion in (1-1) by Appelt). This paper
shows that a new technique called “Co-Training” can reduce the burden (Blum, A. and
Mitchell, T. (1998): Combining Labeled and Unlabeled Data with Co-Training, in
Proceedings of COLT-98, Madison, Wisconsin, http://www.cs.cmu.edu/~avrim/). Starting
with a small set of rules, the system learns semantic classifiers of PERSON,
ORGANIZATION and LOCATION, from unlabeled data.


(5-6) Fukuda, et al., (1999): “Toward Information extraction: Identifying
protein names from biological papers”, in Proc. of the Pacific Symposium on
Biocomputing 98 (PSB 98), Hawaii.

This is not a paper of NE recognition based on machine learning. However, the paper
shows for the first time that terms in bio-chemical domains, in particular, protein
names can be recognized by a set of simple heuristics. The performance of the system
was good, while the evaluation method had not been well established.


6. Coreference resolution

 After creating templates of individual entities and events with their properties, an IE
system merges them to form integrated templates from which all kinds of information
of the same entities and events are to be obtained. Coreference identification plays a
crucial role in this stage.
(6-1) Aone, C. and Bennet, S. W. (1995), “ Evaluating automated and manual
acquisition of anaphora resoluation strategies”, Proceedings of the 33rd
Annual Meeting of the Association for Computational Linguistics (ACL-95),
pp 122-129, Cambridge, MA, June.


(6-2) Hirschman, L. et al. (1997), “Automating Coreference: The Role of
Annotated Training Data”, Proceedings of the AAAI Spring Symposium on
Applying Machine Learning to Discourse Processing.


(6-3)*Kehler, A. (1997), “Probabilistic Coreference in Information
Extraction”, Proceedings of the 1997 Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing and Very Large Corpora.

The paper discusses probabilistic models based on Maximum Entropy (ME) for
coreference resolution. The paper assumes that the first stage (entity and event
recognition) is performed by FASTUS and returns possible coreference relationships
with their ratings, which will be used by a downstream application system. An
interesting comment that lack of reasonable linguistic representations in FASTUS
makes the coreference resolution task unnecessary difficult is found in this paper.


(6-4) Kennedy, C. and Boguraev, B. (1996), “Anaphora for everyone:
Pronominal anaphora resolution without a parser”, Proceedings of the 16th
International Conference on Computational Linguistics (COLING-96).


7. Knowledge Acquisition and Ontology


(7-1)*Muslea, I. (1999): “Extraction Patterns for Information Extraction
Tasks: A Survey”, in Proceedings of The AAAI-99 Workshop on Machine
Learning for Information Extraction.

In order to build a pattern-based IE system, you have to prepare a set of patterns
manually. Most of IE tool-kits developed provide some devices which lessen this
manual effort, like macros in FASTUS, generalization of examples in the NYU system,
etc. However, these devices still require substantial human efforts. Design of patterns
also assume existence of proper domain ontology. Though there exist a few domain-
independent ontologies like cyc, word-net, euro-wordnet, EDR, etc., it is often the case
that these domain independent ontologies are not so effective for IE. This is definitely
the case for IE systems for specific scientific texts.


Therefore, automatic acquisition of patterns and ontologies from texts has attracted
significant interests recently. While this paper is not comprehensive (eg: it does not
cover substantial amounts of corpus-based research in computational linguistics such
as acquisition of sub-categorization frames, word clustering, etc.), this is a good
introduction to the filed and shows how the two fields, KA (Knowledge Acquisition) and
IE, are now merging.


(7-2) Soderland, S., et al. (1995): CRYSTAL: Inducing a Conceptual
Dictionary, in Proceedings of the 14th International Joint Conference on
Artificial Intelligence (IJCAI 95).

CRYSTAL is a part of the IE system developed by the University of Massachusetts,
which is used by their language analyzer BADGER. While BADGER is domain
independent, the dictionary of CN (Concept Nodes), which CRYSTAL is to build from
an annotated corpus, contains domain specific information. The information in a CN is
essentially the same as patterns that are used by other IE systems based on pattern-
matching techniques.


CRYSTAL accepts annotated texts (annotated in terms of semantic classes of phrases)
and takes them as examples of patterns. It tries to generalize those examples to
generalized patterns by using hierarchy of concepts. In this paper, they use UMLS in
as the semantic class hierarchy [See also (Aseltine, J. (1999): WAVE: An Incremental
Algorithm for Information Extraction, In Proceedings of the AAAI Workshop on
Machine Learning for Information Extraction).]
(7-3) Riloff, E. and Jones, R. (1999): Learning Dictionary for Information
Extraction by Multi-Level Bootstrapping, in Proceedings of the 16th National
Conference on Artificial Intelligence (AAAI-99).

CRYSTAL in (7-2) assumes that semantic classes are given before learning of patterns.
In this paper, they propose “mutual bootstrapping” that learns semantic classes and
patterns simultaneously.


(7-4) Rindflesh, et al. (1999): Mining Molecular Binding Terminology from
Biomedical Text, in Proceedings of AMIA-99.

This is a part of the project of (3-1). The program ARBITER recognizes terms that are
relevant to molecular binding relationships.


(7-4) Maynard, D., and Ananiadou, S. (2000): “Identifying Terms by their
Family and Friends”, in Proceedings of Coling 2000, Saarbrucken, Germany.

The first step to acquire domain ontologies is to collect terms of a given domain. In
particular, there are many multi-word terms in Medicine and biology, which is one of
the major causes of difficulties in term recognition in these fields. The paper addresses
how to collect these multi-word terms based on collocation distribution of words.




(7-5) Guarino, N., et al. (1995): Ontologies and Knowledge bases towards a
Terminological Clarification, in Proceedings of Towards Very large
Knowledge Bases, pp 25- 32

Interests in ontology has emerged through discussions in various research fields such
as knowledge representation and sharing in Artificial Intelligence, multi-lingual
machine translation, data management, CALS, community software, etc. It has its own
root in philosophy. This paper discusses what ontology means, what otology is and
what            it            is      not.         Guarino’s           web           site
(http://www.ladseb.pd.cnr.it/infor/people/Guarino.html) provides a comprehensive list of
useful sites in this field.


(7-6) Useful sites of Domain Independent Ontology:
(7-6-1) Cyc (http://www.cyc.com/): Encyclopedic knowledge base.
(7-6-2) Wordnet (http://www.cogsci.princeton.edu/)
(7-6-3) Euro-wordnet (http://www.hum.uva.nl/~ewn/)




(7-6-4)EDR( http://www.iijnet.or.jp/edr/): Lexical resources: mono-lingual dictionaries of
Japanese and English and a Concept dictionary. The concept dictionary can be used as
general otology.
  (7-6-5)Mikrokosomos: (http://crl.nmsu.edu/Research/Projects/mikro/): Ontology for
knowledge-based MT




(7-7)*McEntire, R., Karp, P., et al. (2000): An Evaluation of Ontology
Exchange Languages for Bioinformatics, in Proceedings of 8th International
Conference on Intelligent Systems for Molecular Biology, La Jolla, pp
239-250

 The languages for representation and exchange of ontology are compared and
evaluated in terms of Biology application. Seven candidates, ASN.1, ODL, Onto, OML/
CKML, OPM, XML/RDF and UML, are evaluated in details for biochemistry
application.


(7-8)*Tateishi, Y. et al. (2000): Building an Annotated Corpus in the
Molecular-Biology Domain, in Proceedings of Workshop on Semantic
Annotation and Intelligent Content, Coling 2000, Saarbrucken, Germany, pp
28-34.

 This is a paper on text annotation for biology texts, not ontology. However, the
semantic annotation assumes certain background ontology of the field, which is now
being constructed by a group of the University of Tokyo (http://www-tsujii.is.s.u-
tokyo.ac.jp/).


8. Basic Techniques


(8-1) ** Rabiner, L. and Juang, B. (1986), “An introduction to hidden
Markov models”, IEEE ASSP Magazine, pp 4-16, January.

A standard paper of introduction to HMM.


(8-2) Quinlan, J.R. (1993): “C4.5: Programs for Machine Learning”, Morgan
Kaufman Publishers, San Mateo, Calif.

A standard textbook of decision trees. The program C4.5 is avilable from
http://www.cse.unsw.edu.au/~quinlan/. The advanced version C5.0 is commercially
available.


(8-3) **Viterbi, A. J. (1967), “Error bounds for convolution codes and an
asymptotically optiumum decoding algorithm”, IEEE Trans. Information
Theory, IT-13(2), pp 260-269.

When probabilities of state transitions and emission probabilities are given, a naive
way of finding the optimal transition path - the most probable path) involves a large
search space and time-consuming. The Viterbi algorithm is an efficient algorithm for
solving this problem. The time complexity of this algorithm is O(TN2), where T and N
are the length of the sequence and the number of states. A modified version of this
algorithm which combines the original version with certain rules that constrain
legitimate sequences widely used.


(8-4)**Pereira, F., et al.. (1991): “Finite State Approximation of Phrase
Structure Grammars”, Proceedings of 29th Meeting of the Association for
Computational Linguistics, Berkeley, California, pp246-255.

The discussion in this paper justified the use of FS instead of more expressive
frameworks in NLP, and has influenced on directions of NLP research. One of the
precursor of this direction was a discussion given by Church, K.W. (On Memory
limitations in Natural Language Processing, MIT Laboratory of Computer Science
Technical Report MIT/LCS/TR-245,1980).


(8-5)*Roche, E. and Schabes, Y. (eds.) (1997): Finite State Langugae
Processing, The MIT Press.
A good text book on finite state techniques. The techniques and the ideas of CG
(Constraint Grammar) in this book are materialized as a commercial product ENGCG
of Lingsoft, the performance of which in tagging of English texts is impressive. Their
web site is http://www.lingsoft.fi/cgi-bin/engcg. FST approximation of phrase structure
grammar by Pereira (8-4) also appears in this textbook.

More Related Content

What's hot

A Novel approach for Document Clustering using Concept Extraction
A Novel approach for Document Clustering using Concept ExtractionA Novel approach for Document Clustering using Concept Extraction
A Novel approach for Document Clustering using Concept ExtractionAM Publications
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems BiologyMike Hucka
 
from text and ontology : methodologies and tools - Text2Onto
from text and ontology : methodologies and tools - Text2Ontofrom text and ontology : methodologies and tools - Text2Onto
from text and ontology : methodologies and tools - Text2OntoRadhoueneRouached
 
AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK
AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORKAN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK
AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORKijsc
 
Mahesh Joshi
Mahesh JoshiMahesh Joshi
Mahesh Joshibutest
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
 
ICMLDA_poster.doc
ICMLDA_poster.docICMLDA_poster.doc
ICMLDA_poster.docbutest
 
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTIONA NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTIONijscai
 
Protein structure prediction by means
Protein structure prediction by meansProtein structure prediction by means
Protein structure prediction by meansijaia
 
Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
 
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0Fokhruz Zaman
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION dannyijwest
 
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICA SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICcscpconf
 
Keywords- Based on Arabic Information Retrieval Using Light Stemmer
Keywords- Based on Arabic Information Retrieval Using Light Stemmer Keywords- Based on Arabic Information Retrieval Using Light Stemmer
Keywords- Based on Arabic Information Retrieval Using Light Stemmer IJCSIS Research Publications
 

What's hot (17)

A Novel approach for Document Clustering using Concept Extraction
A Novel approach for Document Clustering using Concept ExtractionA Novel approach for Document Clustering using Concept Extraction
A Novel approach for Document Clustering using Concept Extraction
 
Computational Approaches to Systems Biology
Computational Approaches to Systems BiologyComputational Approaches to Systems Biology
Computational Approaches to Systems Biology
 
from text and ontology : methodologies and tools - Text2Onto
from text and ontology : methodologies and tools - Text2Ontofrom text and ontology : methodologies and tools - Text2Onto
from text and ontology : methodologies and tools - Text2Onto
 
AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK
AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORKAN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK
AN APPROACH FOR IRIS PLANT CLASSIFICATION USING NEURAL NETWORK
 
Mahesh Joshi
Mahesh JoshiMahesh Joshi
Mahesh Joshi
 
B.3.5
B.3.5B.3.5
B.3.5
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
 
CV
CVCV
CV
 
ICMLDA_poster.doc
ICMLDA_poster.docICMLDA_poster.doc
ICMLDA_poster.doc
 
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTIONA NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
 
Protein structure prediction by means
Protein structure prediction by meansProtein structure prediction by means
Protein structure prediction by means
 
Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extraction
 
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
Bioinformatics n bio-bio-1_uoda_workshop_4_july_2013_v1.0
 
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION ONTOLOGY SERVICE CENTER: A DATAHUB FOR  ONTOLOGY APPLICATION
ONTOLOGY SERVICE CENTER: A DATAHUB FOR ONTOLOGY APPLICATION
 
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGICA SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
A SURVEY ON QUESTION ANSWERING SYSTEMS: THE ADVANCES OF FUZZY LOGIC
 
Bioinformatics
BioinformaticsBioinformatics
Bioinformatics
 
Keywords- Based on Arabic Information Retrieval Using Light Stemmer
Keywords- Based on Arabic Information Retrieval Using Light Stemmer Keywords- Based on Arabic Information Retrieval Using Light Stemmer
Keywords- Based on Arabic Information Retrieval Using Light Stemmer
 

Similar to Bibliography (Microsoft Word, 61k)

Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsCSCJournals
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsCSCJournals
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extractionunyil96
 
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSA PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSijsc
 
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents ijsc
 
Rule-based Information Extraction from Disease Outbreak Reports
Rule-based Information Extraction from Disease Outbreak ReportsRule-based Information Extraction from Disease Outbreak Reports
Rule-based Information Extraction from Disease Outbreak ReportsWaqas Tariq
 
Named entity recognition using web document corpus
Named entity recognition using web document corpusNamed entity recognition using web document corpus
Named entity recognition using web document corpusIJMIT JOURNAL
 
Architecture of an ontology based domain-specific natural language question a...
Architecture of an ontology based domain-specific natural language question a...Architecture of an ontology based domain-specific natural language question a...
Architecture of an ontology based domain-specific natural language question a...IJwest
 
Named Entity Recognition Using Web Document Corpus
Named Entity Recognition Using Web Document CorpusNamed Entity Recognition Using Web Document Corpus
Named Entity Recognition Using Web Document CorpusIJMIT JOURNAL
 
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...IOSR Journals
 
A Naive Method For Ontology Construction
A Naive Method For Ontology Construction A Naive Method For Ontology Construction
A Naive Method For Ontology Construction IJSCAI Journal
 
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTIONA NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTIONijscai
 
Ck32985989
Ck32985989Ck32985989
Ck32985989IJMER
 
A Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health CareA Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health CareIJCSIS Research Publications
 
Temporal Information Processing: A Survey
Temporal Information Processing: A SurveyTemporal Information Processing: A Survey
Temporal Information Processing: A Surveykevig
 
A N E XTENSION OF P ROTÉGÉ FOR AN AUTOMA TIC F UZZY - O NTOLOGY BUILDING U...
A N  E XTENSION OF  P ROTÉGÉ FOR AN AUTOMA TIC  F UZZY - O NTOLOGY BUILDING U...A N  E XTENSION OF  P ROTÉGÉ FOR AN AUTOMA TIC  F UZZY - O NTOLOGY BUILDING U...
A N E XTENSION OF P ROTÉGÉ FOR AN AUTOMA TIC F UZZY - O NTOLOGY BUILDING U...ijcsit
 
Ontology Mapping
Ontology MappingOntology Mapping
Ontology Mappingbutest
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourseijitcs
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUEJournal For Research
 

Similar to Bibliography (Microsoft Word, 61k) (20)

Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
 
Rule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes ReportsRule-based Information Extraction for Airplane Crashes Reports
Rule-based Information Extraction for Airplane Crashes Reports
 
Adaptive information extraction
Adaptive information extractionAdaptive information extraction
Adaptive information extraction
 
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTSA PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
A PROPOSED MULTI-DOMAIN APPROACH FOR AUTOMATIC CLASSIFICATION OF TEXT DOCUMENTS
 
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
A Proposed Multi-Domain Approach for Automatic Classification of Text Documents
 
Rule-based Information Extraction from Disease Outbreak Reports
Rule-based Information Extraction from Disease Outbreak ReportsRule-based Information Extraction from Disease Outbreak Reports
Rule-based Information Extraction from Disease Outbreak Reports
 
Named entity recognition using web document corpus
Named entity recognition using web document corpusNamed entity recognition using web document corpus
Named entity recognition using web document corpus
 
Architecture of an ontology based domain-specific natural language question a...
Architecture of an ontology based domain-specific natural language question a...Architecture of an ontology based domain-specific natural language question a...
Architecture of an ontology based domain-specific natural language question a...
 
Named Entity Recognition Using Web Document Corpus
Named Entity Recognition Using Web Document CorpusNamed Entity Recognition Using Web Document Corpus
Named Entity Recognition Using Web Document Corpus
 
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
Tools for Ontology Building from Texts: Analysis and Improvement of the Resul...
 
A Naive Method For Ontology Construction
A Naive Method For Ontology Construction A Naive Method For Ontology Construction
A Naive Method For Ontology Construction
 
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTIONA NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
A NAIVE METHOD FOR ONTOLOGY CONSTRUCTION
 
Ck32985989
Ck32985989Ck32985989
Ck32985989
 
A Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health CareA Review of Intelligent Agent Systems in Animal Health Care
A Review of Intelligent Agent Systems in Animal Health Care
 
Temporal Information Processing: A Survey
Temporal Information Processing: A SurveyTemporal Information Processing: A Survey
Temporal Information Processing: A Survey
 
A N E XTENSION OF P ROTÉGÉ FOR AN AUTOMA TIC F UZZY - O NTOLOGY BUILDING U...
A N  E XTENSION OF  P ROTÉGÉ FOR AN AUTOMA TIC  F UZZY - O NTOLOGY BUILDING U...A N  E XTENSION OF  P ROTÉGÉ FOR AN AUTOMA TIC  F UZZY - O NTOLOGY BUILDING U...
A N E XTENSION OF P ROTÉGÉ FOR AN AUTOMA TIC F UZZY - O NTOLOGY BUILDING U...
 
Ontology Mapping
Ontology MappingOntology Mapping
Ontology Mapping
 
Information extraction using discourse
Information extraction using discourseInformation extraction using discourse
Information extraction using discourse
 
الواجججج
الواججججالواجججج
الواجججج
 
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUECOMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
COMPREHENSIVE ANALYSIS OF NATURAL LANGUAGE PROCESSING TECHNIQUE
 

More from butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

More from butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Bibliography (Microsoft Word, 61k)

  • 1. Annotated Bibliography Information Extraction and Natural Languag Processing Jun-ichi Tsujii tsujii@is.s.u-tokyo.ac.jp Department of Information Science University of Tokyo Bunkyo-ku, Tokyo 113-0033 JAPAN The papers below are aimed to illustrative of the work that has taken place in information extraction in the last five years, as of October 2000. Much of this work has been influenced by the DARPA-sponsored Message Understanding Conferences, which has subdivided the IE task into up to four distinct phases: named entity, coreference resolution, template element and scenario template. This has influenced a generation of IE systems, and this is reflected in the headings of papers given below as well. The list is not complete and there are undoubtedly several papers that have been unintentionally missed out. In particular, the list of the named entity task is oriented towards systems that learn automatically from corpora. You find a list of papers on more structure-oriented methods in the list prepared by Dr.S.Ananiadou for the tutorial. (Note: ** Key paper - shows some technique that has influenced the development of the field. * Recommended reading.) Acknowledge I wish to thank the members of the GENIA project at the University of Tokyo. In. particular, the bibliography provided by Dr. N.Collier helped greatly to compile this annotated list.
  • 2. 1. General Introduction (1-1)*Appelt, D., and Israel, D.(1999): Introduction to Information Extraction Technology, Tutorial for IJCAI-99, (http://www.ai.sri.com/~appelt/ ie-tutorial/) This is a good current state of the arts survey of the field, easy to read. A list of useful web sites is also given, such as publicly available linguistic resources, etc. A simple tool kit of IE is also available from the web site. (1-2) Pazienza, M.(Ed.) (1999): Information Extraction, Lecture Notes in Artificial Intelligence 1714, Springer A collection of papers which show the breadth of the field, including future directions (adaptability of IE systems, the relationships with other fields like digital library, IR, Speech, etc.) and description of a concrete system by the University of Rome. (1-3) Grishman, R.(1998): Information Extraction and Speech Recognition, in Proceedings of the Broadcast News Transcription and Understanding Workshop (available from http://www.cs.nyu.edu/cs/projects/proteus/) A short paper. While the second half is on IE and Speech, the first half of the paper is a brief explanation of IE. If you need a short description of IE, this gives you a good introduction to the field. (1-4)*MUC-6, Proceedings of the Sixth Message Understanding Conference (MUC-6), Morgan Kaufmann, San Meteo, CA. (1-5)*MUC-7 (1998), Proceedings of the Seventh Message Understanding Conference (MUC-7)}, published on the web site http://www.muc.saic.com/ The DARPA-sponsored Message Understanding Conferences (MUC) have greatly affected the development of the field, including concrete task definitions, evaluation methods, selection of domains, etc. From these proceedings, you can see how IE systems work and how their performances are evaluated. The task definitions, annotation scheme and evaluation methods of MUC-6 are found in the MUC-6 web site (http://cs.nyu.edu/cs/faculty/grishman/muc6).
  • 3. 2. Actual IE systems: MUC Systems (2-1)*Appelt, D. et al. (1993), “FASTUS: a finite-state processor for information extraction from real-world texts”, Proceedings of the International Joint Conference on Artificial Intelligence. The system developed by SRI, FASTUS, is one of the representative systems of IE developed in MUC. If you are interested in more thorough descriptions of FASTUS, see (Hobbs,J.R. et al.: FASTUS: A Cascaded Finite-State Transducer for Extracting Information from Natural-Language Text, which appear in (8-5). The article is also available at http://www.ai.sri.com/~appelt/fastus-schabes). (2-2)*Grishman, R. (1995), “The NYU system for MUC-6 or where's the syntax?”, Proceedings of the 6th Message Understanding Conference}, pp 167-176, DARPA. Another representative of the current IE systems based on pattern-matching techniques (the expressive power almost equivalent to finite-state machines). While in FASTUS given patterns are expanded to surface patterns by macros, their system has a special input system that accepts surface examples and expands them, through question-answering with a system designer, to generalized surface patterns. The system works deterministically and uses hand-made patterns. However, the NYU research group has tried various machine learning techniques to the named-entity recognition task. Those are based on decision trees (5-4), Maximum Entropy (5-5), etc. Their tool kit, which enables users to define a simple ontology of a domain and check with actual corpora whether generated patterns properly work, is described in (Sekine, S. et al.(1998): An Information Extraction System and a Customization Tool, in Proceedings of the New Challenges in Natural Language Processing and its Application, Tokyo, available from http://www.cs.nyu.edu/cs/projects/proteus/ ). (2-3) Krupka, G. R. and Hausman, K. (1998), “IsoQuest Inc.: Description of the NetOwl (tm) Extractor System as Used in MUC-7!”, Proceedings of 7th Message Understanding Conference (MUC-7)}, DARPA.
  • 4. (2-4) Srihari, R. (1998), “A Domain Independent Event Extraction Toolkit”, AFRL-IF-RS-TR-1998-152 Final Technical Report, published by the Air Force Research Laboratory, Information Directorate, Rome Research Site, New York. (2-5)*Srihari, R. and Li, W. (1999), “Information Extraction Supported Question Answering”, Proceedings of TREC-8. TREC (Text Retrieval Conference: http://trec.nist.gov/) is a twin brother of MUC under DARPA-sponsored TIPSTER text program. While TREC used to focus on conventional Information Retrieval, it has started a new track called QA (question answering) track. The paper (2-5) describes an attempt of using an IE system (Textract developed by Cymfony Inc.) in the MUC framework to the QA task. There are several interesting comments in this paper. For example, they claim that scenario templates (ST) in MUC were too domain specific and that the group had to redesign them for GE (General Event) templates in order to cope with open-ended nature of the QA task. A similar observation was made by Appelt in (1-1), though his remark was concerned with the discrepancy between linguistic representations and the output templates. FASTUS by SRI has an internal representation directly extracted from texts, from which a post processor generates the official MUC-6 templates. 3. IE Systems for Biology, Biomedicine, Biochemistry, etc. (3-1) * Rindflesch, T. C. et al. (2000), “EDGAR: Extraction of Drugs, Genes and Relations from the Biomedical Literature”, Proceedings of the Pacific Symposium on Bio-informatics (PSB'2000), Hawaii, USA, January. Unlike most of the other systems that were developed by NLP research groups, this system was developed by a research group specialized in Bioinformatics. Domain specific resources like UMLS, etc. are effectively used, together with special programs like MetaMap that maps noun phrases to UMLS semantic types. The terms that the authors used in this paper are somewhat different from standard terms of the NLP community. Their “under-specified parser”, for example, is like a shallow parser or a chunker in the NLP community. Some of the techniques illustrated look domain specific and highly dependent on available resources and their properties. However, the paper is full of insightful observations that have to be reflected in IE systems for this
  • 5. domain (See also (7-4)). (3-2) Craven, M. and Kumlian, J. (1999): Constructing Biological Knowledge Base by Extracting Information from Text Sources, in Proceedings of the 7th International Conference on Intelligent Systems for Molecular Biology (ISMB-99). Assuming that semantic lexicon (i.e. pairs of terms and their semantic classes) is given, they try to train their probabilistic models, Naïve Bayes and a relational learning algorithm. While the naïve Bayes model identifies sentences in abstracts that contain relevant information, the relational learning algorithm identifies which phrases in the identified sentences participate in relevant relationships. The system is not an IE system, but a knowledge acquisition system for IE (See also Section 7). (3-3) Proux, D., et al. (2000): A Pragmatic Information Extraction Strategy for gathering Data on Genetic Interaction, in proceedings of 8th International Conference on Intelligent Systems for Molecular Biology, La Jolla, Calif., pp 279 - 285 A conventional IE system is applied to extract gene interactions. It uses POS-tagging based on FST and HMM, shallow parsing of local structures around verbs, and knowledge-based processing by using Conceptual Graph of Sowa. They reported that complex linguistic structures including nominalized events, co-ordinations etc. in Medline abstracts hamper the performance. (3-4) Milward, T., et al. (2000): Automatic Extraction of Protein Interactions from Scientific Abstracts, in Proceedings of Pacific Symposium on Biocomputing, pp538-549, World Scientific Press. (3-5)Hamphrays, K., et al. (2000): Two Applications of Information Extraction to Biological Science Journal Articles: Enzyme Interactions and Protein Structures, in Proceedings of Pacific Symposium on Biocomputing, pp 72-80, World Scientific Press These two papers reported preliminary results of applying IE tool kits to biology texts. (3-4) is by SRI, Cambridge (Highlight) , while (3-5) is by Sheffield University (Gate, LaSIE). Due to the nature of preliminary experiments, the details of their evaluation
  • 6. methods are not given. 4. IE Systems with full parsers Recently, there have been a few attempts of using full-fledged sentential parsing for IE. They claim that the later stages of IE i.e. merging of templates, co-reference recognition, etc. become convoluted without explicit recognition of sentence structures (See (6-3)). It is also noted in (3-1) and (3-2) that sentences in abstracts of scientific papers tend to have complex sentential structures like nested co-ordinations, which cause difficulties on simple shallow parsing or pattern-matching techniques. While (4-1) is a general purpose IE system, (4-2) and (4-3) are applied to Biochemistry fields. (4-1) Ciravegna,F. et al. (1999) “Full Text Parsing using cascades of Rules”, in Proceedings. of the Ninth Conference of the European Chapter of the Association for Computational Linguistics (EACL99) This is a paper by a European project FACILE. Their work is also found in (Black, W., et al. (1998): “FACILE: Description of the NE System Used for MUC-7", in Proceedings of Message Understanding Conference Proceedings (MUC-7)). A tool kit of IE (Pinocchio) has also been developed (http://ecate.itc.it:1025/cirave/pinocchio). (4-2) Park, J. C., et al. (2001): Bi-directional Incremental Parsing for Automatic Pathway Identification with Combinatory Categorical Grammar, in Proceeding of PSB, Hawaii (to appear) Many interesting examples are given to show the reasons why more linguistically sound frameworks are required for IE in Bio-chemistry application. They focus on extracting protein-protein interactions from texts. Efficiency of full parsing based on CUG is improved by restricting the analysis to structures around a set of designated verbs. They use CUG (Categorical Unification Grammar) as the grammar formalism. (4-3)*Yakushiji, A., et al. (2001): Event Extraction from Biomedical Papers using a Full Parser, in Proceedings of PSB 2001, Hawaii http://www- tsujii.is.s.u-tokyo.ac.jp/)(to appear).
  • 7. They use a special algorithm for efficient parsing for HPSG-like grammar formalisms and devise a special method of extracting relevant information from partial parse results. Even if full sentential parses are not obtained, relevant information is to be extracted from partial parse results. 5. Named entity recognition using machine learning techniques While techniques based on hand-made patterns had been dominant till MUC-5, there were a strong interest in applying machine learning techniques to the task in MUC-6 and MUC-7. Since the NE task in IE shares common techniques with term recognition, a comprehensive bibliography is given by the bibliography of Dr. S.Ananiadou. The following list focuses on NE recognition in IE and particular, those using machine learning techniques (See also Section 7). (5-1)**Bikel, D. M., et al. (1997), “Nymble: a High-Performance Learning Name-finder”, Proceedings of the Fifth Conference on Applied Natural Language Processing, Morgan Kaufmann Publishers, pp. 194-201. Among NE systems based on learning methods, the system in this paper shows the best performance of the time. This is the first system that uses HMM for the NE task. The paper shows that HMM based on a set of simple orthographic features gives a remarkably good result, e.g. around 90 % accuracy. (5-2) *Collier, N., Nobata, C., and Tsujii, J. (2000), “Extracting the Names of Genes and Gene Products with a Hidden Markov Model”, Proceedings of the 18th International Conference on Computational Linguistics (COLING-2000), Saarbrucken, Germany. The named entity recognition techniques based on HMM were first applied to terms in the bio-chemical domain. The result shows that simple orthographical features work fairy well in this domain as well, while the overall performance is not as good as the NE task in MUC. The same group applied a decision tree method to the same problem with the same set of features, the result of which is less than HMM (See: Nobata, C., et al.(1999): “Automatic Term Identification and Classification in Biology Texts”, in Proceeding. of 5th Natural Language Processing Pacific Rim Symposium, Beijing). The results show that the NE task in the biology domain is much harder than the MUC
  • 8. domain, due to abundant uses of multi-word expressions, abbreviations, complex co- ordination within term expressions, etc. (5-3)**Borthwick, A. et al. (1998), “Exploiting Diverse Knowledge Sources via Maximum Entropy in Named Entity Recognition”, Proceedings of the Sixth Workshop on Very Large Corpora, pp 152-160. (5-4) Sekine, S., Grishman, R. and Shinou, H. (1998), “A decision tree method for finding and classifying names in Japanese texts”, Proceedings of the Sixth Workshop on Very Large Corpora. These two papers are by the NYU group. While simple HMM does not allow multi-facet features, decision trees and ME (maximum entropy) can deal with them. In particular, ME accept a large set of features, from which it learns which features are relevant to the task. ME seem to outperform other learning methods. (5-5)*Collins, M. and Singer, Y. (1999), “Unsupervised Models for Named Entity Classification”, Proceedings of the 1999 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora, University of Maryland, USA (http://www.research.att.com/~singer/). Supervised learning methods tend to require substantial human effort of preparing annotated corpora. This nullifies the advantage of trainable systems over the knowledge engineering approach (See the discussion in (1-1) by Appelt). This paper shows that a new technique called “Co-Training” can reduce the burden (Blum, A. and Mitchell, T. (1998): Combining Labeled and Unlabeled Data with Co-Training, in Proceedings of COLT-98, Madison, Wisconsin, http://www.cs.cmu.edu/~avrim/). Starting with a small set of rules, the system learns semantic classifiers of PERSON, ORGANIZATION and LOCATION, from unlabeled data. (5-6) Fukuda, et al., (1999): “Toward Information extraction: Identifying protein names from biological papers”, in Proc. of the Pacific Symposium on Biocomputing 98 (PSB 98), Hawaii. This is not a paper of NE recognition based on machine learning. However, the paper shows for the first time that terms in bio-chemical domains, in particular, protein names can be recognized by a set of simple heuristics. The performance of the system
  • 9. was good, while the evaluation method had not been well established. 6. Coreference resolution After creating templates of individual entities and events with their properties, an IE system merges them to form integrated templates from which all kinds of information of the same entities and events are to be obtained. Coreference identification plays a crucial role in this stage. (6-1) Aone, C. and Bennet, S. W. (1995), “ Evaluating automated and manual acquisition of anaphora resoluation strategies”, Proceedings of the 33rd Annual Meeting of the Association for Computational Linguistics (ACL-95), pp 122-129, Cambridge, MA, June. (6-2) Hirschman, L. et al. (1997), “Automating Coreference: The Role of Annotated Training Data”, Proceedings of the AAAI Spring Symposium on Applying Machine Learning to Discourse Processing. (6-3)*Kehler, A. (1997), “Probabilistic Coreference in Information Extraction”, Proceedings of the 1997 Joint SIGDAT Conference on Empirical Methods in Natural Language Processing and Very Large Corpora. The paper discusses probabilistic models based on Maximum Entropy (ME) for coreference resolution. The paper assumes that the first stage (entity and event recognition) is performed by FASTUS and returns possible coreference relationships with their ratings, which will be used by a downstream application system. An interesting comment that lack of reasonable linguistic representations in FASTUS makes the coreference resolution task unnecessary difficult is found in this paper. (6-4) Kennedy, C. and Boguraev, B. (1996), “Anaphora for everyone: Pronominal anaphora resolution without a parser”, Proceedings of the 16th International Conference on Computational Linguistics (COLING-96). 7. Knowledge Acquisition and Ontology (7-1)*Muslea, I. (1999): “Extraction Patterns for Information Extraction Tasks: A Survey”, in Proceedings of The AAAI-99 Workshop on Machine
  • 10. Learning for Information Extraction. In order to build a pattern-based IE system, you have to prepare a set of patterns manually. Most of IE tool-kits developed provide some devices which lessen this manual effort, like macros in FASTUS, generalization of examples in the NYU system, etc. However, these devices still require substantial human efforts. Design of patterns also assume existence of proper domain ontology. Though there exist a few domain- independent ontologies like cyc, word-net, euro-wordnet, EDR, etc., it is often the case that these domain independent ontologies are not so effective for IE. This is definitely the case for IE systems for specific scientific texts. Therefore, automatic acquisition of patterns and ontologies from texts has attracted significant interests recently. While this paper is not comprehensive (eg: it does not cover substantial amounts of corpus-based research in computational linguistics such as acquisition of sub-categorization frames, word clustering, etc.), this is a good introduction to the filed and shows how the two fields, KA (Knowledge Acquisition) and IE, are now merging. (7-2) Soderland, S., et al. (1995): CRYSTAL: Inducing a Conceptual Dictionary, in Proceedings of the 14th International Joint Conference on Artificial Intelligence (IJCAI 95). CRYSTAL is a part of the IE system developed by the University of Massachusetts, which is used by their language analyzer BADGER. While BADGER is domain independent, the dictionary of CN (Concept Nodes), which CRYSTAL is to build from an annotated corpus, contains domain specific information. The information in a CN is essentially the same as patterns that are used by other IE systems based on pattern- matching techniques. CRYSTAL accepts annotated texts (annotated in terms of semantic classes of phrases) and takes them as examples of patterns. It tries to generalize those examples to generalized patterns by using hierarchy of concepts. In this paper, they use UMLS in as the semantic class hierarchy [See also (Aseltine, J. (1999): WAVE: An Incremental Algorithm for Information Extraction, In Proceedings of the AAAI Workshop on Machine Learning for Information Extraction).]
  • 11. (7-3) Riloff, E. and Jones, R. (1999): Learning Dictionary for Information Extraction by Multi-Level Bootstrapping, in Proceedings of the 16th National Conference on Artificial Intelligence (AAAI-99). CRYSTAL in (7-2) assumes that semantic classes are given before learning of patterns. In this paper, they propose “mutual bootstrapping” that learns semantic classes and patterns simultaneously. (7-4) Rindflesh, et al. (1999): Mining Molecular Binding Terminology from Biomedical Text, in Proceedings of AMIA-99. This is a part of the project of (3-1). The program ARBITER recognizes terms that are relevant to molecular binding relationships. (7-4) Maynard, D., and Ananiadou, S. (2000): “Identifying Terms by their Family and Friends”, in Proceedings of Coling 2000, Saarbrucken, Germany. The first step to acquire domain ontologies is to collect terms of a given domain. In particular, there are many multi-word terms in Medicine and biology, which is one of the major causes of difficulties in term recognition in these fields. The paper addresses how to collect these multi-word terms based on collocation distribution of words. (7-5) Guarino, N., et al. (1995): Ontologies and Knowledge bases towards a Terminological Clarification, in Proceedings of Towards Very large Knowledge Bases, pp 25- 32 Interests in ontology has emerged through discussions in various research fields such as knowledge representation and sharing in Artificial Intelligence, multi-lingual machine translation, data management, CALS, community software, etc. It has its own root in philosophy. This paper discusses what ontology means, what otology is and what it is not. Guarino’s web site (http://www.ladseb.pd.cnr.it/infor/people/Guarino.html) provides a comprehensive list of useful sites in this field. (7-6) Useful sites of Domain Independent Ontology:
  • 12. (7-6-1) Cyc (http://www.cyc.com/): Encyclopedic knowledge base. (7-6-2) Wordnet (http://www.cogsci.princeton.edu/) (7-6-3) Euro-wordnet (http://www.hum.uva.nl/~ewn/) (7-6-4)EDR( http://www.iijnet.or.jp/edr/): Lexical resources: mono-lingual dictionaries of Japanese and English and a Concept dictionary. The concept dictionary can be used as general otology. (7-6-5)Mikrokosomos: (http://crl.nmsu.edu/Research/Projects/mikro/): Ontology for knowledge-based MT (7-7)*McEntire, R., Karp, P., et al. (2000): An Evaluation of Ontology Exchange Languages for Bioinformatics, in Proceedings of 8th International Conference on Intelligent Systems for Molecular Biology, La Jolla, pp 239-250 The languages for representation and exchange of ontology are compared and evaluated in terms of Biology application. Seven candidates, ASN.1, ODL, Onto, OML/ CKML, OPM, XML/RDF and UML, are evaluated in details for biochemistry application. (7-8)*Tateishi, Y. et al. (2000): Building an Annotated Corpus in the Molecular-Biology Domain, in Proceedings of Workshop on Semantic Annotation and Intelligent Content, Coling 2000, Saarbrucken, Germany, pp 28-34. This is a paper on text annotation for biology texts, not ontology. However, the semantic annotation assumes certain background ontology of the field, which is now being constructed by a group of the University of Tokyo (http://www-tsujii.is.s.u- tokyo.ac.jp/). 8. Basic Techniques (8-1) ** Rabiner, L. and Juang, B. (1986), “An introduction to hidden
  • 13. Markov models”, IEEE ASSP Magazine, pp 4-16, January. A standard paper of introduction to HMM. (8-2) Quinlan, J.R. (1993): “C4.5: Programs for Machine Learning”, Morgan Kaufman Publishers, San Mateo, Calif. A standard textbook of decision trees. The program C4.5 is avilable from http://www.cse.unsw.edu.au/~quinlan/. The advanced version C5.0 is commercially available. (8-3) **Viterbi, A. J. (1967), “Error bounds for convolution codes and an asymptotically optiumum decoding algorithm”, IEEE Trans. Information Theory, IT-13(2), pp 260-269. When probabilities of state transitions and emission probabilities are given, a naive way of finding the optimal transition path - the most probable path) involves a large search space and time-consuming. The Viterbi algorithm is an efficient algorithm for solving this problem. The time complexity of this algorithm is O(TN2), where T and N are the length of the sequence and the number of states. A modified version of this algorithm which combines the original version with certain rules that constrain legitimate sequences widely used. (8-4)**Pereira, F., et al.. (1991): “Finite State Approximation of Phrase Structure Grammars”, Proceedings of 29th Meeting of the Association for Computational Linguistics, Berkeley, California, pp246-255. The discussion in this paper justified the use of FS instead of more expressive frameworks in NLP, and has influenced on directions of NLP research. One of the precursor of this direction was a discussion given by Church, K.W. (On Memory limitations in Natural Language Processing, MIT Laboratory of Computer Science Technical Report MIT/LCS/TR-245,1980). (8-5)*Roche, E. and Schabes, Y. (eds.) (1997): Finite State Langugae Processing, The MIT Press.
  • 14. A good text book on finite state techniques. The techniques and the ideas of CG (Constraint Grammar) in this book are materialized as a commercial product ENGCG of Lingsoft, the performance of which in tagging of English texts is impressive. Their web site is http://www.lingsoft.fi/cgi-bin/engcg. FST approximation of phrase structure grammar by Pereira (8-4) also appears in this textbook.