Information Extraction and Natural Languag
Department of Information Science
University of Tokyo
Bunkyo-ku, Tokyo 113-0033 JAPAN
The papers below are aimed to illustrative of the work that has taken place in
information extraction in the last five years, as of October 2000. Much of this work has
been influenced by the DARPA-sponsored Message Understanding Conferences, which
has subdivided the IE task into up to four distinct phases: named entity, coreference
resolution, template element and scenario template. This has influenced a generation
of IE systems, and this is reflected in the headings of papers given below as well.
The list is not complete and there are undoubtedly several papers that have been
unintentionally missed out. In particular, the list of the named entity task is oriented
towards systems that learn automatically from corpora. You find a list of papers on
more structure-oriented methods in the list prepared by Dr.S.Ananiadou for the
(Note: ** Key paper - shows some technique that has influenced the development of
the field. * Recommended reading.)
I wish to thank the members of the GENIA project at the University of Tokyo. In.
particular, the bibliography provided by Dr. N.Collier helped greatly to compile this
1. General Introduction
(1-1)*Appelt, D., and Israel, D.(1999): Introduction to Information
Extraction Technology, Tutorial for IJCAI-99, (http://www.ai.sri.com/~appelt/
This is a good current state of the arts survey of the field, easy to read. A list of useful
web sites is also given, such as publicly available linguistic resources, etc. A simple tool
kit of IE is also available from the web site.
(1-2) Pazienza, M.(Ed.) (1999): Information Extraction, Lecture Notes in
Artificial Intelligence 1714, Springer
A collection of papers which show the breadth of the field, including future directions
(adaptability of IE systems, the relationships with other fields like digital library, IR,
Speech, etc.) and description of a concrete system by the University of Rome.
(1-3) Grishman, R.(1998): Information Extraction and Speech Recognition,
in Proceedings of the Broadcast News Transcription and Understanding
Workshop (available from http://www.cs.nyu.edu/cs/projects/proteus/)
A short paper. While the second half is on IE and Speech, the first half of the paper is a
brief explanation of IE. If you need a short description of IE, this gives you a good
introduction to the field.
(1-4)*MUC-6, Proceedings of the Sixth Message Understanding Conference
(MUC-6), Morgan Kaufmann, San Meteo, CA.
(1-5)*MUC-7 (1998), Proceedings of the Seventh Message Understanding
Conference (MUC-7)}, published on the web site http://www.muc.saic.com/
The DARPA-sponsored Message Understanding Conferences (MUC) have greatly
affected the development of the field, including concrete task definitions, evaluation
methods, selection of domains, etc. From these proceedings, you can see how IE
systems work and how their performances are evaluated. The task definitions,
annotation scheme and evaluation methods of MUC-6 are found in the MUC-6 web site
2. Actual IE systems: MUC Systems
(2-1)*Appelt, D. et al. (1993), “FASTUS: a finite-state processor for
information extraction from real-world texts”, Proceedings of the
International Joint Conference on Artificial Intelligence.
The system developed by SRI, FASTUS, is one of the representative systems of IE
developed in MUC. If you are interested in more thorough descriptions of FASTUS, see
(Hobbs,J.R. et al.: FASTUS: A Cascaded Finite-State Transducer for Extracting
Information from Natural-Language Text, which appear in (8-5). The article is also
available at http://www.ai.sri.com/~appelt/fastus-schabes).
(2-2)*Grishman, R. (1995), “The NYU system for MUC-6 or where's the
syntax?”, Proceedings of the 6th Message Understanding Conference}, pp
Another representative of the current IE systems based on pattern-matching
techniques (the expressive power almost equivalent to finite-state machines). While in
FASTUS given patterns are expanded to surface patterns by macros, their system has
a special input system that accepts surface examples and expands them, through
question-answering with a system designer, to generalized surface patterns.
The system works deterministically and uses hand-made patterns. However, the NYU
research group has tried various machine learning techniques to the named-entity
recognition task. Those are based on decision trees (5-4), Maximum Entropy (5-5), etc.
Their tool kit, which enables users to define a simple ontology of a domain and check
with actual corpora whether generated patterns properly work, is described in (Sekine,
S. et al.(1998): An Information Extraction System and a Customization Tool, in
Proceedings of the New Challenges in Natural Language Processing and its
Application, Tokyo, available from http://www.cs.nyu.edu/cs/projects/proteus/ ).
(2-3) Krupka, G. R. and Hausman, K. (1998), “IsoQuest Inc.: Description of
the NetOwl (tm) Extractor System as Used in MUC-7!”, Proceedings of 7th
Message Understanding Conference (MUC-7)}, DARPA.
(2-4) Srihari, R. (1998), “A Domain Independent Event Extraction Toolkit”,
AFRL-IF-RS-TR-1998-152 Final Technical Report, published by the Air
Force Research Laboratory, Information Directorate, Rome Research Site,
(2-5)*Srihari, R. and Li, W. (1999), “Information Extraction Supported
Question Answering”, Proceedings of TREC-8.
TREC (Text Retrieval Conference: http://trec.nist.gov/) is a twin brother of MUC under
DARPA-sponsored TIPSTER text program. While TREC used to focus on conventional
Information Retrieval, it has started a new track called QA (question answering) track.
The paper (2-5) describes an attempt of using an IE system (Textract developed by
Cymfony Inc.) in the MUC framework to the QA task.
There are several interesting comments in this paper. For example, they claim that
scenario templates (ST) in MUC were too domain specific and that the group had to
redesign them for GE (General Event) templates in order to cope with open-ended
nature of the QA task. A similar observation was made by Appelt in (1-1), though his
remark was concerned with the discrepancy between linguistic representations and the
output templates. FASTUS by SRI has an internal representation directly extracted
from texts, from which a post processor generates the official MUC-6 templates.
3. IE Systems for Biology, Biomedicine, Biochemistry, etc.
(3-1) * Rindflesch, T. C. et al. (2000), “EDGAR: Extraction of Drugs, Genes
and Relations from the Biomedical Literature”, Proceedings of the Pacific
Symposium on Bio-informatics (PSB'2000), Hawaii, USA, January.
Unlike most of the other systems that were developed by NLP research groups, this
system was developed by a research group specialized in Bioinformatics. Domain
specific resources like UMLS, etc. are effectively used, together with special programs
like MetaMap that maps noun phrases to UMLS semantic types. The terms that the
authors used in this paper are somewhat different from standard terms of the NLP
community. Their “under-specified parser”, for example, is like a shallow parser or a
chunker in the NLP community. Some of the techniques illustrated look domain
specific and highly dependent on available resources and their properties. However, the
paper is full of insightful observations that have to be reflected in IE systems for this
domain (See also (7-4)).
(3-2) Craven, M. and Kumlian, J. (1999): Constructing Biological Knowledge
Base by Extracting Information from Text Sources, in Proceedings of the 7th
International Conference on Intelligent Systems for Molecular Biology
Assuming that semantic lexicon (i.e. pairs of terms and their semantic classes) is given,
they try to train their probabilistic models, Naïve Bayes and a relational learning
algorithm. While the naïve Bayes model identifies sentences in abstracts that contain
relevant information, the relational learning algorithm identifies which phrases in the
identified sentences participate in relevant relationships. The system is not an IE
system, but a knowledge acquisition system for IE (See also Section 7).
(3-3) Proux, D., et al. (2000): A Pragmatic Information Extraction Strategy
for gathering Data on Genetic Interaction, in proceedings of 8th
International Conference on Intelligent Systems for Molecular Biology, La
Jolla, Calif., pp 279 - 285
A conventional IE system is applied to extract gene interactions. It uses POS-tagging
based on FST and HMM, shallow parsing of local structures around verbs, and
knowledge-based processing by using Conceptual Graph of Sowa. They reported that
complex linguistic structures including nominalized events, co-ordinations etc. in
Medline abstracts hamper the performance.
(3-4) Milward, T., et al. (2000): Automatic Extraction of Protein Interactions
from Scientific Abstracts, in Proceedings of Pacific Symposium on
Biocomputing, pp538-549, World Scientific Press.
(3-5)Hamphrays, K., et al. (2000): Two Applications of Information
Extraction to Biological Science Journal Articles: Enzyme Interactions and
Protein Structures, in Proceedings of Pacific Symposium on Biocomputing,
pp 72-80, World Scientific Press
These two papers reported preliminary results of applying IE tool kits to biology texts.
(3-4) is by SRI, Cambridge (Highlight) , while (3-5) is by Sheffield University (Gate,
LaSIE). Due to the nature of preliminary experiments, the details of their evaluation
methods are not given.
4. IE Systems with full parsers
Recently, there have been a few attempts of using full-fledged sentential parsing for
IE. They claim that the later stages of IE i.e. merging of templates, co-reference
recognition, etc. become convoluted without explicit recognition of sentence structures
(See (6-3)). It is also noted in (3-1) and (3-2) that sentences in abstracts of scientific
papers tend to have complex sentential structures like nested co-ordinations, which
cause difficulties on simple shallow parsing or pattern-matching techniques.
While (4-1) is a general purpose IE system, (4-2) and (4-3) are applied to
(4-1) Ciravegna,F. et al. (1999) “Full Text Parsing using cascades of Rules”,
in Proceedings. of the Ninth Conference of the European Chapter of the
Association for Computational Linguistics (EACL99)
This is a paper by a European project FACILE. Their work is also found in (Black, W.,
et al. (1998): “FACILE: Description of the NE System Used for MUC-7", in Proceedings
of Message Understanding Conference Proceedings (MUC-7)). A tool kit of IE
(Pinocchio) has also been developed (http://ecate.itc.it:1025/cirave/pinocchio).
(4-2) Park, J. C., et al. (2001): Bi-directional Incremental Parsing for
Automatic Pathway Identification with Combinatory Categorical Grammar,
in Proceeding of PSB, Hawaii (to appear)
Many interesting examples are given to show the reasons why more linguistically
sound frameworks are required for IE in Bio-chemistry application. They focus on
extracting protein-protein interactions from texts. Efficiency of full parsing based on
CUG is improved by restricting the analysis to structures around a set of designated
verbs. They use CUG (Categorical Unification Grammar) as the grammar formalism.
(4-3)*Yakushiji, A., et al. (2001): Event Extraction from Biomedical Papers
using a Full Parser, in Proceedings of PSB 2001, Hawaii http://www-
They use a special algorithm for efficient parsing for HPSG-like grammar formalisms
and devise a special method of extracting relevant information from partial parse
results. Even if full sentential parses are not obtained, relevant information is to be
extracted from partial parse results.
5. Named entity recognition using machine learning techniques
While techniques based on hand-made patterns had been dominant till MUC-5, there
were a strong interest in applying machine learning techniques to the task in MUC-6
and MUC-7. Since the NE task in IE shares common techniques with term recognition,
a comprehensive bibliography is given by the bibliography of Dr. S.Ananiadou. The
following list focuses on NE recognition in IE and particular, those using machine
learning techniques (See also Section 7).
(5-1)**Bikel, D. M., et al. (1997), “Nymble: a High-Performance Learning
Name-finder”, Proceedings of the Fifth Conference on Applied Natural
Language Processing, Morgan Kaufmann Publishers, pp. 194-201.
Among NE systems based on learning methods, the system in this paper shows the
best performance of the time. This is the first system that uses HMM for the NE task.
The paper shows that HMM based on a set of simple orthographic features gives a
remarkably good result, e.g. around 90 % accuracy.
(5-2) *Collier, N., Nobata, C., and Tsujii, J. (2000), “Extracting the Names of
Genes and Gene Products with a Hidden Markov Model”, Proceedings of the
18th International Conference on Computational Linguistics
(COLING-2000), Saarbrucken, Germany.
The named entity recognition techniques based on HMM were first applied to terms in
the bio-chemical domain. The result shows that simple orthographical features work
fairy well in this domain as well, while the overall performance is not as good as the
NE task in MUC. The same group applied a decision tree method to the same problem
with the same set of features, the result of which is less than HMM (See: Nobata, C., et
al.(1999): “Automatic Term Identification and Classification in Biology Texts”, in
Proceeding. of 5th Natural Language Processing Pacific Rim Symposium, Beijing). The
results show that the NE task in the biology domain is much harder than the MUC
domain, due to abundant uses of multi-word expressions, abbreviations, complex co-
ordination within term expressions, etc.
(5-3)**Borthwick, A. et al. (1998), “Exploiting Diverse Knowledge Sources
via Maximum Entropy in Named Entity Recognition”, Proceedings of the
Sixth Workshop on Very Large Corpora, pp 152-160.
(5-4) Sekine, S., Grishman, R. and Shinou, H. (1998), “A decision tree
method for finding and classifying names in Japanese texts”, Proceedings of
the Sixth Workshop on Very Large Corpora.
These two papers are by the NYU group. While simple HMM does not allow multi-facet
features, decision trees and ME (maximum entropy) can deal with them. In particular,
ME accept a large set of features, from which it learns which features are relevant to
the task. ME seem to outperform other learning methods.
(5-5)*Collins, M. and Singer, Y. (1999), “Unsupervised Models for Named
Entity Classification”, Proceedings of the 1999 Joint SIGDAT Conference on
Empirical Methods in Natural Language Processing and Very Large
Corpora, University of Maryland, USA (http://www.research.att.com/~singer/).
Supervised learning methods tend to require substantial human effort of preparing
annotated corpora. This nullifies the advantage of trainable systems over the
knowledge engineering approach (See the discussion in (1-1) by Appelt). This paper
shows that a new technique called “Co-Training” can reduce the burden (Blum, A. and
Mitchell, T. (1998): Combining Labeled and Unlabeled Data with Co-Training, in
Proceedings of COLT-98, Madison, Wisconsin, http://www.cs.cmu.edu/~avrim/). Starting
with a small set of rules, the system learns semantic classifiers of PERSON,
ORGANIZATION and LOCATION, from unlabeled data.
(5-6) Fukuda, et al., (1999): “Toward Information extraction: Identifying
protein names from biological papers”, in Proc. of the Pacific Symposium on
Biocomputing 98 (PSB 98), Hawaii.
This is not a paper of NE recognition based on machine learning. However, the paper
shows for the first time that terms in bio-chemical domains, in particular, protein
names can be recognized by a set of simple heuristics. The performance of the system
was good, while the evaluation method had not been well established.
6. Coreference resolution
After creating templates of individual entities and events with their properties, an IE
system merges them to form integrated templates from which all kinds of information
of the same entities and events are to be obtained. Coreference identification plays a
crucial role in this stage.
(6-1) Aone, C. and Bennet, S. W. (1995), “ Evaluating automated and manual
acquisition of anaphora resoluation strategies”, Proceedings of the 33rd
Annual Meeting of the Association for Computational Linguistics (ACL-95),
pp 122-129, Cambridge, MA, June.
(6-2) Hirschman, L. et al. (1997), “Automating Coreference: The Role of
Annotated Training Data”, Proceedings of the AAAI Spring Symposium on
Applying Machine Learning to Discourse Processing.
(6-3)*Kehler, A. (1997), “Probabilistic Coreference in Information
Extraction”, Proceedings of the 1997 Joint SIGDAT Conference on Empirical
Methods in Natural Language Processing and Very Large Corpora.
The paper discusses probabilistic models based on Maximum Entropy (ME) for
coreference resolution. The paper assumes that the first stage (entity and event
recognition) is performed by FASTUS and returns possible coreference relationships
with their ratings, which will be used by a downstream application system. An
interesting comment that lack of reasonable linguistic representations in FASTUS
makes the coreference resolution task unnecessary difficult is found in this paper.
(6-4) Kennedy, C. and Boguraev, B. (1996), “Anaphora for everyone:
Pronominal anaphora resolution without a parser”, Proceedings of the 16th
International Conference on Computational Linguistics (COLING-96).
7. Knowledge Acquisition and Ontology
(7-1)*Muslea, I. (1999): “Extraction Patterns for Information Extraction
Tasks: A Survey”, in Proceedings of The AAAI-99 Workshop on Machine
Learning for Information Extraction.
In order to build a pattern-based IE system, you have to prepare a set of patterns
manually. Most of IE tool-kits developed provide some devices which lessen this
manual effort, like macros in FASTUS, generalization of examples in the NYU system,
etc. However, these devices still require substantial human efforts. Design of patterns
also assume existence of proper domain ontology. Though there exist a few domain-
independent ontologies like cyc, word-net, euro-wordnet, EDR, etc., it is often the case
that these domain independent ontologies are not so effective for IE. This is definitely
the case for IE systems for specific scientific texts.
Therefore, automatic acquisition of patterns and ontologies from texts has attracted
significant interests recently. While this paper is not comprehensive (eg: it does not
cover substantial amounts of corpus-based research in computational linguistics such
as acquisition of sub-categorization frames, word clustering, etc.), this is a good
introduction to the filed and shows how the two fields, KA (Knowledge Acquisition) and
IE, are now merging.
(7-2) Soderland, S., et al. (1995): CRYSTAL: Inducing a Conceptual
Dictionary, in Proceedings of the 14th International Joint Conference on
Artificial Intelligence (IJCAI 95).
CRYSTAL is a part of the IE system developed by the University of Massachusetts,
which is used by their language analyzer BADGER. While BADGER is domain
independent, the dictionary of CN (Concept Nodes), which CRYSTAL is to build from
an annotated corpus, contains domain specific information. The information in a CN is
essentially the same as patterns that are used by other IE systems based on pattern-
CRYSTAL accepts annotated texts (annotated in terms of semantic classes of phrases)
and takes them as examples of patterns. It tries to generalize those examples to
generalized patterns by using hierarchy of concepts. In this paper, they use UMLS in
as the semantic class hierarchy [See also (Aseltine, J. (1999): WAVE: An Incremental
Algorithm for Information Extraction, In Proceedings of the AAAI Workshop on
Machine Learning for Information Extraction).]
(7-3) Riloff, E. and Jones, R. (1999): Learning Dictionary for Information
Extraction by Multi-Level Bootstrapping, in Proceedings of the 16th National
Conference on Artificial Intelligence (AAAI-99).
CRYSTAL in (7-2) assumes that semantic classes are given before learning of patterns.
In this paper, they propose “mutual bootstrapping” that learns semantic classes and
(7-4) Rindflesh, et al. (1999): Mining Molecular Binding Terminology from
Biomedical Text, in Proceedings of AMIA-99.
This is a part of the project of (3-1). The program ARBITER recognizes terms that are
relevant to molecular binding relationships.
(7-4) Maynard, D., and Ananiadou, S. (2000): “Identifying Terms by their
Family and Friends”, in Proceedings of Coling 2000, Saarbrucken, Germany.
The first step to acquire domain ontologies is to collect terms of a given domain. In
particular, there are many multi-word terms in Medicine and biology, which is one of
the major causes of difficulties in term recognition in these fields. The paper addresses
how to collect these multi-word terms based on collocation distribution of words.
(7-5) Guarino, N., et al. (1995): Ontologies and Knowledge bases towards a
Terminological Clarification, in Proceedings of Towards Very large
Knowledge Bases, pp 25- 32
Interests in ontology has emerged through discussions in various research fields such
as knowledge representation and sharing in Artificial Intelligence, multi-lingual
machine translation, data management, CALS, community software, etc. It has its own
root in philosophy. This paper discusses what ontology means, what otology is and
what it is not. Guarino’s web site
(http://www.ladseb.pd.cnr.it/infor/people/Guarino.html) provides a comprehensive list of
useful sites in this field.
(7-6) Useful sites of Domain Independent Ontology:
(7-6-1) Cyc (http://www.cyc.com/): Encyclopedic knowledge base.
(7-6-2) Wordnet (http://www.cogsci.princeton.edu/)
(7-6-3) Euro-wordnet (http://www.hum.uva.nl/~ewn/)
(7-6-4)EDR( http://www.iijnet.or.jp/edr/): Lexical resources: mono-lingual dictionaries of
Japanese and English and a Concept dictionary. The concept dictionary can be used as
(7-6-5)Mikrokosomos: (http://crl.nmsu.edu/Research/Projects/mikro/): Ontology for
(7-7)*McEntire, R., Karp, P., et al. (2000): An Evaluation of Ontology
Exchange Languages for Bioinformatics, in Proceedings of 8th International
Conference on Intelligent Systems for Molecular Biology, La Jolla, pp
The languages for representation and exchange of ontology are compared and
evaluated in terms of Biology application. Seven candidates, ASN.1, ODL, Onto, OML/
CKML, OPM, XML/RDF and UML, are evaluated in details for biochemistry
(7-8)*Tateishi, Y. et al. (2000): Building an Annotated Corpus in the
Molecular-Biology Domain, in Proceedings of Workshop on Semantic
Annotation and Intelligent Content, Coling 2000, Saarbrucken, Germany, pp
This is a paper on text annotation for biology texts, not ontology. However, the
semantic annotation assumes certain background ontology of the field, which is now
being constructed by a group of the University of Tokyo (http://www-tsujii.is.s.u-
8. Basic Techniques
(8-1) ** Rabiner, L. and Juang, B. (1986), “An introduction to hidden
Markov models”, IEEE ASSP Magazine, pp 4-16, January.
A standard paper of introduction to HMM.
(8-2) Quinlan, J.R. (1993): “C4.5: Programs for Machine Learning”, Morgan
Kaufman Publishers, San Mateo, Calif.
A standard textbook of decision trees. The program C4.5 is avilable from
http://www.cse.unsw.edu.au/~quinlan/. The advanced version C5.0 is commercially
(8-3) **Viterbi, A. J. (1967), “Error bounds for convolution codes and an
asymptotically optiumum decoding algorithm”, IEEE Trans. Information
Theory, IT-13(2), pp 260-269.
When probabilities of state transitions and emission probabilities are given, a naive
way of finding the optimal transition path - the most probable path) involves a large
search space and time-consuming. The Viterbi algorithm is an efficient algorithm for
solving this problem. The time complexity of this algorithm is O(TN2), where T and N
are the length of the sequence and the number of states. A modified version of this
algorithm which combines the original version with certain rules that constrain
legitimate sequences widely used.
(8-4)**Pereira, F., et al.. (1991): “Finite State Approximation of Phrase
Structure Grammars”, Proceedings of 29th Meeting of the Association for
Computational Linguistics, Berkeley, California, pp246-255.
The discussion in this paper justified the use of FS instead of more expressive
frameworks in NLP, and has influenced on directions of NLP research. One of the
precursor of this direction was a discussion given by Church, K.W. (On Memory
limitations in Natural Language Processing, MIT Laboratory of Computer Science
Technical Report MIT/LCS/TR-245,1980).
(8-5)*Roche, E. and Schabes, Y. (eds.) (1997): Finite State Langugae
Processing, The MIT Press.
A good text book on finite state techniques. The techniques and the ideas of CG
(Constraint Grammar) in this book are materialized as a commercial product ENGCG
of Lingsoft, the performance of which in tagging of English texts is impressive. Their
web site is http://www.lingsoft.fi/cgi-bin/engcg. FST approximation of phrase structure
grammar by Pereira (8-4) also appears in this textbook.