1. Semi-automatic Text Mining
Project Proposal for „Future and Emerging Technologies“
in the EU-IST Programme
S. Staab1
, R. Studer Karlsruhe University
K. Markert, B. Webber University of Edinburgh
N. Kushmerick University College Dublin
B. Bremdal, R. Engels Cognit a.s
http://www.aifb.uni-karlsruhe.de/~sst/Research/Projects/TextMining/
1 Abstract
Motivation: The revolutionary step from printed text to digital documents has lead to an
explosive growth of knowledge available (semi-)publicly through the internet or through
community and coporate intranets. With this flood of potentially useful information, there
comes the urgent need to sift through it, find the golden nuggets of information and
analyze them for making informed decisions.
Problem: The vision in text understanding has been that of fully automatic techniques
that may be exploited for purposes like detecting relevant informations in texts,
summarizing the relevant informations, or answering questions on texts. Nevertheless,
fully automatic text mining appears to be as distant as ever. Approaches that actually
work rely almost exclusivly on information retrieval techniques, hardly exploit the fast
progress in computational linguistics research, and thus exhibit well-known limitations
that lead to inconclusive summarizations or to the abundance of hits in search engines like
AltaVista. In addition, the connotation of text mining—the aggregation and analysis of
information into a piece of knowledge that may lead to an informed action — has hardly
been investigated so far.
Objectives: Our project proposal pursues a threefold objective. First, we want to bridge
the gap between techniques that are actually used for text mining, and thus draw from
current and upcoming progress in the fields of knowledge acquisition, computational
linguistics, information retrieval, information extraction and machine learning.
Second, we want to exploit the particularities found in current web documents. This
implies that we need to consider new web standards for document structuring, viz. XML,
and we must consider semi-structuring information such as given through layout, in tables
or lists.
Finally, we want to go beyond information extraction towards text information
exploitation. This means we want to combine extracted information in order to deduce
knowledge that may not have been in the mind of the authors of the text.
Method: We consider text mining a semi-automatic process that is designed and set up
with a particular application in mind. The design involves the construction of a domain
1
Contact: Steffen Staab, AIFB, Karlsruhe University, D-76128 Karlsruhe, email: staab@aifb.uni-
karlsruhe.de, Tel.: +49(0)721/608 7363, Fax.: +49(0)721/ 693717
2. ontology, the formulation and/or learning of interesting structures with computational
linguistics and/or information retrieval techniques and the exploration of the
corresponding results. Once, the domain specific text mining application is set up the
naive user may run it to extract information and – in particular – to find associations and
rules that were not present in the original texts, but that could only be found by
considering, integrating and comparing various text sources.
Scenario: As an interesting case study we choose the mining of annual business reports
and analysts‘ reports that comment on companies from a particular area (e.g.,
telecommunication). This scenario is very appropriate, because
1. It allows the observation of competitors and the detection of trends that are extremely
important for decision makers, such as trends in organizational structures or in
markets and products.
2. The understanding of these texts cannot be performed in isolation. Rather the
knowledge that needs to be found is mostly available in the annual changes that take
place and in the comparisons between companies in the same trade.
3. The setting is well enough observed and understood by professionals in order to
verify the techniques we develop.
2 Chances for Europe
Multiple chances and possibilities arising from an application of semi-automatic text
mining are given on several levels:
1 Informed Decisions: Results from our project may deliver critical information to
European businesses, thus keeping them competitive, reacting quickly to new trends
and possibilities.
2 Individual Learning: The more time the individual may spend on understanding
interconnections and the less time she spends with searching for information and
testing hypothesis, the more she profits from the information technology that is at
hand, now.
3 Research: Though our scenario develops a particular business case, many research
issues may profit from semi-automatic text mining, too. Indeed, research hypotheses
may be easier to (pre-)test or even to generate (cf. Hearst (1999)).
All these factors are critical to develop a high potential of Europeans and for Europeans.
Informed decisions, faster learning and improved research all work together in keeping
Europe competitive.
4 Partner Profile
We consider text mining as being a knowledge acquisition process that should be
facilitated by learning approaches and by the techniques found in information retrieval
and computational linguistics. Hence, the consortium includes people from these
different communities:
3. Prof. Dr. Studer has a chair for knowledge management at Karlsruhe University. He
has carried out research and organized numerous activities in the fields of knowledge
acquisition, knowledge management and data mining for over 20 years.
Dr. Steffen Staab is senior researcher and lecturer at Karlsruhe University. His research
interests include knowledge management, ontology engineering, information extraction,
and data mining. He is now project manager for Karlsruhe in the project GETESS
(http://www.getess.de), which aims at a specific information extraction system for the
tourism domain and which is funded by the German government.
Prof. Dr. Bonnie Webber...
Dr. Katja Markert....
Dr. Nicholas Kushmerick is College Lecturer in the Department of Computer Science,
University College Dublin, Ireland. Dr. Kushmerick received his Ph.D. in 1997 at the
University of Washington, and his dissertation was nominated for the ACM
Distinguished Dissertation award. Dr. Kushmerick has worked in the areas of planning,
machine learning, and information-extraction, -integration, and -retrieval. His worked
has been published in several international journals, and he has been on the organizing
committee of numerous conferences and workshops. Dr. Kushmerick’s current work
focuses on the use of machine learning to scale up knowledge engineering on the
Internet, in service of problems such as information extraction and designing intelligent
browsing assistants.
Dr. Bernt Arild Bremdal: Studied Marine Technology in Trondheim, Norway. After
finishing his MSc at the NTNU he wrote his PhD at the same university. He got is PhD
on the application of artificial intelligence, rule-based and object-oriented programming
in project planning in 1988. After he has been affiliated with a variety of companies he
co-founded and directs CognIT a.s. Author of more than 50 articles and published reports
on computer applications in engineering and industry, design and planning, object-
oriented technology and artificial intelligence. Most recent publication is Braunschweig
and Bremdal, “AI in the Petroleum Industry.” Volume 2. Edition Technip 1996.
Dr. Robert Engels: Studied Artificial Intelligence, Psychology and (partly) Computer
Science at the university of Amsterdam, NL. He conducted his MSc thesis on
applications of Inductive Logic Programming in Stockholm, Sweden. In 1999 he got his
PhD from the university of Karlsruhe for research conducted in the area of Knowledge
Discovery and Data Mining. He (co-) authored a variety of papers, and organised several
international and national (German) workshops on practical applications of Data Mining.
Currently he is affiliated with CognIT as a senior systems architect.
The work packages would be split along the following lines (bold face indicates
leadership for a particular work package):
Knowledge
Acquisition
Computational
Linguistics
Machine
Learning
Information
Retrieval
Univ. Karlsruhe Ontology
acquisition
Mining
Information
Univ. Edinburgh Information
4. Extraction
with Layout
Univ. College
Dublin
Wrappers with
Ontologies;
Mining
Information
Indexing and
querying
structured
documents
Cognit Ontology
induction
Understanding
XML Texts
5 Partner Adresses
Dr. Steffen Staab, Prof. Dr. Rudi Studer
Institute for Applied Computer Science and Formal Description Methods (AIFB),
Karlsruhe University, D-76128 Karlsruhe, Germany
http://www.aifb.uni-karlsruhe.de/WBS
mailto:staab@aifb.uni-karlsruhe.de,studer@aifb.uni-karlsruhe.de
Dr. Katja Markert, Prof. Dr. Bonnie Webber
Division of Informatics, University of Edinburgh, 80 South Bridge
Edinburgh EH1 1HN, Scotland
http://www.informatics.ed.ac.uk/research/irr/
mailto:markert@cogsci.ed.ac.uk,bonnie@dai.ed.ac.uk
Dr. Nicholas Kushmerick
Department of Computer Science, University College Dublin, Dublin 4, Ireland
http://www.cs.ucd.ie/staff/nick/
mailto:nick@ucd.ie
Dr. Robert Engels, Dr. Bernt Bremdal
Cognit a.s, P.B. 610, N-1754 Halden, Norway
http://www.cognit.no/
mailto:robert.engels@cognit.no,bernt.bremdal@cognit.no