Text Mining:
     Beyond Extraction Towards Exploitation
         TIE - Text Information Exploitation
    Project Proposal for „Future and Emerging Technologies“
                   in the EU-IST Programme
           S. Staab1, R. Studer                              Karlsruhe University
          K. Markert, B. Webber                             University of Edinburgh
             N. Kushmerick                                 University College Dublin
          B. Bremdal, R. Engels                                   Cognit a.s

        http://www.aifb.uni-karlsruhe.de/~sst/Research/Projects/TextMining/


1 Abstract
Motivation: The revolutionary step from printed text to digital documents has lead to
an explosive growth of knowledge available (semi-)publicly through the internet or
through community and coporate intranets. With this flood of potentially useful infor-
mation, there comes the urgent need to sift through it, find the golden nuggets of infor-
mation and analyze them for making informed decisions.
Problem: So far, work in text mining has mostly concentrated on purposes like extrac-
ting informations from texts, summarizing the relevant informations, or answering ques-
tions on texts. However, this information extraction-based vision, which has been elabo-
rated, e.g., in approaches for message understanding, mostly neglects the amount and
complexity of information that the user must deal with and act upon. In contrast, the fur-
ther connotations of text mining that go well beyond information extraction — the ag-
gregation and analysis of information into golden nuggets of knowledge that may lead
to informed actions — has hardly been investigated so far.
Objectives: In our project we want to go beyond information extraction towards text in-
formation exploitation. This means we want to aggregate extracted information in order
to deduce knowledge that may not have been in the mind of the authors of the text.
For example, a decade ago there were a lot of separate medical research reports descri-
bing symptoms of a special migraine headache, such as a lack of magnesium (among
others). However, the causal link between magnesium deficits and migraine headaches
was implicit and sometimes it was given only indirectly via other symptoms. A manual
text recherche found that the literature reported eight different types of direct and indi-
rect links between magnesium deficits and migraine headaches. This result strongly sup-
ported the hypothesis that a lack of magnesium causes the type of migraine headaches –
a hypothesis which easily proved true in subsequent medical experiments and, thus, be-


1
 Contact: Steffen Staab, AIFB, Karlsruhe University, D-76128 Karlsruhe, email: staab@aifb.uni-karl-
sruhe.de, Tel.: +49(0)721/608 7363, Fax.: +49(0)721/ 693717
came very valuable to know. However, though all the information was in the docu-
ments, the knowledge about the potential causal link was very hard to discover.
Text mining in the sense we envision will allow for applications that handle tasks like
finding causal links described in texts (semi-)automatically. Once, the text mining appli-
cation has been set up by a systems engineer, the naive user lets the system extract infor-
mation, aggregate it and discover those nuggets of knowledge that may actually help
him to solve a problem.
Method:
The objective just described will be put to work in a realistic environment. This means
we must consider:
1.   Real-world texts: This means we must include semi-structured information such as
     given in de facto web standards, like HTML and XML. This also implies that we
     must try to extract information from layout structures and from more rigid formats,
     such as tables appearing in natural language texts.
2.   Integrating Information Extraction Techniques: We need a broad basis for informati-
     on extraction in order to solve the information exploitation task. For this purpose,
     we want to build on the experiences that have been made with TREC- and MUC-sty-
     le approaches as well as with machine learning techniques which have been applied
     for wrapping semi-structured data. This requires strong competence in the fields of
     Information Retrieval and Extraction, Computational Linguistics and Machine Lear-
     ning.
3.   Domain Ontology: In order to go beyond „simple“ phrase extraction, we need a do-
     main ontology that acts as a semantic mediator between different information extrac-
     tion methods, that allows for knowledge discovery at different levels of granularity
     and that allows for mappings between different terminologies.
4.   Text mining as a semi-automatic process: We consider text mining a semi-automatic
     process that is designed and set up with a particular application and particular topics
     in mind.
     The design involves the construction of a domain ontology and a domain lexicon,
     the formulation and/or learning of interesting structures with computational lingui-
     stics and/or information retrieval techniques and the exploration of the correspon-
     ding results. Once, the domain specific text mining application is set up the naive
     user may run it to extract information and – in particular – to find associations and
     rules that were not present in the original texts, but that could only be found by con-
     sidering, integrating and comparing various text sources.
     This approach parallels the development in data mining where the utopia of a fully
     automatic knowledge discovery process has matured with great success into an engi-
     neering approach towards this problem.
5.   Knowledge Discovery: Finally, we actually need to apply machine learning techni-
     ques to aggregate and analyse extracted information yielding „golden nuggets of
     knowledge“.
Research Issues:
Open research issues in this field are manifold, e.g.:
1. Extracting the semantics of layout with computational linguistics, aligning semi-
   structured data with the corresponding ontology information
2. Aligning several information extraction techniques (TREC-style) with
Integrating techniques: ontology, machine learning, information retrieval and extrac-
     tion, computational linguistics (learning with ontologies, inducing ontologies from
     computational linguistics and information extraction techniques, aligning wrapper
     induction with ontologies, applying information extraction measures to the syntactic
     and semantic level,
Scenario: As an interesting case study we choose the mining of annual business reports
and analysts‘ reports that comment on companies from a particular area (e.g., telecom-
munication). This scenario is very appropriate, because
1.   It allows the observation of competitors and the detection of trends that are extreme-
     ly important for decision makers, such as trends in organizational structures or in
     markets and products.
2.   The understanding of these texts cannot be performed in isolation. Rather the know-
     ledge that needs to be found is mostly available in the annual changes that take place
     and in the comparisons between companies in the same trade.
3.   The setting is well enough observed and understood by professionals in order to ve-
     rify the techniques we develop.


2 Chances for Europe

Multiple chances and possibilities arising from an application of semi-automatic text mi-
ning are given on several levels:

1    Informed Decisions: Results from our project may deliver critical information to Eu-
     ropean businesses, thus keeping them competitive, reacting quickly to new trends
     and possibilities.

2    Individual Learning: The more time the individual may spend on understanding in-
     terconnections and the less time she spends with searching for information and tes-
     ting hypothesis, the more she profits from the information technology that is at hand,
     now.

3    Research: Though our scenario develops a particular business case, many research
     issues may profit from semi-automatic text mining, too. Indeed, research hypotheses
     may be easier to (pre-)test or even to generate.

All these factors are critical to develop a high potential of Europeans and for Europeans.
Informed decisions, faster learning and improved research all work together in keeping
Europe competitive.


4 Partner Profile

We consider text mining as being a knowledge acquisition process that should be facili-
tated by learning approaches and by the techniques found in information retrieval and
computational linguistics. Hence, the consortium includes people from these different
communities:
Prof. Dr. Studer has a chair for knowledge management at Karlsruhe University. He has
carried out research and organized numerous activities in the fields of knowledge acqui-
sition, knowledge management and data mining for over 20 years.

Dr. Steffen Staab is senior researcher and lecturer at Karlsruhe University. His research
interests include knowledge management, ontology engineering, information extraction,
and data mining. He is now project manager for Karlsruhe in the project GETESS
(http://www.getess.de), which aims at an information extraction system for the tourism
domain and which is funded by the German government.

Prof. Dr. Bonnie Webber...

Dr. Katja Markert....

Dr. Nicholas Kushmerick is College Lecturer in the Department of Computer Science,
University College Dublin, Ireland. Dr. Kushmerick received his Ph.D. in 1997 at the
University of Washington, and his dissertation was nominated for the ACM Distin-
guished Dissertation award. Dr. Kushmerick has worked in the areas of planning, ma-
chine learning, and information-extraction, -integration, and -retrieval. His worked has
been published in several international journals, and he has been on the organizing com-
mittee of numerous conferences and workshops. Dr. Kushmerick’s current work focuses
on the use of machine learning to scale up knowledge engineering on the Internet, in ser-
vice of problems such as information extraction and designing intelligent browsing as-
sistants.

Dr. Bernt Arild Bremdal: Studied Marine Technology in Trondheim, Norway. After fi-
nishing his MSc at the NTNU he wrote his PhD at the same university. He got is PhD on
the application of artificial intelligence, rule-based and object-oriented programming in
project planning in 1988. After he has been affiliated with a variety of companies he co-
founded and directs CognIT a.s. Author of more than 50 articles and published reports
on computer applications in engineering and industry, design and planning, object-orien-
ted technology and artificial intelligence. Most recent publication is Braunschweig and
Bremdal, “AI in the Petroleum Industry.” Volume 2. Edition Technip 1996.

Dr. Robert Engels: Studied Artificial Intelligence, Psychology and (partly) Computer
Science at the university of Amsterdam, NL. He conducted his MSc thesis on applicati-
ons of Inductive Logic Programming in Stockholm, Sweden. In 1999 he got his PhD
from the university of Karlsruhe for research conducted in the area of Knowledge Disco-
very and Data Mining. He (co-) authored a variety of papers, and organised several in-
ternational and national (German) workshops on practical applications of Data Mining.
Currently he is affiliated with CognIT as a senior systems architect.
The work packages would be split along the following lines (bold face indicates leader-
ship for a particular work package):
                    Knowledge       Computational     Machine    Lear- Information Re-
                    Acquisition     Linguistics       ning             trieval
Univ. Karlsruhe     Ontology ac-                      Mining Infor-
                    quisition                         mation
Univ. Edinburgh                     Information
Extraction
                                          with Layout
Univ.     College                                           Wrappers with Indexing    and
Dublin                                                      Ontologies;   querying struc-
                                                            Mining Infor- tured documents
                                                            mation
Cognit               Ontology       in-                                    Understanding
                     duction                                               XML Texts




5 Partner Adresses
Dr. Steffen Staab, Prof. Dr. Rudi Studer
     Institute for Applied Computer Science and Formal Description Methods (AIFB),
     Karlsruhe University, D-76128 Karlsruhe, Germany
     http://www.aifb.uni-karlsruhe.de/WBS
     mailto:staab@aifb.uni-karlsruhe.de,studer@aifb.uni-karlsruhe.de


Dr. Katja Markert, Prof. Dr. Bonnie Webber
     Division of Informatics, University of Edinburgh, 80 South Bridge
     Edinburgh EH1 1HN, Scotland
     http://www.informatics.ed.ac.uk/research/irr/
     mailto:markert@cogsci.ed.ac.uk,bonnie@dai.ed.ac.uk


Dr. Nicholas Kushmerick
     Department of Computer Science, University College Dublin, Dublin 4, Ireland
     http://www.cs.ucd.ie/staff/nick/
     mailto:nick@ucd.ie




Dr. Robert Engels, Dr. Bernt Bremdal
     Cognit a.s, P.B. 610, N-1754 Halden, Norway
     http://www.cognit.no/
     mailto:robert.engels@cognit.no,bernt.bremdal@cognit.no

Text Mining: Beyond Extraction Towards Exploitation

  • 1.
    Text Mining: Beyond Extraction Towards Exploitation TIE - Text Information Exploitation Project Proposal for „Future and Emerging Technologies“ in the EU-IST Programme S. Staab1, R. Studer Karlsruhe University K. Markert, B. Webber University of Edinburgh N. Kushmerick University College Dublin B. Bremdal, R. Engels Cognit a.s http://www.aifb.uni-karlsruhe.de/~sst/Research/Projects/TextMining/ 1 Abstract Motivation: The revolutionary step from printed text to digital documents has lead to an explosive growth of knowledge available (semi-)publicly through the internet or through community and coporate intranets. With this flood of potentially useful infor- mation, there comes the urgent need to sift through it, find the golden nuggets of infor- mation and analyze them for making informed decisions. Problem: So far, work in text mining has mostly concentrated on purposes like extrac- ting informations from texts, summarizing the relevant informations, or answering ques- tions on texts. However, this information extraction-based vision, which has been elabo- rated, e.g., in approaches for message understanding, mostly neglects the amount and complexity of information that the user must deal with and act upon. In contrast, the fur- ther connotations of text mining that go well beyond information extraction — the ag- gregation and analysis of information into golden nuggets of knowledge that may lead to informed actions — has hardly been investigated so far. Objectives: In our project we want to go beyond information extraction towards text in- formation exploitation. This means we want to aggregate extracted information in order to deduce knowledge that may not have been in the mind of the authors of the text. For example, a decade ago there were a lot of separate medical research reports descri- bing symptoms of a special migraine headache, such as a lack of magnesium (among others). However, the causal link between magnesium deficits and migraine headaches was implicit and sometimes it was given only indirectly via other symptoms. A manual text recherche found that the literature reported eight different types of direct and indi- rect links between magnesium deficits and migraine headaches. This result strongly sup- ported the hypothesis that a lack of magnesium causes the type of migraine headaches – a hypothesis which easily proved true in subsequent medical experiments and, thus, be- 1 Contact: Steffen Staab, AIFB, Karlsruhe University, D-76128 Karlsruhe, email: staab@aifb.uni-karl- sruhe.de, Tel.: +49(0)721/608 7363, Fax.: +49(0)721/ 693717
  • 2.
    came very valuableto know. However, though all the information was in the docu- ments, the knowledge about the potential causal link was very hard to discover. Text mining in the sense we envision will allow for applications that handle tasks like finding causal links described in texts (semi-)automatically. Once, the text mining appli- cation has been set up by a systems engineer, the naive user lets the system extract infor- mation, aggregate it and discover those nuggets of knowledge that may actually help him to solve a problem. Method: The objective just described will be put to work in a realistic environment. This means we must consider: 1. Real-world texts: This means we must include semi-structured information such as given in de facto web standards, like HTML and XML. This also implies that we must try to extract information from layout structures and from more rigid formats, such as tables appearing in natural language texts. 2. Integrating Information Extraction Techniques: We need a broad basis for informati- on extraction in order to solve the information exploitation task. For this purpose, we want to build on the experiences that have been made with TREC- and MUC-sty- le approaches as well as with machine learning techniques which have been applied for wrapping semi-structured data. This requires strong competence in the fields of Information Retrieval and Extraction, Computational Linguistics and Machine Lear- ning. 3. Domain Ontology: In order to go beyond „simple“ phrase extraction, we need a do- main ontology that acts as a semantic mediator between different information extrac- tion methods, that allows for knowledge discovery at different levels of granularity and that allows for mappings between different terminologies. 4. Text mining as a semi-automatic process: We consider text mining a semi-automatic process that is designed and set up with a particular application and particular topics in mind. The design involves the construction of a domain ontology and a domain lexicon, the formulation and/or learning of interesting structures with computational lingui- stics and/or information retrieval techniques and the exploration of the correspon- ding results. Once, the domain specific text mining application is set up the naive user may run it to extract information and – in particular – to find associations and rules that were not present in the original texts, but that could only be found by con- sidering, integrating and comparing various text sources. This approach parallels the development in data mining where the utopia of a fully automatic knowledge discovery process has matured with great success into an engi- neering approach towards this problem. 5. Knowledge Discovery: Finally, we actually need to apply machine learning techni- ques to aggregate and analyse extracted information yielding „golden nuggets of knowledge“. Research Issues: Open research issues in this field are manifold, e.g.: 1. Extracting the semantics of layout with computational linguistics, aligning semi- structured data with the corresponding ontology information 2. Aligning several information extraction techniques (TREC-style) with
  • 3.
    Integrating techniques: ontology,machine learning, information retrieval and extrac- tion, computational linguistics (learning with ontologies, inducing ontologies from computational linguistics and information extraction techniques, aligning wrapper induction with ontologies, applying information extraction measures to the syntactic and semantic level, Scenario: As an interesting case study we choose the mining of annual business reports and analysts‘ reports that comment on companies from a particular area (e.g., telecom- munication). This scenario is very appropriate, because 1. It allows the observation of competitors and the detection of trends that are extreme- ly important for decision makers, such as trends in organizational structures or in markets and products. 2. The understanding of these texts cannot be performed in isolation. Rather the know- ledge that needs to be found is mostly available in the annual changes that take place and in the comparisons between companies in the same trade. 3. The setting is well enough observed and understood by professionals in order to ve- rify the techniques we develop. 2 Chances for Europe Multiple chances and possibilities arising from an application of semi-automatic text mi- ning are given on several levels: 1 Informed Decisions: Results from our project may deliver critical information to Eu- ropean businesses, thus keeping them competitive, reacting quickly to new trends and possibilities. 2 Individual Learning: The more time the individual may spend on understanding in- terconnections and the less time she spends with searching for information and tes- ting hypothesis, the more she profits from the information technology that is at hand, now. 3 Research: Though our scenario develops a particular business case, many research issues may profit from semi-automatic text mining, too. Indeed, research hypotheses may be easier to (pre-)test or even to generate. All these factors are critical to develop a high potential of Europeans and for Europeans. Informed decisions, faster learning and improved research all work together in keeping Europe competitive. 4 Partner Profile We consider text mining as being a knowledge acquisition process that should be facili- tated by learning approaches and by the techniques found in information retrieval and computational linguistics. Hence, the consortium includes people from these different communities:
  • 4.
    Prof. Dr. Studerhas a chair for knowledge management at Karlsruhe University. He has carried out research and organized numerous activities in the fields of knowledge acqui- sition, knowledge management and data mining for over 20 years. Dr. Steffen Staab is senior researcher and lecturer at Karlsruhe University. His research interests include knowledge management, ontology engineering, information extraction, and data mining. He is now project manager for Karlsruhe in the project GETESS (http://www.getess.de), which aims at an information extraction system for the tourism domain and which is funded by the German government. Prof. Dr. Bonnie Webber... Dr. Katja Markert.... Dr. Nicholas Kushmerick is College Lecturer in the Department of Computer Science, University College Dublin, Ireland. Dr. Kushmerick received his Ph.D. in 1997 at the University of Washington, and his dissertation was nominated for the ACM Distin- guished Dissertation award. Dr. Kushmerick has worked in the areas of planning, ma- chine learning, and information-extraction, -integration, and -retrieval. His worked has been published in several international journals, and he has been on the organizing com- mittee of numerous conferences and workshops. Dr. Kushmerick’s current work focuses on the use of machine learning to scale up knowledge engineering on the Internet, in ser- vice of problems such as information extraction and designing intelligent browsing as- sistants. Dr. Bernt Arild Bremdal: Studied Marine Technology in Trondheim, Norway. After fi- nishing his MSc at the NTNU he wrote his PhD at the same university. He got is PhD on the application of artificial intelligence, rule-based and object-oriented programming in project planning in 1988. After he has been affiliated with a variety of companies he co- founded and directs CognIT a.s. Author of more than 50 articles and published reports on computer applications in engineering and industry, design and planning, object-orien- ted technology and artificial intelligence. Most recent publication is Braunschweig and Bremdal, “AI in the Petroleum Industry.” Volume 2. Edition Technip 1996. Dr. Robert Engels: Studied Artificial Intelligence, Psychology and (partly) Computer Science at the university of Amsterdam, NL. He conducted his MSc thesis on applicati- ons of Inductive Logic Programming in Stockholm, Sweden. In 1999 he got his PhD from the university of Karlsruhe for research conducted in the area of Knowledge Disco- very and Data Mining. He (co-) authored a variety of papers, and organised several in- ternational and national (German) workshops on practical applications of Data Mining. Currently he is affiliated with CognIT as a senior systems architect. The work packages would be split along the following lines (bold face indicates leader- ship for a particular work package): Knowledge Computational Machine Lear- Information Re- Acquisition Linguistics ning trieval Univ. Karlsruhe Ontology ac- Mining Infor- quisition mation Univ. Edinburgh Information
  • 5.
    Extraction with Layout Univ. College Wrappers with Indexing and Dublin Ontologies; querying struc- Mining Infor- tured documents mation Cognit Ontology in- Understanding duction XML Texts 5 Partner Adresses Dr. Steffen Staab, Prof. Dr. Rudi Studer Institute for Applied Computer Science and Formal Description Methods (AIFB), Karlsruhe University, D-76128 Karlsruhe, Germany http://www.aifb.uni-karlsruhe.de/WBS mailto:staab@aifb.uni-karlsruhe.de,studer@aifb.uni-karlsruhe.de Dr. Katja Markert, Prof. Dr. Bonnie Webber Division of Informatics, University of Edinburgh, 80 South Bridge Edinburgh EH1 1HN, Scotland http://www.informatics.ed.ac.uk/research/irr/ mailto:markert@cogsci.ed.ac.uk,bonnie@dai.ed.ac.uk Dr. Nicholas Kushmerick Department of Computer Science, University College Dublin, Dublin 4, Ireland http://www.cs.ucd.ie/staff/nick/ mailto:nick@ucd.ie Dr. Robert Engels, Dr. Bernt Bremdal Cognit a.s, P.B. 610, N-1754 Halden, Norway http://www.cognit.no/ mailto:robert.engels@cognit.no,bernt.bremdal@cognit.no