This document provides an overview of text mining, including its history and definitions. Key points:
- Text mining aims to extract useful information and discover new knowledge from large amounts of unstructured text data without having to read it all.
- Don Swanson is considered a pioneer in text mining for discovering new biomedical relationships through analyzing complementary sets of literature.
- There is no single agreed-upon definition, but text mining generally involves retrieving relevant texts, representing their content, and analyzing the representation to find patterns or associations.
- Current text mining systems are still fairly primitive and rely heavily on human input, but the goal is more automated analysis of large text collections to extract meaningful patterns rather than just
This document summarizes a presentation given by three librarians on the role of librarians in the intelligence process. It discusses the competencies and skills that librarians possess, such as open source intelligence collection, data and metadata management, knowledge management, understanding human information behavior, and instructional design. It argues that these enable librarians to take on new roles in intelligence work, including advising, analyzing, and teaching analysts. The document also outlines principles that allow librarians to collaborate effectively, such as assessing information quality, adding value through analysis, focusing communication, and having a mission focus of improving effectiveness through knowledge creation and application.
History of Information: Classical, Medieval, Modern theory
Open problem of Information: The unification of various theories of information; What is useful/meaningful information?What is an adequate logic of information? Continuous versus discrete models of nature; Computation versus thermodynamics; Classical information versus quantum information; Information and the theory of everything; The Church-Turing Hypothesis; P versus NP?
It from Bit: Why the Quantum? It from Bit? A Participatory Universe?: Three Far-reaching, Visionary Questions from John Archibald Wheeler
Physic, Math, Information: String Theory, Quantum, Sporadic finite Groups, Leech Latice, Gravity as emergent,
Universe digital copy conjecture: representation of universal information
Emergent Transformation Conjecture: the math of emergent
Potential applications: Deep Learning; Capability Transformation using Enterprise Architect
What’s in it for us: Information science, getting ready for Industry 4.0
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes text mining techniques for information retrieval, extraction, and indexing. It discusses common information retrieval techniques like inverted indices and signature files. It also covers stemming, domain dictionaries, exclusion lists, and research directions in text mining like finding better representations for extracted information, enabling multilingual analysis, and integrating domain knowledge. The key techniques discussed are text indexing, query processing, and information extraction from text.
The document provides background information on ProQuest, an electronic database subscribed to by La Salle University-Ozamiz City. It discusses how ProQuest provides access to citations, full text articles, dissertations, and other materials. It also reviews literature related to e-resources, awareness of ProQuest, and barriers to its use. The study aims to determine students' use, awareness, frequency of use, reasons for using, and barriers in using ProQuest. It describes the research methodology, including the descriptive research design, setting of the study at LSU, respondents which were 505 randomly selected students, and the questionnaire used as the data collection instrument.
Arcomem training Topic Analysis Models beginnersarcomem
Probabilistic topic models are algorithms that aim to discover and annotate large collections of documents with thematic information without any prior annotations. They work by analyzing the statistical co-occurrence of words to identify topics, where a topic is a probability distribution over words. Documents are represented as mixtures of topics. For example, a document may have a 60% probability of being about biology, 30% about physics, and 10% about mathematics. Topics emerge from the statistical analysis and provide interpretable groups of correlated terms.
The document discusses different types of information retrieval systems such as traditional query-based systems, text categorization systems, text routing systems, and text filtering systems. It also describes some common techniques used in information retrieval systems like inverted indexing, stopword removal, stemming, and vector space models. Finally, it discusses opportunities for integrating information retrieval techniques with natural language processing to develop more accurate and effective retrieval systems.
Information Retrieval Methods in Libraries and Information CentersEdeama Onwuchekwa
The document discusses various information retrieval methods used in libraries and information centers. It describes traditional methods like cataloguing, classification, indexing, and abstracting. It also discusses newer methods like metadata and online public catalogs. The goal of these various methods is to facilitate the storage and retrieval of information to meet users' needs.
Information science is an interdisciplinary field that incorporates aspects of computer science, library science, communication, and other diverse fields. It is concerned with the origination, collection, organization, sharing, and use of information. While related to fields like information theory and library science, information science has a broader focus on the nature, behavior, and management of information across different technologies and domains. It examines information generation, storage, retrieval, and application through databases and knowledge management. Information science is taught jointly with library science and applies emerging technologies to analyze and improve how information is accessed and used.
This document summarizes a presentation given by three librarians on the role of librarians in the intelligence process. It discusses the competencies and skills that librarians possess, such as open source intelligence collection, data and metadata management, knowledge management, understanding human information behavior, and instructional design. It argues that these enable librarians to take on new roles in intelligence work, including advising, analyzing, and teaching analysts. The document also outlines principles that allow librarians to collaborate effectively, such as assessing information quality, adding value through analysis, focusing communication, and having a mission focus of improving effectiveness through knowledge creation and application.
History of Information: Classical, Medieval, Modern theory
Open problem of Information: The unification of various theories of information; What is useful/meaningful information?What is an adequate logic of information? Continuous versus discrete models of nature; Computation versus thermodynamics; Classical information versus quantum information; Information and the theory of everything; The Church-Turing Hypothesis; P versus NP?
It from Bit: Why the Quantum? It from Bit? A Participatory Universe?: Three Far-reaching, Visionary Questions from John Archibald Wheeler
Physic, Math, Information: String Theory, Quantum, Sporadic finite Groups, Leech Latice, Gravity as emergent,
Universe digital copy conjecture: representation of universal information
Emergent Transformation Conjecture: the math of emergent
Potential applications: Deep Learning; Capability Transformation using Enterprise Architect
What’s in it for us: Information science, getting ready for Industry 4.0
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
This document summarizes text mining techniques for information retrieval, extraction, and indexing. It discusses common information retrieval techniques like inverted indices and signature files. It also covers stemming, domain dictionaries, exclusion lists, and research directions in text mining like finding better representations for extracted information, enabling multilingual analysis, and integrating domain knowledge. The key techniques discussed are text indexing, query processing, and information extraction from text.
The document provides background information on ProQuest, an electronic database subscribed to by La Salle University-Ozamiz City. It discusses how ProQuest provides access to citations, full text articles, dissertations, and other materials. It also reviews literature related to e-resources, awareness of ProQuest, and barriers to its use. The study aims to determine students' use, awareness, frequency of use, reasons for using, and barriers in using ProQuest. It describes the research methodology, including the descriptive research design, setting of the study at LSU, respondents which were 505 randomly selected students, and the questionnaire used as the data collection instrument.
Arcomem training Topic Analysis Models beginnersarcomem
Probabilistic topic models are algorithms that aim to discover and annotate large collections of documents with thematic information without any prior annotations. They work by analyzing the statistical co-occurrence of words to identify topics, where a topic is a probability distribution over words. Documents are represented as mixtures of topics. For example, a document may have a 60% probability of being about biology, 30% about physics, and 10% about mathematics. Topics emerge from the statistical analysis and provide interpretable groups of correlated terms.
The document discusses different types of information retrieval systems such as traditional query-based systems, text categorization systems, text routing systems, and text filtering systems. It also describes some common techniques used in information retrieval systems like inverted indexing, stopword removal, stemming, and vector space models. Finally, it discusses opportunities for integrating information retrieval techniques with natural language processing to develop more accurate and effective retrieval systems.
Information Retrieval Methods in Libraries and Information CentersEdeama Onwuchekwa
The document discusses various information retrieval methods used in libraries and information centers. It describes traditional methods like cataloguing, classification, indexing, and abstracting. It also discusses newer methods like metadata and online public catalogs. The goal of these various methods is to facilitate the storage and retrieval of information to meet users' needs.
Information science is an interdisciplinary field that incorporates aspects of computer science, library science, communication, and other diverse fields. It is concerned with the origination, collection, organization, sharing, and use of information. While related to fields like information theory and library science, information science has a broader focus on the nature, behavior, and management of information across different technologies and domains. It examines information generation, storage, retrieval, and application through databases and knowledge management. Information science is taught jointly with library science and applies emerging technologies to analyze and improve how information is accessed and used.
Search, Signals & Sense: An Analytics Fueled VisionSeth Grimes
The document discusses how text analytics can fuel semantic search and sensemaking by extracting features from documents, analyzing relationships between entities, and integrating search with other data sources. It outlines trends toward more unified search platforms that incorporate user context and infer intent to provide categorized, clustered results rather than just hit lists. The goal is for search to be the starting point for iterative sensemaking through analysis and synthesis of information.
Relationship of information science with library scienceSadaf Batool
Relationship of information science with library science
Presentation by Sadaf Batool
MPhil 1st semester
Table of contents
1. Definition of information science
2. Definition of library science
3. Primary history of library
4. Primary history of information
5. Progress of library science as (Library and information science)
6. IS &LS concerned task
7. Relationship of Information science with library science
8. According to S.R Nathan’s five laws
9. Difference of Information science &Library science
10. Conclusion
11. References
Definition of information science
Information science is that discipline that investigates the properties and behavior of information, the forces governing the flow of information, and the means of processing information for optimum accessibility and usability.
It primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information.
This includes the investigation of information representations in both natural and artificial systems, the use of codes for efficient message transmission, and the study of information processing devices and techniques such as computers and their programming systems.
It is an interdisciplinary science derived from and related to such fields as mathematics, logic, linguistics, psychology, computer technology, operations research, the graphic arts, communications, library science, management, and other similar fields. It has both a pure science component, which inquiries into the subject without regard to its application, and an applied science component, which develops services and products." (Borko, 1968, p.3The study of – the use of information, – its sources and development; – usually taken to refer to the role of scientific, industrial and specialized libraries and information units – in the handling and – dissemination of information. (Prytherch, 2005)
The systematic study and analysis of the – sources, – development, – collection, – organization, – dissemination, – evaluation, – use, and – management of information in all its forms, including the channels (formal and informal) and technology used in its communication. – –(Reitz, 2004) Definition of library science
The study of principles and practices of library care, and organization and administration of a library, and of its technical, informational, and reference services.
Library science as “a generic term for the study of libraries and information units, the role they play in society, their various component routines and processes, and their history and future development. (Harrods ‘Librarian’s Glossary)
Collection of reading material, its processing, organization and dissemination started with the advent of library. The knowledge and its implementation in respect of library may therefore be called library science.
The professional kn
Automatic indexing is the process of analyzing documents to extract information to be included in an index. This can be done through statistical, natural language, concept-based, or hypertext linkage techniques. Statistical techniques are the most common, identifying words and phrases to index documents. Natural language techniques perform additional parsing of text. Concept indexing correlates words to concepts, while hypertext linkages create connections between documents. The goal of automatic indexing is to preprocess documents to allow for relevant search results by representing concepts in the index.
Phyloinformatics combines phylogenetics and informatics to systematically study and classify evolutionary relationships. It has progressed from closed private data to more open and linked data through standards like ontologies and semantic web technologies. This allows phylogenetic concepts and data to be formalized and connected across resources using unique identifiers and statements called triples. Querying linked phylogenetic data from integrated sources will enable new synthetic research, though challenges remain in deploying these technologies and unlocking legacy data currently locked in publications.
This document discusses text and data mining (TDM) and provides definitions from 1982, 1999, and 2008 that describe mining as automatically generating logical representations of text passages, the (semi)automated discovery of trends and patterns across large datasets, and the use of automated methods to exploit knowledge in biomedical literature. It also lists different types of content that can be mined, such as images, graphs, tables, datasets, and text, and provides 101 potential uses for content mining, such as finding papers about chemistry in German or papers acknowledging support from the Wellcome Trust.
The document discusses various topics related to unstructured data analytics including text mining, web mining, and big data. It provides details on text mining tasks like information extraction, topic tracking, summarization, classification, clustering, and association. The key aspects of text mining discussed are preprocessing text data through tokenization, part-of-speech tagging, and semantic analysis. Text mining aims to extract useful information and discover patterns from large collections of unstructured text documents.
This document discusses the field of information science. It defines information science as an interdisciplinary field concerned with analyzing, collecting, storing, retrieving, and disseminating information. Information science incorporates aspects of computer science, as well as fields like library science, communication, management, and social science. The document traces the evolution of information science from focusing on applying computer technology to documents in the 1960s to becoming a broader field that studies the nature, collection, and management of information. Information science is described as an intersection of various disciplines and as being interdisciplinary in nature.
This document discusses the field of information science. It defines information science as an interdisciplinary field concerned with analyzing, collecting, storing, retrieving, and disseminating information. Information science incorporates aspects of computer science, as well as fields like library science, communication, management, and social science. The document traces the evolution of information science from focusing on applying computer technology to documents in the 1960s to becoming a broader field that studies the nature, collection, and management of information. Information science is described as an intersection of various disciplines and as being interdisciplinary in nature.
Applying machine learning techniques to big data in the scholarly domainAngelo Salatino
Slides of the Lecture at the 5th International School on Applied Probability Theory,Communications Technologies & Data Science (APTCT-2020)
12 Nov 2020
Information science is an interdisciplinary field that incorporates aspects of computer science, library science, communication, and other diverse fields. It is concerned with the origination, collection, organization, transformation, and utilization of information. While it is related to information theory and library science, information science has a broader focus on the nature, collection, and management of information across different domains. Information science has become increasingly important with the development and widespread use of computer and information technology to store, retrieve, interpret, and use information through databases and information systems.
Are New Digital Literacies Skills Neededrscd2018SusanMRob
Remarrying research and collection services around access to corpora and text mining, are new technical literacy skills needed? Was presented by Ingrid Mason (Deployment Strategist, AARNet) at the Research Support Community Day 2018
Text Mining: Beyond Extraction Towards Exploitationbutest
This document proposes a project called TIE (Text Information Exploitation) to go beyond information extraction from text and towards exploiting text to deduce new knowledge. The objectives are to aggregate extracted information to find causal links and other insights not explicitly stated in source texts. The methodology involves using domain ontologies to integrate information extraction techniques from real-world texts. The goal is to semi-automate knowledge discovery for applications like analyzing business reports. The project brings together experts in knowledge acquisition, computational linguistics, machine learning and information retrieval to address open research issues and apply text mining to scenario of mining annual reports.
Text Mining: Beyond Extraction Towards Exploitationbutest
This document proposes a project called TIE (Text Information Exploitation) to go beyond information extraction from text and towards exploiting text to deduce new knowledge. The objectives are to aggregate extracted information to find causal links and other insights not explicitly stated in source texts. The methodology involves using domain ontologies to integrate information extraction techniques from real-world texts. The goal is to semi-automate knowledge discovery for applications like analyzing business reports. The project brings together experts in knowledge acquisition, computational linguistics, machine learning and information retrieval to advance the state of text mining.
Finding Your Literature Match - A Recommender SystemEdwin Henneken
1. The document introduces a recommender system that uses a "topic space" constructed from document metadata like keywords to cluster documents and users based on similarity.
2. It provides an example recommendation for a recent arXiv paper based on its bibliographic metadata and usage logs of expert users.
3. Some open questions about optimizing and updating the system are discussed, like whether keywords fully describe the document universe and how to handle literature without keywords.
A Case Study Protocol For Meta-Research Into Digital Practices In The HumanitiesJeff Brooks
This document presents a case study protocol for conducting meta-research on digital practices in the humanities. The protocol was developed by the Digital Methods and Practices Observatory working group to help researchers adopt this methodology across disciplines and approaches. The document discusses three pilot meta-research studies on digital practices that informed the protocol's development. It also provides several examples of how digital tools are being integrated into various stages of humanities research in uneven ways and highlights how research practices are unpredictable and assembled in response to specific project needs.
RELATIONSHIP OF LIBRARY SCIENCE WITH INFORMATION SCIENCELibcorpio
LS relationship IS, Library and Information Science, LIS, Library Science and Information Science, LS vs IS; Relationship of Library science with Information science, Library science, Information science, Library Science Vs Information Science, Similarities and Differences, Library Science vs Information Science, Similarities and Differences, LS relationship IS, Library science, Information science,
Informatics is the study of the structure, behavior, and interactions of natural and artificial systems that store, process and communicate information. It includes the representation, processing and communication of information, as well as the fundamentals of computation and technologies used. Informatics also examines the social aspects of technology and its role in social and organizational change. Library informatics applies these principles to study how information systems can best deliver the right information to users in the right way. The goal is to integrate information and communication technologies into libraries and information organizations.
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
Search, Signals & Sense: An Analytics Fueled VisionSeth Grimes
The document discusses how text analytics can fuel semantic search and sensemaking by extracting features from documents, analyzing relationships between entities, and integrating search with other data sources. It outlines trends toward more unified search platforms that incorporate user context and infer intent to provide categorized, clustered results rather than just hit lists. The goal is for search to be the starting point for iterative sensemaking through analysis and synthesis of information.
Relationship of information science with library scienceSadaf Batool
Relationship of information science with library science
Presentation by Sadaf Batool
MPhil 1st semester
Table of contents
1. Definition of information science
2. Definition of library science
3. Primary history of library
4. Primary history of information
5. Progress of library science as (Library and information science)
6. IS &LS concerned task
7. Relationship of Information science with library science
8. According to S.R Nathan’s five laws
9. Difference of Information science &Library science
10. Conclusion
11. References
Definition of information science
Information science is that discipline that investigates the properties and behavior of information, the forces governing the flow of information, and the means of processing information for optimum accessibility and usability.
It primarily concerned with the analysis, collection, classification, manipulation, storage, retrieval, movement, dissemination, and protection of information.
This includes the investigation of information representations in both natural and artificial systems, the use of codes for efficient message transmission, and the study of information processing devices and techniques such as computers and their programming systems.
It is an interdisciplinary science derived from and related to such fields as mathematics, logic, linguistics, psychology, computer technology, operations research, the graphic arts, communications, library science, management, and other similar fields. It has both a pure science component, which inquiries into the subject without regard to its application, and an applied science component, which develops services and products." (Borko, 1968, p.3The study of – the use of information, – its sources and development; – usually taken to refer to the role of scientific, industrial and specialized libraries and information units – in the handling and – dissemination of information. (Prytherch, 2005)
The systematic study and analysis of the – sources, – development, – collection, – organization, – dissemination, – evaluation, – use, and – management of information in all its forms, including the channels (formal and informal) and technology used in its communication. – –(Reitz, 2004) Definition of library science
The study of principles and practices of library care, and organization and administration of a library, and of its technical, informational, and reference services.
Library science as “a generic term for the study of libraries and information units, the role they play in society, their various component routines and processes, and their history and future development. (Harrods ‘Librarian’s Glossary)
Collection of reading material, its processing, organization and dissemination started with the advent of library. The knowledge and its implementation in respect of library may therefore be called library science.
The professional kn
Automatic indexing is the process of analyzing documents to extract information to be included in an index. This can be done through statistical, natural language, concept-based, or hypertext linkage techniques. Statistical techniques are the most common, identifying words and phrases to index documents. Natural language techniques perform additional parsing of text. Concept indexing correlates words to concepts, while hypertext linkages create connections between documents. The goal of automatic indexing is to preprocess documents to allow for relevant search results by representing concepts in the index.
Phyloinformatics combines phylogenetics and informatics to systematically study and classify evolutionary relationships. It has progressed from closed private data to more open and linked data through standards like ontologies and semantic web technologies. This allows phylogenetic concepts and data to be formalized and connected across resources using unique identifiers and statements called triples. Querying linked phylogenetic data from integrated sources will enable new synthetic research, though challenges remain in deploying these technologies and unlocking legacy data currently locked in publications.
This document discusses text and data mining (TDM) and provides definitions from 1982, 1999, and 2008 that describe mining as automatically generating logical representations of text passages, the (semi)automated discovery of trends and patterns across large datasets, and the use of automated methods to exploit knowledge in biomedical literature. It also lists different types of content that can be mined, such as images, graphs, tables, datasets, and text, and provides 101 potential uses for content mining, such as finding papers about chemistry in German or papers acknowledging support from the Wellcome Trust.
The document discusses various topics related to unstructured data analytics including text mining, web mining, and big data. It provides details on text mining tasks like information extraction, topic tracking, summarization, classification, clustering, and association. The key aspects of text mining discussed are preprocessing text data through tokenization, part-of-speech tagging, and semantic analysis. Text mining aims to extract useful information and discover patterns from large collections of unstructured text documents.
This document discusses the field of information science. It defines information science as an interdisciplinary field concerned with analyzing, collecting, storing, retrieving, and disseminating information. Information science incorporates aspects of computer science, as well as fields like library science, communication, management, and social science. The document traces the evolution of information science from focusing on applying computer technology to documents in the 1960s to becoming a broader field that studies the nature, collection, and management of information. Information science is described as an intersection of various disciplines and as being interdisciplinary in nature.
This document discusses the field of information science. It defines information science as an interdisciplinary field concerned with analyzing, collecting, storing, retrieving, and disseminating information. Information science incorporates aspects of computer science, as well as fields like library science, communication, management, and social science. The document traces the evolution of information science from focusing on applying computer technology to documents in the 1960s to becoming a broader field that studies the nature, collection, and management of information. Information science is described as an intersection of various disciplines and as being interdisciplinary in nature.
Applying machine learning techniques to big data in the scholarly domainAngelo Salatino
Slides of the Lecture at the 5th International School on Applied Probability Theory,Communications Technologies & Data Science (APTCT-2020)
12 Nov 2020
Information science is an interdisciplinary field that incorporates aspects of computer science, library science, communication, and other diverse fields. It is concerned with the origination, collection, organization, transformation, and utilization of information. While it is related to information theory and library science, information science has a broader focus on the nature, collection, and management of information across different domains. Information science has become increasingly important with the development and widespread use of computer and information technology to store, retrieve, interpret, and use information through databases and information systems.
Are New Digital Literacies Skills Neededrscd2018SusanMRob
Remarrying research and collection services around access to corpora and text mining, are new technical literacy skills needed? Was presented by Ingrid Mason (Deployment Strategist, AARNet) at the Research Support Community Day 2018
Text Mining: Beyond Extraction Towards Exploitationbutest
This document proposes a project called TIE (Text Information Exploitation) to go beyond information extraction from text and towards exploiting text to deduce new knowledge. The objectives are to aggregate extracted information to find causal links and other insights not explicitly stated in source texts. The methodology involves using domain ontologies to integrate information extraction techniques from real-world texts. The goal is to semi-automate knowledge discovery for applications like analyzing business reports. The project brings together experts in knowledge acquisition, computational linguistics, machine learning and information retrieval to address open research issues and apply text mining to scenario of mining annual reports.
Text Mining: Beyond Extraction Towards Exploitationbutest
This document proposes a project called TIE (Text Information Exploitation) to go beyond information extraction from text and towards exploiting text to deduce new knowledge. The objectives are to aggregate extracted information to find causal links and other insights not explicitly stated in source texts. The methodology involves using domain ontologies to integrate information extraction techniques from real-world texts. The goal is to semi-automate knowledge discovery for applications like analyzing business reports. The project brings together experts in knowledge acquisition, computational linguistics, machine learning and information retrieval to advance the state of text mining.
Finding Your Literature Match - A Recommender SystemEdwin Henneken
1. The document introduces a recommender system that uses a "topic space" constructed from document metadata like keywords to cluster documents and users based on similarity.
2. It provides an example recommendation for a recent arXiv paper based on its bibliographic metadata and usage logs of expert users.
3. Some open questions about optimizing and updating the system are discussed, like whether keywords fully describe the document universe and how to handle literature without keywords.
A Case Study Protocol For Meta-Research Into Digital Practices In The HumanitiesJeff Brooks
This document presents a case study protocol for conducting meta-research on digital practices in the humanities. The protocol was developed by the Digital Methods and Practices Observatory working group to help researchers adopt this methodology across disciplines and approaches. The document discusses three pilot meta-research studies on digital practices that informed the protocol's development. It also provides several examples of how digital tools are being integrated into various stages of humanities research in uneven ways and highlights how research practices are unpredictable and assembled in response to specific project needs.
RELATIONSHIP OF LIBRARY SCIENCE WITH INFORMATION SCIENCELibcorpio
LS relationship IS, Library and Information Science, LIS, Library Science and Information Science, LS vs IS; Relationship of Library science with Information science, Library science, Information science, Library Science Vs Information Science, Similarities and Differences, Library Science vs Information Science, Similarities and Differences, LS relationship IS, Library science, Information science,
Informatics is the study of the structure, behavior, and interactions of natural and artificial systems that store, process and communicate information. It includes the representation, processing and communication of information, as well as the fundamentals of computation and technologies used. Informatics also examines the social aspects of technology and its role in social and organizational change. Library informatics applies these principles to study how information systems can best deliver the right information to users in the right way. The goal is to integrate information and communication technologies into libraries and information organizations.
Este documento analiza el modelo de negocio de YouTube. Explica que YouTube y otros sitios de video online representan un nuevo modelo de negocio para contenidos audiovisuales debido al cambio en los hábitos de consumo causado por las nuevas tecnologías. Describe cómo YouTube aprovecha la participación de los usuarios para mejorar continuamente y atraer una audiencia diferente a la de los medios tradicionales.
The defense was successful in portraying Michael Jackson favorably to the jury in several ways:
1) They dressed Jackson in ornate costumes that conveyed images of purity, innocence, and humility.
2) Jackson was shown entering the courtroom as if on a red carpet, emphasizing his celebrity status.
3) Jackson appeared vulnerable, childlike, and in declining health during the trial, eliciting sympathy from jurors.
4) Defense attorney Tom Mesereau effectively presented a coherent narrative of Jackson as a victim and portrayed Neverland as a place of refuge, undermining the prosecution's arguments.
Michael Jackson was born in 1958 in Gary, Indiana and rose to fame in the 1960s as the lead singer of The Jackson 5, topping music charts in the 1970s. As a solo artist in the 1980s, his album Thriller broke music records. In the 1990s and 2000s, Jackson faced several legal issues related to child abuse allegations while continuing to release music. He married Lisa Marie Presley and Debbie Rowe and had two children before his death in 2009.
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
This document appears to be a list of popular books from various authors. It includes over 150 book titles across many genres such as fiction, non-fiction, memoirs, and novels. The books cover a wide range of topics from politics to cooking to autobiographies.
The prosecution lost the Michael Jackson trial due to several key mistakes and weaknesses in their case:
1) The lead prosecutor, Thomas Sneddon, was too personally invested in the case against Jackson, having pursued him for over a decade without success.
2) Sneddon's opening statement was disorganized and weak, failing to effectively outline the prosecution's case.
3) The accuser's mother was not credible and damaged the prosecution's case through her erratic testimony, history of lies and con artist behavior.
4) Many prosecution witnesses were not credible due to prior lawsuits against Jackson, debts owed to him, or having been fired by him. Several witnesses even took the Fifth Amendment.
Here are three examples of public relations from around the world:
1. The UK government's "Be Clear on Cancer" campaign which aims to raise awareness of cancer symptoms and encourage early diagnosis.
2. Samsung's global brand marketing and sponsorship activities which aim to increase brand awareness and favorability of Samsung products worldwide.
3. The Brazilian government's efforts to improve its international image and relations with other countries through strategic communication and diplomacy.
The three most important functions of public relations are:
1. Media relations because the media is how most organizations reach their key audiences. Strong media relationships are crucial.
2. Writing, because written communication is at the core of public relations and how most information is
Michael Jackson Please Wait... provides biographical information about Michael Jackson including his birthdate, birthplace, parents, height, interests, idols, favorite foods, films, and more. It discusses his background, career highlights including influential albums like Thriller, and films he appeared in such as The Wiz and Moonwalker. The document contains photos and details about Jackson's life and illustrious music career.
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
The document discusses the process of manufacturing celebrity and its negative byproducts. It argues that celebrities are rarely the best in their individual pursuits like singing, dancing, etc. but become famous due to being products of a system controlled by wealthy elites. This system stifles opportunities for worthy artists and creates feudalism. The document also asserts that manufactured celebrities should not be viewed as role models due to behaviors like drug abuse and narcissism that result from the celebrity-making process.
Michael Jackson was a child star who rose to fame with the Jackson 5 in the late 1960s and early 1970s. As a solo artist in the 1970s and 1980s, he had immense commercial success with albums like Off the Wall, Thriller, and Bad, which featured hit singles and groundbreaking music videos. However, his career and public image were plagued by controversies related to allegations of child sexual abuse in the 1990s and 2000s. He continued recording and performing but faced ongoing media scrutiny into his private life until his death in 2009.
Social Networks: Twitter Facebook SL - Slide 1butest
The document discusses using social networking tools like Twitter and Facebook in K-12 education. Twitter allows students and teachers to share short updates and can be used to give parents a window into classroom activities. Facebook allows targeted advertising that could be used to promote educational activities. Both tools could help facilitate communication between schools and communities if used properly while managing privacy and security concerns.
Facebook has over 300 million active users who log on daily, and allows brands to create public profile pages to interact with users. Pages are for brands and organizations only, while groups can be made by any user about any topic. Pages do not show admin names and have no limits on fans, while groups display admin names and are limited to 5,000 members. Content on pages should aim to provoke action from subscribers and establish a regular posting schedule using a conversational tone.
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
Hare Chevrolet is a car dealership located in Noblesville, Indiana that has successfully used social media platforms like Twitter, Facebook, and YouTube to create a positive brand image. They invest significant time interacting directly with customers online to foster a sense of community rather than overtly advertising. As a result, Hare Chevrolet has built a large, engaged audience on social media and serves as a model for how brands can use online presences strategically.
Welcome to the Dougherty County Public Library's Facebook and ...butest
This document provides instructions for signing up for Facebook and Twitter accounts. It outlines the sign up process for both platforms, including filling out forms with name, email, password and other details. It describes how the platforms will then search for friends and suggest people to connect with. It also explains how to search for and follow the Dougherty County Public Library page on both Facebook and Twitter once signed up. The document concludes by thanking participants and providing a contact for any additional questions.
Paragon Software announces the release of Paragon NTFS for Mac OS X 8.0, which provides full read and write access to NTFS partitions on Macs. It is the fastest NTFS driver on the market, achieving speeds comparable to native Mac file systems. Paragon NTFS for Mac 8.0 fully supports the latest Mac OS X Snow Leopard operating system in 64-bit mode and allows easy transfer of files between Windows and Mac partitions without additional hardware or software.
This document provides compatibility information for Olympus digital products used with Macintosh OS X. It lists various digital cameras, photo printers, voice recorders, and accessories along with their connection type and any notes on compatibility. Some products require booting into OS 9.1 for software compatibility or do not support devices that need a serial port. Drivers and software are available for download from Olympus and other websites for many products to enable use with OS X.
To use printers managed by the university's Information Technology Services (ITS), students and faculty must install the ITS Remote Printing software on their Mac OS X computer. This allows them to add network printers, log in with their ITS account credentials, and print documents while being charged per page to funds in their pre-paid ITS account. The document provides step-by-step instructions for installing the software, adding a network printer, and printing to that printer from any internet connection on or off campus. It also explains the pay-in-advance printing payment system and how to check printing charges.
The document provides an overview of the Mac OS X user interface for beginners, including descriptions of the desktop, login screen, desktop elements like the dock and hard disk, and how to perform common tasks like opening files and folders. It also addresses frequently asked questions for Windows users switching to Mac OS X, such as where documents are stored, how to save or find documents, and what the equivalent of the C: drive is in Mac OS X. The document concludes with sections on file management tasks like creating and deleting folders, organizing files within applications, using Spotlight search, and an overview of the Dashboard feature.
This document provides a checklist for securing Mac OS X version 10.5, focusing on hardening the operating system, securing user accounts and administrator accounts, enabling file encryption and permissions, implementing intrusion detection, and maintaining password security. It describes the Unix infrastructure and security framework that Mac OS X is built on, leveraging open source software and following the Common Data Security Architecture model. The checklist can be used to audit a system or harden it against security threats.
This document summarizes a course on web design that was piloted in the summer of 2003. The course was a 3 credit course that met 4 times a week for lectures and labs. It covered topics such as XHTML, CSS, JavaScript, Photoshop, and building a basic website. 18 students from various majors enrolled. Student and instructor evaluations found the course to be very successful overall, though some improvements were suggested like ensuring proper software and pairing programming/non-programming students. The document also discusses implications of incorporating web design material into existing computer science curriculums.
1. Text Mining 1
Running head: TEXT MINING
Text Mining
Mark Sharp
Rutgers University, School of Communication, Information and Library Studies
2. Text Mining 2
Abstract
The general idea of text mining – getting small "nuggets" of desired information out of
"mountains" of textual data without having to read it all – is nearly as old as information retrieval
(IR) itself. Currently text mining is enjoying a surge of interest fueled by the popularity of the
Internet, the success of bioinformatics, and a rebirth of computational linguistics. It can be
viewed as one of a class of nontraditional IR strategies which attempt to treat entire text
collections holistically, avoid the bias of human queries, objectify the IR process with principled
algorithms, and "let the data speak for itself." These strategies share many techniques such as
semantic parsing and statistical clustering, and the boundaries between them are fuzzy.
Therefore in this paper several related concepts are briefly reviewed in addition to text mining
proper, including data mining, machine learning, natural language processing, text
summarization, template mining, theme finding, text categorization, clustering, filtering, text
visualization, and text compression. Current text mining systems per se appear to be fairly
primitive, but to have the following goals which may serve as a useful definition to distinguish
text mining from other IR concepts: (1) to operate on large, natural language text collections; (2)
to use principled algorithms more than heuristics and manual filtering; (3) to extract
phenomenological units of information (e.g., patterns) rather than or in addition to documents;
(4) to discover new knowledge. Interest in text mining for biomedical research purposes is
especially pervasive and can be viewed as a major new frontier in bioinformatics. Text mining
systems designed for use with science and technology text databases such as MEDLINE
currently seem to have an undue emphasis on expert human filtering which contradicts goal (2).
Whether this represents premature surrender to difficulty or a necessary temporary expedient
remains to be seen.
4. Text Mining 4
Text Mining
Why Text Mining?
It has become a cliché to describe information space and the challenge of navigating it in
dramatic, even histrionic terms ("explosion," "avalanche," "flood," and the like), especially with
regard to scientific, technical, and scholarly literature. We moderns may like to think we are the
first to face this problem, but scientists have always complained about keeping up with their
literature (Saracevic, 2001). The promise of better science through better information technology
has been a major theme in information science since Vannevar Bush (1945) proposed his famous
Memex machine to deal with the "growing mountain of research."
Text mining is data mining applied to textual data. Text is "unstructured, amorphous, and
difficult to deal with" but also "the most common vehicle for formal exchange of information."
Therefore, the "motivation for trying to extract information from it is compelling – even if success
is only partial …. Whereas data mining belongs in the corporate world because that's where most
databases are, text mining promises to move machine learning technology out of the companies
and into the home" as an increasingly necessary Internet adjunct (Witten & Frank, 2000) – i.e., as
"web data mining" (Hearst, 1997). Laender, Ribeiro-Neto, da Silva, and Teixeira (2001) provide a
current review of web data extraction tools.
Text mining is one of a class of what I will call "nontraditional information retrieval (IR)
strategies." The goal of these strategies is to reduce the effort required of users to obtain useful
information from large computerized text data sources. Traditional IR often simultaneously
retrieves both "too little" information and "too much" text (Humphreys, Demetriou, & Gaizauskas,
2000). The nontraditional strategies represent a "broader definition of IR" and the view that "a
truly useful system must go beyond simple retrieval" (Liddy, 2000). I see them as treating the
5. Text Mining 5
entire database or collection more holistically, recognizing that the selectivity of anthropogenic
queries has a downside or bias which can be counterproductive to obtaining the best information,
and attempting to "objectify" the IR process with principled algorithms.1 I like to think that they
try to "let the data speak for itself."
When I started to research this paper I made a list of all the IR concepts (traditional and
non-) that were explicitly related to text mining by the first wave of authorities I identified. It was
a daunting list (Table 1), but I thought it would be possible to rule them all either "in" or "out" and
thus define their boundaries and hierarchical relationships to text mining. However, it soon
became clear that the boundaries were fuzzy, the hierarchy was a mass of convoluted loops, and
even seemingly outlandish claims to text mining relevance had, on closer inspection, a grain of
truth.2 Therefore I decided to try to cover them all instead of focusing on text mining proper,
whatever that turned out to be. Fortunately, time and literature resource limitations intervened to
significantly curtail this plan. Hopefully the result will serve as a sensible compromise.
History of Text Mining
H. P. Luhn (1958), in a seminal paper on automatic abstracting, noted "the resolving power
of significant words" in primary text. Lauren B. Doyle (1961) also captured the spirit of text
mining and related methods when he said that "natural characterization and organization of
information can come from analysis of frequencies and distributions of words in libraries"
1
E.g., "'Objectivity' [means] the results solely depend on the outcome of the linguistic processing algorithms and
statistical calculations" (Dorre, Gerstl, & Seiffert, 1999). I recognize that such computational exotica, stripped of their
mathematical mystique, "can be regarded as a form of transformed cognitive structure" (Ingwersen & Willett, 1995)
and are therefore ultimately just as human and arbitrary as the traditional methods. But I also believe that there can be
degrees of objectivity (operationally defined as general validity or utility) and that in general abstract computational
approaches will tend to be more objective.
2
There is one website, however, that goes too far. Greenfield (2001) lists virtually every text processing and
database technology I have ever heard of under the title "Text Mining." As a kind of rite of passage into the subject,
Patrick Perrin asked me to look at it and tell him if all of that was really text mining, so apparently it's somewhat
notorious in the field.
6. Text Mining 6
("libraries" representing what we would now more generally call collections or corpora). Text
mining per se may be new, but the dream of training a computer to extract information from
"mountains" of textual data is nearly as old as IR itself.
Don R. Swanson (1988) articulated the idea that the scientific literature should be regarded
as a natural phenomenon worthy of "exploration, correlation, and synthesis." He contrasted
scientists' attitudes toward information usage with those of intelligence analysts.
'To the working scientist or engineer, time spent gathering information or writing reports is
often regarded as a wasteful encroachment on time that would otherwise be spent
producing results that he believes to be new' [Weinberg et al, 1963] …. The intelligence
analyst, by contrast, is much more intimate with the available base of recorded information.
New knowledge, or finished intelligence, is seen as emerging from large numbers of
individually unimportant but carefully hoarded fragments that were not necessarily
recognized as related to one another at the time they were acquired. Use of stored data is
intensively interactive; "information retrieval" is an inadequate and even misleading
metaphor. The analyst is continually interacting with units of stored data as though they
were pieces selected from a thousand scrambled jigsaw puzzles. Relevant patterns, not
relevant documents, are sought.
Swanson called upon scientists to be more like intelligence analysts; to "take seriously the idea that
new knowledge is to be gained from the library as well as the laboratory [and] to develop attitudes
toward information indistinguishable from attitudes toward research itself."
Not content to lecture scientists from a theoretical pedestal, by the time these words were
published Swanson had already put the idea into practice by developing a system to discover
meaningful new knowledge in the biomedical literature (see references in Swanson & Smalheiser,
1999). Software now called ARROWSMITH and freely available on the web
(http://kiwi.uchicago.edu) helps by finding common keywords and phrases in "complementary and
noninteractive" sets of articles or "literatures" and juxtaposing representative citations likely to
reveal interesting co-occurrences. Two literatures are "complementary if together they can reveal
useful information not apparent in the two sets considered separately" – e.g., one may reveal a
7. Text Mining 7
natural relationship between A and B, and the other a relationship between B and C, so that
together they suggest a relationship between A and C. The two literatures are "noninteractive" if
their articles do not cross-cite and are not co-cited elsewhere in the literature. Swanson has
discovered at least three biomedically important relationships using this system: between fish oil
and Raynaud's syndrome, magnesium and migraines and epilepsy, and arginine and somatomedin
C (Lindsay & Gordon, 1999). Most recently he has used it to identify several dozen viruses as
potential bioweapons (Swanson, Smalheiser, & Bookstein, 2001).
Swanson's system remains far from fully automated, it is highly medical domain-specific,
and to my knowledge Swanson has never referred to it as text mining. But I believe it meets the
criteria at least partially (see below), and Swanson has been recognized as an early pioneer by self-
described text mining practitioners Marti Hearst (1999) and Ronald Kostoff (1999). I would like
to go further and propose that, because of the ideas he expressed in his 1988 JASIS paper,
Swanson is the father of modern text mining.
What is Text Mining?
Text mining per se is new and is still defining itself. It "has the peculiar distinction of
having a name and a fair amount of hype but as yet almost no practitioners" (Hearst, 1999), and
most of the information about it on the web is "misleading" (Perrin, 2001). The mining metaphor
"implies extracting precious nuggets of ore from otherwise worthless rock" (Hearst, 1999), "gold
hidden in … mountains of textual data" (Dorre, Gerstl, & Seiffert, 1999), or the idea that "the
computer rediscovers information that was encoded in the text by its author" (IBM, 1998b).
Hearst (1997, 1999) has argued for a narrow definition of text mining which distinguishes
it from "information access" (traditional IR). Traditional IR is concerned primarily with the
8. Text Mining 8
retrieval of documents (perhaps it should be called "DR"!) relevant to a user's information need,
but getting the desired information out of the documents is left entirely up to the user. According
to Hearst, data mining (of which text mining is a subtype, see below) not only deals directly with
the information, it tries to discover or derive new information from the data (text) which was
previously unknown even to the author(s) of the data (text[s]). She says "data mining is
opportunistic, whereas information access is goal-driven" and that IR tricks such as clustering,
finding terms for query expansion, and co-citation analysis are not text mining, although they can
aid it by improving the target dataset. Thus, IR can be viewed as a complementary technique
supporting text mining, rather than its broader term.
Text mining always involves (a) getting some texts relevant to the domain of interest
(traditional IR); (b) representing the content of the text in some medium useful for processing
(natural language processing, statistical modeling, etc.); and (c) doing something with the
representation (finding associations, dominant themes, etc.) (Perrin, 2001).
IBM is marketing a product named "Intelligent Miner for Text" (IBM, 1998a,b; Dorre et
al, 1999). It is a set of tools which "can be seen as information extractors which enrich
documents with information about their contents" in the form of structured metadata. "Features"
are classes of data which can be extracted, such as the language of the text, proper names, dates,
currency amounts, abbreviations, and "multiword terms" (significant phrases). The feature
extraction component is "fully automatic – the vocabulary is not predefined." It may operate on
single documents or on collections of documents. Word counts are based on normalization to
canonical forms (e.g., surgeries, surgical, and surgically might all be normalized to surgery).
The phrase extractor "uses a set of simple heuristics… based on a dictionary containing part-of-
speech information for English words [and] simple pattern matching to find expressions having
9. Text Mining 9
the noun phrase structures characteristic of technical terms. This process is much faster than
alternative approaches." There is also a clustering tool, a classification tool, and a search engine/
web crawler. The clustering similarity measure is based on "lexical affinities" – correlated
groups of words which appear frequently within a short distance of each other and which can be
used to label the clusters.
Lindsay and Gordon (1999) and Kostoff (1999) have extended Swanson's approach
without calling it text mining, but Kostoff's other work explicitly uses that label and so he serves
as a kind of bridge. Swanson's system is essentially as follows: MEDLINE searches are done on
two subjects (say, magnesium and migraines) and the results (titles or abstracts) are dumped into
ARROWSMITH, which generates a list of all significant words and phrases common to the two
result sets, and uses this information to "juxtapose pairs of text passages for the user to consider
as possibly complementary" (Swanson & Smalheiser, 1999). Lindsay and Gordon (1999) added
lexical frequency statistics (tf*idf) to rank the common words and phrases by probable
discriminatory value, but their system, like Swanson's, still requires "human filters" at several
points.
Kostoff and co-workers have published several papers on the Web describing various text
mining systems and applications. Losiewicz, Oard, and Kostoff (2000) describe a "TDM [text
data mining] architecture that unifies information retrieval from text collections, information
extraction from individual texts, knowledge discovery in databases, knowledge management in
organizations, and visualization of data and information." What they mean by "unifies" is
unclear, but this statement clearly betokens a broad view of text mining, almost as a synonym for
the entire family of nontraditional IR strategies. The "TDM architecture" they describe includes
subsystems for data collection (source selection and text retrieval), data warehousing
10. Text Mining 10
(information extraction and data storage), and data exploitation (data mining and presentation).
It thus appears to be a system for extracting and analyzing metadata. The authors discuss
linguistic analysis and numerous exotic pattern-finding techniques, but these appear to be long-
range goals. Current work focuses on the more pedestrian challenges of relevance feedback
("simulated nucleation"), bibliometrics, and phrase extraction and statistics. The system is "time
and labor intensive" by the authors' own admission, "requires the close involvement of technical
domain experts(s)" at every level of processing, and aims for a "main output [consisting of]
technical experts who have had their horizon and perspectives broadened substantially through
participation in the data mining process. The data mining tools, techniques and tangible products
are of secondary importance…"
Kostoff, Toothman, Eberhart, and Humenik (2000) connect text mining to "database
tomography," a system for phrase extraction and proximity analysis. The authors capture the
spirit of text mining when they say "techniques that identify, select, gather, cull, and interpret
large amounts of technological information semi-autonomously can expand greatly the
capabilities of human beings…" The idea of "tomography" also evokes text visualization, an
important nontraditional IR strategy related to text mining (see below). The authors cite
unpublished studies showing that in "real-world text mining applications" there is a "strong de-
coupling of the text mining research performer from the text mining user. The performer tended
to focus on exotic automated techniques, to the relative exclusion of the components of judgment
necessary for user credibility and acceptance." Users tended to favor simpler techniques, even if
it meant "reading copious numbers of articles." Database tomography aims to couple text mining
research and technology more closely with the user through "heavy involvement of topical
domain experts (either users or their proxies)" in the development of "strategic database maps"
11. Text Mining 11
on the "front end." "The authors believe that this is the proper use of automated techniques for
text mining: to augment and amplify the capabilities of the expert by providing insights to the
database structure and contents, not to replace the experts by a combination of machines and
non-experts."
Kostoff and DeMarco (2001) define science and technology text mining as "the
extraction of information from technical literature." It has three components: information
retrieval (gathering relevant documents), information processing, and information integration.
"Information processing is the extraction of patterns from the retrieved records" by bibliometrics,
computational linguistics, and clustering. "Information integration is the synergistic combination
of the information processing computer output with the [human] reading of the retrieved relevant
records. The information processing output serves as a framework for the analysis, and the
insights from reading the records enhance the skeleton structure to provide a logical integrated
product." Again, "substantial manual labor" is noted, and technical details are not given, leaving
doubt as to what kind of and how much "computational linguistics" and "clustering" were
actually implemented. This work was also published under the title "Citation mining: Integrating
text mining and biliometrics for research user profiling" by Kostoff, del Rio, Humenik, Garcia,
and Ramirez (2001).
In all of Kostoff's articles, there is a disturbingly high ratio of shifting, florid, technical
jargon and speculation to actual accomplishment. He seems to be re-inventing several well
established techniques such as relevance feedback, co-citation analysis, and phrase extraction,
giving them flashy new names, and failing to cite prior work by others. It is often unclear where
the boundary is between the computer and human filtering, particularly in Kostoff's phrase
extraction process. Given the authors' constant emphasis on the importance of human judgment
12. Text Mining 12
it seems likely that they have not automated the phrase selection process at all, and therefore
have not added anything to classical word proximity analysis for phrase identification.
Unrestricted human filtering or intervention in what are supposed to be algorithmic processes is,
in some sense, a form of "fudging" or "cheating." It is antithetical to the goals of standardizing
and objectifying the IR process, and it is hard to see how it contributes anything progressive to
text mining research. This is not to disagree with Kostoff about the importance of domain
expertise and user credibility and acceptance, only to caution against using such concerns as a
figleaf for excessively primitive IR technology.
Based on the foregoing, I propose the following criteria for a true text mining system.
The keywords are highlighted.
• It must operate on large, natural language text collections.
• It must use principled algorithms more than heuristics and manual filtering.
• It must extract phenomenological units of information (e.g., patterns) rather than or in
addition to documents.
• It must discover new knowledge.
It is to be expected that different systems will meet these criteria to different extents.
Currently Swanson's and Kostoff's systems are on shaky ground on at least the first two, possibly
three. Perhaps text mining, by these criteria, is still more dream than reality. So let's look at
some related concepts.
Data Mining
It seems fairly noncontroversial that text mining is a subdiscipline of the broader and
slightly older field of data mining, the subdiscipline which deals with textual data. An
13. Text Mining 13
intermediate evolutionary lexical form, in fact, is "text data mining" (Hearst, 1999; Losiewicz et al,
2000). The mining metaphor implying "extracting precious nuggets of ore from otherwise
worthless rock" is actually more appropriate for text mining than for data mining, which tends to
deal with trends and patterns across whole databases (Hearst, 1999).
Data mining is considered a synonym for "knowledge discovery in databases" (KDD) by
some writers (e.g. Hearst, 1999) and as a narrower term by others (e.g. Liddy, 2000). The most
cited definition of KDD is that given by Fayyad, Piatesky-Shapiro, and Smyth (1996, cited by Qin,
2000, and Hearst, 1997): the nontrivial process of identifying valid, novel, potentially useful, and
ultimately understandable patterns in data. "Information archaeology" is a synonym for both data
mining and KDD, according to Hearst (1999). Two unusually practical, down-to-earth books on
data mining are Witten and Frank (2000) and Han and Kamber (2001) (Perrin, 2001).
Data mining usually deals with structured data, but text is usually fairly unstructured. The
crux of the text mining problem, then, can be viewed as imposing structure on text to make it
amenable to the analytic techniques of data mining. This is often conceptualized as extracting
metadata from text (Losiewicz et al, 2000).
Machine Learning
Data mining is based on a variety of computational techniques, some of which fall under
the rubric of machine learning. Examples are decision trees, neural networks, and association rules
(clustering). In this context, machine learning involves "the acquisition of structural descriptions
from examples [which] can be used for prediction, explanation, and understanding." When the
description can be used to classify the examples, all three are enabled, unlike purely statistical
modeling which only supports prediction. By some views, however, machine learning is little
14. Text Mining 14
more than practical statistics as it evolved in the field of computer science; i.e., with an emphasis
on searching "through a space of possible concept descriptions for one that fits the data" (Witten &
Frank, 2000).
From a broader artificial intelligence (AI) perspective, machine learning is one of the four
capabilities needed for an AI system such as a robot to pass the "Turing test" – that is, to appear
logical, rational, and intelligent to an intelligent human interrogator. In this context machine
learning involves the ability "to adapt to new circumstances and to detect and extrapolate patterns"
(Russell & Norvig, 1995).
From a biomedical research perspective, Mjolsness and DeCoste (2001) define machine
learning is "the study of computer algorithms capable of learning to improve their performance of
a task on the basis of their own previous experience" primarily through pattern recognition and
statistical inference. They see a legitimate future role for it in "every element of scientific method,
from hypothesis generation to model construction to decisive experimentation." Text mining
could help with the "high data volumes" involved in literature searching. However, most work to
date has focused on experimental data reduction such as visualization of high-dimensional vector
data resulting from gene expression microarray studies (see footnote 6, p. 25).
Natural Language Processing
Natural language processing (NLP) or understanding (NLU) is the branch of linguistics
which deals with computational models of language. A brief history is given by Bates (1995).
Its motivations are both scientific (to better understand language) and practical (to build
intelligent computer systems). NLP has several levels of analysis: phonological (speech),
morphological (word structure), syntactic (grammar), semantic (meaning of multiword
15. Text Mining 15
structures, especially sentences), pragmatic (sentence interpretation), discourse (meaning of
multi-sentence structures), and world (how general knowledge affects language usage) (Allen,
1995). When applied to IR, NLP could in principle combine the computational (Boolean, vector
space, and probabilistic) models' practicality with the cognitive model's willingness to wrestle
with meaning. NLP can differentiate how words are used such as by sentence parsing and part-
of-speech tagging, and thereby might add discriminatory power to statistical text analysis.
Clearly, NLP could be a powerful tool for text mining. Interest in it for that purpose is
widespread but the jury remains out.
Rau (1988) described an early NLP system named SCISOR which was developed by
General Electric. Limited applicability to "constrained domains" was emphasized; SCISOR was
programmed to deal only with information on corporate mergers. Input (news stories, etc.) was
described as being converted to "conceptual format" permitting natural language interrogation
(i.e., question answering) and summarization. SCISOR employed a parallel strategy of top-down
(expectation-driven conceptual analysis) and bottom-up (partial linguistic analysis) parsing.
Parsing is the identification of subjects, verbs, objects, phrases, modifiers, etc., within sentences.
Computerized parsing of free text "is an extremely difficult and challenging problem," according
to Rau. The two parsers in SCISOR interacted with a domain-specific knowledge base
containing grammatical and lexical information. The double parsing strategy of SCISOR
allowed flexibility to perform in-depth analysis when complete grammatical and lexical
knowledge is available, and superficial analysis when unknown words and syntax are
encountered, giving the system robustness. The top-down parser could also be used for text
skimming (looking for particular pieces of information).
However, semantic analysis "is very expensive and furthermore depends on a lot of
16. Text Mining 16
domain-dependent knowledge that has to be constructed manually or obtained from other sources"
(IBM, 1998a). Early NLP's image also suffered from the poor performance of phrase-based
indexing in comparison with stemmed single words in the Cranfield and SMART tests (Salton,
1992). Interest in NLP revived when request-oriented (as opposed to document-oriented) IR came
of age and it was realized that the limitations of the linguistic techniques did not prevent them from
being effective within restricted subject domains (Ingwersen and Willett, 1995). Unlike its more
successful sibling field of speech recognition, NLP has the severe disadvantages of diffuse goals
and lack of robust machine learning algorithms (Bates, 1995). There seems to be wide consensus
that NLP is still not competitive with statistical approaches to traditional IR, but that it may be
practical and even critical for applications such as phrase extraction and text summarization. Even
Salton, the godfather of statistical IR, said, "In the absence of deep linguistic analysis methods that
are applicable to unrestricted subject areas, it is not possible to build intellectually satisfactory text
summaries" (Salton, Allan, Buckley, & Singhal, 1994).
Liz Liddy (2000, 2001) has become a prominent advocate for NLP in text mining. Her
definition of the goal of text mining, in fact, is "capturing semantic information" as tabular
metadata amenable to statistical data mining techniques. In her work, NLP includes stemming
(morphological level), part-of-speech tagging (syntactic level), phrase and proper name
extraction (semantic level), and disambiguation (discourse level). Goals include automating text
mark-up for hypertext linkages in digital libraries, and machine learning algorithms for text
classification (see below).
A "reverse flow" of purely statistical methods to NLP has been going on since about
1990 and has made "substantial contributions" (Kantor, 2001), increasing interest in hybrid
approaches (Marcus, 1995; Losee, 2001a; Perrin, 2001). Statistical enrichment has been shown
17. Text Mining 17
to significantly improve the accuracy of proper name classification, part-of-speech tagging, word
sense disambiguation, and parsing under certain conditions (Marcus, 1995), and tagging and
disambiguation improve probabilistic document retrieval ranking discrimination by some parts of
speech (Losee, 2001a). Ultimately, lexical statistics are a reflection of term dependencies which
in turn reflect natural languages' relation to "naturally occurring dependencies in the physical
world" (Losee, 2001b). However, higher-level NLP proved far inferior to "shallow" tricks like
stemming and query expansion in improving the performance of an advanced IR system under
rigorous test conditions (Perez-Carballo & Strzalkowski, 2000).
Computational linguistics is used as a synonym for NLP by some writers and as a
narrower term by others. According to Hearst (1999), it is the branch of NLP which deals with
finding statistical patterns in large text collections to inform algorithms for NLP techniques such
as part-of-speech tagging, word sense disambiguation, and bilingual dictionary creation; i.e.,
computational linguistics is a form of text mining. Thus, to Hearst and Liddy, text mining
subserves NLP, rather than the reverse. Both Hearst and Liddy refer often to metadata as being
the bridge between NLP and statistics. They both envision text mining as a component of a full-
featured information access system which also includes source detection, content retrieval, and
analytical aids such as text visualization (see below).
A major problem in text analysis is "dangling anaphors" – pronouns and demonstratives
(this, that, the latter, etc.) which refer back to other sentences (Johnson, Paice, Black, & Neal,
1993). Therefore a good job for NLP would be to detect anaphors and search backwards to
resolve their referent. In the language of logic, this might be called identifying the point in the
text where each significant new proposition begins. In 1993, that was beyond available text
processing capabilities, so the authors had to exclude anaphoric sentences from further analysis
18. Text Mining 18
regardless of their information content.
In summary, all this activity and interest raise hopes, but NLP still "has not delivered the
goods" (Saracevic, 2001) and so the jury remains out.
Text Summarization
An obvious example of text mining would be to find previously unknown natural
correlations by looking at co-occurrences of themes in a corpus of texts. Before one can do that,
of course, one must identify the themes. A theme being a form of summary, automated theme-
finding is a form of automatic text summarization (or automatic abstracting), a proud old IR
tradition.
Johnson, Paice, Black, and Neal (1993) trace the history of automatic abstract generation
from Luhn (1958), who proposed extracting sentences based on their computed word content
weights, and Baxendale (1958, cited by Johnson et al, 1993), who drew attention to the
importance of the first and last sentences of paragraphs. Edmundson (1969, cited by Johnson et
al, 1993) found that both of these methods were inferior to extraction on the basis of cues (bonus
words and stigma words). Paice (1981, cited by Johnson et al, 1993) sharpened Edmundson's
idea of cues to "indicator constructs" such as In this paper we show that…
Johnson et al (1993) built a NLP-based auto-abstracting system which selected non-
anaphoric, indicator-containing sentences and ran them through a bottom-up parser, dictionary-
based part-of-speech tagger (noun, verb, etc.) and morphology-based tagger (-ly = adverb, etc.).
Each word was then indexed by its sentence number, position within the sentence, part of speech,
verb tense if applicable, and whether it was plural or singular. The result was then be "cleaned
19. Text Mining 19
up" by a set of corrective heuristics and a grammar-based tag disambiguator3. A global parser
then identified noun phrases based on definitive cues such as being separated by a preposition
(e.g., the primary factor in public health), and then parsed the sentence. The resulting sample
abstract was "far from perfect" as the authors admitted, but it was a plausible condensation down
to 22% of the original text size. Since 22% is an inadequate degree of data reduction for most
text summarization needs, the next step might be to take a page from statistical IR and develop
ways of ranking the selected sentences.
Template mining
SCISOR's (Rau, 1988) text summarization capabilities were based on filling in values
specified by domain-dependent, manually formulated "scripts" – e.g., company A offered B
dollars per share in a takeover bid for company C on date D. The values were extracted from
raw text by parsing and stored in relational data tables. Then summaries of the parsed data
values could be written by a natural language generator. This seems to be a form of template
mining, where the script or metadata table field structure constitutes the template.
Chowdhury (1999) describes template mining as a form of information extraction using
NLP "to extract data directly from the text if either the data and/or text surrounding the data form
recognizable patterns. When text matches a template, the system extracts data according to the
instructions associated with that template." Chowdury traces its history from the mid-1960s
Linguistic String Project at New York University, where "fact retrieval" was conducted against
template data mined from natural language text, up to its current (1999) use in the AltaVista and
3
An example of a sentence with intractable tag ambiguity would be Rice flies like sand, which could refer to the
behavior of grain or insects (Allen, 1995, p. 13). Such a sentence would require higher (pragmatic and discourse)
levels of analysis to disambiguate.
20. Text Mining 20
Ask Jeeves web search engines. .He cites some of the same work I reviewed under NLP and
below (the Rau, Paice, and Gaizauskas groups) perhaps implying that template mining is a
general term for NLP-based metadata approaches to text mining. He also cites Croft (1995) in
reference to the U.S. Advanced Research Projects Agency (ARPA) initiative in this area, the
Message Understanding Conferences (MUCs).
To facilitate template mining, Chowdhury recommends "standardization in the
presentation and layout of information within digital documents" through the use of templates for
document creation. But this is contrary to the spirit of text mining, which is to liberate both the
creators and the users of text from as much tedium and artificiality as possible. Like Kostoff's
unrestricted reliance on human filters, it represents a form of surrender in the face of difficulty –
hopefully premature!
Theme Finding
Salton, Allan, Buckley, and Singhal (1994) looked at how traditional IR models can be
applied to theme generation and text summarization. The authors derived the notion of passage
retrieval from the problem of ranking vector matches when the vectors are of different lengths,
e.g. very short queries against long documents, or clustering documents of different sizes. One
solution is to decompose the documents into subunits of roughly equal size, called "passages." A
common passage unit is a paragraph.
The passages may be converted to normalized vectors and compared. Those with
similarities above a certain threshold (which may be chosen to deliver a desired degree of
abstraction) are considered connected. If the documents are plotted as arcs on the circumference
of a circle and their component passages connected by straight lines in accordance with their
21. Text Mining 21
vector similarities, the resulting starburst pattern can convey themes within and between
documents. These themes can be focused by expressing each triangle of passage similarities
as a centroid and doing similarity calculations on the centroids.
One may want to compute an estimate of the "most important" passages for the purpose
of selective text traversal ("skimming") or text summarization. Such passages might be
identified as (a) having a large number of above-threshold similarity connections, (b) strategic
position (e.g., the first paragraph in each section), or (c) high similarity to some reference node.
The last criterion (c) is called "depth first" selection. In practice, all three of these criteria can be
combined; e.g., start with some desired passage (as in "more like this"), go to the most similar
sectional heading passage, then go to its strongest link, the select the other densely connected
nodes in that cluster in chronological order. For text summarization, repetition can be edited out
on the basis of similarities between sentences or other subunits which are "too high."
Text Categorization
Text categorization should not be considered a form of text mining because it is a
"boiling down" of document content to "pre-defined labels" which "does not lead to discovery of
new information" since "presumably the person who wrote the document knew what it was
about," according to Hearst (1999). Presumably she would also rule out text summarization and
auto-indexing for the same reason. She makes exceptions, however, for cases where the goal of
categorization is to find "unexpected patterns" or "new events" because these "tell us something
about the world, outside of the text collection itself" and therefore qualify as new information.
I would argue, however, that it is not so easy to predict where "new information" will
come from, that novelty is in the eye of the beholder, and that any form of text data reduction is a
22. Text Mining 22
form of separating "precious nuggets" from "worthless rock" according to the human
idiosyncrasies of whoever is doing the separating, be it a traditional library cataloguer/indexer or
a vector space modeler. This is not to say that cataloguing, indexing, and other IR tools are all
text mining, but just to highlight the fuzziness of the boundaries between them.
Clustering
Clustering can be used to classify texts or passages in natural categories that arise from
statistical, lexical, and semantic analysis rather than the arbitrarily pre-determined categories of
traditional manual indexing systems. In the context of text mining, it is the derivation of the
categories which is of interest, since this is a form of theme finding and therefore text
summarization. Once the texts are clustered on the basis of common themes, it may also be useful
to correlate their divergent themes, a la Swanson. Texts may also be clustered on the basis of
length, cost, date, etc. (IBM, 1998b), or bibliographic data such as author, institution, or country of
origin (Kostoff, 1999). Computational aspects of clustering are reviewed by Witten and Frank
(2000, Section 6.6).
Filtering
E-mail filtering is often mentioned as an example of text mining (e.g., Witten and Frank,
2000). The relevance of related techniques such as name recognition, theme finding, and text
categorization are obvious, and it is even possible to imagine software which modifies its own
filtering criteria by discovering new patterns in the whole e-mail stream. However, I was unable
to find reports of any actual work on such a system.
Belkin and Croft (1992) built a model of information filtering (IF) based on Belkin's
23. Text Mining 23
famous anomalous states oif knowledge (ASK) model of IR. In a side-by-side comparison, the
two (IF and IR) appear strikingly similar, the biggest difference being the "stable, long-term…
regular information interests" of IF compared to the "periodic… information need or ASK" of
IR. Extending the side-by-side modeling to Bayesian inference networks, the authors arrive at
another striking comparison: the IF network looks exactly like an upside-down IR network! That
is, in IR multiple documents are percolating down to a single user, while in IF each single
incoming document is percolating down to multiple users. However, the authors reject this
analogy for reasons not entirely clear to me.4
Text Visualization
Text visualization shares text mining's goals of using computational transformations to
reduce the cognitive effort of dealing with large text corpora, highlight patterns across
documents, and help discover new knowledge. Text mining implies homing in on "precious
nuggets" whereas text visualization seems to be concerned with the "big picture," but in practice
both may be regarded as elements of a holistic approach to multi-text corpora. The text mining
systems of Hearst, Kostoff, and Liddy all have explicit text visualization components.
Wise (1999) developed a text visualization paradigm for intelligence analysis named
Spatial Paradigm for Information Retrieval and Exploration (SPIRE) "to find a means of
‘visualizing text’ in order to reduce information processing load and to improve productivity" by
representing large numbers of documents to permit "rapid retrieval, categorization, abstraction,
and comparison, without the requirement to read them all." The theory behind SPIRE was that
4
They seem to feel that "P(oj|pi)", the probability that the incoming document will satisfy the information need
given a user's filtering profile, is poorly understood compared to the conventional Bayesian need-query-document
relationships, but I'm not sure the latter are so well-understood, either.
24. Text Mining 24
humans’ most highly evolved perceptual abilities are those involved in interpreting "visual
features of the natural world." Therefore the goal was to represent text as natural, ecological
images from our early hominid past which require no "prolonged training to appreciate and use"
such as star fields or landscapes (Figure 1). This transformation was accomplished using
standard vector space algorithms and involves clustering and text summarization. SPIRE is an
excellent example of how a cognitive theory can be helpful in inspiring IR innovation and
guiding system development, despite its apparent lack of commercial success.5
Text Compression
As mentioned at the beginning, I started this paper by trying to narrow the definition and
scope of text mining by differentiating it from other nontraditional IR strategies (Table 1). One
by one, however, the other strategies refused to be cleanly differentiated, and the foregoing
polyglot review is the result. The only concept I thought I had succeeded in banishing from the
scope of text mining was data compression, which showed up in the title of a single citation in a
literature search performed for me by Melissa Yonteck. Data compression, a la PKZIP, was
surely not related in any meaningful way to text mining, Yonteck and I agreed. Here at last was
something I could confidently rule out.
But on page 334, Witten and Frank (2000), in discussing statistical character-based
models for token classification (names, dates, money amounts, etc.), note that "there is a close
connection with prediction and compression: the number of bits required to compress an item
with respect to a model can be interpreted as the negative logarithm of the probability with which
that item is produced by the model." That is, text compression algorithms might function as
5
Cartia, Inc., which was marketing the ThemeScape™ software (Figure 2, downloaded Fall 2000), no longer has
any detectable presence on the Web.
25. Text Mining 25
token classifiers in reverse! So I give up. Text mining appears to be related to just about
everything on my original list.
Biomedical Applications
My interest in text mining is motivated primarily by the belief that it can be fruitfully
applied to biomedical literature, specifically the MEDLINE database, to discover new knowledge.
I see text analysis as a major new frontier in bioinformatics, whose smashing success in the area of
gene sequence analysis is based, after all, on nothing more than algorithms for finding and
comparing patterns in the four-letter language of DNA. Swanson's work has focused on
MEDLINE, and Hearst (1999) has also declared a research interest in "automating the discovery of
the function of newly sequenced genes" by determining which novel genes are "co-expressed with
already understood genes which are known to be involved in disease."
Humphreys, Demetriou, and Gaizauskas (2000) used information extraction, defined as
"extracting information about predefined classes of entities and relationships from natural
language texts and placing this information into a structured representation called a template" [is it
therefore template mining?], to build a database of information about enzymes, metabolic
pathways, and protein structure from full text biomedical research articles. The LaSIE (Large
Scale Information Extraction) system includes modules for datatype recognition (names, dates,
etc.), co-reference resolution (pronouns, anaphors, metonyms, etc.), and different types of template
filling. It does linguistic analysis at all levels up to discourse using lexical knowledge,
morphology, and grammars to identify significant words. The enzyme and metabolic pathway
variant of LaSIE is called (of course) EMPathIE and fills the following template fields: enzyme
name, EC (Enzyme Commission) number, organism, pathway, compounds involved and their roles
26. Text Mining 26
(substrate, product, cofactor, etc.), and, interestingly, compounds not involved. Optional fields
include concentration and temperature. The PASTA variant deals with protein structure
information such as which amino acid residues occupy given positions, active and binding sites,
secondary structure, subunits, interactions with other molecules, source organism, and SCOP
category. The prototype has been tested on only six journal papers, so it is far from satisfying the
large text corpus requirement for true text mining, but the authors make no such claim.
The U.S. National Institutes of Health (NIH) have also gotten involved. Tanabe, Scherf,
Smith, Lee, Hunter, and Weinstein (1999) developed a system named MedMiner to help them sort
out the thousands of gene expression correlations resulting from microarray experiments6 to
separate "interesting biological stories" from mere epiphenomena and statistical coincidences. The
first module gathers the relevant texts by querying PubMed (MEDLINE) and GeneCards (an
Israeli gene information database) on the expressed genes. [Gene names generally make good
search words because they are different from normal English words, e.g. "JAK3".] The second
module filters the retrieved texts by user-specifiable relevance criteria based on classical proximity
or term frequency scores (NLP criteria being regarded as too computationally expensive). The
third module is a "carefully designed user interface" to facilitate access to the most likely-to-be-
interesting documents.
Despite the name, then, MedMiner is not a true text mining system, but rather a search and
display enhancement to PubMed (which offers only flat Boolean search logic, unranked retrieval,
and no integration with GeneCards, although it is integrated with other gene and protein
databases). Like Kostoff's system, it is designed to deal with highly technical information by
assisting expert users in their traditional IR tasks rather than attempting to automate them
6
Basically, a square chip coated with an array of known DNA sequences at known locations on the chip is dipped
into a broth containing the expressed messenger RNA (mRNA) from cells under given conditions. The mRNA is
labeled so that when it binds to its complementary DNA on the chip the gene expression pattern is revealed. Gifford
(2001) briefly reviewed the direct application of data visualization to gene expression data not involving any text.
27. Text Mining 27
completely. MedMiner is freely available online at http://discover.nci.nih.gov.
Another NIH group, Rindflesch, Hunter, and Aronson (1999), developed a true NLP
system named ARBITER for mining molecular binding terms from MEDLINE. ARBITER
attempts to identify noun phrases representing molecular entities such as drugs, receptors,
enzymes, toxins, genes, messenger molecules, etc., and their structural features (box, chain,
sequence, subunit, etc.) likely to be involved in binding. ARBITER makes use of MeSH indexing,
the lexical and semantic knowledge bases of the Unified Medical Language System's (UMLS) and
GenBank, co-word adjacency to forms of bind, and a variety of linguistic strategies to deal with
acronyms, anaphors, modifiers, coordinated phrases, and nested phrases (e.g., "…a previously
unrecognized coiled-coil domain within the C terminus of the PKD1 gene product, polycystin, and
demonstrate…"). A test on a small sample (116 abstracts containing a form of bind, one month's
worth from MEDLINE) yielded 72% recall and 79% precision of manually marked binding terms.
While terminology extraction might be considered a fairly trivial form of text mining, it is
obviously a logical step toward the mining of binding relationships (A binds B) which would have
enormous potential for knowledge discovery.
Stapley and Benoit (2000) developed a system named “BioBiblioMetrics” (Stapley,
2000) which uses text visualization to suggest functional clusters of genes from the yeast
Saccharomyces cerevisiae. The system uses a subset of MEDLINE records containing the
yeast's name, a lexical knowledge base of all the known, nontrivial yeast genes and their aliases
from the SGD (Saccharomyces Gene Database), and a matrix of gene name pair co-occurrence
statistics. When one does a search on a gene name or function (e.g. "DNA replication"), the co-
occurring genes are displayed in a graph with “nodes” representing genes and edge lengths
between the nodes representing biological proximity (Figure 2). Nodes are hypertext-linked to
28. Text Mining 28
sequence databases, and edges to those MEDLINE documents that generated them, creating a
biomedical information “landscape” and inference network. BioBiblioMetrics is freely available
online at http://www.bmm.icnet.uk/~stapleyb/biobib/.
Other MEDLINE text mining papers which I did not have a chance to review in full
involve dictionary-controlled natural language processing for extraction of drug-gene relationships
(Rindflesch, Tanabe, Weinstein, & Hunter, 2000); statistical term strength analysis (Wilbur &
Yang, 1996); statistical text classification and a relational machine-learning method (Craven &
Kumlien, 1999); statistical identification of key phrases against an evolutionary protein family
background (Andrade & Valencia, 1997 & 1998); pre-specified protein names and a limited set of
action verbs (Blaschke, Andrade, Ouzounis, & Valencia, 1999); and a proprietary information
extraction system (Thomas, Milward, Ouzounis, Pulman, & Carroll, 2000). Futrelle (2001a)
provides online full-text access to many biomedical text mining papers, including those from the
hard-to-get 2000 and 2001 Pacific Symposia on Biocomputing.
Bob Futrelle (2001a,b) has organized a large "bio-NLP" information network and
enunciated a radical vision which includes several of the themes of this paper, such as the
analogy between text and genome analysis, and the long history of information extraction in its
many guises. He see the challenge as "understanding the nature of biological text, whatever that
turns out to be, linguistic theories not withstanding." He seems to feel that the traditional rules
and grammars of Chomskian linguistics are more hindrance than help.
Frankly, a fresh new approach is needed, fueled by the conviction that language is a
biological phenomenon, not a logical phenomenon. By this we mean that the nature of
language is as messy as the genome. The data and observed phenomena in all their richness
and variety are dominant and cannot subsumed by any elegant theories. This means that in
many ways, biologists have far better hopes of cracking the NLP problem than the
computational linguists, who are focused on mathematics and logic. Even when they look
at data, it is primarily as grist for their math mills.
29. Text Mining 29
Futrelle recommends, for example, building visualization tools such as a protein noun phrase
highlighter which could be used to "assemble a large collection of the standard textual
expression forms [and] map these onto the query forms for which they are the answers."
But Futrelle also goes beyond immediate practical needs. Like Wise (1999), he has a
coherent theory based on the biological nature of language.
By this I mean that language is a communicative capability of living organisms that has
evolved from deep biological roots and from social interactions over millions, and
ultimately, billions of years. I claim that language is not logical and mathematical,
because that's not the nature of the organism (us) that exhibits the language capability.
An example of this is found in our vocabularies. A technically skilled adult will have a
vocabulary of over 100,000 words, basically all memorized. The meaning of "bear" or
"ship" does not follow from the characters that make them up. We simply commit them
to memory. Linguists would like us to believe that our natural ability to "parse" is
radically different and can be explained as a rule-based system.
My radical view is that we understand language not by generalization to abstract rules as
much as by retaining examples and generalizing from them as needed. This is quite
within our capacity, given our 100,000 word vocabularies. We also do reason. I would
claim, again in the biological view, that this is done more by "imagined life" than by
logic. Humans have superb abilities to remember events and to build detailed mental
plans for future activities …. So we need to build this type of reasoning into our systems.
The analogy to genomics is clear. The coding of a particular protein by a particular
sequence of DNA bases is just an accident of evolution. Whatever rules now appear to prevail
(such as "zinc fingers" for DNA-binding proteins) can only be derived empirically, by looking
for patterns within the data. Purely logical approaches must wait for a richer knowledge base.
Only now, after the massive effort of half a century of molecular genetic research, sequencing
whole genomes, and building databases and tools such as GenBank, Gene Cards, and Proteome,
can we begin to think about prediction of protein structure and function from sequence data
alone. Biological linguistics now stands at the beginning of a comparably arduous journey.
These considerations put Swanson's, Kostoff's, Tanabe's, and Chowdhury's reliance on
human expertise and manual filtering in a better light. Perhaps they do not represent premature
30. Text Mining 30
surrender to difficulty so much as a necessary but hopefully temporary expedient. Perhaps they
are keeping "the human in the loop" (Kantor) only long enough to "study the human to learn
what to put in the machine" (Saracevic, 2001). This surprising interface between biomedical text
mining and the cognitive tradition in IR would make a worthy topic for another paper.
31. Text Mining 31
References
Allen, J. (1995). Natural Language Understanding, Second Edition. Redwood City, CA:
Benjamin/Cummings.
Andrade, M. A., & Valencia A. (1997). Automatic annotation for biological sequences
by extraction of keywords from MEDLINE abstracts. Development of a prototype system.
Proceedings of the international conference on intelligent systems for molecular biology 5:25-32.
Andrade, M. A., & Valencia, A. (1998). Automatic extraction of keywords from
scientific text: application to the knowledge domain of protein families. Bioinformatics
14(7):600-607.
Bates, M. (1995). Models of natural language understanding. Proceedings of the
National Academy of Sciences, 92, 9977-9982.
Belkin, N. J., & Croft, W. B. (1992). Information filtering and information retrieval:
Two sides of the same coin? Communications of the ACM, 35, 29-38.
Blaschke, C., Andrade, M. A., Ouzounis, C., & Valencia, A. (1999). Automatic extract-
ion of biological information from scientific text: protein-protein interactions. Proceedings of
the international conference on intelligent systems for molecular biology, pp.60-67.
Bush, V. (1945). As We May Think. Atlantic Monthly, 176 (11), 101-108.
Cartia, Inc. (2000). ThemeScape product suite. Formerly online: http://www.cartia.com/
products/index.html [no longer accessible].
Chowdhury, G. G. (1999). Template mining for information extraction from digital
documents. Library Trends, 48, 182-208.
Craven, M., & Kumlien, J. (1999). Constructing biological knowledge bases by
extracting information from text sources. Proceedings of the International Conference on
32. Text Mining 32
Intelligent Systems for Molecular Biology, pp.77-86.
Dorre, J., Gerstl, P., & Seiffert, R. (1999). Text mining: Finding nuggets in mountains of
textual data. KDD-99, Association of Computing Machinery.
Doyle, L. (1961). Semantic road maps for literature searchers. Journal of the
Association for Computing Machinery, 8, 223-239.
Fan, W. (2001). Text mining, web mining, information retrieval and extraction from the
WWW references. Online: http://www-personal.umich.edu/~wfan/text_mining.html
Futrelle, R. P. (2001a). Natural language processing of biology texts. Online:
http://www.ccs.neu.edu/home/futrelle/bionlp/
Futrelle, R. P. (2001b). The past, present and future of biology text understanding.
Presented at the Conference on Biological Research with Information Extraction (BRIE), Tivoli
Gardens, Copenhagen, Denmark, July 26. Online:
http://www.ccs.neu.edu/home/futrelle/brie2001/index.html
Gifford, D. K. (2001). Blazing pathways through genetic mountains. Science, 293,
2049-2051.
Greenfield, L. (2001). Text mining. Online: http://www.dwinfocenter.org/docum.html
Hearst, M. (1997). Distinguishing between web data mining and information access.
Presentation for the Panel on Web Data Mining, KDD 97, August 16, Newport Beach, CA.
Online: http://www.sims.berkeley.edu/~hearst/talks/data-mining-panel/index.htm
Hearst, M. (1999). Untangling text data mining. In Proceedings of ACL'99: the 37th
Annual Meeting of the Association for Computational Linguistics, University of Maryland, June
20-26, 1999 (invited paper). Online: http://www.sims.berkeley.edu/~hearst/papers/acl99/acl99-
tdm.html
33. Text Mining 33
Hearst, M. (2001). About TextTiling. Online:
http://www.sims.berkeley.edu/~hearst/tiling-about.html
Humphreys, K., Demetriou, G., & Gaizauskas, R. (2000). Bioinformatics applications of
information extraction for scientific journal articles. Journal of Information Science, 26, 75-85.
IBM (1998a). Text analysis tools. Slide #8 of Intelligent Miner for Text Overview.
Online:
http://www-4.ibm.com/software/data/iminer/fortext/presentations/im4t23over/im4t23over8.htm
IBM (1998b). Text mining technology: Turning information into knowledge: A white
paper from IBM. Daniel Tkach (Ed.). Online:
http://www-4.ibm.com/software/data/iminer/fortext/download/whiteweb.pdf
Ingwersen, P., & Willett, P. (1995). An introduction to algorithmic and cognitive
approaches for information retrieval. Libri, 45, 160-177.
Johnson, F. C., Paice, C. D., Black, W. J., & Neal, A. P. (1993). The application of
linguistic processing to automatic abstract generation. Journal of Document and Text
Management, 1, 215-241.
Kantor, P. B. (2001). Lecture K: Natural language concepts. Information Retrieval class,
Rutgers University, School of Communication, Information, and Library Studies, New
Brunswick, NJ.
Kostoff, R. N. (1999). Science and technology innovation. Technovation, 19. Online:
http://www.dtic.mil/dtic/kostoff/Swanson2.txt
Kostoff, R. N., & DeMarco, R. A. (2001). Information extraction from scientific
literature with text mining. Analytical Chemistry (in press). Online:
http://www.onr.navy.mil/sci_tech/special/technowatch/kdocs/anchem2/txt
34. Text Mining 34
Kostoff, R. N., del Rio, J. A., Humenik, J. A., Garcia, E. O., & Ramirez, A. M. (2001).
Citation mining: Integrating text mining and biliometrics for research user profiling. Journal of
the American Society for Information Science, 52, 1148-1156.
Kostoff, R. N., Toothman, D. R., Eberhart, H. J., & Humenik, J. A. (2000). Text mining
using database tomography and bibliometrics: A review. Online:
http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm
KRDL (2001). Text mining: transforming raw text into actionable knowledge (white
paper). Kent Ridge Digital Labs. Online: http://textmining.krdl.org.sg/
Laender, A. H. F., Ribeiro-Neto, B., da Silva, A. S., & Teixeira, J. S. (2001). A brief
survey of web data extraction tools. In press.
Liddy, E. D. (2000). Text mining. Bulletin of the American Society for Information
Science, 27. Online: http://www.asis.org/Bulletin/Oct-00/liddy.html
Liddy, E. D. (2001). Data mining, meta-data, and digital libraries. DIMACS Workshop
on Data Analysis and Digital Libraries, May 17, Center for Discrete Mathematics and
Theoretical Computer Science, Rutgers University, New Brunswick, NJ.
Lindsay, R. K., & Gordon, M. D. (1999). Literature-based discovery by lexical statistics.
Journal of the American Society for Information Science, 50, 574-587.
Losee, R. M. (2001a). Natural language processing in support of decision-making:
phrases and part-of-speech tagging. Information Processing and Management, 37, 769-787.
Losee, R. M. (2001b). Term dependence: A basis for Luhn and Zipf models. Journal of
the American Society for Information Science, 52, 1019-1025.
Losiewicz, P., Oard, D. W., & Kostoff, R. N. (2000). Textual data mining to support
science and technology management. Online:
35. Text Mining 35
http://www.onr.navy.mil/sci_tech/special/technowatch/textmine.htm
Luhn, H. P. (1958). The automatic creation of literature abstracts. IBM Journal of
Research and Development, 2, 159-165.
Marcus, M. (1995). New trends in natural language processing: Statistical natural
language processing. Proceedings of the National Academy of Sciences, 92, 10052-10059.
Mjolsness, E., & DeCoste, D. (2001). Machine learning for science: State of the art and
future prospects. Science, 293, 2051-2055.
Perez-Carballo, J., & Strzalkowski, T. (2000). Natural language information retrieval:
Progress report. Information Processing and Management, 37, 155-178.
Perrin, P. (2001). Personal communication, Molecular Systems research group, Merck &
Co., Inc., Rahway, NJ.
Qin, J. (2000). Working with data: Discovering knowledge through mining and analysis.
Bulletin of the American Society for Information Science, 27. Online:
http://www.asis.org/Bulletin/Oct-00/qin.html
Rau, L. F. (1988). Conceptual information extraction and retrieval from natural language
input. In RIAO 88, pp. 424-437. Paris: Centre des Hautes Etudes Internationales d'Informatique
Documentaire, 1997, General Electric, USA.
Rindflesch, T. C., Hunter, L., & Aronson, A. R. (1999). Mining molecular binding
terminology from biomedical text. Proceedings of the American Medical Informatics
Association Symposium, 1999, 127-131. Online:
http://www.amia.org/pubs/symposia/D005564.PDF
Rindflesch, T. C., Tanabe, L., Weinstein, J. N., & Hunter, L. (2000). EDGAR: extraction
of drugs, genes and relations from the biomedical literature. Pacific Symposium on
36. Text Mining 36
Biocomputing, 2000, 517-528.
Russell, S., & Norvig, P. (1995). Artificial Intelligence: A Modern Approach. Upper
Saddle River, NJ: Prentice Hall.
Salton, G. (1992). The state of retrieval systems evaluation. Information Processing and
Management, 28, 441-449.
Salton, G., Allan, J., Buckley, C., & Singhal, A. (1994). Automatic analysis, theme
generation, and summarization of machine-readable texts. Science, 264, 1421-1426.
Saracevic, T. (2001). Personal communication and class discussions, Seminar in
Information Studies, Rutgers University, School of Communication, Information and Library
Studies, New Brunswick, NJ.
SDM (2001). Text mining 2002 [workshop prospectus]. Second SIAM International
Conference on Data Mining, Arlingon, VA, April 13, 2002. Online:
http://www.cs.utk.edu/tmw02/
Sneiderman, C. A., Rindflesch, T. C., Aronson, A. R. (1996). Finding the findings:
identification of findings in medical literature using restricted natural language processing.
Proceedings of the American Medical Informatics Association Annual Fall Symposium, 1996,
239-243.
Stapley, B. J. (2000). BioBiblioMetrics [On-line]. Available: http://www.bmm.icnet.uk/
~stapleyb/biobib/
Stapley, B. J., & Benoit, G. (2000). Biobibliometrics: information retrieval and
visualization from co-occurrences of gene names in Medline abstracts. Pacific Symposium on
Biocomputing, 2000, 529-540.
Swanson, D. R. (1988). Historical note: Information retrieval and the future of an
37. Text Mining 37
illusion. Journal of the American Society for Information Science, 39, 92-98.
Swanson, D. R., & Smalheiser, N. R. (1997). An interactive system for finding
complementary literatures: A stimulus to scientific discovery. Artificial Intelligence, 91,
183-203.
Swanson, D. R., & Smalheiser, N. R. (1999). Implicit text linkages between Medline
records: Using Arrowsmith as an aid to scientific discovery. Library Trends, 48, 48-51.
Swanson, D. R., Smalheiser, N. R., & Bookstein, A. (2001). Information discovery from
complementary literatures: Categorizing viruses as potential weapons. Journal of the American
Society for Information Science and Technology, 52, 797-812.
Tanabe, L., Scherf, U., Smith, L. H., Lee, J. K., Hunter, L., & Weinstein, J. H. (1999).
MedMiner: An Internet text-mining tool for biomedical information, with application to gene
expression profiling. BioTechniques, 27, 1210-1217.
Thomas, J., Milward, D., Ouzounis, C., Pulman, S., & Carroll, M. (2000). Automatic
extraction of protein interactions from scientific abstracts. Pacific Symposium on Biocomputing,
2000, 541-552.
Wilbur, W. J., & Yang, Y. (1996). An analysis of statistical term strength and its use in
the indexing and retrieval of molecular biology texts. Computers in Biology and Medicine,
26(3):209-222.
Wise, J. A. (1999). The ecological approach to text visualization. Journal of the
American Society for Information Science, 50(13):1224-1233.
Witten, I. H., & Frank, E. (2000). Data Mining: Practical Machine Learning Tools and
Techniques with Java Implementations. San Francisco: Morgan Kaufmann (Academic Press).
38. Text Mining 38
Table 1.
Initial List of Information Retrieval (IR) Concepts Related to Text Mining.
IR concept Authority (see References)
Artificial intelligence Fan; Perrin
Bioinformatics Futrelle; Perrin
Citation mining Kostoff
Computational Linguistics Fan; Hearst
Conceptual Graphs KRDL
Data Abstraction Fan
Data Mining Fan; Perrin; SDM
Database Tomography Kostoff
Document Mining Fan
Domain Knowledge KRDL
Electronic Commerce Fan
Factor Analysis SDM
Information Access Hearst
Information Extraction Chowdhury; Fan; Futrelle; Kostoff; Perrin
Information filtering Fan
Information Integration Fan
Information Retrieval Fan; Perrin
Information Visualization/Mapping Futrelle; Fan; SDM
Intelligent Agents ("bots") Fan
39. Text Mining 39
Knowledge Discovery Fan
Knowledge Extraction Perrin
Knowledge Representation Perrin
Language Identification IBM
Machine Learning Fan; Futrelle; Perrin
Metadata Generation SDM
Natural language processing Fan; Futrelle; Perrin; Rindflesch; Saracevic
Ontologies/Vocabularies/Lexicons Futrelle
Phrase Extraction Fan
Question Answering Futrelle
Resource Discovery Fan
Resource Indexing Fan
Semantic Modeling Perrin; SDM
Semantic Processing Rindflesch
Statistical Language Modeling Fan
Stemming SDM
Syntactic Processing Saracevic
Template Mining Chowdhury; KRDL
Text Analysis Futrelle; IBM
Text Classification/Categorization Fan; Hearst (distinct); IBM; SDM
Text Clustering Fan; IBM
Text Data Mining Hearst; Kostoff
Text Parsing SDM
40. Text Mining 40
Text Purification SDM
Text Segmentation/"TextTiling" Hearst; SDM
Text Summarization Futrelle; IBM; Saracevic; SDM
Text Understanding Futrelle; Fan
Web Data Mining Hearst
Web Mining Fan
Web Utilization Mining Fan
41. Text Mining 41
Figure 1. ThemeScape™ visualization of a collection of 4,314 Y2K debate forum documents
(Cartia, 2000, expired website).
42. Text Mining 42
Figure 2. BioBiblioMetrics retrieval from a search on “DNA repair” and “recombination”
(Stapley, 2000).