Notes.doc.doc
Upcoming SlideShare
Loading in...5
×
 

Like this? Share it with your network

Share

Notes.doc.doc

on

  • 392 views

 

Statistics

Views

Total Views
392
Views on SlideShare
392
Embed Views
0

Actions

Likes
0
Downloads
0
Comments
0

0 Embeds 0

No embeds

Accessibility

Categories

Upload Details

Uploaded via as Microsoft Word

Usage Rights

© All Rights Reserved

Report content

Flagged as inappropriate Flag as inappropriate
Flag as inappropriate

Select your reason for flagging this presentation as inappropriate.

Cancel
  • Full Name Full Name Comment goes here.
    Are you sure you want to
    Your message goes here
    Processing…
Post Comment
Edit your comment

Notes.doc.doc Document Transcript

  • 1. IR&NLP Coursework P1 Text Analysis Within The Fields Of Information Retrieval and Natural Language Processing By Ben Addley 2003695 Academic Year 2004 - 2005
  • 2. Ben Addley IR&NLP Coursework P1 Abstract As users and producers of information, we have created a spiral of ever increasing quantities of stored data. With the amount of directly accessible data available to us today, we need to find new ways in which to manage and review this overabundance of textual information. Text analysis, and in the case of this coursework, Automatic Text Analysis is the study of how a computer can be used to identify the context and importance of text without the intervention of a human. This paper introduces the technical and linguistic terminology necessary for an understanding of the role Text Analysis plays in the two core application areas of Information Retrieval (IR) and Natural language Processing (NLP). Text analysis is not something new, understanding what fundamentally links words, sentences passages and even whole books has been of interest since the middle ages. The first advancement in pure computer based TA came during the 1950’s, however research into computational linguistics and the “holy grail” of an understanding machine began in the 1940’s, right at the birth of modern computing. The basic building block of text analysis is textual understanding and in particular the language it is built upon. Text analysis is based around two differing approaches; Rule based where text is parsed through a series of modular analytical components, each analytical area concentrates on a particular component of natural language (morphemes, words, syntax, semantics etc) and calls upon a number of pre-defined rules. The other is Statistical based learning (sometimes referred to as machine or example based learning). It requires the computer to take in huge amounts of text, and by comparing common compositions of texts, gradually “learns” (through statistical analysis) how language is formed. As well as the two approaches to the underlining technology (rule and statistical based systems), a “third way” exists in hybrid systems combining the two to achieve more effective systems. This is still an emerging and exciting area of Computer Science with research being undertaken to improve quality all areas. There is a real sense that there are still a number of advancements possible in automating a number of the process currently undertaken by humans. There is also a business requirement driving research and development. Regardless of whether we approve of it or not, we now live in a truly global village with a need for fast, efficient and above all accurate tools to assist in our international trading environment. Keywords: Text Analysis, Textual Understanding, Information Retrieval, Natural Language Processing, Example Based Machine Learning, Rule Based Machine Learning. -2-
  • 3. Ben Addley IR&NLP Coursework P1 Contents ACADEMIC YEAR 2004 - 2005................................................................................................................. 1 ABSTRACT.................................................................................................................................................. 2 CONTENTS.................................................................................................................................................. 3 ....................................................................................................................................................................... 3 DISCUSSION NOTES................................................................................................................................. 5 QUESTIONS FOR DISCUSSION............................................................................................................. 6 1. INTRODUCTION.................................................................................................................................... 7 2. HISTORY OF TEXT ANALYSIS.......................................................................................................... 8 2.1 THE INNOVATORS........................................................................................................................................ 8 3. THE BUILDING BLOCKS OF TEXTUAL UNDERSTANDING..................................................... 9 3.1 RULE BASED APPROACH.............................................................................................................................. 9 3.2 STATISTICAL BASED APPROACH.................................................................................................................... 9 3.3 AMBIGUITY............................................................................................................................................... 11 4. TEXT ANALYSIS: TECHNOLOGY OVERVIEW.......................................................................... 11 4.1 PRE-PROCESSING....................................................................................................................................... 11 4.2 PRODUCT TYPES....................................................................................................................................... 11 -3-
  • 4. Ben Addley IR&NLP Coursework P1 4.2.1 LANGUAGE ............................................................................................................................................ 11 4.2.2 CONTENT .............................................................................................................................................. 11 5. TEXT ANALYSIS WITHIN THE FIELD OF INFORMATION RETRIEVAL ...........................12 5.1 DISTILLING THE MEANING OF A DOCUMENT................................................................................................ 12 5.2 TEXT BASED NAVIGATION......................................................................................................................... 13 5.2 TOPIC STRUCTURE ................................................................................................................................... 13 5.3 CLUSTERING............................................................................................................................................. 15 5.4 AUTOMATIC TEXTUAL SUMMARISATION...................................................................................................... 15 5.4.1 EXTRACTION .......................................................................................................................................... 15 5.4.2 ABSTRACTION ........................................................................................................................................ 15 6. TEXT ANALYSIS WITHIN THE FIELD OF NATURAL LANGUAGE PROCESSING............ 16 6.1 MACHINE TRANSLATION............................................................................................................................ 16 6.2 SPEECH TO TEXT SYSTEMS........................................................................................................................ 16 6.3 TEXT TO SPEECH SYSTEMS........................................................................................................................ 17 6.4 CHATTERBOTS (UNDERSTANDER-SYSTEMS).................................................................................................. 17 6.5 ANTI PLAGIARISM TOOLS........................................................................................................................... 17 7. WHAT THE FUTURE HOLDS (IN MY OPINION)......................................................................... 18 8. CONCLUSIONS.................................................................................................................................... 19 9. REFERENCES....................................................................................................................................... 20 9.1 WEB REFERENCES.................................................................................................................................... 20 9.1.1 GENERAL WEB RESOURCES...................................................................................................................... 22 9.1.2 GENERAL BOOK RESOURCES..................................................................................................................... 23 FIGURE 1: COMPUTERISED UNDERSTANG OF LANGUAGE, WINOGRAD T. STANFORD AI SERIES.................................................................................................................................................. 10 FIGURE 2: EXAMPLE OF TOPIC STRUCTURE (REPORT WIDE)..............................................14 -4-
  • 5. Ben Addley IR&NLP Coursework P1 Discussion Notes Article from The New York Times December 25, 2003: ‘Get Me Rewrite!’ ‘Hold On, I’ll Pass You the Computer.’ By Anne Eisenberg. In the famous sketch from the TV show “Monty Python’s Flying Circus,” the actor John Cleese had many ways of saying a parrot was dead, among them, “This parrot is no more,” “He’s expired and gone to meet his maker,” and “His metabolic processes are now history.” Computers can’t do nearly that well at paraphrasing. English sentences with the same meaning take so many different forms that it has been difficult to get computers to recognize paraphrases, much less produce them. Now, using several methods, including statistical techniques borrowed from gene analysis, two researchers have created a program that can automatically generate paraphrases of English sentences. The program gathers text from online news services on specific subjects, learns the characteristic patterns of sentences in these groupings and then uses those patterns to create new sentences that give equivalent information in different words. The researchers, Regina Barzilay, an assistant professor in the department of electrical engineering and computer science at the Massachusetts Institute of Technology, and Lillian Lee, an associate professor of computer science at Cornell University, said that while the program would not yield paraphrases as zany as those in the Monty Python sketch, it is fairly adept at rewording the flat cadences of news service prose. Give it a sentence like “The surprise bombing injured 20 people, 5 of them seriously,” Dr. Barzilay said, and it can match it to equivalent patterns in its databank and then produce a handful of paraphrases. For instance, it might come up with “Twenty people were wounded in the explosion, among them five in serious condition.” Programs that can detect or crank out multiple paraphrases for English sentences could one day have wide use. They might help create summaries of reports or check a document for repetition or plagiarism. Questions typed into a computer with such a program might in the future be automatically paraphrased to make it easier for a search engine to find data. Such programs might even be an aid to writers who want to adapt their prose to the background of their readers. Dr. Lee said the researchers had thought about using it “as a kind of ‘style dial’ “ to rewrite documents automatically for different groups - adapting articles on technical subjects for a children’s encyclopaedia, for example. She cautioned, however, that the work was preliminary and much more research was needed before it might be available for practical use. Fernando Pereira, chairman of the computer and information science department at the University of Pennsylvania said that the paraphrasing work had given him pause. “It’s a little bit humbling if you have the idea that we are creative when we write,” he said, only to discover that one’s special turns of phrase have already been tried by hundreds of other writers and can be found online. “The real insight of this work,” he said, “is that if there is a way of saying something, someone has already said it.” -5-
  • 6. Ben Addley IR&NLP Coursework P1 Questions for Discussion 1. The program outlined in the article above gathers text from online news services, learns the characteristic patterns of sentences in these groupings and then uses those patterns to create new sentences that give equivalent information in different words. Do you think a program like this can only operate within a limited domain like news reports or can they be applied to wider subjects? 2. What are your thoughts on Dr Fernando Pereira’s comment; “The real insight of this work, is that if there is a way of saying something, someone has already said it.” 3. What are the potential problems with applications that adapt prose or paraphrase sentences? Do you think that the context or subtle meaning might be lost if parsed through software as described in the article? 4. Do you think that tools such as Barzilay and Lee’s are applications to aid humans in the work they do or should they be used instead of humans? What are the problems with the latter option and are there any ways in which other areas of NLP could be applied to assist? -6-
  • 7. Ben Addley IR&NLP Coursework P1 1. Introduction What is Text Analysis? A very basic question which you would expect to have a very basic answer. Unfortunately this particular area concerned with the textual understanding of a document is more challenging, where a two-line definition simply doesn’t qualify. In this paper I will introduce the technical and linguistic terminology necessary for an understanding of the role Text Analysis plays in the two core application areas of Information Retrieval (IR) and Natural language Processing (NLP). I will not cover the complicated and highly technical processes, which go into making Text Analysis actually work. Instead presented below constitutes a mere introduction and should allow the user to research areas of interest in more depth. In its rawest sense, text analysis is key to any IR or NLP process. We as users must know some information abut what it is we are trying to retrieve, this comes from perhaps the title of the work, the authors name, section or chapter headers and of course the content. As human beings we are able to ingest and process this data in a number of sophisticated ways, the most important of these is our cognitive ability to understand context and importance of certain words and phrases. Text analysis, and in the case of this coursework, Automatic Text Analysis is the study of how a computer can be used to identify the context and importance of text without the intervention of a human. Wouldn’t it be nice not to have to rely on others IT literacy when searching for documents via search engines or the web? To be able to simply enter a phrase, question or query and have a computer return not just a document that has those keywords in but somehow understands what it was you were trying to look for and return only those documents you were actually seeking. These techniques, as well as being fundamental to IR can aid and extend the effectiveness of other areas such as research within NLP; it can aid interpretation, textual navigation, translation, speech based tools and facilitate word/phrase-spotting programs. As our world becomes a smaller place with increased and more effective communication networks, we have to utilise these techniques to add value to our interactions. Asking a natural question and receiving a natural, sensible and contextual response from a machine is an important goal in our global community. Text Analysis plays a fundamental role in achieving that goal but it is a complicated one! This coursework will attempt to answer some of the basic questions involved in how a computer achieves this feat of Artificial Intelligence (AI). It will also look at where Text Analysis technology currently resides within the field of IR & NLP and what the future might hold if we continue to research and develop in this exciting and challenging area. A question you may wish to bear in mind whilst reading this document is; Does Text Analysis constitute actual understanding of the textual input or simply an electronic approximation of understanding? -7-
  • 8. Ben Addley IR&NLP Coursework P1 2. History of Text Analysis Text analysis is not something new, understanding what fundamentally links words, sentences passages and even whole books has been of interest since the middle ages. The first threads of text analysis can be followed back to medieval biblical scholarship. Intellectuals would try to find parallels between the new and Old Testament where passages might be linked according to places, periods and people. This resulted in the first concordances1, a tool still used in computer based text analysis today. The first advancement in pure computer based TA came during the 1950’s, however research into computational linguistics and the “holy grail” of an understanding machine began in the 1940’s, right at the birth of modern computing. 2.1 The Innovators “In 1949, Warren Weaver proposed that computers might be useful for ‘the solution of world- wide translation problems’ and the resulting effort, called machine translation, attempted to simulate with a computer the presumed functions of a human translator” [Avron Barr 1980]. IBM first demonstrated a basic word for word translation machine in 1954 but these early attempts at machine translation failed due to the simplistic idea that word equivalency techniques and sentence re-ordering would suffice for a translation machine. AI research took on new ideas, the most important of these being textual understanding. The work of Chompsky in the field of linguistic theory coupled with advancements in programming languages in the 1960’s heralded a surge in AI/NLP work. By the 1970’s the approach was to model human language as knowledge based systems and to understand how language works to build up rules that apply to these systems. Terry Winograd in 1972, “groups natural language programs according to how they represent and use knowledge of their subject matter” [Avron Barr 1980]. He proposed four historical groupings based on this approach; the first programs during this period that attempted to analyse and understand textual input were built on limited domains; BASEBALL, ELIZA etc. Then followed systems that used semantic memories and indexing to retrieve and understand words or phrases. A third approach followed during the mid to late sixties called limited logic systems and finally a forth group, knowledge based systems, using first order logic and semantic nets, such as William Woods’s LUNAR program and Winograds SHRDLU system. More recently we have seen a shift away from the ideas of Winograd and the early pioneers of textual understanding within NLP. With increased processor speeds and the advent of super- computing, new methods and theories have been developed in example based machine learning. Today the arguments rage about the respective merits of the two main approaches, with new developments in hybrid systems combining the two. 1 In its most basic form it’s an index that includes a line of context against each entry or occurrence of a word. -8-
  • 9. Ben Addley IR&NLP Coursework P1 3. The Building Blocks of Textual Understanding The basic building block of text analysis is textual understanding and in particular the language it is built upon. When we discuss the fundamentals of this subject we must also understand the guiding principles of language makeup. We are unique from all other species of animal in that we don’t just communicate through signals (as other creatures are capable of) but use sophisticated language properties to do so. Artificial Intelligence (AI) is the branch of Computer Science that undertakes investigation and development of models and tools to replicate this “behaviour” in machines. AI has been defined as “the science of making machines do things that would require intelligence if done by men” [Minsky 1968]. Winograd proposed a series of pre-defined stages that must be adhered to for computerised understanding of language to occur. These follow closely the traditional linguistic approach to word formation processes. 3.1 Rule Based Approach Figure 1 (below), demonstrates this logical rule based approach to textual understanding by parsing written language through a series of modular analytical components. The output of one module acts as an input to the next and so on until the process has been completed. Each analytical area concentrates on a particular component of natural language (morphemes, words, syntax, semantics etc) and calls upon a number of pre-defined rules (shown in the ellipses) to adjudge whether that component of text fits in with a particular rule. Once identified it will be passed on to the next module for further analysis, if later (usually at the semantic or pragmatic stage) it is proved to be incompatible or incorrect it will be passed back to a previous layer. The disadvantage of such an approach is that it is language dependant and will need to be re- modelled for each additional language you wanted to perform textual analysis on. The other main problem is that it requires large amounts of initial human intervention in the programming and rule development stage, this ties in the first problem. Such problems have an impact on applications as we’ll see later in this paper. 3.2 Statistical Based Approach Another name for this approach is machine learning or example based learning. It requires the computer to parse huge amounts of text, and by comparing common compositions of texts (usually domain specific), gradually “learns” through statistical analysis how syntax and certain bigrams and trigrams of words and phrases within a language are formed. The advantage to this method is it is language independent and requires little human intervention. In theory if you have a large enough corpus in any language you can teach the system that language in a relatively short period of time. The disadvantage comes from the scale and size of corpus required, that contains enough compositions and varied bigrams and trigrams to develop a sufficient understanding of language. -9-
  • 10. Ben Addley IR&NLP Coursework P1 Figure 1: Computerised Understang of Language, Winograd T. Stanford AI Series - 10 -
  • 11. Ben Addley IR&NLP Coursework P1 3.3 Ambiguity Ambiguity increases the range of possible interpretations of natural language, and a computer has to find a way to deal with this. [Inman D. 1997] This is another key issue for Text Analysis as computers have to make choices on how interpretations of words and phrases are made. This is an easier problem to overcome with an example based learning approach. 4. Text Analysis: Technology Overview Within Text Analysis we have seen the two approaches to the underlining technology (rule and statistical based systems). There is of course a third way encompassing a hybrid method combining the best elements of rule based and statistical based methodologies. 4.1 Pre-processing When undertaking an analysis on target text it is helpful to carry out automatic pre-processing to strip away the words with no semantic meaning. This can aid in Example based learning as unusual or one off bigrams/trigrams are negated. Another approach is to carry out stemming, this enables prefixes, suffixes and endings, also known as morphemes to be identified leaving just the stem (or core meaning) of the word. For example learn is the stem of learning, by concentrating on the stem both terms will be identified as having the same core and thereby improving the analysis of the text. 4.2 Product Types There is a rich variety of software that supports the general task of Text Analysis within the different disciplines of Human Computer Interaction (HCI). Due to this variety it is helpful to set out a brief general classification of the main areas. The below, taken from Harold Klein’s discussions after the Acapulco ICA conference in 2000, breaks text analysis software down into two broad areas. Firstly language and its makeup and secondly obviously content, which deals with the “what” being communicated: 4.2.1 Language Dealing with the use of language of which there are two further sub-categories; • Linguistic: applications like parsing, lemmatising words2 • Data bank: information retrieval in texts, indexers, concordances, word lists, KWIC/KWOC (key-word-in-context, key-word-out of-context) 4.2.2 Content Dealing with the content of human communication, mainly texts. • Qualitative: looking for regularities and differences in text, exploring the whole text 2 Lemmatising means grouping related words together under a single headword. A Lemmatiser (tool) allows you to define groups of related words and then apply your groupings to words displayed in the Wordlist. - 11 -
  • 12. Ben Addley IR&NLP Coursework P1 (QDA - qualitative data analysis). A few programs allow the processing of audio and video information also. • Event data: analysis of events in textual data • Quantitative: analyse the text selectively to test hypotheses and draw statistical inferences. Output is a data matrix that represents the numerical results of the coding. o Category systems: provided by the software developer (instrumental) or by the researcher (representational), this is selective, only search patterns are searched in the text and coded. Software packages with built-in dictionaries are often language restricted, some have limits on the text unit size and are restricted to process responses to open ended questions but not to analyse mass media texts. The categories can be thematic or semantic; this can have implications on the definition of text units and external variables. o No category system: using co-occurrences of words/strings and/or concepts, these are displayed as graphs or dendrograms. o For coding responses to open ended questions only: these programs cannot analyse huge amount of texts, they fit for rather homogeneous texts only and are often limited in the size of a text unit. [Klein 2002] Obviously the above is just one interpretation of the many approaches taken in the field of Text Analysis. 5. Text Analysis Within the Field of Information Retrieval So we now know a little about what text analysis is and the linguistic theory behind the concept. The key question now, is how do we transform this powerful concept into products that can actually help us in our day-to-day lives? Text analysis is already used in many commercial systems and the first part of this section will investigate where the technology currently lies, what it is used for and who is using it. 5.1 Distilling the Meaning of a Document Making informed and ultimately correct decisions in our busy working life often requires analysing large and time consuming volumes of textual information. Students, Researchers, and professionals (such as Analysts, Lawyers and Editors) are faced by various TA tasks, all requiring the extraction of the core meaning from the document. In today’s “information rich” environment, huge piles of information build up in traditional repositories held in libraries, businesses, individual PCs, and the of course the ever ubiquitous World Wide Web. The amount of information being produced and stored is growing at an extraordinary rate with some predictions stating that we will, at current growth rates, have more - 12 -
  • 13. Ben Addley IR&NLP Coursework P1 information them atoms to store them on! Whether you believe the doomsday like prophecies or not, it is a fact that the human beings are increasingly unable to meet the challenges of this growth. “Mankind is searching for intelligent electronic assistants to help with text analysis projects” [HALLoGRAM Publishing, 2000]. In particular we require help to derive the semantic value of a document in a concise form. Once achieved we can apply the knowledge to a number of other applications, as we’ll discuss throughout the rest of this section. 5.2 Text Based Navigation To understand this application for Text Analysis we need understand semantic networks. “Semantic networks are knowledge representation schemes involving nodes and links between nodes” [Duke University]. Essentially a conceptual web of linked nodes that point to all other nodes containing listed objects. “Concepts stored in the semantic network are hyper linked to those sentences where they have been encountered, and the sentences are in turn hyper linked to the places in the original text from where they have been retrieved” [Sergei Ananyan, Alexander Kharlamov 2004]. By using Text Analysis principals to build semantic networks automatically we can efficiently navigate through stored texts and useful linked documents, which is a core application within the wider field of Information Retrieval. This can be done simultaneously to multiple documents thus creating a very powerful IR tool. Direct applications for this can be found in website design and navigation and I myself have investigated a version of this tool for my BSc final project. I have used Microsoft Indexing Services (part of the MS IIS suite of administrative tools) to index and analyse documents stored within a catalogue. It basically analyses the textual content of documents, sorting it into categories, which can then search on using a simple query language. 5.2 Topic Structure What do we mean when we talk about topic structure? Well, we can use text analysis to identify the most relevant and significant concepts from the semantic network of the text and transform it into a tree like structure of topics sorted by importance. The limbs of the tree structure represent relations of headings and content in the text. Some of these limbs are strong and form the basis of the structure, other are weaker and are often irrelevant or indirect and need to be replaced with more direct ones. To take this paper as an example; I have an introduction with no nested topics, however I then have a series of topics some with sub topics and areas of interest. This is built up over the document as a whole to construct a visual structure to reveal a hierarchy of themes within the text, which can then be used as a powerful information retrieval method. Below is a tree like listing of this coursework as viewed through a topic structure. Main headings hug the left hand side (red line) with secondary (blue) and tertiary (green) topics indented to the right. Content text is omitted. This is a rather simplistic model but demonstrates the point. - 13 -
  • 14. Ben Addley IR&NLP Coursework P1 Figure 2: Example of Topic Structure (Report Wide) - 14 -
  • 15. Ben Addley IR&NLP Coursework P1 5.3 Clustering Clustering is built upon the previous technology of topic structure but goes one stage further. Topic structures sever the links representing weak relations in the text and substitutes certain indirect relations with direct ones. With clustering on the other hand, those links that falls below a pre-defined level of weakness/strength are eliminated altogether. This allows a break up of texts that have been collected together to form individual groups that more clearly represent a common subject or theme. This allows documents to be grouped into particular subject areas facilitating searching, indexing and analysis on that theme as well as on the original topic structure. 5.4 Automatic Textual Summarisation “The goal of automatic summarisation is to take an information source, extract content from it, and present the most important content to the user in a condensed form and in a manner sensitive to the user’s or application’s need”. [Inderjeet Mani 2001] Textual summarisation techniques have their historical roots in the 1960’s, however with the proliferation of the Internet and growth in document production, new interest in the technology has developed. Broadly speaking techniques can be divided into two categories; 5.4.1 Extraction Extraction techniques in general, simply copy the information deemed most important into a summary. There are numerous methods for carrying out Extraction summarisation, one important and widely used methodology utilises algorithms to score individual sentences in the target text. It is both robust and accurate involving the number of important semantic concepts in a sentence. The larger the number and the stronger these concepts are, coupled with the relationship they have with each other, the higher the semantic weight (or score) of the sentence. The summarisation tool then sorts and collects only those sentences, which fall above a pre-set score or weight, thus resulting in a truncated piece of text that summarises the original…hopefully accurately! “The size of the summary is controlled through changing the sentence selection threshold. An advanced algorithm used for developing an accurate semantic network ensures the high quality and relevance of the created summary”. [Sergei Ananyan, Alexander Kharlamov 2004] The same concept can be applied to paragraphs and other units of text within documents. 5.4.2 Abstraction Abstraction in its purest sense is the process of distancing objects from ideas. In the context of summarisation it involves paraphrasing sections of the source document. Usually, abstraction can condense a text more thoroughly than the extraction process discussed above, but the programs that do this are generally harder to program and implement. An example of abstraction in summarisation would be to “understand” the concept of a sentence or phrase. So how do we use automatic textual summarization and what is it good for? There are numerous practical applications and professions using the technology in everyday tasks. Probably the most common of which is in the news and broadcasting industry, where summarisation is used for newspaper articles and scientific and technological journals. Another - 15 -
  • 16. Ben Addley IR&NLP Coursework P1 important and widely used application is in search engine technology and IR. Automatic textual summarisation is a “cross over” application used in both the IR fields and NLP applications. 6. Text Analysis Within the Field of Natural Language Processing In this section I’ll discuss some of the realised applications and products for Text Analysis technology and some current and future advancements. TA is just one part of the wider AI problem within NLP that we face. Core to the issue is NLP and how we can develop systems that are able to interpret the way we, as humans communicate. It is the basis of many other technologies as well information retrieval, machine translation for example is based on the textual analysis of data as are text to speech engines and language spotters, all discussed further in this section. 6.1 Machine Translation “Machine translation (MT) is the application of computers to the task of translating texts from one natural language to another. One of the very earliest pursuits in computer science, MT has proved to be an elusive goal, but today a number of systems are available which produce output which, if not perfect, is of sufficient quality to be useful in a number of specific domains” [EAMT June 2004]. MT is probably the most well known of text analysis applications in existence today but still remains one of the most intangible. Although there have been many advances in the field in recent years, products are still far from perfect and often can’t deal with truly natural language such as colloquialisms. The primary role of MT software is two fold; 1. Provides Gisting (or rough translations) to non-native speakers of a particular language, enabling them to gain an understanding of the document and whether it is of relevance to them. 2. As part of professional transcription and translation workbenches. This application allows users with knowledge of a language to filter and offer triages of large amounts of text. This can free up the professionals time, give them a head start on the target text and enable them to work more efficiently. There are other associated tools within this area such as translation memories and glossaries that also rely on text analysis techniques, however this falls outside of the main scope of the project. 6.2 Speech to Text Systems I’m sure we’ve now all been exposed to the technological wonder that is Speech to text systems. Every time we call up our bank or our gas supplier we are faced with the prospect of an automated voice asking us to tell them, through speech, which service we want. After repeatedly asking for our balance we are invariably put through to the section dealing with selling services of one description or another! It’s true that the technology doesn’t seem to be accurate but big business has identified it as a major efficiency saving, and that is the driver behind it. There are of course other, more refined applications for the technology, language recognition - 16 -
  • 17. Ben Addley IR&NLP Coursework P1 and speech recognition, biometric tools and speaker verification tools. There are some links to companies providing these applications in the general web resources section. 6.3 Text to Speech Systems One of the first aspects of natural language to be modelled was the actual articulation of speech sounds. “Early models of ‘talking machines’ were essentially devices which mechanically simulated the operation of the human vocal tract. More modern attempts to create speech electronically, are generally referred to as speech synthesis”. [Yule G. The Study of Language; Pg 115] The concept is to take a text and using advanced TA techniques, tokenise it into phonemes that make up the individual words. You then electronically reproduce the acoustic properties of those phonemes into sounds that can be played. It is a little trickier then the over-simplified explanation I have given but it demonstrates the idea. There are multiple uses such as: • Call centre technology • Mobile texting to landline phones - where the text message is translated into speech and transferred automatically to the designated landline connection. A working product developed by Loquendo (a subsidiary of Italia Telecom) 6.4 Chatterbots (Understander-Systems) Things have come a long way from the early days of the Eliza in the 60’s and Michael Mauldin from Carnegie Mellon University who coined the term “Chatterbot” in 1994. Basic Chatterbots such as Eliza, Alice and Brian3 use a process of pattern recognition to analyse text and create an illusion of understanding. Questions posed by the user triggered by occurrences of keywords or phrases; activate a particular type of pre-determined response, usually a question, incorporating that phrase or keyword in the answer. 6.5 Anti Plagiarism tools Text Analysis and textual understanding is at the heart of most anti plagiarism tools. To take an example from LSBU’s own efforts in this area, Thomas Lancaster (my old Java lecturer!) has developed a number of tools, one of which is; “Text Analysis Tool (TAT), a system which presents a rolling representation of the stylistic properties of a submission so find areas that are likely to represent extra-corpal plagiarism.” [Lancaster 2002] This is based around syntactic similarities in two target texts. Some tools will use this core technique and provide visual representations of the results (VAST) while others will simply present the two offending articles in a way that allows the user to make a decision more easily on the material they have in front of them (SSS, TRANK). This is an obvious labour saving tool for teachers and lecturers…and a source of fear and loathing by students the world over! 3 Visit the AAAI website for an extensive listing of Chatterbots (see reference section) - 17 -
  • 18. Ben Addley IR&NLP Coursework P1 7. What the Future Holds (in my opinion) This is still an emerging and exciting area of Computer Science with research being undertaken to improve quality all the areas mentioned in the sections above. There is a real sense that there are still a number of advancements possible in automating a number of the process currently undertaken by humans. There is also a business requirement driving research and development. Regardless of whether we approve of it or not, we now live in a truly global village with a need for fast, efficient and above all accurate tools to assist in our international electronic trading environment. Below is a summary of some of the major technological advancements I predict to occur in the near to middle future. Some are logical progressions to the current technology available from commercial companies. Others are slightly more experimental in that may require a radical approach to the way we look at the problems before developing viable solutions. • We should start to see more enterprise solutions with Text Analysis at the core. There is a lot of development in areas of speech to text and translation, with the goal being combined technologies transparent from the user, with seamless input of foreign speech to English textual output. • Convergence of NLP and IR technologies with more efficient methods to search for information in other languages, understanding the inferences of our query and returning results in an intelligent way. • Although English has gradually been accepted as the lingua franca in most industries there is a real need for translation tools, especially those dedicated to textual translation and the associated fields of retrieving that information. The EU alone dealt with 1,416,817 pages of text translation relying on text analysis in 2003, a rise of 9.4% from 2002 [EU DGT Annual Activity Report 2003] and other large bodies (both Governmental and NGO’s) are increasing budgets in the area to cope. This will be a major growth area and as such technological developments will follow. • Email dialogue systems that take automatic input from mail servers, carries out a Text Analysis process, interrogates associated applications such as diary programs, address books etc, and outputs a response in the form of replies to original emails. An example would be meeting planners trying to arrange for everyone in a special interest group (SIG) to attend a quarterly meeting. One email is sent out by the chair suggesting dates, once received, each recipients computer carries out the process described above and returns an email to the chair with their availability. This may continue through a number of iterations before everyone is available, but the key thing is that the process has been automatic and independent of human interference. There are of course many more examples, however, these seem logical and doable considering the current level of knowledge and application of technology at the moment. - 18 -
  • 19. Ben Addley IR&NLP Coursework P1 8. Conclusions As users and producers of information, we have created a spiral of ever increasing quantities of stored data. With the amount of directly accessible data available to us today, we have had to find new ways in which to manage and review this overabundance of textual information. In this paper we have started to explore how a computer can be used to manage this problem through identifying the context and importance of textual input. We’ve looked at the applications of Text Analysis as part of NLP and within the field of Information Retrieval and explored possible convergence in these areas. The key areas and technologies presented above are a mere introduction and do not constitute an in depth view. By showing how they broadly connect with Text Analysis and providing further resources (see reference section) I hope that the reader will further research areas that directly interest them. It is obvious to me from the study carried out for this project that the areas of text analysis and textual understanding within AI is far from complete. Research and development continues in Universities, Government and Industry, trying to achieve better, more natural responses to the inputs we give machines. There are many approaches to this task and we’ve briefly looked at rule-based analysis of text and also example based machine learning as well as hybrids of the two. At the beginning of this coursework report I posed a question; does Text Analysis constitute actual understanding of the textual input or simply an electronic approximation of understanding? The answer, in my opinion is that advancements in the technology have come a long way in the past twenty years, but we have not developed, and are nowhere near developing machines that understand in the way we understand. The question we should really ask is; can we ever truly develop understanding in computers? I’ll conclude with the words of a philosopher rather then a scientist. “It is a very remarkable fact that there are none so depraved and stupid, without even excepting of idiots, that they cannot arrange different words together, forming of them a statement by which they make known their thoughts; while on the other hand, there is no animal, however perfect and fortunately circumstanced it may be, which can do the same…” René Descartes (1637) …What would he have made of mans attempts to program machines to do the same! - 19 -
  • 20. Ben Addley IR&NLP Coursework P1 9. References Eisenberg A. 2003 “Get Me Rewrite!” “Hold On, I’ll Pass You to the Computer.” Article from The New York Times December 25, 2003 Inderjeet M. 2001. Automatic Summarisation. John Benjamins Publishing Company, Amsterdam/Philadephia. Minsky M. L. 1968. Semantic Information Processing MIT Press Yule G. 1985 The Study Of Language Cambridge University Press 9.1 Web References o Avron Barr http://www.aaai.org/Library/Magazine/Vol01/01-01/vol01-01.html Last Updated: Spring 1981 Downloaded: 19, November 2004 A paper entitled “Natural Language Understanding” by Barr of Stanford University. Although published many years ago it still has a wealth of useful information, especially good is the history of NLP (page 2) - 20 -
  • 21. Ben Addley IR&NLP Coursework P1 o Duke State University http://www.duke.edu/~mccann/mwb/15semnet.htm Last Updated: Unknown Downloaded: 16, November 2004 o EAMT http://www.eamt.org/mt.html Last Updated: 3, June 2004 Downloaded: 20, November 2004 Home page of the European Association of Machine Translation, this site is a good guide and technical resource in the area of MT. o European Union Annual Activity Report 2003 http://europa.eu.int/geninfo/query/engine/search/query.pl(AAR 2003 - Report ONLY.doc) Last Updated: 1, April 2004 Downloaded: 21, November 2004 Report into EU translation activities – very dry but full of useful statistics. o HALLoGRAM Publishing http://www.hallogram.com/textanalyst/ Last Updated: Unknown Downloaded: 7, November 2004 A commercial site promoting a product called textanalyst. Has numerous information and definition pages. Very good for an introduction to the subject. o Dave Inman http://www.scism.sbu.ac.uk/inmandw/tutorials/nlp/ambiguity/ambiguity.html Last Updated: 25, February 1997 Downloaded: 21, November 2004 All pages within the NLP tutorial site are directly relevant to text analysis and the wider fields of IR and NLP. A very good place to start any research in the area. - 21 -
  • 22. Ben Addley IR&NLP Coursework P1 o Harold Klein http://www.textanalysis.info/terms.htm Last Updated: 19, May 2002 Downloaded: 30, October 2004 A really good general site housing a huge range of information on text analysis. An excellent place to start any research into the subject. o Thomas Lancaster http://www.radford.edu/~sigcse/DC01/participants/lancaster.html Last Updated: 2002 Downloaded: 20, November 2004 A former PhD student at LSBU, Thomas developed a number of tools in the area of anti plagiarism. For how this links in with TA in more detail either follow the above link or type his name and the term “anti plagiarism” into google. o Sergei Ananyan, Alexander Kharlamov http://www.megaputer.com/tech/wp/tm.php3#nav Last Updated: 2004 Downloaded: 7, November 2004 Automated Analysis of Natural Language Texts white paper offering all round useful information. 9.1.1 General Web Resources These links act as an additional resource on the subject of Text Analysis within Information Retrieval. Most were used as general background reading for this coursework, some were quoted above and some were not used at all. http://www.textanalysis.com/ - VisualText™ - part of Text Analysis International, Inc. • Incorporated company providing products in Info Extraction and NLP. Very useful for the FAQ section: http://www.textanalysis.com/FAQs/faqs.html and the two main product sections that give an overview of current technology capabilities: http://www.textanalysis.com/Products/products.html http://www.intext.de/eindex.html - Social Science Consulting (English language) • Very good site dedicated to all things TA. Has a very good history section and applications for this technology. - 22 -
  • 23. Ben Addley IR&NLP Coursework P1 http://www.semantic-knowledge.com/tropes.htm - Semantic Knowledge Product Site • Another product on the Market, this time from Semantic Knowledge. Tropes offers TA plus IR technologies as a joined up product. Not much in the way of background reading. There is some useful info on how the engine works though: http://www.semantic- knowledge.com/fonction.htm http://www.megaputer.com/tech/wp/tm.php3 - • White paper from Megaputer™ (company offering TextAnalysis product). Good background and introduction to the subject. Very useful section on the history of the subject: http://www.megaputer.com/tech/wp/tm.php3#history and new opportunities looked at from the business perspective: http://www.scism.lsbu.ac.uk/inmandw/tutorials/nlp/index.html • Paper by Dave Inman on the complexities and possibilities of NLP via Computers. Not specifically aimed at my coursework area but very useful questions raised and will be useful to add context to TA within the wider field of IR. The two key links off this page are “can computers understand language?” and “does the structure of language help NLP?” http://www-2.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/0.html • What appears to be a very useful repository on all things NLP from Carnegie Mellon University. It is fairly out of date stuff but good for general background http://www.loquendo.it/en/index.htm • English homepage of Loquendo, an Italian company providing application tools involving speech to text and vice versa. 9.1.2 General Book Resources Obviously as well as the above links there is also a good grounding to be found in the core text book, FOA: A cognitive perspective on search engine technology and the www (Belew R. K). As well as this book there is also the secondary textbook, Information Retrieval (Van Rijsbergen C. J.). This second book is available on the www and has a specific section on Automatic Text Analysis: http://www.dcs.gla.ac.uk/Keith/Preface.html - 23 -