An Introduction to Text Mining


Published on

1 Like
  • Be the first to comment

No Downloads
Total views
On SlideShare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide
  • According to Gartner Group: “Predictive Analysis helps you connect data to effective action by drawing reliable conclusions about current conditions and future events.” Predictive analysis: Leverages an organization’s business knowledge by applying sophisticated analytic techniques to enterprise data It turns that data into insights that lead to the development of programs to increase revenues, reduce costs, improve processes, and prevent criminal or fraudulent activities It encourage actions that demonstrably change how people behave as your customers, employees, patients, students, and citizens Bottom line: it turns data into effective actions that positively impact your bottom line
  • Here are some of the stats you may want to know about SPSS (read highlights from the slide). SPSS has been a cornerstone of the software industry since 1968. We’ve also been on the forefront of blending both new and established technologies to help customers around the world solve business problems. We’ve continued to grow, deliberately and thoughtfully, over the years, acquiring companies and technology complimentary to our existing business. The bottom line for you: We’re here with innovative, proven solutions to help you solve your immediate business problems. And, we’ll be here in the future to support you and your organization…
  • The Clementine Server data mining workbench fits within SPSS’ overall business intelligence product strategy. Our entire business intelligence product line includes products for collecting data, preparing data, reporting and OLAP, as well as modeling. Because different users have different needs and levels of expertise, they are presented with the appropriate product interface. SPSS delivers the right product for every person supported by our 30 years of experience in data analysis and data mining. The analytical solutions we deliver, are scalable and can be deployed throughout your organization to help you transform your business with information..
  • It’s been estimated that 80% of the data in an organization is an unstructured format, that is, in the form of documents, HTML pages, database notes, email, open-ended survey responses, etc. This fact means that decision-makers often rely on only 20% of the data available and a little bit of the documents that they can read. Take open-ended surveys, for example: cross-tab reports of responses are common but open-ended responses, which hold valuable information which qualify the responses and bring up new themes. Organizations rarely have the tools or the time to truly process and disseminate this important information. In a similar fashion, database notes on customer contacts are effectively used to manage individual contacts, but this valuable source of customer information is never used to really understand the customer experience overall. What if you could use this information to keep and grow customers to increase customer lifetime value?
  • So, I think we can agree that there is a need for text analysis, but, where can this technology be applied. [click] Well, surveys are the most obvious, we’ve just talked about that. [click] We could apply this to email, reading the email and making a handling decision based on the content of the email. [click] Call centre data is another candidate for text analysis. What are my customers complaints? Where are there problems with my product? What did the customers who left have to say about my service? [click] Reading comment data is an important potential application; an application that is laborious or subjective currently. In the State of Georgia example, comments are triaged and only those indicating a definite problem or requirement for re-arrest are used. The majority are ignored, even though there is a real sense that there is something in that group. [click] The ability to read abstracts from online databases using a more intelligent engine than a simple word search is an application. [click] Document management and the ability to categorize gigabytes of documents policies, procedures is an application, and along with that [click] Corporate history and the ability to manage corporate information resources is an application. Finally [click] we have seen some use of this technology in the analysis of message in websites. What concepts are your website conveying, and are these concepts the appropriate ones, appropriately placed?
  • At a high level, you need linguistics or Natural Language Processing, to extract concepts which form the bases of business user interfaces like concept maps or feeding data mining techniques to predict customer behavior.
  • Morphology is the study of the structure and form of words Syntax is the study of how words and phrases form sentences Semantics relates to the meaning of words and statements Phonology is the study of sounds in language Pragmatics is the study of idiomatic phrases that cannot be analyzed with strict semantic analysis We tend to deal with the first three and ignore the last two when we are talking about natural language processing.
  • So how do we get from text to concepts? Linguistics, the science of text, includes ideas such as 1) morphology, or how words change based on part of speech, 2) syntax, or how sentences are structured 3) semantics, or the meaning of words and 4) statistics, such as the frequency of terms and patterns. It takes linguistics to cut through the noise of text to find relevant concepts without leaving important concepts undiscovered. Other statistical or machine learning approaches fall short of linguistic extraction, because only a linguistic approach can deal with the ambiguity and complexity of text. That is, linguistic extraction is [click] accurate, [click] scalable, [click] customizable and [click] discovery oriented. By accurate, I mean that [click] compound words, proper nouns, etc., [click] like these examples, are extracted. [click] In terms of scalability, we can process about 1 GB per hour, multiple formats and multiple languages [click]. By customizable, [click] I mean that you can you use dictionaries, rules and patterns [click] to tailor your extraction. Vertical resources can be used like the MeSH which is the official medical thesaurus. And Finally, [click] by discovery-oriented, I mean that, depending on your analysis, you can focus on known terms, unknown terms, new terms and [click] trends.
  • For the next step, to move from concepts to Predictive Analytics, tools are available which address specific business needs by delivering knowledge to adding prediction to operational systems. [click] LexiQuest Mine enables users to quickly identify key concepts, and the relationships between them, within thousands of documents Mine displays these concepts and the links between them in an easy to navigate, color-coded graphical map and trend analysis charts. Mine is designed for people who want to discover, structure and anticipate. [click] LexiQuest Categorize automatically catalogues documents into a predefined taxonomy based on their content. Able to “read” and understand content, Categorize is able to automatically and accurately place a document into into its proper category. From there, it can be sent to the right audience based on their profile or simply reside there for easy retrieval from a portal, intranet or extranet site. [click] Text Mining for Clementine is a new component of Clementine, which we will see in a few minutes, has the ability to unlock knowledge contained in unstructured text data so that it can be combined with information from databases and other data sources to build better models using traditional data mining techniques .
  • The extraction process works basically as three parts: First, a linguistic processor reads the text and comes up with a set of categories. These categories are passed to one of three different applications; depending upon the objective. These applications may be a stand alone concept understanding application, such as our LexiQuest mine. This application represents the concepts, and illustrates their relationship. Our Clementine application uses the concepts as data, and, as part of a larger data mining application. Finally, categorization uses the concept information as the basis of further analysis The final application layer can be used for visualization, data mining, or strict probabilistic assignment of information to known categories.
  • Another text mining example is categorization. The folders on the left represent categories of different types of incoming emails. Text mining can be used to learn the which emails, depending on their content, should be placed in each category (and therefore routed appropriately and automatically). [click] In this case an email on a problem with an ActiveX control [click] can be routed to Dev support.
  • MeSH is the National Library of Medicine's controlled vocabulary thesaurus. It consists of sets of terms naming descriptors in a hierarchical structure that permits searching at various levels of specificity. MeSH descriptors are arranged in both an alphabetic and a hierarchical structure. At the most general level of the hierarchical structure are very broad headings such as "Anatomy" or "Mental Disorders." At more narrow levels are found more specific headings such as "Ankle" and "Conduct Disorder." There are 21,973 descriptors in MeSH. There are also thousands of cross-references that assist in finding the most appropriate MeSH Heading, for example, Vitamin C see Ascorbic Acid. These entries include 23,512 printed see references and 102,346 other entry points
  • Let’s define data mining “ The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.” Data Mining Means: finding patterns or relationships in your data that you can use to solve your organization’s problems
  •   How does one mine data? The CROSS INDUSTRY STANDARD PROCESS FOR DATA MINING provides a framework for all data mining efforts.. This process focuses on business issues, allows the user to work with and interact with the data, works on the data mining process from beginning to end, and USES the results.  Business Understanding where you might convert a business problem to a data mining problem Data Understanding – where you get your first look at the data Data Preparation – the hardest part where you clean the data Modeling – the neatest part where you build prediction or MODELS Evaluation – where you examine performance Deployment – where you actually integrate the results of your data mining into your organization Notices the arrows going around the chart and and back and forth amongst the boxes? These arrows show that the data mining process is an iterative process where the miner may step from on box to another an back for an effective data mining activity
  • Broadly speaking data mining can serve four basic purposes: prediction, segmentation, association and outlier detection. Prediction takes a known result, and attempts to combine input fields in order to best replicate this result. An example may be deciding if someone is a good or bad credit risk, or, whether someone will churn or not. Segmentation finds groups within cases. The number of groups is unknown, ie. Any number of groups is possible. An example may be the examination of customer segments, or, the creation of groupings within financial data Association methods try to develop an “if this, then that” type of analysis. Examples of this are people who watch news programming also watch the weather network. Outlier detection is used to derive atypical cases. These are examples of unusual behaviour vis-s-vis the rest of the data. Examples of this may be fraud detection when examining claims data.
  • The ability to move to the why from the other who, what, when, where, and how much Personalization for customer knowledge. This allows for marketers to craft messages aimed at more individuals than generalized groups Understanding events inside and outside the organization and how they relate The ability to assess competitors technology and to understand the technology positions of the market
  • Questions
  • An Introduction to Text Mining

    1. 1. An Introduction to Text Mining Tim Daciuk SPSS, Inc. Services Manager, Canada
    2. 2. Agenda <ul><li>Introductions </li></ul><ul><li>An Overview of Document Warehousing </li></ul><ul><li>Understanding Unstructured Text </li></ul><ul><li>Concept Extraction </li></ul><ul><li>Text Mining </li></ul><ul><li>Data Mining </li></ul><ul><li>Demonstration </li></ul>
    3. 3. Tim Daciuk <ul><li>Background </li></ul><ul><ul><li>Social research </li></ul></ul><ul><ul><li>Survey research </li></ul></ul><ul><li>SPSS </li></ul><ul><ul><li>25 years working with the product </li></ul></ul><ul><ul><li>12 years working with the company </li></ul></ul><ul><ul><li>5 years working with text analysis </li></ul></ul><ul><li>Prior history </li></ul><ul><ul><li>Consulting </li></ul></ul><ul><ul><li>Education </li></ul></ul>
    4. 4. <ul><li>Predictive analysis helps connect data to effective action by drawing reliable conclusions about current conditions and future events. </li></ul><ul><ul><li>— Gareth Herschel, Research Director, Gartner Group </li></ul></ul>Predictive Analytics: Defined
    5. 5. SPSS At A Glance <ul><li>Leadership </li></ul><ul><ul><li>Market leader in Predictive Analytics </li></ul></ul><ul><ul><li>Focus on online & offline customer data acquisition and analysis </li></ul></ul><ul><li>Stability </li></ul><ul><ul><li>Founded in 1968 </li></ul></ul><ul><ul><li>30+ year heritage in analytic technologies </li></ul></ul><ul><li>Proven track record </li></ul><ul><ul><li>250,000+ customers worldwide </li></ul></ul><ul><ul><li>NASDAQ: SPSS </li></ul></ul><ul><li>Analytics standard </li></ul><ul><ul><li>80% of Fortune 500 are SPSS customers </li></ul></ul><ul><ul><li>80% plus market share in Survey & Market Research sector </li></ul></ul><ul><ul><li>Ranked #1 Data Mining solution by KD Nuggets </li></ul></ul>
    6. 6. Some of Our Brands
    7. 7. Unstructured Data Management <ul><li>Text Mining is a subset of Unstructured Data Management. </li></ul><ul><li>UDM can be broken down into: </li></ul><ul><ul><li>Content and Document Management </li></ul></ul><ul><ul><li>Search and Retrieval </li></ul></ul><ul><ul><li>XML database and tools </li></ul></ul><ul><ul><li>Categorization, Classification, and Visualization </li></ul></ul>
    8. 8. 80% of Data is Unstructured <ul><li>Database notes: </li></ul><ul><ul><li>Call center transcripts </li></ul></ul><ul><ul><li>Other CRM </li></ul></ul><ul><li>Email </li></ul><ul><li>Open-ended survey responses </li></ul><ul><li>Web pages </li></ul><ul><li>NewsGroups </li></ul><ul><li>Documents themselves </li></ul><ul><li>Competitive information </li></ul>
    9. 9. Applications for Text Analysis <ul><li>Surveys </li></ul><ul><li>‘Reading’ email </li></ul><ul><li>Call centre data </li></ul><ul><li>Comment data </li></ul><ul><li>Abstracts </li></ul><ul><li>Document management </li></ul><ul><li>Corporate history </li></ul><ul><li>Thematic understanding of website </li></ul>
    10. 10. Data Warehouse vs. Document Warehouse <ul><li>Data warehouse </li></ul><ul><ul><li>Who, what, when, where, how much </li></ul></ul><ul><ul><li>Internally focused </li></ul></ul><ul><ul><li>Operational information </li></ul></ul><ul><ul><li>Rarely include external information </li></ul></ul><ul><li>Document warehouse </li></ul><ul><ul><li>Why </li></ul></ul><ul><ul><li>May not be internally focused </li></ul></ul><ul><ul><li>May contain a range of information </li></ul></ul><ul><ul><li>Often integrate external information </li></ul></ul>
    11. 11. Document Warehouse Features <ul><li>There is no single document structure or document type </li></ul><ul><li>Documents are drawn from multiple sources </li></ul><ul><li>Essential features of documents are automatically extracted and explicitly stored in the document warehouse </li></ul><ul><li>Document warehouses are designed to integrate semantically related documents </li></ul>
    12. 12. Building the Document Warehouse Identify Sources Retrieve Document Text Analysis Pre-process Document Compile Metadata
    13. 13. Predict, Impact, Deploy Business UI Expert UI Expert UI NLP Customer Data Attitudes Actions Attributes Business User Grow Retain Fraud Outcomes Attract Data Collection Text Surveys Web Channel Operational Systems Text Concepts Concept Maps Clustering Categoriza-tion Trending Information Extraction Prediction
    14. 14. The Building Blocks of Language <ul><li>Morphology </li></ul><ul><li>Syntax </li></ul><ul><li>Semantics </li></ul><ul><li>Phonology </li></ul><ul><li>Pragmatics </li></ul>
    15. 15. Morphology <ul><li>Understanding words </li></ul><ul><ul><li>Stems </li></ul></ul><ul><ul><li>Affixes </li></ul></ul><ul><ul><ul><li>Prefix </li></ul></ul></ul><ul><ul><ul><li>Suffix </li></ul></ul></ul><ul><ul><li>Inflectional elements </li></ul></ul><ul><li>Reducing complexity of analysis </li></ul><ul><li>Reduces complexity of representation </li></ul><ul><li>Supports text mining </li></ul>Noun Prefix Noun Stem Suffix - able dispute in -
    16. 16. Syntax <ul><li>The Bank of Canada will curb inflation with higher interest rates </li></ul>Prepositional phrase Adjective Sentence Noun phrase Verb phrase Noun Verb Aux Noun phrase Noun Adjective Noun The Bank of Canada inflation curb will Interest rates higher with
    17. 17. Semantics <ul><li>The meaning of it all </li></ul><ul><li>Approaches to meaning </li></ul><ul><ul><li>Semantic networks </li></ul></ul><ul><ul><li>Deductive logic </li></ul></ul><ul><ul><li>Rule-based systems </li></ul></ul><ul><li>Useful for classification </li></ul>
    18. 18. Problems with NLP <ul><li>Limitations of Natural Language Processing </li></ul><ul><ul><li>Correctly identifying the role of noun phrases </li></ul></ul><ul><ul><li>Representing abstract concepts </li></ul></ul><ul><ul><li>Classifying synonyms </li></ul></ul><ul><ul><li>Representing the number of concepts </li></ul></ul>
    19. 19. Problems with NLP <ul><li>Limitations of technology </li></ul><ul><ul><li>Language specific designs are required </li></ul></ul><ul><ul><li>Classification speed </li></ul></ul><ul><ul><li>Classifying hybrid words and sentences </li></ul></ul>
    20. 20. Underlying Technology is Based on Linguistics <ul><li>The Linguistic Approach: </li></ul><ul><ul><li>Does not treat a document as a bag of words </li></ul></ul><ul><ul><li>Removes ambiguity by extracting structured concepts </li></ul></ul><ul><li>Concepts are the DNA of text. </li></ul>Text is unstructured, ambiguous, and language dependent.
    21. 21. From Text to Concepts Morphology Syntax Semantics Statistics Linguistic Terminology Extractor Scalable Accurate Customizable Discovery-Oriented <ul><li>Compound words </li></ul><ul><li>Proper nouns </li></ul><ul><li>Figures </li></ul><ul><li>Named entities </li></ul><ul><li>Domain specifics </li></ul><ul><li>Speed </li></ul><ul><li>Multiple formats </li></ul><ul><li>Multiple languages </li></ul><ul><li>SPSS dictionaries </li></ul><ul><li>User dictionaries </li></ul><ul><li>Extraction rules </li></ul><ul><li>Extraction patterns </li></ul><ul><li>Known terms </li></ul><ul><li>Unknown terms </li></ul><ul><li>New terms </li></ul><ul><li>1GB/hour </li></ul><ul><li>PDF, MS Office, text… </li></ul><ul><li>English, French, German Spanish, Italian, Dutch, Japanese </li></ul><ul><li>Inserm; merck & co … </li></ul><ul><li>tnp-470; glut-4 … </li></ul><ul><li>factor receptor; Inhibitory effect; </li></ul><ul><li>D. John Paganoni, .. </li></ul><ul><li>Positive/Negative opinion… </li></ul><ul><li>London, Paris… </li></ul><ul><li>Names, Orgs… </li></ul><ul><li>MeSH, genes... </li></ul><ul><li>Predicates </li></ul><ul><li>Synonyms, stop words.. </li></ul><ul><li>Trends </li></ul>
    22. 22. From Concepts to Predictive Analytics Components Linguistic Terminology Extractor LexiQuest Mine Discover concepts, relationships and trends LexiQuest Categorize Understand documents and assign in pre-defined categories Text Mining for Clementine Add text fields to data mining for better prediction
    23. 23. Concept Extraction Engine The extractor turns unstructured text into concepts: LexiQuest Extractor Engine Linguistic Processor Visualization Probabilities LexiQuest Mine Clementine LexiQuest Categorize
    24. 24. Part-of-Speech Tagging a: adjective b: adverb c: preposition d: determiner n: noun v: verb o: coordination p: participle s: stop word
    25. 25. How is a Concept Extracted? <ul><li>Step 1: Part-of-Speech Tagging </li></ul>A P V N N A N P V great a is Mine LexiQuest like tool a Using V maintaining P V V P N A P N in interested is that organization any for idea N N P N intelligence. competitive on information
    26. 26. How is a Concept Extracted? <ul><li>Step 2: Matching to Known Patterns </li></ul><ul><li>This: </li></ul><ul><li>V P N A N N V P A N PA N P V V P V N PN N </li></ul><ul><li>Looks Most Like: </li></ul><ul><li>N C D N N </li></ul><ul><li>(32 Known patterns for English) </li></ul>
    27. 27. How is the Concept Extracted? <ul><li>The extractor looks at this sentence: </li></ul><ul><ul><li>Using a tool like LexiQuest Mine is a great idea for any organization that is interested in maintaining information on competitive intelligence. </li></ul></ul><ul><li>And extracts the concept: </li></ul><ul><ul><li>Competitive Intelligence </li></ul></ul><ul><li>Concepts are: </li></ul><ul><ul><li>Noun based </li></ul></ul><ul><ul><li>Can be longer than one word </li></ul></ul>
    28. 28. Example: Categorization
    29. 29. The Issue of Language <ul><li>NLP requires separate language understanding </li></ul><ul><li>Clementine text mining </li></ul><ul><ul><li>French </li></ul></ul><ul><ul><li>English </li></ul></ul><ul><ul><li>English/French </li></ul></ul><ul><ul><li>German </li></ul></ul><ul><ul><li>Spanish </li></ul></ul><ul><ul><li>Dutch </li></ul></ul><ul><ul><li>Japanese </li></ul></ul><ul><ul><li>Italian </li></ul></ul><ul><ul><li>Mesh ( Me dical s ubject h eadings) </li></ul></ul><ul><ul><ul><li>http:// </li></ul></ul></ul>
    30. 30. <ul><li>“ The process of discovering meaningful new relationships, patterns and trends by sifting through data using pattern recognition technologies as well as statistical and mathematical techniques.” </li></ul><ul><li>- The Gartner group. </li></ul>Data Mining Defined
    31. 31. Why data mining? <ul><li>Data Mining software generally employs modeling algorithms designed to handle non-linearities and unusual patterns in data </li></ul><ul><ul><li>As opposed to classical linear models (e.g., linear regression) that aren’t as capable </li></ul></ul><ul><li>A related issue is ‘noise’ in the data: where, for example, 2 seemingly similar sets of inputs yield a different output </li></ul>
    32. 32. <ul><li>Use the cross industry standard process for data mining (CRISP-DM) </li></ul><ul><li>Based on real-world lessons: </li></ul><ul><ul><li>Focus on business issues </li></ul></ul><ul><ul><ul><li>User-centric & interactive </li></ul></ul></ul><ul><ul><ul><li>Full process </li></ul></ul></ul><ul><ul><ul><li>Results are used </li></ul></ul></ul>A Data Mining Methodology
    33. 33. Data Mining is not… <ul><li>Keep in mind that data mining is not… </li></ul><ul><ul><li>“ Blind” application of analysis/modeling algorithms </li></ul></ul><ul><ul><li>Brute-force crunching of bulk data </li></ul></ul><ul><ul><li>Black box technology </li></ul></ul><ul><ul><li>Magic </li></ul></ul>
    34. 34. Back to the Process Text Mining
    35. 35. Understanding <ul><li>Business Understanding </li></ul><ul><ul><li>Determine objective </li></ul></ul><ul><ul><li>Assess situation </li></ul></ul><ul><ul><li>Determine data mining goals </li></ul></ul><ul><ul><li>Produce project plan </li></ul></ul><ul><li>Data Understanding </li></ul><ul><ul><li>Collect initial data </li></ul></ul><ul><ul><li>Describe data </li></ul></ul><ul><ul><li>Explore data </li></ul></ul><ul><ul><li>Verify data quality </li></ul></ul>
    36. 36. Data Preparation <ul><li>Data </li></ul><ul><ul><li>Data set </li></ul></ul><ul><ul><li>Data set description </li></ul></ul><ul><ul><li>Select data </li></ul></ul><ul><ul><li>Clean data </li></ul></ul><ul><ul><li>Construct data set / Integrate data </li></ul></ul><ul><ul><li>Format data </li></ul></ul><ul><li>Text </li></ul><ul><ul><li>Concept extraction </li></ul></ul><ul><ul><li>Concept combination </li></ul></ul><ul><ul><li>Concept assessment </li></ul></ul>
    37. 37. Modeling <ul><li>Select modeling technique </li></ul><ul><ul><li>Universe of techniques </li></ul></ul><ul><ul><li>Appropriate techniques </li></ul></ul><ul><ul><ul><li>Data </li></ul></ul></ul><ul><ul><ul><li>Text </li></ul></ul></ul><ul><ul><li>Requirements </li></ul></ul><ul><ul><li>Constraints </li></ul></ul><ul><ul><li>Selected tools </li></ul></ul><ul><li>Generate test design </li></ul><ul><li>Run model(s) </li></ul><ul><li>Assess model(s) </li></ul>
    38. 38. Evaluation <ul><li>Results = Models + Findings </li></ul><ul><li>Evaluate results </li></ul><ul><li>Review process </li></ul><ul><li>Determine next steps </li></ul>
    39. 39. Deployment <ul><li>Plan deployment </li></ul><ul><li>Plan monitoring and maintenance </li></ul><ul><li>Final report </li></ul><ul><li>Project review </li></ul>
    40. 40. <ul><li>Unsupervised methods: </li></ul><ul><ul><li>Group patients by drugs and demographic information and try to find unusual patients </li></ul></ul><ul><li>Supervised methods: </li></ul><ul><ul><li>Attempt to predict amount due and find sets of cases where the amount due is very different from the predicted amount </li></ul></ul>Data Mining Approaches
    41. 41. What Does Data Mining Do? <ul><li>Data mining uses existing data to: </li></ul><ul><ul><li>Predict </li></ul></ul><ul><ul><ul><li>Category membership </li></ul></ul></ul><ul><ul><ul><li>Numeric Value </li></ul></ul></ul><ul><ul><ul><li>Ie. Credit risk </li></ul></ul></ul><ul><ul><li>Group </li></ul></ul><ul><ul><ul><li>Cluster (group) things together based on their characteristics </li></ul></ul></ul><ul><ul><ul><li>Ie. Different types of TV viewers </li></ul></ul></ul><ul><ul><li>Associate </li></ul></ul><ul><ul><ul><li>Find events that occur together, or in a sequence </li></ul></ul></ul><ul><ul><ul><li>Ie. Beer and diapers </li></ul></ul></ul><ul><ul><li>Find outliers </li></ul></ul><ul><ul><ul><li>Identify cases that don’t follow expected behavior </li></ul></ul></ul><ul><ul><ul><li>Ie. Fraudulent behaviour </li></ul></ul></ul>
    42. 42. Benefits of Document Warehousing <ul><li>Richer operational business intelligence </li></ul><ul><li>Knowing your customers </li></ul><ul><li>Macroenvironmental monitoring </li></ul><ul><li>Technology assessment </li></ul>
    43. 43. Conclusions <ul><li>Text mining is </li></ul><ul><ul><li>More than word counts </li></ul></ul><ul><ul><li>Linguistically based </li></ul></ul><ul><ul><li>Concept extraction </li></ul></ul><ul><li>Data mining is </li></ul><ul><ul><li>Advanced analytics applied to datasets </li></ul></ul><ul><ul><li>A family of techniques </li></ul></ul><ul><ul><li>Supervised or unsupervised </li></ul></ul>
    44. 44. Conclusions <ul><li>Text and data mining </li></ul><ul><ul><li>Add dimensionality to the data </li></ul></ul><ul><ul><li>Allow for automation of the text analysis event </li></ul></ul><ul><ul><li>Create 360 degree view </li></ul></ul><ul><li>Applications </li></ul><ul><ul><li>Websites </li></ul></ul><ul><ul><li>Surveys </li></ul></ul><ul><ul><li>Email </li></ul></ul><ul><ul><li>Call centre </li></ul></ul><ul><ul><li>Documentation </li></ul></ul>
    45. 45. ?
    46. 46. So How Do I Get Started? <ul><li>Document Warehousing and Text Mining </li></ul><ul><ul><li>Dan Sullivan, Wiley, 2001 </li></ul></ul><ul><li>Survey of Text Mining: Clustering, Classification and Retrieval </li></ul><ul><ul><li>Michael W. Berry (ed.), Springer, 2003 </li></ul></ul><ul><li>Natural Language Processing for Online Applications: Text Retrieval, Extraction and Categorization </li></ul><ul><ul><li>P. Jackson and I. Moulinier, John Benjamins, 2002 </li></ul></ul>
    47. 47. SPSS Canada <ul><li>Tim Daciuk </li></ul><ul><li>Services Manager, Canada </li></ul><ul><li>416-410-7921 </li></ul><ul><li>800-543-6607 ext. 5156 </li></ul><ul><li>[email_address] </li></ul><ul><li>Hugh Rooney </li></ul><ul><li>SPSS Sales Canada </li></ul><ul><li>416-410-7921 </li></ul><ul><li>905-886-4322 </li></ul><ul><li>[email_address] </li></ul>