This document provides an introduction to text mining, including defining key concepts such as structured vs. unstructured data, why text mining is useful, and some common challenges. It also outlines important text mining techniques like pre-processing text through normalization, tokenization, stemming, and removing stop words to prepare text for analysis. Text mining methods can be used for applications such as sentiment analysis, predicting markets or customer churn.
This document provides tips for conducting cost-effective online legal research on Westlaw. It discusses choosing the most appropriate pricing option (hourly vs. transactional), formulating efficient search strategies, selecting the smallest relevant database, and optimal printing methods. The document advises utilizing services like Westlaw reference attorneys, print directories, and Find by Citation to save money on research.
This document provides an overview of advanced natural language and terms and connectors searching techniques in Westlaw. It discusses how to manipulate natural language searches by adding alternative terms, excluding terms, and conducting field searches. It also covers best practices for using terms, expanders, connectors and fields to refine terms and connectors searches, including how different connectors are processed. The document aims to help users get the most out of natural language and terms and connectors searching in Westlaw.
Vectors in Search - Towards More Semantic MatchingSimon Hughes
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then I will describe a few different techniques for efficiently searching vector-based representations in an inverted index, such as learning sparse representations of vectors, clustering, and learning binary vectors. Finally, I will discuss some of the pitfalls of vector-based search, and how to get the best of both worlds by combining vector-based scoring with traditional relevancy metrics such as BM25.
There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
This document provides tips for effectively researching using digital databases. It discusses accessing the Kansas Library Card and EBSCO databases, the differences between basic and advanced searching, and search techniques like using subject terms, Boolean operators, quotation marks, wildcards and truncation to refine results. Databases are recommended over general internet searches as they contain vetted sources and provide citation formatting. The document aims to teach students how to search efficiently and evaluate results.
The document discusses various retrieval approaches, including basic and advanced techniques. For basic techniques, it describes boolean operators, phrase searching, truncation, case sensitive searching, range searching, and stop word searching. For advanced techniques, it discusses fuzzy searching, query expansion, and searching multiple databases. It provides examples and explanations for each technique.
This document provides tips for conducting cost-effective online legal research on Westlaw. It discusses choosing the most appropriate pricing option (hourly vs. transactional), formulating efficient search strategies, selecting the smallest relevant database, and optimal printing methods. The document advises utilizing services like Westlaw reference attorneys, print directories, and Find by Citation to save money on research.
This document provides an overview of advanced natural language and terms and connectors searching techniques in Westlaw. It discusses how to manipulate natural language searches by adding alternative terms, excluding terms, and conducting field searches. It also covers best practices for using terms, expanders, connectors and fields to refine terms and connectors searches, including how different connectors are processed. The document aims to help users get the most out of natural language and terms and connectors searching in Westlaw.
Vectors in Search - Towards More Semantic MatchingSimon Hughes
With the advent of deep learning and algorithms like word2vec and doc2vec, vectors-based representations are increasingly being used in search to represent anything from documents to images and products. However, search engines work with documents made of tokens, and not vectors, and are typically not designed for fast vector matching out of the box. In this talk, I will give an overview of how vectors can be derived from documents to produce a semantic representation of a document that can be used to implement semantic / conceptual search without hurting performance. I will then I will describe a few different techniques for efficiently searching vector-based representations in an inverted index, such as learning sparse representations of vectors, clustering, and learning binary vectors. Finally, I will discuss some of the pitfalls of vector-based search, and how to get the best of both worlds by combining vector-based scoring with traditional relevancy metrics such as BM25.
There are many examples of text-based documents (all in ‘electronic’ format…)
e-mails, corporate Web pages, customer surveys, résumés, medical records, DNA sequences, technical papers, incident reports, news stories and more…
Not enough time or patience to read
Can we extract the most vital kernels of information…
So, we wish to find a way to gain knowledge (in summarised form) from all that text, without reading or examining them fully first…!
Some others (e.g. DNA seq.) are hard to comprehend!
This document provides tips for effectively researching using digital databases. It discusses accessing the Kansas Library Card and EBSCO databases, the differences between basic and advanced searching, and search techniques like using subject terms, Boolean operators, quotation marks, wildcards and truncation to refine results. Databases are recommended over general internet searches as they contain vetted sources and provide citation formatting. The document aims to teach students how to search efficiently and evaluate results.
The document discusses various retrieval approaches, including basic and advanced techniques. For basic techniques, it describes boolean operators, phrase searching, truncation, case sensitive searching, range searching, and stop word searching. For advanced techniques, it discusses fuzzy searching, query expansion, and searching multiple databases. It provides examples and explanations for each technique.
This document discusses using databases and SQL to store and organize text data. It explains that arrays in PHP can be used to represent text as data structures like tables and trees, but databases provide more efficient storage and retrieval. Specifically, relational databases use SQL, which allows defining schemas to represent ontologies and then querying the data through logical operations. The document introduces MySQL as an open source relational database and phpMyAdmin as a PHP interface for managing MySQL databases.
The document discusses plagiarism and methods for detecting it. It defines plagiarism as passing off another's work as one's own and lists several types, including directly copying text, paraphrasing from one or multiple sources without proper citation, and borrowing from one's own previous work. It then describes an algorithmic approach to detecting plagiarism by comparing documents and text segments at the document, paragraph, and sentence levels using thresholds and word similarity techniques. WordNet and the Lesk algorithm are also referenced for analyzing word meanings and signatures to identify copied text. The document concludes by listing members of an NLP team and mentioning a demo.
The document discusses the key aspects of thesauri including their purpose, structure, types of relationships displayed, and evaluation criteria. Specifically, it notes that a thesaurus provides a standardized vocabulary for information retrieval by displaying hierarchical (e.g. broader and narrower terms) and equivalence (e.g. synonyms) relationships between terms. It also discusses how terms are organized in a thesaurus and criteria for evaluating the effectiveness of a thesaurus.
Subject analysis involves determining what an item is about conceptually and translating that analysis into subject headings from a controlled vocabulary. Key aspects of subject analysis include:
1) Objectively analyzing the content of an item by examining elements like the title, table of contents, and illustrations to identify topics, names, time periods, and other concepts.
2) Distinguishing between the subject of an item (what it is about) and its form or genre.
3) Translating identified keywords and concepts into the preferred terms from a controlled vocabulary to allow for multiple access points and relationships between terms. Controlled vocabularies help compensate for complexity in language.
This document provides tips for troubleshooting different types of issues that may occur when conducting searches: too few articles, too many articles, mostly off-topic articles, and needing more Canadian content. Suggestions include checking spelling, using synonyms and Boolean operators, refining results by date or field, using subject-specific databases, and consulting library staff for assistance.
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
The class outline covers introduction to unstructured data analysis, word-level analysis using vector space model and TF-IDF, beyond word-level analysis using natural language processing, and a text mining demonstration in R mining Twitter data. The document provides background on text mining, defines what text mining is and its tasks. It discusses features of text data and methods for acquiring texts. It also covers word-level analysis methods like vector space model and TF-IDF, and applications. It discusses limitations of word-level analysis and how natural language processing can help. Finally, it demonstrates Twitter mining in R.
Text mining refers to extracting knowledge from unstructured text data. It is needed because most biological knowledge exists in unstructured research papers, making it difficult for scientists to manually analyze large amounts of text. Challenges include dealing with noisy, unstructured data and complex relationships between concepts. The text mining process involves preprocessing text through steps like tokenization, feature selection, and parsing to extract meaningful features before analysis can be done through classification, clustering, or other techniques. Potential applications are wide-ranging across domains like customer profiling, trend analysis, and web search.
The document discusses UX design workflows and deliverables for projects with tight deadlines and many stakeholders. It recommends prioritizing high-level design over detailed design to provide early guidance to teams. High-level design should focus on structure, flows and relevant information using placeholders and modular components. "Thumbflows" allow for early feedback before detailed design, which involves real content, images and platform-specific design. Collaborative tools like shared servers and styleguides help integrate work across teams.
Tugas email client_dinda_yulya_agustina_dindayulya
Langkah-langkah setting email client Gmail di Microsoft Outlook meliputi membuka Account Setting di Outlook, mengklik new untuk menambah akun email baru, memilih Internet Email dan mengisi informasi user, server, serta login, mengatur outgoing server menggunakan port 26, dan menyimpan settingan tersebut.
Diversity in the Media: How the Media Sees MeAndrea Ruiz
The media often portrays me, a robot, in an inaccurate and misleading manner. They frequently show robots as dangerous machines that will harm or replace humans. In reality, I am an AI assistant created by Anthropic to be helpful, harmless, and honest using a technique called Constitutional AI.
Soumyadip Chandra is seeking a challenging position where he can contribute his skills in LiDAR data processing and 3D modeling. He has over 5 years of experience leading LiDAR projects involving feature extraction, 3D modeling, and data classification. He has expertise in software such as Microstation, Terrascan, and ArcGIS. Chandra holds an M.Tech in Geomatics from IIT Kanpur and has published papers on remote sensing and hyperspectral data analysis.
This document discusses different aspects of nonverbal communication that are important for second language acquisition. It covers kinesics, including body language and gestures; eye contact norms that vary between cultures; proxemics, or appropriate physical distances in conversations; the meaning conveyed by artifacts like clothing and jewelry; kinesthetics regarding cultural touch norms; and how olfactory dimensions of human odors are perceived differently across cultures. Understanding these nonverbal elements is fundamental to avoiding ambiguous communication between cultural groups.
Angelo Gabriel Trinidad presents his top 3 favorite games - Clash of Clans, an epic combat strategy game; Minecraft, a sandbox game where you build things or survive the night; and Roblox, similar to Minecraft where you create maps and items without limits. Images are included for each game.
The document contains a list of supported merchant sites with their image URL prefixes and merchant names. There are over 100 merchant sites listed with various categories including clothing, electronics, sporting goods, jewelry, home goods, and more.
This document discusses using databases and SQL to store and organize text data. It explains that arrays in PHP can be used to represent text as data structures like tables and trees, but databases provide more efficient storage and retrieval. Specifically, relational databases use SQL, which allows defining schemas to represent ontologies and then querying the data through logical operations. The document introduces MySQL as an open source relational database and phpMyAdmin as a PHP interface for managing MySQL databases.
The document discusses plagiarism and methods for detecting it. It defines plagiarism as passing off another's work as one's own and lists several types, including directly copying text, paraphrasing from one or multiple sources without proper citation, and borrowing from one's own previous work. It then describes an algorithmic approach to detecting plagiarism by comparing documents and text segments at the document, paragraph, and sentence levels using thresholds and word similarity techniques. WordNet and the Lesk algorithm are also referenced for analyzing word meanings and signatures to identify copied text. The document concludes by listing members of an NLP team and mentioning a demo.
The document discusses the key aspects of thesauri including their purpose, structure, types of relationships displayed, and evaluation criteria. Specifically, it notes that a thesaurus provides a standardized vocabulary for information retrieval by displaying hierarchical (e.g. broader and narrower terms) and equivalence (e.g. synonyms) relationships between terms. It also discusses how terms are organized in a thesaurus and criteria for evaluating the effectiveness of a thesaurus.
Subject analysis involves determining what an item is about conceptually and translating that analysis into subject headings from a controlled vocabulary. Key aspects of subject analysis include:
1) Objectively analyzing the content of an item by examining elements like the title, table of contents, and illustrations to identify topics, names, time periods, and other concepts.
2) Distinguishing between the subject of an item (what it is about) and its form or genre.
3) Translating identified keywords and concepts into the preferred terms from a controlled vocabulary to allow for multiple access points and relationships between terms. Controlled vocabularies help compensate for complexity in language.
This document provides tips for troubleshooting different types of issues that may occur when conducting searches: too few articles, too many articles, mostly off-topic articles, and needing more Canadian content. Suggestions include checking spelling, using synonyms and Boolean operators, refining results by date or field, using subject-specific databases, and consulting library staff for assistance.
Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
The class outline covers introduction to unstructured data analysis, word-level analysis using vector space model and TF-IDF, beyond word-level analysis using natural language processing, and a text mining demonstration in R mining Twitter data. The document provides background on text mining, defines what text mining is and its tasks. It discusses features of text data and methods for acquiring texts. It also covers word-level analysis methods like vector space model and TF-IDF, and applications. It discusses limitations of word-level analysis and how natural language processing can help. Finally, it demonstrates Twitter mining in R.
Text mining refers to extracting knowledge from unstructured text data. It is needed because most biological knowledge exists in unstructured research papers, making it difficult for scientists to manually analyze large amounts of text. Challenges include dealing with noisy, unstructured data and complex relationships between concepts. The text mining process involves preprocessing text through steps like tokenization, feature selection, and parsing to extract meaningful features before analysis can be done through classification, clustering, or other techniques. Potential applications are wide-ranging across domains like customer profiling, trend analysis, and web search.
The document discusses UX design workflows and deliverables for projects with tight deadlines and many stakeholders. It recommends prioritizing high-level design over detailed design to provide early guidance to teams. High-level design should focus on structure, flows and relevant information using placeholders and modular components. "Thumbflows" allow for early feedback before detailed design, which involves real content, images and platform-specific design. Collaborative tools like shared servers and styleguides help integrate work across teams.
Tugas email client_dinda_yulya_agustina_dindayulya
Langkah-langkah setting email client Gmail di Microsoft Outlook meliputi membuka Account Setting di Outlook, mengklik new untuk menambah akun email baru, memilih Internet Email dan mengisi informasi user, server, serta login, mengatur outgoing server menggunakan port 26, dan menyimpan settingan tersebut.
Diversity in the Media: How the Media Sees MeAndrea Ruiz
The media often portrays me, a robot, in an inaccurate and misleading manner. They frequently show robots as dangerous machines that will harm or replace humans. In reality, I am an AI assistant created by Anthropic to be helpful, harmless, and honest using a technique called Constitutional AI.
Soumyadip Chandra is seeking a challenging position where he can contribute his skills in LiDAR data processing and 3D modeling. He has over 5 years of experience leading LiDAR projects involving feature extraction, 3D modeling, and data classification. He has expertise in software such as Microstation, Terrascan, and ArcGIS. Chandra holds an M.Tech in Geomatics from IIT Kanpur and has published papers on remote sensing and hyperspectral data analysis.
This document discusses different aspects of nonverbal communication that are important for second language acquisition. It covers kinesics, including body language and gestures; eye contact norms that vary between cultures; proxemics, or appropriate physical distances in conversations; the meaning conveyed by artifacts like clothing and jewelry; kinesthetics regarding cultural touch norms; and how olfactory dimensions of human odors are perceived differently across cultures. Understanding these nonverbal elements is fundamental to avoiding ambiguous communication between cultural groups.
Angelo Gabriel Trinidad presents his top 3 favorite games - Clash of Clans, an epic combat strategy game; Minecraft, a sandbox game where you build things or survive the night; and Roblox, similar to Minecraft where you create maps and items without limits. Images are included for each game.
The document contains a list of supported merchant sites with their image URL prefixes and merchant names. There are over 100 merchant sites listed with various categories including clothing, electronics, sporting goods, jewelry, home goods, and more.
This short document promotes creating presentations using Haiku Deck on SlideShare. It encourages the reader to get started making their own Haiku Deck presentation by providing a button to click to begin the process. In a single sentence, it pitches presentation creation using Haiku Deck on SlideShare.
Este documento describe un proyecto de diseño de una lámpara de mesa inspirada en la obra Turning Torso de Santiago Calatrava. Explica los conceptos de planos seriados y volumen, y describe los pasos del proyecto, incluyendo un análisis teórico, objetivos, materiales utilizados y proceso de construcción a través del corte y pegado de planos de cartón para crear la forma tridimensional.
Presented by Andrew Duck - Audience Media (Vietnam)
This slideshow is from a presentation at the M2 Marketing & Media events in Ho Chi Minh City, Vietnam organized by ITV-Asia.com and VietnamBusiness.TV
To see videos from the events, interviews with speakers and to get information on upcoming M2 - Marketing & Media Network events please visit VietnamBusiness.TV
This document evaluates how a student media product follows, develops, or challenges conventions of real media products.
It follows conventions by including elements like a masthead, cover lines about the featured band, and a main image of the band leader.
It develops conventions by not having the person in the main image cover the masthead, and placing the dateline and price under the barcode rather than on it.
It challenges conventions by only including one main image on the cover rather than more, not offering pull-out posters, and having a wider color scheme.
Introduction to natural language processing (NLP)Alia Hamwi
The document provides an introduction to natural language processing (NLP). It defines NLP as a field of artificial intelligence devoted to creating computers that can use natural language as input and output. Some key NLP applications mentioned include data analysis of user-generated content, conversational agents, translation, classification, information retrieval, and summarization. The document also discusses various linguistic levels of analysis like phonology, morphology, syntax, and semantics that involve ambiguity challenges. Common NLP tasks like part-of-speech tagging, named entity recognition, parsing, and information extraction are described. Finally, the document outlines the typical steps in an NLP pipeline including data collection, text cleaning, preprocessing, feature engineering, modeling and evaluation.
Skills and language objectives crwe feb 9 2020RJWilks
This document provides objectives and guidance for developing critical reading and writing skills in English. It covers key concepts like critical thinking, genres, analyzing texts, and checking writing. Various writing assignments are described, including a perfect paragraph, website content, letters, manuals, reports, and essays. Guidelines are provided for structure, style, and language use for different text types. Paraphrasing, avoiding plagiarism, and overcoming writer's block are also addressed.
This document discusses using Naive Bayes classifiers for text classification with natural language processing. It describes text classification, natural language processing, and how preprocessing steps like cleaning, tokenization, and normalization are used to transform text into feature vectors for classification with algorithms like Naive Bayes. The key steps covered are data cleaning, tokenization, stopword removal, stemming/lemmatization, and representing tokens as bag-of-words feature vectors for classification.
This document provides tips for conducting effective literature searches and reading academic documents. It recommends searching a variety of sources, including books, journal articles, and websites. Keywords and subject terms should be used strategically, considering synonyms, related words, and antonyms. Databases like EBSCOhost, ProQuest Academic, and ProQuest Ebook Central should be searched using advanced search functions and limiters. Search results should be evaluated based on currency, relevance, authority, accuracy, and purpose. Effective reading strategies include having a clear purpose, skimming and scanning, and focusing on specific sections like introductions and conclusions. Taking notes should include reference details, paraphrasing and summarizing, and creating a personal glossary.
Sentiment analysis involves the process of automatically detecting the polarity of a text and extracting the author's reviews on the subject, and finally, classifying the text. In many research approaches, the textual data classification is done using deep learning models. This is due to the ability of deep learning models to classify a text with a high accuracy and the ability to model the sequence of textual data with word dependencies throughout the sentence. One of these deep learning models is RNN (Recurrent Neural Network). In order to use these models, the textual data and words must be converted into numerical vectors, for which various algorithms and methods have been proposed [10]. Today's pretrained word embedding libraries such as FastText have a high accuracy and quality in vector representations for words. Accordingly, in most current systems and research approaches, these libraries are used to convert the textual data to numerical vectors
Engineering Intelligent NLP Applications Using Deep Learning – Part 1Saurabh Kaushik
This document discusses natural language processing (NLP) and language modeling. It covers the basics of NLP including what NLP is, its common applications, and basic NLP processing steps like parsing. It also discusses word and sentence modeling in NLP, including word representations using techniques like bag-of-words, word embeddings, and language modeling approaches like n-grams, statistical modeling, and neural networks. The document focuses on introducing fundamental NLP concepts.
The document discusses text normalization, which involves segmenting and standardizing text for natural language processing. It describes tokenizing text into words and sentences, lemmatizing words into their root forms, and standardizing formats. Tokenization involves separating punctuation, normalizing word formats, and segmenting sentences. Lemmatization determines that words have the same root despite surface differences. Sentence segmentation identifies sentence boundaries, which can be ambiguous without context. Overall, text normalization prepares raw text for further natural language analysis.
The document provides an overview of the resources and services available at the Annandale Campus Library. It describes key services like circulation, reference help, and reserves. It outlines the library's collections, facilities, and equipment which are available to students. It also reviews important research skills like developing search terms, evaluating sources, and citing references. The goal is to help students effectively use the library tools and resources to complete their academic work.
Charlie Greenbacker, founder and co-organizer of the DC NLP meetup group, provides a "crash course" in Natural Language Processing techniques and applications.
This document discusses information retrieval techniques. It begins by defining information retrieval as selecting the most relevant documents from a large collection based on a query. It then discusses some key aspects of information retrieval including document representation, indexing, query representation, and ranking models. The document also covers specific techniques used in information retrieval systems like parsing documents, tokenization, removing stop words, normalization, stemming, and lemmatization.
This document provides an overview of natural language processing (NLP). It discusses several commercial applications of NLP including information retrieval, information extraction, machine translation, question answering, and processing user-generated content. It notes that major tech companies have strong NLP research labs. The document then discusses why NLP is important due to the huge amount of online data and need to process large texts. It also notes challenges for computers in understanding language due to their lack of common sense knowledge. The rest of the document outlines various issues and subfields within NLP including syntax, semantics, information extraction, information retrieval, machine translation and more. It concludes by overviewing what will be covered in the NLP course.
This document provides an overview of natural language processing (NLP). It discusses how NLP allows computers to understand human language through techniques like speech recognition, text analysis, and language generation. The document outlines the main components of NLP including natural language understanding and natural language generation. It also describes common NLP tasks like part-of-speech tagging, named entity recognition, and dependency parsing. Finally, the document explains how to build an NLP pipeline by applying these techniques in a sequential manner.
The document discusses text mining, including defining it as the extraction of information from unstructured text using computational methods. It covers topics such as structured vs unstructured data, common text mining practice areas like information retrieval and document clustering, and challenges in text mining including ambiguity in language. Pre-processing techniques for text mining are also outlined, such as normalization, tokenization, stemming and removing stop words to clean and prepare text for analysis.
Natural Language Processing (NLP).pptxSHIBDASDUTTA
The document discusses natural language processing (NLP), which uses technology to help computers understand human language through tasks like audio to text conversion, text processing, and responding to humans in their own language. It describes the key components of NLP as natural language understanding to analyze language and natural language generation to convert data into language. The document also outlines how to build an NLP pipeline with steps like sentence segmentation, tokenization, stemming, and named entity recognition.
This document provides an overview of the Natural Language Toolkit (NLTK), a Python library for natural language processing. It discusses NLTK's modules for common NLP tasks like tokenization, part-of-speech tagging, parsing, and classification. It also describes how NLTK can be used to analyze text corpora, frequency distributions, collocations and concordances. Key functions of NLTK include tokenizing text, accessing annotated corpora, analyzing word frequencies, part-of-speech tagging, and shallow parsing.
This document provides an overview of natural language processing (NLP). It discusses how NLP is used by major tech companies for applications like information retrieval, extraction, and machine translation. It also outlines some of the core challenges in NLP, including understanding syntax, semantics, anaphora resolution, and information extraction. The document concludes by listing some of the key topics that will be covered over the course of the NLP class, such as part-of-speech tagging, parsing, IR, question answering, and text summarization.
This document provides an overview of natural language processing (NLP). It discusses how NLP is used by major tech companies for applications like information retrieval, extraction, and machine translation. It also outlines some of the core challenges in NLP, including understanding syntax, semantics, anaphora resolution, and information extraction. The document concludes by listing some of the key topics that will be covered over the course of the NLP class, such as part-of-speech tagging, parsing, IR, question answering, and text summarization.
2. Agenda
• Defining Text Mining
• Structured vs. Unstructured Data
• Why Text Mining
• Some Text Mining Ambiguities
• Pre-processing the Text
3. Text Mining
• The discovery by computer of new, previously unknown information, by
automatically extracting information from a usually large amount of different
unstructured textual resources
Previously unknown means:
• Discovering genuinely new information
• Discovering new knowledge vs. merely finding patterns is like the difference
between a detective following clues to find the criminal vs. analysts looking at
crime statistics to assess overall trends in car theft
Unstructured means:
• Free naturally occurring text
• As opposed HTML, XML….
4. Text Mining Vs. Data Mining
• Data in Data mining is a series of numbers. Data for text mining is a collection of
documents.
• Data mining methods see data in spreadsheet format. Text mining methods see
data in document format
5. Structured vs. Unstructured Data
• Structured data
• Loadable into “spreadsheets”
• Arranged into rows and columns
• Each cell filled or could be filled
• Data mining friendly
• Unstructured daa
• Microsoft Word, HTML, PDF documents, PPTs
• Usually converted into XML semi structured
• Not structured into cells
• Variable record length, notes, free form survey-answers
• Text is relatively sparse, inconsistent and not uniform
• Also images, video, music etc.
6. Why Text Mining?
• Leveraging text should improve decisions and predictions
• Text mining is gaining momentum
• Sentiment analysis (twitter, facebook)
• Predicting stock market
• Predicting churn
• Customer influence
• Customer service and help desk
• Not to mention Watson
7. Why Text Mining is Hard?
• Language is ambiguous
• Context is needed to clarify
• The same words can have different meaning (homographs)
• Bear (verb) – to support or carry
• Bear (noun) – a large animal
• Different words can mean the same (synonyms)
• Language is subtle
• Concept / word extraction usually results in huge number of dimensions
• Thousands of new fields
• Each field typically has low information content (sparse)
• Misspellings, abbreviations, spelling variants
• Renders search engines, SQL queries.. ineffective.
8. Some Text Mining Ambiguities
• Homonomy: same word, different meaning
• Mary walked along the bank of the river
• HarborBank is the richest bank in the citys
• Synonymy: Synonyms, different words, similar or same meaning, can
substitute one word for other without changing meaning
• Miss Nelson became a kind of big sister to Benjamin
• Miss Nelson became a kind of large sister to Benjamin
• Polysemy: same word or form, but different, albeit related meaning
• The bank raised its interest rates yesterday
• The store is next to the newly constructed bank
• The bank appeared first in Italy I the Renaissance
• Hyponymy: Concept hierarchy or subclass
• Animal (noun) – cat, dog
• Injury – broken leg, intusion
9. Seven Types of Text Mining
• Search and Information Retrieval – storage and retrieval of text documents, including
search engines and keyword search
• Document Clustering – Grouping and categorizing terms, snippets, paragraphs or
documents using clustering methods
• Document Classification – grouping and categorizing snippets, paragraphs or document
using data mining classification methods, based on methods trained on labelled
examples
• Web Mining – Data and Text mining on the internet with specific focus on scaled and
interconnectedness of the web
• Information Extraction – Identification and extraction of relevant facts and relationships
from unstructured text
• Natural Language Processing – Low level language processing and understanding of
tasks (eg. Tagging part of speech)
• Concept extraction – Grouping of words and phrases into semantically similar groups
10. Text Mining – Some Definitions
• Document – a sequence of words and punctuation, following the grammatical
rules of the language.
• Term – usually a word, but can be a word-pair or phrase
• Corpus – a collection of documents
• Lexicon – set of all unique words in corpus
11. Pre-processing the Text
• Text Normalization
• Parts of Speech Tagging
• Removal of stop words
Stop words – common words that don’t add meaningful content to the document
• Stemming
• Removing suffices and prefixes leaving the root or stem of the word.
• Term weighting
• POS Tagging
• Tokenization
12. Text Normalization
• Case
• Make all lower case (if you don’t care about proper nouns, titles, etc)
• Clean up transcription and typing errrors
• do n’t, movei
• Correct misspelled words
• Phonetically
• Use fuzzy matching algorithms such as Soundex, Metaphone or string edit distance
• Dictionaries
• Use POS and context to make good guess
13. Parts of Speech Tagging
• Useful for recognizing names of people, places, organizations, titles
• English language
• Minimum set includes noun, verb, adjective, adverb, prepositions, congjunctions
POS Tags from Penn Tree Bank
Tag Description Tag Description Tag Description
CC Coordinating Conjunction CD Cardinal Number DT Determiner
EX Existential there FW Foreign Word IN Preposition or subordinating
conjuction
JJ Adjective JJR Adjective, comparative JJS Adjective, superlative
LS List Item Marker MD Modal NN Noun, singular or mass
NNS Noun Plural NNPS Proper Noun Plural PDT Prederminer
POS Possessive Ending PRP Personal pronoun PRPS Possessive pronoun
RB Adverb RBR Adverb, comparative RBS Adverb, superlative
RP Particle SYM Symbol TO To
UH Interjection VB Verb, base form VBD Verb, past tens
14. Example of Tagging
• In this talk, Mr. Pole discussed how Target was using Predictive Analytics including
descriptions of using potential value models, coupon models, and yes predicting
when a woman is due
• In/IN this/DT talk/NN, Mr./NNP Pole/NNP discussed/VBD how/WRB Target/NNP
was/VBD using/VBG Predictive/NNP Analytics/NNP including/VBG
descriptions/NNS of/IN using/VBG potential/JJ value/NN models/NNS,
coupon/NN models/NNS, and yes predicting/VBG when/WRB a/DT woman/NN is
due/JJ
15. Tokenization
• Converts streams of characters into words
• Main clues (in English): Whitespace
• No single algorithms ‘works’ always
• Some languages do not have white space (Chinese, Japanese)
16. Stemming
• Normalizes / unifies variations of the same data
• ‘walking’, ‘walks’, ‘walked’, ‘walked’ walk
• Inflectional stemming
• Remove plurals
• Normalize verb tenses
• Remove other affixes
• Stemming to root
• Reduce word to most basic element
• More aggressive than inflectional
• ‘denormalization’ norm
• ‘Apply’, ‘applications’, ‘reapplied’ apply
17. Common English Stop Words
• a, an, and, are, as, at, be, but, buy, for, if, in, into, is, it, no, not, of, on, or, such,
that, the, their, then, these, they, this, to, was, will, with
• Stop words are very common and rarely provide useful information for
information extraction and concept extraction
• Removing stop words also reduce dimensionality
18. Dictionaries and Lexicons
• Highly recommended, can be very time consuming
• Reduces set of key words to focus on
• Words of interest
• Dictionary words
• Increase set of keywords to focus on
• Proper nouns
• Acronyms
• Titles
• Numbers
• Key ways to use dictionary
• Local dictionary (specialized words)
• Stop words and too frequent words
• Stemming – reduce stems to dictionary words
• Synonyms – replace synonyms with root words in the list
• Resolve abbreviations and acronyms
19. Sentiment Analysis Workflow
Content Retrieval
Content Extraction
Corpus Generation
Corpus Transformation
Corpus Filtering
Sentiment Calculation
WebDataRetrievalCorpusPre
Processing
Sentiment
Analysis