N-gram models are statistical language models that can be used for word prediction. They calculate the probability of a word based on the previous N-1 words. Higher N-gram models require more data but capture context better. N-gram probabilities are estimated by counting word sequences in a large training corpus and calculating conditional probabilities. These probabilities provide information about syntactic patterns and semantic relationships in language.
Text mining - from Bayes rule to dependency parsingFlorian Leitner
The updated slides for my introductory course on text mining. The slides take you from basic Bayesian statistics over Markov chains and language models (incl. neural embeddings) to text similarity measures (Locality Sensitive Hashing, Latent Semantic Analysis, and Latent Dirichlet Allocation) and classification techniques (Naïve Bayes, MaxEnt [logistic regression]). Finally, they introduce you to shallow parsing with dynamic graphical models (HMMs, MEMMs, and CRFs) and provide you with a glimpse at dependency parsing (shift-reduce, arc-standard).
N-gram models are statistical language models that can be used for word prediction. They calculate the probability of a word based on the previous N-1 words. Higher N-gram models require more data but capture context better. N-gram probabilities are estimated by counting word sequences in a large training corpus and calculating conditional probabilities. These probabilities provide information about syntactic patterns and semantic relationships in language.
Text mining - from Bayes rule to dependency parsingFlorian Leitner
The updated slides for my introductory course on text mining. The slides take you from basic Bayesian statistics over Markov chains and language models (incl. neural embeddings) to text similarity measures (Locality Sensitive Hashing, Latent Semantic Analysis, and Latent Dirichlet Allocation) and classification techniques (Naïve Bayes, MaxEnt [logistic regression]). Finally, they introduce you to shallow parsing with dynamic graphical models (HMMs, MEMMs, and CRFs) and provide you with a glimpse at dependency parsing (shift-reduce, arc-standard).
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
This document provides an overview of natural language processing (NLP) including:
1. An introduction to NLP and its intersection with computational linguistics, computer science, and statistics.
2. A discussion of common NLP problems like tokenization, tagging, parsing, and their rule-based and statistical approaches.
3. An explanation of machine learning techniques for NLP like language models, naive Bayes classifiers, and dependency parsing.
4. Steps for developing an NLP system including translating requirements, experimentation, and going to production.
Enriching Word Vectors with Subword InformationSeonghyun Kim
1) The document proposes a new word vector model that represents words as the sum of their character n-gram vectors to better capture morphological information.
2) It tests this model on nine languages and shows it outperforms previous models on word similarity and analogy tasks.
3) Representing words as combinations of character n-grams allows the model to learn representations for out-of-vocabulary words.
This document provides an overview of natural language processing (NLP) including the linguistic basis of NLP, common NLP problems and approaches, sources of NLP data, and steps to develop an NLP system. It discusses tokenization, part-of-speech tagging, parsing, machine learning approaches like naive Bayes classification and dependency parsing, measuring word similarity, and distributional semantics. The document also provides advice on going from research to production systems and notes areas not covered like machine translation and deep learning methods.
This document provides an overview of natural language processing (NLP) including popular NLP problems, levels of NLP, the role of linguistics, sources of NLP data, tools and algorithms used in NLP, types of models including language models, and considerations for building practical NLP systems. It also describes a practical example of building a language detection system using word language models trained on Wiktionary data and evaluated using Wikipedia test data.
Yi-Shin Chen gives a presentation on natural language processing (NLP) and text mining. The presentation covers basic concepts in NLP including part-of-speech tagging, parsing, word segmentation, stemming, and vector space models. It also discusses word embedding techniques like word2vec, specifically the continuous bag-of-words and skip-gram models used to generate word vectors that capture semantic relationships. The goal is to obtain word representations that can better model linguistic knowledge compared to bag-of-words models.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
Named entity recognition (ner) with nltkJanu Jahnavi
https://www.learntek.org/blog/named-entity-recognition-ner-with-nltk/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
This lectures provides students with an introduction to natural language processing, with a specific focus on the basics of two applications: vector semantics and text classification.
(Lecture at the QUARTZ PhD Winter School (http://www.quartz-itn.eu/training/winter-school/ in Padua, Italy on February 12, 2018)
This document summarizes a paper on using simple lexical overlap features with support vector machines (SVMs) for Russian paraphrase identification. It introduces paraphrase identification and various paraphrase corpora. It then describes a knowledge-lean approach using only tokenization, lowercasing, and overlap features like union and intersection size as inputs to linear and RBF kernel SVMs. The method achieves competitive results on English, Turkish, and Russian paraphrase identification tasks.
This document provides an introduction to natural language processing (NLP) through a presentation given by Rutu Mulkar-Mehta. The presentation covers understanding natural language, common NLP tasks like text categorization and sentiment analysis, and challenges like ambiguity. It also discusses part-of-speech tagging and linguistic resources. The overall goal is to introduce attendees to the field of NLP and some of its applications.
Introduction to natural language processingMinh Pham
This document provides an introduction to natural language processing (NLP). It discusses what NLP is, why NLP is a difficult problem, the history of NLP, fundamental NLP tasks like word segmentation, part-of-speech tagging, syntactic analysis and semantic analysis, and applications of NLP like information retrieval, question answering, text summarization and machine translation. The document aims to give readers an overview of the key concepts and challenges in the field of natural language processing.
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
The document discusses different approaches to generating biographies through natural language processing, including information extraction and language modeling. It describes using information extraction patterns learned from Wikipedia to extract fields like date of birth and place of birth, and bouncing between Wikipedia and Google search results to learn patterns for other fields with less structured data. It also proposes selecting and ranking sentences from search results to improve recall when information extraction may miss relevant sentences. The goal is to build biographies by combining these techniques for high precision on structured fields and better recall on more complex fields.
This document discusses practical aspects of natural language processing (NLP) work. It contrasts research work, which involves setting goals, devising algorithms, training models, and testing accuracy, with development work, which focuses on implementing algorithms as scalable APIs. The document emphasizes that obtaining data is crucial for NLP and describes sources for structured, semi-structured, and unstructured data. It recommends Lisp as a language that supports the interactivity, flexibility, and tree processing needed for NLP research and development work.
The document provides an overview of natural language processing (NLP) and its related areas. It discusses the classical view of NLP involving stages of processing like syntax, semantics, pragmatics, etc. It also discusses the statistical/machine learning view of NLP, where NLP tasks are framed as classification problems and cues from language help reduce uncertainty. Finally, it provides examples of lower-level NLP tasks like part-of-speech tagging that can be viewed as sequence labeling problems.
Context-aware Recommendation: A Quick ViewYONG ZHENG
Context-aware recommendation systems take into account additional contextual information beyond just the user and item, such as time, location, and companion. There are three main approaches: contextual prefiltering splits items or users based on context; contextual modeling directly integrates context into models like matrix factorization; and CARSKit is an open source Java library for building context-aware recommender systems.
This document is a lecture on tokenization and word counts in natural language processing. It discusses concepts like types and tokens, Zipf's law and Heap's law which relate the number of word types to the number of tokens in a text. The document also covers challenges in tokenization like sentence segmentation and provides examples of rule-based and machine learning approaches to tokenization. It introduces word normalization techniques like lemmatization and stemming and provides exercises for students to practice word counting, lemmatization, stemming and removing stop words from texts.
Information Extraction, Named Entity Recognition, NER, text analytics, text mining, e-discovery, unstructured data, structured data, calendaring, standard evaluation per entity, standard evaluation per token, sequence classifier, sequence labeling, word shapes, semantic analysis in language technology
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
This document provides an overview of natural language processing (NLP) including:
1. An introduction to NLP and its intersection with computational linguistics, computer science, and statistics.
2. A discussion of common NLP problems like tokenization, tagging, parsing, and their rule-based and statistical approaches.
3. An explanation of machine learning techniques for NLP like language models, naive Bayes classifiers, and dependency parsing.
4. Steps for developing an NLP system including translating requirements, experimentation, and going to production.
Enriching Word Vectors with Subword InformationSeonghyun Kim
1) The document proposes a new word vector model that represents words as the sum of their character n-gram vectors to better capture morphological information.
2) It tests this model on nine languages and shows it outperforms previous models on word similarity and analogy tasks.
3) Representing words as combinations of character n-grams allows the model to learn representations for out-of-vocabulary words.
This document provides an overview of natural language processing (NLP) including the linguistic basis of NLP, common NLP problems and approaches, sources of NLP data, and steps to develop an NLP system. It discusses tokenization, part-of-speech tagging, parsing, machine learning approaches like naive Bayes classification and dependency parsing, measuring word similarity, and distributional semantics. The document also provides advice on going from research to production systems and notes areas not covered like machine translation and deep learning methods.
This document provides an overview of natural language processing (NLP) including popular NLP problems, levels of NLP, the role of linguistics, sources of NLP data, tools and algorithms used in NLP, types of models including language models, and considerations for building practical NLP systems. It also describes a practical example of building a language detection system using word language models trained on Wiktionary data and evaluated using Wikipedia test data.
Yi-Shin Chen gives a presentation on natural language processing (NLP) and text mining. The presentation covers basic concepts in NLP including part-of-speech tagging, parsing, word segmentation, stemming, and vector space models. It also discusses word embedding techniques like word2vec, specifically the continuous bag-of-words and skip-gram models used to generate word vectors that capture semantic relationships. The goal is to obtain word representations that can better model linguistic knowledge compared to bag-of-words models.
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
In the process of globalization, the number of English words in other languages has rapidly increased. In automatic speech recognition systems, spell-checking, tagging, and other software in the field of natural language processing,
loan words are not easily recognized and should be evaluated
separately. In this paper we present a corpora-based approach to the automatic detection of anglicisms in Russian social network
texts. Proposed method is based on the idea of simultaneous
scripting, phonetics, and semantics similarity of the original Latin word and its Cyrillic analogue. We used a set of transliteration, phonetic transcribing, and morphological analysis methods to find possible hypotheses and distributional semantic models to filter them. Resulting list of borrowings, gathered from approximately 20 million LiveJournal texts, shows good intersection with manually collected dictionary. Proposed method is fully automated and can be applied to any domain–specific area.
Full paper available at:
https://www.academia.edu/29834070/A_General_Method_Applicable_to_the_Search_for_Anglicisms_in_Russian_Social_Network_Texts
Named entity recognition (ner) with nltkJanu Jahnavi
https://www.learntek.org/blog/named-entity-recognition-ner-with-nltk/
Learntek is global online training provider on Big Data Analytics, Hadoop, Machine Learning, Deep Learning, IOT, AI, Cloud Technology, DEVOPS, Digital Marketing and other IT and Management courses.
This lectures provides students with an introduction to natural language processing, with a specific focus on the basics of two applications: vector semantics and text classification.
(Lecture at the QUARTZ PhD Winter School (http://www.quartz-itn.eu/training/winter-school/ in Padua, Italy on February 12, 2018)
This document summarizes a paper on using simple lexical overlap features with support vector machines (SVMs) for Russian paraphrase identification. It introduces paraphrase identification and various paraphrase corpora. It then describes a knowledge-lean approach using only tokenization, lowercasing, and overlap features like union and intersection size as inputs to linear and RBF kernel SVMs. The method achieves competitive results on English, Turkish, and Russian paraphrase identification tasks.
This document provides an introduction to natural language processing (NLP) through a presentation given by Rutu Mulkar-Mehta. The presentation covers understanding natural language, common NLP tasks like text categorization and sentiment analysis, and challenges like ambiguity. It also discusses part-of-speech tagging and linguistic resources. The overall goal is to introduce attendees to the field of NLP and some of its applications.
Introduction to natural language processingMinh Pham
This document provides an introduction to natural language processing (NLP). It discusses what NLP is, why NLP is a difficult problem, the history of NLP, fundamental NLP tasks like word segmentation, part-of-speech tagging, syntactic analysis and semantic analysis, and applications of NLP like information retrieval, question answering, text summarization and machine translation. The document aims to give readers an overview of the key concepts and challenges in the field of natural language processing.
High level introduction to text mining analytics, which covers the building blocks or most commonly used techniques of text mining along with useful additional references/links where required for background/literature and R codes to get you started.
The document discusses different approaches to generating biographies through natural language processing, including information extraction and language modeling. It describes using information extraction patterns learned from Wikipedia to extract fields like date of birth and place of birth, and bouncing between Wikipedia and Google search results to learn patterns for other fields with less structured data. It also proposes selecting and ranking sentences from search results to improve recall when information extraction may miss relevant sentences. The goal is to build biographies by combining these techniques for high precision on structured fields and better recall on more complex fields.
This document discusses practical aspects of natural language processing (NLP) work. It contrasts research work, which involves setting goals, devising algorithms, training models, and testing accuracy, with development work, which focuses on implementing algorithms as scalable APIs. The document emphasizes that obtaining data is crucial for NLP and describes sources for structured, semi-structured, and unstructured data. It recommends Lisp as a language that supports the interactivity, flexibility, and tree processing needed for NLP research and development work.
The document provides an overview of natural language processing (NLP) and its related areas. It discusses the classical view of NLP involving stages of processing like syntax, semantics, pragmatics, etc. It also discusses the statistical/machine learning view of NLP, where NLP tasks are framed as classification problems and cues from language help reduce uncertainty. Finally, it provides examples of lower-level NLP tasks like part-of-speech tagging that can be viewed as sequence labeling problems.
Context-aware Recommendation: A Quick ViewYONG ZHENG
Context-aware recommendation systems take into account additional contextual information beyond just the user and item, such as time, location, and companion. There are three main approaches: contextual prefiltering splits items or users based on context; contextual modeling directly integrates context into models like matrix factorization; and CARSKit is an open source Java library for building context-aware recommender systems.
This document is a lecture on tokenization and word counts in natural language processing. It discusses concepts like types and tokens, Zipf's law and Heap's law which relate the number of word types to the number of tokens in a text. The document also covers challenges in tokenization like sentence segmentation and provides examples of rule-based and machine learning approaches to tokenization. It introduces word normalization techniques like lemmatization and stemming and provides exercises for students to practice word counting, lemmatization, stemming and removing stop words from texts.
This document argues that businesses should implement tokenized electronic payment systems to protect against credit card fraud. It explains how tokenization works by replacing static credit card numbers with dynamic tokens that are meaningless to hackers. The document cites several large data breaches from 2013 that compromised millions of customer records and cost companies hundreds of millions of dollars, noting that tokenization could have prevented the theft of usable payment information. It concludes that tokenization provides strong security at low cost, and should be adopted by all businesses that accept electronic payments.
The document discusses various text preprocessing techniques used in information retrieval systems to improve how systems understand search queries. It covers techniques like tokenization, removing stop words, normalization, stemming/lemmatization, and sentence segmentation. It also discusses challenges with techniques like accent removal, capitalization, and morphological analysis, noting the importance of balancing precision and recall based on the application.
This document discusses using Python in the animation industry. It describes how one person learned Python while working in animation. It also discusses using Python in Maya for animation work. The document then talks about building an animation pipeline using Python and ideas like version control, asset management databases, and behavior logging to analyze user interactions. It considers options like CouchDB and MongoDB for managing asset data and user behavior logs. Finally, it emphasizes that version control, databases, and cloud techniques can benefit many creative roles beyond just programmers.
Tokenization: Life beyond the Information AgeShannon Code
This document discusses the transition from the industrial age to the information age to a future "imagination age" enabled by technologies like blockchain and tokenization. It provides examples of how digital assets could be tokenized to ensure proof of ownership, identity, and provenance using cryptography and blockchain. Specific media examples are given where audio could be tokenized frame-by-frame on a blockchain to prevent tampering and verify ownership and integrity.
This document discusses text mining and information extraction. It covers the goals of information extraction including extracting structured data from unstructured text. It also discusses named entity recognition, challenges in NER, maximum entropy methods for NER, template filling using statistical and finite-state approaches, and applications of information extraction.
Token classification using Bengali TokenizerJeet Das
Sujit Kumar Das presented research on token classification in Bengali language using a Bangla tokenizer. The goals were to develop a system to tokenize Bengali text and classify tokens into predefined categories. The current status shows words being split and stored in an Excel file after removing stop words from input text. Future work includes stemming, part-of-speech tagging, and token classification. The paper reviewed related work on tokenizers and discussed challenges in developing an effective Bangla tokenizer due to the richness of Bengali grammar and text structure.
The document discusses text similarity techniques for detecting duplicate or plagiarized documents. It covers topics like data retrieval methods including TF-IDF, document term matrices, vector space models and latent semantic analysis. For similarity measurements, it describes cosine similarity and second order co-occurrence pointwise mutual information. It also discusses applications in plagiarism detection and copyright violation. Finally, it outlines a prototype for calculating text similarity between files using TF-IDF and measures like cosine, Pearson and co-occurrence similarities.
Three companies are using the web and the most advanced technologies to disrupt loyalty programs. How will internet, software-as-a-service (Saas), blockchain, predictive technologies, mobile beacons, NFC, reshape the loyalty and rewards programs into the new Fintech era?
This presentation provides an introduction to tokenization. It describes what tokenization is, how it implement and also compares it with encryption. Most people try to separate tokenization from encryption. However, it may not really be the case as tokenization could be form of encryption as well.
What is Payment Tokenization?
Tokenization enables banks, acquirers and merchants to offer more secure (mobile) payment services.
It is the process of replacing card data with alternate values.
The original personal account number (PAN) is disconnected and replaced with a unique identifier called a payment token.
The ‘mapping’ between the real PAN and the payment tokens is safely stored in the token vault.
With tokenization the original PAN information is removed from environments where data can be vulnerable.
Why tokenization?
Tokenization heavily reduces payment fraud by removing confidential consumer credit card data from the network.
The original data stays in the bank’s control. External systems have no access to this.
Tokens are not based on cryptography and can therefore not be traced back to the original value.
How does tokenization work?
Step 1: A payment token is generated from the PAN for one time use within a specific domain such as a merchant’s website or channel.
Tokens are sent to the token vault and stored in a PCI-compliant environment which does not allow merchants to store credit card numbers.
Step 2: Tokens are loaded on the mobile device.
Step 3: The NFC device makes a payment at a merchant’s NFC point-of-sales (POS) terminal.
Step 4: The POS terminal sends the token to the acquiring bank, which sends it to the issuing bank through the payment network.
Step 5: The issuer de-tokenizes the token to the real PAN and, if in order, approves the payment.
Step 6: After authorization from the card issuer, the token is returned to the merchant’s POS terminal.
Payment tokens perform like the original PAN for returns, sales reports, marketing analysis, recurring payments etc.
20. How can I issue tokens?
In order to use tokenization, a bank or merchant should become a token service provider (TSP).
A TSP manages the entire lifecycle of payment credentials including:
1. Tokenization: replaces the PAN with a payment token.
2. De-Tokenization: converts the token back to the PAN using the token vault.
3. Token vault: establishes and maintains the payment token to PAN mapping.
4. Domain management: improves protection by defining payment tokens for specific use.
5. Clearing and settlement: ad-hoc de-tokenization during clearing and settlement process.
6. Identification and verification: ensures the original PAN is legitimately used by the token requestor.
Thinking of issuing payment tokens to e.g. secure mobile payments or secure your online sales channel? Bell ID can help: www.bellid.com – info@bellid.com
Martin Cox – Global Head of Sales
The document discusses facts about the growth of big data and how data is generated from many sources. It notes that every person and object generates data, an average person now processes more data than people in history, and data is doubling every two years. It also provides examples of how companies are using big data to personalize experiences, optimize operations, and drive higher sales and conversions.
Tutorial: Context In Recommender SystemsYONG ZHENG
This document provides an overview of a tutorial on context-aware recommender systems. The tutorial will cover traditional recommendation techniques, context-aware recommendation which incorporates additional contextual information such as time and location, and context suggestion. It includes an agenda with topics, background information on recommender systems and evaluation metrics, and descriptions of techniques for context-aware recommendation including context filtering and modeling.
Locality Sensitive Hashing (LSH) is a technique for finding similar items in large datasets. It works in 3 steps:
1. Shingling converts documents to sets of n-grams (sequences of tokens). This represents documents as high-dimensional vectors.
2. MinHashing maps these high-dimensional sets to short signatures or sketches, in a way that preserves similarity according to the Jaccard coefficient. It uses random permutations to select the minimum value in each permutation.
3. LSH partitions the signature matrix into bands and hashes each band separately, so that similar signatures are likely to hash to the same buckets. Candidate pairs are those that share buckets in one or more bands, reducing
The class outline covers introduction to unstructured data analysis, word-level analysis using vector space model and TF-IDF, beyond word-level analysis using natural language processing, and a text mining demonstration in R mining Twitter data. The document provides background on text mining, defines what text mining is and its tasks. It discusses features of text data and methods for acquiring texts. It also covers word-level analysis methods like vector space model and TF-IDF, and applications. It discusses limitations of word-level analysis and how natural language processing can help. Finally, it demonstrates Twitter mining in R.
The document discusses processing Boolean queries in an information retrieval system using an inverted index. It describes the steps to process a simple conjunctive query by locating terms in the dictionary, retrieving their postings lists, and intersecting the lists. More complex queries involving OR and NOT operators are also processed in a similar way. The document also discusses optimizing query processing by considering the order of accessing postings lists.
This document discusses text mining in R. It introduces important text mining concepts like tokenization, tagging, and stemming. It outlines popular R packages for text mining like tm, SnowballC, qdap, and dplyr. The document explains how to create a corpus from text files, explore and transform a corpus, create a document term matrix, and analyze term frequencies. Visualization techniques like word clouds and heatmaps are also summarized.
A Matching Approach Based on Term Clusters for eRecruitmentKemal Can Kara
As the Internet occupies our daily lives in all aspects, finding jobs/employees online has an important role for job seekers and companies that hire. However, it is difficult for a job applicant to find the best job that matches his/her qualifications and also it is difficult for a company to find the best qualified candidates based on the company’s job advertisement. In this paper, we propose a system that extracts data from free-structured job advertisements in an ontological way in Turkish language. We describe a system that extracts data from resumés and jobs to generate a matching system that provides job applicants with the best jobs to match their qualifications. Moreover, the system also provides companies to find the best fit for their job advertisement.
This document provides an introduction to text analytics and natural language processing techniques. It discusses bag-of-words models, term frequency-inverse document frequency (TF-IDF), vector space models, distance measures, document clustering, word embeddings using word2vec, and recurrent neural networks. The agenda covers traditional "frequentist" text analysis methods as well as deep learning techniques for semantic analysis. Hands-on examples in Python are provided to illustrate document clustering, creating word embeddings, and generating text with recurrent neural networks.
02 Text Operatiohhfdhjghdfshjgkhjdfjhglkdfjhgiuyihjufidhcun.pdfbeshahashenafe20
The document discusses five key text operations for information retrieval: 1) lexical analysis, 2) stop word elimination, 3) stemming, 4) term selection, and 5) thesaurus construction. It describes challenges in text operations like tokenization and normalization. Specifically, it covers issues in identifying valid tokens, determining a list of stop words, and conflating word variants through stemming algorithms like affix removal. The overall goal is to preprocess text for indexing terms to improve retrieval performance.
This document discusses various techniques for feature engineering on text data, including both structured and unstructured data. It covers preprocessing techniques like tokenization, stopword removal, and stemming. It then discusses methods for feature extraction, such as bag-of-words, n-grams, TF-IDF, word embeddings, topic models like LDA. It also discusses document similarity metrics and applications of feature engineered text data to text classification. The goal is to transform unstructured text into structured feature vectors that can be used for machine learning applications.
Full-text search allows searching the full text of documents for exact matches or substrings of search terms. It examines all words in every stored document to match search criteria. A common full-text search technique uses an inverted index to map terms to their locations in documents, allowing fast searching in O(m) time where m is the length of the search query. Updating an inverted index is challenging as it is optimized for reads and requires rewriting segments on changes.
Multimedia Data Navigation and the Semantic Web (SemTech 2006)Bradley Allen
The document describes a system for faceted navigation of multimedia content using semantic web technologies. It discusses using ontologies expressed in RDF(S) and OWL to represent metadata, BBC rush footage used as a case study, and visual facets for color, texture and combinations that were generated through MPEG-7 feature extraction and self-organizing map clustering. The system allows retrieval of clips and shots based on textual and visual facet filtering of the RDF represented multimedia data.
This document provides an overview of text mining techniques and processes for analyzing Twitter data with R. It discusses concepts like term-document matrices, text cleaning, frequent term analysis, word clouds, clustering, topic modeling, sentiment analysis and social network analysis. It then provides a step-by-step example of applying these techniques to Twitter data from an R Twitter account, including retrieving tweets, text preprocessing, building term-document matrices, and various analyses.
MEBI 591C/598 Data and Text Mining in Biomedical Informaticsbutest
This document provides an overview of topics related to text mining and natural language processing (NLP) subproblems. It discusses sentence delimiters, tokenizers, part-of-speech tagging, and collocations. For each topic, it provides definitions and examples. It also lists various NLP tools and libraries that can be used, such as OpenNLP, Stanford NLP, Mallet, and biomedical text mining tools like MetaMap and cTAKES.
Information retrieval chapter 2-Text Operations.pptSamuelKetema1
This document discusses text operations for information retrieval systems, including tokenization, stemming, and removing stop words. It explains that tokenization breaks text into discrete tokens or words. Stemming reduces words to their root form by removing affixes like prefixes and suffixes. Stop words, which are very common words like "the" and "of", are filtered out since they provide little meaning. The goal of these text operations is to select more meaningful index terms to represent document contents for retrieval tasks.
This document discusses cracking linguistically correct passphrases using Markov chains and n-grams. It describes implementing a passphrase cracking tool that generates passphrases based on n-gram statistics extracted from source texts. Testing was done on hashed passwords of different lengths, comparing the generated passphrases to the target entropy of English. Results showed some lists cracked a small number of hashes while others fell short of the target entropy. Overall this demonstrated an approach for modeling language to crack passphrases, but also limitations in the current implementation.
PyGotham NY 2017: Natural Language Processing from ScratchNoemi Derzsy
This document discusses using natural language processing techniques to analyze over 32,000 open datasets from NASA. It describes preprocessing text data, generating word clouds, calculating term frequency-inverse document frequency, performing topic modeling with Latent Dirichlet Allocation and K-means clustering. The topic modeling helps evaluate how accurately descriptions and keywords capture dataset topics. Connections between NASA datasets and other US government collections are also examined.
R is a free software environment for statistical analysis and graphics. This document discusses using R for text mining, including preprocessing text data through transformations like stemming, stopword removal, and part-of-speech tagging. It also demonstrates building term document matrices and classifying text with k-nearest neighbors (KNN) algorithms. Specifically, it shows classifying speeches from Obama and Romney with over 90% accuracy using KNN classification in R.
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
Abstract of Aaron Han’s Presentation
The main topic of this presentation will be the “evaluation of machine translation”. With the rapid development of machine translation (MT), the MT evaluation becomes more and more important to tell whether they make some progresses. The traditional human judgments are very time-consuming and expensive. On the other hand, there are some weaknesses in the existing automatic MT evaluation metrics:
– perform well in certain language pairs but weak on others, which we call the language-bias problem;
– consider no linguistic information (leading the metrics result in low correlation with human judgments) or too many linguistic features (difficult in replicability), which we call the extremism problem;
– design incomprehensive factors (e.g. precision only).
To address the existing problems, he has developed several automatic evaluation metrics:
– Design tunable parameters to address the language-bias problem;
– Use concise linguistic features for the linguistic extremism problem;
– Design augmented factors.
The experiments on ACL-WMT corpora show the proposed metrics yield higher correlation with human judgments. The proposed metrics have been published on international top conferences, e.g. COLING and MT SUMMIT. Actually speaking, the evaluation works are very related to the similarity measuring. So these works can be further developed into other literature, such as information retrieval, question and answering, searching, etc.
A brief introduction about some of his other researches will also be mentioned, such as Chinese named entity recognition, word segmentation, and multilingual treebanks, which have been published on Springer LNCS and LNAI series. Precious suggestions and comments are much appreciated. The opportunities of further corporation will be more exciting.
Similar to OUTDATED Text Mining 3/5: String Processing (20)
ViewShift: Hassle-free Dynamic Policy Enforcement for Every Data LakeWalaa Eldin Moustafa
Dynamic policy enforcement is becoming an increasingly important topic in today’s world where data privacy and compliance is a top priority for companies, individuals, and regulators alike. In these slides, we discuss how LinkedIn implements a powerful dynamic policy enforcement engine, called ViewShift, and integrates it within its data lake. We show the query engine architecture and how catalog implementations can automatically route table resolutions to compliance-enforcing SQL views. Such views have a set of very interesting properties: (1) They are auto-generated from declarative data annotations. (2) They respect user-level consent and preferences (3) They are context-aware, encoding a different set of transformations for different use cases (4) They are portable; while the SQL logic is only implemented in one SQL dialect, it is accessible in all engines.
#SQL #Views #Privacy #Compliance #DataLake
The Ipsos - AI - Monitor 2024 Report.pdfSocial Samosa
According to Ipsos AI Monitor's 2024 report, 65% Indians said that products and services using AI have profoundly changed their daily life in the past 3-5 years.
STATATHON: Unleashing the Power of Statistics in a 48-Hour Knowledge Extravag...sameer shah
"Join us for STATATHON, a dynamic 2-day event dedicated to exploring statistical knowledge and its real-world applications. From theory to practice, participants engage in intensive learning sessions, workshops, and challenges, fostering a deeper understanding of statistical methodologies and their significance in various fields."
Build applications with generative AI on Google CloudMárton Kodok
We will explore Vertex AI - Model Garden powered experiences, we are going to learn more about the integration of these generative AI APIs. We are going to see in action what the Gemini family of generative models are for developers to build and deploy AI-driven applications. Vertex AI includes a suite of foundation models, these are referred to as the PaLM and Gemini family of generative ai models, and they come in different versions. We are going to cover how to use via API to: - execute prompts in text and chat - cover multimodal use cases with image prompts. - finetune and distill to improve knowledge domains - run function calls with foundation models to optimize them for specific tasks. At the end of the session, developers will understand how to innovate with generative AI and develop apps using the generative ai industry trends.
"Financial Odyssey: Navigating Past Performance Through Diverse Analytical Lens"sameer shah
Embark on a captivating financial journey with 'Financial Odyssey,' our hackathon project. Delve deep into the past performance of two companies as we employ an array of financial statement analysis techniques. From ratio analysis to trend analysis, uncover insights crucial for informed decision-making in the dynamic world of finance."
Codeless Generative AI Pipelines
(GenAI with Milvus)
https://ml.dssconf.pl/user.html#!/lecture/DSSML24-041a/rate
Discover the potential of real-time streaming in the context of GenAI as we delve into the intricacies of Apache NiFi and its capabilities. Learn how this tool can significantly simplify the data engineering workflow for GenAI applications, allowing you to focus on the creative aspects rather than the technical complexities. I will guide you through practical examples and use cases, showing the impact of automation on prompt building. From data ingestion to transformation and delivery, witness how Apache NiFi streamlines the entire pipeline, ensuring a smooth and hassle-free experience.
Timothy Spann
https://www.youtube.com/@FLaNK-Stack
https://medium.com/@tspann
https://www.datainmotion.dev/
milvus, unstructured data, vector database, zilliz, cloud, vectors, python, deep learning, generative ai, genai, nifi, kafka, flink, streaming, iot, edge
End-to-end pipeline agility - Berlin Buzzwords 2024Lars Albertsson
We describe how we achieve high change agility in data engineering by eliminating the fear of breaking downstream data pipelines through end-to-end pipeline testing, and by using schema metaprogramming to safely eliminate boilerplate involved in changes that affect whole pipelines.
A quick poll on agility in changing pipelines from end to end indicated a huge span in capabilities. For the question "How long time does it take for all downstream pipelines to be adapted to an upstream change," the median response was 6 months, but some respondents could do it in less than a day. When quantitative data engineering differences between the best and worst are measured, the span is often 100x-1000x, sometimes even more.
A long time ago, we suffered at Spotify from fear of changing pipelines due to not knowing what the impact might be downstream. We made plans for a technical solution to test pipelines end-to-end to mitigate that fear, but the effort failed for cultural reasons. We eventually solved this challenge, but in a different context. In this presentation we will describe how we test full pipelines effectively by manipulating workflow orchestration, which enables us to make changes in pipelines without fear of breaking downstream.
Making schema changes that affect many jobs also involves a lot of toil and boilerplate. Using schema-on-read mitigates some of it, but has drawbacks since it makes it more difficult to detect errors early. We will describe how we have rejected this tradeoff by applying schema metaprogramming, eliminating boilerplate but keeping the protection of static typing, thereby further improving agility to quickly modify data pipelines without fear.
Open Source Contributions to Postgres: The Basics POSETTE 2024ElizabethGarrettChri
Postgres is the most advanced open-source database in the world and it's supported by a community, not a single company. So how does this work? How does code actually get into Postgres? I recently had a patch submitted and committed and I want to share what I learned in that process. I’ll give you an overview of Postgres versions and how the underlying project codebase functions. I’ll also show you the process for submitting a patch and getting that tested and committed.
7. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Lexers and Tokenization
• Lexer: converts (“tokenizes”) strings into token sequences
‣ a.k.a.: Tokenizer, Scanner
• Tokenization strategies
‣ whitespace (“s+”) [“PhD.”, “U.S.A.”]
‣ word (letter vs. non-letter) boundary (“b”) [“PhD”, “.”, “U”, “.”, “S”, “.”, “A”, “.”]
‣ Unicode category-based ([“C”, “3”, “P”, “0”], [“Ph”, “D”, “.”, “U”, “.”, “S”, “.” …])
‣ linguistic ([“that”, “ʼs”]; [“do”, “nʼt”]; [“adjective-like”]; [“PhD”, “.”, “U.S.A”, “.”])
‣ whitespace/newline-preserving ([“This”, “ ”, “is”, “ ”, …]) or not
• Task-oriented choice: document classification, entity detection,
linguistic analysis, …
64
This is a sentence .
This is a sentence.
8. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
String Matching with
Regular Expressions
‣ /^[^@]{1,64}@w(?:[w.-]{0,254}w)?.w{2,}$/
• (provided w has Unicode support)
• http://ex-parrot.com/~pdw/Mail-RFC822-Address.html
!
‣ Wild card: . (a dot → matches any character)
‣ Groupings: [A-Z0-9]; Negations: [^a-z]; Captures (save_this)
‣ Markers: ^ “start of string/line”; $ “end of string/line”; [b “word boundary”]
‣ Pattern repetitions: * “zero or more”; *? “zero or more, non-greedy”;
+ “one or more”; +? “one or more, non-greedy”; ? “zero or one”
‣ Example: b([a-z]+)b.+$ captures lower-case words except at the EOS
65
9. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
String Matching with
Finite State Automata 1/2
• Matching single words or patterns:
‣ Regular Expressions are “translated” to finite-state automata
‣ Automata can be deterministic (DFA, O(n)) or non-deterministic (NFA, O(nm))
• where n is the length of the string being scanned and m is the length of the string being matched
‣ Programming languages’ default implementation are (slower) NFAs because
certain special expressions (e.g., “look-ahead” and “look-behind”) cannot be
implemented with a DFA
‣ DFA implementations “in the wild”
• RE2 by Google: https://code.google.com/p/re2/
[C++, and the default RE engine of Golang,
with wrappers for a few other languages]
• dk.brics.automaton: http://www.brics.dk/automaton/
[Java, but very memory “hungry”]
66
1 2 3
a b
b
(ab)+
1 2
a
b
10. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
String Matching with
Finite State Automata 2/2
Dictionaries:
Exact string matching of
word collections
Compressed Prefix Trees (Tries):
PATRICIA Trie
Prefixes → autocompletion functionality
Minimal (compressed) DFA
(a.k.a. MADFA, DAWG or DAFSA)
a compressed dictionary
Python: pip install DAWG
https://github.com/kmike/DAWG
[“kmike” being a NLTK dev.]
Daciuk et al. 2000
67
top
tops
tap
taps
Trie MADFA
PATRICIA Trie
Images Source: WikiMedia Commons, Saffles & Chkno
12. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Sentence Segmentation
• Sentences are the fundamental linguistic unit
‣ Sentence: boundary or “constraint” for linguistic phenomena
‣ collocations [“United Kingdom”, “vice president”], idioms [“drop me a line”],
phrases [PP: “of great fame”] and clauses, statements, …
• Rule/pattern-based segmentation
‣ Segment sentences if the marker is followed by an upper-case letter
• /[.!?]s+[A-Z]/
‣ Special cases: abbreviations, digits, lower-case proper nouns (genes,
“amnesty international”, …), hyphens, quotation marks, …
• Supervised sentence boundary detection
‣ Use some Markov model or a conditional random field to identify possible
sentence segmentation tokens
‣ Requires labeled examples (segmented sentences)
69
turns out letter case is a poor manʼs rule!
maintaining a proper rule set
gets messy fast
13. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Punkt Sentence Tokenizer
(PST) 1/2
• Unsupervised Multilingual Sentence Boundary Detection
• P(●|w-1) > ccpc
• Determines if a marker ● is used as an abbreviation marker by comparing the conditional probability that the
word w-1 before ● is followed by the marker against some (high) cutoff probability.
‣ P(●|w-1) = P(w-1, ●) ÷ P(w-1)
‣ K&S set c = 0.99
• P(w+1|w-1) > P(w+1)
• Evaluates the likelihood that w-1 and w+1 surrounding the marker ● are more commonly collocated than would be
expected by chance: ● is assumed an abbreviation marker (“not independent”) if the LHS is greater than the RHS.
• Flength(w)×Fmarkers(w)×Fpenalty(w) ≥ cabbr
• Evaluates if any of w’s morphology (length of w w/o marker characters, number of periods inside w (e.g., [“U.S.A”]),
penalized when w is not followed by a ●) makes it more likely that w is an abbreviation against some (low) cutoff.
• Fortho(w); Psstarter(w+1|●); …
• Orthography: lower-, upper-case or capitalized word after a probable ● or not
• Sentence Starter: Probability that w is found after a ●
70
Dr.
Mrs. Watson
U.S.A.
. Therefore
14. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Punkt Sentence Tokenizer
(PST) 2/2
• Unsupervised Multilingual
Sentence Boundary Detection
‣ Kiss & Strunk, MIT Press 2006.
‣ Available from NLTK:
nltk.tokenize.punkt (http://
www.nltk.org/api/nltk.tokenize.html)
• PST is language agnostic
‣ Requires that the language uses the
sentence segmentation marker as an
abbreviation marker
‣ Otherwise, the problem PST solves is not
present
!
!
• PST factors in word length
‣ Abbreviations are relatively shorter than
regular words
• PST takes “internal” markers into
account
‣ E.g., “U.S.A”
• Main weakness: long lists of
abbreviations
‣ E.g., author lists in citations
‣ Can be fixed with a pattern-based post-
processing strategy
• NB: a marker must be present
‣ E.g., chats or fora
71
16. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Parts of Speech
• The Parts of Speech:
‣ noun: NN, verb: VB, adjective: JJ, adverb: RB, preposition: IN,
personal pronoun: PRP, …
‣ e.g. the full Penn Treebank PoS tagset contains 48 tags:
‣ 34 grammatical tags (i.e., “real” parts-of-speech) for words
‣ one for cardinal numbers (“CD”; i.e., a series of digits)
‣ 13 for [mathematical] “SYM” and currency “$” symbols, various types of
punctuation, as well as for opening/closing parenthesis and quotes
73
I ate the pizza with green peppers .
PRP VB DT NN IN JJ NN .
“PoS tags”
17. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
The Parts of Speech
• Automatic PoS Tagging ➡ Supervised Machine Learning
‣ Maximum Entropy Markov Models
‣ Conditional Random Fields
‣ Ensemble Methods
74
I ate the pizza with green peppers .
PRP VB DT NN IN JJ NN .
will be explained in the Text Mining #5!
(last lesson)
19. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Linguistic Morphology
• [Verbal] Inflections
‣ conjugation (Indo-European
languages)
‣ tense (“availability” and use varies
across languages)
‣ modality (subjunctive, conditional,
imperative)
‣ voice (active/passive)
‣ …
!
• Contractions
‣ don’t say you’re in a man’s world…
• Declensions
• on nouns, pronouns, adjectives, determiners
‣ case (nominative, accusative,
dative, ablative, genitive, …)
‣ gender (female, male, neuter)
‣ number forms (singular, plural,
dual)
‣ possessive pronouns (I➞my,
you➞your, she➞her, it➞its, … car)
‣ reflexive pronouns (for myself,
yourself, …)
‣ …
76
not a contraction:
possessive s
➡ token normalization
20. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Stemming vs Lemmatization
• Stemming
‣ produced by “stemmers”
‣ produces a word’s “stem”
!
‣ am ➞ am
‣ the going ➞ the go
‣ having ➞ hav
!
‣ fast and simple (pattern-based)
‣ Snowball; Lovins; Porter
• nltk.stem.*
• Lemmatization
‣ produced by “lemmatizers”
‣ produces a word’s “lemma”
!
‣ am ➞ be
‣ the going ➞ the going
‣ having ➞ have
!
‣ requires: a dictionary and PoS
‣ LemmaGen; morpha
• nltk.stem.wordnet
77
➡ token normalization
a.k.a. token “regularization”
(although that is technically the wrong wording)
22. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
• Finding similarly spelled or misspelled words
• Finding “similar” texts/documents
‣ N.B.: no semantics (yet…)
• Detecting entries in a gazetteer
‣ a list of domain-specific words
• Detecting entities in a dictionary
‣ a list of words, each mapped to some URI
‣ N.B.: “entities”, not “concepts” (no semantics…)
String Matching
79
23. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Jaccard Similarity
80
“words” “woody”
w
o
r
s
d y
J(A, B) =
|A B|
|A [ B|
=
3
7
= 0.43
vs
o
6 o.5
24. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
String Similarity
• Edit Distance Measures
‣ Hamming Distance [1950]
‣ Levenshtein-Damerau Distance [1964/5]
‣ Needelman-Wunsch [1970] and
Smith-Waterman [1981] Distance
• also align the strings (“sequence alignment”)
• use dynamic programming
• Modern approaches: BLAST [1990], BWT [1994]
• Other Distance Metrics
‣ Jaro-Winkler Distance [1989/90]
• coefficient of matching characters within a dynamic
window minus transpositions
‣ Soundex (➞ homophones; spelling!)
‣ …
• Basic Operations: “indels”
‣ Insertions
• ac ➞ abc
‣ Deletions
• abc ➞ ac
!
• “Advanced” Operations
‣ Require two “indels”
‣ Substitutions
• abc ➞ aBc
‣ Transpositions
• ab ➞ ba
81
Wikipedia is gold here!
(for once…)
25. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Distance Measures
• Hamming Distance
‣ only counts substitutions
‣ requires equal string lengths
‣ karolin ⬌ kathrin = 3
‣ karlo ⬌ carol = 3
‣ karlos ⬌ carol = undef
• Levenshtein Distance
‣ counts all but transpositions
‣ karlo ⬌ carol = 3
‣ karlos ⬌ carol = 4
• Damerau-Levenshtein D.
‣ also allows transpositions
‣ has quadratic complexity
‣ karlo ⬌ carol = 2
‣ karlos ⬌ carol = 3
• Jaro Distance (a, b)
‣ calculates (m/|a| + m/|b| + (m-t)/m) / 3
‣ 1. m, the # of matching characters
‣ 2. t, the # of transpositions
‣ window: !max(|a|, |b|) / 2" - 1 chars
‣ karlos ⬌ carol = 0.74
‣ (4 ÷ 6 + 4 ÷ 5 + 3 ÷ 4) ÷ 3
• Jaro-Winkler Distance
‣ adds a bonus for matching prefixes
82
[0,1] range
typos!
nltk.metrics.distance
26. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Gazetteer/Dictionary
Matching
• Finding all tokens that match the entries
‣ hash table lookups: constant complexity - O(1)
• Exact, single token matches
‣ regular hash table lookup (e.g., MURMUR3)
• Exact, multiple tokens
‣ prefix tree of hash tables pointing to child tables or leafs (=matches)
• Approximate, single tokens
‣ use some string metric - but do not compare all n-to-m cases…
• Approximate, multiple tokens
‣ a prefix tree of whatever we will use for single tokens…
83
L L
L L L
LSH (next)
27. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Locality Sensitive Hashing
(LSH) 1/2
• A hashing approach to group near neighbors.
• Map similar items into the same [hash] buckets.
• LSH “maximizes” (instead of minimizing) hash collisions.
• It is another dimensionality reduction technique.
• For documents or words, minhashing can be used.
‣ Approach from Rajaraman & Ullman, Mining of Massive Datasets, 2010
• http://infolab.stanford.edu/~ullman/mmds/ch3a.pdf
84
30. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Minhash Signatures (1/2)
87
shingleorn-gramID
Create n-gram x token or
k-shingle x document matrix
(likely very sparse!)
Lemma: Two texts will have the same first “true” shingle/ngram when looking
from top to bottom with a probability equal to their Jaccard (Set) Similarity.
!
Idea: Create sufficient permutations of the row (shingle/n-gram) ordering so
that the Jaccard Similarity can be approximated by comparing the number of
coinciding vs. differing rows.
T T T T h h
0 T F F T 1 1
1 F F T F 2 4
2 F T F T 3 2
3 T F T T 4 0
4 F F T F 0 3
a family of hash functions
h1(x) = (x+1)%n
!
h2(x) = (3x+1)%n
!
n=5
32. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Minhash Signatures (2/2)
89
shingleorn-gramID
S T T T T h h
0 T F F T 1 1
1 F F T F 2 4
2 F T F T 3 2
3 T F T T 4 0
4 F F T F 0 3
h1(x) = (x+1)%n
!
h2(x) = (3x+1)%n
!
n=5
M T T T T
h ∞ ∞ ∞ ∞
h ∞ ∞ ∞ ∞
init matrix M = ∞
for each shingle c,
hash h, and
text t:
if S[t,c] and
M[t,h] > h(c):
M[t,h] = h(c)
perfectly map-reduce-able and embarrassingly parallel!
c=0 T T T T
h ∞ ∞ ∞ ∞
h ∞ ∞ ∞ ∞
c=1 T T T T
h 1 ∞ ∞ 1
h 1 ∞ ∞ 1
c=2 T T T T
h 1 ∞ 2 1
h 1 ∞ 4 1
c=3 T T T T
h 1 3 2 1
h 1 2 4 1
c=4 T T T T
h 1 3 2 1
h 0 2 0 0
M T T T T
h 1 3 0 1
h 0 2 0 0
34. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Banded Locality Sensitive
Minhashing
91
M T T T T
h ∞ ∞ ∞ ∞
h ∞ ∞ ∞ ∞
c=0 T T T T
h ∞ ∞ ∞ ∞
h ∞ ∞ ∞ ∞
c=1 T T T T
h 1 ∞ ∞ 1
h 1 ∞ ∞ 1
T T T T T T T T …
h 1 0 2 1 7 1 4 5
h 1 2 4 1 6 5 5 6
h 0 5 6 0 6 4 7 9
h 4 0 8 8 7 6 5 7
h 7 7 0 8 3 8 7 3
h 8 9 0 7 2 4 8 2
h 8 5 4 0 9 8 4 7
h 9 4 3 9 0 8 3 9
h 8 5 8 0 0 6 8 0
…
Bands
1
2
3
…
Bands b ∝ pagreement
Hashes/Band r ∝ 1/pagreement
pagreement = 1 - (1 - sr
)b
s = Jaccard(A,B)
(in at least one band)
35. Florian Leitner <florian.leitner@upm.es> MSS/ASDM: Text Mining
Practical: Faster Spelling
Correction with LSH
92
• Using the spelling correction tutorial from yesterday and the provided
banded, minhashing-based LSH implementation, develop a spelling
corrector that is nearly as good as Peter Norvig’s, but at least twice
as fast (3x is possible!).
• Tip 1: if you are using NLTK 2.0.x (you probably are!), you might
want to copy-paste the 3.0.x sources for the
nltk.metrics.distance.edit_distance function
• The idea is to play around with the LSH parameters (threshold, size),
the parameter K for K-shingling, and the post-processing of the
matched set to pick the best solution from this “<< n subspace” (see
LSH 2/2 slide).
• Tip 2/remark: without any post-processing of matched sets, you can
achieve about 30% accuracy at a 100x speed up!
36. John Corvino, 2013
“There is a fine line between reading a message
from the text and reading one into the text.”