A first look at tf idf-pdx data science meetup

•Download as PPTX, PDF•

4 likes•637 views

How to Measure Document Similarity and Build Text Classifiers: A First Look at Term Frequency-Inverse Document Frequency (TF-IDF) Representations Text data is potentially valuable for many data science projects but working with text is different from working with structured data. One representation of text that has worked well for many text mining and machine learning applications is the term frequency - inverse document frequency (TF-IDF) vector. In spite of the long winded name, this method is easy to understand, performs well in many applications, and has been implemented in commonly used data science tools. This presentation will introduce TF-IDF and show examples of how to use TF-IDF for document classification and measuring the similarity between documents. This presentation does not assume any background in text mining or natural language processing. Examples will use Python.

Technology

A First Look at TF-IDF
Dan Sullivan
PP
Portland Data Science Group
March 2, 2017
Portland Data Science Meetup
March 2, 2017

Challenges
No obvious structure
Fully understanding language is hard
Large number of documents
Want to
Find documents based on similarity
Classify documents

Fortunately ...
Measuring similarity and
Classifying documents
Does not require fully understanding text

“PDX Data Science is all about data.”
about all data is pdx science
1 1 2 1 1 1

Corpus to Vectors
{
WordsCount
(Term Frequency)

Improvement 1: Remove Stop Words
{
WordsCount
Stop
Words

Improvement 2: N-grams
{
WordsCount
“computer” ,
“science”
“Computer science”

Example: Corpus of Machine Learning Papers
Some terms appear frequently
“Feature”
“Algorithm”
“Training”
Some less frequently
“Reinforcement”
“Non-linear”
“Convolution”

Intuition
Combination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”

Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus

Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequency
tf(t,d) is the number of times a term t occurs in document d

TF-IDF is
Large when:
There is a large count of a term in a
document (large TF) and ...
Low number of documents with term in
them
Small when
Term appears in many documents in
the corpus
TF-IDF
Frequency
Stop Words
Common Words
Rare Words

Populating a Term Vector with TF-IDF
V[index(“emmy”)] = tf-idf(“emmy”,d)
V[index(“noether”)] = tf-idf(“noether”,d)
V[index(“known”)] = tf-idf(“known”,d)
V[index(“landmark”)] = tf-idf(“landmark”,d)
V[index(“contribution”)] = tf-idf(“contribution”,d)
V[index(“abstract”)] = tf-idf(“abstract”,d)
V[index(“algebra”)] = tf-idf(“algebra”,d)
V[index(“theoretical”)] = tf-idf(“theoretical”,d)
V[index(“physics”)] = tf-idf(“physics”,d)
“Emmy Noether is known for her landmark contributions to
abstract algebra and theoretical physics”

Vector Space Model
Term 3
Term 2
Term 1
Doc 1
Doc 2
Doc 3
Term 1 Term 2 Term 3
Doc 1 0.4 0.1 0.6
Doc 2 0.3 0.5 0.0
Doc 3 0.0 0.2 0.6

Similarity Measures
Term 3
Term 2
Term 1
Doc 1
Doc 2
Doc 3 Euclidian Distance
Cosine

Classify by Vector (Point)
TF-IDF Vector

NLP Tools
Python
Gensim
NLTK
spaCy & textacy
Scikit-Learn
TextBlob
R
TM
OpenNLP (R interface)
TidyText
Other
Mallet
Google Natural Language API

This document discusses text analytics techniques for summarizing and analyzing unstructured text documents, with examples from analyzing documents related to tobacco control. It covers data cleaning and standardization steps like removing punctuation, stopwords, stemming, and deduplication. It also discusses frequency analysis using document-term matrices, topic modeling using LDA, and unsupervised and supervised classification techniques. The document provides examples analyzing posts from new users versus highly active users on an online forum, identifying topics and comparing topic distributions between different user groups.

Lec 4,5

alaa223

This document discusses vector space retrieval models. It describes how documents and queries are represented as vectors in a common vector space based on terms. Terms are weighted using metrics like term frequency (TF) and inverse document frequency (IDF) to determine importance. The cosine similarity measure is used to calculate similarity between document and query vectors and rank results by relevance. While simple and effective in practice, vector space models have limitations like missing semantic and syntactic information.

Text Mining with R

Sanjay Mishra

The document provides an introduction to text mining in R using the tm package. It discusses how to import text data from various sources into a corpus, transform and preprocess text within a corpus using mappings, and manage metadata for documents and corpora. Specific transformations demonstrated include converting documents to plain text, removing whitespace, converting to lowercase, removing stopwords, and stemming. The document also discusses filtering documents based on metadata values or text content.

Ir models

Ambreen Angel

The document discusses two main types of retrieval models: Boolean models which use set theory and vector space models which use statistical and algebraic approaches. Vector space models represent documents and queries as vectors of keywords weighted by factors like term frequency and inverse document frequency. Similarity between document and query vectors is calculated using measures like the inner product or cosine similarity to retrieve and rank documents.

Term weighting

Primya Tamil

The document discusses various aspects of term weighting which is important for text retrieval systems, including term frequency, document frequency, inverse document frequency, and how they are used to calculate TF-IDF weights for terms. It also covers stoplists, stemming, and the bag-of-words model which represents text as vectors of word occurrences without considering word order. Term weighting schemes play a major role in the similarity measures used by information retrieval systems to determine document relevance.

hands on: Text Mining With R

Jahnab Kumar Deka

This document discusses text mining in R. It introduces important text mining concepts like tokenization, tagging, and stemming. It outlines popular R packages for text mining like tm, SnowballC, qdap, and dplyr. The document explains how to create a corpus from text files, explore and transform a corpus, create a document term matrix, and analyze term frequencies. Visualization techniques like word clouds and heatmaps are also summarized.

Lec1

Prafulla Kiran

This document provides an overview of the Introduction to Algorithms course, including the course modules and motivating problems. It introduces the Document Distance problem, which aims to define metrics to measure the similarity between documents based on word frequencies. It discusses an initial Python program ("docdist1.py") to calculate document distance that runs inefficiently due to quadratic time list concatenation. Profiling identifies this as the bottleneck. The solution is to use list extension, resulting in "docdist3.py". Further optimizations include using a dictionary to count word frequencies in constant time, creating "docdist4.py". The document outlines remaining opportunities like improving the word extraction and sorting algorithms.

The document discusses probabilistic information retrieval and Bayesian approaches. It introduces concepts like conditional probability, Bayes' theorem, and the probability ranking principle. It explains how probabilistic models estimate the probability of relevance between a document and query by representing them as term sets and making probabilistic assumptions. The goal is to rank documents by the probability of relevance to present the most likely relevant documents first.

RDataMining slides-text-mining-with-r

Yanchang Zhao

This document provides an overview of text mining techniques and processes for analyzing Twitter data with R. It discusses concepts like term-document matrices, text cleaning, frequent term analysis, word clouds, clustering, topic modeling, sentiment analysis and social network analysis. It then provides a step-by-step example of applying these techniques to Twitter data from an R Twitter account, including retrieving tweets, text preprocessing, building term-document matrices, and various analyses.

Author Topic Model

FReeze FRancis

Interactive Latent Dirichlet Allocation

Quentin Pleplé

The document describes an interactive Latent Dirichlet Allocation (LDA) model that allows users to provide feedback to guide the topic modeling process. It summarizes previous work using constraints to encode feedback. It then introduces an approach using variational EM for LDA that allows modifying the topic distributions between epochs based on user feedback, such as removing words from topics, deleting topics, merging topics, or splitting topics. The interactive LDA approach alternates between running LDA to convergence and applying user updates to the topic distributions.

Text classification-php-v4

Glenn De Backer

This document summarizes text classification in PHP. It discusses what text classification is, common natural language processing terminology like tokenization and stemming, Bayes' theorem and how it relates to naive Bayes classification. It provides examples of tokenizing, stemming, stopping words, and building a naive Bayes classifier in PHP using the NlpTools library. Key steps like training and testing a classifier on sample text data are demonstrated.

Natural Language Processing in R (rNLP)

fridolin.wild

Information Retrieval Models Part I

Ingo Frommholz

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev

Databricks

Learning over images and understanding the quality of content play an important role at Pinterest. This talk will present a Spark based system responsible for detecting near (and far) duplicate images. The system is used to improve the accuracy of recommendations and search results across a number of production surfaces at Pinterest. At the core of the pipeline is a Spark implementation of batch LSH (locality sensitive hashing) search capable of comparing billions of items on a daily basis. This implementation replaced an older (MR/Solr/OpenCV) system, increasing throughput by 13x and decreasing runtime by 8x. A generalized Spark Batch LSH is now used outside of the image similarity context by a number of consumers. Inverted index compression using variable byte encoding, dictionary encoding, and primitives packing are some examples of what allows this implementation to scale. The second part of this talk will detail training and integration of a Tensorflow neural net with Spark, used in the candidate selection step of the system. By directly leveraging vectorization in a Spark context we can reduce the latency of the predictions and increase the throughput. Overall, this talk will cover a scalable Spark image processing and prediction pipeline.

Text classification using Text kernels

Dev Nath

This document summarizes a presentation on using string kernels for text classification. It introduces text classification and the challenge of representing text documents as feature vectors. It then discusses how kernel methods can be used as an alternative, by mapping documents into a feature space without explicitly extracting features. Different string kernel algorithms are described that measure similarity between documents based on common subsequences of characters. The document evaluates the performance of these kernels on a text dataset and explores ways to improve efficiency, such as through kernel approximation.

Slides

butest

This document summarizes key concepts in information retrieval systems and algorithms for large data sets. It discusses the differences between information retrieval and data retrieval systems. It also describes several classic models for relevance ranking in IR, including the Boolean model and vector space model. The document outlines topics like text processing, indexing, searching, and evaluation in information retrieval systems.

Cross language information retrieval (clir)slide

Mohd Iqbal Al-farabi

The document describes cross-language information retrieval (CLIR) and summarizes an English-Chinese information retrieval system called ECIRS. ECIRS allows users to input queries in English and retrieves relevant Chinese documents through translation. It includes dictionaries, document indexes, and a Chinese search engine. Screenshots show the user interface where a user can enter an English keyword, view its Chinese translation, and see search results in Chinese.

Elements of Text Mining Part - I

Jaganadh Gopinadhan

This document discusses various text mining and natural language processing techniques in Python, including tokenization, sentence tokenization, word counting, finding word lengths, word proportions, word types and ratios, finding top N words, plotting word frequencies, lexical dispersion plots, tag clouds, word co-occurrence matrices, and stop words filtering. Code examples are provided for implementing each technique in Python.

Spatial LDA

North Carolina State University

Spatial Latent Dirichlet Allocation (SLDA) is an extension of LDA that incorporates spatial information to improve topic modeling of image data. SLDA treats each region of an image grid as a document and assigns visual words representing local image patches to the closest region. This allows it to capture co-occurrence relationships between visual words better than LDA. The paper demonstrates SLDA can outperform LDA on image classification tasks by incorporating spatial context between visual words.

Text Mining Using R

Knoldus Inc.

07 04-06

Gouranga123

This document discusses cross-language information retrieval (CLIR). It defines CLIR as retrieving information written in a language different from the user's query language. It describes approaches to CLIR such as dictionary-based query translation and pseudo-relevance feedback. Dictionary-based query translation uses bilingual dictionaries but requires disambiguation due to ambiguity. Pseudo-relevance feedback assumes top documents are relevant and selects terms from them to expand the query. The document also discusses using parallel corpora to estimate cross-lingual relevance models and evaluate CLIR using conferences like TREC and CLEF.

Cross-lingual Information Retrieval

Shadi Saleh

This document discusses cross-lingual information retrieval. It presents approaches for translating queries from other languages to the document language, including using online machine translation systems and developing a statistical machine translation system. It describes experiments on reranking translations to select the one most effective for retrieval and on adapting the reranking model to new languages. Results show the reranking approach improves over baselines and online translation systems. The document also explores document translation and query expansion techniques.

LDA on social bookmarking systems

Denis Parra Santander

EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...

Chuancong Gao

The document describes a method for mining top-k interesting phrases from ad-hoc document collections using sequence pattern indexing. It discusses existing approaches, presents the problem definition, and proposes a new approach that indexes prefix-maximal phrases ordered by position. This indexing structure allows efficient computation of top-k phrases through a merge join process combined with growing phrase patterns, enabled by optimizations like early termination and search space pruning. An evaluation compares the new approach to baseline methods.

Question Answering with Lydia

Jae Hong Kil

The document describes Lydia, a system for named entity recognition and text analysis that was adapted for question answering at TREC 2005. It summarizes Lydia's pipeline for entity recognition and relationship analysis. It then describes the question answering system, which takes questions as input, extracts targets, collects candidate answers from Lydia's database, scores and ranks candidates, and produces a single answer or list of answers. The system handles factoid, list, and other questions by analyzing the question type and scoring candidates based on features like target juxtaposition and question term matching.

Ir 1 lec 7

alaa223

The document discusses cross-language information retrieval (CLIR). It notes that while there are over 6,000 languages, 80% of websites are in English, creating a need for CLIR. CLIR aims to retrieve relevant documents in languages different from the query language. It is an important area as it allows for global information exchange and knowledge sharing, with applications in national security, access to foreign patents and medical information. CLIR draws on multiple disciplines including information retrieval, natural language processing and machine translation.

Artificial Intelligence

vini89

The document discusses various techniques for information retrieval and language modeling approaches to IR, including: - Clustering documents into similar groups to aid in retrieval - Using term frequency-inverse document frequency (TF-IDF) to measure word importance in documents - Language models that represent documents and queries as probability distributions over words - Smoothing language models to address data sparsity issues - Cluster-based scoring methods that incorporate information from query-relevant document clusters

Text Mining

Gokulks007

This document discusses text mining and summarizes some key differences between text mining and data mining. Text mining, also known as text data mining or knowledge discovery in textual databases, is the process of analyzing text to identify novel information from a collection of documents. Unlike data mining which directly analyzes structured numeric data, text mining applies natural language processing techniques to discover new information from unstructured text data. The document then provides an overview of common text retrieval methods like the Boolean model and document ranking, and discusses measures used to evaluate text retrieval systems like precision and recall.

What's hot

Probabilistic information retrieval models & systems

Selman Bozkır

RDataMining slides-text-mining-with-r

Yanchang Zhao

Author Topic Model

FReeze FRancis

Interactive Latent Dirichlet Allocation

Quentin Pleplé

Text classification-php-v4

Glenn De Backer

Natural Language Processing in R (rNLP)

fridolin.wild

Information Retrieval Models Part I

Ingo Frommholz

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev

Databricks

Text classification using Text kernels

Dev Nath

Slides

butest

Cross language information retrieval (clir)slide

Mohd Iqbal Al-farabi

Elements of Text Mining Part - I

Jaganadh Gopinadhan

Spatial LDA

North Carolina State University

Text Mining Using R

Knoldus Inc.

07 04-06

Gouranga123

Cross-lingual Information Retrieval

Shadi Saleh

LDA on social bookmarking systems

Denis Parra Santander

EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...

Chuancong Gao

Question Answering with Lydia

Jae Hong Kil

Ir 1 lec 7

alaa223

What's hot (20)

Probabilistic information retrieval models & systems

RDataMining slides-text-mining-with-r

Author Topic Model

Interactive Latent Dirichlet Allocation

Text classification-php-v4

Natural Language Processing in R (rNLP)

Information Retrieval Models Part I

Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev

Text classification using Text kernels

Slides

Cross language information retrieval (clir)slide

Elements of Text Mining Part - I

Spatial LDA

Text Mining Using R

07 04-06

Cross-lingual Information Retrieval

LDA on social bookmarking systems

EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...

Question Answering with Lydia

Ir 1 lec 7

Similar to A first look at tf idf-pdx data science meetup

Artificial Intelligence

vini89

Text Mining

Gokulks007

Neural Text Embeddings for Information Retrieval (WSDM 2017)

Bhaskar Mitra

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494

Sean Golliher

The document discusses probabilistic retrieval models in information retrieval. It provides an overview of older models like Boolean retrieval and vector space models. The main focus is on probabilistic models like BM25 and language models. It explains key concepts in probabilistic IR like the probability ranking principle, using Bayes' rule to estimate the probability that a document is relevant given features of the document, and estimating probabilities based on the frequencies of terms in relevant documents. The goal is to rank documents based on the probability of relevance to the query.

Text Mining Analytics 101

Manohar Swamynathan

Web and text

Institute of Technology Telkom

Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequently co-occurring terms. Text classification algorithms such as support vector machines, naive Bayes classifiers, and neural networks are often applied to categorized documents. The vector space model is commonly used to represent documents as vectors of term weights.

Chapter 10 Data Mining Techniques

Houw Liong The

Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequent patterns and relationships between keywords in a collection of documents. Text classification algorithms such as support vector machines, k-nearest neighbors, naive Bayes, and neural networks can be applied to categorize documents based on their contents.

Copy of 10text (2)

Uma Se

Text mining and natural language processing techniques can be used to extract useful information from text data. Common text mining tasks include text categorization to classify documents into predefined categories, document clustering to group similar documents without predefined categories, and keyword-based association analysis to find frequently co-occurring terms. Text classification algorithms such as support vector machines, naive Bayes classifiers, and neural networks are often applied to categorized documents into topics. The vector space model is commonly used to represent documents as vectors of term weights to enable similarity comparisons between documents.

Text Mining Infrastructure in R

Ashraf Uddin

R is a free software environment for statistical analysis and graphics. This document discusses using R for text mining, including preprocessing text data through transformations like stemming, stopword removal, and part-of-speech tagging. It also demonstrates building term document matrices and classifying text with k-nearest neighbors (KNN) algorithms. Specifically, it shows classifying speeches from Obama and Romney with over 90% accuracy using KNN classification in R.

Intro.ppt

WrushabhShirsat3

This document provides an introduction to information retrieval systems and their main components. It discusses how IR systems aim to find relevant documents from a large collection in response to a user's information need. The key processes involved are document indexing to represent contents, query formulation, retrieval of relevant documents, and system evaluation. Indexing involves selecting important keywords from documents and assigning them weights. Various retrieval models are described for comparing document and query representations, such as vector space and probabilistic models. The document also discusses challenges in document representation, query processing, and evaluating system effectiveness.

IRJET - Document Comparison based on TF-IDF Metric

IRJET Journal

This document discusses comparing documents based on the TF-IDF metric and cosine similarity. It begins by representing documents as vectors of terms weighted by TF-IDF. Cosine similarity is then used to measure the similarity between document vectors, with values ranging from 0 (completely dissimilar) to 1 (identical). The document demonstrates this approach on 5 sample documents from different domains, showing their pairwise cosine similarities. Comparing documents based on TF-IDF and cosine similarity allows analyzing relationships between documents in large corpora.

Web search engines

AbdusamadAbdukarimov2

This document discusses various techniques used in web search engines for indexing and ranking documents. It covers topics like inverted indices, stopword removal, stemming, relevance feedback, vector space models, and Bayesian inference networks. Web search engines prepare an index of keywords for documents and return ranked lists in response to queries by measuring similarities between query and document vectors based on term frequencies and inverse document frequencies.

DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...

cscpconf

On-line text documents rapidly increase in size with the growth of World Wide Web. To manage such a huge amount of texts,several text miningapplications came into existence. Those applications such as search engine, text categorization,summarization, and topic detection arebased on feature extraction.It is extremely time consuming and difficult task to extract keyword or feature manually.So an automated process that extracts keywords or features needs to be established.This paper proposes a new domain keyword extraction technique that includes a new weighting method on the base of the conventional TF•IDF. Term frequency-Inverse document frequency is widely used to express the documentsfeature weight, which can’t reflect the division of terms in the document, and then can’t reflect the significance degree and the difference between categories. This paper proposes a new weighting method to which a new weight is added to express the differences between domains on the base of original TF•IDF.The extracted feature can represent the content of the text better and has a better distinguished

Language Technology Enhanced Learning

telss09

This document provides an overview of using latent semantic analysis (LSA) and the R programming language for language technology enhanced learning applications. It describes using LSA to create a semantic space to compare documents and evaluate student writings. It also demonstrates clustering terms based on their semantic similarity and visualizing networks in R. Evaluation results show LSA machine scores for essay quality had a Spearman's rank correlation of 0.687 with human scores, outperforming a pure vector space model.

Information Retrieval

ShujaatZaheer3

This document provides an introduction to information retrieval systems and their key components. It discusses how search engines work by indexing documents, parsing user queries, and ranking results by relevance. The main components of a search engine are described as the crawler, indexer, query parser, ranking model, and interfaces for result display, evaluation, and feedback. The document also covers core concepts in IR like queries, documents, relevance, and information needs. It compares browsing and querying models and pull vs push information delivery.

The Geometry of Learning

fridolin.wild

Latent Semantic Analysis (LSA) is a mathematical technique for computationally modeling the meaning of words and larger units of texts. LSA works by applying a mathematical technique called Singular Value Decomposition (SVD) to a term*document matrix containing frequency counts for all words found in the corpus in all of the documents or passages in the corpus. After this SVD application, the meaning of a word is represented as a vector in a multidimensional semantic space, which makes it possible to compare word meanings, for instance by computing the cosine between two word vectors. LSA has been successfully used in a large variety of language related applications from automatic grading of student essays to predicting click trails in website navigation. In Coh-Metrix (Graesser et al. 2004), a computational tool that produces indices of the linguistic and discourse representations of a text, LSA was used as a measure of text cohesion by assuming that cohesion increases as a function of higher cosine scores between adjacent sentences. Besides being interesting as a technique for building programs that need to deal with semantics, LSA is also interesting as a model of human cognition. LSA can match human performance on word association tasks and vocabulary test. In this talk, Fridolin will focus on LSA as a tool in modeling language acquisition. After framing the area of the talk with sketching the key concepts learning, information, and competence acquisition, and after outlining presuppositions, an introduction into meaningful interaction analysis (MIA) is given. MIA is a means to inspect learning with the support of language analysis that is geometrical in nature. MIA is a fusion of latent semantic analysis (LSA) combined with network analysis (NA/SNA). LSA, NA/SNA, and MIA are illustrated by several examples.

Some Information Retrieval Models and Our Experiments for TREC KBA

Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)

Open nlp presentationss

Chandan Deb

This document provides an overview of the OpenNLP natural language processing tool. It discusses the various NLP tasks that OpenNLP can perform, including tokenization, POS tagging, named entity recognition, chunking, parsing, and co-reference resolution. It also describes how models for these tasks are trained in OpenNLP using annotated training data. The document concludes by listing some advantages and limitations of OpenNLP.

vectorSpaceModelPeterBurden.ppt

pepe3059

The vector space model (VSM) represents documents as vectors of identifiers such as words, where each unique word corresponds to a dimension. Documents are broken down and represented as vectors based on word frequency. Queries are also represented as vectors, and similarity measures such as cosine similarity are used to compare document and query vectors and retrieve the most relevant documents. Variations of the basic VSM include removing common words, weighting terms based on frequency and document distribution, and using tf-idf to emphasize important words.

Tdm information retrieval

KU Leuven

Information retrieval (IR) is the process of searching for and retrieving relevant documents from a large collection based on a user's query. Key aspects of IR include: - Representing documents and queries in a way that allows measuring their similarity, such as the vector space model. - Ranking retrieved documents by relevance to the query using factors like term frequency and inverse document frequency. - Allowing for similarity-based retrieval where documents similar to a given document are retrieved.

Similar to A first look at tf idf-pdx data science meetup (20)

Artificial Intelligence

Text Mining

Neural Text Embeddings for Information Retrieval (WSDM 2017)

Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494

Text Mining Analytics 101

Web and text

Chapter 10 Data Mining Techniques

Copy of 10text (2)

Text Mining Infrastructure in R

Intro.ppt

IRJET - Document Comparison based on TF-IDF Metric

Web search engines

DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...

Language Technology Enhanced Learning

Information Retrieval

The Geometry of Learning

Some Information Retrieval Models and Our Experiments for TREC KBA

Open nlp presentationss

vectorSpaceModelPeterBurden.ppt

Tdm information retrieval

More from Dan Sullivan, Ph.D.

How to Design a Modern Data Warehouse in BigQuery

Dan Sullivan, Ph.D.

This document discusses best practices for data modeling in BigQuery, Google's serverless data warehouse. It recommends designing tables to enable parallel scanning by partitioning and clustering data. Since joins may require shuffling data across compute slots, the document suggests denormalizing data using nested and repeated fields to avoid joins. BigQuery uses a multi-tenant query execution engine called Dremel that dynamically allocates slots, and a distributed storage system called Colossus that handles replication and recovery without needing to manage storage. Data modeling approaches for BigQuery are different than traditional relational databases due to its petabyte scale, serverless architecture, and use of nested data structures.

With Automated ML, is Everyone an ML Engineer?

Dan Sullivan, Ph.D.

Getting Started with BigQuery ML

Dan Sullivan, Ph.D.

This document provides an overview of machine learning basics and building machine learning models in BigQuery. It introduces BigQuery as a serverless data warehouse that supports petabyte-scale data and enables machine learning through BigQuery ML. It describes the main machine learning problem categories of supervised and unsupervised learning. Finally, it demonstrates how to create, evaluate, and predict from machine learning models using SQL in BigQuery.

Google Cloud Certifications & Machine Learning

Dan Sullivan, Ph.D.

Unstructured text to structured data

Dan Sullivan, Ph.D.

Text mining meets neural nets

Dan Sullivan, Ph.D.

ACID vs BASE in NoSQL: Another False Dichotomy

Dan Sullivan, Ph.D.

As relational and NoSQL database continue to adopt characteristic of each other, it becomes more important to understand that ACID-BASE is a spectrum. Instead of making a binary choice between ACID and BASE, developers and designers choose a combination of varying levels of data consistency, availability and network partition tolerance. This presentation briefly describes the ACID-BASE spectrum, the CAP Theorem and how to find the right balance of trade-offs for your application.

Big data, bioscience and the cloud biocatalyst june 2015 sullivan

Dan Sullivan, Ph.D.

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Dan Sullivan, Ph.D.

The document discusses various text mining techniques including sentiment analysis, topic modeling, classification, named entity recognition, and event extraction. It provides examples of applications and considerations for each technique. Performance factors like scalability, language support, and integration rules are also covered. Overall the document serves as an introduction to common text analytics methods.

Modeling with Document Database: 5 Key Patterns

Dan Sullivan, Ph.D.

This document discusses 5 key data modeling patterns for document databases: 1) One-to-many using embedded documents, 2) Many-to-many using references or embedded documents, 3) Trees using parent and child references, 4) Trees using materialized paths, and 5) Entity aggregation for polymorphic documents. It provides examples of each pattern and considerations for implementing them. The document also covers anti-patterns to avoid, such as large arrays and over-normalizing data.

Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2

Dan Sullivan, Ph.D.

The document summarizes a seminar on alternative approaches to managing bioinformatics data. It discusses using relational databases and NoSQL databases for bioinformatics. For relational databases, examples of using them for text mining and atherosclerosis research are provided. For NoSQL, key-value, document, wide column, and graph databases are described. The presentation concludes that the research question should drive the choice between database types and that a project may use multiple database approaches.

Text Mining for Biocuration of Bacterial Infectious Diseases

Dan Sullivan, Ph.D.

Specialty gene sets, such as virulence factors and antibiotic resistance genes, are of particular interest to infectious disease researchers. Much of the information about specialty genes’ function is described in literature but unavailable as structured data in bioinformatics databases. The steadily increasing volume of literature makes it difficult to manually find relevant papers and extract assertion sentences about specialty genes. This presentation describes efforts to build and an automatic classifier for such sentences. Experiments were conducted to assess the impact of the imbalance of positive and negative examples in source documents on classification; develop a support vector machine (SVM) classifier using term frequency-inverse document frequency (TF-IDF) representation of text; and assess the marginal benefit of additional training examples on the quality of the classifier. Analysis of learning curves indicates that additional training examples will not likely improve the quality of the classifier. We discuss options for other text representation schemes to investigate in order to improve the quality of the classifier as measured by F-score.

Limits of RDBMS and Need for NoSQL in Bioinformatics

Dan Sullivan, Ph.D.

Bioinformaticians constantly face challenges with data: from the large volumes of data to the need to integrate diverse data types. Relational databases have a long and successful history of managing data but have been unable to meet emerging needs of big data and highly integrated data stores. This talk discusses the limitations we face when using relational data models for bioinformatics applications. It describes the features, limitations and use cases of four alternative database models: key value databases, document databases, wide column data stores and graph databases. Use in bioinformatics applications is demonstrate with text mining and atherosclerosis research projects. The talk concludes with guidance on choosing an appropriate database model for varying bioinformatics requirements.

More from Dan Sullivan, Ph.D. (13)

How to Design a Modern Data Warehouse in BigQuery

With Automated ML, is Everyone an ML Engineer?

Getting Started with BigQuery ML

Google Cloud Certifications & Machine Learning

Unstructured text to structured data

Text mining meets neural nets

ACID vs BASE in NoSQL: Another False Dichotomy

Big data, bioscience and the cloud biocatalyst june 2015 sullivan

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

Modeling with Document Database: 5 Key Patterns

Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2

Text Mining for Biocuration of Bacterial Infectious Diseases

Limits of RDBMS and Need for NoSQL in Bioinformatics

Recently uploaded

Taking AI to the Next Level in Manufacturing.pdf

ssuserfac0301

Read Taking AI to the Next Level in Manufacturing to gain insights on AI adoption in the manufacturing industry, such as: 1. How quickly AI is being implemented in manufacturing. 2. Which barriers stand in the way of AI adoption. 3. How data quality and governance form the backbone of AI. 4. Organizational processes and structures that may inhibit effective AI adoption. 6. Ideas and approaches to help build your organization's AI strategy.

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...

Tatiana Kojar

Skybuffer AI, built on the robust SAP Business Technology Platform (SAP BTP), is the latest and most advanced version of our AI development, reaffirming our commitment to delivering top-tier AI solutions. Skybuffer AI harnesses all the innovative capabilities of the SAP BTP in the AI domain, from Conversational AI to cutting-edge Generative AI and Retrieval-Augmented Generation (RAG). It also helps SAP customers safeguard their investments into SAP Conversational AI and ensure a seamless, one-click transition to SAP Business AI. With Skybuffer AI, various AI models can be integrated into a single communication channel such as Microsoft Teams. This integration empowers business users with insights drawn from SAP backend systems, enterprise documents, and the expansive knowledge of Generative AI. And the best part of it is that it is all managed through our intuitive no-code Action Server interface, requiring no extensive coding knowledge and making the advanced AI accessible to more users.

leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...

alexjohnson7307

5th LF Energy Power Grid Model Meet-up Slides

DanBrown980551

5th Power Grid Model Meet-up It is with great pleasure that we extend to you an invitation to the 5th Power Grid Model Meet-up, scheduled for 6th June 2024. This event will adopt a hybrid format, allowing participants to join us either through an online Mircosoft Teams session or in person at TU/e located at Den Dolech 2, Eindhoven, Netherlands. The meet-up will be hosted by Eindhoven University of Technology (TU/e), a research university specializing in engineering science & technology. Power Grid Model The global energy transition is placing new and unprecedented demands on Distribution System Operators (DSOs). Alongside upgrades to grid capacity, processes such as digitization, capacity optimization, and congestion management are becoming vital for delivering reliable services. Power Grid Model is an open source project from Linux Foundation Energy and provides a calculation engine that is increasingly essential for DSOs. It offers a standards-based foundation enabling real-time power systems analysis, simulations of electrical power grids, and sophisticated what-if analysis. In addition, it enables in-depth studies and analysis of the electrical power grid’s behavior and performance. This comprehensive model incorporates essential factors such as power generation capacity, electrical losses, voltage levels, power flows, and system stability. Power Grid Model is currently being applied in a wide variety of use cases, including grid planning, expansion, reliability, and congestion studies. It can also help in analyzing the impact of renewable energy integration, assessing the effects of disturbances or faults, and developing strategies for grid control and optimization. What to expect For the upcoming meetup we are organizing, we have an exciting lineup of activities planned: -Insightful presentations covering two practical applications of the Power Grid Model. -An update on the latest advancements in Power Grid -Model technology during the first and second quarters of 2024. -An interactive brainstorming session to discuss and propose new feature requests. -An opportunity to connect with fellow Power Grid Model enthusiasts and users.

HCL Notes and Domino License Cost Reduction in the World of DLAU

panagenda

Webinar Recording: https://www.panagenda.com/webinars/hcl-notes-and-domino-license-cost-reduction-in-the-world-of-dlau/ The introduction of DLAU and the CCB & CCX licensing model caused quite a stir in the HCL community. As a Notes and Domino customer, you may have faced challenges with unexpected user counts and license costs. You probably have questions on how this new licensing approach works and how to benefit from it. Most importantly, you likely have budget constraints and want to save money where possible. Don’t worry, we can help with all of this! We’ll show you how to fix common misconfigurations that cause higher-than-expected user counts, and how to identify accounts which you can deactivate to save money. There are also frequent patterns that can cause unnecessary cost, like using a person document instead of a mail-in for shared mailboxes. We’ll provide examples and solutions for those as well. And naturally we’ll explain the new licensing model. Join HCL Ambassador Marc Thomas in this webinar with a special guest appearance from Franz Walder. It will give you the tools and know-how to stay on top of what is going on with Domino licensing. You will be able lower your cost through an optimized configuration and keep it low going forward. These topics will be covered - Reducing license cost by finding and fixing misconfigurations and superfluous accounts - How do CCB and CCX licenses really work? - Understanding the DLAU tool and how to best utilize it - Tips for common problem areas, like team mailboxes, functional/test users, etc - Practical examples and best practices to implement right away

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf

Chart Kalyan

GraphRAG for Life Science to increase LLM accuracy

Tomaz Bratanic

Presentation of the OECD Artificial Intelligence Review of Germany

innovationoecd

Energy Efficient Video Encoding for Cloud and Edge Computing Instances

Alpen-Adria-Universität

Generating privacy-protected synthetic data using Secludy and Milvus

Zilliz

During this demo, the founders of Secludy will demonstrate how their system utilizes Milvus to store and manipulate embeddings for generating privacy-protected synthetic data. Their approach not only maintains the confidentiality of the original data but also enhances the utility and scalability of LLMs under privacy constraints. Attendees, including machine learning engineers, data scientists, and data managers, will witness first-hand how Secludy's integration with Milvus empowers organizations to harness the power of LLMs securely and efficiently.

Nordic Marketo Engage User Group_June 13_ 2024.pptx

MichaelKnudsen27

SAP S/4 HANA sourcing and procurement to Public cloud

maazsz111

zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...

Alex Pruden

Folding is a recent technique for building efficient recursive SNARKs. Several elegant folding protocols have been proposed, such as Nova, Supernova, Hypernova, Protostar, and others. However, all of them rely on an additively homomorphic commitment scheme based on discrete log, and are therefore not post-quantum secure. In this work we present LatticeFold, the first lattice-based folding protocol based on the Module SIS problem. This folding protocol naturally leads to an efficient recursive lattice-based SNARK and an efficient PCD scheme. LatticeFold supports folding low-degree relations, such as R1CS, as well as high-degree relations, such as CCS. The key challenge is to construct a secure folding protocol that works with the Ajtai commitment scheme. The difficulty, is ensuring that extracted witnesses are low norm through many rounds of folding. We present a novel technique using the sumcheck protocol to ensure that extracted witnesses are always low norm no matter how many rounds of folding are used. Our evaluation of the final proof system suggests that it is as performant as Hypernova, while providing post-quantum security. Paper Link: https://eprint.iacr.org/2024/257

Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency

ScyllaDB

dbms calicut university B. sc Cs 4th sem.pdf

Shinana2

Building Production Ready Search Pipelines with Spark and Milvus

Zilliz

“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...

Edge AI and Vision Alliance

For the full video of this presentation, please visit: https://www.edge-ai-vision.com/2024/06/temporal-event-neural-networks-a-more-efficient-alternative-to-the-transformer-a-presentation-from-brainchip/ Chris Jones, Director of Product Management at BrainChip , presents the “Temporal Event Neural Networks: A More Efficient Alternative to the Transformer” tutorial at the May 2024 Embedded Vision Summit. The expansion of AI services necessitates enhanced computational capabilities on edge devices. Temporal Event Neural Networks (TENNs), developed by BrainChip, represent a novel and highly efficient state-space network. TENNs demonstrate exceptional proficiency in handling multi-dimensional streaming data, facilitating advancements in object detection, action recognition, speech enhancement and language model/sequence generation. Through the utilization of polynomial-based continuous convolutions, TENNs streamline models, expedite training processes and significantly diminish memory requirements, achieving notable reductions of up to 50x in parameters and 5,000x in energy consumption compared to prevailing methodologies like transformers. Integration with BrainChip’s Akida neuromorphic hardware IP further enhances TENNs’ capabilities, enabling the realization of highly capable, portable and passively cooled edge devices. This presentation delves into the technical innovations underlying TENNs, presents real-world benchmarks, and elucidates how this cutting-edge approach is positioned to revolutionize edge AI across diverse applications.

Skybuffer SAM4U tool for SAP license adoption

Tatiana Kojar

Manage and optimize your license adoption and consumption with SAM4U, an SAP free customer software asset management tool. SAM4U, an SAP complimentary software asset management tool for customers, delivers a detailed and well-structured overview of license inventory and usage with a user-friendly interface. We offer a hosted, cost-effective, and performance-optimized SAM4U setup in the Skybuffer Cloud environment. You retain ownership of the system and data, while we manage the ABAP 7.58 infrastructure, ensuring fixed Total Cost of Ownership (TCO) and exceptional services through the SAP Fiori interface.

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

saastr

AWS Cloud Cost Optimization Presentation.pptx

HarisZaheer8

This presentation provides valuable insights into effective cost-saving techniques on AWS. Learn how to optimize your AWS resources by rightsizing, increasing elasticity, picking the right storage class, and choosing the best pricing model. Additionally, discover essential governance mechanisms to ensure continuous cost efficiency. Whether you are new to AWS or an experienced user, this presentation provides clear and practical tips to help you reduce your cloud costs and get the most out of your budget.

Recently uploaded (20)

Taking AI to the Next Level in Manufacturing.pdf

Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...

leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...

5th LF Energy Power Grid Model Meet-up Slides

HCL Notes and Domino License Cost Reduction in the World of DLAU

How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf

GraphRAG for Life Science to increase LLM accuracy

Presentation of the OECD Artificial Intelligence Review of Germany

Energy Efficient Video Encoding for Cloud and Edge Computing Instances

Generating privacy-protected synthetic data using Secludy and Milvus

Nordic Marketo Engage User Group_June 13_ 2024.pptx

SAP S/4 HANA sourcing and procurement to Public cloud

zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...

Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency

dbms calicut university B. sc Cs 4th sem.pdf

Building Production Ready Search Pipelines with Spark and Milvus

“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...

Skybuffer SAM4U tool for SAP license adoption

Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...

AWS Cloud Cost Optimization Presentation.pptx

A first look at tf idf-pdx data science meetup

1. A First Look at TF-IDF Dan Sullivan PP Portland Data Science Group March 2, 2017 Portland Data Science Meetup March 2, 2017

2. What do we do with this?

3. Challenges No obvious structure Fully understanding language is hard Large number of documents Want to Find documents based on similarity Classify documents

4. Fortunately ... Measuring similarity and Classifying documents Does not require fully understanding text

5. Counting Words

6. “PDX Data Science is all about data.” about all data is pdx science 1 1 2 1 1 1

7. Corpus to Vectors { WordsCount (Term Frequency)

8. Improvement 1: Remove Stop Words { WordsCount Stop Words

9. Improvement 2: N-grams { WordsCount “computer” , “science” “Computer science”

10. Example: Corpus of Machine Learning Papers Some terms appear frequently “Feature” “Algorithm” “Training” Some less frequently “Reinforcement” “Non-linear” “Convolution”

11. Intuition Combination of words are good indicators of topic of document Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor” Text mining: “corpus”, “term vector”, “syntax” Social Network: “graph”, “communities”, “users”, “influence”

12. Intuition Combination of words are good indicators of topic of document Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor” Text mining: “corpus”, “term vector”, “syntax” Social Network: “graph”, “communities”, “users”, “influence” Words that appear frequently across documents in a corpus are not good indicators of topic

13. Intuition Combination of words are good indicators of topic of document Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor” Text mining: “corpus”, “term vector”, “syntax” Social Network: “graph”, “communities”, “users”, “influence” Words that appear frequently across documents in a corpus are not good indicators of topic Words that appear frequently only within documents about a single topic are good indicators of topic

14. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus

15. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus TF - term frequency tf(t,d) is the number of times a term t occurs in document d

16. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus TF - term frequency tf(t,d) is the number of times a term t occurs in document d IDF - inverse document frequency idf(t,D) = log(N / | {d in D: t in d} | )

17. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus TF - term frequency tf(t,d) is the number of times a term t occurs in document d IDF - inverse document frequency idf(t,D) = log(N / | {d in D: t in d} | ) TF-IDF = tf(t,d) * idf(t,D)

18. TF-IDF is Large when: There is a large count of a term in a document (large TF) and ... Low number of documents with term in them Small when Term appears in many documents in the corpus TF-IDF Frequency Stop Words Common Words Rare Words

19. Improvement 3: TF-IDF { WordsTF-IDF

20. Populating a Term Vector with TF-IDF V[index(“emmy”)] = tf-idf(“emmy”,d) V[index(“noether”)] = tf-idf(“noether”,d) V[index(“known”)] = tf-idf(“known”,d) V[index(“landmark”)] = tf-idf(“landmark”,d) V[index(“contribution”)] = tf-idf(“contribution”,d) V[index(“abstract”)] = tf-idf(“abstract”,d) V[index(“algebra”)] = tf-idf(“algebra”,d) V[index(“theoretical”)] = tf-idf(“theoretical”,d) V[index(“physics”)] = tf-idf(“physics”,d) “Emmy Noether is known for her landmark contributions to abstract algebra and theoretical physics”

21. Vector Space Model Term 3 Term 2 Term 1 Doc 1 Doc 2 Doc 3 Term 1 Term 2 Term 3 Doc 1 0.4 0.1 0.6 Doc 2 0.3 0.5 0.0 Doc 3 0.0 0.2 0.6

22. Similarity Measures Term 3 Term 2 Term 1 Doc 1 Doc 2 Doc 3 Euclidian Distance Cosine

23. Classify by Vector (Point) TF-IDF Vector

24. Text Classifier with Scikit Learn

25. Document Similarity with Gensim

26. NLP Tools Python Gensim NLTK spaCy & textacy Scikit-Learn TextBlob R TM OpenNLP (R interface) TidyText Other Mallet Google Natural Language API

27. Q & A

A first look at tf idf-pdx data science meetup

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to A first look at tf idf-pdx data science meetup

Similar to A first look at tf idf-pdx data science meetup (20)

More from Dan Sullivan, Ph.D.

More from Dan Sullivan, Ph.D. (13)

Recently uploaded

Recently uploaded (20)

A first look at tf idf-pdx data science meetup