SlideShare a Scribd company logo
A First Look at TF-IDF
Dan Sullivan
PP
Portland Data Science Group
March 2, 2017
Portland Data Science Meetup
March 2, 2017
What do we do with this?
Challenges
No obvious structure
Fully understanding language is hard
Large number of documents
Want to
Find documents based on similarity
Classify documents
Fortunately ...
Measuring similarity and
Classifying documents
Does not require fully understanding text
Counting Words
“PDX Data Science is all about data.”
about all data is pdx science
1 1 2 1 1 1
Corpus to Vectors
{
WordsCount
(Term Frequency)
Improvement 1: Remove Stop Words
{
WordsCount
Stop
Words
Improvement 2: N-grams
{
WordsCount
“computer” ,
“science”
“Computer science”
Example: Corpus of Machine Learning Papers
Some terms appear frequently
“Feature”
“Algorithm”
“Training”
Some less frequently
“Reinforcement”
“Non-linear”
“Convolution”
Intuition
Combination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”
Intuition
Combination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”
Words that appear frequently across documents in a corpus are not good
indicators of topic
Intuition
Combination of words are good indicators of topic of document
Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor”
Text mining: “corpus”, “term vector”, “syntax”
Social Network: “graph”, “communities”, “users”, “influence”
Words that appear frequently across documents in a corpus are not good indicators of topic
Words that appear frequently only within documents about a single topic are good indicators
of topic
Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequency
tf(t,d) is the number of times a term t occurs in document d
Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequency
tf(t,d) is the number of times a term t occurs in document d
IDF - inverse document frequency
idf(t,D) = log(N / | {d in D: t in d} | )
Formalizing Intuition: TF-IDF
Notation
t - a term
d - a document
D - a set of documents or corpus
N - number of documents in corpus
TF - term frequency
tf(t,d) is the number of times a term t occurs in document d
IDF - inverse document frequency
idf(t,D) = log(N / | {d in D: t in d} | )
TF-IDF = tf(t,d) * idf(t,D)
TF-IDF is
Large when:
There is a large count of a term in a
document (large TF) and ...
Low number of documents with term in
them
Small when
Term appears in many documents in
the corpus
TF-IDF
Frequency
Stop Words
Common Words
Rare Words
Improvement 3: TF-IDF
{
WordsTF-IDF
Populating a Term Vector with TF-IDF
V[index(“emmy”)] = tf-idf(“emmy”,d)
V[index(“noether”)] = tf-idf(“noether”,d)
V[index(“known”)] = tf-idf(“known”,d)
V[index(“landmark”)] = tf-idf(“landmark”,d)
V[index(“contribution”)] = tf-idf(“contribution”,d)
V[index(“abstract”)] = tf-idf(“abstract”,d)
V[index(“algebra”)] = tf-idf(“algebra”,d)
V[index(“theoretical”)] = tf-idf(“theoretical”,d)
V[index(“physics”)] = tf-idf(“physics”,d)
“Emmy Noether is known for her landmark contributions to
abstract algebra and theoretical physics”
Vector Space Model
Term 3
Term 2
Term 1
Doc 1
Doc 2
Doc 3
Term 1 Term 2 Term 3
Doc 1 0.4 0.1 0.6
Doc 2 0.3 0.5 0.0
Doc 3 0.0 0.2 0.6
Similarity Measures
Term 3
Term 2
Term 1
Doc 1
Doc 2
Doc 3 Euclidian Distance
Cosine
Classify by Vector (Point)
TF-IDF Vector
Text Classifier with Scikit Learn
Document Similarity with Gensim
NLP Tools
Python
Gensim
NLTK
spaCy & textacy
Scikit-Learn
TextBlob
R
TM
OpenNLP (R interface)
TidyText
Other
Mallet
Google Natural Language API
Q & A

More Related Content

What's hot

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
Selman Bozkır
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
Yanchang Zhao
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
FReeze FRancis
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
Quentin Pleplé
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
Glenn De Backer
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
fridolin.wild
 
Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part I
Ingo Frommholz
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Databricks
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
Dev Nath
 
Slides
SlidesSlides
Slides
butest
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slide
Mohd Iqbal Al-farabi
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
Jaganadh Gopinadhan
 
Spatial LDA
Spatial LDASpatial LDA
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using R
Knoldus Inc.
 
07 04-06
07 04-0607 04-06
07 04-06
Gouranga123
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
Shadi Saleh
 
LDA on social bookmarking systems
LDA on social bookmarking systemsLDA on social bookmarking systems
LDA on social bookmarking systems
Denis Parra Santander
 
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
Chuancong Gao
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
Jae Hong Kil
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
alaa223
 

What's hot (20)

Probabilistic information retrieval models & systems
Probabilistic information retrieval models & systemsProbabilistic information retrieval models & systems
Probabilistic information retrieval models & systems
 
RDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-rRDataMining slides-text-mining-with-r
RDataMining slides-text-mining-with-r
 
Author Topic Model
Author Topic ModelAuthor Topic Model
Author Topic Model
 
Interactive Latent Dirichlet Allocation
Interactive Latent Dirichlet AllocationInteractive Latent Dirichlet Allocation
Interactive Latent Dirichlet Allocation
 
Text classification-php-v4
Text classification-php-v4Text classification-php-v4
Text classification-php-v4
 
Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)Natural Language Processing in R (rNLP)
Natural Language Processing in R (rNLP)
 
Information Retrieval Models Part I
Information Retrieval Models Part IInformation Retrieval Models Part I
Information Retrieval Models Part I
 
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey GusevImage Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
Image Similarity Detection at Scale Using LSH and Tensorflow with Andrey Gusev
 
Text classification using Text kernels
Text classification using Text kernelsText classification using Text kernels
Text classification using Text kernels
 
Slides
SlidesSlides
Slides
 
Cross language information retrieval (clir)slide
Cross language information retrieval (clir)slideCross language information retrieval (clir)slide
Cross language information retrieval (clir)slide
 
Elements of Text Mining Part - I
Elements of Text Mining Part - IElements of Text Mining Part - I
Elements of Text Mining Part - I
 
Spatial LDA
Spatial LDASpatial LDA
Spatial LDA
 
Text Mining Using R
Text Mining Using RText Mining Using R
Text Mining Using R
 
07 04-06
07 04-0607 04-06
07 04-06
 
Cross-lingual Information Retrieval
Cross-lingual Information RetrievalCross-lingual Information Retrieval
Cross-lingual Information Retrieval
 
LDA on social bookmarking systems
LDA on social bookmarking systemsLDA on social bookmarking systems
LDA on social bookmarking systems
 
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
EDBT 12 - Top-k interesting phrase mining in ad-hoc collections using sequenc...
 
Question Answering with Lydia
Question Answering with LydiaQuestion Answering with Lydia
Question Answering with Lydia
 
Ir 1 lec 7
Ir 1 lec 7Ir 1 lec 7
Ir 1 lec 7
 

Similar to A first look at tf idf-pdx data science meetup

Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
vini89
 
Text Mining
Text MiningText Mining
Text Mining
Gokulks007
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Bhaskar Mitra
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Sean Golliher
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
Manohar Swamynathan
 
Web and text
Web and textWeb and text
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
Houw Liong The
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
Uma Se
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
Ashraf Uddin
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
WrushabhShirsat3
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
IRJET Journal
 
Web search engines
Web search enginesWeb search engines
Web search engines
AbdusamadAbdukarimov2
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
cscpconf
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
telss09
 
Information Retrieval
Information Retrieval Information Retrieval
Information Retrieval
ShujaatZaheer3
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
fridolin.wild
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
Patrice Bellot - Aix-Marseille Université / CNRS (LIS, INS2I)
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
Chandan Deb
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
pepe3059
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
KU Leuven
 

Similar to A first look at tf idf-pdx data science meetup (20)

Artificial Intelligence
Artificial IntelligenceArtificial Intelligence
Artificial Intelligence
 
Text Mining
Text MiningText Mining
Text Mining
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
Probabilistic Retrieval Models - Sean Golliher Lecture 8 MSU CSCI 494
 
Text Mining Analytics 101
Text Mining Analytics 101Text Mining Analytics 101
Text Mining Analytics 101
 
Web and text
Web and textWeb and text
Web and text
 
Chapter 10 Data Mining Techniques
 Chapter 10 Data Mining Techniques Chapter 10 Data Mining Techniques
Chapter 10 Data Mining Techniques
 
Copy of 10text (2)
Copy of 10text (2)Copy of 10text (2)
Copy of 10text (2)
 
Text Mining Infrastructure in R
Text Mining Infrastructure in RText Mining Infrastructure in R
Text Mining Infrastructure in R
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
IRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF MetricIRJET - Document Comparison based on TF-IDF Metric
IRJET - Document Comparison based on TF-IDF Metric
 
Web search engines
Web search enginesWeb search engines
Web search engines
 
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
DOMAIN KEYWORD EXTRACTION TECHNIQUE: A NEW WEIGHTING METHOD BASED ON FREQUENC...
 
Language Technology Enhanced Learning
Language Technology Enhanced LearningLanguage Technology Enhanced Learning
Language Technology Enhanced Learning
 
Information Retrieval
Information Retrieval Information Retrieval
Information Retrieval
 
The Geometry of Learning
The Geometry of LearningThe Geometry of Learning
The Geometry of Learning
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
Open nlp presentationss
Open nlp presentationssOpen nlp presentationss
Open nlp presentationss
 
vectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.pptvectorSpaceModelPeterBurden.ppt
vectorSpaceModelPeterBurden.ppt
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 

More from Dan Sullivan, Ph.D.

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
Dan Sullivan, Ph.D.
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
Dan Sullivan, Ph.D.
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
Dan Sullivan, Ph.D.
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
Dan Sullivan, Ph.D.
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
Dan Sullivan, Ph.D.
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
Dan Sullivan, Ph.D.
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
Dan Sullivan, Ph.D.
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Dan Sullivan, Ph.D.
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Dan Sullivan, Ph.D.
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
Dan Sullivan, Ph.D.
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Dan Sullivan, Ph.D.
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
Dan Sullivan, Ph.D.
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
Dan Sullivan, Ph.D.
 

More from Dan Sullivan, Ph.D. (13)

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 

Recently uploaded

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
ssuserfac0301
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
DanBrown980551
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
panagenda
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
Chart Kalyan
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
Tomaz Bratanic
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Alpen-Adria-Universität
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
maazsz111
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
ScyllaDB
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
Shinana2
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
Zilliz
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
Edge AI and Vision Alliance
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
HarisZaheer8
 

Recently uploaded (20)

Taking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdfTaking AI to the Next Level in Manufacturing.pdf
Taking AI to the Next Level in Manufacturing.pdf
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides5th LF Energy Power Grid Model Meet-up Slides
5th LF Energy Power Grid Model Meet-up Slides
 
HCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAUHCL Notes and Domino License Cost Reduction in the World of DLAU
HCL Notes and Domino License Cost Reduction in the World of DLAU
 
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdfHow to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
How to Interpret Trends in the Kalyan Rajdhani Mix Chart.pdf
 
GraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracyGraphRAG for Life Science to increase LLM accuracy
GraphRAG for Life Science to increase LLM accuracy
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing InstancesEnergy Efficient Video Encoding for Cloud and Edge Computing Instances
Energy Efficient Video Encoding for Cloud and Edge Computing Instances
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
SAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloudSAP S/4 HANA sourcing and procurement to Public cloud
SAP S/4 HANA sourcing and procurement to Public cloud
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-EfficiencyFreshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
Freshworks Rethinks NoSQL for Rapid Scaling & Cost-Efficiency
 
dbms calicut university B. sc Cs 4th sem.pdf
dbms  calicut university B. sc Cs 4th sem.pdfdbms  calicut university B. sc Cs 4th sem.pdf
dbms calicut university B. sc Cs 4th sem.pdf
 
Building Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and MilvusBuilding Production Ready Search Pipelines with Spark and Milvus
Building Production Ready Search Pipelines with Spark and Milvus
 
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
“Temporal Event Neural Networks: A More Efficient Alternative to the Transfor...
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
AWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptxAWS Cloud Cost Optimization Presentation.pptx
AWS Cloud Cost Optimization Presentation.pptx
 

A first look at tf idf-pdx data science meetup

  • 1. A First Look at TF-IDF Dan Sullivan PP Portland Data Science Group March 2, 2017 Portland Data Science Meetup March 2, 2017
  • 2. What do we do with this?
  • 3. Challenges No obvious structure Fully understanding language is hard Large number of documents Want to Find documents based on similarity Classify documents
  • 4. Fortunately ... Measuring similarity and Classifying documents Does not require fully understanding text
  • 6. “PDX Data Science is all about data.” about all data is pdx science 1 1 2 1 1 1
  • 8. Improvement 1: Remove Stop Words { WordsCount Stop Words
  • 9. Improvement 2: N-grams { WordsCount “computer” , “science” “Computer science”
  • 10. Example: Corpus of Machine Learning Papers Some terms appear frequently “Feature” “Algorithm” “Training” Some less frequently “Reinforcement” “Non-linear” “Convolution”
  • 11. Intuition Combination of words are good indicators of topic of document Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor” Text mining: “corpus”, “term vector”, “syntax” Social Network: “graph”, “communities”, “users”, “influence”
  • 12. Intuition Combination of words are good indicators of topic of document Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor” Text mining: “corpus”, “term vector”, “syntax” Social Network: “graph”, “communities”, “users”, “influence” Words that appear frequently across documents in a corpus are not good indicators of topic
  • 13. Intuition Combination of words are good indicators of topic of document Self-driving cars: “automobile”, “driver”, “radar”, “image”, “sensor” Text mining: “corpus”, “term vector”, “syntax” Social Network: “graph”, “communities”, “users”, “influence” Words that appear frequently across documents in a corpus are not good indicators of topic Words that appear frequently only within documents about a single topic are good indicators of topic
  • 14. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus
  • 15. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus TF - term frequency tf(t,d) is the number of times a term t occurs in document d
  • 16. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus TF - term frequency tf(t,d) is the number of times a term t occurs in document d IDF - inverse document frequency idf(t,D) = log(N / | {d in D: t in d} | )
  • 17. Formalizing Intuition: TF-IDF Notation t - a term d - a document D - a set of documents or corpus N - number of documents in corpus TF - term frequency tf(t,d) is the number of times a term t occurs in document d IDF - inverse document frequency idf(t,D) = log(N / | {d in D: t in d} | ) TF-IDF = tf(t,d) * idf(t,D)
  • 18. TF-IDF is Large when: There is a large count of a term in a document (large TF) and ... Low number of documents with term in them Small when Term appears in many documents in the corpus TF-IDF Frequency Stop Words Common Words Rare Words
  • 20. Populating a Term Vector with TF-IDF V[index(“emmy”)] = tf-idf(“emmy”,d) V[index(“noether”)] = tf-idf(“noether”,d) V[index(“known”)] = tf-idf(“known”,d) V[index(“landmark”)] = tf-idf(“landmark”,d) V[index(“contribution”)] = tf-idf(“contribution”,d) V[index(“abstract”)] = tf-idf(“abstract”,d) V[index(“algebra”)] = tf-idf(“algebra”,d) V[index(“theoretical”)] = tf-idf(“theoretical”,d) V[index(“physics”)] = tf-idf(“physics”,d) “Emmy Noether is known for her landmark contributions to abstract algebra and theoretical physics”
  • 21. Vector Space Model Term 3 Term 2 Term 1 Doc 1 Doc 2 Doc 3 Term 1 Term 2 Term 3 Doc 1 0.4 0.1 0.6 Doc 2 0.3 0.5 0.0 Doc 3 0.0 0.2 0.6
  • 22. Similarity Measures Term 3 Term 2 Term 1 Doc 1 Doc 2 Doc 3 Euclidian Distance Cosine
  • 23. Classify by Vector (Point) TF-IDF Vector
  • 24. Text Classifier with Scikit Learn
  • 26. NLP Tools Python Gensim NLTK spaCy & textacy Scikit-Learn TextBlob R TM OpenNLP (R interface) TidyText Other Mallet Google Natural Language API
  • 27. Q & A