SlideShare a Scribd company logo
1 of 72
Data Ming TechniquesData Ming Techniques
MUMTAZ KHAN
MS (SEMANTIC WEB)
TF-IDFTF-IDF
TF-IDF stands for Term Frequency & Inverse Document
Frequency .
 Important data for search-figures out what terms are most
relevant for document .
Term frequency: It measures how often a word
occurs in a document.
◦ A word that occurs frequently is probably important to that
document’s meaning.
 Mathematical Form:
TF= (Number of occurrences of the keyword in that particular
document) / (Total number of keywords in the document)
TF-IDF (Continue---- )TF-IDF (Continue---- )
IDF : Inverse Document Frequency measure the
rarity of a term in the whole corpus/documents .
◦ Let N denoting the total number of Documents then the inverse
document frequency of term T is defined as
IDF= Log (N/df) or
= 1+loge (Total Number of Documents/Number of Documents with
that term in it), So
TF-IDF= TF*IDF
Practical Example of TF-IDF
◦ Let we have three documents d1,d2 and d3
◦ d1= The game of life is a game of everlasting learning
◦ d2= The unexamined life is not worth living
◦ d3= Never stop learning
TF-IDF (Continue---- )TF-IDF (Continue---- )
Steps for TF-IDF
TF for d1:
TF for d2:
TF for d3:
Normalized TF for d1,d2 and d3 are as :
d1 :
Term/Word the game of life is a everlasting learning
Frequency 1 2 2 1 1 1 1 1
Term/Word the unexamined life is not worth living
Frequency 1 1   1 1 1 1
Term/Word Never Stop learning
Frequency 1 1 1
Term/Word the game of life is a everlasting learning
Normalized TF 1/10=.1 2/10=.2 2/10=.2 1/10=.1 1/10=.1 1/10=.1 1/10=.1 1/10=.1
TF-IDF (Continue---- )TF-IDF (Continue---- )
d2:
d3:
◦ Note: d1 contains 10 terms/words, d2 contains 7
terms/words and d3 contains 3 terms/words.
Calculation of IDF for each term/word as:
IDF=1+loge (Total Number of Documents/Number
of Documents with that term in it)
Term/Word the unexamined life is not worth living
Normalized TF 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428
Term/Word Never Stop learning
Normalized TF 1/3=.3333 1/3=.3333 1/3=.3333
TF-IDF (Continue---- )TF-IDF (Continue---- )
 Let us Compute the IDF for the
Term Unexamined
IDF=1+loge (Total Number of Documents/Number of Documents with term Unexamined in it).
◦ There are 3 documents in all=d1,d2,d3 but the term unexamined appears in
document d2.
IDF(unexamined)=1+loge (3/1)=2.098726209 , and similarly so on.
Terms IDF
The 1.405507135
Game 2.098726209
Of 2.098726209
Life 1.405507135
Is 1.405507135
A 2.098726209
Everlasting 2.098726209
Terms IDF
Learning 1.405507135
Unexamined 2.098726209
Not 2.098726209
Worth 2.098726209
Living 2.098726209
Never 2.098726209
Stop 2.098726209
TF-IDF (Continue---- )TF-IDF (Continue---- )
 Let us calculate the TF-IDF and the
relevant documents for the query:
life learning
 Note: For each term the TF-IDF as
calculated multiply its normalized term
frequency with its IDF, e.g.
TF-IDF=TF*IDF
Terms D1 D2 D3
life 0.140550715 0.200786736 0
learning 0.140550715 0 0.468502384
TF-IDF (Continue---- )TF-IDF (Continue---- )
 Vector Space Model-Cosine Similarity
◦ For each document we derive a vector
◦ Set of documents in a collection is viewed as a set of
vectors in a vector space.
◦ Each term will have its own axis
 Formula: To find similarity b/w any two documents
Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2||
Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n]
||d1|| = square root(d1[0]2
+ d1[1]2
+ ... + d1[n]2
)
||d2|| = square root(d2[0]2
+ d2[1]2
+ ... + d2[n]2
)
TF-IDF (Continue---- )TF-IDF (Continue---- )
 Vector Space Model-Cosine Similarity
TF-IDF (Continue---- )TF-IDF (Continue---- )
 The TF-IDF for the query : life learning
 Let us calculate the cosine similarity between query and document(d1)
Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * ||
Document1||
Dot product(Query, Document1)
= ((0.702753576) * (0.140550715) + (0.702753576)*(0.140550715))
= 0.197545035151
||Query|| = sqrt((0.702753576) + (0.702753576) ) = 0.993843638185
||Document1|| = sqrt((0.140550715) + (0.140550715) ) = 0.198768727354
Cosine Similarity(Query, Document) = 0.197545035151 / (0.993843638185) *
(0.198768727354)
= 0.197545035151 / 0.197545035151
= 1
life .5 1.405507153 0.702753576
learning .5 1.405507153 0.702753576
TF-IDF (Continue---- )TF-IDF (Continue---- )
 The Similarity score for d1,d2, d3 and query are as:
 Some Demerits of TF-IDF
◦ It is based on bag-of-words model
 therefore it does not capture position in text, semantics, co-occurrences in
different documents, etc.
 For this reason, TF-IDF is only useful as a lexical level feature
◦ Cannot capture semantics (e.g. as compared to topic models,
word embedding's)
Cosine
Similarity
d1 d2 d3
1 0.707106781 0.707106781
ReferencesReferences
1. Van Rijsbergen, Cornelis J., Stephen Edward Robertson, and Martin F. Porter. New
models in probabilistic information retrieval. London: British Library Research and
Development Department, 1980
2. http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html
3. http://www.tfidf.com/
4. Wang, Wei, and Yongxin Tang. "Improvement and Application of TF-IDF Algorithm in
Text Orientation Analysis." (2016).
5. Wu, Ho Chung, et al. "Interpreting tf-idf term weights as making relevance
decisions." ACM Transactions on Information Systems (TOIS) 26.3 (2008): 13.
6. Sparck Jones, Karen. "A statistical interpretation of term specificity and its application in
retrieval." Journal of documentation 28.1 (1972): 11-21.
7. Salton, Gerard, and Christopher Buckley. "Term-weighting approaches in automatic text
retrieval." Information processing & management 24.5 (1988): 513-523.
LDA (Latent Dirichlet Allocation)LDA (Latent Dirichlet Allocation)
 Outline of LDA
◦ Introduction
◦ Model Definition
◦ Variational Inferences
◦ Example output and Simulation
◦ References
LDA (Continue---- )LDA (Continue---- )
 Introduction
◦ When more information becomes available, it becomes more
difficult to find and discover what we need.
◦ We need tools to help us organize, search and understand these
vast amount of information.
◦ Topic modeling provides methods for automatically organizing,
understanding, searching, and summarizing large electronic
archives:
LDA (Continue---- )LDA (Continue---- )
 Introduction (Goal of Topic Model)
◦ Document has several topics
◦ Topics are associated with words
◦ Words are expressed through the topics into documents
Documents WordsTopics
Observed Latent Observed
LDA (Continue---- )LDA (Continue---- )
 LDA (Continue….)
◦ LDA is a generative probabilistic model of a corpus
◦ LDA basically to make pLSA a generative model by imposing a Dirichlet
Prior on the Model parameters
◦ LDA is just a Bayesian Version of pLSA, and the parameters are now
much regularized
◦ LDA breaks down the collection of documents into topics
◦ Discover the hidden themes in the collection
◦ Representing the document as a mixture of topics with their probability
distribution.
◦ Topics are represented as a mixture of words with probability
representing the importance of the for each topic.
◦ Discover the hidden themes in the collection
LDA (Continue---- )LDA (Continue---- )
 Twitter Using LDA
◦ Fetch tweets data using “twitteR” package
◦ Load the data into the R environment
◦ Clean the data to remove : re-tweet information, links special characters,
emoticons, frequent words like is, as, this etc.
◦ Create a Term Document Matrix (TMD) using “tm” package.
◦ Calculate TF-IDF i.e. Term Frequency-Inverse Document Frequency for
all the words in TDM
◦ Exclude all the words with TF-IDF <=0.1, to remove all the words which
all less frequent
◦ Calculate the optimal Number of Topics (k) in the Corpus using
log-likelihood function for the TDM calculated
◦ Apply LDA method using “topicmodels” package to discover topics
◦ Evaluate the model
LDA (Continue---- )LDA (Continue---- )
 LDA – generative process
◦ Setting up a generative model
 We have D documents using a vocabulary of V word types
 Each documents contains (up to) N word tokens.
 We assume K topics.
 Each document has a K-dimensional multinomial Փd over topics
with a
 Common Dirichlet prior Dir( )α
 Each topic has a V-dimensional multinomial "Βk over words
with a
 Common symmetric Dirichlet prior D( ).ɳ
LDA (Continue---- )LDA (Continue---- )
 The Generative process
◦ For topic topic k=1…..K do
 -Draw a word-distribution(multinomial) Βk Dir( )∼ ɳ
◦ For each document d=1……D do
 - Draw a topic-distribution (multinomial)θd Dir( )∼ α
 - For each word Wd,n:
◦ - Draw a topic Zd,n Mult(∼ Փd) with Zd,n€ [1…K]
◦ - Draw a word Wd,n Mult( Z∼ β d,n )
LDA (Continue---- )LDA (Continue---- )
 Graphical Model of LDA
LDA (Continue---- )LDA (Continue---- )
 LDA Joint Distribution
LDA (Continue---- )LDA (Continue---- )
 LDA Joint Distribution defines a posterior p(θ,z,β|w)
◦ From a collection of document we have to infer:
 Per-word topic assignment zd,n
 Per-document topic proportions θd
 Per-corpus topic distribution βk
Note:
LDA (Continue---- )LDA (Continue---- )
 Why depends on and ?
 LDA Graphical Model with working Procedure
LDA (Continue---- )LDA (Continue---- )
 LDA Graphical Model with working Procedure
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
 LDA inputs.
◦ Set of words per document for each document in Corpus
 LDA inputs.
◦ Corpus-wide topic vocabulary distributions
◦ Topic assignments per word
◦ Topic proportion per document
LDA (Continue---- )LDA (Continue---- )
STATISTICAL INFERENCEPROBABILISTIC GENERATIVE PROCESS
3 Latent Variables
Word distribution per topic
(Word-topic-matrix)
Topic distribution per doc
(topic-doc-matrix)
Topic word assignment
TOPIC MODELS
LDA (Continue---- )LDA (Continue---- )
 Dirichlet distribution is Conjugate prior of multinomial
distribution
 The parameter α control the mean shape and sparsity of θ
◦ high α= uniform θ , small α= sparse θ
 In LDA the topics are a V-dimensional Dirichlet and the
topic proportion are a K-dimensional Dirichlet
LDA (Continue---- )LDA (Continue---- )
 The Geometric intuition (Simplex)
LDA (Continue---- )LDA (Continue---- )
 The Dirichlet is a “dice factory”
◦ Multivariate equivalent of the Beta distribution (“coin factory”)
◦ Parameters α determine the form of the prior
 The Dirichlet is defined over the (K-1) simplex
◦ The K non-negative arguments which sum to one
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
 To which topics does a given document belong to? Thus want
to compute the posterior distribution of the hidden variables
given a document.
LDA (Continue---- )LDA (Continue---- )
◦ Variational Inference
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
LDA (Continue---- )LDA (Continue---- )
 LDA Summary
◦ LDA Can:
 Visualize the hidden thematic structure in large corpora
 Generalize new data to fit into that structure
 Used for Feature reduction, bioinformatics
 Used for Sentiment analysis, object localization, automatic harmonic analysis for
music
◦ Note: LDA Main Goal
 In each document, allocate its words to few topics
 In each topic, assign high probability to few terms
◦ This from the joint
 Sparse proportions come from the 1st
term
 Sparse topics come from the 2nd
term
◦ Limitations:
 Must know the number of topics k in advance
 Dirichlet topic distribution cannot capture correlations among topics
ReferencesReferences
1. Jelodar, Hamed, et al. "Latent Dirichlet Allocation (LDA) and Topic modeling: models,
applications, a survey." arXiv preprint arXiv:1711.04305 (2017).
2. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/
3. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of
machine Learning research3.Jan (2003): 993-1022.
4. Video Lectures of David Blei on videolectures.net: http://videolectures.net/mlss09uk_blei_tm/
5. Campr, Michal, and Karel Ježek. "Comparing semantic models for evaluating automatic
document summarization." International Conference on Text, Speech, and Dialogue. Springer
International Publishing, 2015.
6. Hu, Diane J. "Latent dirichlet allocation for text, images, and music." University of California, San
Diego. Retrieved April 26 (2009): 2013.
7. Jayapal, Arun, and Martin Emms. "Topic Models-Latent Dirichlet Allocation." (2014).
8. Wang, Y. Distributed gibbs sampling of latent topic models: The gritty details. Technical report,
2008.
9. https://cs.stanford.edu/~ppasupat/a9online/1140.html
Latent Semantic Indexing (LSI)Latent Semantic Indexing (LSI)
Problems in Lexical matching
Motivation
Introduction
 How LSI Work?
LSI Procedure
SVD
Example
Application
Demerits
LSI(Continue …)LSI(Continue …)
Problems in Lexical Matching
◦ Synonymy
- widespread synonym occurances
-decrease recall.
◦ Polysemy
- retrieval of irrelevant documents
- poor precision
◦ Noise
- Boolean search on specific words
- Retrieval o contently unrelated documents
LSI(Continue …)LSI(Continue …)
Motivation for LSI
◦ To find and fit a useful model of the relationships
between terms and documents.
◦ To find out what terms "really" are implied by a query .
◦ LSI allow the user to search for concepts rather than
specific words.
◦ Stores them in the concept space
◦ LSI can retrieve documents related to a user's query
even when the query and the documents do not share
any common terms
◦ Mathematical model
 Relates documents and the concepts
◦ LSI tries to overcome the problems of lexical matching
LSI(Continue …)LSI(Continue …)
Introduction
◦ LSI is a technique that projects queries and documents
into a space with “latent” semantic dimensions
◦ It uses multidimensional vector space to place all
documents and terms
◦ Each dimension in that space corresponds to a concept
existing in the collection.
◦ Common related terms in a document and query will pull
document and query vector close to each other.
LSI(Continue …)LSI(Continue …)
Concepts in Documents
LSI(Continue …)LSI(Continue …)
How LSI Work?
• A set of documents
 how to determine the similiar ones?
 examine the documents
 try to find concepts in common
 classify the documents
• This is how LSI also works.
• LSI represents terms and documents in a high-dimensional space
allowing relationships between terms and documents to be exploited
during searching.
• Convert high-dimensional space to lower-dimensional space, throw out
noise, keep the good stuff
LSI(Continue …)LSI(Continue …)
LSI Procedure
◦ Obtain term-document matrix.
◦ Compute the SVD.
◦ Truncate-SVD into reduced-k LSI space.
-k-dimensional semantic structure
-similarity on reduced-space:
-term-term
-term-document
-document-document
 Query Procedure
◦ Map the query to reduced k-space
q’=qTUkS-1k,
◦ Retrieve documents or terms within a proximity.
-cosine
-best m
LSI(Continue …)LSI(Continue …)
Singular Value Decomposition (SVD)
◦ LSI use SVD, a linear analysis method:
◦ SVD decomposes the original matrix into three matrixes
 Document eigenvector matrix
 Eigenvalue matrix
 Term eigenvector matrix
◦ SVD of a rectangular matrix A is given by:
 A=U VΣ T
LSI(Continue …)LSI(Continue …)
Singular Value Decomposition (SVD)
◦ For an m n matrix A of rank r there exists aˣ factorization
◦ Singular Vale Decomposition =SVD) as follows
 A=U VΣ T
◦ The columns of U are orthogonal eigenvectors of AAT
◦ The columns of V are orthogonal eigenvectors of AT
A
◦ Eigenvectors λ1 … λr of AAT
are the eigenvectors of AT
A
AAT
=
AT
A=
LSI(Continue …)LSI(Continue …)
Example
◦ Let we have three documents
 d1: Shipment of gold damaged in a fire
 d2: Delivery of silver arrived in a silver truck.
 d3: Shipment of gold arrived in a truck.
◦ Problem: Use Latent Semantic Indexing (LSI) to rank these documents
for the query gold silver truck.
Step 1: Set term weights and construct the term-document matrix A and query matrix:
LSI(Continue …)LSI(Continue …)
Step 2: Decompose matrix A matrix and find the U, S and V matrices, where
LSI(Continue …)LSI(Continue …)
 Step 3: Implement a Rank 2 Approximation by keeping the first two columns of U and V
and the first two columns and rows of S.
LSI(Continue …)LSI(Continue …)
 Step 4: Find the new document vector coordinates in this reduced 2-dimensional space.
 Rows of V holds eigenvector values. These are the coordinates of individual document
vectors, hence
 d1(-0.4945, 0.6492)
 d2(-0.6458, -0.7194)
 d3(-0.5817, 0.2469)
 Step 5: Find the new query vector coordinates in the reduced 2-dimensional space.
 Note: These are the new coordinate of the query vector in two dimensions. Note how
this matrix is now different from the original query matrix q given in Step 1.
so
LSI(Continue …)LSI(Continue …)
 Step 6: Rank documents in decreasing order of query-document cosine similarities.
LSI(Continue …)LSI(Continue …)
 Graphical Representation
 We can see that document d2 scores higher than d3 and d1. Its vector is
closer to the query vector than the other vectors. Also note that Term
Vector Theory is still used at the beginning and at the end of LSI
LSI (Continue…)LSI (Continue…)
Applications of LSI
◦ Information Retrieval
◦ Information Filtering
◦ Relevance Feedback
◦ Improving performance of Search Engines
 in ranking pages
◦ Cross-language retrieval
◦ Automated essay grading
◦ Optimizing link profile of your web page
◦ Modelling of human cognitive function
◦ Dynamic advertisements put on pages, Google’s
AdSense
LSI (Continue…)LSI (Continue…)
Demerits of LSI
◦ Storage
◦ The complexity of the LSI model obtained from
truncated SVD is costly
◦ Its execution efficiency lag far behind the execution
efficiency of the simpler, Boolean models, especially on
large data sets.
◦ The latent topic dimension can not be chosen to arbitrary
numbers. It depends on the rank of the matrix, so can't go
beyond that
◦ Bad for millions of words or documents
◦ Hard to incorporate new words or documents
ReferencesReferences
◦ http://www.bluebit.gr/matrix-calculator/
◦ Rosario, Barbara. "Latent semantic indexing: An overview." Techn. rep. INFOSYS 240 (2000): 1-16.
◦ Ding, Chris HQ. "A probabilistic model for latent semantic indexing." Journal of the Association for
Information Science and Technology 56.6 (2005): 597-608.
◦ Dumais, Susan T. "Latent semantic indexing (LSI) and TREC-2." Nist Special Publication Sp (1994):
105-105.
◦ Alter, Orly, Patrick O. Brown, and David Botstein. "Singular value decomposition for genome-
wide expression data processing and modeling." Proceedings of the National Academy of
Sciences 97.18 (2000): 10101-10106.
◦ Golub, Gene H., and Charles F. Van Loan. Matrix computations. Vol. 3. JHU Press, 2012.
◦ http://www-db.deis.unibo.it/courses/SI-M/
◦ http://web.eecs.utk.edu/research/lsi/
◦ http://lsi.research.telcordia.com/
Word2VectWord2Vect
◦ It is used to generate representation vectors out of
words
◦ Maps words to continuous vector presentations
 i.e. points in an N-dimensional space
◦ Learns vectors from training data (generalizations)
◦ It is a numeric representation for each word
 That enable to capture relationship between words like
synonyms, analogies
Word2VectWord2Vect
Continuous Bag of Words (CBOW)
◦ It predicted the missing word window of context words
 Suppose we given the words Latent Dirichlet, then CBOW
model predict the missing word Allocation, so Latent Dirichlet
Allocation
◦ it useful to identify the missing word in the sentence
◦ It identify the effective sentiment orientations
◦ Randomly initialize input/output weight matrices of sizes
VxN and NxV where V: vocab size, N: vector size
(parameter
◦ Update weight matrices using SGD, backprop. and cross
entropy over corpus
◦ Hidden layer size corresponds to word vector
dimensional.
ConDoc2VectConDoc2Vect
Skip Gram
◦ Method very similar, except now we predict window of
words given single word vector
◦ It predicted the context words given the word
 Suppose we given a word Dirichlet, then Skip-Gram model
predict the context words, Latent Dirichlet Allocation.
◦ Boils down to maximizing dot-product similarity of
context words and target word
◦ Skip-gram typically outperforms CBOW on semantic
and syntactic accuracy (Mikolov et al.)
Word2Vec (Continue…)Word2Vec (Continue…)
Demerits
• Quality depends on input data, number of samples, and
size of vectors (possibly long computation time!)
 But Google released 3 million word vectors trained on 100 billion words!
• Averaging vec’s does not work well (in my experience) on
large text (> tweet level)
• W2V cannot provide fixed-length feature vectors for
variable-length text (pretty much everything!)
Doc2VecDoc2Vec
◦ It generalize Word2Vec to whole documents (phrases,
sentences, etc)
◦ Provide fixed-length vector
◦ Learns Distributed Memory (DM) and Distributed Bag of
Words (DBOW)
Doc2Vec (Continue… )Doc2Vec (Continue… )
Distributed Memory (DM)
◦ Assign and randomly initialize paragraph vector for each
document
◦ Predict next work using context words + paragraph
vector
◦ Slide context window across doc but keep paragraph
vector fixed (hence distributed memory)
◦ Learns Updating done via SGD and backpropagation.
Doc2Vect (Continue… )Doc2Vect (Continue… )
Distributed Bag of Words (DBOW)
◦ Only use paragraph vector (no word vector!)
◦ Take window of words in paragraph and randomly
sample which one to predict using paragraph vector
(ignores word ordering )
◦ Simpler, more memory efficient
◦ DM typically outperforms DBOW
 but DM+DBOW is even better!
Doc2Vec Example)Doc2Vec Example)
Doc2Vec on Wikipedia1
1
Document Embedding with Paragraph Vectors, A. Dai et al., 2014
Doc2Vec Example)Doc2Vec Example)
Doc2Vec on Wikipedia1
◦ LDA vs. Doc2Vec for nearest neighbors to “Machine learning “(bold=unrelated to Machine learning)
1
Document Embedding with Paragraph Vectors, A. Dai et al., 2014
Doc2Vec Example)Doc2Vec Example)
Doc2Vec on Wikipedia1
1
Document Embedding with Paragraph Vectors, A. Dai et al., 2014
Word2Vec (Continue…)Word2Vec (Continue…)
Application
◦ Information Retrieval
◦ Documents classification
◦ Recommendation algorithms
◦ etc
Thank you !
Doc2Vec (Continue…)Doc2Vec (Continue…)
Conclusion
◦ Doc2Vec is more efficient, robust than others
methods such as LIS, LDA, TF-IDF
Thank you !
ReferencesReferences
◦ http://www.bluebit.gr/matrix-calculator/

More Related Content

What's hot

2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision treeKrish_ver2
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methodsrajshreemuthiah
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data MiningDHIVYADEVAKI
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data miningEr. Nawaraj Bhandari
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataSalah Amean
 
Linear regression
Linear regressionLinear regression
Linear regressionMartinHogg9
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —Salah Amean
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reductionKrish_ver2
 

What's hot (20)

Data cubes
Data cubesData cubes
Data cubes
 
2.2 decision tree
2.2 decision tree2.2 decision tree
2.2 decision tree
 
lazy learners and other classication methods
lazy learners and other classication methodslazy learners and other classication methods
lazy learners and other classication methods
 
2. visualization in data mining
2. visualization in data mining2. visualization in data mining
2. visualization in data mining
 
3 Data Mining Tasks
3  Data Mining Tasks3  Data Mining Tasks
3 Data Mining Tasks
 
Data preprocessing in Data Mining
Data preprocessing in Data MiningData preprocessing in Data Mining
Data preprocessing in Data Mining
 
Classification and prediction in data mining
Classification and prediction in data miningClassification and prediction in data mining
Classification and prediction in data mining
 
Association rules
Association rulesAssociation rules
Association rules
 
Dbms schemas for decision support
Dbms schemas for decision supportDbms schemas for decision support
Dbms schemas for decision support
 
Inverted index
Inverted indexInverted index
Inverted index
 
Clustering
ClusteringClustering
Clustering
 
Data mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, dataData mining :Concepts and Techniques Chapter 2, data
Data mining :Concepts and Techniques Chapter 2, data
 
Linear regression
Linear regressionLinear regression
Linear regression
 
Fuzzy sets
Fuzzy sets Fuzzy sets
Fuzzy sets
 
Big Data Ecosystem
Big Data EcosystemBig Data Ecosystem
Big Data Ecosystem
 
Data Mining: Concepts and Techniques — Chapter 2 —
Data Mining:  Concepts and Techniques — Chapter 2 —Data Mining:  Concepts and Techniques — Chapter 2 —
Data Mining: Concepts and Techniques — Chapter 2 —
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
Database Systems Concepts, 5th Ed
Database Systems Concepts, 5th EdDatabase Systems Concepts, 5th Ed
Database Systems Concepts, 5th Ed
 
1.7 data reduction
1.7 data reduction1.7 data reduction
1.7 data reduction
 
Pca ppt
Pca pptPca ppt
Pca ppt
 

Similar to Data mining techniques

TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyTF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyRAtna29
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processingBabu Priyavrat
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Jeet Das
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630Yong Joon Moon
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
Slides
SlidesSlides
Slidesbutest
 
nescala 2013
nescala 2013nescala 2013
nescala 2013Hung Lin
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdfHabtamu100
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015rusbase
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrievalrchbeir
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_fariaPaulo Faria
 

Similar to Data mining techniques (20)

Lec 4,5
Lec 4,5Lec 4,5
Lec 4,5
 
Ir models
Ir modelsIr models
Ir models
 
Lec1
Lec1Lec1
Lec1
 
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyyTF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
TF-IDF.pdf yyyyyyyyyyyyyyyyyyyyyyyyyyyyyy
 
Intro.ppt
Intro.pptIntro.ppt
Intro.ppt
 
Tricks in natural language processing
Tricks in natural language processingTricks in natural language processing
Tricks in natural language processing
 
Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)Information Retrieval-4(inverted index_&amp;_query handling)
Information Retrieval-4(inverted index_&amp;_query handling)
 
Search pitb
Search pitbSearch pitb
Search pitb
 
A-Study_TopicModeling
A-Study_TopicModelingA-Study_TopicModeling
A-Study_TopicModeling
 
Term weighting
Term weightingTerm weighting
Term weighting
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630엘라스틱서치 적합성 이해하기 20160630
엘라스틱서치 적합성 이해하기 20160630
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
Slides
SlidesSlides
Slides
 
Some Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBASome Information Retrieval Models and Our Experiments for TREC KBA
Some Information Retrieval Models and Our Experiments for TREC KBA
 
nescala 2013
nescala 2013nescala 2013
nescala 2013
 
Chapter 4 IR Models.pdf
Chapter 4 IR Models.pdfChapter 4 IR Models.pdf
Chapter 4 IR Models.pdf
 
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
Сергей Кольцов —НИУ ВШЭ —ICBDA 2015
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria2014-mo444-practical-assignment-02-paulo_faria
2014-mo444-practical-assignment-02-paulo_faria
 

More from Higher Education Department KPK, Pakistan (6)

On Linked Open Data (LOD)-based Semantic Video Annotation Systems
On Linked Open Data (LOD)-based  Semantic Video Annotation SystemsOn Linked Open Data (LOD)-based  Semantic Video Annotation Systems
On Linked Open Data (LOD)-based Semantic Video Annotation Systems
 
On Annotation of Video Content for Multimedia Retrieval and Sharing
On Annotation of Video Content for Multimedia  Retrieval and SharingOn Annotation of Video Content for Multimedia  Retrieval and Sharing
On Annotation of Video Content for Multimedia Retrieval and Sharing
 
Introduction to cms and wordpress
Introduction to cms and wordpressIntroduction to cms and wordpress
Introduction to cms and wordpress
 
Mpeg 7-21
Mpeg 7-21Mpeg 7-21
Mpeg 7-21
 
WWW Histor
WWW HistorWWW Histor
WWW Histor
 
Webpage classification and Features
Webpage classification and FeaturesWebpage classification and Features
Webpage classification and Features
 

Recently uploaded

Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - InfographicHr365.us smith
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfPower Karaoke
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfkalichargn70th171
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationkaushalgiri8080
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...kellynguyen01
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxTier1 app
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfkalichargn70th171
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEOrtus Solutions, Corp
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideChristina Lin
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number SystemsJheuzeDellosa
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsMehedi Hasan Shohan
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVshikhaohhpro
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackVICTOR MAESTRE RAMIREZ
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfjoe51371421
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdfWave PLM
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...soniya singh
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio, Inc.
 

Recently uploaded (20)

Asset Management Software - Infographic
Asset Management Software - InfographicAsset Management Software - Infographic
Asset Management Software - Infographic
 
The Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdfThe Evolution of Karaoke From Analog to App.pdf
The Evolution of Karaoke From Analog to App.pdf
 
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdfLearn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
Learn the Fundamentals of XCUITest Framework_ A Beginner's Guide.pdf
 
Project Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanationProject Based Learning (A.I).pptx detail explanation
Project Based Learning (A.I).pptx detail explanation
 
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
Short Story: Unveiling the Reasoning Abilities of Large Language Models by Ke...
 
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptxKnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
KnowAPIs-UnknownPerf-jaxMainz-2024 (1).pptx
 
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdfThe Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
The Essentials of Digital Experience Monitoring_ A Comprehensive Guide.pdf
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASEBATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
BATTLEFIELD ORM: TIPS, TACTICS AND STRATEGIES FOR CONQUERING YOUR DATABASE
 
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop SlideBuilding Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
Building Real-Time Data Pipelines: Stream & Batch Processing workshop Slide
 
What is Binary Language? Computer Number Systems
What is Binary Language?  Computer Number SystemsWhat is Binary Language?  Computer Number Systems
What is Binary Language? Computer Number Systems
 
XpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software SolutionsXpertSolvers: Your Partner in Building Innovative Software Solutions
XpertSolvers: Your Partner in Building Innovative Software Solutions
 
Optimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTVOptimizing AI for immediate response in Smart CCTV
Optimizing AI for immediate response in Smart CCTV
 
Cloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStackCloud Management Software Platforms: OpenStack
Cloud Management Software Platforms: OpenStack
 
why an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdfwhy an Opensea Clone Script might be your perfect match.pdf
why an Opensea Clone Script might be your perfect match.pdf
 
5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf5 Signs You Need a Fashion PLM Software.pdf
5 Signs You Need a Fashion PLM Software.pdf
 
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Naraina Delhi 💯Call Us 🔝8264348440🔝
 
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
Russian Call Girls in Karol Bagh Aasnvi ➡️ 8264348440 💋📞 Independent Escort S...
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed DataAlluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
Alluxio Monthly Webinar | Cloud-Native Model Training on Distributed Data
 

Data mining techniques

  • 1. Data Ming TechniquesData Ming Techniques MUMTAZ KHAN MS (SEMANTIC WEB)
  • 2. TF-IDFTF-IDF TF-IDF stands for Term Frequency & Inverse Document Frequency .  Important data for search-figures out what terms are most relevant for document . Term frequency: It measures how often a word occurs in a document. ◦ A word that occurs frequently is probably important to that document’s meaning.  Mathematical Form: TF= (Number of occurrences of the keyword in that particular document) / (Total number of keywords in the document)
  • 3. TF-IDF (Continue---- )TF-IDF (Continue---- ) IDF : Inverse Document Frequency measure the rarity of a term in the whole corpus/documents . ◦ Let N denoting the total number of Documents then the inverse document frequency of term T is defined as IDF= Log (N/df) or = 1+loge (Total Number of Documents/Number of Documents with that term in it), So TF-IDF= TF*IDF Practical Example of TF-IDF ◦ Let we have three documents d1,d2 and d3 ◦ d1= The game of life is a game of everlasting learning ◦ d2= The unexamined life is not worth living ◦ d3= Never stop learning
  • 4. TF-IDF (Continue---- )TF-IDF (Continue---- ) Steps for TF-IDF TF for d1: TF for d2: TF for d3: Normalized TF for d1,d2 and d3 are as : d1 : Term/Word the game of life is a everlasting learning Frequency 1 2 2 1 1 1 1 1 Term/Word the unexamined life is not worth living Frequency 1 1   1 1 1 1 Term/Word Never Stop learning Frequency 1 1 1 Term/Word the game of life is a everlasting learning Normalized TF 1/10=.1 2/10=.2 2/10=.2 1/10=.1 1/10=.1 1/10=.1 1/10=.1 1/10=.1
  • 5. TF-IDF (Continue---- )TF-IDF (Continue---- ) d2: d3: ◦ Note: d1 contains 10 terms/words, d2 contains 7 terms/words and d3 contains 3 terms/words. Calculation of IDF for each term/word as: IDF=1+loge (Total Number of Documents/Number of Documents with that term in it) Term/Word the unexamined life is not worth living Normalized TF 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 1/7=.1428 Term/Word Never Stop learning Normalized TF 1/3=.3333 1/3=.3333 1/3=.3333
  • 6. TF-IDF (Continue---- )TF-IDF (Continue---- )  Let us Compute the IDF for the Term Unexamined IDF=1+loge (Total Number of Documents/Number of Documents with term Unexamined in it). ◦ There are 3 documents in all=d1,d2,d3 but the term unexamined appears in document d2. IDF(unexamined)=1+loge (3/1)=2.098726209 , and similarly so on. Terms IDF The 1.405507135 Game 2.098726209 Of 2.098726209 Life 1.405507135 Is 1.405507135 A 2.098726209 Everlasting 2.098726209 Terms IDF Learning 1.405507135 Unexamined 2.098726209 Not 2.098726209 Worth 2.098726209 Living 2.098726209 Never 2.098726209 Stop 2.098726209
  • 7. TF-IDF (Continue---- )TF-IDF (Continue---- )  Let us calculate the TF-IDF and the relevant documents for the query: life learning  Note: For each term the TF-IDF as calculated multiply its normalized term frequency with its IDF, e.g. TF-IDF=TF*IDF Terms D1 D2 D3 life 0.140550715 0.200786736 0 learning 0.140550715 0 0.468502384
  • 8. TF-IDF (Continue---- )TF-IDF (Continue---- )  Vector Space Model-Cosine Similarity ◦ For each document we derive a vector ◦ Set of documents in a collection is viewed as a set of vectors in a vector space. ◦ Each term will have its own axis  Formula: To find similarity b/w any two documents Cosine Similarity (d1, d2) = Dot product(d1, d2) / ||d1|| * ||d2|| Dot product (d1,d2) = d1[0] * d2[0] + d1[1] * d2[1] * … * d1[n] * d2[n] ||d1|| = square root(d1[0]2 + d1[1]2 + ... + d1[n]2 ) ||d2|| = square root(d2[0]2 + d2[1]2 + ... + d2[n]2 )
  • 9. TF-IDF (Continue---- )TF-IDF (Continue---- )  Vector Space Model-Cosine Similarity
  • 10. TF-IDF (Continue---- )TF-IDF (Continue---- )  The TF-IDF for the query : life learning  Let us calculate the cosine similarity between query and document(d1) Cosine Similarity(Query,Document1) = Dot product(Query, Document1) / ||Query|| * || Document1|| Dot product(Query, Document1) = ((0.702753576) * (0.140550715) + (0.702753576)*(0.140550715)) = 0.197545035151 ||Query|| = sqrt((0.702753576) + (0.702753576) ) = 0.993843638185 ||Document1|| = sqrt((0.140550715) + (0.140550715) ) = 0.198768727354 Cosine Similarity(Query, Document) = 0.197545035151 / (0.993843638185) * (0.198768727354) = 0.197545035151 / 0.197545035151 = 1 life .5 1.405507153 0.702753576 learning .5 1.405507153 0.702753576
  • 11. TF-IDF (Continue---- )TF-IDF (Continue---- )  The Similarity score for d1,d2, d3 and query are as:  Some Demerits of TF-IDF ◦ It is based on bag-of-words model  therefore it does not capture position in text, semantics, co-occurrences in different documents, etc.  For this reason, TF-IDF is only useful as a lexical level feature ◦ Cannot capture semantics (e.g. as compared to topic models, word embedding's) Cosine Similarity d1 d2 d3 1 0.707106781 0.707106781
  • 12. ReferencesReferences 1. Van Rijsbergen, Cornelis J., Stephen Edward Robertson, and Martin F. Porter. New models in probabilistic information retrieval. London: British Library Research and Development Department, 1980 2. http://ethen8181.github.io/machine-learning/clustering_old/tf_idf/tf_idf.html 3. http://www.tfidf.com/ 4. Wang, Wei, and Yongxin Tang. "Improvement and Application of TF-IDF Algorithm in Text Orientation Analysis." (2016). 5. Wu, Ho Chung, et al. "Interpreting tf-idf term weights as making relevance decisions." ACM Transactions on Information Systems (TOIS) 26.3 (2008): 13. 6. Sparck Jones, Karen. "A statistical interpretation of term specificity and its application in retrieval." Journal of documentation 28.1 (1972): 11-21. 7. Salton, Gerard, and Christopher Buckley. "Term-weighting approaches in automatic text retrieval." Information processing & management 24.5 (1988): 513-523.
  • 13. LDA (Latent Dirichlet Allocation)LDA (Latent Dirichlet Allocation)  Outline of LDA ◦ Introduction ◦ Model Definition ◦ Variational Inferences ◦ Example output and Simulation ◦ References
  • 14. LDA (Continue---- )LDA (Continue---- )  Introduction ◦ When more information becomes available, it becomes more difficult to find and discover what we need. ◦ We need tools to help us organize, search and understand these vast amount of information. ◦ Topic modeling provides methods for automatically organizing, understanding, searching, and summarizing large electronic archives:
  • 15. LDA (Continue---- )LDA (Continue---- )  Introduction (Goal of Topic Model) ◦ Document has several topics ◦ Topics are associated with words ◦ Words are expressed through the topics into documents Documents WordsTopics Observed Latent Observed
  • 16. LDA (Continue---- )LDA (Continue---- )  LDA (Continue….) ◦ LDA is a generative probabilistic model of a corpus ◦ LDA basically to make pLSA a generative model by imposing a Dirichlet Prior on the Model parameters ◦ LDA is just a Bayesian Version of pLSA, and the parameters are now much regularized ◦ LDA breaks down the collection of documents into topics ◦ Discover the hidden themes in the collection ◦ Representing the document as a mixture of topics with their probability distribution. ◦ Topics are represented as a mixture of words with probability representing the importance of the for each topic. ◦ Discover the hidden themes in the collection
  • 17. LDA (Continue---- )LDA (Continue---- )  Twitter Using LDA ◦ Fetch tweets data using “twitteR” package ◦ Load the data into the R environment ◦ Clean the data to remove : re-tweet information, links special characters, emoticons, frequent words like is, as, this etc. ◦ Create a Term Document Matrix (TMD) using “tm” package. ◦ Calculate TF-IDF i.e. Term Frequency-Inverse Document Frequency for all the words in TDM ◦ Exclude all the words with TF-IDF <=0.1, to remove all the words which all less frequent ◦ Calculate the optimal Number of Topics (k) in the Corpus using log-likelihood function for the TDM calculated ◦ Apply LDA method using “topicmodels” package to discover topics ◦ Evaluate the model
  • 18. LDA (Continue---- )LDA (Continue---- )  LDA – generative process ◦ Setting up a generative model  We have D documents using a vocabulary of V word types  Each documents contains (up to) N word tokens.  We assume K topics.  Each document has a K-dimensional multinomial Փd over topics with a  Common Dirichlet prior Dir( )α  Each topic has a V-dimensional multinomial "Βk over words with a  Common symmetric Dirichlet prior D( ).ɳ
  • 19. LDA (Continue---- )LDA (Continue---- )  The Generative process ◦ For topic topic k=1…..K do  -Draw a word-distribution(multinomial) Βk Dir( )∼ ɳ ◦ For each document d=1……D do  - Draw a topic-distribution (multinomial)θd Dir( )∼ α  - For each word Wd,n: ◦ - Draw a topic Zd,n Mult(∼ Փd) with Zd,n€ [1…K] ◦ - Draw a word Wd,n Mult( Z∼ β d,n )
  • 20. LDA (Continue---- )LDA (Continue---- )  Graphical Model of LDA
  • 21. LDA (Continue---- )LDA (Continue---- )  LDA Joint Distribution
  • 22. LDA (Continue---- )LDA (Continue---- )  LDA Joint Distribution defines a posterior p(θ,z,β|w) ◦ From a collection of document we have to infer:  Per-word topic assignment zd,n  Per-document topic proportions θd  Per-corpus topic distribution βk Note:
  • 23. LDA (Continue---- )LDA (Continue---- )  Why depends on and ?
  • 24.  LDA Graphical Model with working Procedure LDA (Continue---- )LDA (Continue---- )
  • 25.  LDA Graphical Model with working Procedure LDA (Continue---- )LDA (Continue---- )
  • 26. LDA (Continue---- )LDA (Continue---- )  LDA inputs. ◦ Set of words per document for each document in Corpus  LDA inputs. ◦ Corpus-wide topic vocabulary distributions ◦ Topic assignments per word ◦ Topic proportion per document
  • 27. LDA (Continue---- )LDA (Continue---- ) STATISTICAL INFERENCEPROBABILISTIC GENERATIVE PROCESS 3 Latent Variables Word distribution per topic (Word-topic-matrix) Topic distribution per doc (topic-doc-matrix) Topic word assignment TOPIC MODELS
  • 28. LDA (Continue---- )LDA (Continue---- )  Dirichlet distribution is Conjugate prior of multinomial distribution  The parameter α control the mean shape and sparsity of θ ◦ high α= uniform θ , small α= sparse θ  In LDA the topics are a V-dimensional Dirichlet and the topic proportion are a K-dimensional Dirichlet
  • 29. LDA (Continue---- )LDA (Continue---- )  The Geometric intuition (Simplex)
  • 30. LDA (Continue---- )LDA (Continue---- )  The Dirichlet is a “dice factory” ◦ Multivariate equivalent of the Beta distribution (“coin factory”) ◦ Parameters α determine the form of the prior  The Dirichlet is defined over the (K-1) simplex ◦ The K non-negative arguments which sum to one
  • 31. LDA (Continue---- )LDA (Continue---- )
  • 32. LDA (Continue---- )LDA (Continue---- )
  • 33. LDA (Continue---- )LDA (Continue---- )
  • 34. LDA (Continue---- )LDA (Continue---- )
  • 35. LDA (Continue---- )LDA (Continue---- )
  • 36. LDA (Continue---- )LDA (Continue---- )  To which topics does a given document belong to? Thus want to compute the posterior distribution of the hidden variables given a document.
  • 37. LDA (Continue---- )LDA (Continue---- ) ◦ Variational Inference
  • 38. LDA (Continue---- )LDA (Continue---- )
  • 39. LDA (Continue---- )LDA (Continue---- )
  • 40. LDA (Continue---- )LDA (Continue---- )  LDA Summary ◦ LDA Can:  Visualize the hidden thematic structure in large corpora  Generalize new data to fit into that structure  Used for Feature reduction, bioinformatics  Used for Sentiment analysis, object localization, automatic harmonic analysis for music ◦ Note: LDA Main Goal  In each document, allocate its words to few topics  In each topic, assign high probability to few terms ◦ This from the joint  Sparse proportions come from the 1st term  Sparse topics come from the 2nd term ◦ Limitations:  Must know the number of topics k in advance  Dirichlet topic distribution cannot capture correlations among topics
  • 41. ReferencesReferences 1. Jelodar, Hamed, et al. "Latent Dirichlet Allocation (LDA) and Topic modeling: models, applications, a survey." arXiv preprint arXiv:1711.04305 (2017). 2. http://blog.echen.me/2011/08/22/introduction-to-latent-dirichlet-allocation/ 3. Blei, David M., Andrew Y. Ng, and Michael I. Jordan. "Latent dirichlet allocation." Journal of machine Learning research3.Jan (2003): 993-1022. 4. Video Lectures of David Blei on videolectures.net: http://videolectures.net/mlss09uk_blei_tm/ 5. Campr, Michal, and Karel Ježek. "Comparing semantic models for evaluating automatic document summarization." International Conference on Text, Speech, and Dialogue. Springer International Publishing, 2015. 6. Hu, Diane J. "Latent dirichlet allocation for text, images, and music." University of California, San Diego. Retrieved April 26 (2009): 2013. 7. Jayapal, Arun, and Martin Emms. "Topic Models-Latent Dirichlet Allocation." (2014). 8. Wang, Y. Distributed gibbs sampling of latent topic models: The gritty details. Technical report, 2008. 9. https://cs.stanford.edu/~ppasupat/a9online/1140.html
  • 42. Latent Semantic Indexing (LSI)Latent Semantic Indexing (LSI) Problems in Lexical matching Motivation Introduction  How LSI Work? LSI Procedure SVD Example Application Demerits
  • 43. LSI(Continue …)LSI(Continue …) Problems in Lexical Matching ◦ Synonymy - widespread synonym occurances -decrease recall. ◦ Polysemy - retrieval of irrelevant documents - poor precision ◦ Noise - Boolean search on specific words - Retrieval o contently unrelated documents
  • 44. LSI(Continue …)LSI(Continue …) Motivation for LSI ◦ To find and fit a useful model of the relationships between terms and documents. ◦ To find out what terms "really" are implied by a query . ◦ LSI allow the user to search for concepts rather than specific words. ◦ Stores them in the concept space ◦ LSI can retrieve documents related to a user's query even when the query and the documents do not share any common terms ◦ Mathematical model  Relates documents and the concepts ◦ LSI tries to overcome the problems of lexical matching
  • 45. LSI(Continue …)LSI(Continue …) Introduction ◦ LSI is a technique that projects queries and documents into a space with “latent” semantic dimensions ◦ It uses multidimensional vector space to place all documents and terms ◦ Each dimension in that space corresponds to a concept existing in the collection. ◦ Common related terms in a document and query will pull document and query vector close to each other.
  • 47. LSI(Continue …)LSI(Continue …) How LSI Work? • A set of documents  how to determine the similiar ones?  examine the documents  try to find concepts in common  classify the documents • This is how LSI also works. • LSI represents terms and documents in a high-dimensional space allowing relationships between terms and documents to be exploited during searching. • Convert high-dimensional space to lower-dimensional space, throw out noise, keep the good stuff
  • 48. LSI(Continue …)LSI(Continue …) LSI Procedure ◦ Obtain term-document matrix. ◦ Compute the SVD. ◦ Truncate-SVD into reduced-k LSI space. -k-dimensional semantic structure -similarity on reduced-space: -term-term -term-document -document-document  Query Procedure ◦ Map the query to reduced k-space q’=qTUkS-1k, ◦ Retrieve documents or terms within a proximity. -cosine -best m
  • 49. LSI(Continue …)LSI(Continue …) Singular Value Decomposition (SVD) ◦ LSI use SVD, a linear analysis method: ◦ SVD decomposes the original matrix into three matrixes  Document eigenvector matrix  Eigenvalue matrix  Term eigenvector matrix ◦ SVD of a rectangular matrix A is given by:  A=U VΣ T
  • 50. LSI(Continue …)LSI(Continue …) Singular Value Decomposition (SVD) ◦ For an m n matrix A of rank r there exists aˣ factorization ◦ Singular Vale Decomposition =SVD) as follows  A=U VΣ T ◦ The columns of U are orthogonal eigenvectors of AAT ◦ The columns of V are orthogonal eigenvectors of AT A ◦ Eigenvectors λ1 … λr of AAT are the eigenvectors of AT A AAT = AT A=
  • 51. LSI(Continue …)LSI(Continue …) Example ◦ Let we have three documents  d1: Shipment of gold damaged in a fire  d2: Delivery of silver arrived in a silver truck.  d3: Shipment of gold arrived in a truck. ◦ Problem: Use Latent Semantic Indexing (LSI) to rank these documents for the query gold silver truck. Step 1: Set term weights and construct the term-document matrix A and query matrix:
  • 52. LSI(Continue …)LSI(Continue …) Step 2: Decompose matrix A matrix and find the U, S and V matrices, where
  • 53. LSI(Continue …)LSI(Continue …)  Step 3: Implement a Rank 2 Approximation by keeping the first two columns of U and V and the first two columns and rows of S.
  • 54. LSI(Continue …)LSI(Continue …)  Step 4: Find the new document vector coordinates in this reduced 2-dimensional space.  Rows of V holds eigenvector values. These are the coordinates of individual document vectors, hence  d1(-0.4945, 0.6492)  d2(-0.6458, -0.7194)  d3(-0.5817, 0.2469)  Step 5: Find the new query vector coordinates in the reduced 2-dimensional space.  Note: These are the new coordinate of the query vector in two dimensions. Note how this matrix is now different from the original query matrix q given in Step 1. so
  • 55. LSI(Continue …)LSI(Continue …)  Step 6: Rank documents in decreasing order of query-document cosine similarities.
  • 56. LSI(Continue …)LSI(Continue …)  Graphical Representation  We can see that document d2 scores higher than d3 and d1. Its vector is closer to the query vector than the other vectors. Also note that Term Vector Theory is still used at the beginning and at the end of LSI
  • 57. LSI (Continue…)LSI (Continue…) Applications of LSI ◦ Information Retrieval ◦ Information Filtering ◦ Relevance Feedback ◦ Improving performance of Search Engines  in ranking pages ◦ Cross-language retrieval ◦ Automated essay grading ◦ Optimizing link profile of your web page ◦ Modelling of human cognitive function ◦ Dynamic advertisements put on pages, Google’s AdSense
  • 58. LSI (Continue…)LSI (Continue…) Demerits of LSI ◦ Storage ◦ The complexity of the LSI model obtained from truncated SVD is costly ◦ Its execution efficiency lag far behind the execution efficiency of the simpler, Boolean models, especially on large data sets. ◦ The latent topic dimension can not be chosen to arbitrary numbers. It depends on the rank of the matrix, so can't go beyond that ◦ Bad for millions of words or documents ◦ Hard to incorporate new words or documents
  • 59. ReferencesReferences ◦ http://www.bluebit.gr/matrix-calculator/ ◦ Rosario, Barbara. "Latent semantic indexing: An overview." Techn. rep. INFOSYS 240 (2000): 1-16. ◦ Ding, Chris HQ. "A probabilistic model for latent semantic indexing." Journal of the Association for Information Science and Technology 56.6 (2005): 597-608. ◦ Dumais, Susan T. "Latent semantic indexing (LSI) and TREC-2." Nist Special Publication Sp (1994): 105-105. ◦ Alter, Orly, Patrick O. Brown, and David Botstein. "Singular value decomposition for genome- wide expression data processing and modeling." Proceedings of the National Academy of Sciences 97.18 (2000): 10101-10106. ◦ Golub, Gene H., and Charles F. Van Loan. Matrix computations. Vol. 3. JHU Press, 2012. ◦ http://www-db.deis.unibo.it/courses/SI-M/ ◦ http://web.eecs.utk.edu/research/lsi/ ◦ http://lsi.research.telcordia.com/
  • 60. Word2VectWord2Vect ◦ It is used to generate representation vectors out of words ◦ Maps words to continuous vector presentations  i.e. points in an N-dimensional space ◦ Learns vectors from training data (generalizations) ◦ It is a numeric representation for each word  That enable to capture relationship between words like synonyms, analogies
  • 61. Word2VectWord2Vect Continuous Bag of Words (CBOW) ◦ It predicted the missing word window of context words  Suppose we given the words Latent Dirichlet, then CBOW model predict the missing word Allocation, so Latent Dirichlet Allocation ◦ it useful to identify the missing word in the sentence ◦ It identify the effective sentiment orientations ◦ Randomly initialize input/output weight matrices of sizes VxN and NxV where V: vocab size, N: vector size (parameter ◦ Update weight matrices using SGD, backprop. and cross entropy over corpus ◦ Hidden layer size corresponds to word vector dimensional.
  • 62. ConDoc2VectConDoc2Vect Skip Gram ◦ Method very similar, except now we predict window of words given single word vector ◦ It predicted the context words given the word  Suppose we given a word Dirichlet, then Skip-Gram model predict the context words, Latent Dirichlet Allocation. ◦ Boils down to maximizing dot-product similarity of context words and target word ◦ Skip-gram typically outperforms CBOW on semantic and syntactic accuracy (Mikolov et al.)
  • 63. Word2Vec (Continue…)Word2Vec (Continue…) Demerits • Quality depends on input data, number of samples, and size of vectors (possibly long computation time!)  But Google released 3 million word vectors trained on 100 billion words! • Averaging vec’s does not work well (in my experience) on large text (> tweet level) • W2V cannot provide fixed-length feature vectors for variable-length text (pretty much everything!)
  • 64. Doc2VecDoc2Vec ◦ It generalize Word2Vec to whole documents (phrases, sentences, etc) ◦ Provide fixed-length vector ◦ Learns Distributed Memory (DM) and Distributed Bag of Words (DBOW)
  • 65. Doc2Vec (Continue… )Doc2Vec (Continue… ) Distributed Memory (DM) ◦ Assign and randomly initialize paragraph vector for each document ◦ Predict next work using context words + paragraph vector ◦ Slide context window across doc but keep paragraph vector fixed (hence distributed memory) ◦ Learns Updating done via SGD and backpropagation.
  • 66. Doc2Vect (Continue… )Doc2Vect (Continue… ) Distributed Bag of Words (DBOW) ◦ Only use paragraph vector (no word vector!) ◦ Take window of words in paragraph and randomly sample which one to predict using paragraph vector (ignores word ordering ) ◦ Simpler, more memory efficient ◦ DM typically outperforms DBOW  but DM+DBOW is even better!
  • 67. Doc2Vec Example)Doc2Vec Example) Doc2Vec on Wikipedia1 1 Document Embedding with Paragraph Vectors, A. Dai et al., 2014
  • 68. Doc2Vec Example)Doc2Vec Example) Doc2Vec on Wikipedia1 ◦ LDA vs. Doc2Vec for nearest neighbors to “Machine learning “(bold=unrelated to Machine learning) 1 Document Embedding with Paragraph Vectors, A. Dai et al., 2014
  • 69. Doc2Vec Example)Doc2Vec Example) Doc2Vec on Wikipedia1 1 Document Embedding with Paragraph Vectors, A. Dai et al., 2014
  • 70. Word2Vec (Continue…)Word2Vec (Continue…) Application ◦ Information Retrieval ◦ Documents classification ◦ Recommendation algorithms ◦ etc Thank you !
  • 71. Doc2Vec (Continue…)Doc2Vec (Continue…) Conclusion ◦ Doc2Vec is more efficient, robust than others methods such as LIS, LDA, TF-IDF Thank you !