Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Julia text mining_inmobi

1,996 views

Published on

Published in: Technology, Education
  • Be the first to comment

Julia text mining_inmobi

  1. 1. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Hands-on Introduction to Text Mining with Julia - A Mathematical approach Abhijith Chandraprabhu April 19, 2014 Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  2. 2. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Overview Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  3. 3. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model About todays session We will be dealing with an age old technique(1999). Old wine in a New bottle! The emphasis is on understanding the math behind it. Vectors, Matrices Dimension, Space Projection, Matrix factorization SVD Hands on session. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  4. 4. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model “For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.” Untangling Text Data Mining, Marti A. Hearst Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  5. 5. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Text Mining Type of Information retrieval, where we try to extract relevant information from huge collection of textual data(documents). Documents: web pages, biomedical literature, movie reviews, research articles etc. Non-Semantic Based Vector-Space model. Applications : Web Search Engines, Biomedical Information Retrieval etc. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  6. 6. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Document1 It’s also important to point out that even though these arrays are generic, they’re not boxed: an Int8 array will take up much less memory than an Int64 array, and both will be laid out as continuous blocks of memory; Julia can deal seamlessly and generically with these different immediate types as well as pointer types like String. Document2 To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis, I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation language for data analysis. Document3 Only later did I realize what makes Julia different from all the others. Julia breaks down the second wall — the wall between your high-level code and native assembly. Not only can you write code with the performance of C in Julia, you can take a peek behind the curtain of any function into its LLVM Intermediate Representation as well as its generated assembly code — all within the REPL. Check it out. Document4 Homoiconicity — the code can be operated on by other parts of the code. Again, R kind of has this too! Kind of, because I’m unaware of a good explanation for how to use it productively, and R’s syntax and scoping rules make it tricky to pull off. But I’m still excited to see it in Julia, because I’ve heard good things about macros and I’d like to appreciate them. Document5 Graphics. One of the big advantages of R over similar languages is the sophisticated graphics available in ggplot2 or lattice. Work is underway on graphics models for Julia but, again, it is early days still. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  7. 7. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Term-Document Matrix Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Query 1 Query 2 Array 1 0 0 0 0 1 0 continous block 1 0 0 0 0 0 0 Julia 1 3 3 1 1 1 1 types 1 0 0 0 0 1 0 string 1 0 0 0 0 0 0 amazing 0 1 0 0 0 0 0 data analysis 0 2 0 0 0 0 1 statistical computing 0 1 0 0 0 0 1 high-level 0 0 1 0 0 0 0 performance 0 0 1 0 0 0 0 LLVM 0 0 1 0 0 0 0 homoiconicity 0 0 0 1 0 0 0 R 0 0 0 2 0 0 0 syntax 0 0 0 4 0 0 0 macros 0 0 0 1 0 0 0 graphics 0 0 0 0 3 0 0 advantages 0 0 0 0 1 0 0 Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  8. 8. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Terminologies explained: Corpus : Structured set of text, obtained by preprocessing raw textual data. Lexicon : Distinctive collection of all the words/terms in the corpus. Document, Query : Bag of words/terms, represented as vector. Keyword, term : Elements of Lexicon. Term Frequency : Frequency of occurence of a term in a Document. TDM : A matrix, with term frequencies as entries across all Documents. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  9. 9. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Major Steps in Text Mining 1. Preprocess the documents to form a corpus. 2. Identify the lexicon from the corpus. 3. Form the Term-Document matrix (Numericized Text). 4. Apply VSM, LSI, K-Means to measure the proximity between all the documents and the query. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  10. 10. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Preprocessing using TextAnalysis.jl package Raw input Text is usually stream of characters. Need to convert this to stream of terms(basic processing units). The first step would be to create Documents which is collection of terms. Four types of documents can be created in Julia. A FileDocument which represents the files on disk str=“Julia is a high-level, high-performance dynamic programming language for technical computing” sd = StringDocument(str) td = TokenDocument(str) nd = NGramDocument(str) Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  11. 11. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Preprocessing Most of the times we have huge volumes of unstructured data, with content not carrying any useful information w.r.t Text Mining. HTML tags Numbers Stop words Prepositions, Articles, Pronouns, In Julia, we can remove these uneccessary using the functions, remove_articles!(), remove_indefinite_articles!() remove_definite_articles!(), remove_pronouns!() remove_prepositions!(), remove_stop_words!() Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  12. 12. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Stemming I have a different opinion I differ with your opinion I opine differently In Information Retrieval, morphological variants of words/terms carrying the same semantic information adds to redundancy. Stemming linguistically normalizes the lexicon of a corpus. stem!(Corpus) stem!(Document) #sd, td or nd Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  13. 13. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Vector-Space Model T1 T2 T3 D1 D2 D3 0 Documents are vectors in the Term-Document Space The elements of the vector are the weights1, wij , corresponding to Document i and term j The weights are the frequencies of the terms in the documents. Proximity of documents calculated by the cosine of the angle between them. a Refer Weighting Schemes Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  14. 14. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Term-Document Space Let Dj , be a collection of i documents. T = t1, t2, t3, ..., ti , is the Lexicon set. wji , is the frequency of occurence of the term j in document i. Document, dj = [w1j , w2j , w3j , ..., wij ]. TDM = [d1 d2 d3 ... dj ]ij dj ∈ Ri×j Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  15. 15. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Weighting Schemes Associating each occurence of a term with a weight that represents its relevance with respect to the meaning of the document it appears in. [] Longer documents do not always carry more informative content (or more relevant content wrt a query). Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  16. 16. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Binary Scheme : wij = 1, if ti occurs in dj , 0 else not. Term Frequency (TF) Scheme : wij = fij , i.e. the number of times ti occurs in document dj . Term Frequency - Inverse Document Frequency (TF-IDF) Scheme : Term Frequency tfij = fij max[f1j ,f2j ,f3j ...,fij ] Inverse-Document frequency, idfi = log N dfi , N: number of documents and dfi : Number of documents in which the term ti occurs. wij = tfij × idfi . m=DocumentTermMatrix(crps) tf_idf(m) Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  17. 17. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Query Matching Finding the relevant documents for a query, q. Cosine Distance Measure is used. cos(θ) = qT dj q 2 dj 2 , where θ is the angle between the query q and document dj . The documents for which cos(θ) > tol, are considered relevant, where tol, is the predefined tolerance. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  18. 18. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Performance Modeling The tol, decides the number of documents returned. With low tol, more documents are returned. But chances of the documents being irrelevant increases. Ideally we need higher number of documents returned and majority of the returned documents to be relevant. Precision, P = Dr Dt , where Dr is the number of relevant documents retrieved, and Dt is the total number of documents retrieved. Recall, R = Dr Nr , where Nr is the total number of relevant documents in the database. VSMModel(QueryNum::Int,A::Array{Float64,2},NumQueries::Int This function is used to obtain Dr , Dt & Nr for any specific query using Vector Space model. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  19. 19. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model 0.2 0.4 0.6 0.8 Tolerance 0 5 10 15 20 25 30 35 40No.ofDocuments Query10 Dr Dt Nr 0.2 0.4 0.6 0.8 Tolerance 0 20 40 60 80 No.ofDocuments Query13 Dr Dt Nr 0.2 0.4 0.6 0.8 Tolerance 0 50 100 150 200 No.ofDocuments Query6 Dr Dt Nr 0.2 0.4 0.6 0.8 Tolerance 0 5 10 15 20 25 30 35 No.ofDocuments Query23 Dr Dt Nr Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  20. 20. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Latent Semantic Indexing There exists underlying latent semantic structure in the data. We can identify this structure through SVD. Project the data onto two lower dimensional spaces. These are the term space and the document space. Dimension reduction is achieved through truncated SVD. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  21. 21. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Singular Value Decomposition Theorem (SVD) For any matrix A ∈ Rm×n, with m > n, there exists two orthogonal matrices U = (u1, . . . , um) ∈ Rm×m & V = (v1, . . . , vn) ∈ Rn×n such that A = U Σ 0 V T , where Σ ∈ Rn×n is diagonal matrix, i.e., Σ = (σ1, ...., σn), with σ1 ≥ σ2 ≥ .... ≥ σn ≥ 0. σ1, . . . , σn are called the singular values of A. Columns of U & V are called the right and left singular vectors of A respectively. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  22. 22. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Low Rank Matrix Approximation using SVD A = n i=1 σi ui vT i ≈ k i=1 σi ui vT i =: Ak k < r The above approximation is based on the Eckart-Young theorem. It helps in removal of noise, solving ill-conditioned problems, and mainly in dimension reduction of data. Using the below function examine the effect of rank reduction. Why is SVD good enough to decompose the TDM? svdRedRank(A,k) Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  23. 23. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model A m n ≈ m U k ≤ n k ≤ n S k k ≤ n V n Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  24. 24. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model VSMModel(A::Array{Float64,2},nq::Int) SVDModel(A::Array{Float64,2},nq::Int,rank::Int) The above functions returns the Recall and Precision using the Vector space model and the LSI model. The first qNum vectors are the queries and the rest are the document vectors, of matrix A. For the LSI(SVDModel), we need to pass in the rank also as a parameter. Use the functions shown below to view the Precision Vs Recall for VSM and LSI. plotNew_RecPrec() plotAdd_RecPrec() Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  25. 25. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model 10 20 30 40 50 60 70 80 90 Recall 30 40 50 60 70 Precision Precision Vs Recall VSM LSI KM Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  26. 26. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Thank you. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach

×