Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Intro to Vectorization Concepts - GaTech cse6242


Published on

Introduction to vectorizing data in machine learning

Published in: Data & Analytics

Intro to Vectorization Concepts - GaTech cse6242

  1. 1. Vectorization Core Concepts in Data Mining
  2. 2. Topic Index • Why Vectorization? • Vector Space Model • Bag of Words • TF-IDF • N-Grams • Kernel Hashing
  3. 3. “How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?” --- Peter Norvig, “Artificial Intelligence: A Modern Approach” WHY VECTORIZATION?
  4. 4. Classic Scenario: “Classify some tweets for positive vs negative sentiment”
  5. 5. What Needs to Happen? • Need each tweet as some structure that can be fed to a learning algorithm – To represent the knowledge of “negative” vs “positive” tweet • How does that happen? – We need to take the raw text and convert it into what is called a “vector” • Vector relates to the fundamentals of linear algebra – “Solving sets of linear equations”
  6. 6. Wait. What’s a Vector Again? • An array of floating point numbers • Represents data – Text – Audio – Image • Example: –[ 1.0, 0.0, 1.0, 0.5 ]
  7. 7. “I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.” --- Hal, 2001 VECTOR SPACE MODEL
  8. 8. Vector Space Model • Common way of vectorizing text – every possible word is mapped to a specific integer • If we have a large enough array then every word fits into a unique slot in the array – value at that index is the number of the times the word occurs • Most often our array size is less than our corpus vocabulary – so we have to have a “vectorization strategy” to account for this
  9. 9. Text Can Include Several Stages • Sentence Segmentation – can skip straight to tokenization depending on use case • Tokenization – find individual words • Lemmatization – finding the base or stem of words • Removing Stop words – “the”, “and”, etc • Vectorization – we take the output of the process and make an array of floating point values
  10. 10. “A man who carries a cat by the tail learns something he can learn in no other way.” --- Mark Twain TEXT VECTORIZATION STRATEGIES
  11. 11. Bag of Words • A group of words or a document is represented as a bag – or “multi-set” of its words • Bag of words is a list of words and their word counts – simplest vector model – but can end up using a lot of columns due to number of words involved. • Grammar and word ordering is ignored – but we still track how many times the word occurs in the document • has been used most frequently in the document classification – and information retrieval domains.
  12. 12. Term frequency inverse document frequency (TF-IDF) • Fixes some issues with “bag of words” • allows us to leverage the information about how often a word occurs in a document (TF) – while considering the frequency of the word in the corpus to control for the facet that some words will be more common than others (IDF) • more accurate than the basic bag of words model – but computationally more expensive
  13. 13. TF-IDF Formula • wi = TFi * IDFi • TFi(t) – = (Number of times term t appears in a document) / (Total number of terms in the document). • IDFi = log (N / Dfi) – N is total documents in corpus – Dfi is documents containing the term t
  14. 14. N-grams • A group of words in a sequence is called an n-gram • A single word can be called a unigram • Two words like “Coca Cola” can be considered a single unit and called a bigram • Three and more terms can be called trigrams, 4-grams, 5-grams and so on and so forth
  15. 15. N-Grams Usage • If we combine the unigrams and bigrams from a document and generate weights using TF-IDF – will end up with large vectors with many meaningless bigrams – having large weights on account of their large IDF • Can pass n-gram through something called a log-likelihood test – which can determine whether two words occurred together rather by chance, or because they form a significant unit – It selects the most significant ones and prunes away the least significant ones • Using the remaining n-grams, TF-IDF weighting scheme is applied and vectors are produced – In this way, significant bigrams like “Coca Cola” can be more properly accounted for in a TF-IDF weighting.
  16. 16. Kernel Hashing • When we want to vectorize the data in a single pass – making it a “just in time” vectorizer. • Can be used when we want to vectorize text right before we feed it to our learning algorithm. • We come up with a fixed sized vector that is typically smaller than the total possible words that we could index or vectorize – Then we use a hash function to create an index into the vector.
  17. 17. More Kernel Hashing • Advantage to use kernel hashing is that we don’t need the pre-cursor pass like we do with TF-IDF – but we run the risk of having collisions between words • The reality is that these collisions occur very infrequently – and don’t have a noticeable impact on learning performance • For more reading: – kernels-for-wildly-unprincipled-machine-learning/