Vectorization
Core Concepts in Data Mining
Georgia Tech – CSE6242 – March 2015
Josh Patterson
Presenter: Josh Patterson
• Email:
– josh@pattersonconsultingtn.com
• Twitter:
– @jpatanooga
• Github:
– https://github.com/
jpatanooga
Past
Published in IAAI-09:
“TinyTermite: A Secure Routing
Algorithm”
Grad work in Meta-heuristics, Ant-
algorithms
Tennessee Valley Authority (TVA)
Hadoop and the Smartgrid
Cloudera
Principal Solution Architect
Today: Patterson Consulting
Topic Index
• Why Vectorization?
• Vector Space Model
• Text Vectorization
• General Vectorization
WHY VECTORIZATION?
“How is it possible for a slow, tiny brain, whether biological or
electronic, to perceive, understand, predict, and manipulate a
world far larger and more complicated than itself?”
--- Peter Norvig, “Artificial Intelligence: A Modern Approach”
Classic Scenario:
“Classify some tweets
for positive vs
negative sentiment”
What Needs to Happen?
• Need each tweet as some structure that can be
fed to a learning algorithm
– To represent the knowledge of “negative” vs
“positive” tweet
• How does that happen?
– We need to take the raw text and convert it into what
is called a “vector”
• Vector relates to the fundamentals of linear
algebra
– “Solving sets of linear equations”
Wait. What’s a Vector Again?
• An array of floating point numbers
• Represents data
– Text
– Audio
– Image
• Example:
–[ 1.0, 0.0, 1.0, 0.5 ]
VECTOR SPACE MODEL
“I am putting myself to the fullest possible use, which is
all I think that any conscious entity can ever hope to do.”
--- Hal, 2001
Vector Space Model
• Common way of vectorizing text
– every possible word is mapped to a specific integer
• If we have a large enough array then every word
fits into a unique slot in the array
– value at that index is the number of the times the
word occurs
• Most often our array size is less than our corpus
vocabulary
– so we have to have a “vectorization strategy” to
account for this
Text Can Include Several Stages
• Sentence Segmentation
– can skip straight to tokenization depending on use case
• Tokenization
– find individual words
• Lemmatization
– finding the base or stem of words
• Removing Stop words
– “the”, “and”, etc
• Vectorization
– we take the output of the process and make an array of
floating point values
TEXT VECTORIZATION STRATEGIES
“A man who carries a cat by the tail learns something he can learn
in no other way.”
--- Mark Twain
Bag of Words
• A group of words or a document is represented as a bag
– or “multi-set” of its words
• Bag of words is a list of words and their word counts
– simplest vector model
– but can end up using a lot of columns due to number of words
involved.
• Grammar and word ordering is ignored
– but we still track how many times the word occurs in the
document
• has been used most frequently in the document
classification
– and information retrieval domains.
Term frequency inverse document
frequency (TF-IDF)
• Fixes some issues with “bag of words”
• allows us to leverage the information about
how often a word occurs in a document (TF)
– while considering the frequency of the word in the
corpus to control for the facet that some words
will be more common than others (IDF)
• more accurate than the basic bag of words
model
– but computationally more expensive
Kernel Hashing
• When we want to vectorize the data in a single
pass
– making it a “just in time” vectorizer.
• Can be used when we want to vectorize text right
before we feed it to our learning algorithm.
• We come up with a fixed sized vector that is
typically smaller than the total possible words
that we could index or vectorize
– Then we use a hash function to create an index into
the vector.
GENERAL VECTORIZATION STRATEGIES
“Everybody good? Plenty of slaves for my robot colony?”
--- TARS, Interstellar
Four Major Attribute Types
• Nominal
– Ex: “sunny”, “overcast”, and “rainy”
• Ordinal
– Like nominal but with order
• Interval
– “year” but expressed in fixed and equal lengths
• Ratio
– scheme defines a zero point and then a distance
from this fixed zero point
Techniques of Feature Engineering
• Taking the values directly from the attribute unchanged
– If the value is something we can use out of the box
• Feature scaling
– standardization
– or Normalizing an attribute
• Binarization of features
– 0 or 1
• Dimensionality reduction
– Use only the most interesting features
Canova
• Command Line Based
– We don’t want to write custom code for every dataset
• Examples of Usage
– Convert the MNIST dataset from raw binary files to
the svmLight text format.
– Convert raw text into TF-IDF based vectors in a text
vector format {svmLight, arff}
• Scales out on multiple runtimes
– Local, hadoop
• Open Source, ASF 2.0 Licensed
– https://github.com/deeplearning4j/Canova

Vectorization - Georgia Tech - CSE6242 - March 2015

  • 1.
    Vectorization Core Concepts inData Mining Georgia Tech – CSE6242 – March 2015 Josh Patterson
  • 2.
    Presenter: Josh Patterson •Email: – josh@pattersonconsultingtn.com • Twitter: – @jpatanooga • Github: – https://github.com/ jpatanooga Past Published in IAAI-09: “TinyTermite: A Secure Routing Algorithm” Grad work in Meta-heuristics, Ant- algorithms Tennessee Valley Authority (TVA) Hadoop and the Smartgrid Cloudera Principal Solution Architect Today: Patterson Consulting
  • 3.
    Topic Index • WhyVectorization? • Vector Space Model • Text Vectorization • General Vectorization
  • 4.
    WHY VECTORIZATION? “How isit possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?” --- Peter Norvig, “Artificial Intelligence: A Modern Approach”
  • 5.
    Classic Scenario: “Classify sometweets for positive vs negative sentiment”
  • 6.
    What Needs toHappen? • Need each tweet as some structure that can be fed to a learning algorithm – To represent the knowledge of “negative” vs “positive” tweet • How does that happen? – We need to take the raw text and convert it into what is called a “vector” • Vector relates to the fundamentals of linear algebra – “Solving sets of linear equations”
  • 7.
    Wait. What’s aVector Again? • An array of floating point numbers • Represents data – Text – Audio – Image • Example: –[ 1.0, 0.0, 1.0, 0.5 ]
  • 8.
    VECTOR SPACE MODEL “Iam putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.” --- Hal, 2001
  • 9.
    Vector Space Model •Common way of vectorizing text – every possible word is mapped to a specific integer • If we have a large enough array then every word fits into a unique slot in the array – value at that index is the number of the times the word occurs • Most often our array size is less than our corpus vocabulary – so we have to have a “vectorization strategy” to account for this
  • 10.
    Text Can IncludeSeveral Stages • Sentence Segmentation – can skip straight to tokenization depending on use case • Tokenization – find individual words • Lemmatization – finding the base or stem of words • Removing Stop words – “the”, “and”, etc • Vectorization – we take the output of the process and make an array of floating point values
  • 11.
    TEXT VECTORIZATION STRATEGIES “Aman who carries a cat by the tail learns something he can learn in no other way.” --- Mark Twain
  • 12.
    Bag of Words •A group of words or a document is represented as a bag – or “multi-set” of its words • Bag of words is a list of words and their word counts – simplest vector model – but can end up using a lot of columns due to number of words involved. • Grammar and word ordering is ignored – but we still track how many times the word occurs in the document • has been used most frequently in the document classification – and information retrieval domains.
  • 13.
    Term frequency inversedocument frequency (TF-IDF) • Fixes some issues with “bag of words” • allows us to leverage the information about how often a word occurs in a document (TF) – while considering the frequency of the word in the corpus to control for the facet that some words will be more common than others (IDF) • more accurate than the basic bag of words model – but computationally more expensive
  • 14.
    Kernel Hashing • Whenwe want to vectorize the data in a single pass – making it a “just in time” vectorizer. • Can be used when we want to vectorize text right before we feed it to our learning algorithm. • We come up with a fixed sized vector that is typically smaller than the total possible words that we could index or vectorize – Then we use a hash function to create an index into the vector.
  • 15.
    GENERAL VECTORIZATION STRATEGIES “Everybodygood? Plenty of slaves for my robot colony?” --- TARS, Interstellar
  • 16.
    Four Major AttributeTypes • Nominal – Ex: “sunny”, “overcast”, and “rainy” • Ordinal – Like nominal but with order • Interval – “year” but expressed in fixed and equal lengths • Ratio – scheme defines a zero point and then a distance from this fixed zero point
  • 17.
    Techniques of FeatureEngineering • Taking the values directly from the attribute unchanged – If the value is something we can use out of the box • Feature scaling – standardization – or Normalizing an attribute • Binarization of features – 0 or 1 • Dimensionality reduction – Use only the most interesting features
  • 18.
    Canova • Command LineBased – We don’t want to write custom code for every dataset • Examples of Usage – Convert the MNIST dataset from raw binary files to the svmLight text format. – Convert raw text into TF-IDF based vectors in a text vector format {svmLight, arff} • Scales out on multiple runtimes – Local, hadoop • Open Source, ASF 2.0 Licensed – https://github.com/deeplearning4j/Canova

Editor's Notes

  • #15 Advantage to use kernel hashing is that we don’t need the pre-cursor pass like we do with TF-IDF but we run the risk of having collisions between words The reality is that these collisions occur very infrequently and don’t have a noticeable impact on learning performance
  • #18 Feature scaling (or “feature normalization”) can improve convergence speed of certain algorithms (example: stochastic gradient descent) When we “standardize” a vector we subtract a measure of location (minimum, maximum, median, etc) and then divide by a measure of scale (variance, standard deviation, range, etc). Another method of feature normalization is “pre-whitening”. Pre-whitening is a decorrelation transformation that makes the input independent by transforming it against a transformed input covariance matrix. The transformation is called “pre-whitening” due to how it changes the input vector into a white noise vector.