SlideShare a Scribd company logo
Vectorization
Core Concepts in Data Mining
Georgia Tech – CSE6242 – March 2015
Josh Patterson
Presenter: Josh Patterson
• Email:
– josh@pattersonconsultingtn.com
• Twitter:
– @jpatanooga
• Github:
– https://github.com/
jpatanooga
Past
Published in IAAI-09:
“TinyTermite: A Secure Routing
Algorithm”
Grad work in Meta-heuristics, Ant-
algorithms
Tennessee Valley Authority (TVA)
Hadoop and the Smartgrid
Cloudera
Principal Solution Architect
Today: Patterson Consulting
Topic Index
• Why Vectorization?
• Vector Space Model
• Text Vectorization
• General Vectorization
WHY VECTORIZATION?
“How is it possible for a slow, tiny brain, whether biological or
electronic, to perceive, understand, predict, and manipulate a
world far larger and more complicated than itself?”
--- Peter Norvig, “Artificial Intelligence: A Modern Approach”
Classic Scenario:
“Classify some tweets
for positive vs
negative sentiment”
What Needs to Happen?
• Need each tweet as some structure that can be
fed to a learning algorithm
– To represent the knowledge of “negative” vs
“positive” tweet
• How does that happen?
– We need to take the raw text and convert it into what
is called a “vector”
• Vector relates to the fundamentals of linear
algebra
– “Solving sets of linear equations”
Wait. What’s a Vector Again?
• An array of floating point numbers
• Represents data
– Text
– Audio
– Image
• Example:
–[ 1.0, 0.0, 1.0, 0.5 ]
VECTOR SPACE MODEL
“I am putting myself to the fullest possible use, which is
all I think that any conscious entity can ever hope to do.”
--- Hal, 2001
Vector Space Model
• Common way of vectorizing text
– every possible word is mapped to a specific integer
• If we have a large enough array then every word
fits into a unique slot in the array
– value at that index is the number of the times the
word occurs
• Most often our array size is less than our corpus
vocabulary
– so we have to have a “vectorization strategy” to
account for this
Text Can Include Several Stages
• Sentence Segmentation
– can skip straight to tokenization depending on use case
• Tokenization
– find individual words
• Lemmatization
– finding the base or stem of words
• Removing Stop words
– “the”, “and”, etc
• Vectorization
– we take the output of the process and make an array of
floating point values
TEXT VECTORIZATION STRATEGIES
“A man who carries a cat by the tail learns something he can learn
in no other way.”
--- Mark Twain
Bag of Words
• A group of words or a document is represented as a bag
– or “multi-set” of its words
• Bag of words is a list of words and their word counts
– simplest vector model
– but can end up using a lot of columns due to number of words
involved.
• Grammar and word ordering is ignored
– but we still track how many times the word occurs in the
document
• has been used most frequently in the document
classification
– and information retrieval domains.
Term frequency inverse document
frequency (TF-IDF)
• Fixes some issues with “bag of words”
• allows us to leverage the information about
how often a word occurs in a document (TF)
– while considering the frequency of the word in the
corpus to control for the facet that some words
will be more common than others (IDF)
• more accurate than the basic bag of words
model
– but computationally more expensive
Kernel Hashing
• When we want to vectorize the data in a single
pass
– making it a “just in time” vectorizer.
• Can be used when we want to vectorize text right
before we feed it to our learning algorithm.
• We come up with a fixed sized vector that is
typically smaller than the total possible words
that we could index or vectorize
– Then we use a hash function to create an index into
the vector.
GENERAL VECTORIZATION STRATEGIES
“Everybody good? Plenty of slaves for my robot colony?”
--- TARS, Interstellar
Four Major Attribute Types
• Nominal
– Ex: “sunny”, “overcast”, and “rainy”
• Ordinal
– Like nominal but with order
• Interval
– “year” but expressed in fixed and equal lengths
• Ratio
– scheme defines a zero point and then a distance
from this fixed zero point
Techniques of Feature Engineering
• Taking the values directly from the attribute unchanged
– If the value is something we can use out of the box
• Feature scaling
– standardization
– or Normalizing an attribute
• Binarization of features
– 0 or 1
• Dimensionality reduction
– Use only the most interesting features
Canova
• Command Line Based
– We don’t want to write custom code for every dataset
• Examples of Usage
– Convert the MNIST dataset from raw binary files to
the svmLight text format.
– Convert raw text into TF-IDF based vectors in a text
vector format {svmLight, arff}
• Scales out on multiple runtimes
– Local, hadoop
• Open Source, ASF 2.0 Licensed
– https://github.com/deeplearning4j/Canova

More Related Content

What's hot

Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
Josh Patterson
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
Josh Patterson
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
Adam Gibson
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
botsplash.com
 
Reactconf 2014 - Event Stream Processing
Reactconf 2014 - Event Stream ProcessingReactconf 2014 - Event Stream Processing
Reactconf 2014 - Event Stream Processing
Andy Piper
 
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profitlucenerevolution
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
StampedeCon
 
Tensorflow vs MxNet
Tensorflow vs MxNetTensorflow vs MxNet
Tensorflow vs MxNet
Ashish Bansal
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning
Amazon Web Services
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
Grant Ingersoll
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
Sigmoid
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
Steve Moore
 
Apache Toree
Apache ToreeApache Toree
Apache Toree
Asim Jalis
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
OpenSource Connections
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
nathanmarz
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Lucidworks
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
Portland R User Group
 
LuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity LinkageLuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity Linkage
zouzias
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
Tommaso Teofili
 

What's hot (20)

Deep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVecDeep Learning: DL4J and DataVec
Deep Learning: DL4J and DataVec
 
Smart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVecSmart Data Conference: DL4J and DataVec
Smart Data Conference: DL4J and DataVec
 
Deep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextMLDeep learning on Hadoop/Spark -NextML
Deep learning on Hadoop/Spark -NextML
 
Building NLP solutions using Python
Building NLP solutions using PythonBuilding NLP solutions using Python
Building NLP solutions using Python
 
Reactconf 2014 - Event Stream Processing
Reactconf 2014 - Event Stream ProcessingReactconf 2014 - Event Stream Processing
Reactconf 2014 - Event Stream Processing
 
Hacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and ProfitHacking Lucene and Solr for Fun and Profit
Hacking Lucene and Solr for Fun and Profit
 
Go Deep
Go DeepGo Deep
Go Deep
 
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
Getting Started with Keras and TensorFlow - StampedeCon AI Summit 2017
 
Tensorflow vs MxNet
Tensorflow vs MxNetTensorflow vs MxNet
Tensorflow vs MxNet
 
(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning(BDT311) Deep Learning: Going Beyond Machine Learning
(BDT311) Deep Learning: Going Beyond Machine Learning
 
Open Source Search FTW
Open Source Search FTWOpen Source Search FTW
Open Source Search FTW
 
Real Time search using Spark and Elasticsearch
Real Time search using Spark and ElasticsearchReal Time search using Spark and Elasticsearch
Real Time search using Spark and Elasticsearch
 
DeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François GarillotDeepLearning4J and Spark: Successes and Challenges - François Garillot
DeepLearning4J and Spark: Successes and Challenges - François Garillot
 
Apache Toree
Apache ToreeApache Toree
Apache Toree
 
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
Building a Lightweight Discovery Interface for China's Patents@NYC Solr/Lucen...
 
The inherent complexity of stream processing
The inherent complexity of stream processingThe inherent complexity of stream processing
The inherent complexity of stream processing
 
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
Where Search Meets Machine Learning: Presented by Diana Hu & Joaquin Delgado,...
 
R, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web ServicesR, Hadoop and Amazon Web Services
R, Hadoop and Amazon Web Services
 
LuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity LinkageLuceneRDD for (Geospatial) Search and Entity Linkage
LuceneRDD for (Geospatial) Search and Entity Linkage
 
Natural Language Search in Solr
Natural Language Search in SolrNatural Language Search in Solr
Natural Language Search in Solr
 

Viewers also liked

The road to the launch of vectoring in Belgium
The road to the launch of vectoring in BelgiumThe road to the launch of vectoring in Belgium
The road to the launch of vectoring in Belgium
Reinhard Laroy
 
Xgboost
XgboostXgboost
Частные компании: Кипр и Белиз
Частные компании: Кипр и БелизЧастные компании: Кипр и Белиз
Частные компании: Кипр и Белиз
Maxim Shvidkiy
 
Research & Planning Task 3
Research & Planning Task 3Research & Planning Task 3
Research & Planning Task 3
bcasey34
 
Zija International Produces GenM Skincare Products
Zija International Produces GenM Skincare ProductsZija International Produces GenM Skincare Products
Zija International Produces GenM Skincare ProductsKenneth Brailsford
 
51 marketing hang may mac
51 marketing hang may mac51 marketing hang may mac
51 marketing hang may mac
NGOC TRINH NGUYEN DANG
 
Paneles con Manta Filtrante - Serie PG-4
Paneles con Manta Filtrante - Serie PG-4Paneles con Manta Filtrante - Serie PG-4
Paneles con Manta Filtrante - Serie PG-4
MET MANN, Fabricante de Climatización y Ventilación
 
Аутсорсинг лабораторных услуг
Аутсорсинг лабораторных услугАутсорсинг лабораторных услуг
Аутсорсинг лабораторных услуг
BDA
 
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Pedro Príncipe
 
Hp NLB Singaopre
Hp NLB SingaopreHp NLB Singaopre
Hp NLB Singaopre
Satya Harish
 
Pirita- Kose sügisretk
Pirita- Kose sügisretkPirita- Kose sügisretk
Pirita- Kose sügisretkMairi
 
Tic´s en pedagogia
Tic´s en pedagogiaTic´s en pedagogia
Tic´s en pedagogia
Jorge Aconda
 
Learning design overview
Learning design overviewLearning design overview
Learning design overviewMartin Weller
 
School nr 5
School nr 5School nr 5
School nr 5
Mairi
 

Viewers also liked (16)

Vectorization
VectorizationVectorization
Vectorization
 
The road to the launch of vectoring in Belgium
The road to the launch of vectoring in BelgiumThe road to the launch of vectoring in Belgium
The road to the launch of vectoring in Belgium
 
Xgboost
XgboostXgboost
Xgboost
 
Частные компании: Кипр и Белиз
Частные компании: Кипр и БелизЧастные компании: Кипр и Белиз
Частные компании: Кипр и Белиз
 
Research & Planning Task 3
Research & Planning Task 3Research & Planning Task 3
Research & Planning Task 3
 
Zija International Produces GenM Skincare Products
Zija International Produces GenM Skincare ProductsZija International Produces GenM Skincare Products
Zija International Produces GenM Skincare Products
 
51 marketing hang may mac
51 marketing hang may mac51 marketing hang may mac
51 marketing hang may mac
 
Paneles con Manta Filtrante - Serie PG-4
Paneles con Manta Filtrante - Serie PG-4Paneles con Manta Filtrante - Serie PG-4
Paneles con Manta Filtrante - Serie PG-4
 
Аутсорсинг лабораторных услуг
Аутсорсинг лабораторных услугАутсорсинг лабораторных услуг
Аутсорсинг лабораторных услуг
 
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
Acesso Aberto a publicações e dados: requisitos dos financiadores de ciência ...
 
Hp NLB Singaopre
Hp NLB SingaopreHp NLB Singaopre
Hp NLB Singaopre
 
Pirita- Kose sügisretk
Pirita- Kose sügisretkPirita- Kose sügisretk
Pirita- Kose sügisretk
 
Tic´s en pedagogia
Tic´s en pedagogiaTic´s en pedagogia
Tic´s en pedagogia
 
Learning design overview
Learning design overviewLearning design overview
Learning design overview
 
School nr 5
School nr 5School nr 5
School nr 5
 
Disfrazámonos
DisfrazámonosDisfrazámonos
Disfrazámonos
 

Similar to Vectorization - Georgia Tech - CSE6242 - March 2015

Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
ssbd6985
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
vincent683379
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
Rahul Jain
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
Joaquin Delgado PhD.
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
S. Diana Hu
 
Text Mining
Text MiningText Mining
Text Mining
sathish sak
 
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics HackathonxAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
Russell Duhon
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph models
Antonio García-Domínguez
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Lucidworks
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
Simon Hughes
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
Simon Hughes
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
OpenSource Connections
 
Artificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge AcquisitionArtificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge Acquisition
The Integral Worm
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
Lucidworks
 
Wither OWL
Wither OWLWither OWL
Wither OWL
James Hendler
 
Taming Text
Taming TextTaming Text
Taming Text
Grant Ingersoll
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
Minha Hwang
 
Learning Emergent Knowledge from Blog Postings
Learning Emergent Knowledge from Blog PostingsLearning Emergent Knowledge from Blog Postings
Learning Emergent Knowledge from Blog PostingsSaltlux Inc.
 

Similar to Vectorization - Georgia Tech - CSE6242 - March 2015 (20)

Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101AI presentation and introduction - Retrieval Augmented Generation RAG 101
AI presentation and introduction - Retrieval Augmented Generation RAG 101
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
RecSys 2015 Tutorial - Scalable Recommender Systems: Where Machine Learning m...
 
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning... RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
RecSys 2015 Tutorial – Scalable Recommender Systems: Where Machine Learning...
 
Text Mining
Text MiningText Mining
Text Mining
 
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics HackathonxAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
xAPI Vocabulary Stone Soup: LAK 2016 JISC Learning Analytics Hackathon
 
MRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph modelsMRT 2018: reflecting on the past and the present with temporal graph models
MRT 2018: reflecting on the past and the present with temporal graph models
 
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
Vectors in Search – Towards More Semantic Matching - Simon Hughes, Dice.com
 
Vectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic MatchingVectors in Search - Towards More Semantic Matching
Vectors in Search - Towards More Semantic Matching
 
Searching with vectors
Searching with vectorsSearching with vectors
Searching with vectors
 
Haystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon HughesHaystack 2019 - Search with Vectors - Simon Hughes
Haystack 2019 - Search with Vectors - Simon Hughes
 
Artificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge AcquisitionArtificial Intelligence: Knowledge Acquisition
Artificial Intelligence: Knowledge Acquisition
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Wither OWL
Wither OWLWither OWL
Wither OWL
 
Taming Text
Taming TextTaming Text
Taming Text
 
Introduction to Text Mining
Introduction to Text MiningIntroduction to Text Mining
Introduction to Text Mining
 
Learning Emergent Knowledge from Blog Postings
Learning Emergent Knowledge from Blog PostingsLearning Emergent Knowledge from Blog Postings
Learning Emergent Knowledge from Blog Postings
 

More from Josh Patterson

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
Josh Patterson
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
Josh Patterson
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
Josh Patterson
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Josh Patterson
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Josh Patterson
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
Josh Patterson
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Josh Patterson
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Josh Patterson
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
Josh Patterson
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation Talk
Josh Patterson
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
Josh Patterson
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
Josh Patterson
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
Josh Patterson
 

More from Josh Patterson (13)

Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?Patterson Consulting: What is Artificial Intelligence?
Patterson Consulting: What is Artificial Intelligence?
 
What is Artificial Intelligence
What is Artificial IntelligenceWhat is Artificial Intelligence
What is Artificial Intelligence
 
Modeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural NetworksModeling Electronic Health Records with Recurrent Neural Networks
Modeling Electronic Health Records with Recurrent Neural Networks
 
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
Chattanooga Hadoop Meetup - Hadoop 101 - November 2014
 
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on HadoopHadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
Hadoop Summit 2014 - San Jose - Introduction to Deep Learning on Hadoop
 
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARNMLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
MLConf 2013: Metronome and Parallel Iterative Algorithms on YARN
 
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARNHadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
Hadoop Summit EU 2013: Parallel Linear Regression, IterativeReduce, and YARN
 
Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2Knitting boar atl_hug_jan2013_v2
Knitting boar atl_hug_jan2013_v2
 
Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012Knitting boar - Toronto and Boston HUGs - Nov 2012
Knitting boar - Toronto and Boston HUGs - Nov 2012
 
LA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation TalkLA HUG Dec 2011 - Recommendation Talk
LA HUG Dec 2011 - Recommendation Talk
 
Oct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on HadoopOct 2011 CHADNUG Presentation on Hadoop
Oct 2011 CHADNUG Presentation on Hadoop
 
Machine Learning and Hadoop
Machine Learning and HadoopMachine Learning and Hadoop
Machine Learning and Hadoop
 
Classification with Naive Bayes
Classification with Naive BayesClassification with Naive Bayes
Classification with Naive Bayes
 

Recently uploaded

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
ewymefz
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Subhajit Sahu
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
alex933524
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Subhajit Sahu
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
axoqas
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
James Polillo
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
ArpitMalhotra16
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
enxupq
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
nscud
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
NABLAS株式会社
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
yhkoc
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
vcaxypu
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
axoqas
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
ocavb
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
jerlynmaetalle
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
ewymefz
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
nscud
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
ewymefz
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
TravisMalana
 

Recently uploaded (20)

一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
一比一原版(UPenn毕业证)宾夕法尼亚大学毕业证成绩单
 
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
Levelwise PageRank with Loop-Based Dead End Handling Strategy : SHORT REPORT ...
 
Tabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflowsTabula.io Cheatsheet: automate your data workflows
Tabula.io Cheatsheet: automate your data workflows
 
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
Algorithmic optimizations for Dynamic Levelwise PageRank (from STICD) : SHORT...
 
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
做(mqu毕业证书)麦考瑞大学毕业证硕士文凭证书学费发票原版一模一样
 
Jpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization SampleJpolillo Amazon PPC - Bid Optimization Sample
Jpolillo Amazon PPC - Bid Optimization Sample
 
standardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghhstandardisation of garbhpala offhgfffghh
standardisation of garbhpala offhgfffghh
 
一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单一比一原版(QU毕业证)皇后大学毕业证成绩单
一比一原版(QU毕业证)皇后大学毕业证成绩单
 
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
一比一原版(CBU毕业证)不列颠海角大学毕业证成绩单
 
Criminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdfCriminal IP - Threat Hunting Webinar.pdf
Criminal IP - Threat Hunting Webinar.pdf
 
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
【社内勉強会資料_Octo: An Open-Source Generalist Robot Policy】
 
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
一比一原版(CU毕业证)卡尔顿大学毕业证成绩单
 
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
一比一原版(RUG毕业证)格罗宁根大学毕业证成绩单
 
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
哪里卖(usq毕业证书)南昆士兰大学毕业证研究生文凭证书托福证书原版一模一样
 
一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单一比一原版(TWU毕业证)西三一大学毕业证成绩单
一比一原版(TWU毕业证)西三一大学毕业证成绩单
 
The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...The affect of service quality and online reviews on customer loyalty in the E...
The affect of service quality and online reviews on customer loyalty in the E...
 
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
一比一原版(UofM毕业证)明尼苏达大学毕业证成绩单
 
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
一比一原版(CBU毕业证)卡普顿大学毕业证成绩单
 
一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单一比一原版(NYU毕业证)纽约大学毕业证成绩单
一比一原版(NYU毕业证)纽约大学毕业证成绩单
 
Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)Malana- Gimlet Market Analysis (Portfolio 2)
Malana- Gimlet Market Analysis (Portfolio 2)
 

Vectorization - Georgia Tech - CSE6242 - March 2015

  • 1. Vectorization Core Concepts in Data Mining Georgia Tech – CSE6242 – March 2015 Josh Patterson
  • 2. Presenter: Josh Patterson • Email: – josh@pattersonconsultingtn.com • Twitter: – @jpatanooga • Github: – https://github.com/ jpatanooga Past Published in IAAI-09: “TinyTermite: A Secure Routing Algorithm” Grad work in Meta-heuristics, Ant- algorithms Tennessee Valley Authority (TVA) Hadoop and the Smartgrid Cloudera Principal Solution Architect Today: Patterson Consulting
  • 3. Topic Index • Why Vectorization? • Vector Space Model • Text Vectorization • General Vectorization
  • 4. WHY VECTORIZATION? “How is it possible for a slow, tiny brain, whether biological or electronic, to perceive, understand, predict, and manipulate a world far larger and more complicated than itself?” --- Peter Norvig, “Artificial Intelligence: A Modern Approach”
  • 5. Classic Scenario: “Classify some tweets for positive vs negative sentiment”
  • 6. What Needs to Happen? • Need each tweet as some structure that can be fed to a learning algorithm – To represent the knowledge of “negative” vs “positive” tweet • How does that happen? – We need to take the raw text and convert it into what is called a “vector” • Vector relates to the fundamentals of linear algebra – “Solving sets of linear equations”
  • 7. Wait. What’s a Vector Again? • An array of floating point numbers • Represents data – Text – Audio – Image • Example: –[ 1.0, 0.0, 1.0, 0.5 ]
  • 8. VECTOR SPACE MODEL “I am putting myself to the fullest possible use, which is all I think that any conscious entity can ever hope to do.” --- Hal, 2001
  • 9. Vector Space Model • Common way of vectorizing text – every possible word is mapped to a specific integer • If we have a large enough array then every word fits into a unique slot in the array – value at that index is the number of the times the word occurs • Most often our array size is less than our corpus vocabulary – so we have to have a “vectorization strategy” to account for this
  • 10. Text Can Include Several Stages • Sentence Segmentation – can skip straight to tokenization depending on use case • Tokenization – find individual words • Lemmatization – finding the base or stem of words • Removing Stop words – “the”, “and”, etc • Vectorization – we take the output of the process and make an array of floating point values
  • 11. TEXT VECTORIZATION STRATEGIES “A man who carries a cat by the tail learns something he can learn in no other way.” --- Mark Twain
  • 12. Bag of Words • A group of words or a document is represented as a bag – or “multi-set” of its words • Bag of words is a list of words and their word counts – simplest vector model – but can end up using a lot of columns due to number of words involved. • Grammar and word ordering is ignored – but we still track how many times the word occurs in the document • has been used most frequently in the document classification – and information retrieval domains.
  • 13. Term frequency inverse document frequency (TF-IDF) • Fixes some issues with “bag of words” • allows us to leverage the information about how often a word occurs in a document (TF) – while considering the frequency of the word in the corpus to control for the facet that some words will be more common than others (IDF) • more accurate than the basic bag of words model – but computationally more expensive
  • 14. Kernel Hashing • When we want to vectorize the data in a single pass – making it a “just in time” vectorizer. • Can be used when we want to vectorize text right before we feed it to our learning algorithm. • We come up with a fixed sized vector that is typically smaller than the total possible words that we could index or vectorize – Then we use a hash function to create an index into the vector.
  • 15. GENERAL VECTORIZATION STRATEGIES “Everybody good? Plenty of slaves for my robot colony?” --- TARS, Interstellar
  • 16. Four Major Attribute Types • Nominal – Ex: “sunny”, “overcast”, and “rainy” • Ordinal – Like nominal but with order • Interval – “year” but expressed in fixed and equal lengths • Ratio – scheme defines a zero point and then a distance from this fixed zero point
  • 17. Techniques of Feature Engineering • Taking the values directly from the attribute unchanged – If the value is something we can use out of the box • Feature scaling – standardization – or Normalizing an attribute • Binarization of features – 0 or 1 • Dimensionality reduction – Use only the most interesting features
  • 18. Canova • Command Line Based – We don’t want to write custom code for every dataset • Examples of Usage – Convert the MNIST dataset from raw binary files to the svmLight text format. – Convert raw text into TF-IDF based vectors in a text vector format {svmLight, arff} • Scales out on multiple runtimes – Local, hadoop • Open Source, ASF 2.0 Licensed – https://github.com/deeplearning4j/Canova

Editor's Notes

  1. Advantage to use kernel hashing is that we don’t need the pre-cursor pass like we do with TF-IDF but we run the risk of having collisions between words The reality is that these collisions occur very infrequently and don’t have a noticeable impact on learning performance
  2. Feature scaling (or “feature normalization”) can improve convergence speed of certain algorithms (example: stochastic gradient descent) When we “standardize” a vector we subtract a measure of location (minimum, maximum, median, etc) and then divide by a measure of scale (variance, standard deviation, range, etc). Another method of feature normalization is “pre-whitening”. Pre-whitening is a decorrelation transformation that makes the input independent by transforming it against a transformed input covariance matrix. The transformation is called “pre-whitening” due to how it changes the input vector into a white noise vector.