FEATURE ENGINEERING FOR
TEXT DATA
Presenter : Shruti Kar
Instructor : Dr. Guozhu Dong
Class : Feature Engineering
(CS 7900-05)
https://cdn-images-1.medium.com/max/2000/1*vXKKe3J-lfi1YQ7HC6onxQ.jpeg
“More data beats clever algorithms, but better data beats more data.” – Peter Norvig.
TEXT DATA
STRUCTURED DATA UNSTRUCTURED DATA
UNDERSTANDING THE TYPE OF TEXT DATA
SEMI-STRUCTURED DATA
FEATURES FROM SEMI-STRUCTURED DATA
Examples: Book, Newspaper, XML Documents, PDFs etc.
Features:
• Table Of Content / Index
• Glossary
• Titles
• Subheadings
• Text( Bold, Color, & Italics)
• Captions on Photographs / Diagrams
• Tables
• <> tags in XML documents
Cleaning:
• Convert accented characters
• Expanding Contractions
• Lowercasing
• Repairing (“C a s a C a f &eacute;” ->
“Casa Café”) (Not in example)
Removing:
• Stopwords
• Removing Tags
• Rare words (Not in example)
• Common words (Not in example)
• Removing non-alphanumeric
Roots:
• Spelling correction (Not in example)
• Chop (Not in example)
• Stem (root word)
• Lemmatize (semantic root)
eg: “I am late” -> “I be late”
NATURAL LANGUAGE PROCESSING:
FEATURES FROM UNSTRUCTURED DATA
Tokenizing:
• Tokenize
• N-Grams
• Skip-Grams
• Char-grams
• Affixes (Not in example)
NATURAL LANGUAGE PROCESSING:
FEATURES FROM UNSTRUCTURED DATA
Enrich:
• Entity Insertion/Extraction
"Microsoft Releases Windows" -> "Microsoft(company) releases Windows(application)"
• Parse Trees
"Alice hits Bill" -> Alice/Noun_subject hits/Verb Bill/Noun_object
[('Mark', 'NNP', u'B-PERSON'), ('and', 'CC', u'O'), ('John', 'NNP', u'B-PERSON'), ('are', 'VBP', u'O'), ('working', 'VBG', u'O'),
('at', 'IN', u'O'), ('Google', 'NNP', u'B-ORGANIZATION'), ('.', '.', u'O’)]
• Reading Level
NATURAL LANGUAGE PROCESSING:
FEATURES FROM UNSTRUCTURED DATA
AFTER PREPROCESSING:
TEXT VECTORIZATION
BAG OF WORDS MODEL:
• Most simple vector space representational model – Bag of Words Model.
• Vector space model – Mathematical model to represent unstructured text as numeric vectors
each dimension of the vector is a specific featureattribute.
• Bag of words model – represents each text
document as a numeric vector, each dimension
is a specific word from the corpus and the value
could be:
its frequency in the document,
occurrence (denoted by 1 or 0) or
weighted values.
• Each document is represented literally as a ‘bag’
of its own words disregarding :
word orders,
sequences and
grammar.
TEXT VECTORIZATION
BAG OF WORDS MODEL:
(1) John like to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
“John”, “likes”, “to”, “watch”, “movies”, “Mary”, “likes”, “movies”, “too”
“John”, “also”, “likes”, “to”, “watch”, “football”, “games”
BOW1 = {“John”:1, “likes”:2, “to”:1, “watch”:1, “movies”:2, “Mary”:1, “too”:1};
BOW2 = {“John”:1, “also”:1, “likes”:1, “to”:1, “watch”:1, “football”:1, “games”:1};
(3) John like to watch movies. Mary likes movies too. John also likes to watch football games.
BOW3 = {“John”:2, “likes”:3, “to”:2, “watch”:2, “movies”:2, “Mary”:1, “too”:1, “also”:1, “football”:1, “games”:1};
John Likes To Watch Movies Mary Too Also Football games
1 2 1 1 2 1 1 0 0 0
1 1 1 1 0 0 0 1 1 1
2 3 2 2 2 1 1 1 1 1
Document Vector
Word Vector
TEXT VECTORIZATION
BAG OF WORDS MODEL:
BAG OF WORDS:
Bag of Words is nothing but Term Frequencies
TEXT VECTORIZATION
BAG OF WORDS MODEL:
Bag of Words here, is nothing but Term Frequencies
TEXT VECTORIZATION
TF-IDF MODEL:
• Term frequencies are not necessarily the best representation for the text.
• Having a high raw count does not necessarily mean that the corresponding word is more important.
• TF-IDF “normalizes” the term frequency by weighing a term by the inverse of document frequency.
TF = (Number of times term t appears in a document)/(Number of terms in the document)
IDF = log(N/n),
where, N is the number of documents
n is the number of documents a term t has appeared in.
TF-IDF = TF x IDF
TEXT VECTORIZATION
TF-IDF MODEL:
TF (This, Document1) = 1/8
TF (This, Document2) = 1/5
IDF (This) = log (2 / 2) = 0
IDF (Messi) = log (2 / 1) = 0.301.
TF – IDF (This, Document1) = (1 / 8) * (0) = 0
TF – IDF (This, Document2) = (1 / 5) * (0) = 0
TF – IDF (Messi, Document1) = (4 / 8) * 0.301 = 0.15
TEXT VECTORIZATION
TF-IDF MODEL:
TEXT VECTORIZATION
BAG OF N-GRAMS MODEL:
• Bag of Words model doesn’t consider order of words.
Thus different sentences can have exactly the same representation, as long as the same words are used.
• Bag of N-Grams model – Extension of the Bag of Words model, leverage N-gram based features.
TEXT VECTORIZATION
CO-OCCURRENCE MATRIX:
Corpus =“The quick brown fox jumps over the lazy dog.”
Window Size :- 2
TEXT VECTORIZATION
CO-OCCURRENCE MATRIX:
Corpus = “He is not lazy. He is intelligent. He is smart”.
Window Size :- 2
TEXT VECTORIZATION
CO-OCCURRENCE MATRIX:
PMI = Pointwise Mutual Information.
Larger PMI  Higher correlation
Where, w= word, c= context word
ISSUES: Many entries with PMI (w,c) = log 0
SOLUTION:
• Set PMI(w,c) = 0 for all unobserved pairs.
• Drop all entries of PMI< 0 [POSITIVE POINTWISE MUTUAL INFORMATION]
Produces 2 different vectors for each word:
• Describes word when it is the ‘target word’ in the window
• Describes word when it is the ‘context word’ in window
PREDICTION BASED EMBEDDING:
Prediction based Embedding
• CBOW
• Skip-Gram
CBOW and Skip-Gram model for neural network differs in the terms of input and output of the neural network.
• CBOW: Input to neural network is set of context words within a certain window surrounding a ‘target’ word. And
output predicts the ‘target’ word i.e., what word should belong to the target position.
• Skip-Gram: Input is similar to a CBOW model. Output predicts each ‘context’ based on the ‘target’ word appearing
at the center of the window.
Both the times we learn wi and context vector ῶ i for each word in the vocabulary.
PREDICTION BASED EMBEDDING:
• Goal is to supply training samples and learn the weights.
• Use the weights to predict probabilities for new input word.
PARAGRAPH VECTOR:
• Bag of Words – no order / sequence. But no semantics.
• Bag of N-grams – little semantics. But suffers data sparsity and high dimensionality.
• Methods:
• A weighted average of all the words in the document (loses order of words)
• combining the word vectors in an order given by a parse tree of a sentence, using
matrix-vector operations. (works only for sentences)
• PARAGRAPH VECTOR – applicable to variable-length pieces of texts:
sentences,
paragraphs, and
documents
PARAGRAPH VECTOR:
A framework for learning word vectors. Context of three words
(“the,” “cat,” and “sat”) is used to predict the fourth word
(“on”). The input words are mapped to columns of the matrix
W to predict the output word.
PARAGRAPH VECTOR:
Distributed Memory Model of
Paragraph Vectors (PV-DM)
Distributed Bag of Words version of
Paragraph Vector (PV-DBOW)
DOCUMENT SIMILARITY:
• Document similarity – Similarity based on features extracted from the documents like
bag of words or tf-idf.
• Pairwise document similarity
• Several similarity and distance metrics
• cosine distance/similarity
• euclidean distance
• manhattan distance
• BM25 similarity
• jaccard distance
• Levenshtein distance
• Hamming distance
DOCUMENT SIMILARITY:
TOPIC MODELS
• We can also use some summarization techniques to extract topic or concept based features from text documents.
• Extracting key themes or concepts from a corpus of documents which are represented as topics.
• Each topic can be represented as a bag or collection of words/terms from the document corpus.
TOPIC MODELS
• Most use Matrix Decomposition.
• Eg: Latent Semantic Indexing uses Singular Valued Decomposition.
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
LATENT SEMANTIC INDEXING:
TOPIC MODELS
• Latent Dirichlet Allocation – Uses generative probabilistic model
• Each document consists of a combination of several topics.
• Each term or word can be assigned to a specific topic.
• Similar to pLSI based model (probabilistic LSI). Each latent topic contains a Dirichlet prior over them in the case of LDA.
LATENT DIRICHLET ALLOCATION:
• extract K topics
• from M documents
TOPIC MODELS
LATENT DIRICHLET ALLOCATION:
TOPIC MODELS
LATENT DIRICHLET ALLOCATION:
• LDA is applied on a document-term matrix (TF-IDF or Bag of Words feature matrix), it gets decomposed into
two main components.
• A document-topic matrix, which would be the feature matrix we
are looking for.
• A topic-term matrix, which helps us in looking at potential topics in the corpus.
TEXT DOCUMENT
DOCUMENT
PREPROCESSING
FEATURE SELECTION
FEATURE EXTRACTIONTEXT CLASSIFICATION
Tokenization, Stop-words
removal, Lemmatization,
Stemming
NLP techniques, NER, TF-
IDF, Information Gain(IG),
BOW, N-gram BOW
Word Embeddings, Glove,
LSA, LDA
Neural Network, CNN,
RNN, LSTM, SVM, RF
Structured or
Unstructured data
CONCLUSION
Thank You

Text features

  • 1.
    FEATURE ENGINEERING FOR TEXTDATA Presenter : Shruti Kar Instructor : Dr. Guozhu Dong Class : Feature Engineering (CS 7900-05) https://cdn-images-1.medium.com/max/2000/1*vXKKe3J-lfi1YQ7HC6onxQ.jpeg “More data beats clever algorithms, but better data beats more data.” – Peter Norvig.
  • 2.
    TEXT DATA STRUCTURED DATAUNSTRUCTURED DATA UNDERSTANDING THE TYPE OF TEXT DATA SEMI-STRUCTURED DATA
  • 3.
    FEATURES FROM SEMI-STRUCTUREDDATA Examples: Book, Newspaper, XML Documents, PDFs etc. Features: • Table Of Content / Index • Glossary • Titles • Subheadings • Text( Bold, Color, & Italics) • Captions on Photographs / Diagrams • Tables • <> tags in XML documents
  • 4.
    Cleaning: • Convert accentedcharacters • Expanding Contractions • Lowercasing • Repairing (“C a s a C a f &eacute;” -> “Casa Café”) (Not in example) Removing: • Stopwords • Removing Tags • Rare words (Not in example) • Common words (Not in example) • Removing non-alphanumeric Roots: • Spelling correction (Not in example) • Chop (Not in example) • Stem (root word) • Lemmatize (semantic root) eg: “I am late” -> “I be late” NATURAL LANGUAGE PROCESSING: FEATURES FROM UNSTRUCTURED DATA
  • 5.
    Tokenizing: • Tokenize • N-Grams •Skip-Grams • Char-grams • Affixes (Not in example) NATURAL LANGUAGE PROCESSING: FEATURES FROM UNSTRUCTURED DATA Enrich: • Entity Insertion/Extraction "Microsoft Releases Windows" -> "Microsoft(company) releases Windows(application)" • Parse Trees "Alice hits Bill" -> Alice/Noun_subject hits/Verb Bill/Noun_object [('Mark', 'NNP', u'B-PERSON'), ('and', 'CC', u'O'), ('John', 'NNP', u'B-PERSON'), ('are', 'VBP', u'O'), ('working', 'VBG', u'O'), ('at', 'IN', u'O'), ('Google', 'NNP', u'B-ORGANIZATION'), ('.', '.', u'O’)] • Reading Level
  • 6.
    NATURAL LANGUAGE PROCESSING: FEATURESFROM UNSTRUCTURED DATA AFTER PREPROCESSING:
  • 7.
    TEXT VECTORIZATION BAG OFWORDS MODEL: • Most simple vector space representational model – Bag of Words Model. • Vector space model – Mathematical model to represent unstructured text as numeric vectors each dimension of the vector is a specific featureattribute. • Bag of words model – represents each text document as a numeric vector, each dimension is a specific word from the corpus and the value could be: its frequency in the document, occurrence (denoted by 1 or 0) or weighted values. • Each document is represented literally as a ‘bag’ of its own words disregarding : word orders, sequences and grammar.
  • 8.
    TEXT VECTORIZATION BAG OFWORDS MODEL: (1) John like to watch movies. Mary likes movies too. (2) John also likes to watch football games. “John”, “likes”, “to”, “watch”, “movies”, “Mary”, “likes”, “movies”, “too” “John”, “also”, “likes”, “to”, “watch”, “football”, “games” BOW1 = {“John”:1, “likes”:2, “to”:1, “watch”:1, “movies”:2, “Mary”:1, “too”:1}; BOW2 = {“John”:1, “also”:1, “likes”:1, “to”:1, “watch”:1, “football”:1, “games”:1}; (3) John like to watch movies. Mary likes movies too. John also likes to watch football games. BOW3 = {“John”:2, “likes”:3, “to”:2, “watch”:2, “movies”:2, “Mary”:1, “too”:1, “also”:1, “football”:1, “games”:1}; John Likes To Watch Movies Mary Too Also Football games 1 2 1 1 2 1 1 0 0 0 1 1 1 1 0 0 0 1 1 1 2 3 2 2 2 1 1 1 1 1 Document Vector Word Vector
  • 9.
    TEXT VECTORIZATION BAG OFWORDS MODEL: BAG OF WORDS: Bag of Words is nothing but Term Frequencies
  • 10.
    TEXT VECTORIZATION BAG OFWORDS MODEL: Bag of Words here, is nothing but Term Frequencies
  • 11.
    TEXT VECTORIZATION TF-IDF MODEL: •Term frequencies are not necessarily the best representation for the text. • Having a high raw count does not necessarily mean that the corresponding word is more important. • TF-IDF “normalizes” the term frequency by weighing a term by the inverse of document frequency. TF = (Number of times term t appears in a document)/(Number of terms in the document) IDF = log(N/n), where, N is the number of documents n is the number of documents a term t has appeared in. TF-IDF = TF x IDF
  • 12.
    TEXT VECTORIZATION TF-IDF MODEL: TF(This, Document1) = 1/8 TF (This, Document2) = 1/5 IDF (This) = log (2 / 2) = 0 IDF (Messi) = log (2 / 1) = 0.301. TF – IDF (This, Document1) = (1 / 8) * (0) = 0 TF – IDF (This, Document2) = (1 / 5) * (0) = 0 TF – IDF (Messi, Document1) = (4 / 8) * 0.301 = 0.15
  • 13.
  • 17.
    TEXT VECTORIZATION BAG OFN-GRAMS MODEL: • Bag of Words model doesn’t consider order of words. Thus different sentences can have exactly the same representation, as long as the same words are used. • Bag of N-Grams model – Extension of the Bag of Words model, leverage N-gram based features.
  • 18.
    TEXT VECTORIZATION CO-OCCURRENCE MATRIX: Corpus=“The quick brown fox jumps over the lazy dog.” Window Size :- 2
  • 19.
    TEXT VECTORIZATION CO-OCCURRENCE MATRIX: Corpus= “He is not lazy. He is intelligent. He is smart”. Window Size :- 2
  • 20.
    TEXT VECTORIZATION CO-OCCURRENCE MATRIX: PMI= Pointwise Mutual Information. Larger PMI  Higher correlation Where, w= word, c= context word ISSUES: Many entries with PMI (w,c) = log 0 SOLUTION: • Set PMI(w,c) = 0 for all unobserved pairs. • Drop all entries of PMI< 0 [POSITIVE POINTWISE MUTUAL INFORMATION] Produces 2 different vectors for each word: • Describes word when it is the ‘target word’ in the window • Describes word when it is the ‘context word’ in window
  • 21.
    PREDICTION BASED EMBEDDING: Predictionbased Embedding • CBOW • Skip-Gram CBOW and Skip-Gram model for neural network differs in the terms of input and output of the neural network. • CBOW: Input to neural network is set of context words within a certain window surrounding a ‘target’ word. And output predicts the ‘target’ word i.e., what word should belong to the target position. • Skip-Gram: Input is similar to a CBOW model. Output predicts each ‘context’ based on the ‘target’ word appearing at the center of the window. Both the times we learn wi and context vector ῶ i for each word in the vocabulary.
  • 22.
    PREDICTION BASED EMBEDDING: •Goal is to supply training samples and learn the weights. • Use the weights to predict probabilities for new input word.
  • 23.
    PARAGRAPH VECTOR: • Bagof Words – no order / sequence. But no semantics. • Bag of N-grams – little semantics. But suffers data sparsity and high dimensionality. • Methods: • A weighted average of all the words in the document (loses order of words) • combining the word vectors in an order given by a parse tree of a sentence, using matrix-vector operations. (works only for sentences) • PARAGRAPH VECTOR – applicable to variable-length pieces of texts: sentences, paragraphs, and documents
  • 24.
    PARAGRAPH VECTOR: A frameworkfor learning word vectors. Context of three words (“the,” “cat,” and “sat”) is used to predict the fourth word (“on”). The input words are mapped to columns of the matrix W to predict the output word.
  • 25.
    PARAGRAPH VECTOR: Distributed MemoryModel of Paragraph Vectors (PV-DM) Distributed Bag of Words version of Paragraph Vector (PV-DBOW)
  • 26.
    DOCUMENT SIMILARITY: • Documentsimilarity – Similarity based on features extracted from the documents like bag of words or tf-idf. • Pairwise document similarity • Several similarity and distance metrics • cosine distance/similarity • euclidean distance • manhattan distance • BM25 similarity • jaccard distance • Levenshtein distance • Hamming distance
  • 27.
  • 28.
    TOPIC MODELS • Wecan also use some summarization techniques to extract topic or concept based features from text documents. • Extracting key themes or concepts from a corpus of documents which are represented as topics. • Each topic can be represented as a bag or collection of words/terms from the document corpus.
  • 29.
    TOPIC MODELS • Mostuse Matrix Decomposition. • Eg: Latent Semantic Indexing uses Singular Valued Decomposition. LATENT SEMANTIC INDEXING:
  • 30.
  • 31.
  • 32.
  • 33.
  • 34.
  • 35.
  • 36.
  • 37.
    TOPIC MODELS • LatentDirichlet Allocation – Uses generative probabilistic model • Each document consists of a combination of several topics. • Each term or word can be assigned to a specific topic. • Similar to pLSI based model (probabilistic LSI). Each latent topic contains a Dirichlet prior over them in the case of LDA. LATENT DIRICHLET ALLOCATION: • extract K topics • from M documents
  • 38.
  • 39.
    TOPIC MODELS LATENT DIRICHLETALLOCATION: • LDA is applied on a document-term matrix (TF-IDF or Bag of Words feature matrix), it gets decomposed into two main components. • A document-topic matrix, which would be the feature matrix we are looking for. • A topic-term matrix, which helps us in looking at potential topics in the corpus.
  • 40.
    TEXT DOCUMENT DOCUMENT PREPROCESSING FEATURE SELECTION FEATUREEXTRACTIONTEXT CLASSIFICATION Tokenization, Stop-words removal, Lemmatization, Stemming NLP techniques, NER, TF- IDF, Information Gain(IG), BOW, N-gram BOW Word Embeddings, Glove, LSA, LDA Neural Network, CNN, RNN, LSTM, SVM, RF Structured or Unstructured data CONCLUSION
  • 41.

Editor's Notes

  • #7 Flesch Reading Ease score : 206.835 – (1.015 x ASL) – (84.6 x ASW) Flesch-Kincaid Grade Level score : (.39 x ASL) + (11.8 x ASW) – 15.59 where: ASL = average sentence length (the number of words divided by the number of sentences) ASW = average number of syllables per word (the number of syllables divided by the number of words)
  • #25 The paragraph vectors are also asked to contribute to the prediction task of the next word given many contexts sampled from the paragraph. The contexts are fixed-length and sampled from a sliding window over the paragraph. The paragraph vector is shared across all contexts generated from the same paragraph but not across paragraphs. The word vector matrix W, however, is shared across paragraphs. I.e., the vector for “powerful” is the same for all paragraphs. The paragraph token can be thought of as another word. It acts as a memory that remembers what is missing from the current context – or the topic of the paragraph. For this reason, we often call this model the Distributed Memory Model of Paragraph Vectors (PV-DM).
  • #27 Suppose that there are N paragraphs in the corpus, M words in the vocabulary, and we want to learn paragraph vectors such that each paragraph is mapped to p dimensions and each word is mapped to q dimensions, then the model has the total of N × p + M × q parameters (excluding the softmax parameters)
  • #28 Document similarity is the process of using a distance or similarity based metric that can be used to identify how similar a text document is with any other document(s) based on features extracted from the documents like bag of words or tf-idf.