Text features

FEATURE ENGINEERING FOR
TEXT DATA
Presenter : Shruti Kar
Instructor : Dr. Guozhu Dong
Class : Feature Engineering
(CS 7900-05)
https://cdn-images-1.medium.com/max/2000/1*vXKKe3J-lfi1YQ7HC6onxQ.jpeg
“More data beats clever algorithms, but better data beats more data.” – Peter Norvig.

TEXT DATA
STRUCTURED DATA UNSTRUCTURED DATA
UNDERSTANDING THE TYPE OF TEXT DATA
SEMI-STRUCTURED DATA

FEATURES FROM SEMI-STRUCTURED DATA
Examples: Book, Newspaper, XML Documents, PDFs etc.
Features:
• Table Of Content / Index
• Glossary
• Titles
• Subheadings
• Text( Bold, Color, & Italics)
• Captions on Photographs / Diagrams
• Tables
• <> tags in XML documents

Cleaning:
• Convert accented characters
• Expanding Contractions
• Lowercasing
• Repairing (“C a s a C a f é” ->
“Casa Café”) (Not in example)
Removing:
• Stopwords
• Removing Tags
• Rare words (Not in example)
• Common words (Not in example)
• Removing non-alphanumeric
Roots:
• Spelling correction (Not in example)
• Chop (Not in example)
• Stem (root word)
• Lemmatize (semantic root)
eg: “I am late” -> “I be late”
NATURAL LANGUAGE PROCESSING:
FEATURES FROM UNSTRUCTURED DATA

Tokenizing:
• Tokenize
• N-Grams
• Skip-Grams
• Char-grams
• Affixes (Not in example)
Enrich:
• Entity Insertion/Extraction
"Microsoft Releases Windows" -> "Microsoft(company) releases Windows(application)"
• Parse Trees
"Alice hits Bill" -> Alice/Noun_subject hits/Verb Bill/Noun_object
[('Mark', 'NNP', u'B-PERSON'), ('and', 'CC', u'O'), ('John', 'NNP', u'B-PERSON'), ('are', 'VBP', u'O'), ('working', 'VBG', u'O'),
('at', 'IN', u'O'), ('Google', 'NNP', u'B-ORGANIZATION'), ('.', '.', u'O’)]
• Reading Level

AFTER PREPROCESSING:

TEXT VECTORIZATION
BAG OF WORDS MODEL:
• Most simple vector space representational model – Bag of Words Model.
• Vector space model – Mathematical model to represent unstructured text as numeric vectors
each dimension of the vector is a specific featureattribute.
• Bag of words model – represents each text
document as a numeric vector, each dimension
is a specific word from the corpus and the value
could be:
its frequency in the document,
occurrence (denoted by 1 or 0) or
weighted values.
• Each document is represented literally as a ‘bag’
of its own words disregarding :
word orders,
sequences and
grammar.

TEXT VECTORIZATION
BAG OF WORDS MODEL:
(1) John like to watch movies. Mary likes movies too.
(2) John also likes to watch football games.
“John”, “likes”, “to”, “watch”, “movies”, “Mary”, “likes”, “movies”, “too”
“John”, “also”, “likes”, “to”, “watch”, “football”, “games”
BOW1 = {“John”:1, “likes”:2, “to”:1, “watch”:1, “movies”:2, “Mary”:1, “too”:1};
BOW2 = {“John”:1, “also”:1, “likes”:1, “to”:1, “watch”:1, “football”:1, “games”:1};
(3) John like to watch movies. Mary likes movies too. John also likes to watch football games.
BOW3 = {“John”:2, “likes”:3, “to”:2, “watch”:2, “movies”:2, “Mary”:1, “too”:1, “also”:1, “football”:1, “games”:1};
John Likes To Watch Movies Mary Too Also Football games
1 2 1 1 2 1 1 0 0 0
1 1 1 1 0 0 0 1 1 1
2 3 2 2 2 1 1 1 1 1
Document Vector
Word Vector

TEXT VECTORIZATION
BAG OF WORDS MODEL:
BAG OF WORDS:
Bag of Words is nothing but Term Frequencies

TEXT VECTORIZATION
BAG OF WORDS MODEL:
Bag of Words here, is nothing but Term Frequencies

TEXT VECTORIZATION
TF-IDF MODEL:
• Term frequencies are not necessarily the best representation for the text.
• Having a high raw count does not necessarily mean that the corresponding word is more important.
• TF-IDF “normalizes” the term frequency by weighing a term by the inverse of document frequency.
TF = (Number of times term t appears in a document)/(Number of terms in the document)
IDF = log(N/n),
where, N is the number of documents
n is the number of documents a term t has appeared in.
TF-IDF = TF x IDF

TEXT VECTORIZATION
TF-IDF MODEL:
TF (This, Document1) = 1/8
TF (This, Document2) = 1/5
IDF (This) = log (2 / 2) = 0
IDF (Messi) = log (2 / 1) = 0.301.
TF – IDF (This, Document1) = (1 / 8) * (0) = 0
TF – IDF (This, Document2) = (1 / 5) * (0) = 0
TF – IDF (Messi, Document1) = (4 / 8) * 0.301 = 0.15

TEXT VECTORIZATION
TF-IDF MODEL:

TEXT VECTORIZATION
BAG OF N-GRAMS MODEL:
• Bag of Words model doesn’t consider order of words.
Thus different sentences can have exactly the same representation, as long as the same words are used.
• Bag of N-Grams model – Extension of the Bag of Words model, leverage N-gram based features.

TEXT VECTORIZATION
CO-OCCURRENCE MATRIX:
Corpus =“The quick brown fox jumps over the lazy dog.”
Window Size :- 2

TEXT VECTORIZATION
Corpus = “He is not lazy. He is intelligent. He is smart”.
Window Size :- 2

TEXT VECTORIZATION
PMI = Pointwise Mutual Information.
Larger PMI  Higher correlation
Where, w= word, c= context word
ISSUES: Many entries with PMI (w,c) = log 0
SOLUTION:
• Set PMI(w,c) = 0 for all unobserved pairs.
• Drop all entries of PMI< 0 [POSITIVE POINTWISE MUTUAL INFORMATION]
Produces 2 different vectors for each word:
• Describes word when it is the ‘target word’ in the window
• Describes word when it is the ‘context word’ in window

PREDICTION BASED EMBEDDING:
Prediction based Embedding
• CBOW
• Skip-Gram
CBOW and Skip-Gram model for neural network differs in the terms of input and output of the neural network.
• CBOW: Input to neural network is set of context words within a certain window surrounding a ‘target’ word. And
output predicts the ‘target’ word i.e., what word should belong to the target position.
• Skip-Gram: Input is similar to a CBOW model. Output predicts each ‘context’ based on the ‘target’ word appearing
at the center of the window.
Both the times we learn wi and context vector ῶ i for each word in the vocabulary.

PREDICTION BASED EMBEDDING:
• Goal is to supply training samples and learn the weights.
• Use the weights to predict probabilities for new input word.

PARAGRAPH VECTOR:
• Bag of Words – no order / sequence. But no semantics.
• Bag of N-grams – little semantics. But suffers data sparsity and high dimensionality.
• Methods:
• A weighted average of all the words in the document (loses order of words)
• combining the word vectors in an order given by a parse tree of a sentence, using
matrix-vector operations. (works only for sentences)
• PARAGRAPH VECTOR – applicable to variable-length pieces of texts:
sentences,
paragraphs, and
documents

PARAGRAPH VECTOR:
A framework for learning word vectors. Context of three words
(“the,” “cat,” and “sat”) is used to predict the fourth word
(“on”). The input words are mapped to columns of the matrix
W to predict the output word.

PARAGRAPH VECTOR:
Distributed Memory Model of
Paragraph Vectors (PV-DM)
Distributed Bag of Words version of
Paragraph Vector (PV-DBOW)

DOCUMENT SIMILARITY:
• Document similarity – Similarity based on features extracted from the documents like
bag of words or tf-idf.
• Pairwise document similarity
• Several similarity and distance metrics
• cosine distance/similarity
• euclidean distance
• manhattan distance
• BM25 similarity
• jaccard distance
• Levenshtein distance
• Hamming distance

TOPIC MODELS
• We can also use some summarization techniques to extract topic or concept based features from text documents.
• Extracting key themes or concepts from a corpus of documents which are represented as topics.
• Each topic can be represented as a bag or collection of words/terms from the document corpus.

TOPIC MODELS
• Most use Matrix Decomposition.
• Eg: Latent Semantic Indexing uses Singular Valued Decomposition.
LATENT SEMANTIC INDEXING:

TOPIC MODELS
LATENT SEMANTIC INDEXING:

TOPIC MODELS
• Latent Dirichlet Allocation – Uses generative probabilistic model
• Each document consists of a combination of several topics.
• Each term or word can be assigned to a specific topic.
• Similar to pLSI based model (probabilistic LSI). Each latent topic contains a Dirichlet prior over them in the case of LDA.
LATENT DIRICHLET ALLOCATION:
• extract K topics
• from M documents

TOPIC MODELS

TOPIC MODELS
• LDA is applied on a document-term matrix (TF-IDF or Bag of Words feature matrix), it gets decomposed into
two main components.
• A document-topic matrix, which would be the feature matrix we
are looking for.
• A topic-term matrix, which helps us in looking at potential topics in the corpus.

TEXT DOCUMENT
DOCUMENT
PREPROCESSING
FEATURE SELECTION
FEATURE EXTRACTIONTEXT CLASSIFICATION
Tokenization, Stop-words
removal, Lemmatization,
Stemming
NLP techniques, NER, TF-
IDF, Information Gain(IG),
BOW, N-gram BOW
Word Embeddings, Glove,
LSA, LDA
Neural Network, CNN,
RNN, LSTM, SVM, RF
Structured or
Unstructured data
CONCLUSION

Text features

More Related Content

What's hot

Similar to Text features

Recently uploaded

Text features

Editor's Notes