Dan Sullivan's Guide to Emerging Text Analytics Techniques

Dan Sullivan
Big Data TechCon Boston 2015
*

*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations

*
* First commercial work in natural language
processing in late 1980s
* Document Warehousing and Text Mining, 2001
* Most recent and current text mining work in
life sciences area
* Classification
* Named Entity Recognition
* Event Extraction
* Contact
* dan@dsapptech.com
* @dsapptech
* Linkedin.com/in/dansullivanpdx

*
Discount Code:
DATA35
• Available as book & eBook
• FREE shipping in the U.S.
• EPUB, PDF, and MOBI
eBook formats provided
Also available at booksellers and
online retailers – 35% off discount
only good at informit.com

*
*Sentiment Analysis
*Topic Modeling
*Classification
*Event Extraction
* Workflows, Procedures and Governance
* Performance Considerations

*
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* Use Cases
*Market research
*Brand monitoring
*e-Discovery
*Intellectual property
management

Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations

*
*Sentiment Analysis
*Topic Modeling
*Classification
*Event Extraction
* Workflows
*Performance Considerations

*
* Analysis of tone or opinion of a
communication
* Polarity:
text  {positive, neutral, negative}
* Categorization:
text  {angry, pleased, confused …}
* Scale
text  -10 … +10
* Metadata about context essential
* subject area
* communication medium

*
*Keywords
*Lexical Affinity
* Affective Norms for English Words (ANEW)
* Emotional Dimensions
* Arousal
* Dominance
* Valence
*Statistical Classification
*Semantic or Concept-based Classification

*
* Use Cases
* Brand monitoring
* Competitive intelligence
* Demographic modeling
* Campaign analysis
* Tools
* RapidMiner
* ViralHeat Sentiment Analysis API
* Python NLTK
* Python TextBlog
* R sentiment package

*
* Technique for identify dominant themes
in document
* Does not require training
* Multiple Algorithms
* Probabilistic Latent Semantic Indexing
(PLSI)
* Latent Dirichlet allocation (LDA)
*Assumptions
*Documents about a mixture of topics
*Words used in document attributable to
topic
Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/

Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece,
Negotiations,
Varoufakis

*
* Topics represented by words; documents about a
set of topics
*Doc 1: 50% politics, 50% presidential
*Doc 2: 25% CPU, 30% memory, 45% I/O
*Doc 3: 30% cholesterol, 40% arteries, 30% heart
* Learning Topics
*Assign each word to a topic
*For each word and topic, compute
* Probability of topic given a document P(topic|doc)
* Probability of word given a topic P(word|topic)
* Reassign word to new topic with probability
P(topic|doc) * P(word|topic)
* Reassignment based on probability that topic T
generated use of word W
TOPICS

Image Source: David Blei, “Probabilistic Topic Models”
http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/

*
* Use Cases
* Data exploration in large corpus
* Pre-classification analysis
* Identify dominant themes
* Tools
*Stanford Topic Modeling Toolbox
*Mallet (UMass Amherst)
*R package: topicmodels
*Python package: Gensim

*
* Sentiment Analysis
* Topic Modeling

* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* TF-IDF
* Vector space representation
* Cosine of vectors measure of similarity
* Algorithms
* Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*

*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/

Support Vector Machine (SVM) is large
margin classifier
Commonly used in text classification
Initial results based on life sciences
sentence classifier
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
*

*Term Frequency (TF)
tf(t,d) = # of occurrences of t in d
t is a term
d is a document
*Inverse Document Frequency (IDF)
idf(t,D) = log(N / |{d in D : t in d}|)
D is set of documents
N is number of document
*TF-IDF = tf(t,d) * idf(t,D)
*TF-IDF is
*large when high term frequency in document and low
term frequency in all documents
*small when term appears in many documents
*

* Bag of word model
* Ignores structure (syntax) and
meaning (semantics) of sentences
* Representation vector length is the
size of set of unique words in corpus
* Stemming used to remove
morphological differences
* Each word is assigned an index in the
representation vector, V
* The value V[i] is non-zero if word
appears in sentence represented by
vector
* The non-zero value is a function of
the frequency of the word in the
sentence and the frequency of the
term in the corpus
*

Non-VF, Predicted VF:
 “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of
EspB into the host cell.”
 “Data were log-transformed to correct for heterogeneity of the variances where
necessary.”
 “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the
PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption
in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF
 “Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
 “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing
E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and
intestinal inflammation but no signs of HUS. “
 “The DsbLI system also comprises a functional redox pair”

 Adding additional examples is not likely to substantially
improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error

8 Alternative Algorithms
Select 10,000 most important features using chi-square

*
* SAS Text Miner
* IBM Text Analytics
* Smartlogic
* Python: scikit-learn
* R: RTextTools
* R: tm

*
* Processes of identifying words and phrases of objects
in specific categories. Also known as:
*Entity identification
*Entity extraction
*Chunking
* Two steps:
* Detect entities
* Classify entities
* Common classes of entities:
* Persons
* Organizations
* Geographic locations
* Dates
* Monetary amounts

*
* Four Broad Techniques
*Linguistic - utilize structure of sentence
* Statistical – detect patterns in training
examples
* Custom patterns – regular expressions
* Dictionaries
*Challenges
*Creating training corpus
*Granularity

*
*Use Cases
* Name normalization
* Entity correlation
*Quantified metrics based on texts
*Building block for event extraction
*Tools
* Stanford Core NLP
* OpenNLP
* Mallet
* Basis Technology
* Lexalytics
* NetOwl
* Cogitio API

*
* Entities and relations between
entities
* Company A acquires Company B
* Engineer A filed patent application
on Topic B on Date C
*Politician P announces A on Twitter
on Date B
* Assign roles to entities
* Assign subtypes
* Link to semantic data

*
* Brenden’s Twitter NLP Tools -
https://github.com/aritter/twitter_nlp
* Alchemy API
* Turku BioNLP Event Extraction Software
* Stanford Biomedical Event Parser
Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/

*
* Classification
* Named Entity Recognition
* Event Extraction

*
* Document Collection
* Text Extraction
* Pre-processing
* Case conversion
* Punctuation removal
* Stemming
* Normalization
* N-gram analysis
* Analysis
* Term Frequency – Inverse Document Frequency
* Conditional Probabilities and Topic Models
* NER and Entity Extraction
* Integration
* Link to Structured Data
* Augment with additional semantic information
* Utilization
* Improve information retrieval
* Identity brand perception problems
* Assess likelihood of customer churn
* Predict likelihood of …
Collect
Extract &
Pre-Process
Analyze
Integrate
Utilize

*
Source: https://uima.apache.org/

*
* Scalability
* Multiple language support
* Quality
*Precision
*Recall
* Algorithm selection
* Reliability and timeliness of sources
* Integration rules

* Increase quantity of data (not always helpful; see
error curves)
* Improve quality of data
* Utilize multiple supervised algorithms,
ensemble and non-ensemble
* Use unlabeled data and semi-supervised
techniques
* Feature Selection
* Parameter Tuning
* Feature Engineering
* Given:
* High quality data in sufficient quantity
* State of the art machine learning algorithms
* How to improve results: Change Representation?
*

*TF-IDF
*Loss of syntactic and
semantic information
*No relation between
term index and meaning
*No support for
disambiguation
*Feature engineering
extends vector
representation or
substitute specific for
more general terms – a
crude way to capture
semantic properties
*
 Ideal
Representation
◦ Capture semantic
similarity of words
◦ Does not require
feature engineering
◦ Minimal pre-
processing, e.g. no
mapping to
ontologies
◦ Improves precision
and recall

*Words represented as set of
weights in vector
*Useful properties
* Semantically similar words in close
proximity
* Methods for capturing phrases, e.g.
“Secretion system”
* Captures some semantic features
*Trained with
* Skip-gram or CBOW algorithms
* Text, such as PubMed abstracts and
open access papers
*
T. Mikolov, et. al. “Efﬁcient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf

*
* “Characterization of the Affective Norms for English Words
by discrete emotional categories”
http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf
* “New Avenues in Opinion Mining and Sentiment Analysis”
http://sentic.net/new-avenues-in-opinion-mining-and-
sentiment-analysis.pdf
* “Empirical Study of Topic Modeling in Twitter”
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
* “Open Domain Event Extraction from Twitter”
http://turing.cs.washington.edu/papers/kdd12-ritter.pdf

Dan Sullivan's Guide to Emerging Text Analytics Techniques

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (6)

Similar to Dan Sullivan's Guide to Emerging Text Analytics Techniques

Similar to Dan Sullivan's Guide to Emerging Text Analytics Techniques (20)

More from Dan Sullivan, Ph.D.

More from Dan Sullivan, Ph.D. (12)

Recently uploaded

Recently uploaded (20)

Dan Sullivan's Guide to Emerging Text Analytics Techniques

Editor's Notes