Dan Sullivan
Big Data TechCon Boston 2015
*
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
*
* First commercial work in natural language
processing in late 1980s
* Document Warehousing and Text Mining, 2001
* Most recent and current text mining work in
life sciences area
* Classification
* Named Entity Recognition
* Event Extraction
* Contact
* dan@dsapptech.com
* @dsapptech
* Linkedin.com/in/dansullivanpdx
*
Discount Code:
DATA35
• Available as book & eBook
• FREE shipping in the U.S.
• EPUB, PDF, and MOBI
eBook formats provided
Also available at booksellers and
online retailers – 35% off discount
only good at informit.com
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows, Procedures and Governance
* Performance Considerations
*
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* Use Cases
*Market research
*Brand monitoring
*e-Discovery
*Intellectual property
management
Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
*
* Analysis of tone or opinion of a
communication
* Polarity:
text  {positive, neutral, negative}
* Categorization:
text  {angry, pleased, confused …}
* Scale
text  -10 … +10
* Metadata about context essential
* subject area
* communication medium
*
*Keywords
*Lexical Affinity
* Affective Norms for English Words (ANEW)
* Emotional Dimensions
* Arousal
* Dominance
* Valence
*Statistical Classification
*Semantic or Concept-based Classification
*
* Use Cases
* Brand monitoring
* Competitive intelligence
* Demographic modeling
* Campaign analysis
* Tools
* RapidMiner
* ViralHeat Sentiment Analysis API
* Python NLTK
* Python TextBlog
* R sentiment package
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows, Procedures and Governance
* Performance Considerations
*
* Technique for identify dominant themes
in document
* Does not require training
* Multiple Algorithms
* Probabilistic Latent Semantic Indexing
(PLSI)
* Latent Dirichlet allocation (LDA)
*Assumptions
*Documents about a mixture of topics
*Words used in document attributable to
topic
Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece,
Negotiations,
Varoufakis
*
* Topics represented by words; documents about a
set of topics
*Doc 1: 50% politics, 50% presidential
*Doc 2: 25% CPU, 30% memory, 45% I/O
*Doc 3: 30% cholesterol, 40% arteries, 30% heart
* Learning Topics
*Assign each word to a topic
*For each word and topic, compute
* Probability of topic given a document P(topic|doc)
* Probability of word given a topic P(word|topic)
* Reassign word to new topic with probability
P(topic|doc) * P(word|topic)
* Reassignment based on probability that topic T
generated use of word W
TOPICS
Image Source: David Blei, “Probabilistic Topic Models”
http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
*
* Use Cases
* Data exploration in large corpus
* Pre-classification analysis
* Identify dominant themes
* Tools
*Stanford Topic Modeling Toolbox
*Mallet (UMass Amherst)
*R package: topicmodels
*Python package: Gensim
*
* Sentiment Analysis
* Topic Modeling
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* TF-IDF
* Vector space representation
* Cosine of vectors measure of similarity
* Algorithms
* Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*
*
*
*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
Support Vector Machine (SVM) is large
margin classifier
Commonly used in text classification
Initial results based on life sciences
sentence classifier
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
*
*Term Frequency (TF)
tf(t,d) = # of occurrences of t in d
t is a term
d is a document
*Inverse Document Frequency (IDF)
idf(t,D) = log(N / |{d in D : t in d}|)
D is set of documents
N is number of document
*TF-IDF = tf(t,d) * idf(t,D)
*TF-IDF is
*large when high term frequency in document and low
term frequency in all documents
*small when term appears in many documents
*
* Bag of word model
* Ignores structure (syntax) and
meaning (semantics) of sentences
* Representation vector length is the
size of set of unique words in corpus
* Stemming used to remove
morphological differences
* Each word is assigned an index in the
representation vector, V
* The value V[i] is non-zero if word
appears in sentence represented by
vector
* The non-zero value is a function of
the frequency of the word in the
sentence and the frequency of the
term in the corpus
*
Non-VF, Predicted VF:
 “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of
EspB into the host cell.”
 “Data were log-transformed to correct for heterogeneity of the variances where
necessary.”
 “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the
PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption
in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF
 “Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
 “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing
E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and
intestinal inflammation but no signs of HUS. “
 “The DsbLI system also comprises a functional redox pair”
 Adding additional examples is not likely to substantially
improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error
8 Alternative Algorithms
Select 10,000 most important features using chi-square
*
* SAS Text Miner
* IBM Text Analytics
* Smartlogic
* Python: scikit-learn
* R: RTextTools
* R: tm
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
*
* Processes of identifying words and phrases of objects
in specific categories. Also known as:
*Entity identification
*Entity extraction
*Chunking
* Two steps:
* Detect entities
* Classify entities
* Common classes of entities:
* Persons
* Organizations
* Geographic locations
* Dates
* Monetary amounts
*
*
* Four Broad Techniques
*Linguistic - utilize structure of sentence
* Statistical – detect patterns in training
examples
* Custom patterns – regular expressions
* Dictionaries
*Challenges
*Creating training corpus
*Granularity
*
*
*
*Use Cases
* Name normalization
* Entity correlation
*Quantified metrics based on texts
*Building block for event extraction
*Tools
* Stanford Core NLP
* OpenNLP
* Mallet
* Basis Technology
* Lexalytics
* NetOwl
* Cogitio API
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
*
* Entities and relations between
entities
* Company A acquires Company B
* Engineer A filed patent application
on Topic B on Date C
*Politician P announces A on Twitter
on Date B
* Assign roles to entities
* Assign subtypes
* Link to semantic data
*
* Brenden’s Twitter NLP Tools -
https://github.com/aritter/twitter_nlp
* Alchemy API
* Turku BioNLP Event Extraction Software
* Stanford Biomedical Event Parser
Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/
*
* Classification
* Named Entity Recognition
* Event Extraction
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
*
* Document Collection
* Text Extraction
* Pre-processing
* Case conversion
* Punctuation removal
* Stemming
* Normalization
* N-gram analysis
* Analysis
* Term Frequency – Inverse Document Frequency
* Conditional Probabilities and Topic Models
* NER and Entity Extraction
* Integration
* Link to Structured Data
* Augment with additional semantic information
* Utilization
* Improve information retrieval
* Identity brand perception problems
* Assess likelihood of customer churn
* Predict likelihood of …
Collect
Extract &
Pre-Process
Analyze
Integrate
Utilize
*
Source: https://uima.apache.org/
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
*
* Scalability
* Multiple language support
* Quality
*Precision
*Recall
* Algorithm selection
* Reliability and timeliness of sources
* Integration rules
* Increase quantity of data (not always helpful; see
error curves)
* Improve quality of data
* Utilize multiple supervised algorithms,
ensemble and non-ensemble
* Use unlabeled data and semi-supervised
techniques
* Feature Selection
* Parameter Tuning
* Feature Engineering
* Given:
* High quality data in sufficient quantity
* State of the art machine learning algorithms
* How to improve results: Change Representation?
*
*TF-IDF
*Loss of syntactic and
semantic information
*No relation between
term index and meaning
*No support for
disambiguation
*Feature engineering
extends vector
representation or
substitute specific for
more general terms – a
crude way to capture
semantic properties
*
 Ideal
Representation
◦ Capture semantic
similarity of words
◦ Does not require
feature engineering
◦ Minimal pre-
processing, e.g. no
mapping to
ontologies
◦ Improves precision
and recall
*Words represented as set of
weights in vector
*Useful properties
* Semantically similar words in close
proximity
* Methods for capturing phrases, e.g.
“Secretion system”
* Captures some semantic features
*Trained with
* Skip-gram or CBOW algorithms
* Text, such as PubMed abstracts and
open access papers
*
T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
*
*
* “Characterization of the Affective Norms for English Words
by discrete emotional categories”
http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf
* “New Avenues in Opinion Mining and Sentiment Analysis”
http://sentic.net/new-avenues-in-opinion-mining-and-
sentiment-analysis.pdf
* “Empirical Study of Topic Modeling in Twitter”
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
* “Open Domain Event Extraction from Twitter”
http://turing.cs.washington.edu/papers/kdd12-ritter.pdf

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property

  • 1.
    Dan Sullivan Big DataTechCon Boston 2015 *
  • 2.
    * * Emerging Demandfor Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 3.
    * * First commercialwork in natural language processing in late 1980s * Document Warehousing and Text Mining, 2001 * Most recent and current text mining work in life sciences area * Classification * Named Entity Recognition * Event Extraction * Contact * dan@dsapptech.com * @dsapptech * Linkedin.com/in/dansullivanpdx
  • 4.
    * Discount Code: DATA35 • Availableas book & eBook • FREE shipping in the U.S. • EPUB, PDF, and MOBI eBook formats provided Also available at booksellers and online retailers – 35% off discount only good at informit.com
  • 5.
    * * Emerging Demandfor Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows, Procedures and Governance * Performance Considerations
  • 6.
    * *Large volumes of accessibleand relevant texts: *Social media *Email *Patents and research *Customer communications * Use Cases *Market research *Brand monitoring *e-Discovery *Intellectual property management
  • 7.
    Manual procedures aretime consuming and costly Volume of literature continues to grow Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually Some success with popular tools but limitations
  • 8.
    * * Emerging Demandfor Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows *Performance Considerations
  • 9.
    * * Analysis oftone or opinion of a communication * Polarity: text  {positive, neutral, negative} * Categorization: text  {angry, pleased, confused …} * Scale text  -10 … +10 * Metadata about context essential * subject area * communication medium
  • 10.
    * *Keywords *Lexical Affinity * AffectiveNorms for English Words (ANEW) * Emotional Dimensions * Arousal * Dominance * Valence *Statistical Classification *Semantic or Concept-based Classification
  • 11.
    * * Use Cases *Brand monitoring * Competitive intelligence * Demographic modeling * Campaign analysis * Tools * RapidMiner * ViralHeat Sentiment Analysis API * Python NLTK * Python TextBlog * R sentiment package
  • 12.
    * * Emerging Demandfor Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows, Procedures and Governance * Performance Considerations
  • 13.
    * * Technique foridentify dominant themes in document * Does not require training * Multiple Algorithms * Probabilistic Latent Semantic Indexing (PLSI) * Latent Dirichlet allocation (LDA) *Assumptions *Documents about a mixture of topics *Words used in document attributable to topic Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
  • 14.
    Debt, Law, Graduation Debt, EU, Greece,Euro Source: http://www.nytimes.com/pages/business/index.html April 27, 2015 EU, Greece, Negotiations, Varoufakis
  • 15.
    * * Topics representedby words; documents about a set of topics *Doc 1: 50% politics, 50% presidential *Doc 2: 25% CPU, 30% memory, 45% I/O *Doc 3: 30% cholesterol, 40% arteries, 30% heart * Learning Topics *Assign each word to a topic *For each word and topic, compute * Probability of topic given a document P(topic|doc) * Probability of word given a topic P(word|topic) * Reassign word to new topic with probability P(topic|doc) * P(word|topic) * Reassignment based on probability that topic T generated use of word W TOPICS
  • 16.
    Image Source: DavidBlei, “Probabilistic Topic Models” http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
  • 17.
    * * Use Cases *Data exploration in large corpus * Pre-classification analysis * Identify dominant themes * Tools *Stanford Topic Modeling Toolbox *Mallet (UMass Amherst) *R package: topicmodels *Python package: Gensim
  • 18.
  • 19.
    * * Emerging Demandfor Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 20.
    * 3 KeyComponents * Data * Representation scheme * Algorithms * Data * Positive examples – Examples from representative corpus * Negative examples – Randomly selected from same publications * Representation * TF-IDF * Vector space representation * Cosine of vectors measure of similarity * Algorithms * Supervised learning * SVMs * Ridge Classifier * Perceptrons * kNN * SGD Classifier * Naïve Bayes * Random Forest * AdaBoost *
  • 21.
  • 22.
  • 23.
    * Source: Steven Bird,Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
  • 24.
    Support Vector Machine(SVM) is large margin classifier Commonly used in text classification Initial results based on life sciences sentence classifier Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png *
  • 25.
    *Term Frequency (TF) tf(t,d)= # of occurrences of t in d t is a term d is a document *Inverse Document Frequency (IDF) idf(t,D) = log(N / |{d in D : t in d}|) D is set of documents N is number of document *TF-IDF = tf(t,d) * idf(t,D) *TF-IDF is *large when high term frequency in document and low term frequency in all documents *small when term appears in many documents *
  • 26.
    * Bag ofword model * Ignores structure (syntax) and meaning (semantics) of sentences * Representation vector length is the size of set of unique words in corpus * Stemming used to remove morphological differences * Each word is assigned an index in the representation vector, V * The value V[i] is non-zero if word appears in sentence represented by vector * The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus *
  • 27.
    Non-VF, Predicted VF: “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of EspB into the host cell.”  “Data were log-transformed to correct for heterogeneity of the variances where necessary.”  “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.” VF, Predicted Non-VF  “Here, it is reported that the pO157-encoded Type V-secreted serine protease EspP influences the intestinal colonization of calves. “  “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “  “The DsbLI system also comprises a functional redox pair”
  • 28.
     Adding additionalexamples is not likely to substantially improve results as seen by error curve 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 2000 4000 6000 8000 10000 All Training Error Validation Error
  • 29.
    8 Alternative Algorithms Select10,000 most important features using chi-square
  • 30.
    * * SAS TextMiner * IBM Text Analytics * Smartlogic * Python: scikit-learn * R: RTextTools * R: tm
  • 31.
    * * Emerging Demandfor Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 32.
    * * Processes ofidentifying words and phrases of objects in specific categories. Also known as: *Entity identification *Entity extraction *Chunking * Two steps: * Detect entities * Classify entities * Common classes of entities: * Persons * Organizations * Geographic locations * Dates * Monetary amounts
  • 33.
  • 34.
    * * Four BroadTechniques *Linguistic - utilize structure of sentence * Statistical – detect patterns in training examples * Custom patterns – regular expressions * Dictionaries *Challenges *Creating training corpus *Granularity
  • 35.
  • 36.
  • 37.
    * *Use Cases * Namenormalization * Entity correlation *Quantified metrics based on texts *Building block for event extraction *Tools * Stanford Core NLP * OpenNLP * Mallet * Basis Technology * Lexalytics * NetOwl * Cogitio API
  • 38.
    * * Emerging Demandfor Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 39.
    * * Entities andrelations between entities * Company A acquires Company B * Engineer A filed patent application on Topic B on Date C *Politician P announces A on Twitter on Date B * Assign roles to entities * Assign subtypes * Link to semantic data
  • 40.
    * * Brenden’s TwitterNLP Tools - https://github.com/aritter/twitter_nlp * Alchemy API * Turku BioNLP Event Extraction Software * Stanford Biomedical Event Parser Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/
  • 41.
    * * Classification * NamedEntity Recognition * Event Extraction
  • 42.
    * * Emerging Demandfor Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows *Performance Considerations
  • 43.
    * * Document Collection *Text Extraction * Pre-processing * Case conversion * Punctuation removal * Stemming * Normalization * N-gram analysis * Analysis * Term Frequency – Inverse Document Frequency * Conditional Probabilities and Topic Models * NER and Entity Extraction * Integration * Link to Structured Data * Augment with additional semantic information * Utilization * Improve information retrieval * Identity brand perception problems * Assess likelihood of customer churn * Predict likelihood of … Collect Extract & Pre-Process Analyze Integrate Utilize
  • 44.
  • 45.
    * * Emerging Demandfor Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows *Performance Considerations
  • 46.
    * * Scalability * Multiplelanguage support * Quality *Precision *Recall * Algorithm selection * Reliability and timeliness of sources * Integration rules
  • 47.
    * Increase quantityof data (not always helpful; see error curves) * Improve quality of data * Utilize multiple supervised algorithms, ensemble and non-ensemble * Use unlabeled data and semi-supervised techniques * Feature Selection * Parameter Tuning * Feature Engineering * Given: * High quality data in sufficient quantity * State of the art machine learning algorithms * How to improve results: Change Representation? *
  • 48.
    *TF-IDF *Loss of syntacticand semantic information *No relation between term index and meaning *No support for disambiguation *Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties *  Ideal Representation ◦ Capture semantic similarity of words ◦ Does not require feature engineering ◦ Minimal pre- processing, e.g. no mapping to ontologies ◦ Improves precision and recall
  • 49.
    *Words represented asset of weights in vector *Useful properties * Semantically similar words in close proximity * Methods for capturing phrases, e.g. “Secretion system” * Captures some semantic features *Trained with * Skip-gram or CBOW algorithms * Text, such as PubMed abstracts and open access papers * T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
  • 50.
  • 51.
    * * “Characterization ofthe Affective Norms for English Words by discrete emotional categories” http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf * “New Avenues in Opinion Mining and Sentiment Analysis” http://sentic.net/new-avenues-in-opinion-mining-and- sentiment-analysis.pdf * “Empirical Study of Topic Modeling in Twitter” http://snap.stanford.edu/soma2010/papers/soma2010_12.p df http://snap.stanford.edu/soma2010/papers/soma2010_12.p df * “Open Domain Event Extraction from Twitter” http://turing.cs.washington.edu/papers/kdd12-ritter.pdf

Editor's Notes

  • #28 1. – Process used in VF 2. – No idea why this labeled as a 1 3. Probably from a Methods section, refers to resistance cassette 4.