The document discusses various text mining techniques including sentiment analysis, topic modeling, classification, named entity recognition, and event extraction. It provides examples of applications and considerations for each technique. Performance factors like scalability, language support, and integration rules are also covered. Overall the document serves as an introduction to common text analytics methods.
2. *
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
3. *
* First commercial work in natural language
processing in late 1980s
* Document Warehousing and Text Mining, 2001
* Most recent and current text mining work in
life sciences area
* Classification
* Named Entity Recognition
* Event Extraction
* Contact
* dan@dsapptech.com
* @dsapptech
* Linkedin.com/in/dansullivanpdx
4. *
Discount Code:
DATA35
• Available as book & eBook
• FREE shipping in the U.S.
• EPUB, PDF, and MOBI
eBook formats provided
Also available at booksellers and
online retailers – 35% off discount
only good at informit.com
5. *
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows, Procedures and Governance
* Performance Considerations
6. *
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* Use Cases
*Market research
*Brand monitoring
*e-Discovery
*Intellectual property
management
7. Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations
8. *
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
9. *
* Analysis of tone or opinion of a
communication
* Polarity:
text {positive, neutral, negative}
* Categorization:
text {angry, pleased, confused …}
* Scale
text -10 … +10
* Metadata about context essential
* subject area
* communication medium
10. *
*Keywords
*Lexical Affinity
* Affective Norms for English Words (ANEW)
* Emotional Dimensions
* Arousal
* Dominance
* Valence
*Statistical Classification
*Semantic or Concept-based Classification
12. *
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows, Procedures and Governance
* Performance Considerations
13. *
* Technique for identify dominant themes
in document
* Does not require training
* Multiple Algorithms
* Probabilistic Latent Semantic Indexing
(PLSI)
* Latent Dirichlet allocation (LDA)
*Assumptions
*Documents about a mixture of topics
*Words used in document attributable to
topic
Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
15. *
* Topics represented by words; documents about a
set of topics
*Doc 1: 50% politics, 50% presidential
*Doc 2: 25% CPU, 30% memory, 45% I/O
*Doc 3: 30% cholesterol, 40% arteries, 30% heart
* Learning Topics
*Assign each word to a topic
*For each word and topic, compute
* Probability of topic given a document P(topic|doc)
* Probability of word given a topic P(word|topic)
* Reassign word to new topic with probability
P(topic|doc) * P(word|topic)
* Reassignment based on probability that topic T
generated use of word W
TOPICS
16. Image Source: David Blei, “Probabilistic Topic Models”
http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
17. *
* Use Cases
* Data exploration in large corpus
* Pre-classification analysis
* Identify dominant themes
* Tools
*Stanford Topic Modeling Toolbox
*Mallet (UMass Amherst)
*R package: topicmodels
*Python package: Gensim
23. *
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
24. Support Vector Machine (SVM) is large
margin classifier
Commonly used in text classification
Initial results based on life sciences
sentence classifier
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
*
25. *Term Frequency (TF)
tf(t,d) = # of occurrences of t in d
t is a term
d is a document
*Inverse Document Frequency (IDF)
idf(t,D) = log(N / |{d in D : t in d}|)
D is set of documents
N is number of document
*TF-IDF = tf(t,d) * idf(t,D)
*TF-IDF is
*large when high term frequency in document and low
term frequency in all documents
*small when term appears in many documents
*
26. * Bag of word model
* Ignores structure (syntax) and
meaning (semantics) of sentences
* Representation vector length is the
size of set of unique words in corpus
* Stemming used to remove
morphological differences
* Each word is assigned an index in the
representation vector, V
* The value V[i] is non-zero if word
appears in sentence represented by
vector
* The non-zero value is a function of
the frequency of the word in the
sentence and the frequency of the
term in the corpus
*
27. Non-VF, Predicted VF:
“Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of
EspB into the host cell.”
“Data were log-transformed to correct for heterogeneity of the variances where
necessary.”
“Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the
PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption
in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF
“Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
“Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing
E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and
intestinal inflammation but no signs of HUS. “
“The DsbLI system also comprises a functional redox pair”
28. Adding additional examples is not likely to substantially
improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error
30. *
* SAS Text Miner
* IBM Text Analytics
* Smartlogic
* Python: scikit-learn
* R: RTextTools
* R: tm
31. *
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
32. *
* Processes of identifying words and phrases of objects
in specific categories. Also known as:
*Entity identification
*Entity extraction
*Chunking
* Two steps:
* Detect entities
* Classify entities
* Common classes of entities:
* Persons
* Organizations
* Geographic locations
* Dates
* Monetary amounts
37. *
*Use Cases
* Name normalization
* Entity correlation
*Quantified metrics based on texts
*Building block for event extraction
*Tools
* Stanford Core NLP
* OpenNLP
* Mallet
* Basis Technology
* Lexalytics
* NetOwl
* Cogitio API
38. *
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
39. *
* Entities and relations between
entities
* Company A acquires Company B
* Engineer A filed patent application
on Topic B on Date C
*Politician P announces A on Twitter
on Date B
* Assign roles to entities
* Assign subtypes
* Link to semantic data
40. *
* Brenden’s Twitter NLP Tools -
https://github.com/aritter/twitter_nlp
* Alchemy API
* Turku BioNLP Event Extraction Software
* Stanford Biomedical Event Parser
Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/
45. *
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
46. *
* Scalability
* Multiple language support
* Quality
*Precision
*Recall
* Algorithm selection
* Reliability and timeliness of sources
* Integration rules
47. * Increase quantity of data (not always helpful; see
error curves)
* Improve quality of data
* Utilize multiple supervised algorithms,
ensemble and non-ensemble
* Use unlabeled data and semi-supervised
techniques
* Feature Selection
* Parameter Tuning
* Feature Engineering
* Given:
* High quality data in sufficient quantity
* State of the art machine learning algorithms
* How to improve results: Change Representation?
*
48. *TF-IDF
*Loss of syntactic and
semantic information
*No relation between
term index and meaning
*No support for
disambiguation
*Feature engineering
extends vector
representation or
substitute specific for
more general terms – a
crude way to capture
semantic properties
*
Ideal
Representation
◦ Capture semantic
similarity of words
◦ Does not require
feature engineering
◦ Minimal pre-
processing, e.g. no
mapping to
ontologies
◦ Improves precision
and recall
49. *Words represented as set of
weights in vector
*Useful properties
* Semantically similar words in close
proximity
* Methods for capturing phrases, e.g.
“Secretion system”
* Captures some semantic features
*Trained with
* Skip-gram or CBOW algorithms
* Text, such as PubMed abstracts and
open access papers
*
T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
51. *
* “Characterization of the Affective Norms for English Words
by discrete emotional categories”
http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf
* “New Avenues in Opinion Mining and Sentiment Analysis”
http://sentic.net/new-avenues-in-opinion-mining-and-
sentiment-analysis.pdf
* “Empirical Study of Topic Modeling in Twitter”
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
* “Open Domain Event Extraction from Twitter”
http://turing.cs.washington.edu/papers/kdd12-ritter.pdf
Editor's Notes
1. – Process used in VF
2. – No idea why this labeled as a 1
3. Probably from a Methods section, refers to resistance cassette
4.