SlideShare a Scribd company logo
1 of 51
Dan Sullivan
Big Data TechCon Boston 2015
*
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
*
* First commercial work in natural language
processing in late 1980s
* Document Warehousing and Text Mining, 2001
* Most recent and current text mining work in
life sciences area
* Classification
* Named Entity Recognition
* Event Extraction
* Contact
* dan@dsapptech.com
* @dsapptech
* Linkedin.com/in/dansullivanpdx
*
Discount Code:
DATA35
• Available as book & eBook
• FREE shipping in the U.S.
• EPUB, PDF, and MOBI
eBook formats provided
Also available at booksellers and
online retailers – 35% off discount
only good at informit.com
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows, Procedures and Governance
* Performance Considerations
*
*Large volumes of
accessible and relevant
texts:
*Social media
*Email
*Patents and research
*Customer
communications
* Use Cases
*Market research
*Brand monitoring
*e-Discovery
*Intellectual property
management
Manual procedures are time
consuming and costly
Volume of literature continues
to grow
Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
Some success with popular tools
but limitations
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
*
* Analysis of tone or opinion of a
communication
* Polarity:
text  {positive, neutral, negative}
* Categorization:
text  {angry, pleased, confused …}
* Scale
text  -10 … +10
* Metadata about context essential
* subject area
* communication medium
*
*Keywords
*Lexical Affinity
* Affective Norms for English Words (ANEW)
* Emotional Dimensions
* Arousal
* Dominance
* Valence
*Statistical Classification
*Semantic or Concept-based Classification
*
* Use Cases
* Brand monitoring
* Competitive intelligence
* Demographic modeling
* Campaign analysis
* Tools
* RapidMiner
* ViralHeat Sentiment Analysis API
* Python NLTK
* Python TextBlog
* R sentiment package
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows, Procedures and Governance
* Performance Considerations
*
* Technique for identify dominant themes
in document
* Does not require training
* Multiple Algorithms
* Probabilistic Latent Semantic Indexing
(PLSI)
* Latent Dirichlet allocation (LDA)
*Assumptions
*Documents about a mixture of topics
*Words used in document attributable to
topic
Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
Debt, Law,
Graduation
Debt, EU,
Greece, Euro
Source: http://www.nytimes.com/pages/business/index.html April 27, 2015
EU, Greece,
Negotiations,
Varoufakis
*
* Topics represented by words; documents about a
set of topics
*Doc 1: 50% politics, 50% presidential
*Doc 2: 25% CPU, 30% memory, 45% I/O
*Doc 3: 30% cholesterol, 40% arteries, 30% heart
* Learning Topics
*Assign each word to a topic
*For each word and topic, compute
* Probability of topic given a document P(topic|doc)
* Probability of word given a topic P(word|topic)
* Reassign word to new topic with probability
P(topic|doc) * P(word|topic)
* Reassignment based on probability that topic T
generated use of word W
TOPICS
Image Source: David Blei, “Probabilistic Topic Models”
http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
*
* Use Cases
* Data exploration in large corpus
* Pre-classification analysis
* Identify dominant themes
* Tools
*Stanford Topic Modeling Toolbox
*Mallet (UMass Amherst)
*R package: topicmodels
*Python package: Gensim
*
* Sentiment Analysis
* Topic Modeling
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
* 3 Key Components
* Data
* Representation scheme
* Algorithms
* Data
* Positive examples – Examples from representative
corpus
* Negative examples – Randomly selected from same
publications
* Representation
* TF-IDF
* Vector space representation
* Cosine of vectors measure of similarity
* Algorithms
* Supervised learning
* SVMs
* Ridge Classifier
* Perceptrons
* kNN
* SGD Classifier
* Naïve Bayes
* Random Forest
* AdaBoost
*
*
*
*
Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python:
Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
Support Vector Machine (SVM) is large
margin classifier
Commonly used in text classification
Initial results based on life sciences
sentence classifier
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
*
*Term Frequency (TF)
tf(t,d) = # of occurrences of t in d
t is a term
d is a document
*Inverse Document Frequency (IDF)
idf(t,D) = log(N / |{d in D : t in d}|)
D is set of documents
N is number of document
*TF-IDF = tf(t,d) * idf(t,D)
*TF-IDF is
*large when high term frequency in document and low
term frequency in all documents
*small when term appears in many documents
*
* Bag of word model
* Ignores structure (syntax) and
meaning (semantics) of sentences
* Representation vector length is the
size of set of unique words in corpus
* Stemming used to remove
morphological differences
* Each word is assigned an index in the
representation vector, V
* The value V[i] is non-zero if word
appears in sentence represented by
vector
* The non-zero value is a function of
the frequency of the word in the
sentence and the frequency of the
term in the corpus
*
Non-VF, Predicted VF:
 “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of
EspB into the host cell.”
 “Data were log-transformed to correct for heterogeneity of the variances where
necessary.”
 “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the
PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption
in the cesF region of EHEC strain 85-170.”
VF, Predicted Non-VF
 “Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
 “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing
E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and
intestinal inflammation but no signs of HUS. “
 “The DsbLI system also comprises a functional redox pair”
 Adding additional examples is not likely to substantially
improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 2000 4000 6000 8000 10000
All
Training Error
Validation Error
8 Alternative Algorithms
Select 10,000 most important features using chi-square
*
* SAS Text Miner
* IBM Text Analytics
* Smartlogic
* Python: scikit-learn
* R: RTextTools
* R: tm
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
*
* Processes of identifying words and phrases of objects
in specific categories. Also known as:
*Entity identification
*Entity extraction
*Chunking
* Two steps:
* Detect entities
* Classify entities
* Common classes of entities:
* Persons
* Organizations
* Geographic locations
* Dates
* Monetary amounts
*
*
* Four Broad Techniques
*Linguistic - utilize structure of sentence
* Statistical – detect patterns in training
examples
* Custom patterns – regular expressions
* Dictionaries
*Challenges
*Creating training corpus
*Granularity
*
*
*
*Use Cases
* Name normalization
* Entity correlation
*Quantified metrics based on texts
*Building block for event extraction
*Tools
* Stanford Core NLP
* OpenNLP
* Mallet
* Basis Technology
* Lexalytics
* NetOwl
* Cogitio API
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
* Performance Considerations
*
* Entities and relations between
entities
* Company A acquires Company B
* Engineer A filed patent application
on Topic B on Date C
*Politician P announces A on Twitter
on Date B
* Assign roles to entities
* Assign subtypes
* Link to semantic data
*
* Brenden’s Twitter NLP Tools -
https://github.com/aritter/twitter_nlp
* Alchemy API
* Turku BioNLP Event Extraction Software
* Stanford Biomedical Event Parser
Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/
*
* Classification
* Named Entity Recognition
* Event Extraction
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
*
* Document Collection
* Text Extraction
* Pre-processing
* Case conversion
* Punctuation removal
* Stemming
* Normalization
* N-gram analysis
* Analysis
* Term Frequency – Inverse Document Frequency
* Conditional Probabilities and Topic Models
* NER and Entity Extraction
* Integration
* Link to Structured Data
* Augment with additional semantic information
* Utilization
* Improve information retrieval
* Identity brand perception problems
* Assess likelihood of customer churn
* Predict likelihood of …
Collect
Extract &
Pre-Process
Analyze
Integrate
Utilize
*
Source: https://uima.apache.org/
*
* Emerging Demand for Text Analytics
* Text Mining Techniques
*Sentiment Analysis
*Topic Modeling
*Classification
*Named Entity Recognition
*Event Extraction
* Workflows
*Performance Considerations
*
* Scalability
* Multiple language support
* Quality
*Precision
*Recall
* Algorithm selection
* Reliability and timeliness of sources
* Integration rules
* Increase quantity of data (not always helpful; see
error curves)
* Improve quality of data
* Utilize multiple supervised algorithms,
ensemble and non-ensemble
* Use unlabeled data and semi-supervised
techniques
* Feature Selection
* Parameter Tuning
* Feature Engineering
* Given:
* High quality data in sufficient quantity
* State of the art machine learning algorithms
* How to improve results: Change Representation?
*
*TF-IDF
*Loss of syntactic and
semantic information
*No relation between
term index and meaning
*No support for
disambiguation
*Feature engineering
extends vector
representation or
substitute specific for
more general terms – a
crude way to capture
semantic properties
*
 Ideal
Representation
◦ Capture semantic
similarity of words
◦ Does not require
feature engineering
◦ Minimal pre-
processing, e.g. no
mapping to
ontologies
◦ Improves precision
and recall
*Words represented as set of
weights in vector
*Useful properties
* Semantically similar words in close
proximity
* Methods for capturing phrases, e.g.
“Secretion system”
* Captures some semantic features
*Trained with
* Skip-gram or CBOW algorithms
* Text, such as PubMed abstracts and
open access papers
*
T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
*
*
* “Characterization of the Affective Norms for English Words
by discrete emotional categories”
http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf
* “New Avenues in Opinion Mining and Sentiment Analysis”
http://sentic.net/new-avenues-in-opinion-mining-and-
sentiment-analysis.pdf
* “Empirical Study of Topic Modeling in Twitter”
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
http://snap.stanford.edu/soma2010/papers/soma2010_12.p
df
* “Open Domain Event Extraction from Twitter”
http://turing.cs.washington.edu/papers/kdd12-ritter.pdf

More Related Content

What's hot

NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
Acquisition of malicious code using active learning
Acquisition of malicious code using active learningAcquisition of malicious code using active learning
Acquisition of malicious code using active learningUltraUploader
 
QALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic WebQALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic WebConstantin Orasan
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Jean Brenda
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?Constantin Orasan
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitGreg Landrum
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in PracticeVsevolod Dyomkin
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language ProcessingVsevolod Dyomkin
 
My Research Journey with R
My Research Journey with RMy Research Journey with R
My Research Journey with RTom Kelly
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Vsevolod Dyomkin
 
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...IJCSIS Research Publications
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsCarole Goble
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSopen_phacts
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through DatabaseNina Jeliazkova
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSaeedeh Shekarpour
 
smartAPIs: EUDAT Semantic Working Group Presentation @ RDA 9th Plenary
smartAPIs:  EUDAT Semantic Working Group Presentation @ RDA 9th PlenarysmartAPIs:  EUDAT Semantic Working Group Presentation @ RDA 9th Plenary
smartAPIs: EUDAT Semantic Working Group Presentation @ RDA 9th PlenaryMark Wilkinson
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.Tatiana Tarasova
 
Argument extraction from news, blogs and social media.
Argument extraction from news, blogs and social media.Argument extraction from news, blogs and social media.
Argument extraction from news, blogs and social media.Shubhangi Tandon
 
Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIAInsight_Altmetrics
 

What's hot (20)

NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Acquisition of malicious code using active learning
Acquisition of malicious code using active learningAcquisition of malicious code using active learning
Acquisition of malicious code using active learning
 
QALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic WebQALL-ME: Ontology and Semantic Web
QALL-ME: Ontology and Semantic Web
 
Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval Standard Datasets in Information Retrieval
Standard Datasets in Information Retrieval
 
From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?From TREC to Watson: is open domain question answering a solved problem?
From TREC to Watson: is open domain question answering a solved problem?
 
Open-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKitOpen-source from/in the enterprise: the RDKit
Open-source from/in the enterprise: the RDKit
 
Natural Language Processing in Practice
Natural Language Processing in PracticeNatural Language Processing in Practice
Natural Language Processing in Practice
 
Crash-course in Natural Language Processing
Crash-course in Natural Language ProcessingCrash-course in Natural Language Processing
Crash-course in Natural Language Processing
 
My Research Journey with R
My Research Journey with RMy Research Journey with R
My Research Journey with R
 
Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?Can functional programming be liberated from static typing?
Can functional programming be liberated from static typing?
 
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
Improving Intrusion Detection with Deep Packet Inspection and Regular Express...
 
Advances in Scientific Workflow Environments
Advances in Scientific Workflow EnvironmentsAdvances in Scientific Workflow Environments
Advances in Scientific Workflow Environments
 
Patent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTSPatent annotations: From SureChEMBL to Open PHACTS
Patent annotations: From SureChEMBL to Open PHACTS
 
Making project data avalialble eNanomapper through Database
Making project data avalialble eNanomapper through  DatabaseMaking project data avalialble eNanomapper through  Database
Making project data avalialble eNanomapper through Database
 
Semantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked DataSemantic Interpretation of User Query for Question Answering on Interlinked Data
Semantic Interpretation of User Query for Question Answering on Interlinked Data
 
smartAPIs: EUDAT Semantic Working Group Presentation @ RDA 9th Plenary
smartAPIs:  EUDAT Semantic Working Group Presentation @ RDA 9th PlenarysmartAPIs:  EUDAT Semantic Working Group Presentation @ RDA 9th Plenary
smartAPIs: EUDAT Semantic Working Group Presentation @ RDA 9th Plenary
 
ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.ParlBench: a SPARQL-benchmark for electronic publishing applications.
ParlBench: a SPARQL-benchmark for electronic publishing applications.
 
Uncovering Library Features from API Usage on Stack Overflow
Uncovering Library Features from API Usage on Stack OverflowUncovering Library Features from API Usage on Stack Overflow
Uncovering Library Features from API Usage on Stack Overflow
 
Argument extraction from news, blogs and social media.
Argument extraction from news, blogs and social media.Argument extraction from news, blogs and social media.
Argument extraction from news, blogs and social media.
 
Social Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIASocial Phrases Having Impact in Altmetrics - SOPHIA
Social Phrases Having Impact in Altmetrics - SOPHIA
 

Viewers also liked

Case year report infographics
Case year report infographicsCase year report infographics
Case year report infographicsMark Soekarjo
 
Infrastructure Tech Monitoring & Evaluation
Infrastructure Tech Monitoring & EvaluationInfrastructure Tech Monitoring & Evaluation
Infrastructure Tech Monitoring & EvaluationAnthony Raymond Ochoa
 
Open Source Technologies for Contents and Maps
Open Source Technologies for Contents and MapsOpen Source Technologies for Contents and Maps
Open Source Technologies for Contents and MapsTsungWei Hu
 
Online tools for analyzing data coabe 2014
Online tools for analyzing data coabe 2014Online tools for analyzing data coabe 2014
Online tools for analyzing data coabe 2014Venu Thelakkat
 
Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...
Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...
Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...Smart Commute
 

Viewers also liked (6)

Case year report infographics
Case year report infographicsCase year report infographics
Case year report infographics
 
Infrastructure Tech Monitoring & Evaluation
Infrastructure Tech Monitoring & EvaluationInfrastructure Tech Monitoring & Evaluation
Infrastructure Tech Monitoring & Evaluation
 
Open Source Technologies for Contents and Maps
Open Source Technologies for Contents and MapsOpen Source Technologies for Contents and Maps
Open Source Technologies for Contents and Maps
 
Online tools for analyzing data coabe 2014
Online tools for analyzing data coabe 2014Online tools for analyzing data coabe 2014
Online tools for analyzing data coabe 2014
 
Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...
Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...
Smart Commute Evaluation: Tools, Techniques and Lessons Learned in Monitoring...
 
Hawaii Pacific GIS Conference 2012: 3D GIS - Creating Bathymetry Maps with Co...
Hawaii Pacific GIS Conference 2012: 3D GIS - Creating Bathymetry Maps with Co...Hawaii Pacific GIS Conference 2012: 3D GIS - Creating Bathymetry Maps with Co...
Hawaii Pacific GIS Conference 2012: 3D GIS - Creating Bathymetry Maps with Co...
 

Similar to Dan Sullivan's Guide to Emerging Text Analytics Techniques

Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPChristian Morbidoni
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...OSTHUS
 
The Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchJeremy Leipzig
 
Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"GeeksLab Odessa
 
Semantic data integration proof of concept
Semantic data integration proof of conceptSemantic data integration proof of concept
Semantic data integration proof of conceptNicolas Bertrand
 
Applying ocr to extract information : Text mining
Applying ocr to extract information  : Text miningApplying ocr to extract information  : Text mining
Applying ocr to extract information : Text miningSaurabh Singh
 
Data models for preserving and publishing digital research material beyond th...
Data models for preserving and publishing digital research material beyond th...Data models for preserving and publishing digital research material beyond th...
Data models for preserving and publishing digital research material beyond th...Leiden University Medical Center
 
Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonHetu Bhavsar
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...Apache OpenNLP
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchErudite
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2Daniel JACOB
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To PythonVanessa Rene
 
3 Software Estmation.ppt
3 Software Estmation.ppt3 Software Estmation.ppt
3 Software Estmation.pptSoham De
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarAhmad C. Bukhari
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabadGeohedrick
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilitiesmkim8
 

Similar to Dan Sullivan's Guide to Emerging Text Analytics Techniques (20)

Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
Big Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLPBig Data Analytics course: Named Entities and Deep Learning for NLP
Big Data Analytics course: Named Entities and Deep Learning for NLP
 
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...Revolutionizing Laboratory  Instrument Data for the  Pharmaceutical Industry:...
Revolutionizing Laboratory Instrument Data for the Pharmaceutical Industry:...
 
The Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational ResearchThe Role of Metadata in Reproducible Computational Research
The Role of Metadata in Reproducible Computational Research
 
Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"Всеволод Демкин "Natural language processing на практике"
Всеволод Демкин "Natural language processing на практике"
 
Semantic data integration proof of concept
Semantic data integration proof of conceptSemantic data integration proof of concept
Semantic data integration proof of concept
 
Applying ocr to extract information : Text mining
Applying ocr to extract information  : Text miningApplying ocr to extract information  : Text mining
Applying ocr to extract information : Text mining
 
Data models for preserving and publishing digital research material beyond th...
Data models for preserving and publishing digital research material beyond th...Data models for preserving and publishing digital research material beyond th...
Data models for preserving and publishing digital research material beyond th...
 
Dia09
Dia09Dia09
Dia09
 
Sentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using pythonSentiment analysis of Twitter data using python
Sentiment analysis of Twitter data using python
 
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...Big Data Spain 2017  - Deriving Actionable Insights from High Volume Media St...
Big Data Spain 2017 - Deriving Actionable Insights from High Volume Media St...
 
Utilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword researchUtilizing the natural langauage toolkit for keyword research
Utilizing the natural langauage toolkit for keyword research
 
Make your data great again - Ver 2
Make your data great again - Ver 2Make your data great again - Ver 2
Make your data great again - Ver 2
 
Text mining
Text miningText mining
Text mining
 
Introduction To Python
Introduction To PythonIntroduction To Python
Introduction To Python
 
3 Software Estmation.ppt
3 Software Estmation.ppt3 Software Estmation.ppt
3 Software Estmation.ppt
 
Standardization of the HIPC Data Templates
Standardization of the HIPC Data TemplatesStandardization of the HIPC Data Templates
Standardization of the HIPC Data Templates
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So Far
 
Data science training in hyderabad
Data science training in hyderabadData science training in hyderabad
Data science training in hyderabad
 
E-Utilities
E-UtilitiesE-Utilities
E-Utilities
 

More from Dan Sullivan, Ph.D.

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryDan Sullivan, Ph.D.
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?Dan Sullivan, Ph.D.
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery MLDan Sullivan, Ph.D.
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningDan Sullivan, Ph.D.
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured dataDan Sullivan, Ph.D.
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupDan Sullivan, Ph.D.
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyDan Sullivan, Ph.D.
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivanDan Sullivan, Ph.D.
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsDan Sullivan, Ph.D.
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Dan Sullivan, Ph.D.
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesDan Sullivan, Ph.D.
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsDan Sullivan, Ph.D.
 

More from Dan Sullivan, Ph.D. (12)

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Text Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious DiseasesText Mining for Biocuration of Bacterial Infectious Diseases
Text Mining for Biocuration of Bacterial Infectious Diseases
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 

Recently uploaded

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsAndrey Dotsenko
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr LapshynFwdays
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphNeo4j
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Alan Dix
 

Recently uploaded (20)

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
Transcript: #StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
"Federated learning: out of reach no matter how close",Oleksandr Lapshyn
 
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort ServiceHot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
Hot Sexy call girls in Panjabi Bagh 🔝 9953056974 🔝 Delhi escort Service
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge GraphSIEMENS: RAPUNZEL – A Tale About Knowledge Graph
SIEMENS: RAPUNZEL – A Tale About Knowledge Graph
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptxVulnerability_Management_GRC_by Sohang Sengupta.pptx
Vulnerability_Management_GRC_by Sohang Sengupta.pptx
 
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024
 
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...Swan(sea) Song – personal research during my six years at Swansea ... and bey...
Swan(sea) Song – personal research during my six years at Swansea ... and bey...
 

Dan Sullivan's Guide to Emerging Text Analytics Techniques

  • 1. Dan Sullivan Big Data TechCon Boston 2015 *
  • 2. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 3. * * First commercial work in natural language processing in late 1980s * Document Warehousing and Text Mining, 2001 * Most recent and current text mining work in life sciences area * Classification * Named Entity Recognition * Event Extraction * Contact * dan@dsapptech.com * @dsapptech * Linkedin.com/in/dansullivanpdx
  • 4. * Discount Code: DATA35 • Available as book & eBook • FREE shipping in the U.S. • EPUB, PDF, and MOBI eBook formats provided Also available at booksellers and online retailers – 35% off discount only good at informit.com
  • 5. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows, Procedures and Governance * Performance Considerations
  • 6. * *Large volumes of accessible and relevant texts: *Social media *Email *Patents and research *Customer communications * Use Cases *Market research *Brand monitoring *e-Discovery *Intellectual property management
  • 7. Manual procedures are time consuming and costly Volume of literature continues to grow Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually Some success with popular tools but limitations
  • 8. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows *Performance Considerations
  • 9. * * Analysis of tone or opinion of a communication * Polarity: text  {positive, neutral, negative} * Categorization: text  {angry, pleased, confused …} * Scale text  -10 … +10 * Metadata about context essential * subject area * communication medium
  • 10. * *Keywords *Lexical Affinity * Affective Norms for English Words (ANEW) * Emotional Dimensions * Arousal * Dominance * Valence *Statistical Classification *Semantic or Concept-based Classification
  • 11. * * Use Cases * Brand monitoring * Competitive intelligence * Demographic modeling * Campaign analysis * Tools * RapidMiner * ViralHeat Sentiment Analysis API * Python NLTK * Python TextBlog * R sentiment package
  • 12. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows, Procedures and Governance * Performance Considerations
  • 13. * * Technique for identify dominant themes in document * Does not require training * Multiple Algorithms * Probabilistic Latent Semantic Indexing (PLSI) * Latent Dirichlet allocation (LDA) *Assumptions *Documents about a mixture of topics *Words used in document attributable to topic Source: http://www.keepcalm-o-matic.co.uk/p/keep-calm-theres-no-training-today/
  • 14. Debt, Law, Graduation Debt, EU, Greece, Euro Source: http://www.nytimes.com/pages/business/index.html April 27, 2015 EU, Greece, Negotiations, Varoufakis
  • 15. * * Topics represented by words; documents about a set of topics *Doc 1: 50% politics, 50% presidential *Doc 2: 25% CPU, 30% memory, 45% I/O *Doc 3: 30% cholesterol, 40% arteries, 30% heart * Learning Topics *Assign each word to a topic *For each word and topic, compute * Probability of topic given a document P(topic|doc) * Probability of word given a topic P(word|topic) * Reassign word to new topic with probability P(topic|doc) * P(word|topic) * Reassignment based on probability that topic T generated use of word W TOPICS
  • 16. Image Source: David Blei, “Probabilistic Topic Models” http://yosinski.com/mlss12/MLSS-2012-Blei-Probabilistic-Topic-Models/
  • 17. * * Use Cases * Data exploration in large corpus * Pre-classification analysis * Identify dominant themes * Tools *Stanford Topic Modeling Toolbox *Mallet (UMass Amherst) *R package: topicmodels *Python package: Gensim
  • 18. * * Sentiment Analysis * Topic Modeling
  • 19. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 20. * 3 Key Components * Data * Representation scheme * Algorithms * Data * Positive examples – Examples from representative corpus * Negative examples – Randomly selected from same publications * Representation * TF-IDF * Vector space representation * Cosine of vectors measure of similarity * Algorithms * Supervised learning * SVMs * Ridge Classifier * Perceptrons * kNN * SGD Classifier * Naïve Bayes * Random Forest * AdaBoost *
  • 21. *
  • 22. *
  • 23. * Source: Steven Bird, Ewan Klein, and Edward Loper. Natural Language Processing with Python: Analyzing Text with Natural Language Toolkit. http://www.nltk.org/book/
  • 24. Support Vector Machine (SVM) is large margin classifier Commonly used in text classification Initial results based on life sciences sentence classifier Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png *
  • 25. *Term Frequency (TF) tf(t,d) = # of occurrences of t in d t is a term d is a document *Inverse Document Frequency (IDF) idf(t,D) = log(N / |{d in D : t in d}|) D is set of documents N is number of document *TF-IDF = tf(t,d) * idf(t,D) *TF-IDF is *large when high term frequency in document and low term frequency in all documents *small when term appears in many documents *
  • 26. * Bag of word model * Ignores structure (syntax) and meaning (semantics) of sentences * Representation vector length is the size of set of unique words in corpus * Stemming used to remove morphological differences * Each word is assigned an index in the representation vector, V * The value V[i] is non-zero if word appears in sentence represented by vector * The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus *
  • 27. Non-VF, Predicted VF:  “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of EspB into the host cell.”  “Data were log-transformed to correct for heterogeneity of the variances where necessary.”  “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.” VF, Predicted Non-VF  “Here, it is reported that the pO157-encoded Type V-secreted serine protease EspP influences the intestinal colonization of calves. “  “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “  “The DsbLI system also comprises a functional redox pair”
  • 28.  Adding additional examples is not likely to substantially improve results as seen by error curve 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 2000 4000 6000 8000 10000 All Training Error Validation Error
  • 29. 8 Alternative Algorithms Select 10,000 most important features using chi-square
  • 30. * * SAS Text Miner * IBM Text Analytics * Smartlogic * Python: scikit-learn * R: RTextTools * R: tm
  • 31. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 32. * * Processes of identifying words and phrases of objects in specific categories. Also known as: *Entity identification *Entity extraction *Chunking * Two steps: * Detect entities * Classify entities * Common classes of entities: * Persons * Organizations * Geographic locations * Dates * Monetary amounts
  • 33. *
  • 34. * * Four Broad Techniques *Linguistic - utilize structure of sentence * Statistical – detect patterns in training examples * Custom patterns – regular expressions * Dictionaries *Challenges *Creating training corpus *Granularity
  • 35. *
  • 36. *
  • 37. * *Use Cases * Name normalization * Entity correlation *Quantified metrics based on texts *Building block for event extraction *Tools * Stanford Core NLP * OpenNLP * Mallet * Basis Technology * Lexalytics * NetOwl * Cogitio API
  • 38. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows * Performance Considerations
  • 39. * * Entities and relations between entities * Company A acquires Company B * Engineer A filed patent application on Topic B on Date C *Politician P announces A on Twitter on Date B * Assign roles to entities * Assign subtypes * Link to semantic data
  • 40. * * Brenden’s Twitter NLP Tools - https://github.com/aritter/twitter_nlp * Alchemy API * Turku BioNLP Event Extraction Software * Stanford Biomedical Event Parser Source: Turku Event Extraction System, http://jbjorne.github.io/TEES/
  • 41. * * Classification * Named Entity Recognition * Event Extraction
  • 42. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows *Performance Considerations
  • 43. * * Document Collection * Text Extraction * Pre-processing * Case conversion * Punctuation removal * Stemming * Normalization * N-gram analysis * Analysis * Term Frequency – Inverse Document Frequency * Conditional Probabilities and Topic Models * NER and Entity Extraction * Integration * Link to Structured Data * Augment with additional semantic information * Utilization * Improve information retrieval * Identity brand perception problems * Assess likelihood of customer churn * Predict likelihood of … Collect Extract & Pre-Process Analyze Integrate Utilize
  • 45. * * Emerging Demand for Text Analytics * Text Mining Techniques *Sentiment Analysis *Topic Modeling *Classification *Named Entity Recognition *Event Extraction * Workflows *Performance Considerations
  • 46. * * Scalability * Multiple language support * Quality *Precision *Recall * Algorithm selection * Reliability and timeliness of sources * Integration rules
  • 47. * Increase quantity of data (not always helpful; see error curves) * Improve quality of data * Utilize multiple supervised algorithms, ensemble and non-ensemble * Use unlabeled data and semi-supervised techniques * Feature Selection * Parameter Tuning * Feature Engineering * Given: * High quality data in sufficient quantity * State of the art machine learning algorithms * How to improve results: Change Representation? *
  • 48. *TF-IDF *Loss of syntactic and semantic information *No relation between term index and meaning *No support for disambiguation *Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties *  Ideal Representation ◦ Capture semantic similarity of words ◦ Does not require feature engineering ◦ Minimal pre- processing, e.g. no mapping to ontologies ◦ Improves precision and recall
  • 49. *Words represented as set of weights in vector *Useful properties * Semantically similar words in close proximity * Methods for capturing phrases, e.g. “Secretion system” * Captures some semantic features *Trained with * Skip-gram or CBOW algorithms * Text, such as PubMed abstracts and open access papers * T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
  • 50. *
  • 51. * * “Characterization of the Affective Norms for English Words by discrete emotional categories” http://indiana.edu/~panlab/papers/SraMjaJtw_ANEW.pdf * “New Avenues in Opinion Mining and Sentiment Analysis” http://sentic.net/new-avenues-in-opinion-mining-and- sentiment-analysis.pdf * “Empirical Study of Topic Modeling in Twitter” http://snap.stanford.edu/soma2010/papers/soma2010_12.p df http://snap.stanford.edu/soma2010/papers/soma2010_12.p df * “Open Domain Event Extraction from Twitter” http://turing.cs.washington.edu/papers/kdd12-ritter.pdf

Editor's Notes

  1. 1. – Process used in VF 2. – No idea why this labeled as a 1 3. Probably from a Methods section, refers to resistance cassette 4.