SlideShare a Scribd company logo
1 of 26
GBCB Seminar
April 24, 2014
 Biocuration in bacterial infectious diseases
 Existing approaches to biocuration
 Goals of current research
 Sentence classification for virulence factor
(VF) curation
 Future Research
 Large scale sequencing,
transcriptomics, proteomics and
metabolomics provide large volumes
of data about structure and function
 Valuable information about genes,
proteins and other biological entities
derived from interpretation of data
 Publications capture information that
researchers extract from data by
aggregating, integrating,
summarizing and analyzing
experiment results and interpreting
those results with respect to other
published results
 Gene annotation
◦ Virulence factors
◦ Antibiotic resistance
◦ Genomic metadata
 Experiment Metadata
◦ Transcriptomic metadata
◦ Metabolomic metadata
 Literature
◦ Named entity recognition
◦ Metadata tagging
 Automated annotation
◦ Example – RAST
◦ Transfer annotations based on similarity
◦ Metabolic reconstruction
 Community curation
◦ Example – WikiGenes
◦ Collaborative manual curation
 Model building
◦ Example - MetaFlux
◦ Predict missing components of pathways based
on FBA models
 Dedicated manual curation
◦ Example –
◦ PATRIC Curate entries with statements
traceable to literature
◦ In 2009, half of biocurators were using text mining in
support of biocuration1
◦ Common use cases:
 Document prioritization
 Linking entities and relations to biological resources
such as GO or UniProt
 Identification of evidence
◦ Identification of evidence
 Pattern recognition - genomic location information
 Named entity recognition – T4SS components
 Event extraction – positive/negative regulation
1. PMID: 23110974
 Manual procedures are time
consuming and costly
 Volume of literature continues
to grow
 Commonly used search
techniques, such as keyword,
similarity searching, metadata
filtering, etc. can still yield
volumes of literature that are
difficult to analyze manually
 Some success with popular
tools but limitations
http://www.nature.com/nrmicro/journal/v8/n1/fig_tab/nrmicro2260_F2.html
http://stroke.nih.gov/materials/strokechallenges.htm
 Potentially brittle
methods, e.g.
dictionary lookups
 Questions of effort
required to extend
 Named entity
recognition does not
allows disambiguate
correctly
 Prioritizing documents
is still challenging
Textpresso Dictionary Entries
Adhesion to host
Adhesion to hosts
Adhesion to other organism during symbiotic interaction
Adhesion to other organism during symbiotic interactions
Adhesion to symbiont
Adhesion to symbionts
Agglutination during conjugation with cellular fusion
Agglutination during conjugation with cellular fusions
Agglutination during conjugation without cellular fusion
Agglutination during conjugation without cellular fusions
 Generalized set of biocuration tools to:
◦ Filter and prioritize documents
◦ Identify relevant assertion sentences within documents
◦ Extract entity and events
◦ Require minimal manual intervention
 Approach
◦ Address each objective separately
◦ Topic modeling and similarity measures for document
classification
◦ Term-frequency Inverse Document Frequency (TF-IDF) for
sentence classification
◦ Shallow semantic parsing for entity and event extraction
 Focus of this presentation is TF-IDF for sentence
classification and its limitations
 3 Key Components
◦ Data
◦ Representation scheme
◦ Algorithms
 Data
◦ Positive examples – VF assertion sentences
◦ Negative examples – Randomly selected from same
publications
 Representation
◦ TF-IDF
◦ Vector space representation
◦ Cosine of vectors measure of similarity
 Algorithms
◦ Supervised learning
 SVMs
 Ridge Classifier
 Perceptrons
 kNN
 SGD Classifier
 Naïve Bayes
 Random Forest
 AdaBoost
• Semisupervised Learning
• Label Spreading
 “Bacterial virulence factors enable a [pathogen] to replicate and disseminate within a
host in part by subverting or eluding host defenses.”1
 Example assertion sentences about virulence factors
 Mutations in the fimH gene of Salmonella typhimurium result in a non-fimbriate, non-adhesive
phenotype.2
 Unexpectedly, here we find that nonacylated LprG retains TLR2 activity. 3
 The autolysin Ami contributes to the adhesion of Listeria. 4
 Negative examples are randomly selected non-VF assertion sentences from the same set
of publications.
 VF Sentence Set 1 - PATRIC team of biocurators identified 4,696 assertion sentences in
1,127 publications about virulence in 5 genera: Escherichia, Listeria, Mycobacterium,
Salmonella, Shigella
 VF Sentence Set 2 - Second round of curation over initial results yield 3,716 VF assertion
sentences from 787 publications across 6 genera: Bartonella, Escherichia, Listeria,
Mycobacterium, Salmonella, Shigella
1. A. Cross, “What is a Virulence Factor” Crit Care. 2008; 12(6): 196.
2. Hancox, Yeh et al. 1997
3. Drage, Tsai et al. 2010
4. Milohanic, Jonquieres et al. 2001
 Term Frequency (TF)
tf(t,d) = # of occurrences of t in d
t is a term
d is a document
 Inverse Document Frequency (IDF)
idf(t,D) = log(N / |{d in D : t in d}|)
D is set of documents
N is number of document
 TF-IDF = tf(t,d) * idf(t,D)
 TF-IDF is
◦ large when high term frequency in document and low
term frequency in all documents
◦ small when term appears in many documents
 Bag of word model
 Ignores structure (syntax) and
meaning (semantics) of sentences
 Representation vector length is the
size of set of unique words in
corpus
 Stemming used to remove
morphological differences
 Each word is assigned an index in
the representation vector, V
 The value V[i] is non-zero if word
appears in sentence represented by
vector
 The non-zero value is a function of
the frequency of the word in the
sentence and the frequency of the
term in the corpus
 Support Vector Machine (SVM) is large
margin classifier
 Commonly used in text classification
 Initial results based on VF Sentence
Set 1
Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
 Non-VF, Predicted VF:
◦ “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels
of EspB into the host cell.”
◦ “Data were log-transformed to correct for heterogeneity of the variances where
necessary.”
◦ “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the
PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption
in the cesF region of EHEC strain 85-170.”
 VF, Predicted Non-VF
◦ “Here, it is reported that the pO157-encoded Type V-secreted serine protease
EspP influences the intestinal colonization of calves. “
◦ “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing
E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and
intestinal inflammation but no signs of HUS. “
◦ “The DsbLI system also comprises a functional redox pair”
 Adding additional examples is not likely to
substantially improve results as seen by error curve
0
0.05
0.1
0.15
0.2
0.25
0.3
0.35
0.4
0.45
0.5
0 5000 10000
All
Training Error
Validation Error
 8 Alternative Algorithms
 VF Sentence Set 2
 Select 10,000 most important features using
chi-square
 Machine learning technique that takes
advantage of unlabeled data
 Unlabeled data helps determine shape
of underlying data distribution
 Added randomly selected, unlabeled
sentences from VF publications
 Trained with 842 labeled and 4346
unlabeled
 Label Spreading is a semi-supervised
algorithm somewhat resilient to noise
 Algorithms have parameters not learned from
data
 SVMs, for example:
◦ C – balances training error and over-fitting
◦ Kernel – function to map data to high-dimensional
space, e.g. linear, polynomial
◦ Gamma – parameter in non-linear kernels, controls how
far influence of a training instance reaches
 Search combination of parameters
 Optimal results with linear kernel and slightly
smaller C than default
 Process of explicitly modeling relations between variables or explicitly
representing information not already in a representation scheme, for
example:
◦ Classify all numbers as NUMBER instead of numerals
◦ Replace gene/protein names with term GENE_Protein
 Used in text classification problems, e.g. phrase-based learning has
improved some rule-based classifiers.1
 Rule based learners may not be generalizable to other domains
 Taxonomic-structure of Unified Medical Language System (UMLS) used
to create semantic similarity measures. 2
 Most informative features can be detected automatically, e.g. chi-
square
 Manual feature engineering is not a viable option if our goal is
topic-independent support for biocuration
1. DOI:10.1.1.36.9770
2. PMID: 22580178
 Improve quality of data (quantity not likely helpful)
 Utilize multiple supervised algorithms,
ensemble and non-ensemble
 Use unlabeled data and semi-supervised
techniques
 Feature Selection
 Parameter Tuning
 Feature Engineering
 Given:
◦ High quality data in sufficient quantity
◦ State of the art machine learning algorithms
 How to improve results: Change Representation?
 TF-IDF
◦ Loss of syntactic and
semantic information
◦ No relation between
term index and
meaning
◦ No support for
disambiguation
◦ Feature engineering
extends vector
representation or
substitute specific for
more general terms – a
crude way to capture
semantic properties
 Ideal
Representation
◦ Capture semantic
similarity of words
◦ Does not require
feature engineering
◦ Minimal pre-
processing, e.g. no
mapping to
ontologies
◦ Improves precision
and recall
 Words represented as set of weights in vector
 Useful properties
◦ Semantically similar words in close proximity
◦ Methods for capturing phrases, e.g. “Secretion system”
◦ Captures some semantic features
 Trained with
◦ Skip-gram or CBOW algorithms
◦ Text, such as PubMed abstracts and open access papers
T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
 Utilize distributed representations in
classification algorithms
 Compare SVM and multi-layered neural
network for classification
 Build on distributed word representation as
basis for shallow semantic parsing and
information extraction
 Apply to other specialty gene sets
 PATRIC Curators: Rebecca
Wattam, Chunhong Mao, David
Abraham, Meredith Wilson,
Yan Zhang
 Resources
◦ PATRIC www.patricbrc.org
◦ Python, NumPy, SciPy, Scikit-
Learn
◦ iPython
◦ Gensim
 Funding
◦ National Institute of Allergy and
Infectious Disease, National
Institutes of Health
CID Photo Here

More Related Content

What's hot

provenance of microarray experiments
provenance of microarray experimentsprovenance of microarray experiments
provenance of microarray experiments
Helena Deus
 

What's hot (20)

A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning ApproachesA Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
A Deep Analysis on Prevailing Spam Mail Filteration Machine Learning Approaches
 
Utilizing literature for biological discovery
Utilizing literature for biological discoveryUtilizing literature for biological discovery
Utilizing literature for biological discovery
 
The Language of the Gene Ontology
The Language of the Gene OntologyThe Language of the Gene Ontology
The Language of the Gene Ontology
 
Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019Ai in drug design webinar 26 feb 2019
Ai in drug design webinar 26 feb 2019
 
Bio Scope
Bio ScopeBio Scope
Bio Scope
 
PPT
PPTPPT
PPT
 
Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0Decision Support System for Bat Identification using Random Forest and C5.0
Decision Support System for Bat Identification using Random Forest and C5.0
 
provenance of microarray experiments
provenance of microarray experimentsprovenance of microarray experiments
provenance of microarray experiments
 
Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...Biological literature mining - from information retrieval to biological disco...
Biological literature mining - from information retrieval to biological disco...
 
Publishing for the 21st Century: Experiences from the NEUROSCIENCE INFORMATIO...
Publishing for the 21st Century: Experiences from the NEUROSCIENCE INFORMATIO...Publishing for the 21st Century: Experiences from the NEUROSCIENCE INFORMATIO...
Publishing for the 21st Century: Experiences from the NEUROSCIENCE INFORMATIO...
 
NetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David AmarNetBioSIG2014-Talk by David Amar
NetBioSIG2014-Talk by David Amar
 
150219 agbt giab_poster_marc
150219 agbt giab_poster_marc150219 agbt giab_poster_marc
150219 agbt giab_poster_marc
 
Softwares For Phylogentic Analysis
Softwares For Phylogentic AnalysisSoftwares For Phylogentic Analysis
Softwares For Phylogentic Analysis
 
GoTermsAnalysisWithR
GoTermsAnalysisWithRGoTermsAnalysisWithR
GoTermsAnalysisWithR
 
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA SequenceEfficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
Efficiency of Using Sequence Discovery for Polymorphism in DNA Sequence
 
27 20 dec16 13794 28120-1-sm(edit)genap
27 20 dec16 13794 28120-1-sm(edit)genap27 20 dec16 13794 28120-1-sm(edit)genap
27 20 dec16 13794 28120-1-sm(edit)genap
 
How to analyse large data sets
How to analyse large data setsHow to analyse large data sets
How to analyse large data sets
 
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
Machine Learning in Biology and Why It Doesn't Make Sense - Theo Knijnenburg,...
 
2015 aem-grs-keynote
2015 aem-grs-keynote2015 aem-grs-keynote
2015 aem-grs-keynote
 
Novel network pharmacology methods for drug mechanism of action identificatio...
Novel network pharmacology methods for drug mechanism of action identificatio...Novel network pharmacology methods for drug mechanism of action identificatio...
Novel network pharmacology methods for drug mechanism of action identificatio...
 

Similar to Text Mining for Biocuration of Bacterial Infectious Diseases

Similar to Text Mining for Biocuration of Bacterial Infectious Diseases (20)

MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
MiAIRR:Minimum information about an Adaptive Immune Receptor Repertoire Seque...
 
Biomedical literature mining
Biomedical literature miningBiomedical literature mining
Biomedical literature mining
 
Prediction of protein function
Prediction of protein functionPrediction of protein function
Prediction of protein function
 
Eccmid meet the expert 2015
Eccmid meet the expert 2015Eccmid meet the expert 2015
Eccmid meet the expert 2015
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Standardization of the HIPC Data Templates
Standardization of the HIPC Data TemplatesStandardization of the HIPC Data Templates
Standardization of the HIPC Data Templates
 
Standardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So FarStandardization of the HIPC Data Templates: The Story So Far
Standardization of the HIPC Data Templates: The Story So Far
 
Syntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicineSyntactic-semantic analysis for information extraction in biomedicine
Syntactic-semantic analysis for information extraction in biomedicine
 
171017 giab for giab grc workshop
171017 giab for giab grc workshop171017 giab for giab grc workshop
171017 giab for giab grc workshop
 
Prosdocimi ucb cdao
Prosdocimi ucb cdaoProsdocimi ucb cdao
Prosdocimi ucb cdao
 
rheumatoid arthritis
rheumatoid arthritisrheumatoid arthritis
rheumatoid arthritis
 
UKSG 2023 - Will artificial intelligence change how readers use the research ...
UKSG 2023 - Will artificial intelligence change how readers use the research ...UKSG 2023 - Will artificial intelligence change how readers use the research ...
UKSG 2023 - Will artificial intelligence change how readers use the research ...
 
Data Mining in Rediology reports
Data Mining in Rediology reportsData Mining in Rediology reports
Data Mining in Rediology reports
 
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...Creating an Urban Legend: A System for Electrophysiology Data Management and ...
Creating an Urban Legend: A System for Electrophysiology Data Management and ...
 
SooryaKiran Bioinformatics
SooryaKiran BioinformaticsSooryaKiran Bioinformatics
SooryaKiran Bioinformatics
 
LECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICSLECTURE NOTES ON BIOINFORMATICS
LECTURE NOTES ON BIOINFORMATICS
 
Open Targets, identifying targets for drug development in the treatment of di...
Open Targets, identifying targets for drug development in the treatment of di...Open Targets, identifying targets for drug development in the treatment of di...
Open Targets, identifying targets for drug development in the treatment of di...
 
CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...CINECA webinar slides: Modular and reproducible workflows for federated molec...
CINECA webinar slides: Modular and reproducible workflows for federated molec...
 
How bioinformatic and sequencing data might inform the regulatory process - O...
How bioinformatic and sequencing data might inform the regulatory process - O...How bioinformatic and sequencing data might inform the regulatory process - O...
How bioinformatic and sequencing data might inform the regulatory process - O...
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 

More from Dan Sullivan, Ph.D.

Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Dan Sullivan, Ph.D.
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Dan Sullivan, Ph.D.
 

More from Dan Sullivan, Ph.D. (13)

How to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQueryHow to Design a Modern Data Warehouse in BigQuery
How to Design a Modern Data Warehouse in BigQuery
 
With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?With Automated ML, is Everyone an ML Engineer?
With Automated ML, is Everyone an ML Engineer?
 
Getting Started with BigQuery ML
Getting Started with BigQuery MLGetting Started with BigQuery ML
Getting Started with BigQuery ML
 
Google Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine LearningGoogle Cloud Certifications & Machine Learning
Google Cloud Certifications & Machine Learning
 
Unstructured text to structured data
Unstructured text to structured dataUnstructured text to structured data
Unstructured text to structured data
 
A first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetupA first look at tf idf-pdx data science meetup
A first look at tf idf-pdx data science meetup
 
Text mining meets neural nets
Text mining meets neural netsText mining meets neural nets
Text mining meets neural nets
 
ACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False DichotomyACID vs BASE in NoSQL: Another False Dichotomy
ACID vs BASE in NoSQL: Another False Dichotomy
 
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
Big data, bioscience and the cloud   biocatalyst june 2015 sullivanBig data, bioscience and the cloud   biocatalyst june 2015 sullivan
Big data, bioscience and the cloud biocatalyst june 2015 sullivan
 
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual PropertyTools and Techniques for Analyzing Texts: Tweets to Intellectual Property
Tools and Techniques for Analyzing Texts: Tweets to Intellectual Property
 
Modeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key PatternsModeling with Document Database: 5 Key Patterns
Modeling with Document Database: 5 Key Patterns
 
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
Sullivan GBCB Seminar Fall 2014 - Limits of RDMS for Bioinformatics v2
 
Limits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in BioinformaticsLimits of RDBMS and Need for NoSQL in Bioinformatics
Limits of RDBMS and Need for NoSQL in Bioinformatics
 

Recently uploaded

Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
HyderabadDolls
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
nirzagarg
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
gajnagarg
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
gajnagarg
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
HyderabadDolls
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
HyderabadDolls
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
HyderabadDolls
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 

Recently uploaded (20)

Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Introduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptxIntroduction to Statistics Presentation.pptx
Introduction to Statistics Presentation.pptx
 
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
Lake Town / Independent Kolkata Call Girls Phone No 8005736733 Elite Escort S...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime GiridihGiridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
Giridih Escorts Service Girl ^ 9332606886, WhatsApp Anytime Giridih
 
Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?Case Study 4 Where the cry of rebellion happen?
Case Study 4 Where the cry of rebellion happen?
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
Top profile Call Girls In Chandrapur [ 7014168258 ] Call Me For Genuine Model...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Latur [ 7014168258 ] Call Me For Genuine Models We ...
 
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...Top Call Girls in Balaghat  9332606886Call Girls Advance Cash On Delivery Ser...
Top Call Girls in Balaghat 9332606886Call Girls Advance Cash On Delivery Ser...
 
Call Girls In GOA North Goa +91-8588052666 Direct Cash Escorts Service
Call Girls In GOA North Goa +91-8588052666 Direct Cash Escorts ServiceCall Girls In GOA North Goa +91-8588052666 Direct Cash Escorts Service
Call Girls In GOA North Goa +91-8588052666 Direct Cash Escorts Service
 
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
Sonagachi * best call girls in Kolkata | ₹,9500 Pay Cash 8005736733 Free Home...
 
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Nandurbar [ 7014168258 ] Call Me For Genuine Models...
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
Kalyani ? Call Girl in Kolkata | Service-oriented sexy call girls 8005736733 ...
 
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
Sealdah % High Class Call Girls Kolkata - 450+ Call Girl Cash Payment 8005736...
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 

Text Mining for Biocuration of Bacterial Infectious Diseases

  • 2.  Biocuration in bacterial infectious diseases  Existing approaches to biocuration  Goals of current research  Sentence classification for virulence factor (VF) curation  Future Research
  • 3.  Large scale sequencing, transcriptomics, proteomics and metabolomics provide large volumes of data about structure and function  Valuable information about genes, proteins and other biological entities derived from interpretation of data  Publications capture information that researchers extract from data by aggregating, integrating, summarizing and analyzing experiment results and interpreting those results with respect to other published results
  • 4.  Gene annotation ◦ Virulence factors ◦ Antibiotic resistance ◦ Genomic metadata  Experiment Metadata ◦ Transcriptomic metadata ◦ Metabolomic metadata  Literature ◦ Named entity recognition ◦ Metadata tagging
  • 5.  Automated annotation ◦ Example – RAST ◦ Transfer annotations based on similarity ◦ Metabolic reconstruction  Community curation ◦ Example – WikiGenes ◦ Collaborative manual curation  Model building ◦ Example - MetaFlux ◦ Predict missing components of pathways based on FBA models  Dedicated manual curation ◦ Example – ◦ PATRIC Curate entries with statements traceable to literature
  • 6. ◦ In 2009, half of biocurators were using text mining in support of biocuration1 ◦ Common use cases:  Document prioritization  Linking entities and relations to biological resources such as GO or UniProt  Identification of evidence ◦ Identification of evidence  Pattern recognition - genomic location information  Named entity recognition – T4SS components  Event extraction – positive/negative regulation 1. PMID: 23110974
  • 7.  Manual procedures are time consuming and costly  Volume of literature continues to grow  Commonly used search techniques, such as keyword, similarity searching, metadata filtering, etc. can still yield volumes of literature that are difficult to analyze manually  Some success with popular tools but limitations
  • 9.  Potentially brittle methods, e.g. dictionary lookups  Questions of effort required to extend  Named entity recognition does not allows disambiguate correctly  Prioritizing documents is still challenging Textpresso Dictionary Entries Adhesion to host Adhesion to hosts Adhesion to other organism during symbiotic interaction Adhesion to other organism during symbiotic interactions Adhesion to symbiont Adhesion to symbionts Agglutination during conjugation with cellular fusion Agglutination during conjugation with cellular fusions Agglutination during conjugation without cellular fusion Agglutination during conjugation without cellular fusions
  • 10.  Generalized set of biocuration tools to: ◦ Filter and prioritize documents ◦ Identify relevant assertion sentences within documents ◦ Extract entity and events ◦ Require minimal manual intervention  Approach ◦ Address each objective separately ◦ Topic modeling and similarity measures for document classification ◦ Term-frequency Inverse Document Frequency (TF-IDF) for sentence classification ◦ Shallow semantic parsing for entity and event extraction  Focus of this presentation is TF-IDF for sentence classification and its limitations
  • 11.  3 Key Components ◦ Data ◦ Representation scheme ◦ Algorithms  Data ◦ Positive examples – VF assertion sentences ◦ Negative examples – Randomly selected from same publications  Representation ◦ TF-IDF ◦ Vector space representation ◦ Cosine of vectors measure of similarity  Algorithms ◦ Supervised learning  SVMs  Ridge Classifier  Perceptrons  kNN  SGD Classifier  Naïve Bayes  Random Forest  AdaBoost • Semisupervised Learning • Label Spreading
  • 12.  “Bacterial virulence factors enable a [pathogen] to replicate and disseminate within a host in part by subverting or eluding host defenses.”1  Example assertion sentences about virulence factors  Mutations in the fimH gene of Salmonella typhimurium result in a non-fimbriate, non-adhesive phenotype.2  Unexpectedly, here we find that nonacylated LprG retains TLR2 activity. 3  The autolysin Ami contributes to the adhesion of Listeria. 4  Negative examples are randomly selected non-VF assertion sentences from the same set of publications.  VF Sentence Set 1 - PATRIC team of biocurators identified 4,696 assertion sentences in 1,127 publications about virulence in 5 genera: Escherichia, Listeria, Mycobacterium, Salmonella, Shigella  VF Sentence Set 2 - Second round of curation over initial results yield 3,716 VF assertion sentences from 787 publications across 6 genera: Bartonella, Escherichia, Listeria, Mycobacterium, Salmonella, Shigella 1. A. Cross, “What is a Virulence Factor” Crit Care. 2008; 12(6): 196. 2. Hancox, Yeh et al. 1997 3. Drage, Tsai et al. 2010 4. Milohanic, Jonquieres et al. 2001
  • 13.  Term Frequency (TF) tf(t,d) = # of occurrences of t in d t is a term d is a document  Inverse Document Frequency (IDF) idf(t,D) = log(N / |{d in D : t in d}|) D is set of documents N is number of document  TF-IDF = tf(t,d) * idf(t,D)  TF-IDF is ◦ large when high term frequency in document and low term frequency in all documents ◦ small when term appears in many documents
  • 14.  Bag of word model  Ignores structure (syntax) and meaning (semantics) of sentences  Representation vector length is the size of set of unique words in corpus  Stemming used to remove morphological differences  Each word is assigned an index in the representation vector, V  The value V[i] is non-zero if word appears in sentence represented by vector  The non-zero value is a function of the frequency of the word in the sentence and the frequency of the term in the corpus
  • 15.  Support Vector Machine (SVM) is large margin classifier  Commonly used in text classification  Initial results based on VF Sentence Set 1 Image Source:http://en.wikipedia.org/wiki/File:Svm_max_sep_hyperplane_with_margin.png
  • 16.  Non-VF, Predicted VF: ◦ “Collectively, these data suggest that EPEC 30-5-1(3) translocates reduced levels of EspB into the host cell.” ◦ “Data were log-transformed to correct for heterogeneity of the variances where necessary.” ◦ “Subsequently, the kanamycin resistance cassette from pVK4 was cloned into the PstI site of pMP3, and the resulting plasmid pMP4 was used to target a disruption in the cesF region of EHEC strain 85-170.”  VF, Predicted Non-VF ◦ “Here, it is reported that the pO157-encoded Type V-secreted serine protease EspP influences the intestinal colonization of calves. “ ◦ “Here, we report that intragastric inoculation of a Shiga toxin 2 (Stx2)-producing E. coli O157:H7 clinical isolate into infant rabbits led to severe diarrhea and intestinal inflammation but no signs of HUS. “ ◦ “The DsbLI system also comprises a functional redox pair”
  • 17.  Adding additional examples is not likely to substantially improve results as seen by error curve 0 0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 0.45 0.5 0 5000 10000 All Training Error Validation Error
  • 18.  8 Alternative Algorithms  VF Sentence Set 2  Select 10,000 most important features using chi-square
  • 19.  Machine learning technique that takes advantage of unlabeled data  Unlabeled data helps determine shape of underlying data distribution  Added randomly selected, unlabeled sentences from VF publications  Trained with 842 labeled and 4346 unlabeled  Label Spreading is a semi-supervised algorithm somewhat resilient to noise
  • 20.  Algorithms have parameters not learned from data  SVMs, for example: ◦ C – balances training error and over-fitting ◦ Kernel – function to map data to high-dimensional space, e.g. linear, polynomial ◦ Gamma – parameter in non-linear kernels, controls how far influence of a training instance reaches  Search combination of parameters  Optimal results with linear kernel and slightly smaller C than default
  • 21.  Process of explicitly modeling relations between variables or explicitly representing information not already in a representation scheme, for example: ◦ Classify all numbers as NUMBER instead of numerals ◦ Replace gene/protein names with term GENE_Protein  Used in text classification problems, e.g. phrase-based learning has improved some rule-based classifiers.1  Rule based learners may not be generalizable to other domains  Taxonomic-structure of Unified Medical Language System (UMLS) used to create semantic similarity measures. 2  Most informative features can be detected automatically, e.g. chi- square  Manual feature engineering is not a viable option if our goal is topic-independent support for biocuration 1. DOI:10.1.1.36.9770 2. PMID: 22580178
  • 22.  Improve quality of data (quantity not likely helpful)  Utilize multiple supervised algorithms, ensemble and non-ensemble  Use unlabeled data and semi-supervised techniques  Feature Selection  Parameter Tuning  Feature Engineering  Given: ◦ High quality data in sufficient quantity ◦ State of the art machine learning algorithms  How to improve results: Change Representation?
  • 23.  TF-IDF ◦ Loss of syntactic and semantic information ◦ No relation between term index and meaning ◦ No support for disambiguation ◦ Feature engineering extends vector representation or substitute specific for more general terms – a crude way to capture semantic properties  Ideal Representation ◦ Capture semantic similarity of words ◦ Does not require feature engineering ◦ Minimal pre- processing, e.g. no mapping to ontologies ◦ Improves precision and recall
  • 24.  Words represented as set of weights in vector  Useful properties ◦ Semantically similar words in close proximity ◦ Methods for capturing phrases, e.g. “Secretion system” ◦ Captures some semantic features  Trained with ◦ Skip-gram or CBOW algorithms ◦ Text, such as PubMed abstracts and open access papers T. Mikolov, et. al. “Efficient Estimation of Word Representations in Vector Space.” 2013. http://arxiv.org/pdf/1301.3781.pdf
  • 25.  Utilize distributed representations in classification algorithms  Compare SVM and multi-layered neural network for classification  Build on distributed word representation as basis for shallow semantic parsing and information extraction  Apply to other specialty gene sets
  • 26.  PATRIC Curators: Rebecca Wattam, Chunhong Mao, David Abraham, Meredith Wilson, Yan Zhang  Resources ◦ PATRIC www.patricbrc.org ◦ Python, NumPy, SciPy, Scikit- Learn ◦ iPython ◦ Gensim  Funding ◦ National Institute of Allergy and Infectious Disease, National Institutes of Health CID Photo Here

Editor's Notes

  1. 1. – Process used in VF 2. – No idea why this labeled as a 1 3. Probably from a Methods section, refers to resistance cassette 4.