Mouse-Human Research Classifier

MouseHuman
Research Classifier
Presented By: Osama Jomaa
Research Adviser: Dr. Iddo Friedberg

Mouse Models in Research
Shares 99% of its
genome with humans

Shares 99% of its
genome with humans
Fewer ethical
concerns than other
mammal models

InexpensiveShares 99% of its
genome with humans
Fewer ethical
concerns than other
mammal models
Short generation
times
Small

The Mouse Trap. The Danger of Using one Lab Animal to Study Every Disease. Daniel Engber
http:http://www.slate.com/articles/health_and_science/the_mouse_trap/2011/11/lab_mice_are_they_limiting_our_understanding_of_huma
n_disease_.html. November 16, 2011

Designer Mice for Human Research
Photo taken from “Designer mice for human disease - A close view of Nobel Laureate : Oliver Smithies” Yau-Sheng Tsai, Pei-Jane Tsai,
Man-Jin Jiang, Cherng-Shyang Chang. http://proj.ncku.edu.tw/research/commentary/e/20071116/2.html December 9, 2014

Mouse Model is Not Perfect Though
Photo taken from: The Mouse Trap. The Danger of Using one Lab Animal to Study Every Disease. Daniel Engber
http:http://www.slate.com/articles/health_and_science/the_mouse_trap/2011/11/lab_mice_are_they_limiting_our_understanding_of_huma
n_disease_.html. November 16, 2011

Mouse Correlation with Human to Equivalent Diseases
Photo taken from “Genomic responses in mouse models poorly mimic human inflammatory diseases.” Seok, Warren, and Others.
Proceedings of the National Academy of Sciences. 110, no. 9 (2013): 3507-3512.
Rank correlation (R2
)
Percentage of genes changed
in the same direction

Proposed Research
Classify the Mouse-Human scientific literature
in PubMed into different areas of research
Citation Networks + MeSH Thesaurus
Identify and study the popular areas of
Mouse-Human research
What?
How?
Why?

Proposed Research
Classify the proteins in the Mouse-Human
citation pairs into different biological systems
Protein Co-occurrence Networks
+ Gene Ontology
Investigate the biological systems and
proteins for which Mouse is used
as a model organism for Human
What?
How?
Why?

Agenda
1. PubMed Articles Classification
1. Collect Mouse and Human Papers
2. Build a Citation Network
3. Classify the Cit-Net Using MeSH Thesaurus
4. Stats Study on MeSH Disease Classification
2. PubMed Proteins Analysis
1. Collect Human Protein and Annotation Data
2. Build the Entity Co-occurrence Networks
3. Classify PCoC Networks Using Gene Ontology
3. Summary

1. Collect Human Proteins and Annotation Data
3. Summary

Getting Mouse and Human PubMed IDs
Uniprot
GOA
Mouse PubMed Identifiers (PMIDs)
Human PubMed Identifiers (PMIDs)
1. Get Mouse & Human
papers from Uniprot

Uniprot
GOA
papers from Uniprot
2. Query PubMed API for the
citation list for each article

Uniprot
GOA
papers from Uniprot
.
.
<CitationList>
<PMID> 342342 </PMID>
<PMID> 423545 </PMID>
<PMID> 432598 </PMID>
</CitationList>
.
.
3. Parse PubMed XML response
and get the citation list

Uniprot
GOA
papers from Uniprot
.
.
<CitationList>
<PMID> 342342 </PMID>
<PMID> 423545 </PMID>
<PMID> 432598 </PMID>
</CitationList>
.
.
3. Parse PubMed XML response
Very few PubMed articles have
the citation list in their XML file!

Getting Mouse and Human Citation
List from Scopus
Uniprot
GOA
papers from Uniprot
2. Author HTTP GET request
with PMIDS
3. Parse Scopus JSON response
.
.
{CitationList: {PMID: 342342},
{PMID: 423545}, {PMID: 432598}}
.
.

Building the Citation Network
H
M
M
H
H
H
H
M
H
H
H
M
H
H
H
H
H
H
M
H
M
M
H
H
H
H

H
M
M
H
H
H
H
M
H
H
H
M
H
H
H
H
H
H
M
H
M
M
H
H
H
H
M → H
H → H
H → M
M → M

H
M
M
H
H
H
H
M
H
H
H
M
H
H
H
H
H
H
M
H
M
M
H
H
H
H
M → H
H → H
H → M
M → M
62%
3%
34%
Mouse Inter and Intra Citations
Mouse-Human Citations Mouse-Mouse Citations
Moue-Others Citations
34%
62%
4%
Human Inter and Intra Citations
Human-Others Citations Human-Human Citations
Human-Mouse Citations

Medical Subject Headings
 Controlled vocabulary to index PubMed articles
 Stored in a DAG-like structure
 16 top level concepts at the root
 Includes ~27K concepts (MeSH descriptors) all together

Medical Subject Headings
 Controlled vocabulary to index PubMed articles
 Stored in a DAG-like structure
 16 top level concepts at the root
 Includes ~27K concepts (MeSH descriptors) all together
We used MeSH to group the Mouse and
Human papers in the citation network
into classes of research

MeSH Structure Example
Digestive System Diseases
Gastrointestinal Diseases
Digestive System Neoplasms
Neoplasms by Site
Neoplasms
Stomach Diseases
Gastrointestinal Neoplasms
Stomach Neoplasms

Classifying the Citation Network
H
M
M
H
H
H
M
H
H
H
M
H
H
H
H
H
H
M
H
M
M
H H
H

To Do: Place in research areas
H
M
M
H
H
H
M
H
H
H
M
H
H
H
H
H
H
M
H
M
M
H H
H Digestive
System
Diseases
Eye Diseases
Virus
Diseases
Immune
System
Diseases
Cardiovascular DiseasesSkin
Diseases

Number of Mouse and Human Papers in the MeSH
Disease Categories

Number of Mouse-Human Citation Pairs in the MeSH
Disease Categories

GenBank
Protein: NP_e342 | PMID: 432432
kicgdkssgihygvitcegckgffrrsqqc
Protein: NP_452u1 | PMID: 483232
Adtltytlglsdgqlplgaspdlpeasacp
…..
1. Get the protein sequences Human
and papers

GenBank
…..
and papers
...
PMID: 3213414
NP_u4323: sgihygvitcegckgffrrsqqc
NP_i4322: lplgaspdlpeasacfewrwts
NP_w3421: kicgdkssgihygvitceg
PMID: 2346414
NP_ti3423: vitcegckgckgffrrsqqc
NP_q4322f: ygvitcegeasacfewrwts
NP_x342u2: kicgdkssgihygvitceg
2. Group the proteins by their PMID

GenBank
…..
and papers
...
PMID: 3213414
PMID: 2346414
2. Group the proteins by their PMID
3. Intersect the Genbank papers with Scopus citations

Removing Redundancies
Use CD-HIT with similarity threshold = 0.9

Gene Ontology
Photo taken from: Gene Ontology Consortium. Ontology Structure. http://geneontology.org/page/ontology-structure Last access
December 13, 2014

Gene Ontology Annotation
Biological Process
Cellular Component
Molecular Function
cytochrome c
mitochondrial matrix
oxidoreductase activity
oxidative phosphorylation

FASTA File
BLAST
DB
1. Create BLAST query in FASTA format
2. Create BLAST Database from Swissprot
Human Flat File
Getting GO Terms

FASTA File
BLAST
DB
NP_u4323: GO1, GO5, GO4
NP_i4322: GO5, GO9
NP_w3421: GO4, GO6
...
1. Create BLAST query in FASTA format
2. Create BLAST Database from Swissprot
Human Flat File
3. Do BLAST with e-value = 10-8
4. Parse the BLAST XML response
and get the GO terms for the top hits
Getting GO Terms

1. Collect Cited Human Proteins and Annotation Data
3. Summary

20
14
1
19
12
18
24
7
8
4
6
MP
MP
MP
HP
HP
Citation Edge
P-P Edge
P-C-P Edge

To Do: Classifying the PCoC Network

To Do: Place in Protein Biological Systems
lactase activity
serotonin
Receptor
activity
signal sequence
binding
signal transducer
activitynucleotide
binding
ATP
binding

Summary
 Cit-Net connects citing Mouse papers with cited Human
papers in the PubMed database
 MeSH is used to classify the citation network nodes into
different classes of research
 PCoC network connects the proteins in the citing Mouse
papers with proteins in the cited Human papers
 GO is used to group the P-P and P-C-P network nodes
into different classes of MFs, BPs and Ccs

Timetable
Jan Feb Mar Apr May
Database Creation and
Data migration
Citation Network
Classification
PCoC Networks Building
PCoC Networks
Classification
PCoC Networks Analysis

Mouse-Human Research Classifier

Recommended

Recommended

More Related Content

Similar to Mouse-Human Research Classifier

Similar to Mouse-Human Research Classifier (20)

Recently uploaded

Recently uploaded (20)

Mouse-Human Research Classifier