Systems Immunology -- 2014

Yannick Pouliot, PhD
Biocomputational scientist
Khatri Laboratory
04/09/2014
Databases, Web Services and
Tools For Systems Immunology
Databases for Systems
Immunology

GOALS
Convey understanding of:
1. a set of databases highly relevant to
Systems Immunology
2. the issues and pitfalls associated with
each DB

Systems Immunology, and particularly
the application of meta-analysis, can
reveal testable hypotheses.
But for that to happen, you need lots
of (diverse) …
DATA

Historically, data were typically
available in flat file formats only
Relational databases now used
increasingly

Huge Numbers of Databases
• See Nucleic Acids Research’ yearly database issue
to see just how many there are…
• Many need to be licensed ($)
▫ Ingenuity Pathways Analysis (IPA)
 Excellent but pricey
▫ MetaCore
 competitor to IPA
 available from Lane Library
• Many more freely available
▫ E.g., DAVID: similar to IPA and MetaCore
▫ Typically dirtier than commercial products, but
sometimes much more comprehensive

But first: “Free” does not necessarily
mean “easy to use”
(yet another application of the “there
ain’t no free lunch” principle)

Typical Problems with Third Party Data 1: Data
Cleanup
• Third party data almost always requires preprocessing
▫ reshaping input data for reading by R or database upload
▫ substituting offending strings
 single quotes
 ending spaces
 converting spaces to nulls
▫ normalizing equivalent strings (“Saline” = “saline”)
▫ semantic normalization
 encoding source terms against controlled nomenclature
 computing against same concept
 enables cross-database queries
▫ reconciling descriptions of data to that in source papers
 is this thing here what they are talking about in the paper?
 missing data
 extraneous data
▫ poorly described protocols
 which *!&!! antibody did the authors actually use?
 software version, parameters used

Typical Problems with Third Party Data 2: Must be
Downloadable
To be useful in Systems Immunology, a database needs to offer
one of the following:
1. downloadable (FTP/SFTP) in text or other form
2. accessible programmatically over the Internet (e.g., Web
service)
 Otherwise, must write a scraping program
 assuming this is acceptable use
Manual Web interfaces don’t cut it…
Familiarity with databasing and programming skills essential

Relational Databases -- Take Your Pick

A Small Sample of DBs Useful in Systems Immunology
• NCBI:
▫ GEO: Gene expression
▫ PubChem: Drug an compound activity data
• DrugBank: Comprehensive info on drugs and their targets
• BioGPS, Expression Atlas: Compendia of gene expression across tissues
• Connectivity Map (CMAP): systematic survey of effects of compounds on
cells
• Comparative Toxicogenomics Database (CTD): effects of compounds on
genes; correlation of compounds with diseases
• Unified Medical Language System (UMLS): concept identification, DB
cross-querying
• ImmPort: Only multi-assay type immunological DB
• Stanford Data Miner (SDM): Human Immune Monitoring Core’s database
• The Cancer Genome Atlas (TCGA): Incredibly wide and deep repository of
human cancer data

Gene Expression
… including
- microarray
- qPCR
- RNA-Seq

GEO
• Vast repository of everything expression
▫ microarray gene expression as well as e.g., RNA-
Seq
▫ lots of disease and drug treatment data in
humans
• Semi-structured data
▫ limits searchability of GEO search engine
▫ minimal standards applied by GEO
 manual curation required

GEO: Example
Goal: Identifying transcripts unique to individual leukocyte cell
types
Process:
1. Curate GEO gene expression datasets for immune cells; store in
MySQL
2. Classify cell types according to Cell Ontology
3. Compute z-score of expression for all genes in each cell type
Tools: RMySQL + shiny + ggplot

PubChem
• Three components:
▫ Compounds
▫ Substances
▫ BioAssay  this is where the action is
• PubChem BioAssay is a repository of bioactivity for
compounds
▫ Very wide range of assays:
 high throughput screening
 in vivo assays
 cell-free assays
• Complex data model (XML)
▫ can be converted to relational, though…

PubChem BioAssay: Example
Approach: Create a model that correlates
bioactivity profiles in screening assays with
pattern of drug adversity
Enables prediction of adversity based on how
a compound behaves in selected screens

DrugBank
• Comprehensive collection of detailed drug data
▫ chemical
▫ pharmacological
▫ pharmaceutical
▫ target
• Contents
▫ 7,680 drug entries
 1,552 FDA-approved small molecule drugs
 55 FDA-approved biotech (antibodies/protein/peptide) drugs
 6,000 experimental drugs
• But …

Even When Data Are Available For Download, Converting Into Desired
Format Can Be Challenging…
 Converting to relational or TSV formats doable but not trivial
Operating directly on XML not recommended…

DrugBank: Example
SELECT distinct
c.`NAME` as drug_name,
case
when ( not(d.GENE_NAME = null) and (e.symbol = null)) then d.GENE_NAME
when ( d.GENE_NAME = '') and (not(e.symbol = null)) then e.symbol
else e.symbol
end as Symbol,
e.GeneID,
c.`DRUGBANK_ID` as drugbank_id,
c.`rxcui`,
c.CAS_NUMBER as cas_number,
d.`NAME` as gene_name
FROM
`target` a
join (`targets` b, drug c, partner d)
on (
a.`TARGETS_FKEY` = b.`TARGETS_PKEY`
and b.`DRUG_FKEY` = c.`DRUG_PKEY`
and a.`PARTNER` = d.`PARTNER_PKEY`
)
left join
annot_gene.`gene_info_hs` e
on
d.`NAME` = e.`name`
order by
drug_name,
Symbol;
• Retrieve known targets of drugs
• Find as many gene symbols as
possible for targets

Connectivity Map
• Contents: collection of microarray gene expression
datasets from panels of cell types treated with
multiple compounds at multiple doses
• Used to find drugs where the expression profiles
match that of a user’s query gene signature
▫ The system computes a similarity metric to quantify
the connection between that gene signature and
reference profiles
• Cells are all tumor cells from NCI-60 set

Connectivity Map: Example 1
select
à`.ìnstance_id` AS ìnstance_id`,
à`.`probe_name` AS `probe_name`,
`b`.`direction` AS `direction`,
`b`.`msigdb_id` AS `msigdb_id`,
à`.`rank` AS `rank`,
`c`.`cmap_name` AS `cmap_name`,
`c`.`cell` AS `cell_type`,
`c`.`catalog_name`,
`c`.`catalog_number`,
`c`.`cas_number`,
`c`.`rxcui`,
`c`.`batch_id`,
`c`.`perturbation_scan_id`
from
((`v_instance2probe1` à`
join `gene_sets` `b`)
join ìnstances` `c`)
where
((à`.`probe_id` = `b`.`probe_id`)
and (à`.ìnstance_id` = `c`.ìnstance_id`))
1. Assemble gene expression data and metadata
stored in multiple files into a cohesive DB
schema
2. Retrieve results into an integrated view

Connectivity Map: Example 2
Goal: Find drugs that increase gene expression in the reverse
direction to what is observed in Inflammatory Bowel Disease
(IBD) vs. normal tissues should decrease symptoms
Method:
1. Characterize the effect of drugs on human gene transcript
levels
2. Characterize the difference in human gene transcript levels
between disease and normal tissue pairs
3. Find drugs that induce the reciprocal signature observed in
disease
 link using rxcui
GEO CM

Data Sources
Disease data: GEO
• Assemble MySQL database of 176 gene expression microarray datasets
from GEO
▫ diseased vs. normal tissue pairs
▫ 100 specific diseases manually reviewed and encoded using UMLS identifiers
▫ drug names encoded against UMLS RXCUI
Drug data : Connectivity Map
Gene expression microarray profiles of effects of 164 drugs in:
▫ breast cancer: MCF7 epithelial cell line
▫ prostate cancer: PC3 epithelial cell line
▫ leukemia: HL60
▫ melanoma: SKMEL5
▫ drug names encoded against UMLS RXCUI

Comparative Toxicogenomics Database
• Based on curation of literature of interactions
between
▫ compounds and diseases
▫ compounds and genes
▫ genes and diseases

CTD: Example
Goal: Retrieve genes whose
expression is influenced by
testosterone-related compounds

Unified Medical Language System (UMLS)
… and why you need it
• Provided by the National Library of Medicine
• Inter-relates many controlled nomenclatures
• Assigns single concept identifiers
• enables collapsing of variant expressions into
one concept
• Particularly useful when dealing with drug or
compound names (RXCUI)
• Use it from NCI Metathesaurus
or create a MySQL DB

UMLS: Example 1
• Developed by National Library of Medicine
 data files and software that brings together multiple
biomedical vocabularies and ontologies to enable
semantic interoperability
▫ repository of terms, definitions and concepts in
biomedicine, complete with cross-referencing
and ontological relationships
• Essential but complex and large
• Requires free license
▫ or use it from NCI Metathesaurus

UMLS: Example 2
“ I don’t like these drug names…”
SELECT distinct
a.drug_name,
c.STR as shorter_drug_name,
length(c.STR) as str_length
FROM
pharm_drugbank.`m_drug2gene` a,
kb_umls.`RXNCONSO` b,
kb_umls.`RXNCONSO` c
where
a.drug_name = b.`STR`
and b.RXCUI = c.RXCUI
and not(b.STR = c.STR)
and
length(c.`STR`)<length(b.`STR`)
and not(a.Symbol is null)
order by
a.drug_name,
length(c.STR) asc
First 10 rows…

Immunological data
- ImmPort
- Stanford Data Miner

ImmPort: The King of Immunology
Databases

• Very rich metadata
• Stores data for many
different types of assays
(unusual)
• Uniquely curated and
parsed
• Excellent database
schema
• well documented
on ImmPort site
• sample 

ImmPort: Example 2
SELECT distinct
a.`study_accession`,
i.`name` as fcs_file,
j.`panel`,
j.`number_of_markers`
FROM
kb_immport.`study` a,
kb_immport.àrm_or_cohort` b,
kb_immport.arm_2_subject c,
kb_immport.`subject` d,
kb_immport.`biosample` e,
kb_immport.`biosample_2_expsample` f,
kb_immport.èxpsample` g,
kb_immport.èxpsample_2_file_info` h,
kb_immport.`file_info` i,
kb_immport.fcs_annotation j
where
a.`study_accession` = b.`study_accession`
and b.`study_accession` = e.`study_accession`
and c.`subject_accession` = d.`subject_accession`
and d.`subject_accession` = e.`subject_accession`
and a.`workspace_id` = b.`workspace_id`
and b.`workspace_id` = e.`workspace_id`
and e.`workspace_id` = g.`workspace_id`
and g.`workspace_id` = i.`workspace_id`
and i.`workspace_id` = j.`workspace_id`
and e.`biosample_accession` = f.`biosample_accession`
and f.èxperiment_accession` = g.èxperiment_accession`
and g.èxperiment_accession` = h.èxperiment_accession`
and h.èxperiment_accession` = j.èxperiment_accession`
and f.èxpsample_accession` = g.èxpsample_accession`
and g.èxpsample_accession` = h.èxpsample_accession`
and h.èxpsample_accession` = j.èxpsample_accession`
and h.`file_info_id` = i.`file_info_id`
and i.file_info_id = j.`file_info_id`
and a.òfficial_title` regexp 'influenz'
and i.`name` regexp '.fcs'
and d.species = 'Homo sapiens'
order by
a.`study_accession`,
j.`panel`
Goal: Retrieve all flow cytometry
files (FCS) and marker panels
associated with studies involving
influenza

ImmPort: Example 3
Goal: Retrieve HAI results for
influenza vaccinees, measured at
day 0 and 28 post-vaccination

Putting it all together:
Meta-analysis of human
influenza vaccination data in
ImmPort data to evaluate
changes in immunological
marker frequencies from flow
cytometry data using
automatic gating
Maecker, H., McCoy, J.P. & Nussenblatt, R. Standardizing
immunophenotyping for the human immunology project. Nature reviews
Immunology 12, 191-200 (2012).
10 studies
370 subjects
~17K FCS files
Question: What changes in marker
frequencies are observed during
influenza immunization?

Stanford Data Miner: The Prince of
Immunology Databases

SDM: Example
Retrieve cell type
frequencies from
CytOF data
following influenza
immunization

Integrated Disease Repositories

The Cancer Genomics Atlas (TCGA)
• Lots of cancers
• Clinical data
▫ Full pathology
▫ Imaging, radiology,
immunohistochemistry
• Genomics: lots!
▫ both tumor and control tissues
▫ genotyping
▫ exome sequencing
▫ miRNA sequencing
▫ RNA-Seq

In Conclusion…
• Huge number of public resources
▫ ultimately integratable
• Scientific power frequently lies in integrating data from
multiple databases
• Data clean-up typically needed
▫ mapping to ontologies or controlled nomenclatures
essential
• Domain-specific curation frequently required to
structure otherwise semi-structured data
▫ e.g., GEO
• All doable given today’s plethora of free/cheap tools
and compute power

Coming To Terms With MySQL
• Widest usage in bioinformatics
• Free (community edition)
• Runs on everything (Linux, Win, Mac)
• Easiest relational DB (short of MS Access)
• Resources
▫ Moes (2005): Beginning MySQL; Wiley
▫ DuBois (2007): MySQL Cookbook; O’Reilly
▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly

Key R Packages
▫ RMySQL: accessing relational databases, e.g.,
MySQL
▫ ggplot2: hyper-powerful plotting
▫ RColorBrewer: assign colors to plot objects
automatically, such as plotted ggplot
▫ plyr and dplyr: easy manipulation of data frames
▫ sqldf: query data frames using SQL
 another easy way to manipulate data frames
▫ shiny: Web-based user interface
 if you want interactive R analysis

Finding Drug Candidates Using Rank-Ordered,
Drug-Disease Anti-Correlation Scores
1. Compute an anti-correlation
score for each drug-disease pair
2. Compute P-values of anti-
correlation scores (significance
testing) using distance between
observed score vs. scores of 100
randomly-generated comparisons
3. Retain correlation that have FDR
values better than 0.05

Systems Immunology -- 2014

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Viewers also liked

Viewers also liked (20)

Similar to Systems Immunology -- 2014

Similar to Systems Immunology -- 2014 (20)

More from Yannick Pouliot

More from Yannick Pouliot (11)

Recently uploaded

Recently uploaded (20)

Systems Immunology -- 2014