Yannick Pouliot, PhD
Biocomputational scientist
Khatri Laboratory
04/09/2014
Databases, Web Services and
Tools For Systems...
GOALS
Convey understanding of:
1. a set of databases highly relevant to
Systems Immunology
2. the issues and pitfalls asso...
Systems Immunology, and particularly
the application of meta-analysis, can
reveal testable hypotheses.
But for that to hap...
Historically, data were typically
available in flat file formats only
Relational databases now used
increasingly
Huge Numbers of Databases
• See Nucleic Acids Research’ yearly database issue
to see just how many there are…
• Many need ...
But first: “Free” does not necessarily
mean “easy to use”
(yet another application of the “there
ain’t no free lunch” prin...
Typical Problems with Third Party Data 1: Data
Cleanup
• Third party data almost always requires preprocessing
▫ reshaping...
Typical Problems with Third Party Data 2: Must be
Downloadable
To be useful in Systems Immunology, a database needs to off...
Relational Databases -- Take Your Pick
A Small Sample of DBs Useful in Systems Immunology
• NCBI:
▫ GEO: Gene expression
▫ PubChem: Drug an compound activity dat...
Gene Expression
… including
- microarray
- qPCR
- RNA-Seq
GEO
• Vast repository of everything expression
▫ microarray gene expression as well as e.g., RNA-
Seq
▫ lots of disease an...
GEO: Example
Goal: Identifying transcripts unique to individual leukocyte cell
types
Process:
1. Curate GEO gene expressio...
Drugs, compounds, bioactivity
PubChem
• Three components:
▫ Compounds
▫ Substances
▫ BioAssay  this is where the action is
• PubChem BioAssay is a repo...
PubChem BioAssay: Example
Approach: Create a model that correlates
bioactivity profiles in screening assays with
pattern o...
DrugBank
• Comprehensive collection of detailed drug data
▫ chemical
▫ pharmacological
▫ pharmaceutical
▫ target
• Content...
Even When Data Are Available For Download, Converting Into Desired
Format Can Be Challenging…
 Converting to relational o...
DrugBank: Example
SELECT distinct
c.`NAME` as drug_name,
case
when ( not(d.GENE_NAME = null) and (e.symbol = null)) then d...
Connectivity Map
• Contents: collection of microarray gene expression
datasets from panels of cell types treated with
mult...
Connectivity Map: Example 1
select
`a`.`instance_id` AS `instance_id`,
`a`.`probe_name` AS `probe_name`,
`b`.`direction` A...
Connectivity Map: Example 2
Goal: Find drugs that increase gene expression in the reverse
direction to what is observed in...
Data Sources
Disease data: GEO
• Assemble MySQL database of 176 gene expression microarray datasets
from GEO
▫ diseased vs...
Comparative Toxicogenomics Database
• Based on curation of literature of interactions
between
▫ compounds and diseases
▫ c...
CTD: Example
Goal: Retrieve genes whose
expression is influenced by
testosterone-related compounds
Data integration
Unified Medical Language System (UMLS)
… and why you need it
• Provided by the National Library of Medicine
• Inter-relate...
UMLS: Example 1
• Developed by National Library of Medicine
 data files and software that brings together multiple
biomed...
UMLS: Example 2
“ I don’t like these drug names…”
SELECT distinct
a.drug_name,
c.STR as shorter_drug_name,
length(c.STR) a...
Immunological data
- ImmPort
- Stanford Data Miner
ImmPort: The King of Immunology
Databases
• Very rich metadata
• Stores data for many
different types of assays
(unusual)
• Uniquely curated and
parsed
• Excellent ...
ImmPort: Example 2
SELECT distinct
a.`study_accession`,
i.`name` as fcs_file,
j.`panel`,
j.`number_of_markers`
FROM
kb_imm...
ImmPort: Example 3
Goal: Retrieve HAI results for
influenza vaccinees, measured at
day 0 and 28 post-vaccination
Putting it all together:
Meta-analysis of human
influenza vaccination data in
ImmPort data to evaluate
changes in immunolo...
Stanford Data Miner: The Prince of
Immunology Databases
SDM: Example
Retrieve cell type
frequencies from
CytOF data
following influenza
immunization
Integrated Disease Repositories
The Cancer Genomics Atlas (TCGA)
• Lots of cancers
• Clinical data
▫ Full pathology
▫ Imaging, radiology,
immunohistochemi...
In Conclusion…
• Huge number of public resources
▫ ultimately integratable
• Scientific power frequently lies in integrati...
Questions?
Coming To Terms With MySQL
• Widest usage in bioinformatics
• Free (community edition)
• Runs on everything (Linux, Win, M...
Key R Packages
▫ RMySQL: accessing relational databases, e.g.,
MySQL
▫ ggplot2: hyper-powerful plotting
▫ RColorBrewer: as...
Finding Drug Candidates Using Rank-Ordered,
Drug-Disease Anti-Correlation Scores
1. Compute an anti-correlation
score for ...
Systems Immunology -- 2014
Systems Immunology -- 2014
Upcoming SlideShare
Loading in …5
×

Systems Immunology -- 2014

193
-1

Published on

Databases useful for Systems Immunology

Published in: Health & Medicine
1 Comment
0 Likes
Statistics
Notes
  • Nice deck Yannick, thought you'd like my annual post on NAR's DB issue. There are also many more DBs in the wild that do not get listed in NAR. I'm thinking some kind of capture/tag/release analysis needs to done to really count them.
    http://scienceblogs.com/digitalbio/2014/01/09/bio-databases-2014/
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

No Downloads
Views
Total Views
193
On Slideshare
0
From Embeds
0
Number of Embeds
0
Actions
Shares
0
Downloads
5
Comments
1
Likes
0
Embeds 0
No embeds

No notes for slide

Systems Immunology -- 2014

  1. 1. Yannick Pouliot, PhD Biocomputational scientist Khatri Laboratory 04/09/2014 Databases, Web Services and Tools For Systems Immunology Databases for Systems Immunology
  2. 2. GOALS Convey understanding of: 1. a set of databases highly relevant to Systems Immunology 2. the issues and pitfalls associated with each DB
  3. 3. Systems Immunology, and particularly the application of meta-analysis, can reveal testable hypotheses. But for that to happen, you need lots of (diverse) … DATA
  4. 4. Historically, data were typically available in flat file formats only Relational databases now used increasingly
  5. 5. Huge Numbers of Databases • See Nucleic Acids Research’ yearly database issue to see just how many there are… • Many need to be licensed ($) ▫ Ingenuity Pathways Analysis (IPA)  Excellent but pricey ▫ MetaCore  competitor to IPA  available from Lane Library • Many more freely available ▫ E.g., DAVID: similar to IPA and MetaCore ▫ Typically dirtier than commercial products, but sometimes much more comprehensive
  6. 6. But first: “Free” does not necessarily mean “easy to use” (yet another application of the “there ain’t no free lunch” principle)
  7. 7. Typical Problems with Third Party Data 1: Data Cleanup • Third party data almost always requires preprocessing ▫ reshaping input data for reading by R or database upload ▫ substituting offending strings  single quotes  ending spaces  converting spaces to nulls ▫ normalizing equivalent strings (“Saline” = “saline”) ▫ semantic normalization  encoding source terms against controlled nomenclature  computing against same concept  enables cross-database queries ▫ reconciling descriptions of data to that in source papers  is this thing here what they are talking about in the paper?  missing data  extraneous data ▫ poorly described protocols  which *!&!! antibody did the authors actually use?  software version, parameters used
  8. 8. Typical Problems with Third Party Data 2: Must be Downloadable To be useful in Systems Immunology, a database needs to offer one of the following: 1. downloadable (FTP/SFTP) in text or other form 2. accessible programmatically over the Internet (e.g., Web service)  Otherwise, must write a scraping program  assuming this is acceptable use Manual Web interfaces don’t cut it… Familiarity with databasing and programming skills essential
  9. 9. Relational Databases -- Take Your Pick
  10. 10. A Small Sample of DBs Useful in Systems Immunology • NCBI: ▫ GEO: Gene expression ▫ PubChem: Drug an compound activity data • DrugBank: Comprehensive info on drugs and their targets • BioGPS, Expression Atlas: Compendia of gene expression across tissues • Connectivity Map (CMAP): systematic survey of effects of compounds on cells • Comparative Toxicogenomics Database (CTD): effects of compounds on genes; correlation of compounds with diseases • Unified Medical Language System (UMLS): concept identification, DB cross-querying • ImmPort: Only multi-assay type immunological DB • Stanford Data Miner (SDM): Human Immune Monitoring Core’s database • The Cancer Genome Atlas (TCGA): Incredibly wide and deep repository of human cancer data
  11. 11. Gene Expression … including - microarray - qPCR - RNA-Seq
  12. 12. GEO • Vast repository of everything expression ▫ microarray gene expression as well as e.g., RNA- Seq ▫ lots of disease and drug treatment data in humans • Semi-structured data ▫ limits searchability of GEO search engine ▫ minimal standards applied by GEO  manual curation required
  13. 13. GEO: Example Goal: Identifying transcripts unique to individual leukocyte cell types Process: 1. Curate GEO gene expression datasets for immune cells; store in MySQL 2. Classify cell types according to Cell Ontology 3. Compute z-score of expression for all genes in each cell type Tools: RMySQL + shiny + ggplot
  14. 14. Drugs, compounds, bioactivity
  15. 15. PubChem • Three components: ▫ Compounds ▫ Substances ▫ BioAssay  this is where the action is • PubChem BioAssay is a repository of bioactivity for compounds ▫ Very wide range of assays:  high throughput screening  in vivo assays  cell-free assays • Complex data model (XML) ▫ can be converted to relational, though…
  16. 16. PubChem BioAssay: Example Approach: Create a model that correlates bioactivity profiles in screening assays with pattern of drug adversity Enables prediction of adversity based on how a compound behaves in selected screens
  17. 17. DrugBank • Comprehensive collection of detailed drug data ▫ chemical ▫ pharmacological ▫ pharmaceutical ▫ target • Contents ▫ 7,680 drug entries  1,552 FDA-approved small molecule drugs  55 FDA-approved biotech (antibodies/protein/peptide) drugs  6,000 experimental drugs • But …
  18. 18. Even When Data Are Available For Download, Converting Into Desired Format Can Be Challenging…  Converting to relational or TSV formats doable but not trivial Operating directly on XML not recommended…
  19. 19. DrugBank: Example SELECT distinct c.`NAME` as drug_name, case when ( not(d.GENE_NAME = null) and (e.symbol = null)) then d.GENE_NAME when ( d.GENE_NAME = '') and (not(e.symbol = null)) then e.symbol else e.symbol end as Symbol, e.GeneID, c.`DRUGBANK_ID` as drugbank_id, c.`rxcui`, c.CAS_NUMBER as cas_number, d.`NAME` as gene_name FROM `target` a join (`targets` b, drug c, partner d) on ( a.`TARGETS_FKEY` = b.`TARGETS_PKEY` and b.`DRUG_FKEY` = c.`DRUG_PKEY` and a.`PARTNER` = d.`PARTNER_PKEY` ) left join annot_gene.`gene_info_hs` e on d.`NAME` = e.`name` order by drug_name, Symbol; • Retrieve known targets of drugs • Find as many gene symbols as possible for targets
  20. 20. Connectivity Map • Contents: collection of microarray gene expression datasets from panels of cell types treated with multiple compounds at multiple doses • Used to find drugs where the expression profiles match that of a user’s query gene signature ▫ The system computes a similarity metric to quantify the connection between that gene signature and reference profiles • Cells are all tumor cells from NCI-60 set
  21. 21. Connectivity Map: Example 1 select `a`.`instance_id` AS `instance_id`, `a`.`probe_name` AS `probe_name`, `b`.`direction` AS `direction`, `b`.`msigdb_id` AS `msigdb_id`, `a`.`rank` AS `rank`, `c`.`cmap_name` AS `cmap_name`, `c`.`cell` AS `cell_type`, `c`.`catalog_name`, `c`.`catalog_number`, `c`.`cas_number`, `c`.`rxcui`, `c`.`batch_id`, `c`.`perturbation_scan_id` from ((`v_instance2probe1` `a` join `gene_sets` `b`) join `instances` `c`) where ((`a`.`probe_id` = `b`.`probe_id`) and (`a`.`instance_id` = `c`.`instance_id`)) 1. Assemble gene expression data and metadata stored in multiple files into a cohesive DB schema 2. Retrieve results into an integrated view
  22. 22. Connectivity Map: Example 2 Goal: Find drugs that increase gene expression in the reverse direction to what is observed in Inflammatory Bowel Disease (IBD) vs. normal tissues should decrease symptoms Method: 1. Characterize the effect of drugs on human gene transcript levels 2. Characterize the difference in human gene transcript levels between disease and normal tissue pairs 3. Find drugs that induce the reciprocal signature observed in disease  link using rxcui GEO CM
  23. 23. Data Sources Disease data: GEO • Assemble MySQL database of 176 gene expression microarray datasets from GEO ▫ diseased vs. normal tissue pairs ▫ 100 specific diseases manually reviewed and encoded using UMLS identifiers ▫ drug names encoded against UMLS RXCUI Drug data : Connectivity Map Gene expression microarray profiles of effects of 164 drugs in: ▫ breast cancer: MCF7 epithelial cell line ▫ prostate cancer: PC3 epithelial cell line ▫ leukemia: HL60 ▫ melanoma: SKMEL5 ▫ drug names encoded against UMLS RXCUI
  24. 24. Comparative Toxicogenomics Database • Based on curation of literature of interactions between ▫ compounds and diseases ▫ compounds and genes ▫ genes and diseases
  25. 25. CTD: Example Goal: Retrieve genes whose expression is influenced by testosterone-related compounds
  26. 26. Data integration
  27. 27. Unified Medical Language System (UMLS) … and why you need it • Provided by the National Library of Medicine • Inter-relates many controlled nomenclatures • Assigns single concept identifiers • enables collapsing of variant expressions into one concept • Particularly useful when dealing with drug or compound names (RXCUI) • Use it from NCI Metathesaurus or create a MySQL DB
  28. 28. UMLS: Example 1 • Developed by National Library of Medicine  data files and software that brings together multiple biomedical vocabularies and ontologies to enable semantic interoperability ▫ repository of terms, definitions and concepts in biomedicine, complete with cross-referencing and ontological relationships • Essential but complex and large • Requires free license ▫ or use it from NCI Metathesaurus
  29. 29. UMLS: Example 2 “ I don’t like these drug names…” SELECT distinct a.drug_name, c.STR as shorter_drug_name, length(c.STR) as str_length FROM pharm_drugbank.`m_drug2gene` a, kb_umls.`RXNCONSO` b, kb_umls.`RXNCONSO` c where a.drug_name = b.`STR` and b.RXCUI = c.RXCUI and not(b.STR = c.STR) and length(c.`STR`)<length(b.`STR`) and not(a.Symbol is null) order by a.drug_name, length(c.STR) asc First 10 rows…
  30. 30. Immunological data - ImmPort - Stanford Data Miner
  31. 31. ImmPort: The King of Immunology Databases
  32. 32. • Very rich metadata • Stores data for many different types of assays (unusual) • Uniquely curated and parsed • Excellent database schema • well documented on ImmPort site • sample 
  33. 33. ImmPort: Example 2 SELECT distinct a.`study_accession`, i.`name` as fcs_file, j.`panel`, j.`number_of_markers` FROM kb_immport.`study` a, kb_immport.`arm_or_cohort` b, kb_immport.arm_2_subject c, kb_immport.`subject` d, kb_immport.`biosample` e, kb_immport.`biosample_2_expsample` f, kb_immport.`expsample` g, kb_immport.`expsample_2_file_info` h, kb_immport.`file_info` i, kb_immport.fcs_annotation j where a.`study_accession` = b.`study_accession` and b.`study_accession` = e.`study_accession` and c.`subject_accession` = d.`subject_accession` and d.`subject_accession` = e.`subject_accession` and a.`workspace_id` = b.`workspace_id` and b.`workspace_id` = e.`workspace_id` and e.`workspace_id` = g.`workspace_id` and g.`workspace_id` = i.`workspace_id` and i.`workspace_id` = j.`workspace_id` and e.`biosample_accession` = f.`biosample_accession` and f.`experiment_accession` = g.`experiment_accession` and g.`experiment_accession` = h.`experiment_accession` and h.`experiment_accession` = j.`experiment_accession` and f.`expsample_accession` = g.`expsample_accession` and g.`expsample_accession` = h.`expsample_accession` and h.`expsample_accession` = j.`expsample_accession` and h.`file_info_id` = i.`file_info_id` and i.file_info_id = j.`file_info_id` and a.`official_title` regexp 'influenz' and i.`name` regexp '.fcs' and d.species = 'Homo sapiens' order by a.`study_accession`, j.`panel` Goal: Retrieve all flow cytometry files (FCS) and marker panels associated with studies involving influenza
  34. 34. ImmPort: Example 3 Goal: Retrieve HAI results for influenza vaccinees, measured at day 0 and 28 post-vaccination
  35. 35. Putting it all together: Meta-analysis of human influenza vaccination data in ImmPort data to evaluate changes in immunological marker frequencies from flow cytometry data using automatic gating Maecker, H., McCoy, J.P. & Nussenblatt, R. Standardizing immunophenotyping for the human immunology project. Nature reviews Immunology 12, 191-200 (2012). 10 studies 370 subjects ~17K FCS files Question: What changes in marker frequencies are observed during influenza immunization?
  36. 36. Stanford Data Miner: The Prince of Immunology Databases
  37. 37. SDM: Example Retrieve cell type frequencies from CytOF data following influenza immunization
  38. 38. Integrated Disease Repositories
  39. 39. The Cancer Genomics Atlas (TCGA) • Lots of cancers • Clinical data ▫ Full pathology ▫ Imaging, radiology, immunohistochemistry • Genomics: lots! ▫ both tumor and control tissues ▫ genotyping ▫ exome sequencing ▫ miRNA sequencing ▫ RNA-Seq
  40. 40. In Conclusion… • Huge number of public resources ▫ ultimately integratable • Scientific power frequently lies in integrating data from multiple databases • Data clean-up typically needed ▫ mapping to ontologies or controlled nomenclatures essential • Domain-specific curation frequently required to structure otherwise semi-structured data ▫ e.g., GEO • All doable given today’s plethora of free/cheap tools and compute power
  41. 41. Questions?
  42. 42. Coming To Terms With MySQL • Widest usage in bioinformatics • Free (community edition) • Runs on everything (Linux, Win, Mac) • Easiest relational DB (short of MS Access) • Resources ▫ Moes (2005): Beginning MySQL; Wiley ▫ DuBois (2007): MySQL Cookbook; O’Reilly ▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly
  43. 43. Key R Packages ▫ RMySQL: accessing relational databases, e.g., MySQL ▫ ggplot2: hyper-powerful plotting ▫ RColorBrewer: assign colors to plot objects automatically, such as plotted ggplot ▫ plyr and dplyr: easy manipulation of data frames ▫ sqldf: query data frames using SQL  another easy way to manipulate data frames ▫ shiny: Web-based user interface  if you want interactive R analysis
  44. 44. Finding Drug Candidates Using Rank-Ordered, Drug-Disease Anti-Correlation Scores 1. Compute an anti-correlation score for each drug-disease pair 2. Compute P-values of anti- correlation scores (significance testing) using distance between observed score vs. scores of 100 randomly-generated comparisons 3. Retain correlation that have FDR values better than 0.05
  1. A particular slide catching your eye?

    Clipping is a handy way to collect important slides you want to go back to later.

×