Yannick Pouliot, PhD
Biocomputational scientist
Butte Laboratory
04/04/2012
Databases, Web Services and
Tools For Systems Immunology
What You Need For Systems Immunology
1. A real hypothesis
▫ No fuzzy brain stuff
2. An understanding of statistics and data mining
▫ …one never understands enough statistics…
3. A lot of data, typically from different “levels” of
reality: organismal, molecular/static,
molecular/functional, etc
▫ … and therefore, databases of some sort
4. Software tools and programming expertise
5. Computing power
1: Hypothesis
Developing a Hypothesis Suitable for Data Mining
• Possibly the hardest step
▫ Must have a measurable metric that can be tested
statistically
• A real hypothesis (H1) looks like this:
▫ H1: Drugs with increased frequency of adverse drug
reactions can be identified from patterns of reactivity
in PubChem Bioassays screens.
• Actually, statistical tests tries to invalidate the null
(Ho) hypothesis, which looks like this:
▫ Ho: Bioactivity patterns in PubChem Bioassays do not
distinguish drugs with increased frequency of adverse
drug reactions
2: Statistics & Data Mining
Understanding Statistics: Essential
• Not easy; counter-intuitive
• Critical, because with large volumes of data come the
guarantee that you will always find “something”
▫ … except that it will most likely be purely artifactual
Q: ever heard of multiple testing correction?
If not, read Bill Noble excellent description: Noble, W. S.
How does multiple testing correction work? Nat Biotech
27, 1135-1137 (2009).
Learning About Statistics
Introductory
• Norman, Streiner (2008): Biostatistics, the Bare Essentials;
Hamilton.
More advanced
• Vittinghoff, Eric. (2005): Regression methods in biostatistics;
Springer.
• Gentleman et al., (2005): Bioinformatics and Computational
Biology Solutions Using R and Bioconductor; Springer.
Advanced
Doncaster & Davey (2007): Analysis of Variance and
Covariance: How to Choose and Construct Models for the
Life Sciences; Cambridge University Press.
Understanding Data Mining
• Data mining uses statistical techniques + other
techniques that are uniquely “computational”
• Key to Systems Immunology
• Resources:
▫ Excellent introduction, Weka-specific:
 Witten & Frank (2005): Data Mining: Practical Machine
Learning Tools and Techniques, Second Edition; Morgan
Kaufman
▫ Nisbet et al., (2009): Handbook of Statistical Analysis and
Data Mining Applications; Wiley
• Tools: coming-up
3: Data
Huge Numbers of Databases
• Many need to be licensed ($)
▫ Ingenuity Pathways Analysis (IPA)
 Excellent but pricey
▫ MetaCore
 competitor to IPA
 available from Lane Library
• Many more freely available
▫ DAVID: similar to IPA and MetaCore
▫ Typically dirtier than commercial products, but
sometimes much more comprehensive
▫ Consult Nucleic Acids Research’s yearly database issue
The Bad News
• To be useful in Systems Medicine, databases
need to offer one of the following:
▫ Be downloadable (FTP) in text or other form
▫ Be accessible programmatically over the Internet
(e.g., Web service)
Clicking on Web interfaces just doesn’t cut it…
This mean knowing about databases and having
programming skills (more later)
A Small Sample of DBs Crucial to
Systems Immunology
• NCBI: Entrez, GEO, PubMed, Gene, Genome,
RefSeq, dbSNP
• EBI: Array Express, Gene Expression Atlas,
ENSEMBL
• Mouse Genome Database
• DrugBank
• BioGPS
• HapMap
• STITCH: interactions between compounds and
proteins
• UMLS (Unified Medical Language System)
Unified Medical Language System
• Developed by National Library of Medicine
• = data files and software that brings together
multiple biomedical vocabularies and ontologies
to enable semantic interoperability.
▫ repository of terms, definitions and concepts in
biomedicine, complete with cross-referencing and
ontological relationships
• Essential but complex and large
• Requires free license
Immunology Databases
The ImmPort Database: The Only DB of Its Kind in
Immunology
• http://immport.niaid.nih.gov/
• Stores results from huge range of assays
▫ HAI  flow cytometry phenotyping
▫ Genotyping
▫ Sequencing
▫ Gene expression
▫ etc
• Intended to be the primary repository for all NIAID
“center” grants
• Can access pre-publication data if given access by PI
• Caveat: volume of PUBLIC datasets is currently
limited
Stanford’s Human Immune Monitoring
Center (HIMC) Database
• Stanford Data Miner is HIMC’s data mining
database
• Stores many of the assays run by HIMC
• Ask HIMC for access data from researchers who
use HIMC (will require their consent)
Next Level Up: Relational Databases
Take Your Pick
Why Relational Databases?
• Much faster access to data
• Data are safe
• Completely robust query answers
• Good scaling
• Highly integrative
▫ Cross-database querying: essential!
Recommendation: MySQL
• Nothing magical about MySQL
▫ Widest usage in bioinformatics
▫ Free (community edition)
▫ Runs on everything (Linux, Win, Mac)
▫ Easiest relational DB (short of MS Access)
• Resources
▫ Moes (2005): Beginning MySQL; Wiley
▫ DuBois (2007): MySQL Cookbook; O’Reilly
▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly
4: Software tools and programming
expertise
• Free software!
• Free algorithms!
• Pre-coded algorithms (i.e., packages)!
• Very cheap computing power!
The Good News
The Bad News
• Dunno how to use
• “Not talented”
• “Not enough time”
• Can’t be bothered
▫ e.g., reading the paper describing the software tool
one is relying on
More Good News
• Not that hard
• Lots and lots of good resources
• Read a book, dammit
• Find a buddy
• Use Cloud instances (preconfigured machines)
▫ Can even be free!
“Gateway Drugs” to Programming:
Workflow Systems
• GenePattern
▫ Predominantly oriented toward gene expression
analysis
▫ Public server available
• Galaxy
▫ Predominantly oriented toward sequence (NGS)
analysis
▫ Public server available
• Weka
▫ Easiest way to learn data mining
But Seriously: Why Programming?
• Address small problems that can nail you
• Address bigger problems by standing on the
shoulders of giants
• Flexibility: If you’re doing “real” science, off-the-
shelf software will fail you every time
▫ 80% rule…
Example:
Don’t Try This
With Excel
•Millions of sequence
reads compared against
mouse transcriptome
• Determining number of
distinct species and
frequency of members in
each
• Summarize using plots
for each codon
How it’s
done
SQL + R
Another Example…
Languages/Systems You Can’t Do
Without
• SQL
▫ To talk to MySQL
• Perl or Python
▫ To glue things together
• R (“R Project”)
▫ To perform heavy-duty statistical analysis
• Weka
▫ To apply machine learning algorithms
The Inside Scoop of Making
Programming Work for You
• Diagram or write down your process
▫ Don’t just sit down and write code
• Write comments
▫ “I’m doing this because of this special case over here”
• Using meaningful variable names
▫ $c = not good
• Use development tools
▫ Rstudio (for writing R code)
▫ Eclipse (for writing in almost any language)
▫ HeidiSQL (SQL browser)
1. download subject <--> group mapping table
2. download drug treatment data for each subject ID
create two sets of subjects
ImmunoTreatedSubjects
NonImmunoTreatedSubjects
3. download gene type data
ImmunoGenes
NonImmunoGenes
4. calculate variance of each gene set for each subject
5. create data frame to store (4) --> varGeneSetForEachSubject
6. compute t test to determine whether mean is significantly different
first test: generate statistic for each individual subject
--> compare variance of ImmunoGenes vs. variance of NonImmunoGenes for
each subject
Programming Without Programming: NCBI’s
Ebot
• Uses NCBI e-utilities (“Web services”)
▫ Programmatic access to NCBI databases,
including PubMed
▫ VERY useful for data mining
• Ebot codes the particular kind of service you
want to use
• Still, it only gets you so far, but at least the
heaviest lifting has been done (and it is heavy…)
Heavy Lifting: R
R: Why It Hurts So Good
• The “R Project” (aka R) is the premiere Open Source
statistical and data mining language and suite of
libraries.
• Pros
▫ Free, runs on everything
▫ Very flexible statistical computing
▫ Dominant standard in biocomputing
▫ Big user community at Stanford
▫ Many key libraries written at Stanford
• Cons
▫ Non-trivial learning curve
▫ Documentation is of variable quality
Key R Resources
Three essentials:
• RStudio
▫ Integrated development environment
 don’t code R w/o it!
• Crawley (2007): The R Book; Wiley
• Matloff (2011): The Art of R Programming; No
Starch Press
• Teetor (2011): The R Cookbook; O’Reilly
• Wickham (2009): ggplot2; Springer
Lighter Lifting: Weka
3/18/2022
37
WEKA Data Mining Suite
• Machine learning/data mining software written
in Java (distributed under the GNU Public
License)
• Main features:
▫ Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods
▫ Graphical user interfaces (incl. data visualization)
▫ Environment for comparing learning algorithms
• Heavily referenced in “Data Mining” (Witten &
Frank)
3/18/2022
University of Waikato
38
3/18/2022
University of Waikato
39
Perl, Python
• Either is a great language for bioinformatics
• Run on anything
• Use it to quickly glue systems together, e.g.,
▫ Integrate MySQL and R together
▫ Run Web services queries
• Python has more growth potential
▫ Preferable over Perl
5. Computer Power
Why The Cloud Matters For Biologists
• You are purchasing computing power, not
machines
▫  never outdated
• You can purchase as much/little as you need
▫ You don’t have to run/manage what you don’t use
• Can easily migrate from one machine type to
another (minutes)
• Can add storage in seconds
• Accessible from anywhere
• Easy to share e.g., (large) datasets with others
Why Own When You Can Rent?
Welcome To the Cloud…
For biomedical computing, Amazon
Cloud is ideal because it provides
highly flexible storage and compute
power sold on a use basis
Another Example: PathSeq
• Compare millions of short-read sequences
against all genomic + transcriptomic sequences
for all microbes (!)
Amazon Cloud
“Management Console”
Q: What does working with a Cloud
machine feel like?
A: It’s not materially different than
accessing a machine on our cluster,
except you can do anything you want
Main Services Provided by Amazon Cloud
• Storage
▫ Traditional disk volumes
▫ S3 buckets (“Simple Storage System”)
• Computing (EC2 – “Elastic Compute Cloud”)
▫ Single machine instances
▫ Clusters of various types
• Machine types
▫ Compute servers
▫ Database servers
▫ Cluster
▫ Specialized architectures
▫ Variety of operating systems (LINUX flavors, Windows)
Costs
• You pay for (almost) everything you do
▫ Data transfers (out)
▫ Storage
▫ CPU cycles (depends on instance type; one
instance is free)
• Can purchase cycles at below average market
price
▫ Can provide access to vast amounts of computing
power at a price you can afford
• Research grants from Amazon
Questions?

Databases, Web Services and Tools For Systems Immunology

  • 1.
    Yannick Pouliot, PhD Biocomputationalscientist Butte Laboratory 04/04/2012 Databases, Web Services and Tools For Systems Immunology
  • 2.
    What You NeedFor Systems Immunology 1. A real hypothesis ▫ No fuzzy brain stuff 2. An understanding of statistics and data mining ▫ …one never understands enough statistics… 3. A lot of data, typically from different “levels” of reality: organismal, molecular/static, molecular/functional, etc ▫ … and therefore, databases of some sort 4. Software tools and programming expertise 5. Computing power
  • 3.
  • 4.
    Developing a HypothesisSuitable for Data Mining • Possibly the hardest step ▫ Must have a measurable metric that can be tested statistically • A real hypothesis (H1) looks like this: ▫ H1: Drugs with increased frequency of adverse drug reactions can be identified from patterns of reactivity in PubChem Bioassays screens. • Actually, statistical tests tries to invalidate the null (Ho) hypothesis, which looks like this: ▫ Ho: Bioactivity patterns in PubChem Bioassays do not distinguish drugs with increased frequency of adverse drug reactions
  • 5.
    2: Statistics &Data Mining
  • 6.
    Understanding Statistics: Essential •Not easy; counter-intuitive • Critical, because with large volumes of data come the guarantee that you will always find “something” ▫ … except that it will most likely be purely artifactual Q: ever heard of multiple testing correction? If not, read Bill Noble excellent description: Noble, W. S. How does multiple testing correction work? Nat Biotech 27, 1135-1137 (2009).
  • 7.
    Learning About Statistics Introductory •Norman, Streiner (2008): Biostatistics, the Bare Essentials; Hamilton. More advanced • Vittinghoff, Eric. (2005): Regression methods in biostatistics; Springer. • Gentleman et al., (2005): Bioinformatics and Computational Biology Solutions Using R and Bioconductor; Springer. Advanced Doncaster & Davey (2007): Analysis of Variance and Covariance: How to Choose and Construct Models for the Life Sciences; Cambridge University Press.
  • 8.
    Understanding Data Mining •Data mining uses statistical techniques + other techniques that are uniquely “computational” • Key to Systems Immunology • Resources: ▫ Excellent introduction, Weka-specific:  Witten & Frank (2005): Data Mining: Practical Machine Learning Tools and Techniques, Second Edition; Morgan Kaufman ▫ Nisbet et al., (2009): Handbook of Statistical Analysis and Data Mining Applications; Wiley • Tools: coming-up
  • 9.
  • 10.
    Huge Numbers ofDatabases • Many need to be licensed ($) ▫ Ingenuity Pathways Analysis (IPA)  Excellent but pricey ▫ MetaCore  competitor to IPA  available from Lane Library • Many more freely available ▫ DAVID: similar to IPA and MetaCore ▫ Typically dirtier than commercial products, but sometimes much more comprehensive ▫ Consult Nucleic Acids Research’s yearly database issue
  • 11.
    The Bad News •To be useful in Systems Medicine, databases need to offer one of the following: ▫ Be downloadable (FTP) in text or other form ▫ Be accessible programmatically over the Internet (e.g., Web service) Clicking on Web interfaces just doesn’t cut it… This mean knowing about databases and having programming skills (more later)
  • 12.
    A Small Sampleof DBs Crucial to Systems Immunology • NCBI: Entrez, GEO, PubMed, Gene, Genome, RefSeq, dbSNP • EBI: Array Express, Gene Expression Atlas, ENSEMBL • Mouse Genome Database • DrugBank • BioGPS • HapMap • STITCH: interactions between compounds and proteins • UMLS (Unified Medical Language System)
  • 13.
    Unified Medical LanguageSystem • Developed by National Library of Medicine • = data files and software that brings together multiple biomedical vocabularies and ontologies to enable semantic interoperability. ▫ repository of terms, definitions and concepts in biomedicine, complete with cross-referencing and ontological relationships • Essential but complex and large • Requires free license
  • 14.
  • 15.
    The ImmPort Database:The Only DB of Its Kind in Immunology • http://immport.niaid.nih.gov/ • Stores results from huge range of assays ▫ HAI  flow cytometry phenotyping ▫ Genotyping ▫ Sequencing ▫ Gene expression ▫ etc • Intended to be the primary repository for all NIAID “center” grants • Can access pre-publication data if given access by PI • Caveat: volume of PUBLIC datasets is currently limited
  • 16.
    Stanford’s Human ImmuneMonitoring Center (HIMC) Database • Stanford Data Miner is HIMC’s data mining database • Stores many of the assays run by HIMC • Ask HIMC for access data from researchers who use HIMC (will require their consent)
  • 17.
    Next Level Up:Relational Databases Take Your Pick
  • 18.
    Why Relational Databases? •Much faster access to data • Data are safe • Completely robust query answers • Good scaling • Highly integrative ▫ Cross-database querying: essential!
  • 19.
    Recommendation: MySQL • Nothingmagical about MySQL ▫ Widest usage in bioinformatics ▫ Free (community edition) ▫ Runs on everything (Linux, Win, Mac) ▫ Easiest relational DB (short of MS Access) • Resources ▫ Moes (2005): Beginning MySQL; Wiley ▫ DuBois (2007): MySQL Cookbook; O’Reilly ▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly
  • 20.
    4: Software toolsand programming expertise
  • 21.
    • Free software! •Free algorithms! • Pre-coded algorithms (i.e., packages)! • Very cheap computing power! The Good News
  • 22.
    The Bad News •Dunno how to use • “Not talented” • “Not enough time” • Can’t be bothered ▫ e.g., reading the paper describing the software tool one is relying on
  • 23.
    More Good News •Not that hard • Lots and lots of good resources • Read a book, dammit • Find a buddy • Use Cloud instances (preconfigured machines) ▫ Can even be free!
  • 24.
    “Gateway Drugs” toProgramming: Workflow Systems • GenePattern ▫ Predominantly oriented toward gene expression analysis ▫ Public server available • Galaxy ▫ Predominantly oriented toward sequence (NGS) analysis ▫ Public server available • Weka ▫ Easiest way to learn data mining
  • 25.
    But Seriously: WhyProgramming? • Address small problems that can nail you • Address bigger problems by standing on the shoulders of giants • Flexibility: If you’re doing “real” science, off-the- shelf software will fail you every time ▫ 80% rule…
  • 26.
    Example: Don’t Try This WithExcel •Millions of sequence reads compared against mouse transcriptome • Determining number of distinct species and frequency of members in each • Summarize using plots for each codon
  • 27.
  • 29.
  • 30.
    Languages/Systems You Can’tDo Without • SQL ▫ To talk to MySQL • Perl or Python ▫ To glue things together • R (“R Project”) ▫ To perform heavy-duty statistical analysis • Weka ▫ To apply machine learning algorithms
  • 31.
    The Inside Scoopof Making Programming Work for You • Diagram or write down your process ▫ Don’t just sit down and write code • Write comments ▫ “I’m doing this because of this special case over here” • Using meaningful variable names ▫ $c = not good • Use development tools ▫ Rstudio (for writing R code) ▫ Eclipse (for writing in almost any language) ▫ HeidiSQL (SQL browser) 1. download subject <--> group mapping table 2. download drug treatment data for each subject ID create two sets of subjects ImmunoTreatedSubjects NonImmunoTreatedSubjects 3. download gene type data ImmunoGenes NonImmunoGenes 4. calculate variance of each gene set for each subject 5. create data frame to store (4) --> varGeneSetForEachSubject 6. compute t test to determine whether mean is significantly different first test: generate statistic for each individual subject --> compare variance of ImmunoGenes vs. variance of NonImmunoGenes for each subject
  • 32.
    Programming Without Programming:NCBI’s Ebot • Uses NCBI e-utilities (“Web services”) ▫ Programmatic access to NCBI databases, including PubMed ▫ VERY useful for data mining • Ebot codes the particular kind of service you want to use • Still, it only gets you so far, but at least the heaviest lifting has been done (and it is heavy…)
  • 33.
  • 34.
    R: Why ItHurts So Good • The “R Project” (aka R) is the premiere Open Source statistical and data mining language and suite of libraries. • Pros ▫ Free, runs on everything ▫ Very flexible statistical computing ▫ Dominant standard in biocomputing ▫ Big user community at Stanford ▫ Many key libraries written at Stanford • Cons ▫ Non-trivial learning curve ▫ Documentation is of variable quality
  • 35.
    Key R Resources Threeessentials: • RStudio ▫ Integrated development environment  don’t code R w/o it! • Crawley (2007): The R Book; Wiley • Matloff (2011): The Art of R Programming; No Starch Press • Teetor (2011): The R Cookbook; O’Reilly • Wickham (2009): ggplot2; Springer
  • 36.
  • 37.
    3/18/2022 37 WEKA Data MiningSuite • Machine learning/data mining software written in Java (distributed under the GNU Public License) • Main features: ▫ Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods ▫ Graphical user interfaces (incl. data visualization) ▫ Environment for comparing learning algorithms • Heavily referenced in “Data Mining” (Witten & Frank)
  • 38.
  • 39.
  • 40.
    Perl, Python • Eitheris a great language for bioinformatics • Run on anything • Use it to quickly glue systems together, e.g., ▫ Integrate MySQL and R together ▫ Run Web services queries • Python has more growth potential ▫ Preferable over Perl
  • 41.
  • 42.
    Why The CloudMatters For Biologists • You are purchasing computing power, not machines ▫  never outdated • You can purchase as much/little as you need ▫ You don’t have to run/manage what you don’t use • Can easily migrate from one machine type to another (minutes) • Can add storage in seconds • Accessible from anywhere • Easy to share e.g., (large) datasets with others
  • 43.
    Why Own WhenYou Can Rent? Welcome To the Cloud…
  • 44.
    For biomedical computing,Amazon Cloud is ideal because it provides highly flexible storage and compute power sold on a use basis
  • 45.
    Another Example: PathSeq •Compare millions of short-read sequences against all genomic + transcriptomic sequences for all microbes (!) Amazon Cloud “Management Console”
  • 46.
    Q: What doesworking with a Cloud machine feel like? A: It’s not materially different than accessing a machine on our cluster, except you can do anything you want
  • 47.
    Main Services Providedby Amazon Cloud • Storage ▫ Traditional disk volumes ▫ S3 buckets (“Simple Storage System”) • Computing (EC2 – “Elastic Compute Cloud”) ▫ Single machine instances ▫ Clusters of various types • Machine types ▫ Compute servers ▫ Database servers ▫ Cluster ▫ Specialized architectures ▫ Variety of operating systems (LINUX flavors, Windows)
  • 48.
    Costs • You payfor (almost) everything you do ▫ Data transfers (out) ▫ Storage ▫ CPU cycles (depends on instance type; one instance is free) • Can purchase cycles at below average market price ▫ Can provide access to vast amounts of computing power at a price you can afford • Research grants from Amazon
  • 49.

Editor's Notes

  • #7 Multiple hypothesis testing corrects for random events that falsely appear significant
  • #29 PPT document created using combination of Perl, MySQL and R
  • #48 Blue=services I’ve used
  • #49 Mention cost calculator: http://calculator.s3.amazonaws.com/calc5.html