2. What You Need For Systems Immunology
1. A real hypothesis
▫ No fuzzy brain stuff
2. An understanding of statistics and data mining
▫ …one never understands enough statistics…
3. A lot of data, typically from different “levels” of
reality: organismal, molecular/static,
molecular/functional, etc
▫ … and therefore, databases of some sort
4. Software tools and programming expertise
5. Computing power
4. Developing a Hypothesis Suitable for Data Mining
• Possibly the hardest step
▫ Must have a measurable metric that can be tested
statistically
• A real hypothesis (H1) looks like this:
▫ H1: Drugs with increased frequency of adverse drug
reactions can be identified from patterns of reactivity
in PubChem Bioassays screens.
• Actually, statistical tests tries to invalidate the null
(Ho) hypothesis, which looks like this:
▫ Ho: Bioactivity patterns in PubChem Bioassays do not
distinguish drugs with increased frequency of adverse
drug reactions
6. Understanding Statistics: Essential
• Not easy; counter-intuitive
• Critical, because with large volumes of data come the
guarantee that you will always find “something”
▫ … except that it will most likely be purely artifactual
Q: ever heard of multiple testing correction?
If not, read Bill Noble excellent description: Noble, W. S.
How does multiple testing correction work? Nat Biotech
27, 1135-1137 (2009).
7. Learning About Statistics
Introductory
• Norman, Streiner (2008): Biostatistics, the Bare Essentials;
Hamilton.
More advanced
• Vittinghoff, Eric. (2005): Regression methods in biostatistics;
Springer.
• Gentleman et al., (2005): Bioinformatics and Computational
Biology Solutions Using R and Bioconductor; Springer.
Advanced
Doncaster & Davey (2007): Analysis of Variance and
Covariance: How to Choose and Construct Models for the
Life Sciences; Cambridge University Press.
8. Understanding Data Mining
• Data mining uses statistical techniques + other
techniques that are uniquely “computational”
• Key to Systems Immunology
• Resources:
▫ Excellent introduction, Weka-specific:
Witten & Frank (2005): Data Mining: Practical Machine
Learning Tools and Techniques, Second Edition; Morgan
Kaufman
▫ Nisbet et al., (2009): Handbook of Statistical Analysis and
Data Mining Applications; Wiley
• Tools: coming-up
10. Huge Numbers of Databases
• Many need to be licensed ($)
▫ Ingenuity Pathways Analysis (IPA)
Excellent but pricey
▫ MetaCore
competitor to IPA
available from Lane Library
• Many more freely available
▫ DAVID: similar to IPA and MetaCore
▫ Typically dirtier than commercial products, but
sometimes much more comprehensive
▫ Consult Nucleic Acids Research’s yearly database issue
11. The Bad News
• To be useful in Systems Medicine, databases
need to offer one of the following:
▫ Be downloadable (FTP) in text or other form
▫ Be accessible programmatically over the Internet
(e.g., Web service)
Clicking on Web interfaces just doesn’t cut it…
This mean knowing about databases and having
programming skills (more later)
12. A Small Sample of DBs Crucial to
Systems Immunology
• NCBI: Entrez, GEO, PubMed, Gene, Genome,
RefSeq, dbSNP
• EBI: Array Express, Gene Expression Atlas,
ENSEMBL
• Mouse Genome Database
• DrugBank
• BioGPS
• HapMap
• STITCH: interactions between compounds and
proteins
• UMLS (Unified Medical Language System)
13. Unified Medical Language System
• Developed by National Library of Medicine
• = data files and software that brings together
multiple biomedical vocabularies and ontologies
to enable semantic interoperability.
▫ repository of terms, definitions and concepts in
biomedicine, complete with cross-referencing and
ontological relationships
• Essential but complex and large
• Requires free license
15. The ImmPort Database: The Only DB of Its Kind in
Immunology
• http://immport.niaid.nih.gov/
• Stores results from huge range of assays
▫ HAI flow cytometry phenotyping
▫ Genotyping
▫ Sequencing
▫ Gene expression
▫ etc
• Intended to be the primary repository for all NIAID
“center” grants
• Can access pre-publication data if given access by PI
• Caveat: volume of PUBLIC datasets is currently
limited
16. Stanford’s Human Immune Monitoring
Center (HIMC) Database
• Stanford Data Miner is HIMC’s data mining
database
• Stores many of the assays run by HIMC
• Ask HIMC for access data from researchers who
use HIMC (will require their consent)
18. Why Relational Databases?
• Much faster access to data
• Data are safe
• Completely robust query answers
• Good scaling
• Highly integrative
▫ Cross-database querying: essential!
19. Recommendation: MySQL
• Nothing magical about MySQL
▫ Widest usage in bioinformatics
▫ Free (community edition)
▫ Runs on everything (Linux, Win, Mac)
▫ Easiest relational DB (short of MS Access)
• Resources
▫ Moes (2005): Beginning MySQL; Wiley
▫ DuBois (2007): MySQL Cookbook; O’Reilly
▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly
21. • Free software!
• Free algorithms!
• Pre-coded algorithms (i.e., packages)!
• Very cheap computing power!
The Good News
22. The Bad News
• Dunno how to use
• “Not talented”
• “Not enough time”
• Can’t be bothered
▫ e.g., reading the paper describing the software tool
one is relying on
23. More Good News
• Not that hard
• Lots and lots of good resources
• Read a book, dammit
• Find a buddy
• Use Cloud instances (preconfigured machines)
▫ Can even be free!
24. “Gateway Drugs” to Programming:
Workflow Systems
• GenePattern
▫ Predominantly oriented toward gene expression
analysis
▫ Public server available
• Galaxy
▫ Predominantly oriented toward sequence (NGS)
analysis
▫ Public server available
• Weka
▫ Easiest way to learn data mining
25. But Seriously: Why Programming?
• Address small problems that can nail you
• Address bigger problems by standing on the
shoulders of giants
• Flexibility: If you’re doing “real” science, off-the-
shelf software will fail you every time
▫ 80% rule…
26. Example:
Don’t Try This
With Excel
•Millions of sequence
reads compared against
mouse transcriptome
• Determining number of
distinct species and
frequency of members in
each
• Summarize using plots
for each codon
30. Languages/Systems You Can’t Do
Without
• SQL
▫ To talk to MySQL
• Perl or Python
▫ To glue things together
• R (“R Project”)
▫ To perform heavy-duty statistical analysis
• Weka
▫ To apply machine learning algorithms
31. The Inside Scoop of Making
Programming Work for You
• Diagram or write down your process
▫ Don’t just sit down and write code
• Write comments
▫ “I’m doing this because of this special case over here”
• Using meaningful variable names
▫ $c = not good
• Use development tools
▫ Rstudio (for writing R code)
▫ Eclipse (for writing in almost any language)
▫ HeidiSQL (SQL browser)
1. download subject <--> group mapping table
2. download drug treatment data for each subject ID
create two sets of subjects
ImmunoTreatedSubjects
NonImmunoTreatedSubjects
3. download gene type data
ImmunoGenes
NonImmunoGenes
4. calculate variance of each gene set for each subject
5. create data frame to store (4) --> varGeneSetForEachSubject
6. compute t test to determine whether mean is significantly different
first test: generate statistic for each individual subject
--> compare variance of ImmunoGenes vs. variance of NonImmunoGenes for
each subject
32. Programming Without Programming: NCBI’s
Ebot
• Uses NCBI e-utilities (“Web services”)
▫ Programmatic access to NCBI databases,
including PubMed
▫ VERY useful for data mining
• Ebot codes the particular kind of service you
want to use
• Still, it only gets you so far, but at least the
heaviest lifting has been done (and it is heavy…)
34. R: Why It Hurts So Good
• The “R Project” (aka R) is the premiere Open Source
statistical and data mining language and suite of
libraries.
• Pros
▫ Free, runs on everything
▫ Very flexible statistical computing
▫ Dominant standard in biocomputing
▫ Big user community at Stanford
▫ Many key libraries written at Stanford
• Cons
▫ Non-trivial learning curve
▫ Documentation is of variable quality
35. Key R Resources
Three essentials:
• RStudio
▫ Integrated development environment
don’t code R w/o it!
• Crawley (2007): The R Book; Wiley
• Matloff (2011): The Art of R Programming; No
Starch Press
• Teetor (2011): The R Cookbook; O’Reilly
• Wickham (2009): ggplot2; Springer
37. 3/18/2022
37
WEKA Data Mining Suite
• Machine learning/data mining software written
in Java (distributed under the GNU Public
License)
• Main features:
▫ Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods
▫ Graphical user interfaces (incl. data visualization)
▫ Environment for comparing learning algorithms
• Heavily referenced in “Data Mining” (Witten &
Frank)
40. Perl, Python
• Either is a great language for bioinformatics
• Run on anything
• Use it to quickly glue systems together, e.g.,
▫ Integrate MySQL and R together
▫ Run Web services queries
• Python has more growth potential
▫ Preferable over Perl
42. Why The Cloud Matters For Biologists
• You are purchasing computing power, not
machines
▫ never outdated
• You can purchase as much/little as you need
▫ You don’t have to run/manage what you don’t use
• Can easily migrate from one machine type to
another (minutes)
• Can add storage in seconds
• Accessible from anywhere
• Easy to share e.g., (large) datasets with others
43. Why Own When You Can Rent?
Welcome To the Cloud…
44. For biomedical computing, Amazon
Cloud is ideal because it provides
highly flexible storage and compute
power sold on a use basis
45. Another Example: PathSeq
• Compare millions of short-read sequences
against all genomic + transcriptomic sequences
for all microbes (!)
Amazon Cloud
“Management Console”
46. Q: What does working with a Cloud
machine feel like?
A: It’s not materially different than
accessing a machine on our cluster,
except you can do anything you want
47. Main Services Provided by Amazon Cloud
• Storage
▫ Traditional disk volumes
▫ S3 buckets (“Simple Storage System”)
• Computing (EC2 – “Elastic Compute Cloud”)
▫ Single machine instances
▫ Clusters of various types
• Machine types
▫ Compute servers
▫ Database servers
▫ Cluster
▫ Specialized architectures
▫ Variety of operating systems (LINUX flavors, Windows)
48. Costs
• You pay for (almost) everything you do
▫ Data transfers (out)
▫ Storage
▫ CPU cycles (depends on instance type; one
instance is free)
• Can purchase cycles at below average market
price
▫ Can provide access to vast amounts of computing
power at a price you can afford
• Research grants from Amazon