Databases, Web Services andTools For Systems ImmunologyYannick Pouliot, PhDBiocomputational scientistButte Laboratory04/04/2012
What You Need For Systems Immunology1. A real hypothesis ▫ No fuzzy brain stuff2. An understanding of statistics and data mining ▫ …one never understands enough statistics…3. A lot of data, typically from different “levels” of reality: organismal, molecular/static, molecular/functional, etc ▫ … and therefore, databases of some sort4. Software tools and programming expertise5. Computing power
Developing a Hypothesis Suitable for Data Mining• Possibly the hardest step ▫ Must have a measurable metric that can be tested statistically• A real hypothesis (H1) looks like this: ▫ H1: Drugs with increased frequency of adverse drug reactions can be identified from patterns of reactivity in PubChem Bioassays screens.• Actually, statistical tests tries to invalidate the null (Ho) hypothesis, which looks like this: ▫ Ho: Bioactivity patterns in PubChem Bioassays do not distinguish drugs with increased frequency of adverse drug reactions
Understanding Statistics: Essential• Not easy; counter-intuitive• Critical, because with large volumes of data come the guarantee that you will always find “something” ▫ … except that it will most likely be purely artifactual Q: ever heard of multiple testing correction? If not, read Bill Noble excellent description: Noble, W. S. How does multiple testing correction work? Nat Biotech 27, 1135-1137 (2009).
Learning About StatisticsIntroductory• Norman, Streiner (2008): Biostatistics, the Bare Essentials; Hamilton.More advanced• Vittinghoff, Eric. (2005): Regression methods in biostatistics; Springer.• Gentleman et al., (2005): Bioinformatics and Computational Biology Solutions Using R and Bioconductor; Springer.AdvancedDoncaster & Davey (2007): Analysis of Variance and Covariance: How to Choose and Construct Models for the Life Sciences; Cambridge University Press.
Understanding Data Mining• Data mining uses statistical techniques + other techniques that are uniquely “computational”• Key to Systems Immunology• Resources: ▫ Excellent introduction, Weka-specific: Witten & Frank (2005): Data Mining: Practical Machine Learning Tools and Techniques, Second Edition; Morgan Kaufman ▫ Nisbet et al., (2009): Handbook of Statistical Analysis and Data Mining Applications; Wiley• Tools: coming-up
Huge Numbers of Databases• Many need to be licensed ($) ▫ Ingenuity Pathways Analysis (IPA) Excellent but pricey ▫ MetaCore competitor to IPA available from Lane Library• Many more freely available ▫ DAVID: similar to IPA and MetaCore ▫ Typically dirtier than commercial products, but sometimes much more comprehensive ▫ Consult Nucleic Acids Research’s yearly database issue
The Bad News• To be useful in Systems Medicine, databases need to offer one of the following: ▫ Be downloadable (FTP) in text or other form ▫ Be accessible programmatically over the Internet (e.g., Web service) Clicking on Web interfaces just doesn’t cut it… This mean knowing about databases and having programming skills (more later)
A Small Sample of DBs Crucial toSystems Immunology• NCBI: Entrez, GEO, PubMed, Gene, Genome, RefSeq, dbSNP• EBI: Array Express, Gene Expression Atlas, ENSEMBL• Mouse Genome Database• DrugBank• BioGPS• HapMap• STITCH: interactions between compounds and proteins• UMLS (Unified Medical Language System)
Unified Medical Language System• Developed by National Library of Medicine• = data files and software that brings together multiple biomedical vocabularies and ontologies to enable semantic interoperability. ▫ repository of terms, definitions and concepts in biomedicine, complete with cross-referencing and ontological relationships• Essential but complex and large• Requires free license
The ImmPort Database: The Only DB of Its Kind inImmunology• http://immport.niaid.nih.gov/• Stores results from huge range of assays ▫ HAI flow cytometry phenotyping ▫ Genotyping ▫ Sequencing ▫ Gene expression ▫ etc• Intended to be the primary repository for all NIAID “center” grants• Can access pre-publication data if given access by PI• Caveat: volume of PUBLIC datasets is currently limited
Stanford’s Human Immune MonitoringCenter (HIMC) Database• Stanford Data Miner is HIMC’s data mining database• Stores many of the assays run by HIMC• Ask HIMC for access data from researchers who use HIMC (will require their consent)
Next Level Up: Relational DatabasesTake Your Pick
Why Relational Databases?• Much faster access to data• Data are safe• Completely robust query answers• Good scaling• Highly integrative ▫ Cross-database querying: essential!
Recommendation: MySQL• Nothing magical about MySQL ▫ Widest usage in bioinformatics ▫ Free (community edition) ▫ Runs on everything (Linux, Win, Mac) ▫ Easiest relational DB (short of MS Access)• Resources ▫ Moes (2005): Beginning MySQL; Wiley ▫ DuBois (2007): MySQL Cookbook; O’Reilly ▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly
The Good News• Free software!• Free algorithms!• Pre-coded algorithms (i.e., packages)!• Very cheap computing power!
The Bad News• Dunno how to use• “Not talented”• “Not enough time”• Can’t be bothered ▫ e.g., reading the paper describing the software tool one is relying on
More Good News• Not that hard• Lots and lots of good resources• Read a book, dammit• Find a buddy• Use Cloud instances (preconfigured machines) ▫ Can even be free!
“Gateway Drugs” to Programming:Workflow Systems• GenePattern ▫ Predominantly oriented toward gene expression analysis ▫ Public server available• Galaxy ▫ Predominantly oriented toward sequence (NGS) analysis ▫ Public server available• Weka ▫ Easiest way to learn data mining
But Seriously: Why Programming?• Address small problems that can nail you• Address bigger problems by standing on the shoulders of giants• Flexibility: If you’re doing “real” science, off-the- shelf software will fail you every time ▫ 80% rule…
Example:Don’t Try ThisWith Excel•Millions of sequencereads compared againstmouse transcriptome• Determining number ofdistinct species andfrequency of members ineach• Summarize using plotsfor each codon
Languages/Systems You Can’t DoWithout• SQL ▫ To talk to MySQL• Perl or Python ▫ To glue things together• R (“R Project”) ▫ To perform heavy-duty statistical analysis• Weka ▫ To apply machine learning algorithms
The Inside Scoop of MakingProgramming Work for You1. download subject <--> group mapping table• download drug treatmentdown your process2. Diagram or write data for each subject ID create two sets of subjects and write code ▫ Don’t just sit down ImmunoTreatedSubjects• Write comments NonImmunoTreatedSubjects3. downloaddoing this because of this special case over here” ▫ “I’m gene type data ImmunoGenes• NonImmunoGenes Using meaningful variable names4. calculate variance of each gene set for each subject ▫ $c = not good5. create data frame to store (4) --> varGeneSetForEachSubject• Use development tools6. compute t test to determine whether mean is significantly different ▫ Rstudio (for writing R code)first test: generate statistic for each individual subject ▫ Eclipse (for writing in almost any language)--> compare variance of ImmunoGenes vs. variance of NonImmunoGenes for ▫ HeidiSQL (SQL browser) each subject
Programming Without Programming: NCBI’sEbot• Uses NCBI e-utilities (“Web services”) ▫ Programmatic access to NCBI databases, including PubMed ▫ VERY useful for data mining• Ebot codes the particular kind of service you want to use• Still, it only gets you so far, but at least the heaviest lifting has been done (and it is heavy…)
R: Why It Hurts So Good• The “R Project” (aka R) is the premiere Open Source statistical and data mining language and suite of libraries.• Pros ▫ Free, runs on everything ▫ Very flexible statistical computing ▫ Dominant standard in biocomputing ▫ Big user community at Stanford ▫ Many key libraries written at Stanford• Cons ▫ Non-trivial learning curve ▫ Documentation is of variable quality
Key R ResourcesThree essentials:• RStudio ▫ Integrated development environment don’t code R w/o it!• Crawley (2007): The R Book; Wiley• Matloff (2011): The Art of R Programming; No Starch Press• Teetor (2011): The R Cookbook; O’Reilly• Wickham (2009): ggplot2; Springer
37WEKA Data Mining Suite• Machine learning/data mining software written in Java (distributed under the GNU Public License)• Main features: ▫ Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods ▫ Graphical user interfaces (incl. data visualization) ▫ Environment for comparing learning algorithms• Heavily referenced in “Data Mining” (Witten & Frank) 4/4/2012
Perl, Python• Either is a great language for bioinformatics• Run on anything• Use it to quickly glue systems together, e.g., ▫ Integrate MySQL and R together ▫ Run Web services queries• Python has more growth potential ▫ Preferable over Perl
Why The Cloud Matters For Biologists• You are purchasing computing power, not machines ▫ never outdated• You can purchase as much/little as you need ▫ You don’t have to run/manage what you don’t use• Can easily migrate from one machine type to another (minutes)• Can add storage in seconds• Accessible from anywhere• Easy to share e.g., (large) datasets with others
Why Own When You Can Rent?Welcome To the Cloud…
For biomedical computing, AmazonCloud is ideal because it provideshighly flexible storage and computepower sold on a use basis
Another Example: PathSeq • Compare millions of short-read sequences against all genomic + transcriptomic sequences for all microbes (!)Amazon Cloud“Management Console”
Q: What does working with a Cloudmachine feel like?A: It’s not materially different thanaccessing a machine on our cluster,except you can do anything you want
Main Services Provided by Amazon Cloud• Storage ▫ Traditional disk volumes ▫ S3 buckets (“Simple Storage System”)• Computing (EC2 – “Elastic Compute Cloud”) ▫ Single machine instances ▫ Clusters of various types• Machine types ▫ Compute servers ▫ Database servers ▫ Cluster ▫ Specialized architectures ▫ Variety of operating systems (LINUX flavors, Windows)
Costs• You pay for (almost) everything you do ▫ Data transfers (out) ▫ Storage ▫ CPU cycles (depends on instance type; one instance is free)• Can purchase cycles at below average market price ▫ Can provide access to vast amounts of computing power at a price you can afford• Research grants from Amazon