SlideShare a Scribd company logo
1 of 49
Yannick Pouliot, PhD
Biocomputational scientist
Butte Laboratory
04/04/2012
Databases, Web Services and
Tools For Systems Immunology
What You Need For Systems Immunology
1. A real hypothesis
▫ No fuzzy brain stuff
2. An understanding of statistics and data mining
▫ …one never understands enough statistics…
3. A lot of data, typically from different “levels” of
reality: organismal, molecular/static,
molecular/functional, etc
▫ … and therefore, databases of some sort
4. Software tools and programming expertise
5. Computing power
1: Hypothesis
Developing a Hypothesis Suitable for Data Mining
• Possibly the hardest step
▫ Must have a measurable metric that can be tested
statistically
• A real hypothesis (H1) looks like this:
▫ H1: Drugs with increased frequency of adverse drug
reactions can be identified from patterns of reactivity
in PubChem Bioassays screens.
• Actually, statistical tests tries to invalidate the null
(Ho) hypothesis, which looks like this:
▫ Ho: Bioactivity patterns in PubChem Bioassays do not
distinguish drugs with increased frequency of adverse
drug reactions
2: Statistics & Data Mining
Understanding Statistics: Essential
• Not easy; counter-intuitive
• Critical, because with large volumes of data come the
guarantee that you will always find “something”
▫ … except that it will most likely be purely artifactual
Q: ever heard of multiple testing correction?
If not, read Bill Noble excellent description: Noble, W. S.
How does multiple testing correction work? Nat Biotech
27, 1135-1137 (2009).
Learning About Statistics
Introductory
• Norman, Streiner (2008): Biostatistics, the Bare Essentials;
Hamilton.
More advanced
• Vittinghoff, Eric. (2005): Regression methods in biostatistics;
Springer.
• Gentleman et al., (2005): Bioinformatics and Computational
Biology Solutions Using R and Bioconductor; Springer.
Advanced
Doncaster & Davey (2007): Analysis of Variance and
Covariance: How to Choose and Construct Models for the
Life Sciences; Cambridge University Press.
Understanding Data Mining
• Data mining uses statistical techniques + other
techniques that are uniquely “computational”
• Key to Systems Immunology
• Resources:
▫ Excellent introduction, Weka-specific:
 Witten & Frank (2005): Data Mining: Practical Machine
Learning Tools and Techniques, Second Edition; Morgan
Kaufman
▫ Nisbet et al., (2009): Handbook of Statistical Analysis and
Data Mining Applications; Wiley
• Tools: coming-up
3: Data
Huge Numbers of Databases
• Many need to be licensed ($)
▫ Ingenuity Pathways Analysis (IPA)
 Excellent but pricey
▫ MetaCore
 competitor to IPA
 available from Lane Library
• Many more freely available
▫ DAVID: similar to IPA and MetaCore
▫ Typically dirtier than commercial products, but
sometimes much more comprehensive
▫ Consult Nucleic Acids Research’s yearly database issue
The Bad News
• To be useful in Systems Medicine, databases
need to offer one of the following:
▫ Be downloadable (FTP) in text or other form
▫ Be accessible programmatically over the Internet
(e.g., Web service)
Clicking on Web interfaces just doesn’t cut it…
This mean knowing about databases and having
programming skills (more later)
A Small Sample of DBs Crucial to
Systems Immunology
• NCBI: Entrez, GEO, PubMed, Gene, Genome,
RefSeq, dbSNP
• EBI: Array Express, Gene Expression Atlas,
ENSEMBL
• Mouse Genome Database
• DrugBank
• BioGPS
• HapMap
• STITCH: interactions between compounds and
proteins
• UMLS (Unified Medical Language System)
Unified Medical Language System
• Developed by National Library of Medicine
• = data files and software that brings together
multiple biomedical vocabularies and ontologies
to enable semantic interoperability.
▫ repository of terms, definitions and concepts in
biomedicine, complete with cross-referencing and
ontological relationships
• Essential but complex and large
• Requires free license
Immunology Databases
The ImmPort Database: The Only DB of Its Kind in
Immunology
• http://immport.niaid.nih.gov/
• Stores results from huge range of assays
▫ HAI  flow cytometry phenotyping
▫ Genotyping
▫ Sequencing
▫ Gene expression
▫ etc
• Intended to be the primary repository for all NIAID
“center” grants
• Can access pre-publication data if given access by PI
• Caveat: volume of PUBLIC datasets is currently
limited
Stanford’s Human Immune Monitoring
Center (HIMC) Database
• Stanford Data Miner is HIMC’s data mining
database
• Stores many of the assays run by HIMC
• Ask HIMC for access data from researchers who
use HIMC (will require their consent)
Next Level Up: Relational Databases
Take Your Pick
Why Relational Databases?
• Much faster access to data
• Data are safe
• Completely robust query answers
• Good scaling
• Highly integrative
▫ Cross-database querying: essential!
Recommendation: MySQL
• Nothing magical about MySQL
▫ Widest usage in bioinformatics
▫ Free (community edition)
▫ Runs on everything (Linux, Win, Mac)
▫ Easiest relational DB (short of MS Access)
• Resources
▫ Moes (2005): Beginning MySQL; Wiley
▫ DuBois (2007): MySQL Cookbook; O’Reilly
▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly
4: Software tools and programming
expertise
• Free software!
• Free algorithms!
• Pre-coded algorithms (i.e., packages)!
• Very cheap computing power!
The Good News
The Bad News
• Dunno how to use
• “Not talented”
• “Not enough time”
• Can’t be bothered
▫ e.g., reading the paper describing the software tool
one is relying on
More Good News
• Not that hard
• Lots and lots of good resources
• Read a book, dammit
• Find a buddy
• Use Cloud instances (preconfigured machines)
▫ Can even be free!
“Gateway Drugs” to Programming:
Workflow Systems
• GenePattern
▫ Predominantly oriented toward gene expression
analysis
▫ Public server available
• Galaxy
▫ Predominantly oriented toward sequence (NGS)
analysis
▫ Public server available
• Weka
▫ Easiest way to learn data mining
But Seriously: Why Programming?
• Address small problems that can nail you
• Address bigger problems by standing on the
shoulders of giants
• Flexibility: If you’re doing “real” science, off-the-
shelf software will fail you every time
▫ 80% rule…
Example:
Don’t Try This
With Excel
•Millions of sequence
reads compared against
mouse transcriptome
• Determining number of
distinct species and
frequency of members in
each
• Summarize using plots
for each codon
How it’s
done
SQL + R
Another Example…
Languages/Systems You Can’t Do
Without
• SQL
▫ To talk to MySQL
• Perl or Python
▫ To glue things together
• R (“R Project”)
▫ To perform heavy-duty statistical analysis
• Weka
▫ To apply machine learning algorithms
The Inside Scoop of Making
Programming Work for You
• Diagram or write down your process
▫ Don’t just sit down and write code
• Write comments
▫ “I’m doing this because of this special case over here”
• Using meaningful variable names
▫ $c = not good
• Use development tools
▫ Rstudio (for writing R code)
▫ Eclipse (for writing in almost any language)
▫ HeidiSQL (SQL browser)
1. download subject <--> group mapping table
2. download drug treatment data for each subject ID
create two sets of subjects
ImmunoTreatedSubjects
NonImmunoTreatedSubjects
3. download gene type data
ImmunoGenes
NonImmunoGenes
4. calculate variance of each gene set for each subject
5. create data frame to store (4) --> varGeneSetForEachSubject
6. compute t test to determine whether mean is significantly different
first test: generate statistic for each individual subject
--> compare variance of ImmunoGenes vs. variance of NonImmunoGenes for
each subject
Programming Without Programming: NCBI’s
Ebot
• Uses NCBI e-utilities (“Web services”)
▫ Programmatic access to NCBI databases,
including PubMed
▫ VERY useful for data mining
• Ebot codes the particular kind of service you
want to use
• Still, it only gets you so far, but at least the
heaviest lifting has been done (and it is heavy…)
Heavy Lifting: R
R: Why It Hurts So Good
• The “R Project” (aka R) is the premiere Open Source
statistical and data mining language and suite of
libraries.
• Pros
▫ Free, runs on everything
▫ Very flexible statistical computing
▫ Dominant standard in biocomputing
▫ Big user community at Stanford
▫ Many key libraries written at Stanford
• Cons
▫ Non-trivial learning curve
▫ Documentation is of variable quality
Key R Resources
Three essentials:
• RStudio
▫ Integrated development environment
 don’t code R w/o it!
• Crawley (2007): The R Book; Wiley
• Matloff (2011): The Art of R Programming; No
Starch Press
• Teetor (2011): The R Cookbook; O’Reilly
• Wickham (2009): ggplot2; Springer
Lighter Lifting: Weka
3/18/2022
37
WEKA Data Mining Suite
• Machine learning/data mining software written
in Java (distributed under the GNU Public
License)
• Main features:
▫ Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods
▫ Graphical user interfaces (incl. data visualization)
▫ Environment for comparing learning algorithms
• Heavily referenced in “Data Mining” (Witten &
Frank)
3/18/2022
University of Waikato
38
3/18/2022
University of Waikato
39
Perl, Python
• Either is a great language for bioinformatics
• Run on anything
• Use it to quickly glue systems together, e.g.,
▫ Integrate MySQL and R together
▫ Run Web services queries
• Python has more growth potential
▫ Preferable over Perl
5. Computer Power
Why The Cloud Matters For Biologists
• You are purchasing computing power, not
machines
▫  never outdated
• You can purchase as much/little as you need
▫ You don’t have to run/manage what you don’t use
• Can easily migrate from one machine type to
another (minutes)
• Can add storage in seconds
• Accessible from anywhere
• Easy to share e.g., (large) datasets with others
Why Own When You Can Rent?
Welcome To the Cloud…
For biomedical computing, Amazon
Cloud is ideal because it provides
highly flexible storage and compute
power sold on a use basis
Another Example: PathSeq
• Compare millions of short-read sequences
against all genomic + transcriptomic sequences
for all microbes (!)
Amazon Cloud
“Management Console”
Q: What does working with a Cloud
machine feel like?
A: It’s not materially different than
accessing a machine on our cluster,
except you can do anything you want
Main Services Provided by Amazon Cloud
• Storage
▫ Traditional disk volumes
▫ S3 buckets (“Simple Storage System”)
• Computing (EC2 – “Elastic Compute Cloud”)
▫ Single machine instances
▫ Clusters of various types
• Machine types
▫ Compute servers
▫ Database servers
▫ Cluster
▫ Specialized architectures
▫ Variety of operating systems (LINUX flavors, Windows)
Costs
• You pay for (almost) everything you do
▫ Data transfers (out)
▫ Storage
▫ CPU cycles (depends on instance type; one
instance is free)
• Can purchase cycles at below average market
price
▫ Can provide access to vast amounts of computing
power at a price you can afford
• Research grants from Amazon
Questions?

More Related Content

What's hot

Secondary data analysis with digital trace data
Secondary data analysis with digital trace dataSecondary data analysis with digital trace data
Secondary data analysis with digital trace data
Andrea Wiggins
 
Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...
SC CTSI at USC and CHLA
 
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
Tom LaGatta
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Adaryl "Bob" Wakefield, MBA
 
PE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesPE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File Features
Antiy Labs
 

What's hot (20)

Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna Workflows
 
Secondary data analysis with digital trace data
Secondary data analysis with digital trace dataSecondary data analysis with digital trace data
Secondary data analysis with digital trace data
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
Findability through Traceability - A Realistic Application of Candidate Tr...
Findability through Traceability  - A Realistic Application of Candidate Tr...Findability through Traceability  - A Realistic Application of Candidate Tr...
Findability through Traceability - A Realistic Application of Candidate Tr...
 
Research software susainability
Research software susainabilityResearch software susainability
Research software susainability
 
Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...
 
Large Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a HaystackLarge Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a Haystack
 
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustLec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
 
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
Slides chase 2019  connected health conference - thursday 26 september 2019 -...Slides chase 2019  connected health conference - thursday 26 september 2019 -...
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
CV_10/17
CV_10/17CV_10/17
CV_10/17
 
User Expectations in Mobile App Security
User Expectations in Mobile App SecurityUser Expectations in Mobile App Security
User Expectations in Mobile App Security
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
 
Knowledge Beacons
Knowledge BeaconsKnowledge Beacons
Knowledge Beacons
 
PE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesPE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File Features
 
Mining Software Repositories
Mining Software RepositoriesMining Software Repositories
Mining Software Repositories
 

Viewers also liked

Monstruos en la noche
Monstruos en la nocheMonstruos en la noche
Monstruos en la noche
guest06918f8
 
Arantaboutamvsandanime.docx
Arantaboutamvsandanime.docxArantaboutamvsandanime.docx
Arantaboutamvsandanime.docx
JuanID78
 
Mapa conceptual de la sociedad de la información
Mapa conceptual de la sociedad de la informaciónMapa conceptual de la sociedad de la información
Mapa conceptual de la sociedad de la información
samiapaternina
 

Viewers also liked (16)

Introduction
IntroductionIntroduction
Introduction
 
1. presentación recursos destinados al dmq
1. presentación recursos destinados al dmq1. presentación recursos destinados al dmq
1. presentación recursos destinados al dmq
 
Caperucita Roja Convertida En Internet
Caperucita Roja Convertida En InternetCaperucita Roja Convertida En Internet
Caperucita Roja Convertida En Internet
 
Kenth's UNT Trasncript.PDF
Kenth's UNT Trasncript.PDFKenth's UNT Trasncript.PDF
Kenth's UNT Trasncript.PDF
 
Historia del internet
Historia del internetHistoria del internet
Historia del internet
 
Monstruos en la noche
Monstruos en la nocheMonstruos en la noche
Monstruos en la noche
 
Ola
OlaOla
Ola
 
Historia del internet
Historia del internetHistoria del internet
Historia del internet
 
Arantaboutamvsandanime.docx
Arantaboutamvsandanime.docxArantaboutamvsandanime.docx
Arantaboutamvsandanime.docx
 
Chapter4b
Chapter4bChapter4b
Chapter4b
 
Chapter4a
Chapter4aChapter4a
Chapter4a
 
Creating Fit Families
Creating Fit FamiliesCreating Fit Families
Creating Fit Families
 
Rockfon Celings steel-finish-pics
Rockfon Celings steel-finish-picsRockfon Celings steel-finish-pics
Rockfon Celings steel-finish-pics
 
Online vs. offline - Studie zum Einkaufsverhalten
Online vs. offline - Studie zum EinkaufsverhaltenOnline vs. offline - Studie zum Einkaufsverhalten
Online vs. offline - Studie zum Einkaufsverhalten
 
Mapa conceptual de la sociedad de la información
Mapa conceptual de la sociedad de la informaciónMapa conceptual de la sociedad de la información
Mapa conceptual de la sociedad de la información
 
Muscles of mastication & TMJ Dr.N.Mugunthan
Muscles of mastication & TMJ Dr.N.MugunthanMuscles of mastication & TMJ Dr.N.Mugunthan
Muscles of mastication & TMJ Dr.N.Mugunthan
 

Similar to Databases, Web Services and Tools For Systems Immunology

H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 

Similar to Databases, Web Services and Tools For Systems Immunology (20)

H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 

More from Yannick Pouliot

Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
Yannick Pouliot
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening Data
Yannick Pouliot
 

More from Yannick Pouliot (11)

Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Systems Immunology -- 2014
Systems Immunology -- 2014Systems Immunology -- 2014
Systems Immunology -- 2014
 
Managing experiment data using Excel and Friends
Managing experiment data using Excel and FriendsManaging experiment data using Excel and Friends
Managing experiment data using Excel and Friends
 
Essential UNIX skills for biologists
Essential UNIX skills for biologistsEssential UNIX skills for biologists
Essential UNIX skills for biologists
 
A guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databasesA guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databases
 
Ontologically-Aware Automated Gating
Ontologically-Aware Automated GatingOntologically-Aware Automated Gating
Ontologically-Aware Automated Gating
 
Why The Cloud Is A Computational Biologist's Best Friend
Why The Cloud Is A Computational Biologist's Best FriendWhy The Cloud Is A Computational Biologist's Best Friend
Why The Cloud Is A Computational Biologist's Best Friend
 
There’s No Avoiding It: Programming Skills You’ll Need
There’s No Avoiding It:  Programming Skills You’ll NeedThere’s No Avoiding It:  Programming Skills You’ll Need
There’s No Avoiding It: Programming Skills You’ll Need
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening Data
 
Repositioning Old Drugs For New Indications Using Computational Approaches
Repositioning Old Drugs For New Indications Using Computational ApproachesRepositioning Old Drugs For New Indications Using Computational Approaches
Repositioning Old Drugs For New Indications Using Computational Approaches
 

Recently uploaded

College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
perfect solution
 

Recently uploaded (20)

Call Girls Varanasi Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Varanasi Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Varanasi Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Varanasi Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls Guntur Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Guntur  Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Guntur  Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Guntur Just Call 8250077686 Top Class Call Girl Service Available
 
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
Mumbai ] (Call Girls) in Mumbai 10k @ I'm VIP Independent Escorts Girls 98333...
 
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Bangalore Just Call 8250077686 Top Class Call Girl Service Available
 
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟  9332606886 ⟟ Call Me For G...
Top Rated Bangalore Call Girls Ramamurthy Nagar ⟟ 9332606886 ⟟ Call Me For G...
 
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
Pondicherry Call Girls Book Now 9630942363 Top Class Pondicherry Escort Servi...
 
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service AvailableCall Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
Call Girls Gwalior Just Call 8617370543 Top Class Call Girl Service Available
 
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
Best Rate (Guwahati ) Call Girls Guwahati ⟟ 8617370543 ⟟ High Class Call Girl...
 
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Coimbatore Just Call 9907093804 Top Class Call Girl Service Available
 
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
(👑VVIP ISHAAN ) Russian Call Girls Service Navi Mumbai🖕9920874524🖕Independent...
 
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
Night 7k to 12k Navi Mumbai Call Girl Photo 👉 BOOK NOW 9833363713 👈 ♀️ night ...
 
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service AvailableCall Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
Call Girls Dehradun Just Call 9907093804 Top Class Call Girl Service Available
 
Call Girls Service Jaipur {9521753030} ❤️VVIP RIDDHI Call Girl in Jaipur Raja...
Call Girls Service Jaipur {9521753030} ❤️VVIP RIDDHI Call Girl in Jaipur Raja...Call Girls Service Jaipur {9521753030} ❤️VVIP RIDDHI Call Girl in Jaipur Raja...
Call Girls Service Jaipur {9521753030} ❤️VVIP RIDDHI Call Girl in Jaipur Raja...
 
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
College Call Girls in Haridwar 9667172968 Short 4000 Night 10000 Best call gi...
 
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
Best Rate (Patna ) Call Girls Patna ⟟ 8617370543 ⟟ High Class Call Girl In 5 ...
 
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort ServicePremium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
Premium Call Girls Cottonpet Whatsapp 7001035870 Independent Escort Service
 
Call Girls Haridwar Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Haridwar Just Call 8250077686 Top Class Call Girl Service AvailableCall Girls Haridwar Just Call 8250077686 Top Class Call Girl Service Available
Call Girls Haridwar Just Call 8250077686 Top Class Call Girl Service Available
 
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
Call Girls in Delhi Triveni Complex Escort Service(🔝))/WhatsApp 97111⇛47426
 
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In AhmedabadO963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
O963O942363 Call Girls In Ahmedabad Escort Service Available 24×7 In Ahmedabad
 
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
Book Paid Powai Call Girls Mumbai 𖠋 9930245274 𖠋Low Budget Full Independent H...
 

Databases, Web Services and Tools For Systems Immunology

  • 1. Yannick Pouliot, PhD Biocomputational scientist Butte Laboratory 04/04/2012 Databases, Web Services and Tools For Systems Immunology
  • 2. What You Need For Systems Immunology 1. A real hypothesis ▫ No fuzzy brain stuff 2. An understanding of statistics and data mining ▫ …one never understands enough statistics… 3. A lot of data, typically from different “levels” of reality: organismal, molecular/static, molecular/functional, etc ▫ … and therefore, databases of some sort 4. Software tools and programming expertise 5. Computing power
  • 4. Developing a Hypothesis Suitable for Data Mining • Possibly the hardest step ▫ Must have a measurable metric that can be tested statistically • A real hypothesis (H1) looks like this: ▫ H1: Drugs with increased frequency of adverse drug reactions can be identified from patterns of reactivity in PubChem Bioassays screens. • Actually, statistical tests tries to invalidate the null (Ho) hypothesis, which looks like this: ▫ Ho: Bioactivity patterns in PubChem Bioassays do not distinguish drugs with increased frequency of adverse drug reactions
  • 5. 2: Statistics & Data Mining
  • 6. Understanding Statistics: Essential • Not easy; counter-intuitive • Critical, because with large volumes of data come the guarantee that you will always find “something” ▫ … except that it will most likely be purely artifactual Q: ever heard of multiple testing correction? If not, read Bill Noble excellent description: Noble, W. S. How does multiple testing correction work? Nat Biotech 27, 1135-1137 (2009).
  • 7. Learning About Statistics Introductory • Norman, Streiner (2008): Biostatistics, the Bare Essentials; Hamilton. More advanced • Vittinghoff, Eric. (2005): Regression methods in biostatistics; Springer. • Gentleman et al., (2005): Bioinformatics and Computational Biology Solutions Using R and Bioconductor; Springer. Advanced Doncaster & Davey (2007): Analysis of Variance and Covariance: How to Choose and Construct Models for the Life Sciences; Cambridge University Press.
  • 8. Understanding Data Mining • Data mining uses statistical techniques + other techniques that are uniquely “computational” • Key to Systems Immunology • Resources: ▫ Excellent introduction, Weka-specific:  Witten & Frank (2005): Data Mining: Practical Machine Learning Tools and Techniques, Second Edition; Morgan Kaufman ▫ Nisbet et al., (2009): Handbook of Statistical Analysis and Data Mining Applications; Wiley • Tools: coming-up
  • 10. Huge Numbers of Databases • Many need to be licensed ($) ▫ Ingenuity Pathways Analysis (IPA)  Excellent but pricey ▫ MetaCore  competitor to IPA  available from Lane Library • Many more freely available ▫ DAVID: similar to IPA and MetaCore ▫ Typically dirtier than commercial products, but sometimes much more comprehensive ▫ Consult Nucleic Acids Research’s yearly database issue
  • 11. The Bad News • To be useful in Systems Medicine, databases need to offer one of the following: ▫ Be downloadable (FTP) in text or other form ▫ Be accessible programmatically over the Internet (e.g., Web service) Clicking on Web interfaces just doesn’t cut it… This mean knowing about databases and having programming skills (more later)
  • 12. A Small Sample of DBs Crucial to Systems Immunology • NCBI: Entrez, GEO, PubMed, Gene, Genome, RefSeq, dbSNP • EBI: Array Express, Gene Expression Atlas, ENSEMBL • Mouse Genome Database • DrugBank • BioGPS • HapMap • STITCH: interactions between compounds and proteins • UMLS (Unified Medical Language System)
  • 13. Unified Medical Language System • Developed by National Library of Medicine • = data files and software that brings together multiple biomedical vocabularies and ontologies to enable semantic interoperability. ▫ repository of terms, definitions and concepts in biomedicine, complete with cross-referencing and ontological relationships • Essential but complex and large • Requires free license
  • 15. The ImmPort Database: The Only DB of Its Kind in Immunology • http://immport.niaid.nih.gov/ • Stores results from huge range of assays ▫ HAI  flow cytometry phenotyping ▫ Genotyping ▫ Sequencing ▫ Gene expression ▫ etc • Intended to be the primary repository for all NIAID “center” grants • Can access pre-publication data if given access by PI • Caveat: volume of PUBLIC datasets is currently limited
  • 16. Stanford’s Human Immune Monitoring Center (HIMC) Database • Stanford Data Miner is HIMC’s data mining database • Stores many of the assays run by HIMC • Ask HIMC for access data from researchers who use HIMC (will require their consent)
  • 17. Next Level Up: Relational Databases Take Your Pick
  • 18. Why Relational Databases? • Much faster access to data • Data are safe • Completely robust query answers • Good scaling • Highly integrative ▫ Cross-database querying: essential!
  • 19. Recommendation: MySQL • Nothing magical about MySQL ▫ Widest usage in bioinformatics ▫ Free (community edition) ▫ Runs on everything (Linux, Win, Mac) ▫ Easiest relational DB (short of MS Access) • Resources ▫ Moes (2005): Beginning MySQL; Wiley ▫ DuBois (2007): MySQL Cookbook; O’Reilly ▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly
  • 20. 4: Software tools and programming expertise
  • 21. • Free software! • Free algorithms! • Pre-coded algorithms (i.e., packages)! • Very cheap computing power! The Good News
  • 22. The Bad News • Dunno how to use • “Not talented” • “Not enough time” • Can’t be bothered ▫ e.g., reading the paper describing the software tool one is relying on
  • 23. More Good News • Not that hard • Lots and lots of good resources • Read a book, dammit • Find a buddy • Use Cloud instances (preconfigured machines) ▫ Can even be free!
  • 24. “Gateway Drugs” to Programming: Workflow Systems • GenePattern ▫ Predominantly oriented toward gene expression analysis ▫ Public server available • Galaxy ▫ Predominantly oriented toward sequence (NGS) analysis ▫ Public server available • Weka ▫ Easiest way to learn data mining
  • 25. But Seriously: Why Programming? • Address small problems that can nail you • Address bigger problems by standing on the shoulders of giants • Flexibility: If you’re doing “real” science, off-the- shelf software will fail you every time ▫ 80% rule…
  • 26. Example: Don’t Try This With Excel •Millions of sequence reads compared against mouse transcriptome • Determining number of distinct species and frequency of members in each • Summarize using plots for each codon
  • 28.
  • 30. Languages/Systems You Can’t Do Without • SQL ▫ To talk to MySQL • Perl or Python ▫ To glue things together • R (“R Project”) ▫ To perform heavy-duty statistical analysis • Weka ▫ To apply machine learning algorithms
  • 31. The Inside Scoop of Making Programming Work for You • Diagram or write down your process ▫ Don’t just sit down and write code • Write comments ▫ “I’m doing this because of this special case over here” • Using meaningful variable names ▫ $c = not good • Use development tools ▫ Rstudio (for writing R code) ▫ Eclipse (for writing in almost any language) ▫ HeidiSQL (SQL browser) 1. download subject <--> group mapping table 2. download drug treatment data for each subject ID create two sets of subjects ImmunoTreatedSubjects NonImmunoTreatedSubjects 3. download gene type data ImmunoGenes NonImmunoGenes 4. calculate variance of each gene set for each subject 5. create data frame to store (4) --> varGeneSetForEachSubject 6. compute t test to determine whether mean is significantly different first test: generate statistic for each individual subject --> compare variance of ImmunoGenes vs. variance of NonImmunoGenes for each subject
  • 32. Programming Without Programming: NCBI’s Ebot • Uses NCBI e-utilities (“Web services”) ▫ Programmatic access to NCBI databases, including PubMed ▫ VERY useful for data mining • Ebot codes the particular kind of service you want to use • Still, it only gets you so far, but at least the heaviest lifting has been done (and it is heavy…)
  • 34. R: Why It Hurts So Good • The “R Project” (aka R) is the premiere Open Source statistical and data mining language and suite of libraries. • Pros ▫ Free, runs on everything ▫ Very flexible statistical computing ▫ Dominant standard in biocomputing ▫ Big user community at Stanford ▫ Many key libraries written at Stanford • Cons ▫ Non-trivial learning curve ▫ Documentation is of variable quality
  • 35. Key R Resources Three essentials: • RStudio ▫ Integrated development environment  don’t code R w/o it! • Crawley (2007): The R Book; Wiley • Matloff (2011): The Art of R Programming; No Starch Press • Teetor (2011): The R Cookbook; O’Reilly • Wickham (2009): ggplot2; Springer
  • 37. 3/18/2022 37 WEKA Data Mining Suite • Machine learning/data mining software written in Java (distributed under the GNU Public License) • Main features: ▫ Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods ▫ Graphical user interfaces (incl. data visualization) ▫ Environment for comparing learning algorithms • Heavily referenced in “Data Mining” (Witten & Frank)
  • 40. Perl, Python • Either is a great language for bioinformatics • Run on anything • Use it to quickly glue systems together, e.g., ▫ Integrate MySQL and R together ▫ Run Web services queries • Python has more growth potential ▫ Preferable over Perl
  • 42. Why The Cloud Matters For Biologists • You are purchasing computing power, not machines ▫  never outdated • You can purchase as much/little as you need ▫ You don’t have to run/manage what you don’t use • Can easily migrate from one machine type to another (minutes) • Can add storage in seconds • Accessible from anywhere • Easy to share e.g., (large) datasets with others
  • 43. Why Own When You Can Rent? Welcome To the Cloud…
  • 44. For biomedical computing, Amazon Cloud is ideal because it provides highly flexible storage and compute power sold on a use basis
  • 45. Another Example: PathSeq • Compare millions of short-read sequences against all genomic + transcriptomic sequences for all microbes (!) Amazon Cloud “Management Console”
  • 46. Q: What does working with a Cloud machine feel like? A: It’s not materially different than accessing a machine on our cluster, except you can do anything you want
  • 47. Main Services Provided by Amazon Cloud • Storage ▫ Traditional disk volumes ▫ S3 buckets (“Simple Storage System”) • Computing (EC2 – “Elastic Compute Cloud”) ▫ Single machine instances ▫ Clusters of various types • Machine types ▫ Compute servers ▫ Database servers ▫ Cluster ▫ Specialized architectures ▫ Variety of operating systems (LINUX flavors, Windows)
  • 48. Costs • You pay for (almost) everything you do ▫ Data transfers (out) ▫ Storage ▫ CPU cycles (depends on instance type; one instance is free) • Can purchase cycles at below average market price ▫ Can provide access to vast amounts of computing power at a price you can afford • Research grants from Amazon

Editor's Notes

  1. Multiple hypothesis testing corrects for random events that falsely appear significant
  2. PPT document created using combination of Perl, MySQL and R
  3. Blue=services I’ve used
  4. Mention cost calculator: http://calculator.s3.amazonaws.com/calc5.html