SlideShare a Scribd company logo
Yannick Pouliot, PhD
Biocomputational scientist
Butte Laboratory
04/04/2012
Databases, Web Services and
Tools For Systems Immunology
What You Need For Systems Immunology
1. A real hypothesis
▫ No fuzzy brain stuff
2. An understanding of statistics and data mining
▫ …one never understands enough statistics…
3. A lot of data, typically from different “levels” of
reality: organismal, molecular/static,
molecular/functional, etc
▫ … and therefore, databases of some sort
4. Software tools and programming expertise
5. Computing power
1: Hypothesis
Developing a Hypothesis Suitable for Data Mining
• Possibly the hardest step
▫ Must have a measurable metric that can be tested
statistically
• A real hypothesis (H1) looks like this:
▫ H1: Drugs with increased frequency of adverse drug
reactions can be identified from patterns of reactivity
in PubChem Bioassays screens.
• Actually, statistical tests tries to invalidate the null
(Ho) hypothesis, which looks like this:
▫ Ho: Bioactivity patterns in PubChem Bioassays do not
distinguish drugs with increased frequency of adverse
drug reactions
2: Statistics & Data Mining
Understanding Statistics: Essential
• Not easy; counter-intuitive
• Critical, because with large volumes of data come the
guarantee that you will always find “something”
▫ … except that it will most likely be purely artifactual
Q: ever heard of multiple testing correction?
If not, read Bill Noble excellent description: Noble, W. S.
How does multiple testing correction work? Nat Biotech
27, 1135-1137 (2009).
Learning About Statistics
Introductory
• Norman, Streiner (2008): Biostatistics, the Bare Essentials;
Hamilton.
More advanced
• Vittinghoff, Eric. (2005): Regression methods in biostatistics;
Springer.
• Gentleman et al., (2005): Bioinformatics and Computational
Biology Solutions Using R and Bioconductor; Springer.
Advanced
Doncaster & Davey (2007): Analysis of Variance and
Covariance: How to Choose and Construct Models for the
Life Sciences; Cambridge University Press.
Understanding Data Mining
• Data mining uses statistical techniques + other
techniques that are uniquely “computational”
• Key to Systems Immunology
• Resources:
▫ Excellent introduction, Weka-specific:
 Witten & Frank (2005): Data Mining: Practical Machine
Learning Tools and Techniques, Second Edition; Morgan
Kaufman
▫ Nisbet et al., (2009): Handbook of Statistical Analysis and
Data Mining Applications; Wiley
• Tools: coming-up
3: Data
Huge Numbers of Databases
• Many need to be licensed ($)
▫ Ingenuity Pathways Analysis (IPA)
 Excellent but pricey
▫ MetaCore
 competitor to IPA
 available from Lane Library
• Many more freely available
▫ DAVID: similar to IPA and MetaCore
▫ Typically dirtier than commercial products, but
sometimes much more comprehensive
▫ Consult Nucleic Acids Research’s yearly database issue
The Bad News
• To be useful in Systems Medicine, databases
need to offer one of the following:
▫ Be downloadable (FTP) in text or other form
▫ Be accessible programmatically over the Internet
(e.g., Web service)
Clicking on Web interfaces just doesn’t cut it…
This mean knowing about databases and having
programming skills (more later)
A Small Sample of DBs Crucial to
Systems Immunology
• NCBI: Entrez, GEO, PubMed, Gene, Genome,
RefSeq, dbSNP
• EBI: Array Express, Gene Expression Atlas,
ENSEMBL
• Mouse Genome Database
• DrugBank
• BioGPS
• HapMap
• STITCH: interactions between compounds and
proteins
• UMLS (Unified Medical Language System)
Unified Medical Language System
• Developed by National Library of Medicine
• = data files and software that brings together
multiple biomedical vocabularies and ontologies
to enable semantic interoperability.
▫ repository of terms, definitions and concepts in
biomedicine, complete with cross-referencing and
ontological relationships
• Essential but complex and large
• Requires free license
Immunology Databases
The ImmPort Database: The Only DB of Its Kind in
Immunology
• http://immport.niaid.nih.gov/
• Stores results from huge range of assays
▫ HAI  flow cytometry phenotyping
▫ Genotyping
▫ Sequencing
▫ Gene expression
▫ etc
• Intended to be the primary repository for all NIAID
“center” grants
• Can access pre-publication data if given access by PI
• Caveat: volume of PUBLIC datasets is currently
limited
Stanford’s Human Immune Monitoring
Center (HIMC) Database
• Stanford Data Miner is HIMC’s data mining
database
• Stores many of the assays run by HIMC
• Ask HIMC for access data from researchers who
use HIMC (will require their consent)
Next Level Up: Relational Databases
Take Your Pick
Why Relational Databases?
• Much faster access to data
• Data are safe
• Completely robust query answers
• Good scaling
• Highly integrative
▫ Cross-database querying: essential!
Recommendation: MySQL
• Nothing magical about MySQL
▫ Widest usage in bioinformatics
▫ Free (community edition)
▫ Runs on everything (Linux, Win, Mac)
▫ Easiest relational DB (short of MS Access)
• Resources
▫ Moes (2005): Beginning MySQL; Wiley
▫ DuBois (2007): MySQL Cookbook; O’Reilly
▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly
4: Software tools and programming
expertise
• Free software!
• Free algorithms!
• Pre-coded algorithms (i.e., packages)!
• Very cheap computing power!
The Good News
The Bad News
• Dunno how to use
• “Not talented”
• “Not enough time”
• Can’t be bothered
▫ e.g., reading the paper describing the software tool
one is relying on
More Good News
• Not that hard
• Lots and lots of good resources
• Read a book, dammit
• Find a buddy
• Use Cloud instances (preconfigured machines)
▫ Can even be free!
“Gateway Drugs” to Programming:
Workflow Systems
• GenePattern
▫ Predominantly oriented toward gene expression
analysis
▫ Public server available
• Galaxy
▫ Predominantly oriented toward sequence (NGS)
analysis
▫ Public server available
• Weka
▫ Easiest way to learn data mining
But Seriously: Why Programming?
• Address small problems that can nail you
• Address bigger problems by standing on the
shoulders of giants
• Flexibility: If you’re doing “real” science, off-the-
shelf software will fail you every time
▫ 80% rule…
Example:
Don’t Try This
With Excel
•Millions of sequence
reads compared against
mouse transcriptome
• Determining number of
distinct species and
frequency of members in
each
• Summarize using plots
for each codon
How it’s
done
SQL + R
Another Example…
Languages/Systems You Can’t Do
Without
• SQL
▫ To talk to MySQL
• Perl or Python
▫ To glue things together
• R (“R Project”)
▫ To perform heavy-duty statistical analysis
• Weka
▫ To apply machine learning algorithms
The Inside Scoop of Making
Programming Work for You
• Diagram or write down your process
▫ Don’t just sit down and write code
• Write comments
▫ “I’m doing this because of this special case over here”
• Using meaningful variable names
▫ $c = not good
• Use development tools
▫ Rstudio (for writing R code)
▫ Eclipse (for writing in almost any language)
▫ HeidiSQL (SQL browser)
1. download subject <--> group mapping table
2. download drug treatment data for each subject ID
create two sets of subjects
ImmunoTreatedSubjects
NonImmunoTreatedSubjects
3. download gene type data
ImmunoGenes
NonImmunoGenes
4. calculate variance of each gene set for each subject
5. create data frame to store (4) --> varGeneSetForEachSubject
6. compute t test to determine whether mean is significantly different
first test: generate statistic for each individual subject
--> compare variance of ImmunoGenes vs. variance of NonImmunoGenes for
each subject
Programming Without Programming: NCBI’s
Ebot
• Uses NCBI e-utilities (“Web services”)
▫ Programmatic access to NCBI databases,
including PubMed
▫ VERY useful for data mining
• Ebot codes the particular kind of service you
want to use
• Still, it only gets you so far, but at least the
heaviest lifting has been done (and it is heavy…)
Heavy Lifting: R
R: Why It Hurts So Good
• The “R Project” (aka R) is the premiere Open Source
statistical and data mining language and suite of
libraries.
• Pros
▫ Free, runs on everything
▫ Very flexible statistical computing
▫ Dominant standard in biocomputing
▫ Big user community at Stanford
▫ Many key libraries written at Stanford
• Cons
▫ Non-trivial learning curve
▫ Documentation is of variable quality
Key R Resources
Three essentials:
• RStudio
▫ Integrated development environment
 don’t code R w/o it!
• Crawley (2007): The R Book; Wiley
• Matloff (2011): The Art of R Programming; No
Starch Press
• Teetor (2011): The R Cookbook; O’Reilly
• Wickham (2009): ggplot2; Springer
Lighter Lifting: Weka
3/18/2022
37
WEKA Data Mining Suite
• Machine learning/data mining software written
in Java (distributed under the GNU Public
License)
• Main features:
▫ Comprehensive set of data pre-processing tools,
learning algorithms and evaluation methods
▫ Graphical user interfaces (incl. data visualization)
▫ Environment for comparing learning algorithms
• Heavily referenced in “Data Mining” (Witten &
Frank)
3/18/2022
University of Waikato
38
3/18/2022
University of Waikato
39
Perl, Python
• Either is a great language for bioinformatics
• Run on anything
• Use it to quickly glue systems together, e.g.,
▫ Integrate MySQL and R together
▫ Run Web services queries
• Python has more growth potential
▫ Preferable over Perl
5. Computer Power
Why The Cloud Matters For Biologists
• You are purchasing computing power, not
machines
▫  never outdated
• You can purchase as much/little as you need
▫ You don’t have to run/manage what you don’t use
• Can easily migrate from one machine type to
another (minutes)
• Can add storage in seconds
• Accessible from anywhere
• Easy to share e.g., (large) datasets with others
Why Own When You Can Rent?
Welcome To the Cloud…
For biomedical computing, Amazon
Cloud is ideal because it provides
highly flexible storage and compute
power sold on a use basis
Another Example: PathSeq
• Compare millions of short-read sequences
against all genomic + transcriptomic sequences
for all microbes (!)
Amazon Cloud
“Management Console”
Q: What does working with a Cloud
machine feel like?
A: It’s not materially different than
accessing a machine on our cluster,
except you can do anything you want
Main Services Provided by Amazon Cloud
• Storage
▫ Traditional disk volumes
▫ S3 buckets (“Simple Storage System”)
• Computing (EC2 – “Elastic Compute Cloud”)
▫ Single machine instances
▫ Clusters of various types
• Machine types
▫ Compute servers
▫ Database servers
▫ Cluster
▫ Specialized architectures
▫ Variety of operating systems (LINUX flavors, Windows)
Costs
• You pay for (almost) everything you do
▫ Data transfers (out)
▫ Storage
▫ CPU cycles (depends on instance type; one
instance is free)
• Can purchase cycles at below average market
price
▫ Can provide access to vast amounts of computing
power at a price you can afford
• Research grants from Amazon
Questions?

More Related Content

What's hot

Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
Tao Xie
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna Workflows
Andrea Wiggins
 
Secondary data analysis with digital trace data
Secondary data analysis with digital trace dataSecondary data analysis with digital trace data
Secondary data analysis with digital trace data
Andrea Wiggins
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Bertram Ludäscher
 
Findability through Traceability - A Realistic Application of Candidate Tr...
Findability through Traceability  - A Realistic Application of Candidate Tr...Findability through Traceability  - A Realistic Application of Candidate Tr...
Findability through Traceability - A Realistic Application of Candidate Tr...
Markus Borg
 
Research software susainability
Research software susainabilityResearch software susainability
Research software susainability
Daniel S. Katz
 
Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...
SC CTSI at USC and CHLA
 
Large Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a HaystackLarge Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a Haystack
Marcus Botacin
 
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustLec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
Menchita Falcutila Dumlao
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
Justin Sybrandt, Ph.D.
 
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
Tom LaGatta
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
Tao Xie
 
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
Slides chase 2019  connected health conference - thursday 26 september 2019 -...Slides chase 2019  connected health conference - thursday 26 september 2019 -...
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
Amélie Gyrard
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
Bonnie Hurwitz
 
CV_10/17
CV_10/17CV_10/17
User Expectations in Mobile App Security
User Expectations in Mobile App SecurityUser Expectations in Mobile App Security
User Expectations in Mobile App Security
Tao Xie
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Adaryl "Bob" Wakefield, MBA
 
Knowledge Beacons
Knowledge BeaconsKnowledge Beacons
Knowledge Beacons
Benjamin Good
 
PE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesPE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File Features
Antiy Labs
 
Mining Software Repositories
Mining Software RepositoriesMining Software Repositories
Mining Software Repositories
Israel Herraiz
 

What's hot (20)

Software Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software EngineeringSoftware Analytics: Data Analytics for Software Engineering
Software Analytics: Data Analytics for Software Engineering
 
Collaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna WorkflowsCollaborative Data Analysis with Taverna Workflows
Collaborative Data Analysis with Taverna Workflows
 
Secondary data analysis with digital trace data
Secondary data analysis with digital trace dataSecondary data analysis with digital trace data
Secondary data analysis with digital trace data
 
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
Introducing the Whole Tale Project: Merging Science and Cyberinfrastructure P...
 
Findability through Traceability - A Realistic Application of Candidate Tr...
Findability through Traceability  - A Realistic Application of Candidate Tr...Findability through Traceability  - A Realistic Application of Candidate Tr...
Findability through Traceability - A Realistic Application of Candidate Tr...
 
Research software susainability
Research software susainabilityResearch software susainability
Research software susainability
 
Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...Using electronic laboratory notebooks in the academic life sciences: a group ...
Using electronic laboratory notebooks in the academic life sciences: a group ...
 
Large Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a HaystackLarge Scale Studies: Malware Needles in a Haystack
Large Scale Studies: Malware Needles in a Haystack
 
Lec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrustLec 1 integrating data science and data analytics in various research thrust
Lec 1 integrating data science and data analytics in various research thrust
 
Sybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal PresentationSybrandt Thesis Proposal Presentation
Sybrandt Thesis Proposal Presentation
 
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014LaGatta and de Garrigues - Splunk for Data Science - .conf2014
LaGatta and de Garrigues - Splunk for Data Science - .conf2014
 
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven ResearchISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
ISEC'18 Tutorial: Research Methodology on Pursuing Impact-Driven Research
 
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
Slides chase 2019  connected health conference - thursday 26 september 2019 -...Slides chase 2019  connected health conference - thursday 26 september 2019 -...
Slides chase 2019 connected health conference - thursday 26 september 2019 -...
 
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
iMicrobe and iVirus: Extending the iPlant cyberinfrastructure from plants to ...
 
CV_10/17
CV_10/17CV_10/17
CV_10/17
 
User Expectations in Mobile App Security
User Expectations in Mobile App SecurityUser Expectations in Mobile App Security
User Expectations in Mobile App Security
 
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and SparkReproducible Research with R, The Tidyverse, Notebooks, and Spark
Reproducible Research with R, The Tidyverse, Notebooks, and Spark
 
Knowledge Beacons
Knowledge BeaconsKnowledge Beacons
Knowledge Beacons
 
PE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File FeaturesPE Trojan Detection Based on the Assessment of Static File Features
PE Trojan Detection Based on the Assessment of Static File Features
 
Mining Software Repositories
Mining Software RepositoriesMining Software Repositories
Mining Software Repositories
 

Viewers also liked

Introduction
IntroductionIntroduction
Introduction
Inonu12345
 
1. presentación recursos destinados al dmq
1. presentación recursos destinados al dmq1. presentación recursos destinados al dmq
1. presentación recursos destinados al dmq
Presidencia de la República del Ecuador
 
Caperucita Roja Convertida En Internet
Caperucita Roja Convertida En InternetCaperucita Roja Convertida En Internet
Caperucita Roja Convertida En Internet
miyerlandysalamanca
 
Historia del internet
Historia del internetHistoria del internet
Historia del internet
Kendry de Urieles
 
Monstruos en la noche
Monstruos en la nocheMonstruos en la noche
Monstruos en la noche
guest06918f8
 
Ola
OlaOla
Historia del internet
Historia del internetHistoria del internet
Historia del internet
Kendry de Urieles
 
Arantaboutamvsandanime.docx
Arantaboutamvsandanime.docxArantaboutamvsandanime.docx
Arantaboutamvsandanime.docx
JuanID78
 
Chapter4b
Chapter4bChapter4b
Chapter4b
mnikol
 
Chapter4a
Chapter4aChapter4a
Chapter4a
mnikol
 
Creating Fit Families
Creating Fit FamiliesCreating Fit Families
Creating Fit Families
LiftPotential
 
Rockfon Celings steel-finish-pics
Rockfon Celings steel-finish-picsRockfon Celings steel-finish-pics
Rockfon Celings steel-finish-pics
MSS Pvt Ltd
 
Online vs. offline - Studie zum Einkaufsverhalten
Online vs. offline - Studie zum EinkaufsverhaltenOnline vs. offline - Studie zum Einkaufsverhalten
Online vs. offline - Studie zum Einkaufsverhalten
die dialogagenten Agentur Beratung Service GmbH
 
Mapa conceptual de la sociedad de la información
Mapa conceptual de la sociedad de la informaciónMapa conceptual de la sociedad de la información
Mapa conceptual de la sociedad de la información
samiapaternina
 
Muscles of mastication & TMJ Dr.N.Mugunthan
Muscles of mastication & TMJ Dr.N.MugunthanMuscles of mastication & TMJ Dr.N.Mugunthan
Muscles of mastication & TMJ Dr.N.Mugunthan
mgmcri1234
 

Viewers also liked (16)

Introduction
IntroductionIntroduction
Introduction
 
1. presentación recursos destinados al dmq
1. presentación recursos destinados al dmq1. presentación recursos destinados al dmq
1. presentación recursos destinados al dmq
 
Caperucita Roja Convertida En Internet
Caperucita Roja Convertida En InternetCaperucita Roja Convertida En Internet
Caperucita Roja Convertida En Internet
 
Kenth's UNT Trasncript.PDF
Kenth's UNT Trasncript.PDFKenth's UNT Trasncript.PDF
Kenth's UNT Trasncript.PDF
 
Historia del internet
Historia del internetHistoria del internet
Historia del internet
 
Monstruos en la noche
Monstruos en la nocheMonstruos en la noche
Monstruos en la noche
 
Ola
OlaOla
Ola
 
Historia del internet
Historia del internetHistoria del internet
Historia del internet
 
Arantaboutamvsandanime.docx
Arantaboutamvsandanime.docxArantaboutamvsandanime.docx
Arantaboutamvsandanime.docx
 
Chapter4b
Chapter4bChapter4b
Chapter4b
 
Chapter4a
Chapter4aChapter4a
Chapter4a
 
Creating Fit Families
Creating Fit FamiliesCreating Fit Families
Creating Fit Families
 
Rockfon Celings steel-finish-pics
Rockfon Celings steel-finish-picsRockfon Celings steel-finish-pics
Rockfon Celings steel-finish-pics
 
Online vs. offline - Studie zum Einkaufsverhalten
Online vs. offline - Studie zum EinkaufsverhaltenOnline vs. offline - Studie zum Einkaufsverhalten
Online vs. offline - Studie zum Einkaufsverhalten
 
Mapa conceptual de la sociedad de la información
Mapa conceptual de la sociedad de la informaciónMapa conceptual de la sociedad de la información
Mapa conceptual de la sociedad de la información
 
Muscles of mastication & TMJ Dr.N.Mugunthan
Muscles of mastication & TMJ Dr.N.MugunthanMuscles of mastication & TMJ Dr.N.Mugunthan
Muscles of mastication & TMJ Dr.N.Mugunthan
 

Similar to Databases, Web Services and Tools For Systems Immunology

H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
Sri Ambati
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
Rachel Berryman
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
javed75
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker, Inc.
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
Chris Dwan
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
Sarah Anna Stewart
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
c.titus.brown
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
mestato
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
Ken Karapetyan
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
US Environmental Protection Agency (EPA), Center for Computational Toxicology and Exposure
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
Leighton Pritchard
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
C. Tobin Magle
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Demi Ben-Ari
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Anita de Waard
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
Functional Genomics Data Society
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
Sri Ambati
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
c.titus.brown
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
Tao Xie
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
Travis Oliphant
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
c.titus.brown
 

Similar to Databases, Web Services and Tools For Systems Immunology (20)

H2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin LedellH2O World - Intro to Data Science with Erin Ledell
H2O World - Intro to Data Science with Erin Ledell
 
From SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the SwitchFrom SQL to Python - A Beginner's Guide to Making the Switch
From SQL to Python - A Beginner's Guide to Making the Switch
 
Data Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATAData Science.pptx NEW COURICUUMN IN DATA
Data Science.pptx NEW COURICUUMN IN DATA
 
Docker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce HoffDocker in Open Science Data Analysis Challenges by Bruce Hoff
Docker in Open Science Data Analysis Challenges by Bruce Hoff
 
No Free Lunch: Metadata in the life sciences
No Free Lunch:  Metadata in the life sciencesNo Free Lunch:  Metadata in the life sciences
No Free Lunch: Metadata in the life sciences
 
Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...Research Data (and Software) Management at Imperial: (Everything you need to ...
Research Data (and Software) Management at Imperial: (Everything you need to ...
 
2014 nicta-reproducibility
2014 nicta-reproducibility2014 nicta-reproducibility
2014 nicta-reproducibility
 
Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...Building genomic data cyberinfrastructure with the online database software T...
Building genomic data cyberinfrastructure with the online database software T...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...Royal society of chemistry activities to develop a data repository for chemis...
Royal society of chemistry activities to develop a data repository for chemis...
 
Introduction to Bioinformatics
Introduction to BioinformaticsIntroduction to Bioinformatics
Introduction to Bioinformatics
 
Reproducible research: theory
Reproducible research: theoryReproducible research: theory
Reproducible research: theory
 
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ PanoraysQuick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
Quick dive into the big data pool without drowning - Demi Ben-Ari @ Panorays
 
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne UlitmatumElsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
Elsevier‘s RDM Program: Habits of Effective Data and the Bourne Ulitmatum
 
Cshl minseqe 2013_ouellette
Cshl minseqe 2013_ouelletteCshl minseqe 2013_ouellette
Cshl minseqe 2013_ouellette
 
H2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User GroupH2O with Erin LeDell at Portland R User Group
H2O with Erin LeDell at Portland R User Group
 
2014 manchester-reproducibility
2014 manchester-reproducibility2014 manchester-reproducibility
2014 manchester-reproducibility
 
Software Analytics - Achievements and Challenges
Software Analytics - Achievements and ChallengesSoftware Analytics - Achievements and Challenges
Software Analytics - Achievements and Challenges
 
Continuum Analytics and Python
Continuum Analytics and PythonContinuum Analytics and Python
Continuum Analytics and Python
 
2015 genome-center
2015 genome-center2015 genome-center
2015 genome-center
 

More from Yannick Pouliot

Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
Yannick Pouliot
 
Systems Immunology -- 2014
Systems Immunology -- 2014Systems Immunology -- 2014
Systems Immunology -- 2014
Yannick Pouliot
 
Managing experiment data using Excel and Friends
Managing experiment data using Excel and FriendsManaging experiment data using Excel and Friends
Managing experiment data using Excel and Friends
Yannick Pouliot
 
Essential UNIX skills for biologists
Essential UNIX skills for biologistsEssential UNIX skills for biologists
Essential UNIX skills for biologists
Yannick Pouliot
 
A guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databasesA guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databases
Yannick Pouliot
 
Ontologically-Aware Automated Gating
Ontologically-Aware Automated GatingOntologically-Aware Automated Gating
Ontologically-Aware Automated Gating
Yannick Pouliot
 
Why The Cloud Is A Computational Biologist's Best Friend
Why The Cloud Is A Computational Biologist's Best FriendWhy The Cloud Is A Computational Biologist's Best Friend
Why The Cloud Is A Computational Biologist's Best Friend
Yannick Pouliot
 
There’s No Avoiding It: Programming Skills You’ll Need
There’s No Avoiding It:  Programming Skills You’ll NeedThere’s No Avoiding It:  Programming Skills You’ll Need
There’s No Avoiding It: Programming Skills You’ll Need
Yannick Pouliot
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
Yannick Pouliot
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening Data
Yannick Pouliot
 
Repositioning Old Drugs For New Indications Using Computational Approaches
Repositioning Old Drugs For New Indications Using Computational ApproachesRepositioning Old Drugs For New Indications Using Computational Approaches
Repositioning Old Drugs For New Indications Using Computational Approaches
Yannick Pouliot
 

More from Yannick Pouliot (11)

Survey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and AnalyticsSurvey of Spark for Data Pre-Processing and Analytics
Survey of Spark for Data Pre-Processing and Analytics
 
Systems Immunology -- 2014
Systems Immunology -- 2014Systems Immunology -- 2014
Systems Immunology -- 2014
 
Managing experiment data using Excel and Friends
Managing experiment data using Excel and FriendsManaging experiment data using Excel and Friends
Managing experiment data using Excel and Friends
 
Essential UNIX skills for biologists
Essential UNIX skills for biologistsEssential UNIX skills for biologists
Essential UNIX skills for biologists
 
A guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databasesA guided SQL tour of bioinformatics databases
A guided SQL tour of bioinformatics databases
 
Ontologically-Aware Automated Gating
Ontologically-Aware Automated GatingOntologically-Aware Automated Gating
Ontologically-Aware Automated Gating
 
Why The Cloud Is A Computational Biologist's Best Friend
Why The Cloud Is A Computational Biologist's Best FriendWhy The Cloud Is A Computational Biologist's Best Friend
Why The Cloud Is A Computational Biologist's Best Friend
 
There’s No Avoiding It: Programming Skills You’ll Need
There’s No Avoiding It:  Programming Skills You’ll NeedThere’s No Avoiding It:  Programming Skills You’ll Need
There’s No Avoiding It: Programming Skills You’ll Need
 
Ontologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological DataOntologies for Semantic Normalization of Immunological Data
Ontologies for Semantic Normalization of Immunological Data
 
Predicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening DataPredicting Adverse Drug Reactions Using PubChem Screening Data
Predicting Adverse Drug Reactions Using PubChem Screening Data
 
Repositioning Old Drugs For New Indications Using Computational Approaches
Repositioning Old Drugs For New Indications Using Computational ApproachesRepositioning Old Drugs For New Indications Using Computational Approaches
Repositioning Old Drugs For New Indications Using Computational Approaches
 

Recently uploaded

The Best Ayurvedic Antacid Tablets in India
The Best Ayurvedic Antacid Tablets in IndiaThe Best Ayurvedic Antacid Tablets in India
The Best Ayurvedic Antacid Tablets in India
Swastik Ayurveda
 
Complementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLSComplementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLS
chiranthgowda16
 
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa CentralClinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
19various
 
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
Donc Test
 
Histololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptxHistololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptx
AyeshaZaid1
 
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
bkling
 
Cell Therapy Expansion and Challenges in Autoimmune Disease
Cell Therapy Expansion and Challenges in Autoimmune DiseaseCell Therapy Expansion and Challenges in Autoimmune Disease
Cell Therapy Expansion and Challenges in Autoimmune Disease
Health Advances
 
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptxVestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Dr. Rabia Inam Gandapore
 
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
rishi2789
 
K CỔ TỬ CUNG.pdf tự ghi chép, chữ hơi xấu
K CỔ TỬ CUNG.pdf tự ghi chép, chữ hơi xấuK CỔ TỬ CUNG.pdf tự ghi chép, chữ hơi xấu
K CỔ TỬ CUNG.pdf tự ghi chép, chữ hơi xấu
HongBiThi1
 
Journal Article Review on Rasamanikya
Journal Article Review on RasamanikyaJournal Article Review on Rasamanikya
Journal Article Review on Rasamanikya
Dr. Jyothirmai Paindla
 
Post-Menstrual Smell- When to Suspect Vaginitis.pptx
Post-Menstrual Smell- When to Suspect Vaginitis.pptxPost-Menstrual Smell- When to Suspect Vaginitis.pptx
Post-Menstrual Smell- When to Suspect Vaginitis.pptx
FFragrant
 
Top Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in IndiaTop Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in India
SwisschemDerma
 
Best Ayurvedic medicine for Gas and Indigestion
Best Ayurvedic medicine for Gas and IndigestionBest Ayurvedic medicine for Gas and Indigestion
Best Ayurvedic medicine for Gas and Indigestion
Swastik Ayurveda
 
Cardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdfCardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdf
shivalingatalekar1
 
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptxREGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
LaniyaNasrink
 
Adhd Medication Shortage Uk - trinexpharmacy.com
Adhd Medication Shortage Uk - trinexpharmacy.comAdhd Medication Shortage Uk - trinexpharmacy.com
Adhd Medication Shortage Uk - trinexpharmacy.com
reignlana06
 
Artificial Intelligence Symposium (THAIS)
Artificial Intelligence Symposium (THAIS)Artificial Intelligence Symposium (THAIS)
Artificial Intelligence Symposium (THAIS)
Josep Vidal-Alaball
 
THERAPEUTIC ANTISENSE MOLECULES .pptx
THERAPEUTIC ANTISENSE MOLECULES    .pptxTHERAPEUTIC ANTISENSE MOLECULES    .pptx
THERAPEUTIC ANTISENSE MOLECULES .pptx
70KRISHPATEL
 
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
rightmanforbloodline
 

Recently uploaded (20)

The Best Ayurvedic Antacid Tablets in India
The Best Ayurvedic Antacid Tablets in IndiaThe Best Ayurvedic Antacid Tablets in India
The Best Ayurvedic Antacid Tablets in India
 
Complementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLSComplementary feeding in infant IAP PROTOCOLS
Complementary feeding in infant IAP PROTOCOLS
 
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa CentralClinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
Clinic ^%[+27633867063*Abortion Pills For Sale In Tembisa Central
 
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
TEST BANK For Community Health Nursing A Canadian Perspective, 5th Edition by...
 
Histololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptxHistololgy of Female Reproductive System.pptx
Histololgy of Female Reproductive System.pptx
 
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
Part II - Body Grief: Losing parts of ourselves and our identity before, duri...
 
Cell Therapy Expansion and Challenges in Autoimmune Disease
Cell Therapy Expansion and Challenges in Autoimmune DiseaseCell Therapy Expansion and Challenges in Autoimmune Disease
Cell Therapy Expansion and Challenges in Autoimmune Disease
 
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptxVestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
Vestibulocochlear Nerve by Dr. Rabia Inam Gandapore.pptx
 
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
CHEMOTHERAPY_RDP_CHAPTER 2 _LEPROSY.pdf1
 
K CỔ TỬ CUNG.pdf tự ghi chép, chữ hơi xấu
K CỔ TỬ CUNG.pdf tự ghi chép, chữ hơi xấuK CỔ TỬ CUNG.pdf tự ghi chép, chữ hơi xấu
K CỔ TỬ CUNG.pdf tự ghi chép, chữ hơi xấu
 
Journal Article Review on Rasamanikya
Journal Article Review on RasamanikyaJournal Article Review on Rasamanikya
Journal Article Review on Rasamanikya
 
Post-Menstrual Smell- When to Suspect Vaginitis.pptx
Post-Menstrual Smell- When to Suspect Vaginitis.pptxPost-Menstrual Smell- When to Suspect Vaginitis.pptx
Post-Menstrual Smell- When to Suspect Vaginitis.pptx
 
Top Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in IndiaTop Effective Soaps for Fungal Skin Infections in India
Top Effective Soaps for Fungal Skin Infections in India
 
Best Ayurvedic medicine for Gas and Indigestion
Best Ayurvedic medicine for Gas and IndigestionBest Ayurvedic medicine for Gas and Indigestion
Best Ayurvedic medicine for Gas and Indigestion
 
Cardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdfCardiac Assessment for B.sc Nursing Student.pdf
Cardiac Assessment for B.sc Nursing Student.pdf
 
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptxREGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
REGULATION FOR COMBINATION PRODUCTS AND MEDICAL DEVICES.pptx
 
Adhd Medication Shortage Uk - trinexpharmacy.com
Adhd Medication Shortage Uk - trinexpharmacy.comAdhd Medication Shortage Uk - trinexpharmacy.com
Adhd Medication Shortage Uk - trinexpharmacy.com
 
Artificial Intelligence Symposium (THAIS)
Artificial Intelligence Symposium (THAIS)Artificial Intelligence Symposium (THAIS)
Artificial Intelligence Symposium (THAIS)
 
THERAPEUTIC ANTISENSE MOLECULES .pptx
THERAPEUTIC ANTISENSE MOLECULES    .pptxTHERAPEUTIC ANTISENSE MOLECULES    .pptx
THERAPEUTIC ANTISENSE MOLECULES .pptx
 
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
TEST BANK For An Introduction to Brain and Behavior, 7th Edition by Bryan Kol...
 

Databases, Web Services and Tools For Systems Immunology

  • 1. Yannick Pouliot, PhD Biocomputational scientist Butte Laboratory 04/04/2012 Databases, Web Services and Tools For Systems Immunology
  • 2. What You Need For Systems Immunology 1. A real hypothesis ▫ No fuzzy brain stuff 2. An understanding of statistics and data mining ▫ …one never understands enough statistics… 3. A lot of data, typically from different “levels” of reality: organismal, molecular/static, molecular/functional, etc ▫ … and therefore, databases of some sort 4. Software tools and programming expertise 5. Computing power
  • 4. Developing a Hypothesis Suitable for Data Mining • Possibly the hardest step ▫ Must have a measurable metric that can be tested statistically • A real hypothesis (H1) looks like this: ▫ H1: Drugs with increased frequency of adverse drug reactions can be identified from patterns of reactivity in PubChem Bioassays screens. • Actually, statistical tests tries to invalidate the null (Ho) hypothesis, which looks like this: ▫ Ho: Bioactivity patterns in PubChem Bioassays do not distinguish drugs with increased frequency of adverse drug reactions
  • 5. 2: Statistics & Data Mining
  • 6. Understanding Statistics: Essential • Not easy; counter-intuitive • Critical, because with large volumes of data come the guarantee that you will always find “something” ▫ … except that it will most likely be purely artifactual Q: ever heard of multiple testing correction? If not, read Bill Noble excellent description: Noble, W. S. How does multiple testing correction work? Nat Biotech 27, 1135-1137 (2009).
  • 7. Learning About Statistics Introductory • Norman, Streiner (2008): Biostatistics, the Bare Essentials; Hamilton. More advanced • Vittinghoff, Eric. (2005): Regression methods in biostatistics; Springer. • Gentleman et al., (2005): Bioinformatics and Computational Biology Solutions Using R and Bioconductor; Springer. Advanced Doncaster & Davey (2007): Analysis of Variance and Covariance: How to Choose and Construct Models for the Life Sciences; Cambridge University Press.
  • 8. Understanding Data Mining • Data mining uses statistical techniques + other techniques that are uniquely “computational” • Key to Systems Immunology • Resources: ▫ Excellent introduction, Weka-specific:  Witten & Frank (2005): Data Mining: Practical Machine Learning Tools and Techniques, Second Edition; Morgan Kaufman ▫ Nisbet et al., (2009): Handbook of Statistical Analysis and Data Mining Applications; Wiley • Tools: coming-up
  • 10. Huge Numbers of Databases • Many need to be licensed ($) ▫ Ingenuity Pathways Analysis (IPA)  Excellent but pricey ▫ MetaCore  competitor to IPA  available from Lane Library • Many more freely available ▫ DAVID: similar to IPA and MetaCore ▫ Typically dirtier than commercial products, but sometimes much more comprehensive ▫ Consult Nucleic Acids Research’s yearly database issue
  • 11. The Bad News • To be useful in Systems Medicine, databases need to offer one of the following: ▫ Be downloadable (FTP) in text or other form ▫ Be accessible programmatically over the Internet (e.g., Web service) Clicking on Web interfaces just doesn’t cut it… This mean knowing about databases and having programming skills (more later)
  • 12. A Small Sample of DBs Crucial to Systems Immunology • NCBI: Entrez, GEO, PubMed, Gene, Genome, RefSeq, dbSNP • EBI: Array Express, Gene Expression Atlas, ENSEMBL • Mouse Genome Database • DrugBank • BioGPS • HapMap • STITCH: interactions between compounds and proteins • UMLS (Unified Medical Language System)
  • 13. Unified Medical Language System • Developed by National Library of Medicine • = data files and software that brings together multiple biomedical vocabularies and ontologies to enable semantic interoperability. ▫ repository of terms, definitions and concepts in biomedicine, complete with cross-referencing and ontological relationships • Essential but complex and large • Requires free license
  • 15. The ImmPort Database: The Only DB of Its Kind in Immunology • http://immport.niaid.nih.gov/ • Stores results from huge range of assays ▫ HAI  flow cytometry phenotyping ▫ Genotyping ▫ Sequencing ▫ Gene expression ▫ etc • Intended to be the primary repository for all NIAID “center” grants • Can access pre-publication data if given access by PI • Caveat: volume of PUBLIC datasets is currently limited
  • 16. Stanford’s Human Immune Monitoring Center (HIMC) Database • Stanford Data Miner is HIMC’s data mining database • Stores many of the assays run by HIMC • Ask HIMC for access data from researchers who use HIMC (will require their consent)
  • 17. Next Level Up: Relational Databases Take Your Pick
  • 18. Why Relational Databases? • Much faster access to data • Data are safe • Completely robust query answers • Good scaling • Highly integrative ▫ Cross-database querying: essential!
  • 19. Recommendation: MySQL • Nothing magical about MySQL ▫ Widest usage in bioinformatics ▫ Free (community edition) ▫ Runs on everything (Linux, Win, Mac) ▫ Easiest relational DB (short of MS Access) • Resources ▫ Moes (2005): Beginning MySQL; Wiley ▫ DuBois (2007): MySQL Cookbook; O’Reilly ▫ Dyer (2008): MYSQL in a Nutshell; O’Reilly
  • 20. 4: Software tools and programming expertise
  • 21. • Free software! • Free algorithms! • Pre-coded algorithms (i.e., packages)! • Very cheap computing power! The Good News
  • 22. The Bad News • Dunno how to use • “Not talented” • “Not enough time” • Can’t be bothered ▫ e.g., reading the paper describing the software tool one is relying on
  • 23. More Good News • Not that hard • Lots and lots of good resources • Read a book, dammit • Find a buddy • Use Cloud instances (preconfigured machines) ▫ Can even be free!
  • 24. “Gateway Drugs” to Programming: Workflow Systems • GenePattern ▫ Predominantly oriented toward gene expression analysis ▫ Public server available • Galaxy ▫ Predominantly oriented toward sequence (NGS) analysis ▫ Public server available • Weka ▫ Easiest way to learn data mining
  • 25. But Seriously: Why Programming? • Address small problems that can nail you • Address bigger problems by standing on the shoulders of giants • Flexibility: If you’re doing “real” science, off-the- shelf software will fail you every time ▫ 80% rule…
  • 26. Example: Don’t Try This With Excel •Millions of sequence reads compared against mouse transcriptome • Determining number of distinct species and frequency of members in each • Summarize using plots for each codon
  • 28.
  • 30. Languages/Systems You Can’t Do Without • SQL ▫ To talk to MySQL • Perl or Python ▫ To glue things together • R (“R Project”) ▫ To perform heavy-duty statistical analysis • Weka ▫ To apply machine learning algorithms
  • 31. The Inside Scoop of Making Programming Work for You • Diagram or write down your process ▫ Don’t just sit down and write code • Write comments ▫ “I’m doing this because of this special case over here” • Using meaningful variable names ▫ $c = not good • Use development tools ▫ Rstudio (for writing R code) ▫ Eclipse (for writing in almost any language) ▫ HeidiSQL (SQL browser) 1. download subject <--> group mapping table 2. download drug treatment data for each subject ID create two sets of subjects ImmunoTreatedSubjects NonImmunoTreatedSubjects 3. download gene type data ImmunoGenes NonImmunoGenes 4. calculate variance of each gene set for each subject 5. create data frame to store (4) --> varGeneSetForEachSubject 6. compute t test to determine whether mean is significantly different first test: generate statistic for each individual subject --> compare variance of ImmunoGenes vs. variance of NonImmunoGenes for each subject
  • 32. Programming Without Programming: NCBI’s Ebot • Uses NCBI e-utilities (“Web services”) ▫ Programmatic access to NCBI databases, including PubMed ▫ VERY useful for data mining • Ebot codes the particular kind of service you want to use • Still, it only gets you so far, but at least the heaviest lifting has been done (and it is heavy…)
  • 34. R: Why It Hurts So Good • The “R Project” (aka R) is the premiere Open Source statistical and data mining language and suite of libraries. • Pros ▫ Free, runs on everything ▫ Very flexible statistical computing ▫ Dominant standard in biocomputing ▫ Big user community at Stanford ▫ Many key libraries written at Stanford • Cons ▫ Non-trivial learning curve ▫ Documentation is of variable quality
  • 35. Key R Resources Three essentials: • RStudio ▫ Integrated development environment  don’t code R w/o it! • Crawley (2007): The R Book; Wiley • Matloff (2011): The Art of R Programming; No Starch Press • Teetor (2011): The R Cookbook; O’Reilly • Wickham (2009): ggplot2; Springer
  • 37. 3/18/2022 37 WEKA Data Mining Suite • Machine learning/data mining software written in Java (distributed under the GNU Public License) • Main features: ▫ Comprehensive set of data pre-processing tools, learning algorithms and evaluation methods ▫ Graphical user interfaces (incl. data visualization) ▫ Environment for comparing learning algorithms • Heavily referenced in “Data Mining” (Witten & Frank)
  • 40. Perl, Python • Either is a great language for bioinformatics • Run on anything • Use it to quickly glue systems together, e.g., ▫ Integrate MySQL and R together ▫ Run Web services queries • Python has more growth potential ▫ Preferable over Perl
  • 42. Why The Cloud Matters For Biologists • You are purchasing computing power, not machines ▫  never outdated • You can purchase as much/little as you need ▫ You don’t have to run/manage what you don’t use • Can easily migrate from one machine type to another (minutes) • Can add storage in seconds • Accessible from anywhere • Easy to share e.g., (large) datasets with others
  • 43. Why Own When You Can Rent? Welcome To the Cloud…
  • 44. For biomedical computing, Amazon Cloud is ideal because it provides highly flexible storage and compute power sold on a use basis
  • 45. Another Example: PathSeq • Compare millions of short-read sequences against all genomic + transcriptomic sequences for all microbes (!) Amazon Cloud “Management Console”
  • 46. Q: What does working with a Cloud machine feel like? A: It’s not materially different than accessing a machine on our cluster, except you can do anything you want
  • 47. Main Services Provided by Amazon Cloud • Storage ▫ Traditional disk volumes ▫ S3 buckets (“Simple Storage System”) • Computing (EC2 – “Elastic Compute Cloud”) ▫ Single machine instances ▫ Clusters of various types • Machine types ▫ Compute servers ▫ Database servers ▫ Cluster ▫ Specialized architectures ▫ Variety of operating systems (LINUX flavors, Windows)
  • 48. Costs • You pay for (almost) everything you do ▫ Data transfers (out) ▫ Storage ▫ CPU cycles (depends on instance type; one instance is free) • Can purchase cycles at below average market price ▫ Can provide access to vast amounts of computing power at a price you can afford • Research grants from Amazon

Editor's Notes

  1. Multiple hypothesis testing corrects for random events that falsely appear significant
  2. PPT document created using combination of Perl, MySQL and R
  3. Blue=services I’ve used
  4. Mention cost calculator: http://calculator.s3.amazonaws.com/calc5.html