SlideShare a Scribd company logo
Introduction to
 Bioinformatics
   Stephen Turner, Ph.D.
Bioinformatics Core Director
bioinformatics@virginia.edu

       Slides at bit.ly/intro-bioinfo
Contact
Web:    bioinformatics.virginia.edu
E-mail: bioinformatics@virginia.edu
Blog:   GettingGeneticsDone.com
Twitter: @genetics_blog
Bioinformatics Origins:

Rooted in sequence analysis.

Driven by the need to:
● Collect
● Annotate
● Analyze
Margaret Dayhoff (1925-1983)
● Collected all known protein
  structures & sequences
● Published Atlas in 1965
● Pioneered algorithm development
  for:
     ○ Comparing protein sequences
     ○ Deriving evolutionary history from
       alignments
“In this paper we shall describe a completed
computer program for the IBM 7090, which to
our knowledge is the first successful attempt
at aiding the analysis of the amino acid chain
structure of protein.”
IBM 7090
“There is a tremendous amount of information
regarding evolutionary history and biochemical
   function implicit in each sequence and the
     number of known sequences is growing
  explosively. We feel it is important to collect
  this significant information, correlate it into a
          unified whole and interpret it.”

            M. Dayhoff, February 27, 1967
modified from @drewconway
ted                          ed
                                                             en                           nt
                            t                            inv                          ve
                         ne                           et                           in
                       PA                       er  n
                                                                            W
                                                                               W
                     AR                     Int                            W
    1960                            1970                            1980       1990               2000                      2010

                 s                      g                 k                                                                   g
            At
               la                   ci
                                       n                an L                                                                in
        f                        n                     B B                                                                nc
       f
                             qu
                                e                    en EM                                                            ue
    ho                    e                         G I-                                                           eq
 ay                      S                           EB                                                           S
D
                  g   er                                                                                     en
                an                                                                                     t   -G
              S                                                                                     ex
                                                                                                   N
Definition
From Wikipedia: Bioinformatics is a branch of biological science which deals with the study of methods for storing,
retrieving and analyzing biological data, such as nucleic acid (DNA/RNA) and protein sequence, structure, function,
pathways and genetic interactions. It generates new knowledge that is useful in such fields as drug design and
development of new software tools to create that knowledge. Bioinformatics also deals with algorithms, databases and
information systems, web technologies, artificial intelligence and soft computing, information and computation theory,
structural biology, software engineering, data mining, image processing, modeling and simulation, discrete
mathematics, control and system theory, circuit theory, and statistics.



Our definition: using computer science and
statistics to answer biological questions.
Subdisciplines
●   Sequence alignment (DNA, RNA, Protein)
●   Genome annotation
●   Evolutionary biology / comparative genomics
●   Analysis of gene expression
●   Analysis of gene regulation
●   Genotype-phenotype association
●   Mutation analysis
●   Structural biology
●   Biomarker identification
●   Pathway analysis / "systems biology"
●   Literature analysis / text-mining
Central Dogma


    Reverse                  RNA
  transcription            Silencing               Prions




DNA           RNA                  Protein

                                       Post-translational
                                         modification

             Methylation
Protein folding determines
                                           molecular function




DNA provides assembly
instructions for proteins
                            Networks of interacting
                               proteins determine
                             tissue/organ function
Protein folding determines
                                             molecular function


    DNA variant analysis
 Gene expression analysis
     Genome annotation      Pathway analysis
         Epigenetics        Systems biology
DNA provides assembly        Biomarker ID'n
instructions for proteins
                              Networks of interacting
     miRNA analysis              proteins determine
     Quantitative MS           tissue/organ function
       Proteomics
Subdisciplines
●   Sequence alignment (DNA, RNA, Protein)
●   Genome annotation
●   Evolutionary biology / comparative genomics
●   Analysis of gene expression
●   Analysis of gene regulation
●   Genotype-phenotype association
●   Mutation analysis
●   Structural biology
●   Biomarker identification
●   Pathway analysis / "systems biology"
●   Literature analysis / text-mining
Sequence alignment, example 1
Outbreak: fever, characteristic skin lesions.




Culture, isolate DNA, sequence (sanger):
        GTGAGTAATAATAATTCAAAACTGGAATTTGTACCTAATATACAGCTTAAAGAAGACTTAGGAGCTTTTAGCTATAAAGTCCAACTTTCT
        CCTGTAGAAAAAGGTATGGCTCATATCCTTGGTAACTCTATTAGAAGGGTTTTATTATCTTCACTATCAGGTGCATCTATAATTAAAGTA
        AACATCGCTAATGTACTACATGAGTATTCTACTTTAGAAGATGTAAAAGAAGATGTTGTTGAAATTGTTTCTAATTTGAAAAAGGTTGCG
        ATAAAGCTTGATACAGGTATAGATAGACTAGATTTAGAACTATCTGTAAATAAATCAGGTGTAGTTAGCGCTGGAGATTTTAAGACGACT
        CAAGGTGTAGAAATAATAAATAAAGATCAGCCAATAGCTACTTTGACAAACCAAAGAGCATTTAGCTTAACTGCTACAGTGAGTGTAGGT
        AGAAATGTCGGAATACTTTCTGCGATACCAACCGAGCTTGAGAGAGTTGGTGATATAGCTGTAGATGCTGATTTTAATCCTATTAAAAGA
        GTTGCTTTTGAGGTTTTTGATAATGGTGATAGTGAAACTTTAGAAGTATTTGTAAAGACAAATGGTACTATAGAACCACTAGCAGCTGTT
        ACGAAAGCTTTAGAGTATTTCTGTGAGCAAATATCAGTATTTGTATCTCTAAGAGTACCTAGTAATGGTAAAACAGGTGATGTATTAATA
        GATTCTAATATTGATCCTATCCTTCTTAAGCCGATTGATGATTTAGAGCTAACTGTCAGATCATCTAACTGTCTGCGTGCAGAAAACATT
        AAGTATCTTGGTGATTTGGTACAGTATTCTGAATCACAGCTTATGAAGATACCTAACTTAGGTAAGAAATCTCTCAATGAGATCAAACAA
        ATTTTAATAGATAATAACTTGTCTCTAGGTGTCCAAATTGACAATTTTAGAGAGCTAGTTGAAGGAAAATAA
Sequence alignment, example 1
●   BLAST (Basic Local Alignment Search Tool)
●   Go to blast.ncbi.nlm.nih.gov
●   Click "Nucleotide BLAST" (blastn)
●   Under "Choose Search Set", click the
    "Others" button, then search the entire nr/nt
    collection (you don't know what it is)
    GTGAGTAATAATAATTCAAAACTGGAATTTGTACCTAATATACAGCTTAAAGAAGACTTAGGAGCTTTTAGCTATAAAGTCCAACTTTCT
    CCTGTAGAAAAAGGTATGGCTCATATCCTTGGTAACTCTATTAGAAGGGTTTTATTATCTTCACTATCAGGTGCATCTATAATTAAAGTA
    AACATCGCTAATGTACTACATGAGTATTCTACTTTAGAAGATGTAAAAGAAGATGTTGTTGAAATTGTTTCTAATTTGAAAAAGGTTGCG
    ATAAAGCTTGATACAGGTATAGATAGACTAGATTTAGAACTATCTGTAAATAAATCAGGTGTAGTTAGCGCTGGAGATTTTAAGACGACT
    CAAGGTGTAGAAATAATAAATAAAGATCAGCCAATAGCTACTTTGACAAACCAAAGAGCATTTAGCTTAACTGCTACAGTGAGTGTAGGT
    AGAAATGTCGGAATACTTTCTGCGATACCAACCGAGCTTGAGAGAGTTGGTGATATAGCTGTAGATGCTGATTTTAATCCTATTAAAAGA
    GTTGCTTTTGAGGTTTTTGATAATGGTGATAGTGAAACTTTAGAAGTATTTGTAAAGACAAATGGTACTATAGAACCACTAGCAGCTGTT
    ACGAAAGCTTTAGAGTATTTCTGTGAGCAAATATCAGTATTTGTATCTCTAAGAGTACCTAGTAATGGTAAAACAGGTGATGTATTAATA
    GATTCTAATATTGATCCTATCCTTCTTAAGCCGATTGATGATTTAGAGCTAACTGTCAGATCATCTAACTGTCTGCGTGCAGAAAACATT
    AAGTATCTTGGTGATTTGGTACAGTATTCTGAATCACAGCTTATGAAGATACCTAACTTAGGTAAGAAATCTCTCAATGAGATCAAACAA
    ATTTTAATAGATAATAACTTGTCTCTAGGTGTCCAAATTGACAATTTTAGAGAGCTAGTTGAAGGAAAATAA
Sequence alignment, example 2
● Illumina HiSeq 2500:
  ○ 600,000,000,000 bases sequenced in single run.
  ○ 6,000,000,000 x 100-bp (short) reads
● BLAST way too slow.
● BWA: burrows wheeler aligner (fast)
● Bowtie: fast, memory-efficient (aligns
  25,000,000 35-bp reads per hour per CPU).
● Many others... MAQ, Eland, RMAP, SOAP,
  SHRiMP, BFAST, Mosaik, Novoalign, BLAT,
  GMAP, GSNAP, MOM, QPalma, SeqMap,
  VelociMapper, Stampy, mrFAST, etc.
Subdisciplines
●   Sequence alignment (DNA, RNA, Protein)
●   Genome annotation
●   Evolutionary biology / comparative genomics
●   Analysis of gene expression
●   Analysis of gene regulation
●   Genotype-phenotype association
●   Mutation analysis
●   Structural biology
●   Biomarker identification
●   Pathway analysis / "systems biology"
●   Literature analysis / text-mining
Comparative Genomics example
● Go to genome.ucsc.edu
● Search for POLR2A
● Turn on some conservation tracks
Sequence similarity
Evolutionary distance
Subdisciplines
●   Sequence alignment (DNA, RNA, Protein)
●   Genome annotation
●   Evolutionary biology / comparative genomics
●   Analysis of gene expression
●   Analysis of gene regulation
●   Genotype-phenotype association
●   Mutation analysis
●   Structural biology
●   Biomarker identification
●   Pathway analysis / "systems biology"
●   Literature analysis / text-mining
Genetic Epidemiology
Epidemiology: the study of the patterns,
causes, and effects of health and disease
conditions in defined populations.

Genetic epidemiology: the study of genetic
factors in determining health and disease in
families and populations.
Protein folding determines
                                           molecular function




DNA provides assembly
instructions for proteins
                            Networks of interacting
                               proteins determine
                             tissue/organ function
Genetic epidemiology
● Linkage: finding genetic loci that segregate
  with the disease in families.
● Association: finding alleles that co-occur with
  disease in populations.
   ○ Common disease - common variant hypothesis:
     ■ Common variants (e.g. >1-5% in the population)
       contribute to common, complex disease).
   ○ Common disease - rare variant hypothesis:
     ■ Polymorphisms that cause disease are under
       purifying selection, and will thus be rare.
   ○ Really, it's a mix of both
Candidate gene study
  ● Select candidate genes based on:
         ○     Known biology
         ○     Previous linkage/association evidence
         ○     Pathways
         ○     Evidence from model organisms
  ● Genotype variants (SNPs) in those genes
  ● Statistical association




Genotype at position rs12345: A/A   Genotype at position rs12345: A/T   Genotype at position rs12345: T/T
Genome-wide association study
●   Genotype >500,000 SNPs
●   Statistical test at each one
●   Manhattan plot of results
●   GWAS does not inform:
    ○ Which gene affected
    ○ How gene function perturbed
    ○ How biological function altered
Subdisciplines
●   Sequence alignment (DNA, RNA, Protein)
●   Genome annotation
●   Evolutionary biology / comparative genomics
●   Analysis of gene expression
●   Analysis of gene regulation
●   Genotype-phenotype association
●   Mutation analysis
●   Structural biology
●   Biomarker identification
●   Pathway analysis / "systems biology"
●   Literature analysis / text-mining
Gene expression pre-2008
      PCR             Microarrays
Exercise (Thursday)
●   Download R: r-project.org
●   Download Rstudio: rstudio.com
●   Get data: http://people.virginia.edu/~sdt5z/GSE4107_RAW.zip
●   Run code to download BioC packages:
    ○ source("http://bioconductor.org/biocLite.R")
    ○ biocLite()
    ○ biocLite(c("affy", "AnnotationDbi", "hgu133plus2cdf",
      "hgu133plus2.db", "genefilter", "DBI", "annotate",
      "arrayQualityMetrics", "limma", "GOstats",
      "Category", "GO.db", "KEGG.db"))
Gene expression pre-2008
      PCR             Microarrays
RNA sequencing (RNA-seq)
                                          Isolate RNAs              Generate cDNA, fragment, size
       Samples of interest                                               select, add linkers




  Condition 1       Condition 2
(normal colon)     (colon tumor)
                                                                              Sequence ends
                                   Image: www.bioinformatics.ca




          Align to Genome




        Downstream analysis                                       100s of millions of paired reads
                                                                  10s of billions bases of sequence
RNA-seq advantages
●   No reference necessary
●   Low background (no cross-hybridization)
●   Unlimited dynamic range (FC 9000 Science 320:1344)
●   Direct counting (microarrays: indirect – hybridization)
●   Can characterize full transcriptome
    ○ mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc)
    ○ Differential gene expression
    ○ Differential coding output
    ○ Differential TSS usage
    ○ Differential isoform expression
Isoform level data
Isoform level data
Differential splicing & TSS use
RNA-seq challenges
● Library construction
  ○ Size selection (messenger, small)
  ○ Strand specificity?
● Bioinformatic challenges
  ○ Spliced alignment
  ○ Transcript deconvolution
● Statistical Challenges
  ○ Highly variable abundance
  ○ Sample size: never, ever, plan n=1
● Normalization (RPKM)
  ○ Compare features of different lengths
  ○ Compare conditions with different
    sequence depth
Common question #1: Depth
● Question: how much sequence do I need?
● Answer: it’s complicated.
● Depends on:
   ○ Size & complexity of transcriptome
   ○ Application: differential gene expression, transcript
     discovery, aberrant splicing, etc.
   ○ Tissue type, RNA quality, library preparation
   ○ Sequencing type: length, single-/paired-end, etc.
● Find publication in your field w/ similar goals.
● Good news: 1 GA or ½ HiSeq lane is
  sufficient for most applications
Common question #2: Sample Size
● Question: How many samples should I
  sequence?
● Oversimplified Answer: At least 3 biological
  replicates per condition.
● Depends on:
   ○   Sequencing depth
   ○   Application
   ○   Goals (prioritization, biomarker discovery, etc.)
   ○   Effect size, desired power, statistical significance
● Find a publication with similar goals
Common question #3: Workflow
● How do I analyze the data?
● No standards!
   ○ Unspliced aligners: BWA, Bowtie, Stampy, SHRiMP
   ○ Spliced aligners: Tophat, MapSplice, SpliceMap, GSNAP, QPALMA
   ○ Reference builds & annotations: UCSC, Entrez, Ensembl
   ○ Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS
   ○ Quantification: Cufflinks, RSEM, MISO, ERANGE, NEUMA, Alexa-Seq
   ○ Differential expression: Cuffdiff, DegSeq, DESeq, EdgeR, Myrna
● Like early microarray days: lots of excitement, lots of
  tools, little knowledge of integrating tools in pipeline!
● Benchmarks
● Microarray: Spike-ins (Irizarry)
● RNA-Seq: ???, simulation, ???
Phases of NGS analysis
● Primary
  ○ Conversion of raw machine signal into sequence and qualities
● Secondary
  ○ Alignment of reads to reference genome or transcriptome
  ○ De novo assembly of reads into contigs
● Tertiary
  ○ SNP discovery/genotyping
  ○ Peak discovery/quantification (ChIP, MeDIP)
  ○ Transcript assembly/quantification (RNA-seq)
● Quaternary
  ○ Differential expression
  ○ Enrichment, pathways, correlation, clustering, visualization, etc.
Extra credit (not really): RNA-seq
http://bit.ly/galaxy-rnaseq
● #1: learn to use galaxy: bit.ly/uva-galaxy
● #2: Run through an RNA-seq exercise in 1 hour:
  ○ Read some background material on RNA-seq
  ○ Read the tophat/cufflinks method paper
  ○ Get some data (Illumina BodyMap)
  ○ QC / trim your reads
  ○ Map to hg19 with tophat
  ○ Visualize where reads map
  ○ Assemble with cufflinks
  ○ Differential expression with cuffdiff
Subdisciplines
●   Sequence alignment (DNA, RNA, Protein)
●   Genome annotation
●   Evolutionary biology / comparative genomics
●   Analysis of gene expression
●   Analysis of gene regulation
●   Genotype-phenotype association
●   Mutation analysis
●   Structural biology
●   Biomarker identification
●   Pathway analysis / "systems biology"
●   Literature analysis / text-mining
How are genes regulated?
●   Transcription factors (ChIP-seq)
●   Micro-RNAs (RNA-seq)
●   Chromatin accessibility (DNAse-Seq)
●   DNA Methylation (RRBS-seq, MeDIP-seq)
●   RNA processing
●   RNA transport
●   Translation
●   Post-translational modification
Importance of DNA methylation
● Occurs most frequently at CpG sites
● High methylation at promoters ≈ silencing
● Methylation perturbed in cancer
● Methylation associated with many other
  complex diseases: neural, autoimmune,
  response to env.
● Mapping DNA methylation → new disease
  genes & drug targets.
DNA Methylation Challenges
● Dynamic and tissue-specific
● DNA → Collection of cells which vary in
  5meC patterns → 5meC pattern is complex.
● Further, uneven distribution of CpG targets
● Multiple classes of methods:
  ○ Bisulfite, sequence-based: Assay methylated target
    sequences across individual DNAs.
  ○ Affinity enrichment, count-based: Assay methylation
    level across many genomic loci.
● Many methods
● Many algorithms
Many methylation methods
  Gene        RNA-Seq         High-throughput cDNA sequencing
Expression
              BS-Seq          Whole-genome bisulfite sequencing
              RRBS-Seq        Reduced representation bisulfite sequencing
              BC-Seq          Bisulfite capture sequencing
              BSPP            Bisulfite specific padlock probes
              Methyl-Seq      Restriction enzyme based methyl-seq

   DNA        MSCC            Methyl sensitive cut counting
Methylation   HELP-Seq        HpaII fragment enrichment by ligation PCR
              MCA-Seq         Methylated CpG island amplification
              MeDIP-Seq       Methylated DNA immunoprecipitation
              MBP-Seq         Methyl-binding protein sequencing
              MethylCap-seq   Methylated DNA capture by affinity purification
              MIRA-Seq        Methylated CpG island recovery assay
Methylation methods:
Features & biases
Methylation: Bioinformatics Resources
Resource                           Purpose                                                            URL Refs
Batman                             MeDIP DNA methylation analysis tool                                http://td-blade.gurdon.cam.ac.uk/software/batman
BDPC                               DNA methylation analysis platform                                  http://biochem.jacobs-university.de/BDPC
BSMAP                              Whole-genome bisulphite sequence mapping                           http://code.google.com/p/bsmap
CpG Analyzer                       Windows-based program for bisulphite DNA                           -
CpGcluster                         CpG island identification                                          http://bioinfo2.ugr.es/CpGcluster
CpGFinder                          Online program for CpG island identification                       http://linux1.softberry.com
CpG Island Explorer                Online program for CpG Island identification                       http://bioinfo.hku.hk/cpgieintro.html
CpG Island Searcher                Online program for CpG Island identification                       http://cpgislands.usc.edu
CpG PatternFinder                  Windows-based program for bisulphite DNA                           -
CpG Promoter                       Large-scale promoter mapping using CpG islands                     http://www.cshl.edu/OTT/html/cpg_promoter.html
CpG ratio and GC content Plotter   Online program for plotting the observed:expected ratio of CpG     http://mwsross.bms.ed.ac.uk/public/cgi-bin/cpg.pl
CpGviewer                          Bisulphite DNA sequencing viewer                                   http://dna.leeds.ac.uk/cpgviewer
CyMATE                             Bisulphite-based analysis of plant genomic DNA                     http://www.gmi.oeaw.ac.at/en/cymate-index/
EMBOSS CpGPlot/ CpGReport          Online program for plotting CpG-rich regions                       http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html
Epigenomics Roadmap                NIH Epigenomics Roadmap Initiative homepage                        http://nihroadmap.nih.gov/epigenomics
Epinexus                           DNA methylation analysis tools                                     http://epinexus.net/home.html
MEDME                              Software package (using R) for modelling MeDIP experimental data   http://espresso.med.yale.edu/medme
methBLAST                          Similarity search program for bisulphite-modified DNA              http://medgen.ugent.be/methBLAST
MethDB                             Database for DNA methylation data                                  http://www.methdb.de
MethPrimer                         Primer design for bisulphite PCR                                   http://www.urogene.org/methprimer
methPrimerDB                       PCR primers for DNA methylation analysis                           http://medgen.ugent.be/methprimerdb
MethTools                          Bisulphite sequence data analysis tool                             http://www.methdb.de
MethyCancer Database               Database of cancer DNA methylation data                            http://methycancer.psych.ac.cn
Methyl Primer Express              Primer design for bisulphite PCR                                   http://www.appliedbiosystems.com/
Methylumi                          Bioconductor pkg for DNA methylation data from Illumina            http://www.bioconductor.org/packages/bioc/html/
Methylyzer                         Bisulphite DNA sequence visualization tool                         http://ubio.bioinfo.cnio.es/Methylyzer/main/index.html
mPod                               DNA methylation viewer integrated w/ Ensembl genome browser        http://www.compbio.group.cam.ac.uk/Projects/
PubMeth                            Database of DNA methylation literature                             http://www.pubmeth.org
QUMA                               Quantification tool for methylation analysis                       http://quma.cdb.riken.jp
TCGA Data Portal                   Database of TCGA DNA methylation data                              http://cancergenome.nih.gov/dataportal
Subdisciplines
●   Sequence alignment (DNA, RNA, Protein)
●   Genome annotation
●   Evolutionary biology / comparative genomics
●   Analysis of gene expression
●   Analysis of gene regulation
●   Genotype-phenotype association
●   Mutation analysis
●   Structural biology
●   Biomarker identification
●   Pathway analysis / "systems biology"
●   Literature analysis / text-mining
One gene, one enzyme, one function?




         Zhu X. et al. (2007). Genes & Dev 21:1010-1024.                                                 Jeong, H. et al.. (2001) Nature 411:41–42.




 Ptacek, J. et al. (2005) Nature 438:679–684.          Guimera and Amaral. (2005). Nature 433:895-900.      Tong, A.H. et al. (2001). Science 294:2364-2368.
Distribution of disease genes


                        Diseases connected if same
                        gene implicated in both.




                        Genes connected if
                        implicated in the same
                        disorder.

                       Goh et al. (2007). PNAS 104:8685.
Distribution of disease genes

Overlay with PPI data




                              Genes contributing to a common
                              disease interact through protein-
                                    protein interactions.


                            Genes connected if
                            implicated in the same
                            disorder.

                           Goh et al. (2007). PNAS 104:8685.
Distribution of disease genes




Seebacher and Gavin (2011). Cell 144:1000-
1001
                                             ●   “Essential” genes
k = degree                                         ● Encode hubs
  = # interaction partners                         ● Are expressed globally

                                             ●   “Non-essential” disease genes
                                                  ●  Do not encode hubs
                                                  ●  Tissue specific expression
Distribution of disease genes
●   Disease genes at functional periphery of cellular networks (Goh PNAS 2007).
●   Genes contributing to a common disease interact through protein-protein
    interactions (Goh PNAS 2007).
●   Diseaseome analysis: Pt 2x likely to develop another disease if that
    disease shares gene with pt’s primary disease (Park et al. 2009. The Impact of Cellular
    Networks on Disease Comorbidity. Mol Syst Biol 5:262).
●   miRNA analysis: If connect diseases with associated genes regulated by
    common miRNA, get disease-class segregation. E.g. cancers share similar
    associations at miRNA level (Lu et al. 2009. An analysis of human microRNA and disease associations.
    PLoS ONE 3:e3420).




                   Nonrandom placement of
             disease genes in interactome!
Distribution of disease genes
       Vidal et al, Cell 2011.
Distribution of disease genes
● Data is cheap and diverse.
  ○ Genetic variation: GWAS, next-gen sequencing
  ○ Gene expression: Microarray, RNA-seq
  ○ Proteomics: Y2H, CoAP/MS
● Cellular components interact in a network
  with other cellular components.
● Disease is the result of an abnormality in
  that network.
● Integrate multiple data types, understand
  network, understand disease.
Pathway Analysis
● You’ve done your microarray/RNA-Seq experiment
  ○ You have a list of genes
  ○ Want to put these into functional context
  ○ What biological processes are perturbed?
  ○ What pathways are being dysregulated?
  ○ Data reduction: hundreds or thousands of genes can be reduced to
         10s of pathways
   ○     Identifying active pathways = more explanatory power
● “Pathway analysis” encompasses many, many
  techniques:
   ○ 1st Generation: Overrepresentation Analysis (E.g. GO ORA)
   ○ 2nd Generation: Functional Class Scoring (e.g. GSEA)
   ○ 3rd Generation (in development): Pathway Topology (E.g. SPIA)
● http://gettinggeneticsdone.com/2012/03/pathway-analysis-for-high-throughput.html
Pathway Analysis: Over-
representation analysis
● Many variations on the same theme:
  statistically evaluates the fraction of genes in
  particular pathway that show changes in
  expression.
● Algorithm:
   ○ Create input list (e.g. “significant at p<0.05”)
   ○ For each gene set:
     ■ Count number of input genes
     ■ Count number of “background” genes (e.g. all genes on platform).
   ○ Test each pathway for over-representation of input
     genes
● Gene Set: typically gene ontology (GO)
  term.
Pathway analysis: over-
representation analysis
● Ontology = formal representation of a knowledge
  domain.
● Gene ontology = cell biology.
● GO represented by directed acyclic graph (DAG).
  ○ Terms are nodes, relationships are edges.
  ○ Parent terms are more general than their child terms.
  ○ Unlike a simple tree, terms can have multiple parents.




    Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annotations. Nature Reviews Genetics, 9(7), 509-15.
Pathway analysis:
Over-representation analysis
● Algorithm:
  ○ Create input list (e.g. “significant at p<0.05”)
  ○ For each gene set:
      ■ Count number of input genes
      ■ Count number of “background” genes (e.g. all genes on platform).
  ○ Test each pathway for over-representation of input genes
● Ex: GO “Purine Ribonucleotide Biosynthetic Process”
  ○ 1% of input (significant) genes are annotated with this term.
  ○ 1% of genes on the chip are annotated with this term.
  ○ Not significantly overrepresented.
● Ex: GO “V(D)J Recombination”
  ○ 20% of input (significant) genes are annotated with this term.
  ○ 1% of genes on the chip are annotated with this term.
  ○ Highly significantly over-represented!
Pathway analysis
● Pathway analysis gives you more biological
  insight than staring at lists of genes.
● Pathway analysis is complex, and has many
  limitations.
● Pathway analysis is still more of an
  exploratory procedure rather than a pure
  statistical endpoint.
● The best conclusions are made by viewing
  enrichment analysis results through the lens
  of the investigator’s expert biological
  knowledge.
Subdisciplines
●   Sequence alignment (DNA, RNA, Protein)
●   Genome annotation
●   Evolutionary biology / comparative genomics
●   Analysis of gene expression
●   Analysis of gene regulation
●   Genotype-phenotype association
●   Mutation analysis
●   Structural biology
●   Biomarker identification
●   Pathway analysis / "systems biology"
●   Literature analysis / text-mining
Resources: Online community &
discussion forum
● Seqanswers
  ○   http://SEQanswers.com
  ○   Twitter: @SEQquestions
  ○   Format: Forum
  ○   Li et al. SEQanswers : An open access community
      for collaboratively decoding genomes. Bioinformatics
      (2012).
● BioStar:
  ○   http://biostar.stackexchange.com
  ○   Twitter: @BioStarQuestion
  ○   Format: Q&A
  ○   Parnell et al. BioStar: an online question & answer
      resource for the bioinformatics community. PLoS
      Comp Bio (2011) 7:e1002216.
Resources: further education


    stephenturner.us/p/edu

  Regularly updated, comprehensive list of over 20 in-
  person and free online workshops in bioinformatics,
        programming, statistics, genetics, etc.
Publicly Available Data: NCBI
●   Genbank: http://www.ncbi.nlm.nih.gov/genbank/
     ○ Collection of all publicly available DNA sequences.
     ○ Feb 2013: 150,141,354,858 bases from 162,886,727 sequences.
●   NCBI Genomes: http://www.ncbi.nlm.nih.gov/genome/
     ○ Public repository for sequenced genomes.
     ○ March 2013: 3,005 eukaryotes, 19,125 prokaryotes, 3,570 viruses.
●   NCBI Taxonomy: http://www.ncbi.nlm.nih.gov/taxonomy
     ○ Publicly available classification and nomenclature database for all organisms in the public
         sequences database.
     ○ Phylogenetic lineages for >160,000 organisms (est. ~10% life on the planet)
●   GEO: http://www.ncbi.nlm.nih.gov/geo/
     ○ Public repository of sequence- and array-based gene expression data, free for the taking.
     ○ 900,000+ samples, 3,200+ datasets.
●   dbGaP: http://www.ncbi.nlm.nih.gov/gap
     ○ Public repository for genetic studies.
     ○ 2,500+ datasets, 100,000+ variables.
●   SRA: http://www.ncbi.nlm.nih.gov/sra
     ○ Public repository for raw sequencing data from NGS platforms.
     ○ 3,500,000,000,000,000 bases sequenced.
Publicly Available Data: Databases
●   2013 Nucleic Acids Research Database Issue
     ○ http://nar.oxfordjournals.org/content/41/D1/D1.abstract
     ○ 176 articles describing new/updated molecular biology databases.
●   NAR Molecular Biology Database Collection
     ○ http://www.oxfordjournals.org/nar/database/a/
     ○ 1,512 molecular biology databases
     ○ Categories: DNA/RNA/Protein sequences, structures,
        metabolic/signaling pathways, genes & genomes, human diseases,
        microarray/other gene expression data, proteomics, organelles, plants,
        immunological, cell bio, …
Publicly Available Data: Webservers
● 2012 NAR Web Server Issue
   ○ http://nar.oxfordjournals.org/content/40/W1.toc
   ○ 102 articles/webservers featured
● Bioinformatics Links Directory
   ○ http://bioinformatics.ca/links_directory/
   ○ Includes all the NAR resources above.
   ○ 1,376 tools, 620 databases, 163 other resources
   ○ Topics: computer-related, DNA, education, expression,
      genomics, literature, model organisms, RNA, protein, other
      molecules, sequence comparison, …
Bioinformatics Core Mission:
 help scientists publish their
work and obtain new funding
through service and training.
Services
●   Gene expression: Microarray Analysis
●   Gene expression: RNA-seq Analysis
●   Pathway analysis
●   DNA Variation (GWAS, NGS)
●   DNA Binding / ChIP-Seq
●   DNA Methylation
●   Metagenomics
●   Grant / Manuscript support
●   Custom development (computing & stats)
●   ... etc.
Contact
Web:    bioinformatics.virginia.edu
E-mail: bioinformatics@virginia.edu
Blog:   GettingGeneticsDone.com
Twitter: @genetics_blog

More Related Content

Viewers also liked

MMA Forum D2 Track 01 Empreendedorismo Mobile - Sparkflow
MMA Forum D2 Track 01 Empreendedorismo Mobile - SparkflowMMA Forum D2 Track 01 Empreendedorismo Mobile - Sparkflow
MMA Forum D2 Track 01 Empreendedorismo Mobile - Sparkflow
Mobile Marketing Association
 
Το Λουξεμβούργο
Το ΛουξεμβούργοΤο Λουξεμβούργο
Το Λουξεμβούργο
Stella Kalle
 
Las islas galápagos
Las islas galápagosLas islas galápagos
Las islas galápagos
germajeny
 
Exposición enseñando a sumar
Exposición enseñando a sumarExposición enseñando a sumar
Exposición enseñando a sumar
Fabio Gutierrez
 
LA HIGIENE Y SEGURIDAD
LA HIGIENE Y SEGURIDADLA HIGIENE Y SEGURIDAD
LA HIGIENE Y SEGURIDAD
ederth45
 
NGDATA Corp Presentation Public 04022015
NGDATA Corp Presentation Public 04022015NGDATA Corp Presentation Public 04022015
NGDATA Corp Presentation Public 04022015NGDATA
 
Interoperabilidade de Documentos Arquivísticos: dos Sistemas de Negócio ao SI...
Interoperabilidade de Documentos Arquivísticos: dos Sistemas de Negócio ao SI...Interoperabilidade de Documentos Arquivísticos: dos Sistemas de Negócio ao SI...
Interoperabilidade de Documentos Arquivísticos: dos Sistemas de Negócio ao SI...
Daniel Flores
 
Scrutiny by Select Committees of the Estimates
Scrutiny by Select Committees of the EstimatesScrutiny by Select Committees of the Estimates
Scrutiny by Select Committees of the EstimatesHamish Coghill
 
Entrepreneurial Learning
Entrepreneurial LearningEntrepreneurial Learning
Entrepreneurial Learning
Mervi Jansson-Aalto
 
Townsville Enterprise Limited Achievements Update 20 May 2014
Townsville Enterprise Limited Achievements Update 20 May 2014Townsville Enterprise Limited Achievements Update 20 May 2014
Townsville Enterprise Limited Achievements Update 20 May 2014
CPA Australia
 
Google desktop
Google desktopGoogle desktop
Google desktop
Dario Tulmo
 
SLIDESHARE TEST.pptx
SLIDESHARE TEST.pptxSLIDESHARE TEST.pptx
SLIDESHARE TEST.pptxEcoAthens
 

Viewers also liked (14)

MMA Forum D2 Track 01 Empreendedorismo Mobile - Sparkflow
MMA Forum D2 Track 01 Empreendedorismo Mobile - SparkflowMMA Forum D2 Track 01 Empreendedorismo Mobile - Sparkflow
MMA Forum D2 Track 01 Empreendedorismo Mobile - Sparkflow
 
Το Λουξεμβούργο
Το ΛουξεμβούργοΤο Λουξεμβούργο
Το Λουξεμβούργο
 
Las islas galápagos
Las islas galápagosLas islas galápagos
Las islas galápagos
 
Salmo23
Salmo23Salmo23
Salmo23
 
Exposición enseñando a sumar
Exposición enseñando a sumarExposición enseñando a sumar
Exposición enseñando a sumar
 
LA HIGIENE Y SEGURIDAD
LA HIGIENE Y SEGURIDADLA HIGIENE Y SEGURIDAD
LA HIGIENE Y SEGURIDAD
 
NGDATA Corp Presentation Public 04022015
NGDATA Corp Presentation Public 04022015NGDATA Corp Presentation Public 04022015
NGDATA Corp Presentation Public 04022015
 
Interoperabilidade de Documentos Arquivísticos: dos Sistemas de Negócio ao SI...
Interoperabilidade de Documentos Arquivísticos: dos Sistemas de Negócio ao SI...Interoperabilidade de Documentos Arquivísticos: dos Sistemas de Negócio ao SI...
Interoperabilidade de Documentos Arquivísticos: dos Sistemas de Negócio ao SI...
 
Ch14 pr0
Ch14 pr0Ch14 pr0
Ch14 pr0
 
Scrutiny by Select Committees of the Estimates
Scrutiny by Select Committees of the EstimatesScrutiny by Select Committees of the Estimates
Scrutiny by Select Committees of the Estimates
 
Entrepreneurial Learning
Entrepreneurial LearningEntrepreneurial Learning
Entrepreneurial Learning
 
Townsville Enterprise Limited Achievements Update 20 May 2014
Townsville Enterprise Limited Achievements Update 20 May 2014Townsville Enterprise Limited Achievements Update 20 May 2014
Townsville Enterprise Limited Achievements Update 20 May 2014
 
Google desktop
Google desktopGoogle desktop
Google desktop
 
SLIDESHARE TEST.pptx
SLIDESHARE TEST.pptxSLIDESHARE TEST.pptx
SLIDESHARE TEST.pptx
 

Recently uploaded

"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
SACHIN R KONDAGURI
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
Mohammed Sikander
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
MysoreMuleSoftMeetup
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
Scholarhat
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
Sandy Millin
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
Wasim Ak
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
TechSoup
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
Kartik Tiwari
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
Balvir Singh
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
EverAndrsGuerraGuerr
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
EugeneSaldivar
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
thanhdowork
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
Celine George
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
DhatriParmar
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
Thiyagu K
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
deeptiverma2406
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Akanksha trivedi rama nursing college kanpur.
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Atul Kumar Singh
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
Krisztián Száraz
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
Jisc
 

Recently uploaded (20)

"Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe..."Protectable subject matters, Protection in biotechnology, Protection of othe...
"Protectable subject matters, Protection in biotechnology, Protection of othe...
 
Multithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race conditionMultithreading_in_C++ - std::thread, race condition
Multithreading_in_C++ - std::thread, race condition
 
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
Mule 4.6 & Java 17 Upgrade | MuleSoft Mysore Meetup #46
 
Azure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHatAzure Interview Questions and Answers PDF By ScholarHat
Azure Interview Questions and Answers PDF By ScholarHat
 
2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...2024.06.01 Introducing a competency framework for languag learning materials ...
2024.06.01 Introducing a competency framework for languag learning materials ...
 
Normal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of LabourNormal Labour/ Stages of Labour/ Mechanism of Labour
Normal Labour/ Stages of Labour/ Mechanism of Labour
 
Introduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp NetworkIntroduction to AI for Nonprofits with Tapp Network
Introduction to AI for Nonprofits with Tapp Network
 
Chapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdfChapter -12, Antibiotics (One Page Notes).pdf
Chapter -12, Antibiotics (One Page Notes).pdf
 
Operation Blue Star - Saka Neela Tara
Operation Blue Star   -  Saka Neela TaraOperation Blue Star   -  Saka Neela Tara
Operation Blue Star - Saka Neela Tara
 
Thesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.pptThesis Statement for students diagnonsed withADHD.ppt
Thesis Statement for students diagnonsed withADHD.ppt
 
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...TESDA TM1 REVIEWER  FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
TESDA TM1 REVIEWER FOR NATIONAL ASSESSMENT WRITTEN AND ORAL QUESTIONS WITH A...
 
A Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptxA Survey of Techniques for Maximizing LLM Performance.pptx
A Survey of Techniques for Maximizing LLM Performance.pptx
 
How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17How to Make a Field invisible in Odoo 17
How to Make a Field invisible in Odoo 17
 
The Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptxThe Diamond Necklace by Guy De Maupassant.pptx
The Diamond Necklace by Guy De Maupassant.pptx
 
Unit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdfUnit 8 - Information and Communication Technology (Paper I).pdf
Unit 8 - Information and Communication Technology (Paper I).pdf
 
Best Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDABest Digital Marketing Institute In NOIDA
Best Digital Marketing Institute In NOIDA
 
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama UniversityNatural birth techniques - Mrs.Akanksha Trivedi Rama University
Natural birth techniques - Mrs.Akanksha Trivedi Rama University
 
Guidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th SemesterGuidance_and_Counselling.pdf B.Ed. 4th Semester
Guidance_and_Counselling.pdf B.Ed. 4th Semester
 
Advantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO PerspectiveAdvantages and Disadvantages of CMS from an SEO Perspective
Advantages and Disadvantages of CMS from an SEO Perspective
 
The approach at University of Liverpool.pptx
The approach at University of Liverpool.pptxThe approach at University of Liverpool.pptx
The approach at University of Liverpool.pptx
 

Introduction to Bioinformatics for UVA Cell Bio 8401

  • 1. Introduction to Bioinformatics Stephen Turner, Ph.D. Bioinformatics Core Director bioinformatics@virginia.edu Slides at bit.ly/intro-bioinfo
  • 2. Contact Web: bioinformatics.virginia.edu E-mail: bioinformatics@virginia.edu Blog: GettingGeneticsDone.com Twitter: @genetics_blog
  • 3. Bioinformatics Origins: Rooted in sequence analysis. Driven by the need to: ● Collect ● Annotate ● Analyze
  • 4. Margaret Dayhoff (1925-1983) ● Collected all known protein structures & sequences ● Published Atlas in 1965 ● Pioneered algorithm development for: ○ Comparing protein sequences ○ Deriving evolutionary history from alignments “In this paper we shall describe a completed computer program for the IBM 7090, which to our knowledge is the first successful attempt at aiding the analysis of the amino acid chain structure of protein.”
  • 6. “There is a tremendous amount of information regarding evolutionary history and biochemical function implicit in each sequence and the number of known sequences is growing explosively. We feel it is important to collect this significant information, correlate it into a unified whole and interpret it.” M. Dayhoff, February 27, 1967
  • 8. ted ed en nt t inv ve ne et in PA er n W W AR Int W 1960 1970 1980 1990 2000 2010 s g k g At la ci n an L in f n B B nc f qu e en EM ue ho e G I- eq ay S EB S D g er en an t -G S ex N
  • 9.
  • 10. Definition From Wikipedia: Bioinformatics is a branch of biological science which deals with the study of methods for storing, retrieving and analyzing biological data, such as nucleic acid (DNA/RNA) and protein sequence, structure, function, pathways and genetic interactions. It generates new knowledge that is useful in such fields as drug design and development of new software tools to create that knowledge. Bioinformatics also deals with algorithms, databases and information systems, web technologies, artificial intelligence and soft computing, information and computation theory, structural biology, software engineering, data mining, image processing, modeling and simulation, discrete mathematics, control and system theory, circuit theory, and statistics. Our definition: using computer science and statistics to answer biological questions.
  • 11. Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining
  • 12. Central Dogma Reverse RNA transcription Silencing Prions DNA RNA Protein Post-translational modification Methylation
  • 13. Protein folding determines molecular function DNA provides assembly instructions for proteins Networks of interacting proteins determine tissue/organ function
  • 14. Protein folding determines molecular function DNA variant analysis Gene expression analysis Genome annotation Pathway analysis Epigenetics Systems biology DNA provides assembly Biomarker ID'n instructions for proteins Networks of interacting miRNA analysis proteins determine Quantitative MS tissue/organ function Proteomics
  • 15. Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining
  • 16. Sequence alignment, example 1 Outbreak: fever, characteristic skin lesions. Culture, isolate DNA, sequence (sanger): GTGAGTAATAATAATTCAAAACTGGAATTTGTACCTAATATACAGCTTAAAGAAGACTTAGGAGCTTTTAGCTATAAAGTCCAACTTTCT CCTGTAGAAAAAGGTATGGCTCATATCCTTGGTAACTCTATTAGAAGGGTTTTATTATCTTCACTATCAGGTGCATCTATAATTAAAGTA AACATCGCTAATGTACTACATGAGTATTCTACTTTAGAAGATGTAAAAGAAGATGTTGTTGAAATTGTTTCTAATTTGAAAAAGGTTGCG ATAAAGCTTGATACAGGTATAGATAGACTAGATTTAGAACTATCTGTAAATAAATCAGGTGTAGTTAGCGCTGGAGATTTTAAGACGACT CAAGGTGTAGAAATAATAAATAAAGATCAGCCAATAGCTACTTTGACAAACCAAAGAGCATTTAGCTTAACTGCTACAGTGAGTGTAGGT AGAAATGTCGGAATACTTTCTGCGATACCAACCGAGCTTGAGAGAGTTGGTGATATAGCTGTAGATGCTGATTTTAATCCTATTAAAAGA GTTGCTTTTGAGGTTTTTGATAATGGTGATAGTGAAACTTTAGAAGTATTTGTAAAGACAAATGGTACTATAGAACCACTAGCAGCTGTT ACGAAAGCTTTAGAGTATTTCTGTGAGCAAATATCAGTATTTGTATCTCTAAGAGTACCTAGTAATGGTAAAACAGGTGATGTATTAATA GATTCTAATATTGATCCTATCCTTCTTAAGCCGATTGATGATTTAGAGCTAACTGTCAGATCATCTAACTGTCTGCGTGCAGAAAACATT AAGTATCTTGGTGATTTGGTACAGTATTCTGAATCACAGCTTATGAAGATACCTAACTTAGGTAAGAAATCTCTCAATGAGATCAAACAA ATTTTAATAGATAATAACTTGTCTCTAGGTGTCCAAATTGACAATTTTAGAGAGCTAGTTGAAGGAAAATAA
  • 17. Sequence alignment, example 1 ● BLAST (Basic Local Alignment Search Tool) ● Go to blast.ncbi.nlm.nih.gov ● Click "Nucleotide BLAST" (blastn) ● Under "Choose Search Set", click the "Others" button, then search the entire nr/nt collection (you don't know what it is) GTGAGTAATAATAATTCAAAACTGGAATTTGTACCTAATATACAGCTTAAAGAAGACTTAGGAGCTTTTAGCTATAAAGTCCAACTTTCT CCTGTAGAAAAAGGTATGGCTCATATCCTTGGTAACTCTATTAGAAGGGTTTTATTATCTTCACTATCAGGTGCATCTATAATTAAAGTA AACATCGCTAATGTACTACATGAGTATTCTACTTTAGAAGATGTAAAAGAAGATGTTGTTGAAATTGTTTCTAATTTGAAAAAGGTTGCG ATAAAGCTTGATACAGGTATAGATAGACTAGATTTAGAACTATCTGTAAATAAATCAGGTGTAGTTAGCGCTGGAGATTTTAAGACGACT CAAGGTGTAGAAATAATAAATAAAGATCAGCCAATAGCTACTTTGACAAACCAAAGAGCATTTAGCTTAACTGCTACAGTGAGTGTAGGT AGAAATGTCGGAATACTTTCTGCGATACCAACCGAGCTTGAGAGAGTTGGTGATATAGCTGTAGATGCTGATTTTAATCCTATTAAAAGA GTTGCTTTTGAGGTTTTTGATAATGGTGATAGTGAAACTTTAGAAGTATTTGTAAAGACAAATGGTACTATAGAACCACTAGCAGCTGTT ACGAAAGCTTTAGAGTATTTCTGTGAGCAAATATCAGTATTTGTATCTCTAAGAGTACCTAGTAATGGTAAAACAGGTGATGTATTAATA GATTCTAATATTGATCCTATCCTTCTTAAGCCGATTGATGATTTAGAGCTAACTGTCAGATCATCTAACTGTCTGCGTGCAGAAAACATT AAGTATCTTGGTGATTTGGTACAGTATTCTGAATCACAGCTTATGAAGATACCTAACTTAGGTAAGAAATCTCTCAATGAGATCAAACAA ATTTTAATAGATAATAACTTGTCTCTAGGTGTCCAAATTGACAATTTTAGAGAGCTAGTTGAAGGAAAATAA
  • 18.
  • 19.
  • 20. Sequence alignment, example 2 ● Illumina HiSeq 2500: ○ 600,000,000,000 bases sequenced in single run. ○ 6,000,000,000 x 100-bp (short) reads ● BLAST way too slow. ● BWA: burrows wheeler aligner (fast) ● Bowtie: fast, memory-efficient (aligns 25,000,000 35-bp reads per hour per CPU). ● Many others... MAQ, Eland, RMAP, SOAP, SHRiMP, BFAST, Mosaik, Novoalign, BLAT, GMAP, GSNAP, MOM, QPalma, SeqMap, VelociMapper, Stampy, mrFAST, etc.
  • 21. Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining
  • 22. Comparative Genomics example ● Go to genome.ucsc.edu ● Search for POLR2A ● Turn on some conservation tracks
  • 24. Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining
  • 25. Genetic Epidemiology Epidemiology: the study of the patterns, causes, and effects of health and disease conditions in defined populations. Genetic epidemiology: the study of genetic factors in determining health and disease in families and populations.
  • 26. Protein folding determines molecular function DNA provides assembly instructions for proteins Networks of interacting proteins determine tissue/organ function
  • 27. Genetic epidemiology ● Linkage: finding genetic loci that segregate with the disease in families. ● Association: finding alleles that co-occur with disease in populations. ○ Common disease - common variant hypothesis: ■ Common variants (e.g. >1-5% in the population) contribute to common, complex disease). ○ Common disease - rare variant hypothesis: ■ Polymorphisms that cause disease are under purifying selection, and will thus be rare. ○ Really, it's a mix of both
  • 28. Candidate gene study ● Select candidate genes based on: ○ Known biology ○ Previous linkage/association evidence ○ Pathways ○ Evidence from model organisms ● Genotype variants (SNPs) in those genes ● Statistical association Genotype at position rs12345: A/A Genotype at position rs12345: A/T Genotype at position rs12345: T/T
  • 29. Genome-wide association study ● Genotype >500,000 SNPs ● Statistical test at each one ● Manhattan plot of results ● GWAS does not inform: ○ Which gene affected ○ How gene function perturbed ○ How biological function altered
  • 30. Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining
  • 31. Gene expression pre-2008 PCR Microarrays
  • 32. Exercise (Thursday) ● Download R: r-project.org ● Download Rstudio: rstudio.com ● Get data: http://people.virginia.edu/~sdt5z/GSE4107_RAW.zip ● Run code to download BioC packages: ○ source("http://bioconductor.org/biocLite.R") ○ biocLite() ○ biocLite(c("affy", "AnnotationDbi", "hgu133plus2cdf", "hgu133plus2.db", "genefilter", "DBI", "annotate", "arrayQualityMetrics", "limma", "GOstats", "Category", "GO.db", "KEGG.db"))
  • 33. Gene expression pre-2008 PCR Microarrays
  • 34. RNA sequencing (RNA-seq) Isolate RNAs Generate cDNA, fragment, size Samples of interest select, add linkers Condition 1 Condition 2 (normal colon) (colon tumor) Sequence ends Image: www.bioinformatics.ca Align to Genome Downstream analysis 100s of millions of paired reads 10s of billions bases of sequence
  • 35. RNA-seq advantages ● No reference necessary ● Low background (no cross-hybridization) ● Unlimited dynamic range (FC 9000 Science 320:1344) ● Direct counting (microarrays: indirect – hybridization) ● Can characterize full transcriptome ○ mRNA and ncRNA (miRNA, lncRNA, snoRNA, etc) ○ Differential gene expression ○ Differential coding output ○ Differential TSS usage ○ Differential isoform expression
  • 39. RNA-seq challenges ● Library construction ○ Size selection (messenger, small) ○ Strand specificity? ● Bioinformatic challenges ○ Spliced alignment ○ Transcript deconvolution ● Statistical Challenges ○ Highly variable abundance ○ Sample size: never, ever, plan n=1 ● Normalization (RPKM) ○ Compare features of different lengths ○ Compare conditions with different sequence depth
  • 40. Common question #1: Depth ● Question: how much sequence do I need? ● Answer: it’s complicated. ● Depends on: ○ Size & complexity of transcriptome ○ Application: differential gene expression, transcript discovery, aberrant splicing, etc. ○ Tissue type, RNA quality, library preparation ○ Sequencing type: length, single-/paired-end, etc. ● Find publication in your field w/ similar goals. ● Good news: 1 GA or ½ HiSeq lane is sufficient for most applications
  • 41. Common question #2: Sample Size ● Question: How many samples should I sequence? ● Oversimplified Answer: At least 3 biological replicates per condition. ● Depends on: ○ Sequencing depth ○ Application ○ Goals (prioritization, biomarker discovery, etc.) ○ Effect size, desired power, statistical significance ● Find a publication with similar goals
  • 42. Common question #3: Workflow ● How do I analyze the data? ● No standards! ○ Unspliced aligners: BWA, Bowtie, Stampy, SHRiMP ○ Spliced aligners: Tophat, MapSplice, SpliceMap, GSNAP, QPALMA ○ Reference builds & annotations: UCSC, Entrez, Ensembl ○ Assembly: Cufflinks, Scripture, Trinity, G.Mor.Se, Velvet, TransABySS ○ Quantification: Cufflinks, RSEM, MISO, ERANGE, NEUMA, Alexa-Seq ○ Differential expression: Cuffdiff, DegSeq, DESeq, EdgeR, Myrna ● Like early microarray days: lots of excitement, lots of tools, little knowledge of integrating tools in pipeline! ● Benchmarks ● Microarray: Spike-ins (Irizarry) ● RNA-Seq: ???, simulation, ???
  • 43. Phases of NGS analysis ● Primary ○ Conversion of raw machine signal into sequence and qualities ● Secondary ○ Alignment of reads to reference genome or transcriptome ○ De novo assembly of reads into contigs ● Tertiary ○ SNP discovery/genotyping ○ Peak discovery/quantification (ChIP, MeDIP) ○ Transcript assembly/quantification (RNA-seq) ● Quaternary ○ Differential expression ○ Enrichment, pathways, correlation, clustering, visualization, etc.
  • 44. Extra credit (not really): RNA-seq http://bit.ly/galaxy-rnaseq ● #1: learn to use galaxy: bit.ly/uva-galaxy ● #2: Run through an RNA-seq exercise in 1 hour: ○ Read some background material on RNA-seq ○ Read the tophat/cufflinks method paper ○ Get some data (Illumina BodyMap) ○ QC / trim your reads ○ Map to hg19 with tophat ○ Visualize where reads map ○ Assemble with cufflinks ○ Differential expression with cuffdiff
  • 45. Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining
  • 46. How are genes regulated? ● Transcription factors (ChIP-seq) ● Micro-RNAs (RNA-seq) ● Chromatin accessibility (DNAse-Seq) ● DNA Methylation (RRBS-seq, MeDIP-seq) ● RNA processing ● RNA transport ● Translation ● Post-translational modification
  • 47. Importance of DNA methylation ● Occurs most frequently at CpG sites ● High methylation at promoters ≈ silencing ● Methylation perturbed in cancer ● Methylation associated with many other complex diseases: neural, autoimmune, response to env. ● Mapping DNA methylation → new disease genes & drug targets.
  • 48. DNA Methylation Challenges ● Dynamic and tissue-specific ● DNA → Collection of cells which vary in 5meC patterns → 5meC pattern is complex. ● Further, uneven distribution of CpG targets ● Multiple classes of methods: ○ Bisulfite, sequence-based: Assay methylated target sequences across individual DNAs. ○ Affinity enrichment, count-based: Assay methylation level across many genomic loci. ● Many methods ● Many algorithms
  • 49. Many methylation methods Gene RNA-Seq High-throughput cDNA sequencing Expression BS-Seq Whole-genome bisulfite sequencing RRBS-Seq Reduced representation bisulfite sequencing BC-Seq Bisulfite capture sequencing BSPP Bisulfite specific padlock probes Methyl-Seq Restriction enzyme based methyl-seq DNA MSCC Methyl sensitive cut counting Methylation HELP-Seq HpaII fragment enrichment by ligation PCR MCA-Seq Methylated CpG island amplification MeDIP-Seq Methylated DNA immunoprecipitation MBP-Seq Methyl-binding protein sequencing MethylCap-seq Methylated DNA capture by affinity purification MIRA-Seq Methylated CpG island recovery assay
  • 51. Methylation: Bioinformatics Resources Resource Purpose URL Refs Batman MeDIP DNA methylation analysis tool http://td-blade.gurdon.cam.ac.uk/software/batman BDPC DNA methylation analysis platform http://biochem.jacobs-university.de/BDPC BSMAP Whole-genome bisulphite sequence mapping http://code.google.com/p/bsmap CpG Analyzer Windows-based program for bisulphite DNA - CpGcluster CpG island identification http://bioinfo2.ugr.es/CpGcluster CpGFinder Online program for CpG island identification http://linux1.softberry.com CpG Island Explorer Online program for CpG Island identification http://bioinfo.hku.hk/cpgieintro.html CpG Island Searcher Online program for CpG Island identification http://cpgislands.usc.edu CpG PatternFinder Windows-based program for bisulphite DNA - CpG Promoter Large-scale promoter mapping using CpG islands http://www.cshl.edu/OTT/html/cpg_promoter.html CpG ratio and GC content Plotter Online program for plotting the observed:expected ratio of CpG http://mwsross.bms.ed.ac.uk/public/cgi-bin/cpg.pl CpGviewer Bisulphite DNA sequencing viewer http://dna.leeds.ac.uk/cpgviewer CyMATE Bisulphite-based analysis of plant genomic DNA http://www.gmi.oeaw.ac.at/en/cymate-index/ EMBOSS CpGPlot/ CpGReport Online program for plotting CpG-rich regions http://www.ebi.ac.uk/Tools/emboss/cpgplot/index.html Epigenomics Roadmap NIH Epigenomics Roadmap Initiative homepage http://nihroadmap.nih.gov/epigenomics Epinexus DNA methylation analysis tools http://epinexus.net/home.html MEDME Software package (using R) for modelling MeDIP experimental data http://espresso.med.yale.edu/medme methBLAST Similarity search program for bisulphite-modified DNA http://medgen.ugent.be/methBLAST MethDB Database for DNA methylation data http://www.methdb.de MethPrimer Primer design for bisulphite PCR http://www.urogene.org/methprimer methPrimerDB PCR primers for DNA methylation analysis http://medgen.ugent.be/methprimerdb MethTools Bisulphite sequence data analysis tool http://www.methdb.de MethyCancer Database Database of cancer DNA methylation data http://methycancer.psych.ac.cn Methyl Primer Express Primer design for bisulphite PCR http://www.appliedbiosystems.com/ Methylumi Bioconductor pkg for DNA methylation data from Illumina http://www.bioconductor.org/packages/bioc/html/ Methylyzer Bisulphite DNA sequence visualization tool http://ubio.bioinfo.cnio.es/Methylyzer/main/index.html mPod DNA methylation viewer integrated w/ Ensembl genome browser http://www.compbio.group.cam.ac.uk/Projects/ PubMeth Database of DNA methylation literature http://www.pubmeth.org QUMA Quantification tool for methylation analysis http://quma.cdb.riken.jp TCGA Data Portal Database of TCGA DNA methylation data http://cancergenome.nih.gov/dataportal
  • 52. Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining
  • 53. One gene, one enzyme, one function? Zhu X. et al. (2007). Genes & Dev 21:1010-1024. Jeong, H. et al.. (2001) Nature 411:41–42. Ptacek, J. et al. (2005) Nature 438:679–684. Guimera and Amaral. (2005). Nature 433:895-900. Tong, A.H. et al. (2001). Science 294:2364-2368.
  • 54. Distribution of disease genes Diseases connected if same gene implicated in both. Genes connected if implicated in the same disorder. Goh et al. (2007). PNAS 104:8685.
  • 55. Distribution of disease genes Overlay with PPI data Genes contributing to a common disease interact through protein- protein interactions. Genes connected if implicated in the same disorder. Goh et al. (2007). PNAS 104:8685.
  • 56. Distribution of disease genes Seebacher and Gavin (2011). Cell 144:1000- 1001 ● “Essential” genes k = degree ● Encode hubs = # interaction partners ● Are expressed globally ● “Non-essential” disease genes ● Do not encode hubs ● Tissue specific expression
  • 57. Distribution of disease genes ● Disease genes at functional periphery of cellular networks (Goh PNAS 2007). ● Genes contributing to a common disease interact through protein-protein interactions (Goh PNAS 2007). ● Diseaseome analysis: Pt 2x likely to develop another disease if that disease shares gene with pt’s primary disease (Park et al. 2009. The Impact of Cellular Networks on Disease Comorbidity. Mol Syst Biol 5:262). ● miRNA analysis: If connect diseases with associated genes regulated by common miRNA, get disease-class segregation. E.g. cancers share similar associations at miRNA level (Lu et al. 2009. An analysis of human microRNA and disease associations. PLoS ONE 3:e3420). Nonrandom placement of disease genes in interactome!
  • 58. Distribution of disease genes Vidal et al, Cell 2011.
  • 59. Distribution of disease genes ● Data is cheap and diverse. ○ Genetic variation: GWAS, next-gen sequencing ○ Gene expression: Microarray, RNA-seq ○ Proteomics: Y2H, CoAP/MS ● Cellular components interact in a network with other cellular components. ● Disease is the result of an abnormality in that network. ● Integrate multiple data types, understand network, understand disease.
  • 60. Pathway Analysis ● You’ve done your microarray/RNA-Seq experiment ○ You have a list of genes ○ Want to put these into functional context ○ What biological processes are perturbed? ○ What pathways are being dysregulated? ○ Data reduction: hundreds or thousands of genes can be reduced to 10s of pathways ○ Identifying active pathways = more explanatory power ● “Pathway analysis” encompasses many, many techniques: ○ 1st Generation: Overrepresentation Analysis (E.g. GO ORA) ○ 2nd Generation: Functional Class Scoring (e.g. GSEA) ○ 3rd Generation (in development): Pathway Topology (E.g. SPIA) ● http://gettinggeneticsdone.com/2012/03/pathway-analysis-for-high-throughput.html
  • 61. Pathway Analysis: Over- representation analysis ● Many variations on the same theme: statistically evaluates the fraction of genes in particular pathway that show changes in expression. ● Algorithm: ○ Create input list (e.g. “significant at p<0.05”) ○ For each gene set: ■ Count number of input genes ■ Count number of “background” genes (e.g. all genes on platform). ○ Test each pathway for over-representation of input genes ● Gene Set: typically gene ontology (GO) term.
  • 62. Pathway analysis: over- representation analysis ● Ontology = formal representation of a knowledge domain. ● Gene ontology = cell biology. ● GO represented by directed acyclic graph (DAG). ○ Terms are nodes, relationships are edges. ○ Parent terms are more general than their child terms. ○ Unlike a simple tree, terms can have multiple parents. Rhee, S. Y., Wood, V., Dolinski, K., & Draghici, S. (2008). Use and misuse of the gene ontology annotations. Nature Reviews Genetics, 9(7), 509-15.
  • 63. Pathway analysis: Over-representation analysis ● Algorithm: ○ Create input list (e.g. “significant at p<0.05”) ○ For each gene set: ■ Count number of input genes ■ Count number of “background” genes (e.g. all genes on platform). ○ Test each pathway for over-representation of input genes ● Ex: GO “Purine Ribonucleotide Biosynthetic Process” ○ 1% of input (significant) genes are annotated with this term. ○ 1% of genes on the chip are annotated with this term. ○ Not significantly overrepresented. ● Ex: GO “V(D)J Recombination” ○ 20% of input (significant) genes are annotated with this term. ○ 1% of genes on the chip are annotated with this term. ○ Highly significantly over-represented!
  • 64. Pathway analysis ● Pathway analysis gives you more biological insight than staring at lists of genes. ● Pathway analysis is complex, and has many limitations. ● Pathway analysis is still more of an exploratory procedure rather than a pure statistical endpoint. ● The best conclusions are made by viewing enrichment analysis results through the lens of the investigator’s expert biological knowledge.
  • 65. Subdisciplines ● Sequence alignment (DNA, RNA, Protein) ● Genome annotation ● Evolutionary biology / comparative genomics ● Analysis of gene expression ● Analysis of gene regulation ● Genotype-phenotype association ● Mutation analysis ● Structural biology ● Biomarker identification ● Pathway analysis / "systems biology" ● Literature analysis / text-mining
  • 66. Resources: Online community & discussion forum ● Seqanswers ○ http://SEQanswers.com ○ Twitter: @SEQquestions ○ Format: Forum ○ Li et al. SEQanswers : An open access community for collaboratively decoding genomes. Bioinformatics (2012). ● BioStar: ○ http://biostar.stackexchange.com ○ Twitter: @BioStarQuestion ○ Format: Q&A ○ Parnell et al. BioStar: an online question & answer resource for the bioinformatics community. PLoS Comp Bio (2011) 7:e1002216.
  • 67. Resources: further education stephenturner.us/p/edu Regularly updated, comprehensive list of over 20 in- person and free online workshops in bioinformatics, programming, statistics, genetics, etc.
  • 68. Publicly Available Data: NCBI ● Genbank: http://www.ncbi.nlm.nih.gov/genbank/ ○ Collection of all publicly available DNA sequences. ○ Feb 2013: 150,141,354,858 bases from 162,886,727 sequences. ● NCBI Genomes: http://www.ncbi.nlm.nih.gov/genome/ ○ Public repository for sequenced genomes. ○ March 2013: 3,005 eukaryotes, 19,125 prokaryotes, 3,570 viruses. ● NCBI Taxonomy: http://www.ncbi.nlm.nih.gov/taxonomy ○ Publicly available classification and nomenclature database for all organisms in the public sequences database. ○ Phylogenetic lineages for >160,000 organisms (est. ~10% life on the planet) ● GEO: http://www.ncbi.nlm.nih.gov/geo/ ○ Public repository of sequence- and array-based gene expression data, free for the taking. ○ 900,000+ samples, 3,200+ datasets. ● dbGaP: http://www.ncbi.nlm.nih.gov/gap ○ Public repository for genetic studies. ○ 2,500+ datasets, 100,000+ variables. ● SRA: http://www.ncbi.nlm.nih.gov/sra ○ Public repository for raw sequencing data from NGS platforms. ○ 3,500,000,000,000,000 bases sequenced.
  • 69. Publicly Available Data: Databases ● 2013 Nucleic Acids Research Database Issue ○ http://nar.oxfordjournals.org/content/41/D1/D1.abstract ○ 176 articles describing new/updated molecular biology databases. ● NAR Molecular Biology Database Collection ○ http://www.oxfordjournals.org/nar/database/a/ ○ 1,512 molecular biology databases ○ Categories: DNA/RNA/Protein sequences, structures, metabolic/signaling pathways, genes & genomes, human diseases, microarray/other gene expression data, proteomics, organelles, plants, immunological, cell bio, …
  • 70. Publicly Available Data: Webservers ● 2012 NAR Web Server Issue ○ http://nar.oxfordjournals.org/content/40/W1.toc ○ 102 articles/webservers featured ● Bioinformatics Links Directory ○ http://bioinformatics.ca/links_directory/ ○ Includes all the NAR resources above. ○ 1,376 tools, 620 databases, 163 other resources ○ Topics: computer-related, DNA, education, expression, genomics, literature, model organisms, RNA, protein, other molecules, sequence comparison, …
  • 71. Bioinformatics Core Mission: help scientists publish their work and obtain new funding through service and training.
  • 72. Services ● Gene expression: Microarray Analysis ● Gene expression: RNA-seq Analysis ● Pathway analysis ● DNA Variation (GWAS, NGS) ● DNA Binding / ChIP-Seq ● DNA Methylation ● Metagenomics ● Grant / Manuscript support ● Custom development (computing & stats) ● ... etc.
  • 73. Contact Web: bioinformatics.virginia.edu E-mail: bioinformatics@virginia.edu Blog: GettingGeneticsDone.com Twitter: @genetics_blog