SlideShare a Scribd company logo
1 of 42
Open questions in bioinformatics and
computational biology from an evolutionary
and molecular biology perspective
Jason Stajich

Assistant Professor of Bioinformatics & Bioinformaticist
Plant Pathology and Microbiology
University of California, Riverside
Outline

• Where and what is the data?

   • Big data is here, Bigger data is coming.

• Current approaches

• What are the problems?

   • Efficient searching in large data sets

   • Ease of interfacing with data to support genomics research - software, databases, and
     UI development

   • Finding signal in the datasets - statistical and computational methods

   • Managing the data - TBs of data from the sequencing.

• Other more biologically focused research areas.
Sequencing has been a driving force
• DNA sequencing and analysis has been the bread and butter of bioinformatics and
  computational biology for genomics and evolutionary research

• Human Genome project - USD $3B (1991 dollars)

   • Typically sequencing costs have been converging towards $0.01 a nucleotide base -
     inexpensive but still limiting for large-scale questions

   • New sequencing techniques have dropped the costs by orders of magnitude

• Techniques for searching data - There has always been speed vs sensitivity tradeoffs

   • Initially developed with heuristics tuned towards the 103 - 106 sequence sized databases

• Previously software, computational resources required specialized centers - less the case
  now
16 B bases 2007-2008




                    Human genome =
                       3B bases




http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
15 Gb in 5 days

  1-3 Gb / day / machine!
for CURRENT technologies

     ~$8000 for human
     genome number of
          bases

    http://www.politigenomics.com/next-generation-sequencing-informatics
396 B
“...bringing its total to 89
     [GA II machines]”
Cheaper sequencing is leading to democratized genome
science. Everyone* can do it




           *With $USD 500K               http://bit.ly/5zNBk
Tools for Bioinformatics and Comparative Genomics

• Interfacing with the data (flat text and binary files, databases, web resources)

• BioPerl is a toolkit for bioinformatics data processing

   • Parsers for common file formats

   • Visualization of some kinds of genomic data - basis for genome browser

   • Genome Browsers to see genomic context information, important for visualizing high
     density data like 2nd-generation sequencing (RNA-Seq, ChIP-Seq)

• Community web resources for information sharing

   • Social networking challenges to get scientists to share information with attribution
Genome Browser data integration - Gbrowse
           Ncra_OR74A_chrIV_contig7.20
                      300k                                         310k                                           320k                                             330k
       DNA_GCContent
    % gc



       NCBI genes (Broad called)
           NCU04433                                       NCU04430                                                           NCU04426
       sulfate permease II CYS-14                         related to aminopeptidase Y precursor; vacuolar                    related to cyclin-supressing protein kinase
                       NCU04432                                            NCU04429                                                               NCU04425
                           hypothetical protein                            conserved hypothetical protein                                       putative protein
                                       NCU04431                                          NCU04428                                                                  NCU04424
                                     related to endo-1; 3-beta-glucanase                  related to spindle assembly checkpoint protein                           related to regulator of chromatin
                                                                                                              NCU04427
                                                                                                             conserved hypothetical protein
       PASA updated NCBI/Broad genes
           NCU04433        NCU04432
                                  [pasa:asmbl_9429,status:12],[pasa:asmbl_9430,status:12]                   [pasa:asmbl_9440,status:12],[pasa:asmbl_9441,status:12],[pasa:asmbl_9442,status:12]
                                                          [pasa:asmbl_9431,status:12],[pasa:asmbl_9432,status:12]            [pasa:asmbl_9443,status:12],[pasa:asmbl_9444,status:12]
                                                                           [pasa:asmbl_9433,status:12],[pasa:asmbl_9434,status:12],[pasa:asmbl_9435,status:12]
                                                                                          [pasa:asmbl_9436,status:12],[pasa:asmbl_9437,status:12],[pasa:asmbl_9438,status:12],[pasa:asmbl_9439,statu
                                                                                                                                                [pasa:asmbl_9445,status:12],[pasa:asmbl_9446,status:12
                                                                                                                                                                   NCU04424

       Named Genes (Radford laboratory)
           cys-14                   gh16-3
                                   tRNA{phe}-9

       miRNA Solexa histogram
           miRNA




       K4dime ChIP-Seq histogram (SOAP)
           K4dime_Solexa                                                                                                                                                     Gbrowse - Stein et al 2002


                                                                                                                                                                              Stajich et al, unpublished
       K9met3 ChIP-Seq histogram (SOAP)                                                                                                                                    Smith, Freitag, et al unpublished
           K9met3
fungalgenomes.org/genomes
Central dogma of eukaryotic
          biology
                                                  Protein

                                                          mRNA

                                                             pre-mRNA
                                                                 Genomic DNA
                 Exon            Exon              Exon



                        Intron



                                        Intron
           Protein
                                                 splicing to form mRNA

  mRNA   Ribosome
                                   Introns



                        Genomic DNA
Information flow in the cell - Central Dogma

• DNA (4 bases, {A,C,G,T}) transcribed into

• RNA (4 bases, {A,C,G,U}) translated into

• Protein (20 amino acid residues, {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}) by triplets
  (codons) of RNAs
  UCA -> Serine (S)
  AUG -> Methionine (M)

• 3 stop codons (UGA, UAA,UAG) in most species

• As always in Biology, there are exceptions!
  Some species use different stop codons. The codon table (codon -> AA) is not the same for
  all species, the mitochondria has different codon table.
22      0      0:
                                  24      0      0:
                                  26      2      0:=
                                  28      7      3:*
                                  30      7     18:*
                                  32     45     68:===*
                                  34    166    184:========*
                                  36    337    379:================ *



 The Sequence DB Search problem
                                  38    581    626:=========================== *
                                  40    869    873:=======================================*
                                  42   1009   1067:============================================== *
                                  44   1276   1177:=====================================================*====
                                  46   1253   1198:======================================================*==
                                  48   1199   1147:====================================================*==
                                  50   1032   1047:===============================================*
                                  52    949    920:=========================================*==
                                  54    838    786:===================================*===
                                  56    578    657:=========================== *
                                  58    467    539:====================== *
                                  60    393    437:================== *
                                  62    339    350:===============*


              Database
                                  64    276    278:============*
                                  66    214    220:=========*
                                  68    188    173:=======*=
                                  70    140    136:======*
                                  72    131    106:====*=


Query           Sequence          74     88     83:===*
                                  76     71     64:==*=
                                  78     48     50:==*
                Sequence          80     43     39:=*
                                  82     38     30:=*



                           Generates
                                  84     27     24:=*
                Sequence          86     21     18:*


   Pairwise
                                  88     15     14:*
                Sequence          90     17     11:*



                            Scores
                                  92      7      8:*          :=======*    = represents 1 library sequence


                                                                                                                E-value
                                  94     22      7:*          :======*===============
                Sequence

  alignment
                                  96      3      5:*          :=== *
                                  98      8      4:*          :===*====
                Sequence         100      6      3:*          :==*===
                                 102      5      2:*          :=*===
                                 104      9      2:*          :=*=======


  For each
                Sequence         106      4      1:*          :*===
                                 108      5      1:*          :*====
                Sequence         110      4      1:*          :*===
                                 112      4      1:*          :*===



  sequence
                                 114      4      1:*          :*===
                Sequence         116      6      0:=          *======
                                 118      1      0:=          *=
                Sequence        >120     32      0:==         *================================


                Sequence

                Sequence

                Sequence
                                                                                      11
                Sequence
A
     Sequence alignment
     algorithms C
• String matching to find common substrings

                      G (mutations) and
  • Allowing for mismatches
    gaps (insertions/deletions)

• Dynamic ProgrammingG
                     solution

  • Needleman-Wunsch - Global Alignments;
    Smith-Waterman - Local alignments
                      T
    (SSEARCH, GGSEARCH; water, needle)

  • Speedups by looking for common words
    (KTUPLES - 2-7) to seed the alignment start

   • Basically what happens in BLAST, FASTA
 DNA sequences, dynamic programming algorithms are available that can calculate the optimal score in
   • Heursitics typically employ a fixed insertion
 O(MN) steps. This efficiency is achieved by determining the optimal score for each prefix of each string,
 andpenalty and cutoffs prefix byscores so that three paths that can be used to extend an alignment:
      then extending each on low considering the
 (1) DPextending the alignment by one residue in each sequence; (2) by extending the alignment by one
     by matrix is sparse.
                                                                                 Pearson, ISMB 2000
 residue in the first sequence and aligning it with a gap in the second; or (3) extending the alignment by
Table 9: Algorithms for comparing protein and DNA sequences


         value    scoring   gap      time
         Sequence alignment algorithms and
       calculated matrix  penalty  required                             complexity
-   global similarity   arbitrary      penalty/gap       O(n2 )    Needleman and
                                            q                      Wunsch, 1970

    (global) distance     unity      penalty/residue     O(n2 )    Sellers, 1974
                                           rk

    local similarity    ˆ
                        Si j < 0.0       affine           O(n2 )    Smith and Waterman, 1981
                                         q + rk                    Gotoh, 1982

     approx. local      ˆ
                        Si j < 0.0   limited gap size   O(n2 )/K   Lipman and Pearson, 1985
       similarity                         q + rk                   Pearson and Lipman, 1988

       maximum          ˆ
                        Si j < 0.0      multiple        O(n2 )/K   Altshul et al., 1990
     segment score                      segments




lated and unrelated proteins.
                                                                                    Pearson, ISMB 2000
FASTA
time: 2:00 min




                        BLAST
                        time: 20 sec




BLAST
time: 20 sec




Heuristic algorithms for DP
 searching for sequence
         similarities
                    28
                                       Pearson, ISMB 2000
Variants on the alignment theme

• HMMER (v1, v2) - Profile Hidden Markov Model where sequences are compared to a profile
  of a multiple sequence alignment and probability that HMM could have emitted query
  sequence is assessed.

• MUMMER - suffix trees, longest increasing subsequence, and Smith-Waterman DP

• BLAT - big hash tables to find exact matching or patterns (ie 24 out a 28bp word) in O(1)
  and string them together. For nearly identical sequences

• For protein sequence alignments - HMMER3 is a game changer.

   • Speed is now basically as fast as BLAST in some cases. New (today!) HMMER3.0b3 is
     multithread and MPI enabled
The need for speed

• Acceleration of current algorithms

   • Hardware acceleration with GPUs and FPGAs

      • TimeLogic, Paracel (defunct)

   • SIMD acceleration

      • SSEARCH using striped Smith-Waterman on x86 (SSE2)

• Embarrassingly Parallel problem!

   • MPI and PVM enabled to split up searches on a cluster, and recombined the score
     distribution at the end to assess E-value
SSEARCH on par with BLAST!   Farrar, Bioinformatics 2007
Large numbers of genome sequences
• 1000+ Bacterial and Archeal genomes

• 100s of fungal genomes

• 10s of animal and plant genomes

• 10s of other eukaryotes

• Coming online
                                                                   http://bit.ly/2ruuhM



   • 1000 human genome project - http://1000genomes.org

   • 1001 Arabidopsis genomes - http://1001genomes.org

   • 1000 Drosophila genomes - http://dpgp.org            Metagenomics
   • 15,000 vertebrate genome project (propsed)
The coming storm: what
to do with all this data?

• Can use the more sensitive tools on
  larger datasets.

• Datasets are of course getting larger.
  Challenge Moore’s law? Producing
  data faster.

   • Need to get more efficient in how
     the data is processed, organized,
     and accessed

• Not discussing the storage and data
  management challenges
22      0      0:
                                  24      0      0:
                                  26      2      0:=
                                  28      7      3:*
                                  30      7     18:*
                                  32     45     68:===*
                                  34    166    184:========*
                                  36    337    379:================ *



 The Sequence DB Search problem
                                  38    581    626:=========================== *
                                  40    869    873:=======================================*
                                  42   1009   1067:============================================== *
                                  44   1276   1177:=====================================================*====
                                  46   1253   1198:======================================================*==
                                  48   1199   1147:====================================================*==
                                  50   1032   1047:===============================================*



1000Xs
                                  52    949    920:=========================================*==
                                  54    838    786:===================================*===
                                  56    578    657:=========================== *
                                  58    467    539:====================== *
                                  60    393    437:================== *


 more
                                  62    339    350:===============*


              Database
                                  64    276    278:============*
                                  66    214    220:=========*
                                  68    188    173:=======*=
                                  70    140    136:======*
                                  72    131    106:====*=


Query           Sequence          74     88     83:===*
                                  76     71     64:==*=
                                  78     48     50:==*
                Sequence          80     43     39:=*
                                  82     38     30:=*



                           Generates
                                  84     27     24:=*
                Sequence          86     21     18:*


   Pairwise
                                  88     15     14:*
                Sequence          90     17     11:*



                            Scores
                                  92      7      8:*          :=======*    = represents 1 library sequence


                                                                                                                E-value
                                  94     22      7:*          :======*===============
                Sequence

  alignment
                                  96      3      5:*          :=== *
                                  98      8      4:*          :===*====
                Sequence         100      6      3:*          :==*===
                                 102      5      2:*          :=*===
                                 104      9      2:*          :=*=======


  For each
                Sequence         106      4      1:*          :*===
                                 108      5      1:*          :*====
                Sequence         110      4      1:*          :*===
                                 112      4      1:*          :*===



  sequence
                                 114      4      1:*          :*===
                Sequence         116      6      0:=          *======
                                 118      1      0:=          *=
                Sequence        >120     32      0:==         *================================


                Sequence

                Sequence

                Sequence
                                                                                      11
                Sequence
Next Generation Sequencing data challenges
• Where did you come from?                           • Genomic Assembly.
  Map all the reads (10-30M tags of length 25 -        Make contiguous sequence stretches by putting
  120 bp) against the genome                           jigsaw back together (30 - 120bp reads,
  With mismatch/gaps.                                  sometimes paired).
  Soln: Hashing, Suffix trees, Burrows-Wheeler          Soln: Best solutions are graph based -but many
  Transformation. BowTie, MAQ, SOAP, BWA               paths through the data so lots of memory
                                                       required. Velvet, ABYSS
• Find the errors, mutations, insertions or
  deletions, or copy number variation.               • Show me your genes!
  Align reads to the genome, but confidently call       Align transcript pieces (short-reads from RNA-
  the differences between the sample and               Seq) to the genome, but with large gaps for
  reference as real variations.                        introns.
  Soln: Some bayesian approaches that factor           Soln: Allow partial alignments and stitch together
  quality and number of sequences that support it.     those that are paired, extra points for occurring
  Shortcut is heuristic: 3+ reads that agree =         at a splice-site and having multiple independent
  believable. Can condition on number of agreeing      reads that agree. BowTie+TopHat/CuffLinks
  reads. Mosaik, MAQ
Jacob Appelbaum & Donald Knuth Demonstrate
      The Recursive Homeboys Principle
Evolutionary questions

• Gene content changes

• Gene order evolution

• Evolutionary relationships between organisms (branching order)

• Evolutionary Rates
Comparing gene content among species

• Identification of the same genes found in each species

• Use similar searches to find significant ones - pairwise problem is fairly straightforward

• Combining these for multiple species where missing data is allowed (gene loss) or additional
  data (gene duplication) makes the problem harder.

• Typical methods cluster into Orthologous groups by similarity cutoff, but transitive property
  is not always met (A similar to B, B similar to C, not always the case that A and C are
  significant similar or homologous).

   • Variations on theme using Markov Clustering (MCL) have been applied with some
     success.

   • Other approaches include phylogenetic (tree building) for putative clusters to sort out
     relationships, but is also fraught with false positives and computation complexity.
Clustering with MCL   http://micans.org/mcl
Ortholog and paralog finding still active areas

• De novo identification identification of families still has many false positives. Dynamic
  adjustment of parameters based on sequence length, evolutionary rates, etc may help
  refine clusters better.

• As more genomes are produced, how can we re-build ortholog sets incrementally?

• Use HMMs of ortholog sets as searching database to improve sensitivity
Gene order evolution

• Comparing order of the genes in
  different organisms.

• Conservation of order is referred to
  as synteny

• Why is it important? Shared synteny
  suggests between different
  organisms suggests it is the
  ancestral state. If it has been
  preserved over time, maybe a
  functional reason

• Also useful for inferring
  corresponding (orthologous) genes
  in different species
Synteny blocks

• Identifying regions that have shared
  history

• Identify anchors, consider ordering
  of anchors in each organism

• Allow for insertions and deletions

• Find longest common “substring” of
  genes.

• Genes in a block found in different
  species may be orthologous
Phylogenetic Trees

• Represents evolutionary history of
  organism

• Can be used with binary character
  traits or DNA or protein sequences

• Constructed by a calculation of the
  distance between the taxa

• Bayesian and maximum likelihood
  approaches as well as simpler
  distance-only (neighbor-joining) and
  parsimony approaches to represent
  the data.
Tree building future
problems
• How do we assemble very large
  trees?
  In pieces and stitch together?
  All in one go with huge distributed
  systems?

• Very much an active area of research
  Need more efficient algorithms and
  hardware acceleration approaches.

• NSF funded CIPRES have been on
  algorithmic improvements -for large
  number of taxa from Tree of Life
  project data.

• Tree Viz is also important area - how
  do we represent the large amount of
  data? Dynamically and interactively?
Evolutionary rates

• Calculate evolutionary rates as
  amount of change in a sequence as
  compared to others

• If the amount of change is greater in
  some branches suggests interesting
  phenomenon -identify by comparing
  the relative rates.

• Statistical tests for whether rates are
  significantly different. Can apply to
  protein loci and genome alignments.

• Open questions: inferring rates in
  the first place still a hard problem
  and quite slow. Methods that
  efficiently scan whole genome of
  alignments still need work.
Comparing conservation profiles
  Nc          TTCCTTACCAGCAGCCCCGGCCGTCGCCCCGCTCTTTCGACCCACGATGTTGTC---GAT
  Nt          TTCCTTACCAGCAGCCCCGGCCGTCGCCCCGCTCTTTCGACTCACGATGTTGTC---GAT
  Nd          TTCCTGACCAGCAGCCCCGGCCGTCGCCCCGCTCTTTCGACCCACGATGTTCTC---GAT
  Sm          TTCCTTACCAGCAGCCCCGGCCGTCGCCCCGCGCTCTCGACTCACGATGTTGTC---GAC
  Pa          TTCTTGACCAGCAGCCCAGGACGCCATCCAGCACT---GACCCACCTTGATGTCCAAGAT
  Cg          TTCCTCACCAGCAGCCCTGGGCGCCACCCGGCTCTGTCGTCGCA---TGACGTCAAAGAT
              *** * *********** ** ** * ** ** **    * * **   **   **   **

       &'()*+,*)-./)-++-.0123-4'.5)*(,.67
   !"#$%$$           !"#$#$$          !"#!$$$   !"#!!$$   !"#!9$$   !"#!A$$   !"#!J$$   !"#!"$$   !"#!<$$   !"#!B$$
   !"#

?@4/                                                                                                                  #9?

                                                                                                                      9$?
   "$%&'()
       &:;$""99

   #**&"+,-.)/.-0&123)&4#2#&5!"#
       -+EFG.!A##9
       -+EFG.!A##A

   6+7+80-9:4#2#&,/(0;+(
       C,-+-D-+EFG.!A##9H+>->(+D!AI


   4<0);$.7)&=;-07)5->/;?'7;-.7@
       ,=-+>/*2+
                                                                                                                      !
                                                                                                                      $8"
                                                                                                                      $
Figure 3: Subtree ROC curves.




             28

                                Pollard et al, Genome Res, 2009
0-

 _
 _                                                              Primate Conservation by PhastCons



 _
 _                                                         Placental Mammal Conservation by PhastCons



 _
 _                                                             Vertebrate Conservation by PhastCons



_
El
El
El
                                                                Multiz Alignments of 44 Vertebrates
mp
 la
an
us
 et
 er
 ur
by
  w
                                                                                               C      chr18:                                            27174000
                                                                                                       UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
                                                                                                      DSG1
se                                                                                                                 Vertebrate Multiz Alignment & Conservation (44 Species)
 at                                                                                                      3_             Vertebrate Basewise Conservation by PhyloP
 at
 ig
 el                                                                                                      0-
bit
 r18:      27162280        27162290        27162300       27162310
ka CATCCATAGT TGATCGAGAGGTCACTCCT T TCT TCAT TGTAAGTGGA
 --->
ca                                                                                                      -6 _
UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics
 in
SG1 T S     I   V    D R E V            T    P   F    F    I                                             1_                Vertebrate Conservation by PhastCons
  w
se          Vertebrate Multiz Alignment & Conservation (44 Species)
 5_               Vertebrate Basewise Conservation by PhyloP


           Simultaneously identifying slow and
         rapidly evolving regions in the human                                                                              Pollard et al, Genome Res, 2009

                                       genome
More open areas in computational biology

• Image processing - decoding visual images into computable data
  Microscopy images for phenotypes (trait) analyses:
   How big is that neuron cell? After I apply this drug?

 Patterns of gene expression through staining and visualization - what is the pattern?

• Protein 3-D structure analysis and prediction
  Molecular dynamic simulations to take 1D protein sequence to 2D and 3D sequence
  structure predictions. Predictions are still very poor - using hints of known sub-sequences
  and their folding pattern. {Folding@Home}

• RNA structure prediction
  Identification of the secondary structure of sequences via thermodynamic and base-pairing
  rules. Finding real versus artifact and comparing this to sequence conservation.
Still more

• Gene network reconstruction from time series data
  How do genes function together and are dependent on each other.

• Gene identification and annotation
  Use of computation and experimental data to define the regions of the genome which are
  transcribed into genes

• Gene function prediction
  Using the known annotation of function of some genes can we infer the function of others
  using ontologies of gene function and phylogenetic relatedness of the genes and species
For more information

                                              http://bioinfo.ucr.edu




                     http://lab.stajich.org




http://bioperl.org

More Related Content

What's hot

Characterization in Dvilp 7 gene
Characterization in Dvilp 7 geneCharacterization in Dvilp 7 gene
Characterization in Dvilp 7 geneHunter Kelley
 
BIOL335: Functional genomics
BIOL335: Functional genomicsBIOL335: Functional genomics
BIOL335: Functional genomicsPaul Gardner
 
SfN 2008_Tabata_CruzAguado_Progranulin
SfN 2008_Tabata_CruzAguado_ProgranulinSfN 2008_Tabata_CruzAguado_Progranulin
SfN 2008_Tabata_CruzAguado_ProgranulinPierre Zwiegers
 
Computational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein EngineeringComputational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein EngineeringPablo Carbonell
 
Optimization of experimental protocols for cellular lysis
Optimization of experimental protocols for cellular lysisOptimization of experimental protocols for cellular lysis
Optimization of experimental protocols for cellular lysisExpedeon
 

What's hot (7)

Characterization in Dvilp 7 gene
Characterization in Dvilp 7 geneCharacterization in Dvilp 7 gene
Characterization in Dvilp 7 gene
 
Bioinformatica t2-databases
Bioinformatica t2-databasesBioinformatica t2-databases
Bioinformatica t2-databases
 
BIOL335: Functional genomics
BIOL335: Functional genomicsBIOL335: Functional genomics
BIOL335: Functional genomics
 
SfN 2008_Tabata_CruzAguado_Progranulin
SfN 2008_Tabata_CruzAguado_ProgranulinSfN 2008_Tabata_CruzAguado_Progranulin
SfN 2008_Tabata_CruzAguado_Progranulin
 
Computational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein EngineeringComputational Protein Design. 1. Challenges in Protein Engineering
Computational Protein Design. 1. Challenges in Protein Engineering
 
PurdyPoster2014.1
PurdyPoster2014.1PurdyPoster2014.1
PurdyPoster2014.1
 
Optimization of experimental protocols for cellular lysis
Optimization of experimental protocols for cellular lysisOptimization of experimental protocols for cellular lysis
Optimization of experimental protocols for cellular lysis
 

Viewers also liked

Connecting ultrastructure to molecules
Connecting ultrastructure to moleculesConnecting ultrastructure to molecules
Connecting ultrastructure to moleculesJason Stajich
 
Evolution and exploration of the transcriptional landscape in two filamentous...
Evolution and exploration of the transcriptional landscape in two filamentous...Evolution and exploration of the transcriptional landscape in two filamentous...
Evolution and exploration of the transcriptional landscape in two filamentous...Jason Stajich
 
MOMY FungiDB tutorial
MOMY FungiDB tutorialMOMY FungiDB tutorial
MOMY FungiDB tutorialJason Stajich
 
2011 05-02 fungi db-iccc
2011 05-02 fungi db-iccc2011 05-02 fungi db-iccc
2011 05-02 fungi db-icccJason Stajich
 
Comparative genomics of the fungal kingdom
Comparative genomics of the fungal kingdomComparative genomics of the fungal kingdom
Comparative genomics of the fungal kingdomJason Stajich
 
Fungi DB at Keystone meeting
Fungi DB at Keystone meetingFungi DB at Keystone meeting
Fungi DB at Keystone meetingJason Stajich
 
Stajich phyloseminar 2011 06-29
Stajich phyloseminar 2011 06-29Stajich phyloseminar 2011 06-29
Stajich phyloseminar 2011 06-29Jason Stajich
 
2009 11 09 UCLA Bioinformatics Talk
2009 11 09 UCLA Bioinformatics Talk2009 11 09 UCLA Bioinformatics Talk
2009 11 09 UCLA Bioinformatics TalkJason Stajich
 
Télécommunication par satellite et la technologie vsat
Télécommunication par satellite et la technologie vsatTélécommunication par satellite et la technologie vsat
Télécommunication par satellite et la technologie vsatjosepkap
 

Viewers also liked (9)

Connecting ultrastructure to molecules
Connecting ultrastructure to moleculesConnecting ultrastructure to molecules
Connecting ultrastructure to molecules
 
Evolution and exploration of the transcriptional landscape in two filamentous...
Evolution and exploration of the transcriptional landscape in two filamentous...Evolution and exploration of the transcriptional landscape in two filamentous...
Evolution and exploration of the transcriptional landscape in two filamentous...
 
MOMY FungiDB tutorial
MOMY FungiDB tutorialMOMY FungiDB tutorial
MOMY FungiDB tutorial
 
2011 05-02 fungi db-iccc
2011 05-02 fungi db-iccc2011 05-02 fungi db-iccc
2011 05-02 fungi db-iccc
 
Comparative genomics of the fungal kingdom
Comparative genomics of the fungal kingdomComparative genomics of the fungal kingdom
Comparative genomics of the fungal kingdom
 
Fungi DB at Keystone meeting
Fungi DB at Keystone meetingFungi DB at Keystone meeting
Fungi DB at Keystone meeting
 
Stajich phyloseminar 2011 06-29
Stajich phyloseminar 2011 06-29Stajich phyloseminar 2011 06-29
Stajich phyloseminar 2011 06-29
 
2009 11 09 UCLA Bioinformatics Talk
2009 11 09 UCLA Bioinformatics Talk2009 11 09 UCLA Bioinformatics Talk
2009 11 09 UCLA Bioinformatics Talk
 
Télécommunication par satellite et la technologie vsat
Télécommunication par satellite et la technologie vsatTélécommunication par satellite et la technologie vsat
Télécommunication par satellite et la technologie vsat
 

Similar to 2009 11 16 UCR Comp Sci

Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics finalRainu Rajeev
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS
 
Rm Resumev5 2 09
Rm Resumev5 2 09Rm Resumev5 2 09
Rm Resumev5 2 09mastror
 
132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaquesSHAPE Society
 
GENE KNOCKOUT
GENE KNOCKOUTGENE KNOCKOUT
GENE KNOCKOUTRANA SAHA
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2Dan Gaston
 
DNA Analysis - Basic Research : A Case Study
DNA Analysis - Basic Research : A Case StudyDNA Analysis - Basic Research : A Case Study
DNA Analysis - Basic Research : A Case StudyQIAGEN
 
Gene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencingGene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencingmahimachoudhary0807
 
Genome Editing CRISPR-Cas9
Genome Editing CRISPR-Cas9 Genome Editing CRISPR-Cas9
Genome Editing CRISPR-Cas9 Ek Han Tan
 

Similar to 2009 11 16 UCR Comp Sci (20)

RML NCBI Resources
RML NCBI ResourcesRML NCBI Resources
RML NCBI Resources
 
Bioinformatics final
Bioinformatics finalBioinformatics final
Bioinformatics final
 
BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2BITS training - UCSC Genome Browser - Part 2
BITS training - UCSC Genome Browser - Part 2
 
Rm Resumev5 2 09
Rm Resumev5 2 09Rm Resumev5 2 09
Rm Resumev5 2 09
 
20140710 6 c_mason_ercc2.0_workshop
20140710 6 c_mason_ercc2.0_workshop20140710 6 c_mason_ercc2.0_workshop
20140710 6 c_mason_ercc2.0_workshop
 
R.Hernandez072712
R.Hernandez072712R.Hernandez072712
R.Hernandez072712
 
Oral Presentation at NCI
Oral Presentation at NCIOral Presentation at NCI
Oral Presentation at NCI
 
132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques
 
132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques132 gene expression in atherosclerotic plaques
132 gene expression in atherosclerotic plaques
 
Micro array study for gene expression in vp
Micro array study for gene expression in vpMicro array study for gene expression in vp
Micro array study for gene expression in vp
 
ACMG Workshop 2011
ACMG Workshop 2011ACMG Workshop 2011
ACMG Workshop 2011
 
GENE KNOCKOUT
GENE KNOCKOUTGENE KNOCKOUT
GENE KNOCKOUT
 
2015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and22015 Bioc4010 lecture1and2
2015 Bioc4010 lecture1and2
 
DNA Analysis - Basic Research : A Case Study
DNA Analysis - Basic Research : A Case StudyDNA Analysis - Basic Research : A Case Study
DNA Analysis - Basic Research : A Case Study
 
Crispr/cas9 101
Crispr/cas9 101Crispr/cas9 101
Crispr/cas9 101
 
Protein Nanotube based Probes for Cancer Cell Imaging
Protein Nanotube based Probes for Cancer Cell ImagingProtein Nanotube based Probes for Cancer Cell Imaging
Protein Nanotube based Probes for Cancer Cell Imaging
 
Submitted sequence (strains)
Submitted sequence (strains)Submitted sequence (strains)
Submitted sequence (strains)
 
PAG-2004-Roe
PAG-2004-RoePAG-2004-Roe
PAG-2004-Roe
 
Gene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencingGene Sequencing | maxam gilbert sequencing | sanger sequencing
Gene Sequencing | maxam gilbert sequencing | sanger sequencing
 
Genome Editing CRISPR-Cas9
Genome Editing CRISPR-Cas9 Genome Editing CRISPR-Cas9
Genome Editing CRISPR-Cas9
 

More from Jason Stajich

Connecting ultrastructure to molecules: Studying evolution of molecular comp...
Connecting ultrastructure to molecules: Studying evolution  of molecular comp...Connecting ultrastructure to molecules: Studying evolution  of molecular comp...
Connecting ultrastructure to molecules: Studying evolution of molecular comp...Jason Stajich
 
Comparative genomics of fung: studying adaptation through gene family evolution
Comparative genomics of fung: studying adaptation through gene family evolutionComparative genomics of fung: studying adaptation through gene family evolution
Comparative genomics of fung: studying adaptation through gene family evolutionJason Stajich
 
Bioinformatics and BioPerl
Bioinformatics and BioPerlBioinformatics and BioPerl
Bioinformatics and BioPerlJason Stajich
 
Evolution of gene family size change in fungi
Evolution of gene family size change in fungiEvolution of gene family size change in fungi
Evolution of gene family size change in fungiJason Stajich
 
BioPerl: The evolution of a Bioinformatics Toolkit
BioPerl: The evolution of a Bioinformatics ToolkitBioPerl: The evolution of a Bioinformatics Toolkit
BioPerl: The evolution of a Bioinformatics ToolkitJason Stajich
 
BioPerl: Developming Open Source Software
BioPerl: Developming Open Source SoftwareBioPerl: Developming Open Source Software
BioPerl: Developming Open Source SoftwareJason Stajich
 
Comparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlComparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlJason Stajich
 

More from Jason Stajich (7)

Connecting ultrastructure to molecules: Studying evolution of molecular comp...
Connecting ultrastructure to molecules: Studying evolution  of molecular comp...Connecting ultrastructure to molecules: Studying evolution  of molecular comp...
Connecting ultrastructure to molecules: Studying evolution of molecular comp...
 
Comparative genomics of fung: studying adaptation through gene family evolution
Comparative genomics of fung: studying adaptation through gene family evolutionComparative genomics of fung: studying adaptation through gene family evolution
Comparative genomics of fung: studying adaptation through gene family evolution
 
Bioinformatics and BioPerl
Bioinformatics and BioPerlBioinformatics and BioPerl
Bioinformatics and BioPerl
 
Evolution of gene family size change in fungi
Evolution of gene family size change in fungiEvolution of gene family size change in fungi
Evolution of gene family size change in fungi
 
BioPerl: The evolution of a Bioinformatics Toolkit
BioPerl: The evolution of a Bioinformatics ToolkitBioPerl: The evolution of a Bioinformatics Toolkit
BioPerl: The evolution of a Bioinformatics Toolkit
 
BioPerl: Developming Open Source Software
BioPerl: Developming Open Source SoftwareBioPerl: Developming Open Source Software
BioPerl: Developming Open Source Software
 
Comparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerlComparative Genomics with GMOD and BioPerl
Comparative Genomics with GMOD and BioPerl
 

Recently uploaded

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxFIDO Alliance
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideStefan Dietze
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxFIDO Alliance
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?Paolo Missier
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...ScyllaDB
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe中 央社
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxFIDO Alliance
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftshyamraj55
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfFIDO Alliance
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Skynet Technologies
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Hiroshi SHIBATA
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfFIDO Alliance
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Patrick Viafore
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...panagenda
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jNeo4j
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...FIDO Alliance
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...FIDO Alliance
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?Mark Billinghurst
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfUK Journal
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPTiSEO AI
 

Recently uploaded (20)

Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptxHarnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
Harnessing Passkeys in the Battle Against AI-Powered Cyber Threats.pptx
 
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The InsideCollecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
Collecting & Temporal Analysis of Behavioral Web Data - Tales From The Inside
 
Intro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptxIntro to Passkeys and the State of Passwordless.pptx
Intro to Passkeys and the State of Passwordless.pptx
 
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
(Explainable) Data-Centric AI: what are you explaininhg, and to whom?
 
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
Event-Driven Architecture Masterclass: Engineering a Robust, High-performance...
 
Portal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russePortal Kombat : extension du réseau de propagande russe
Portal Kombat : extension du réseau de propagande russe
 
ADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptxADP Passwordless Journey Case Study.pptx
ADP Passwordless Journey Case Study.pptx
 
Oauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoftOauth 2.0 Introduction and Flows with MuleSoft
Oauth 2.0 Introduction and Flows with MuleSoft
 
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdfHow Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
How Red Hat Uses FDO in Device Lifecycle _ Costin and Vitaliy at Red Hat.pdf
 
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
Human Expert Website Manual WCAG 2.0 2.1 2.2 Audit - Digital Accessibility Au...
 
Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024Long journey of Ruby Standard library at RubyKaigi 2024
Long journey of Ruby Standard library at RubyKaigi 2024
 
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdfWhere to Learn More About FDO _ Richard at FIDO Alliance.pdf
Where to Learn More About FDO _ Richard at FIDO Alliance.pdf
 
Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024Extensible Python: Robustness through Addition - PyCon 2024
Extensible Python: Robustness through Addition - PyCon 2024
 
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
Easier, Faster, and More Powerful – Alles Neu macht der Mai -Wir durchleuchte...
 
Your enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4jYour enemies use GenAI too - staying ahead of fraud with Neo4j
Your enemies use GenAI too - staying ahead of fraud with Neo4j
 
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
Secure Zero Touch enabled Edge compute with Dell NativeEdge via FDO _ Brad at...
 
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
Choosing the Right FDO Deployment Model for Your Application _ Geoffrey at In...
 
The Metaverse: Are We There Yet?
The  Metaverse:    Are   We  There  Yet?The  Metaverse:    Are   We  There  Yet?
The Metaverse: Are We There Yet?
 
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdfBreaking Down the Flutterwave Scandal What You Need to Know.pdf
Breaking Down the Flutterwave Scandal What You Need to Know.pdf
 
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
1111 ChatGPT Prompts PDF Free Download - Prompts for ChatGPT
 

2009 11 16 UCR Comp Sci

  • 1. Open questions in bioinformatics and computational biology from an evolutionary and molecular biology perspective Jason Stajich Assistant Professor of Bioinformatics & Bioinformaticist Plant Pathology and Microbiology University of California, Riverside
  • 2. Outline • Where and what is the data? • Big data is here, Bigger data is coming. • Current approaches • What are the problems? • Efficient searching in large data sets • Ease of interfacing with data to support genomics research - software, databases, and UI development • Finding signal in the datasets - statistical and computational methods • Managing the data - TBs of data from the sequencing. • Other more biologically focused research areas.
  • 3. Sequencing has been a driving force • DNA sequencing and analysis has been the bread and butter of bioinformatics and computational biology for genomics and evolutionary research • Human Genome project - USD $3B (1991 dollars) • Typically sequencing costs have been converging towards $0.01 a nucleotide base - inexpensive but still limiting for large-scale questions • New sequencing techniques have dropped the costs by orders of magnitude • Techniques for searching data - There has always been speed vs sensitivity tradeoffs • Initially developed with heuristics tuned towards the 103 - 106 sequence sized databases • Previously software, computational resources required specialized centers - less the case now
  • 4. 16 B bases 2007-2008 Human genome = 3B bases http://www.ncbi.nlm.nih.gov/Genbank/genbankstats.html
  • 5. 15 Gb in 5 days 1-3 Gb / day / machine! for CURRENT technologies ~$8000 for human genome number of bases http://www.politigenomics.com/next-generation-sequencing-informatics
  • 6.
  • 8. “...bringing its total to 89 [GA II machines]”
  • 9. Cheaper sequencing is leading to democratized genome science. Everyone* can do it *With $USD 500K http://bit.ly/5zNBk
  • 10.
  • 11. Tools for Bioinformatics and Comparative Genomics • Interfacing with the data (flat text and binary files, databases, web resources) • BioPerl is a toolkit for bioinformatics data processing • Parsers for common file formats • Visualization of some kinds of genomic data - basis for genome browser • Genome Browsers to see genomic context information, important for visualizing high density data like 2nd-generation sequencing (RNA-Seq, ChIP-Seq) • Community web resources for information sharing • Social networking challenges to get scientists to share information with attribution
  • 12. Genome Browser data integration - Gbrowse Ncra_OR74A_chrIV_contig7.20 300k 310k 320k 330k DNA_GCContent % gc NCBI genes (Broad called) NCU04433 NCU04430 NCU04426 sulfate permease II CYS-14 related to aminopeptidase Y precursor; vacuolar related to cyclin-supressing protein kinase NCU04432 NCU04429 NCU04425 hypothetical protein conserved hypothetical protein putative protein NCU04431 NCU04428 NCU04424 related to endo-1; 3-beta-glucanase related to spindle assembly checkpoint protein related to regulator of chromatin NCU04427 conserved hypothetical protein PASA updated NCBI/Broad genes NCU04433 NCU04432 [pasa:asmbl_9429,status:12],[pasa:asmbl_9430,status:12] [pasa:asmbl_9440,status:12],[pasa:asmbl_9441,status:12],[pasa:asmbl_9442,status:12] [pasa:asmbl_9431,status:12],[pasa:asmbl_9432,status:12] [pasa:asmbl_9443,status:12],[pasa:asmbl_9444,status:12] [pasa:asmbl_9433,status:12],[pasa:asmbl_9434,status:12],[pasa:asmbl_9435,status:12] [pasa:asmbl_9436,status:12],[pasa:asmbl_9437,status:12],[pasa:asmbl_9438,status:12],[pasa:asmbl_9439,statu [pasa:asmbl_9445,status:12],[pasa:asmbl_9446,status:12 NCU04424 Named Genes (Radford laboratory) cys-14 gh16-3 tRNA{phe}-9 miRNA Solexa histogram miRNA K4dime ChIP-Seq histogram (SOAP) K4dime_Solexa Gbrowse - Stein et al 2002 Stajich et al, unpublished K9met3 ChIP-Seq histogram (SOAP) Smith, Freitag, et al unpublished K9met3
  • 14. Central dogma of eukaryotic biology Protein mRNA pre-mRNA Genomic DNA Exon Exon Exon Intron Intron Protein splicing to form mRNA mRNA Ribosome Introns Genomic DNA
  • 15. Information flow in the cell - Central Dogma • DNA (4 bases, {A,C,G,T}) transcribed into • RNA (4 bases, {A,C,G,U}) translated into • Protein (20 amino acid residues, {A,C,D,E,F,G,H,I,K,L,M,N,P,Q,R,S,T,V,W,Y}) by triplets (codons) of RNAs UCA -> Serine (S) AUG -> Methionine (M) • 3 stop codons (UGA, UAA,UAG) in most species • As always in Biology, there are exceptions! Some species use different stop codons. The codon table (codon -> AA) is not the same for all species, the mitochondria has different codon table.
  • 16. 22 0 0: 24 0 0: 26 2 0:= 28 7 3:* 30 7 18:* 32 45 68:===* 34 166 184:========* 36 337 379:================ * The Sequence DB Search problem 38 581 626:=========================== * 40 869 873:=======================================* 42 1009 1067:============================================== * 44 1276 1177:=====================================================*==== 46 1253 1198:======================================================*== 48 1199 1147:====================================================*== 50 1032 1047:===============================================* 52 949 920:=========================================*== 54 838 786:===================================*=== 56 578 657:=========================== * 58 467 539:====================== * 60 393 437:================== * 62 339 350:===============* Database 64 276 278:============* 66 214 220:=========* 68 188 173:=======*= 70 140 136:======* 72 131 106:====*= Query Sequence 74 88 83:===* 76 71 64:==*= 78 48 50:==* Sequence 80 43 39:=* 82 38 30:=* Generates 84 27 24:=* Sequence 86 21 18:* Pairwise 88 15 14:* Sequence 90 17 11:* Scores 92 7 8:* :=======* = represents 1 library sequence E-value 94 22 7:* :======*=============== Sequence alignment 96 3 5:* :=== * 98 8 4:* :===*==== Sequence 100 6 3:* :==*=== 102 5 2:* :=*=== 104 9 2:* :=*======= For each Sequence 106 4 1:* :*=== 108 5 1:* :*==== Sequence 110 4 1:* :*=== 112 4 1:* :*=== sequence 114 4 1:* :*=== Sequence 116 6 0:= *====== 118 1 0:= *= Sequence >120 32 0:== *================================ Sequence Sequence Sequence 11 Sequence
  • 17. A Sequence alignment algorithms C • String matching to find common substrings G (mutations) and • Allowing for mismatches gaps (insertions/deletions) • Dynamic ProgrammingG solution • Needleman-Wunsch - Global Alignments; Smith-Waterman - Local alignments T (SSEARCH, GGSEARCH; water, needle) • Speedups by looking for common words (KTUPLES - 2-7) to seed the alignment start • Basically what happens in BLAST, FASTA DNA sequences, dynamic programming algorithms are available that can calculate the optimal score in • Heursitics typically employ a fixed insertion O(MN) steps. This efficiency is achieved by determining the optimal score for each prefix of each string, andpenalty and cutoffs prefix byscores so that three paths that can be used to extend an alignment: then extending each on low considering the (1) DPextending the alignment by one residue in each sequence; (2) by extending the alignment by one by matrix is sparse. Pearson, ISMB 2000 residue in the first sequence and aligning it with a gap in the second; or (3) extending the alignment by
  • 18. Table 9: Algorithms for comparing protein and DNA sequences value scoring gap time Sequence alignment algorithms and calculated matrix penalty required complexity - global similarity arbitrary penalty/gap O(n2 ) Needleman and q Wunsch, 1970 (global) distance unity penalty/residue O(n2 ) Sellers, 1974 rk local similarity ˆ Si j < 0.0 affine O(n2 ) Smith and Waterman, 1981 q + rk Gotoh, 1982 approx. local ˆ Si j < 0.0 limited gap size O(n2 )/K Lipman and Pearson, 1985 similarity q + rk Pearson and Lipman, 1988 maximum ˆ Si j < 0.0 multiple O(n2 )/K Altshul et al., 1990 segment score segments lated and unrelated proteins. Pearson, ISMB 2000
  • 19. FASTA time: 2:00 min BLAST time: 20 sec BLAST time: 20 sec Heuristic algorithms for DP searching for sequence similarities 28 Pearson, ISMB 2000
  • 20. Variants on the alignment theme • HMMER (v1, v2) - Profile Hidden Markov Model where sequences are compared to a profile of a multiple sequence alignment and probability that HMM could have emitted query sequence is assessed. • MUMMER - suffix trees, longest increasing subsequence, and Smith-Waterman DP • BLAT - big hash tables to find exact matching or patterns (ie 24 out a 28bp word) in O(1) and string them together. For nearly identical sequences • For protein sequence alignments - HMMER3 is a game changer. • Speed is now basically as fast as BLAST in some cases. New (today!) HMMER3.0b3 is multithread and MPI enabled
  • 21. The need for speed • Acceleration of current algorithms • Hardware acceleration with GPUs and FPGAs • TimeLogic, Paracel (defunct) • SIMD acceleration • SSEARCH using striped Smith-Waterman on x86 (SSE2) • Embarrassingly Parallel problem! • MPI and PVM enabled to split up searches on a cluster, and recombined the score distribution at the end to assess E-value
  • 22. SSEARCH on par with BLAST! Farrar, Bioinformatics 2007
  • 23. Large numbers of genome sequences • 1000+ Bacterial and Archeal genomes • 100s of fungal genomes • 10s of animal and plant genomes • 10s of other eukaryotes • Coming online http://bit.ly/2ruuhM • 1000 human genome project - http://1000genomes.org • 1001 Arabidopsis genomes - http://1001genomes.org • 1000 Drosophila genomes - http://dpgp.org Metagenomics • 15,000 vertebrate genome project (propsed)
  • 24. The coming storm: what to do with all this data? • Can use the more sensitive tools on larger datasets. • Datasets are of course getting larger. Challenge Moore’s law? Producing data faster. • Need to get more efficient in how the data is processed, organized, and accessed • Not discussing the storage and data management challenges
  • 25. 22 0 0: 24 0 0: 26 2 0:= 28 7 3:* 30 7 18:* 32 45 68:===* 34 166 184:========* 36 337 379:================ * The Sequence DB Search problem 38 581 626:=========================== * 40 869 873:=======================================* 42 1009 1067:============================================== * 44 1276 1177:=====================================================*==== 46 1253 1198:======================================================*== 48 1199 1147:====================================================*== 50 1032 1047:===============================================* 1000Xs 52 949 920:=========================================*== 54 838 786:===================================*=== 56 578 657:=========================== * 58 467 539:====================== * 60 393 437:================== * more 62 339 350:===============* Database 64 276 278:============* 66 214 220:=========* 68 188 173:=======*= 70 140 136:======* 72 131 106:====*= Query Sequence 74 88 83:===* 76 71 64:==*= 78 48 50:==* Sequence 80 43 39:=* 82 38 30:=* Generates 84 27 24:=* Sequence 86 21 18:* Pairwise 88 15 14:* Sequence 90 17 11:* Scores 92 7 8:* :=======* = represents 1 library sequence E-value 94 22 7:* :======*=============== Sequence alignment 96 3 5:* :=== * 98 8 4:* :===*==== Sequence 100 6 3:* :==*=== 102 5 2:* :=*=== 104 9 2:* :=*======= For each Sequence 106 4 1:* :*=== 108 5 1:* :*==== Sequence 110 4 1:* :*=== 112 4 1:* :*=== sequence 114 4 1:* :*=== Sequence 116 6 0:= *====== 118 1 0:= *= Sequence >120 32 0:== *================================ Sequence Sequence Sequence 11 Sequence
  • 26. Next Generation Sequencing data challenges • Where did you come from? • Genomic Assembly. Map all the reads (10-30M tags of length 25 - Make contiguous sequence stretches by putting 120 bp) against the genome jigsaw back together (30 - 120bp reads, With mismatch/gaps. sometimes paired). Soln: Hashing, Suffix trees, Burrows-Wheeler Soln: Best solutions are graph based -but many Transformation. BowTie, MAQ, SOAP, BWA paths through the data so lots of memory required. Velvet, ABYSS • Find the errors, mutations, insertions or deletions, or copy number variation. • Show me your genes! Align reads to the genome, but confidently call Align transcript pieces (short-reads from RNA- the differences between the sample and Seq) to the genome, but with large gaps for reference as real variations. introns. Soln: Some bayesian approaches that factor Soln: Allow partial alignments and stitch together quality and number of sequences that support it. those that are paired, extra points for occurring Shortcut is heuristic: 3+ reads that agree = at a splice-site and having multiple independent believable. Can condition on number of agreeing reads that agree. BowTie+TopHat/CuffLinks reads. Mosaik, MAQ
  • 27. Jacob Appelbaum & Donald Knuth Demonstrate The Recursive Homeboys Principle
  • 28. Evolutionary questions • Gene content changes • Gene order evolution • Evolutionary relationships between organisms (branching order) • Evolutionary Rates
  • 29. Comparing gene content among species • Identification of the same genes found in each species • Use similar searches to find significant ones - pairwise problem is fairly straightforward • Combining these for multiple species where missing data is allowed (gene loss) or additional data (gene duplication) makes the problem harder. • Typical methods cluster into Orthologous groups by similarity cutoff, but transitive property is not always met (A similar to B, B similar to C, not always the case that A and C are significant similar or homologous). • Variations on theme using Markov Clustering (MCL) have been applied with some success. • Other approaches include phylogenetic (tree building) for putative clusters to sort out relationships, but is also fraught with false positives and computation complexity.
  • 30. Clustering with MCL http://micans.org/mcl
  • 31. Ortholog and paralog finding still active areas • De novo identification identification of families still has many false positives. Dynamic adjustment of parameters based on sequence length, evolutionary rates, etc may help refine clusters better. • As more genomes are produced, how can we re-build ortholog sets incrementally? • Use HMMs of ortholog sets as searching database to improve sensitivity
  • 32. Gene order evolution • Comparing order of the genes in different organisms. • Conservation of order is referred to as synteny • Why is it important? Shared synteny suggests between different organisms suggests it is the ancestral state. If it has been preserved over time, maybe a functional reason • Also useful for inferring corresponding (orthologous) genes in different species
  • 33. Synteny blocks • Identifying regions that have shared history • Identify anchors, consider ordering of anchors in each organism • Allow for insertions and deletions • Find longest common “substring” of genes. • Genes in a block found in different species may be orthologous
  • 34. Phylogenetic Trees • Represents evolutionary history of organism • Can be used with binary character traits or DNA or protein sequences • Constructed by a calculation of the distance between the taxa • Bayesian and maximum likelihood approaches as well as simpler distance-only (neighbor-joining) and parsimony approaches to represent the data.
  • 35. Tree building future problems • How do we assemble very large trees? In pieces and stitch together? All in one go with huge distributed systems? • Very much an active area of research Need more efficient algorithms and hardware acceleration approaches. • NSF funded CIPRES have been on algorithmic improvements -for large number of taxa from Tree of Life project data. • Tree Viz is also important area - how do we represent the large amount of data? Dynamically and interactively?
  • 36. Evolutionary rates • Calculate evolutionary rates as amount of change in a sequence as compared to others • If the amount of change is greater in some branches suggests interesting phenomenon -identify by comparing the relative rates. • Statistical tests for whether rates are significantly different. Can apply to protein loci and genome alignments. • Open questions: inferring rates in the first place still a hard problem and quite slow. Methods that efficiently scan whole genome of alignments still need work.
  • 37. Comparing conservation profiles Nc TTCCTTACCAGCAGCCCCGGCCGTCGCCCCGCTCTTTCGACCCACGATGTTGTC---GAT Nt TTCCTTACCAGCAGCCCCGGCCGTCGCCCCGCTCTTTCGACTCACGATGTTGTC---GAT Nd TTCCTGACCAGCAGCCCCGGCCGTCGCCCCGCTCTTTCGACCCACGATGTTCTC---GAT Sm TTCCTTACCAGCAGCCCCGGCCGTCGCCCCGCGCTCTCGACTCACGATGTTGTC---GAC Pa TTCTTGACCAGCAGCCCAGGACGCCATCCAGCACT---GACCCACCTTGATGTCCAAGAT Cg TTCCTCACCAGCAGCCCTGGGCGCCACCCGGCTCTGTCGTCGCA---TGACGTCAAAGAT *** * *********** ** ** * ** ** ** * * ** ** ** ** &'()*+,*)-./)-++-.0123-4'.5)*(,.67 !"#$%$$ !"#$#$$ !"#!$$$ !"#!!$$ !"#!9$$ !"#!A$$ !"#!J$$ !"#!"$$ !"#!<$$ !"#!B$$ !"# ?@4/ #9? 9$? "$%&'() &:;$""99 #**&"+,-.)/.-0&123)&4#2#&5!"# -+EFG.!A##9 -+EFG.!A##A 6+7+80-9:4#2#&,/(0;+( C,-+-D-+EFG.!A##9H+>->(+D!AI 4<0);$.7)&=;-07)5->/;?'7;-.7@ ,=-+>/*2+ ! $8" $
  • 38. Figure 3: Subtree ROC curves. 28 Pollard et al, Genome Res, 2009
  • 39. 0- _ _ Primate Conservation by PhastCons _ _ Placental Mammal Conservation by PhastCons _ _ Vertebrate Conservation by PhastCons _ El El El Multiz Alignments of 44 Vertebrates mp la an us et er ur by w C chr18: 27174000 UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics DSG1 se Vertebrate Multiz Alignment & Conservation (44 Species) at 3_ Vertebrate Basewise Conservation by PhyloP at ig el 0- bit r18: 27162280 27162290 27162300 27162310 ka CATCCATAGT TGATCGAGAGGTCACTCCT T TCT TCAT TGTAAGTGGA ---> ca -6 _ UCSC Genes Based on RefSeq, UniProt, GenBank, CCDS and Comparative Genomics in SG1 T S I V D R E V T P F F I 1_ Vertebrate Conservation by PhastCons w se Vertebrate Multiz Alignment & Conservation (44 Species) 5_ Vertebrate Basewise Conservation by PhyloP Simultaneously identifying slow and rapidly evolving regions in the human Pollard et al, Genome Res, 2009 genome
  • 40. More open areas in computational biology • Image processing - decoding visual images into computable data Microscopy images for phenotypes (trait) analyses: How big is that neuron cell? After I apply this drug? Patterns of gene expression through staining and visualization - what is the pattern? • Protein 3-D structure analysis and prediction Molecular dynamic simulations to take 1D protein sequence to 2D and 3D sequence structure predictions. Predictions are still very poor - using hints of known sub-sequences and their folding pattern. {Folding@Home} • RNA structure prediction Identification of the secondary structure of sequences via thermodynamic and base-pairing rules. Finding real versus artifact and comparing this to sequence conservation.
  • 41. Still more • Gene network reconstruction from time series data How do genes function together and are dependent on each other. • Gene identification and annotation Use of computation and experimental data to define the regions of the genome which are transcribed into genes • Gene function prediction Using the known annotation of function of some genes can we infer the function of others using ontologies of gene function and phylogenetic relatedness of the genes and species
  • 42. For more information http://bioinfo.ucr.edu http://lab.stajich.org http://bioperl.org

Editor's Notes

  1. Introduction of introns into genes seems inefficient What is their purpose, why are they still there, how are they removed or inserted