Opportunities and Constraints




Palaniappan SP
connectsp2012@gmail.com
Request Note


  I prepared this presentation entirely with input from internet research with the intend to
  share this as give back to society. Please share your comments and suggestions through the
  mail ID. It would help to improve the value and benefit of this preparation




18-Nov-12                                                                                      2
Core areas where DNA sequencing is employed


 • Academic research
            •   understanding gene expression/regulation
            •   phylogeny, demography and evolution research
 • Oncology
      • understanding DNA’s role in cancer cells

            •   finding ways to tune gene expression for cancer abatement or prevention
 • Gene therapy
            •   Using recombinant DNA to suppress / modify / induce gene expression to
                address genetic disorder based diseases / malfunction




18-Nov-12                                                                                 3
More areas where DNA sequencing is employed

 • Developing Genetically Modified Organisms through recombinant
    research
            •   salt tolerant/drought tolerant/disease resistant cultivable crops
            •   microbes producing more of therapeutic compounds, proteins
            •   microbes for environment cleaning
            •   pro-biotic lactobacilli

 • Clinical diagnosis - diagnosing gene-sequence-correlated diseases /
    infections e.g. HIV
 • Forensic analysis - DNA fingerprint profiling to identify crime suspects

 • Pedigree analysis - to establish parental lineage in legal disputes



18-Nov-12                                                                           4
Agencies engaged in DNA sequencing


 • Non-Profit Research Laboratories in Universities, Institutes

 • Clinical labs of government and private hospitals

 • Commercial organizations engaged in DNA sequencing for payment




18-Nov-12                                                           5
Databases maintaining DNA sequence data
  During early period, the data were generated and analyzed only by a few research
  institutes like members of Humane Genome Research Project. Later when such
  databases grew by size and region, so many databases were created and are made
  available for research communities.
  Following are some examples:
      • NCBI - National Center for Biotechnology Information (GenBank)
         http://www.ncbi.nlm.nih.gov/guide/data-software/#databases_
      • EBI - European Bioinformatics Institute (EMBL)
         http://www.ebi.ac.uk/Databases/
      • EMNEW - Index of New EMBL Nucleotides ( EBI)
         http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+databanks
      • DDBJ - DNA Data Bank of Japan
         http://www.ddbj.nig.ac.jp/
      • As per Nucleic Acid Research database issue dated Jan 2012, there are as many
         as 1380 databases!!!
         http://www.oxfordjournals.org/nar/database/a/
18-Nov-12                                                                               6
DNA Sequencing capability has grown exponentially

            DNA sequences in GenBank
            Doubling time = 18 months

             Sequencing Cost   Data Analyzing Cost




  Source: Bioinformatics Challenges of High-Throughput DNA Sequencing by Stuart M. Brown, Ph.D,
  New York University www.med.nyu.edu/rcr/rcr/course/NexGen-2010.ppt
18-Nov-12                                                                                    7
Big Data Of DNA Sequence Is Different From Other Big Data

   Most of the databases have built in search engines with predefined filters to narrow down
     the search. Those web based tools also restrict the input / output file format. Integrating
     customized search tools with databases need to be worked out.
   Too much of tool customization limits the interoperability in various software platforms
   Comparing conventional RDBMS databases, there is no Dev region, Test region or sand box
     environment where one can test the tools, scripts and queries freely
   Data mining is not confined to one or few data sources. There are more than 1300
     databases – some are primary (Tables) and some are derived databases(Table Views). It’s
     likely the data is replicated in many databases.
   Analytics is not looking for matching exact search string or a cluster of strings defined by a
     complex query with multiple joins and unions. Analysis is mostly on percentage of matching
     of a given sequence. A variety of computational algorithms like dynamic programming and
     heuristic algorithms or probabilistic methods are used for sequence alignment.
   Too many tools, search engines, software and script languages - it is difficult to find or
     validate a software component framework / a technology tool box.

18-Nov-12                                                                                            8
Using Big Data of Gene Sequence – Examples


  Identifying gene sequences relevant to specific biochemical/metabolic
     pathway using transcriptional "fingerprints“
  Understanding gene regulations exerted by promoters, suppressors
  Whole genome sequence analysis for disease control, human healthcare
  Gene profiling for
     ―   Structural genes (coding for mRNA, rRNA, tRNA)
     ―   Functional genes (coding for promoter, operator, terminator)
     ―   Regulatory genes (coding for repressor protein that binds to operator)
     ―   putative genes which are not evidently associated with any protein produced or
         function performed
     ―   sequence of interest with reference to SNPs, SVs, indels, ChIP


18-Nov-12                                                                                 9
Big Data of Gene Sequence – More Examples

  extrinsic gene finding system for gene annotation
  Understanding genetic basis for multi-drug resistance of super bugs so as
     to evolve alternative control measures
  Targeted drug delivery against pathogens
  Localized gene therapy for infectious diseases or inherent disorder
  String mining / sequence mining, itemset mining, association rule mining
     – Data mining can help us in two ways. 1) Understand genetic mechanism
     of regulation and expression of phenotypes and 2) Retrieve genes or
     genetic information that could be converted into a process technology or
     a diagnostic tool or a therapeutic technique.



18-Nov-12                                                                      10
Big Data Management – A Generic Approach




18-Nov-12                                    11
Analyzing DNA databases – Some Practical Constraints


  Reading frame alignment. Every sequence can represent three different
     reading frames that could be converted into a derived amino acid
     sequence
  Presence of Exon-Intron - RNA splicing, possibility of alternative splicing
     make the analysis as more complex
  Silent mutations – redundancy of codon –SNPs. Difficult to distinguish
     silent mutation from sequencing error. Sequencing errors are possible
     because of complexity in sample preparation, sequencing, assembling and
     analyzing sequence data. Those situations could be resolved only by
     repeat runs.
  Since DNA preparation is from a host of cells, the sequence we get is,
     eventually, an average of DNA sequence from all sample cells.


18-Nov-12                                                                        12
Analyzing DNA databases – Practical Constraints - contd.
  Significance of non-coding DNA is yet to be understood - In many
     eukaryotes, up to 99% of an organism's total genome size is non-coding
     DNA. More than 98% of the human genome does not encode protein
     sequences. A fraction of non-coding sequence is reported to regulate
     gene expression.
  Sequence matching is based on statistical analysis and not on exact data
     matching
  Reference Human Genome Data may not represent global population.
     However when more and more sequence information from different
     geographic region are added, the reference would become more global.
     The Genome Reference Consortium is an international body that takes
     care of genome curation.
  In a short span of time, the cost of sequencing has drastically come down.
     Still the sequencing fee (which is around $1000 per individual) is
     expensive for countries like India
18-Nov-12                                                                       13
Software Tools Used For Sequence Analysis

 • There are quite a large number of tools are available in internet
 • The Tools are used for
      ―     sequence comparison/alignment
      ―     searching databases and retrieve catalogued reference sequences
      ―     assembling short sequence strings to get complete sequence of a gene
      ―     Retrieving sequence info for constructing primer / oligo probe
      ―     converting nucleic acid sequence to protein structure
      ―     converting protein structure to nucleic acid sequence
      ―     Multiple sequence alignment
 • Scripting languages – Perl, Python, Ruby. BioPERL, BioPython and BioRuby are the
     framework applications that could be readily used for data mining.
 • SWIG – Tool to generate scripting language interface – It improves interoperability
     of scripts
 • SourceForge, GitHub – commonly used version control systems to maintain
     software and tool versions

18-Nov-12                                                                                14
Some Commonly Used Tools
 •   Ensembl Genome Browser
 •   UCSC Genome Browser
 •   Entrez - Integrated, text-based search and retrieval tool used at NCBI
 •   RSAT - Regulatory Sequence Analysis Tools - tools dedicated to the detection of regulatory signals in non-
     coding sequences
 •   BLAST, FASTA - Tools used for sequence alignment - to compare query sequence with that available in a
     database
 •   ClustalW - Multiple sequence alignment program
 •   GeneMarkerR - A commercial tool for forensic profiling
 •    Seq Anal - A collection of tools to search, align and analyze DNA sequences
 •   Galaxy Tools can also be used to search, align and analyze DNA data
 •   Codon Suite - Codon-based sequence analysis
 •   Transeq, Backtranseq - Tools to translate or back-translate between nucleotide and peptide sequences.
 •   ReadSeq: Molecular sequence format converter
 •   FASTLINK - Used to map genes and find the approximate location of disease genes.
 •   DnaSP - A software package for the analysis of nucleotide polymorphism from aligned DNA sequence data
 •   MATCHTM - A tool for searching transcription factor binding sites in DNA sequences
 •   PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of
     basic, large-scale analyses
 •   GeneMark™ - Free gene prediction software
 •   Genscan is the best available ab initio gene predictor
 •   More list of gene prediction software in
     http://en.wikipedia.org/wiki/List_of_gene_prediction_software

18-Nov-12                                                                                                     15
18-Nov-12   16

DNA Sequence Data in Big Data Perspective

  • 1.
  • 2.
    Request Note I prepared this presentation entirely with input from internet research with the intend to share this as give back to society. Please share your comments and suggestions through the mail ID. It would help to improve the value and benefit of this preparation 18-Nov-12 2
  • 3.
    Core areas whereDNA sequencing is employed • Academic research • understanding gene expression/regulation • phylogeny, demography and evolution research • Oncology • understanding DNA’s role in cancer cells • finding ways to tune gene expression for cancer abatement or prevention • Gene therapy • Using recombinant DNA to suppress / modify / induce gene expression to address genetic disorder based diseases / malfunction 18-Nov-12 3
  • 4.
    More areas whereDNA sequencing is employed • Developing Genetically Modified Organisms through recombinant research • salt tolerant/drought tolerant/disease resistant cultivable crops • microbes producing more of therapeutic compounds, proteins • microbes for environment cleaning • pro-biotic lactobacilli • Clinical diagnosis - diagnosing gene-sequence-correlated diseases / infections e.g. HIV • Forensic analysis - DNA fingerprint profiling to identify crime suspects • Pedigree analysis - to establish parental lineage in legal disputes 18-Nov-12 4
  • 5.
    Agencies engaged inDNA sequencing • Non-Profit Research Laboratories in Universities, Institutes • Clinical labs of government and private hospitals • Commercial organizations engaged in DNA sequencing for payment 18-Nov-12 5
  • 6.
    Databases maintaining DNAsequence data During early period, the data were generated and analyzed only by a few research institutes like members of Humane Genome Research Project. Later when such databases grew by size and region, so many databases were created and are made available for research communities. Following are some examples: • NCBI - National Center for Biotechnology Information (GenBank) http://www.ncbi.nlm.nih.gov/guide/data-software/#databases_ • EBI - European Bioinformatics Institute (EMBL) http://www.ebi.ac.uk/Databases/ • EMNEW - Index of New EMBL Nucleotides ( EBI) http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+databanks • DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/ • As per Nucleic Acid Research database issue dated Jan 2012, there are as many as 1380 databases!!! http://www.oxfordjournals.org/nar/database/a/ 18-Nov-12 6
  • 7.
    DNA Sequencing capabilityhas grown exponentially DNA sequences in GenBank Doubling time = 18 months Sequencing Cost Data Analyzing Cost Source: Bioinformatics Challenges of High-Throughput DNA Sequencing by Stuart M. Brown, Ph.D, New York University www.med.nyu.edu/rcr/rcr/course/NexGen-2010.ppt 18-Nov-12 7
  • 8.
    Big Data OfDNA Sequence Is Different From Other Big Data  Most of the databases have built in search engines with predefined filters to narrow down the search. Those web based tools also restrict the input / output file format. Integrating customized search tools with databases need to be worked out.  Too much of tool customization limits the interoperability in various software platforms  Comparing conventional RDBMS databases, there is no Dev region, Test region or sand box environment where one can test the tools, scripts and queries freely  Data mining is not confined to one or few data sources. There are more than 1300 databases – some are primary (Tables) and some are derived databases(Table Views). It’s likely the data is replicated in many databases.  Analytics is not looking for matching exact search string or a cluster of strings defined by a complex query with multiple joins and unions. Analysis is mostly on percentage of matching of a given sequence. A variety of computational algorithms like dynamic programming and heuristic algorithms or probabilistic methods are used for sequence alignment.  Too many tools, search engines, software and script languages - it is difficult to find or validate a software component framework / a technology tool box. 18-Nov-12 8
  • 9.
    Using Big Dataof Gene Sequence – Examples  Identifying gene sequences relevant to specific biochemical/metabolic pathway using transcriptional "fingerprints“  Understanding gene regulations exerted by promoters, suppressors  Whole genome sequence analysis for disease control, human healthcare  Gene profiling for ― Structural genes (coding for mRNA, rRNA, tRNA) ― Functional genes (coding for promoter, operator, terminator) ― Regulatory genes (coding for repressor protein that binds to operator) ― putative genes which are not evidently associated with any protein produced or function performed ― sequence of interest with reference to SNPs, SVs, indels, ChIP 18-Nov-12 9
  • 10.
    Big Data ofGene Sequence – More Examples  extrinsic gene finding system for gene annotation  Understanding genetic basis for multi-drug resistance of super bugs so as to evolve alternative control measures  Targeted drug delivery against pathogens  Localized gene therapy for infectious diseases or inherent disorder  String mining / sequence mining, itemset mining, association rule mining – Data mining can help us in two ways. 1) Understand genetic mechanism of regulation and expression of phenotypes and 2) Retrieve genes or genetic information that could be converted into a process technology or a diagnostic tool or a therapeutic technique. 18-Nov-12 10
  • 11.
    Big Data Management– A Generic Approach 18-Nov-12 11
  • 12.
    Analyzing DNA databases– Some Practical Constraints  Reading frame alignment. Every sequence can represent three different reading frames that could be converted into a derived amino acid sequence  Presence of Exon-Intron - RNA splicing, possibility of alternative splicing make the analysis as more complex  Silent mutations – redundancy of codon –SNPs. Difficult to distinguish silent mutation from sequencing error. Sequencing errors are possible because of complexity in sample preparation, sequencing, assembling and analyzing sequence data. Those situations could be resolved only by repeat runs.  Since DNA preparation is from a host of cells, the sequence we get is, eventually, an average of DNA sequence from all sample cells. 18-Nov-12 12
  • 13.
    Analyzing DNA databases– Practical Constraints - contd.  Significance of non-coding DNA is yet to be understood - In many eukaryotes, up to 99% of an organism's total genome size is non-coding DNA. More than 98% of the human genome does not encode protein sequences. A fraction of non-coding sequence is reported to regulate gene expression.  Sequence matching is based on statistical analysis and not on exact data matching  Reference Human Genome Data may not represent global population. However when more and more sequence information from different geographic region are added, the reference would become more global. The Genome Reference Consortium is an international body that takes care of genome curation.  In a short span of time, the cost of sequencing has drastically come down. Still the sequencing fee (which is around $1000 per individual) is expensive for countries like India 18-Nov-12 13
  • 14.
    Software Tools UsedFor Sequence Analysis • There are quite a large number of tools are available in internet • The Tools are used for ― sequence comparison/alignment ― searching databases and retrieve catalogued reference sequences ― assembling short sequence strings to get complete sequence of a gene ― Retrieving sequence info for constructing primer / oligo probe ― converting nucleic acid sequence to protein structure ― converting protein structure to nucleic acid sequence ― Multiple sequence alignment • Scripting languages – Perl, Python, Ruby. BioPERL, BioPython and BioRuby are the framework applications that could be readily used for data mining. • SWIG – Tool to generate scripting language interface – It improves interoperability of scripts • SourceForge, GitHub – commonly used version control systems to maintain software and tool versions 18-Nov-12 14
  • 15.
    Some Commonly UsedTools • Ensembl Genome Browser • UCSC Genome Browser • Entrez - Integrated, text-based search and retrieval tool used at NCBI • RSAT - Regulatory Sequence Analysis Tools - tools dedicated to the detection of regulatory signals in non- coding sequences • BLAST, FASTA - Tools used for sequence alignment - to compare query sequence with that available in a database • ClustalW - Multiple sequence alignment program • GeneMarkerR - A commercial tool for forensic profiling • Seq Anal - A collection of tools to search, align and analyze DNA sequences • Galaxy Tools can also be used to search, align and analyze DNA data • Codon Suite - Codon-based sequence analysis • Transeq, Backtranseq - Tools to translate or back-translate between nucleotide and peptide sequences. • ReadSeq: Molecular sequence format converter • FASTLINK - Used to map genes and find the approximate location of disease genes. • DnaSP - A software package for the analysis of nucleotide polymorphism from aligned DNA sequence data • MATCHTM - A tool for searching transcription factor binding sites in DNA sequences • PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses • GeneMark™ - Free gene prediction software • Genscan is the best available ab initio gene predictor • More list of gene prediction software in http://en.wikipedia.org/wiki/List_of_gene_prediction_software 18-Nov-12 15
  • 16.