DNA Sequence Data in Big Data Perspective


Published on

Published in: Education
  • Be the first to comment

  • Be the first to like this

No Downloads
Total Views
On Slideshare
From Embeds
Number of Embeds
Embeds 0
No embeds

No notes for slide

DNA Sequence Data in Big Data Perspective

  1. 1. Opportunities and ConstraintsPalaniappan SPconnectsp2012@gmail.com
  2. 2. Request Note I prepared this presentation entirely with input from internet research with the intend to share this as give back to society. Please share your comments and suggestions through the mail ID. It would help to improve the value and benefit of this preparation18-Nov-12 2
  3. 3. Core areas where DNA sequencing is employed • Academic research • understanding gene expression/regulation • phylogeny, demography and evolution research • Oncology • understanding DNA’s role in cancer cells • finding ways to tune gene expression for cancer abatement or prevention • Gene therapy • Using recombinant DNA to suppress / modify / induce gene expression to address genetic disorder based diseases / malfunction18-Nov-12 3
  4. 4. More areas where DNA sequencing is employed • Developing Genetically Modified Organisms through recombinant research • salt tolerant/drought tolerant/disease resistant cultivable crops • microbes producing more of therapeutic compounds, proteins • microbes for environment cleaning • pro-biotic lactobacilli • Clinical diagnosis - diagnosing gene-sequence-correlated diseases / infections e.g. HIV • Forensic analysis - DNA fingerprint profiling to identify crime suspects • Pedigree analysis - to establish parental lineage in legal disputes18-Nov-12 4
  5. 5. Agencies engaged in DNA sequencing • Non-Profit Research Laboratories in Universities, Institutes • Clinical labs of government and private hospitals • Commercial organizations engaged in DNA sequencing for payment18-Nov-12 5
  6. 6. Databases maintaining DNA sequence data During early period, the data were generated and analyzed only by a few research institutes like members of Humane Genome Research Project. Later when such databases grew by size and region, so many databases were created and are made available for research communities. Following are some examples: • NCBI - National Center for Biotechnology Information (GenBank) http://www.ncbi.nlm.nih.gov/guide/data-software/#databases_ • EBI - European Bioinformatics Institute (EMBL) http://www.ebi.ac.uk/Databases/ • EMNEW - Index of New EMBL Nucleotides ( EBI) http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+databanks • DDBJ - DNA Data Bank of Japan http://www.ddbj.nig.ac.jp/ • As per Nucleic Acid Research database issue dated Jan 2012, there are as many as 1380 databases!!! http://www.oxfordjournals.org/nar/database/a/18-Nov-12 6
  7. 7. DNA Sequencing capability has grown exponentially DNA sequences in GenBank Doubling time = 18 months Sequencing Cost Data Analyzing Cost Source: Bioinformatics Challenges of High-Throughput DNA Sequencing by Stuart M. Brown, Ph.D, New York University www.med.nyu.edu/rcr/rcr/course/NexGen-2010.ppt18-Nov-12 7
  8. 8. Big Data Of DNA Sequence Is Different From Other Big Data  Most of the databases have built in search engines with predefined filters to narrow down the search. Those web based tools also restrict the input / output file format. Integrating customized search tools with databases need to be worked out.  Too much of tool customization limits the interoperability in various software platforms  Comparing conventional RDBMS databases, there is no Dev region, Test region or sand box environment where one can test the tools, scripts and queries freely  Data mining is not confined to one or few data sources. There are more than 1300 databases – some are primary (Tables) and some are derived databases(Table Views). It’s likely the data is replicated in many databases.  Analytics is not looking for matching exact search string or a cluster of strings defined by a complex query with multiple joins and unions. Analysis is mostly on percentage of matching of a given sequence. A variety of computational algorithms like dynamic programming and heuristic algorithms or probabilistic methods are used for sequence alignment.  Too many tools, search engines, software and script languages - it is difficult to find or validate a software component framework / a technology tool box.18-Nov-12 8
  9. 9. Using Big Data of Gene Sequence – Examples  Identifying gene sequences relevant to specific biochemical/metabolic pathway using transcriptional "fingerprints“  Understanding gene regulations exerted by promoters, suppressors  Whole genome sequence analysis for disease control, human healthcare  Gene profiling for ― Structural genes (coding for mRNA, rRNA, tRNA) ― Functional genes (coding for promoter, operator, terminator) ― Regulatory genes (coding for repressor protein that binds to operator) ― putative genes which are not evidently associated with any protein produced or function performed ― sequence of interest with reference to SNPs, SVs, indels, ChIP18-Nov-12 9
  10. 10. Big Data of Gene Sequence – More Examples  extrinsic gene finding system for gene annotation  Understanding genetic basis for multi-drug resistance of super bugs so as to evolve alternative control measures  Targeted drug delivery against pathogens  Localized gene therapy for infectious diseases or inherent disorder  String mining / sequence mining, itemset mining, association rule mining – Data mining can help us in two ways. 1) Understand genetic mechanism of regulation and expression of phenotypes and 2) Retrieve genes or genetic information that could be converted into a process technology or a diagnostic tool or a therapeutic technique.18-Nov-12 10
  11. 11. Big Data Management – A Generic Approach18-Nov-12 11
  12. 12. Analyzing DNA databases – Some Practical Constraints  Reading frame alignment. Every sequence can represent three different reading frames that could be converted into a derived amino acid sequence  Presence of Exon-Intron - RNA splicing, possibility of alternative splicing make the analysis as more complex  Silent mutations – redundancy of codon –SNPs. Difficult to distinguish silent mutation from sequencing error. Sequencing errors are possible because of complexity in sample preparation, sequencing, assembling and analyzing sequence data. Those situations could be resolved only by repeat runs.  Since DNA preparation is from a host of cells, the sequence we get is, eventually, an average of DNA sequence from all sample cells.18-Nov-12 12
  13. 13. Analyzing DNA databases – Practical Constraints - contd.  Significance of non-coding DNA is yet to be understood - In many eukaryotes, up to 99% of an organisms total genome size is non-coding DNA. More than 98% of the human genome does not encode protein sequences. A fraction of non-coding sequence is reported to regulate gene expression.  Sequence matching is based on statistical analysis and not on exact data matching  Reference Human Genome Data may not represent global population. However when more and more sequence information from different geographic region are added, the reference would become more global. The Genome Reference Consortium is an international body that takes care of genome curation.  In a short span of time, the cost of sequencing has drastically come down. Still the sequencing fee (which is around $1000 per individual) is expensive for countries like India18-Nov-12 13
  14. 14. Software Tools Used For Sequence Analysis • There are quite a large number of tools are available in internet • The Tools are used for ― sequence comparison/alignment ― searching databases and retrieve catalogued reference sequences ― assembling short sequence strings to get complete sequence of a gene ― Retrieving sequence info for constructing primer / oligo probe ― converting nucleic acid sequence to protein structure ― converting protein structure to nucleic acid sequence ― Multiple sequence alignment • Scripting languages – Perl, Python, Ruby. BioPERL, BioPython and BioRuby are the framework applications that could be readily used for data mining. • SWIG – Tool to generate scripting language interface – It improves interoperability of scripts • SourceForge, GitHub – commonly used version control systems to maintain software and tool versions18-Nov-12 14
  15. 15. Some Commonly Used Tools • Ensembl Genome Browser • UCSC Genome Browser • Entrez - Integrated, text-based search and retrieval tool used at NCBI • RSAT - Regulatory Sequence Analysis Tools - tools dedicated to the detection of regulatory signals in non- coding sequences • BLAST, FASTA - Tools used for sequence alignment - to compare query sequence with that available in a database • ClustalW - Multiple sequence alignment program • GeneMarkerR - A commercial tool for forensic profiling • Seq Anal - A collection of tools to search, align and analyze DNA sequences • Galaxy Tools can also be used to search, align and analyze DNA data • Codon Suite - Codon-based sequence analysis • Transeq, Backtranseq - Tools to translate or back-translate between nucleotide and peptide sequences. • ReadSeq: Molecular sequence format converter • FASTLINK - Used to map genes and find the approximate location of disease genes. • DnaSP - A software package for the analysis of nucleotide polymorphism from aligned DNA sequence data • MATCHTM - A tool for searching transcription factor binding sites in DNA sequences • PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of basic, large-scale analyses • GeneMark™ - Free gene prediction software • Genscan is the best available ab initio gene predictor • More list of gene prediction software in http://en.wikipedia.org/wiki/List_of_gene_prediction_software18-Nov-12 15
  16. 16. 18-Nov-12 16