DNA Sequence Data in Big Data Perspective

Opportunities and Constraints

Palaniappan SP
connectsp2012@gmail.com

Request Note

I prepared this presentation entirely with input from internet research with the intend to
share this as give back to society. Please share your comments and suggestions through the
mail ID. It would help to improve the value and benefit of this preparation

18-Nov-12 2

Core areas where DNA sequencing is employed

• Academic research
• understanding gene expression/regulation
• phylogeny, demography and evolution research
• Oncology
• understanding DNA’s role in cancer cells

• finding ways to tune gene expression for cancer abatement or prevention
• Gene therapy
• Using recombinant DNA to suppress / modify / induce gene expression to
address genetic disorder based diseases / malfunction

18-Nov-12 3

More areas where DNA sequencing is employed

• Developing Genetically Modified Organisms through recombinant
research
• salt tolerant/drought tolerant/disease resistant cultivable crops
• microbes producing more of therapeutic compounds, proteins
• microbes for environment cleaning
• pro-biotic lactobacilli

• Clinical diagnosis - diagnosing gene-sequence-correlated diseases /
infections e.g. HIV
• Forensic analysis - DNA fingerprint profiling to identify crime suspects

• Pedigree analysis - to establish parental lineage in legal disputes

18-Nov-12 4

Agencies engaged in DNA sequencing

• Non-Profit Research Laboratories in Universities, Institutes

• Clinical labs of government and private hospitals

• Commercial organizations engaged in DNA sequencing for payment

18-Nov-12 5

Databases maintaining DNA sequence data
During early period, the data were generated and analyzed only by a few research
institutes like members of Humane Genome Research Project. Later when such
databases grew by size and region, so many databases were created and are made
available for research communities.
Following are some examples:
• NCBI - National Center for Biotechnology Information (GenBank)
http://www.ncbi.nlm.nih.gov/guide/data-software/#databases_
• EBI - European Bioinformatics Institute (EMBL)
http://www.ebi.ac.uk/Databases/
• EMNEW - Index of New EMBL Nucleotides ( EBI)
http://srs.ebi.ac.uk/srsbin/cgi-bin/wgetz?-page+databanks
• DDBJ - DNA Data Bank of Japan
http://www.ddbj.nig.ac.jp/
• As per Nucleic Acid Research database issue dated Jan 2012, there are as many
as 1380 databases!!!
http://www.oxfordjournals.org/nar/database/a/
18-Nov-12 6

DNA Sequencing capability has grown exponentially

DNA sequences in GenBank
Doubling time = 18 months

Sequencing Cost Data Analyzing Cost

Source: Bioinformatics Challenges of High-Throughput DNA Sequencing by Stuart M. Brown, Ph.D,
New York University www.med.nyu.edu/rcr/rcr/course/NexGen-2010.ppt
18-Nov-12 7

Big Data Of DNA Sequence Is Different From Other Big Data

 Most of the databases have built in search engines with predefined filters to narrow down
the search. Those web based tools also restrict the input / output file format. Integrating
customized search tools with databases need to be worked out.
 Too much of tool customization limits the interoperability in various software platforms
 Comparing conventional RDBMS databases, there is no Dev region, Test region or sand box
environment where one can test the tools, scripts and queries freely
 Data mining is not confined to one or few data sources. There are more than 1300
databases – some are primary (Tables) and some are derived databases(Table Views). It’s
likely the data is replicated in many databases.
 Analytics is not looking for matching exact search string or a cluster of strings defined by a
complex query with multiple joins and unions. Analysis is mostly on percentage of matching
of a given sequence. A variety of computational algorithms like dynamic programming and
heuristic algorithms or probabilistic methods are used for sequence alignment.
 Too many tools, search engines, software and script languages - it is difficult to find or
validate a software component framework / a technology tool box.

18-Nov-12 8

Using Big Data of Gene Sequence – Examples

 Identifying gene sequences relevant to specific biochemical/metabolic
pathway using transcriptional "fingerprints“
 Understanding gene regulations exerted by promoters, suppressors
 Whole genome sequence analysis for disease control, human healthcare
 Gene profiling for
― Structural genes (coding for mRNA, rRNA, tRNA)
― Functional genes (coding for promoter, operator, terminator)
― Regulatory genes (coding for repressor protein that binds to operator)
― putative genes which are not evidently associated with any protein produced or
function performed
― sequence of interest with reference to SNPs, SVs, indels, ChIP

18-Nov-12 9

Big Data of Gene Sequence – More Examples

 extrinsic gene finding system for gene annotation
 Understanding genetic basis for multi-drug resistance of super bugs so as
to evolve alternative control measures
 Targeted drug delivery against pathogens
 Localized gene therapy for infectious diseases or inherent disorder
 String mining / sequence mining, itemset mining, association rule mining
– Data mining can help us in two ways. 1) Understand genetic mechanism
of regulation and expression of phenotypes and 2) Retrieve genes or
genetic information that could be converted into a process technology or
a diagnostic tool or a therapeutic technique.

18-Nov-12 10

Big Data Management – A Generic Approach

18-Nov-12 11

Analyzing DNA databases – Some Practical Constraints

 Reading frame alignment. Every sequence can represent three different
reading frames that could be converted into a derived amino acid
sequence
 Presence of Exon-Intron - RNA splicing, possibility of alternative splicing
make the analysis as more complex
 Silent mutations – redundancy of codon –SNPs. Difficult to distinguish
silent mutation from sequencing error. Sequencing errors are possible
because of complexity in sample preparation, sequencing, assembling and
analyzing sequence data. Those situations could be resolved only by
repeat runs.
 Since DNA preparation is from a host of cells, the sequence we get is,
eventually, an average of DNA sequence from all sample cells.

18-Nov-12 12

Analyzing DNA databases – Practical Constraints - contd.
 Significance of non-coding DNA is yet to be understood - In many
eukaryotes, up to 99% of an organism's total genome size is non-coding
DNA. More than 98% of the human genome does not encode protein
sequences. A fraction of non-coding sequence is reported to regulate
gene expression.
 Sequence matching is based on statistical analysis and not on exact data
matching
 Reference Human Genome Data may not represent global population.
However when more and more sequence information from different
geographic region are added, the reference would become more global.
The Genome Reference Consortium is an international body that takes
care of genome curation.
 In a short span of time, the cost of sequencing has drastically come down.
Still the sequencing fee (which is around $1000 per individual) is
expensive for countries like India
18-Nov-12 13

Software Tools Used For Sequence Analysis

• There are quite a large number of tools are available in internet
• The Tools are used for
― sequence comparison/alignment
― searching databases and retrieve catalogued reference sequences
― assembling short sequence strings to get complete sequence of a gene
― Retrieving sequence info for constructing primer / oligo probe
― converting nucleic acid sequence to protein structure
― converting protein structure to nucleic acid sequence
― Multiple sequence alignment
• Scripting languages – Perl, Python, Ruby. BioPERL, BioPython and BioRuby are the
framework applications that could be readily used for data mining.
• SWIG – Tool to generate scripting language interface – It improves interoperability
of scripts
• SourceForge, GitHub – commonly used version control systems to maintain
software and tool versions

18-Nov-12 14

Some Commonly Used Tools
• Ensembl Genome Browser
• UCSC Genome Browser
• Entrez - Integrated, text-based search and retrieval tool used at NCBI
• RSAT - Regulatory Sequence Analysis Tools - tools dedicated to the detection of regulatory signals in non-
coding sequences
• BLAST, FASTA - Tools used for sequence alignment - to compare query sequence with that available in a
database
• ClustalW - Multiple sequence alignment program
• GeneMarkerR - A commercial tool for forensic profiling
• Seq Anal - A collection of tools to search, align and analyze DNA sequences
• Galaxy Tools can also be used to search, align and analyze DNA data
• Codon Suite - Codon-based sequence analysis
• Transeq, Backtranseq - Tools to translate or back-translate between nucleotide and peptide sequences.
• ReadSeq: Molecular sequence format converter
• FASTLINK - Used to map genes and find the approximate location of disease genes.
• DnaSP - A software package for the analysis of nucleotide polymorphism from aligned DNA sequence data
• MATCHTM - A tool for searching transcription factor binding sites in DNA sequences
• PLINK is a free, open-source whole genome association analysis toolset, designed to perform a range of
basic, large-scale analyses
• GeneMark™ - Free gene prediction software
• Genscan is the best available ab initio gene predictor
• More list of gene prediction software in
http://en.wikipedia.org/wiki/List_of_gene_prediction_software

18-Nov-12 15

DNA Sequence Data in Big Data Perspective

More Related Content

What's hot

Viewers also liked

Similar to DNA Sequence Data in Big Data Perspective

Recently uploaded

DNA Sequence Data in Big Data Perspective