This document describes a project using neural networks to classify genes based on their DNA sequences. The project involves preprocessing gene sequence data from E. coli and B. subtilis to create tables of codon and amino acid frequencies. These statistics are split and merged to create training and testing data for a neural network classifier. The neural network uses the frequency statistics as inputs and has two output neurons to classify genes as belonging to one organism or the other. Results are analyzed to evaluate the classification accuracy.
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
A Systems Biology Approach to Natural Products ResearchHuda Nazeer
Explains the systems biology approach (holistic approach), its advantages and tools used compared to the reductionist approach in natural products research.
Introduction
Transcriptome analysis
Goal of functional genomics
Why we need functional genomics
Technique
1. At DNA level
2.At RNA level
3. At protein level
4. loss of function
5. functional genomic and bioinformatics
Application
Latest research and reviews
Websites of functional genomics
Conclusions
Reference
Genome annotation, NGS sequence data, decoding sequence information, The genome contains all the biological information required to build and maintain any given living organism.
A Systems Biology Approach to Natural Products ResearchHuda Nazeer
Explains the systems biology approach (holistic approach), its advantages and tools used compared to the reductionist approach in natural products research.
Introduction
Transcriptome analysis
Goal of functional genomics
Why we need functional genomics
Technique
1. At DNA level
2.At RNA level
3. At protein level
4. loss of function
5. functional genomic and bioinformatics
Application
Latest research and reviews
Websites of functional genomics
Conclusions
Reference
The 20th Century was transformed by the ability to program on silicon, an innovation that made possible technologies that fundamentally revolutionised how the world works. As we face global challenges in health, food production, and in powering an increasingly energy-greedy planet, it is becoming clear that the 21st Century could be equally transformed by programming an entirely different material: biological matter. The power to program biology could transform medicine, agriculture, and energy, but relies, fundamentally, on an understanding of biochemistry as molecular machinery in the service of biological information-processing. Unlike engineered systems, however, living cells self-generate, self-organise, and self-repair, they undertake massively parallel operations with slow and noisy components in a noisy environment, they sense and actuate at molecular scales, and most intriguingly, they blur the line between software and hardware. Understanding this biological computation presents a huge challenge to the scientific community. Yet the ultimate destination and prize at the culmination of this scientific journey is the promise of revolutionary and transformative technology: the rational design and implementation of biological function, or more succinctly, the ability to program life.
Diagnosis Chest Diseases Using Neural Network and Genetic Hybrid AlgorithmIJERA Editor
The back propagation algorithm is most popular algorithm in feed forward neural network with the multi-layer. It measures the output error and calculates the gradient of the error and adjusting the ANN weight moving along the descending gradient direction. Back propagation is used to learn and store by mapping relations of input- output model. A genetic algorithm is having a random probability distribution or pattern that may be analyses statistically but may not be predicted precisely. Genetic algorithm is an iterative procedure that generates new population for individual from the old one. In my paper I am proposing to implement the back propagation algorithm and genetic algorithm to compare the output accuracy percent for medical diagnosis on various chest diseases (Asthme, tuberculosis, lung cancer, pneumonia).
Personalized Medicine and the Omics Revolution by Professor Mike SnyderThe Hive
Personalized medicine is expected to benefit from the combination of genomic information with the global monitoring of molecular components and physiological states. To ascertain whether this can be achieved, we determined the whole genome sequence of an individual at high accuracy and performed an integrated Personal Omics Profiling (iPOP) analysis, combining genomic, transcriptomic, proteomic, metabolomic, and autoantibodyomic information, over a 38-month period that included healthy and two virally infected states. Our iPOP analysis of blood components revealed extensive, dynamic and broad changes in diverse molecular components and biological pathways across healthy and disease conditions. Importantly, genomic information was also used to estimate medical risks, including Type 2 Diabetes, whose onset was observed during the course of our study. Our study demonstrates that longitudinal personal omics profiling can relate genomic information to global functional omics activity for physiological and medical interpretation of healthy and disease states.
Meet the speaker, Professor Michael Snyder (Stanford):
Michael Snyder is the Stanford Ascherman Professor, Chair of Genetics and the Director of the Center of Genomics and Personalized Medicine. He received his Ph.D. from the California Institute of Technology and postdoctoral training at Stanford University. He is a leader in the field of functional genomics and proteomics, and one of the major participants of the ENCODE project. His laboratory study was the first to perform a large-scale functional genomics project in any organism, and has launched many technologies in genomics and proteomics. These including the development of proteome chips, high resolution tiling arrays for the entire human genome, methods for global mapping of transcription factor binding sites (ChIP-chip now replaced by ChIP-seq), paired end sequencing for mapping of structural variation in eukaryotes, de novo genome sequencing of genomes using high throughput technologies and RNA-Seq. These technologies have been used for characterizing genomes, proteomes and regulatory networks. Seminal findings from the Snyder laboratory include; the discovery that much more of the human genome is transcribed and contains regulatory information than was previously appreciated, and a high diversity of transcription factor binding occurs both between and within species. He has also combined different state-of–the-art omics technologies to perform the first longitudinal detailed integrative personal omics profile (iPOP) of person and used this to assess disease risk and monitor disease states for personalized medicine. He is a co-founder of several biotechnology companies including; Protometrix (now part of Life Technologies), Affomix (now part of Illumina), Excelix, and Personalis, and he presently serves on the board of a number of companies.
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...Abdelrahman Hosny
In the past, the study of medicine used to focus on observing biological processes that take place in organisms, and based on these observations, biologists would make conclusions that translate into a better understanding of how organisms systems work. Recently, the approach has changed to a computational paradigm, where scientists try to model these biological processes as mathematical equations or statistical models. In this study, we have modeled an important activity of cell replication in a type of bacteria genome using different deep learning network models. Results from this research suggest that deep learning models have the potential to learn representations of DNA sequences, hence predicting cell behavior. Source code is available under MIT license at: http://abdelrahmanhosny.github.io/DL-Cerevesiae/
The 20th Century was transformed by the ability to program on silicon, an innovation that made possible technologies that fundamentally revolutionised how the world works. As we face global challenges in health, food production, and in powering an increasingly energy-greedy planet, it is becoming clear that the 21st Century could be equally transformed by programming an entirely different material: biological matter. The power to program biology could transform medicine, agriculture, and energy, but relies, fundamentally, on an understanding of biochemistry as molecular machinery in the service of biological information-processing. Unlike engineered systems, however, living cells self-generate, self-organise, and self-repair, they undertake massively parallel operations with slow and noisy components in a noisy environment, they sense and actuate at molecular scales, and most intriguingly, they blur the line between software and hardware. Understanding this biological computation presents a huge challenge to the scientific community. Yet the ultimate destination and prize at the culmination of this scientific journey is the promise of revolutionary and transformative technology: the rational design and implementation of biological function, or more succinctly, the ability to program life.
Diagnosis Chest Diseases Using Neural Network and Genetic Hybrid AlgorithmIJERA Editor
The back propagation algorithm is most popular algorithm in feed forward neural network with the multi-layer. It measures the output error and calculates the gradient of the error and adjusting the ANN weight moving along the descending gradient direction. Back propagation is used to learn and store by mapping relations of input- output model. A genetic algorithm is having a random probability distribution or pattern that may be analyses statistically but may not be predicted precisely. Genetic algorithm is an iterative procedure that generates new population for individual from the old one. In my paper I am proposing to implement the back propagation algorithm and genetic algorithm to compare the output accuracy percent for medical diagnosis on various chest diseases (Asthme, tuberculosis, lung cancer, pneumonia).
Personalized Medicine and the Omics Revolution by Professor Mike SnyderThe Hive
Personalized medicine is expected to benefit from the combination of genomic information with the global monitoring of molecular components and physiological states. To ascertain whether this can be achieved, we determined the whole genome sequence of an individual at high accuracy and performed an integrated Personal Omics Profiling (iPOP) analysis, combining genomic, transcriptomic, proteomic, metabolomic, and autoantibodyomic information, over a 38-month period that included healthy and two virally infected states. Our iPOP analysis of blood components revealed extensive, dynamic and broad changes in diverse molecular components and biological pathways across healthy and disease conditions. Importantly, genomic information was also used to estimate medical risks, including Type 2 Diabetes, whose onset was observed during the course of our study. Our study demonstrates that longitudinal personal omics profiling can relate genomic information to global functional omics activity for physiological and medical interpretation of healthy and disease states.
Meet the speaker, Professor Michael Snyder (Stanford):
Michael Snyder is the Stanford Ascherman Professor, Chair of Genetics and the Director of the Center of Genomics and Personalized Medicine. He received his Ph.D. from the California Institute of Technology and postdoctoral training at Stanford University. He is a leader in the field of functional genomics and proteomics, and one of the major participants of the ENCODE project. His laboratory study was the first to perform a large-scale functional genomics project in any organism, and has launched many technologies in genomics and proteomics. These including the development of proteome chips, high resolution tiling arrays for the entire human genome, methods for global mapping of transcription factor binding sites (ChIP-chip now replaced by ChIP-seq), paired end sequencing for mapping of structural variation in eukaryotes, de novo genome sequencing of genomes using high throughput technologies and RNA-Seq. These technologies have been used for characterizing genomes, proteomes and regulatory networks. Seminal findings from the Snyder laboratory include; the discovery that much more of the human genome is transcribed and contains regulatory information than was previously appreciated, and a high diversity of transcription factor binding occurs both between and within species. He has also combined different state-of–the-art omics technologies to perform the first longitudinal detailed integrative personal omics profile (iPOP) of person and used this to assess disease risk and monitor disease states for personalized medicine. He is a co-founder of several biotechnology companies including; Protometrix (now part of Life Technologies), Affomix (now part of Illumina), Excelix, and Personalis, and he presently serves on the board of a number of companies.
Confirming dna replication origins of saccharomyces cerevisiae a deep learnin...Abdelrahman Hosny
In the past, the study of medicine used to focus on observing biological processes that take place in organisms, and based on these observations, biologists would make conclusions that translate into a better understanding of how organisms systems work. Recently, the approach has changed to a computational paradigm, where scientists try to model these biological processes as mathematical equations or statistical models. In this study, we have modeled an important activity of cell replication in a type of bacteria genome using different deep learning network models. Results from this research suggest that deep learning models have the potential to learn representations of DNA sequences, hence predicting cell behavior. Source code is available under MIT license at: http://abdelrahmanhosny.github.io/DL-Cerevesiae/
This lab has two parts – please answer all parts.Lab 7 Biotechn.docxglennf2
This lab has two parts – please answer all parts.
Lab 7: Biotechnology
A. Gene Finder Activity
Lab Materials
Materials found in your lab kit:
• none
Additional materials needed:
• access to the Internet
Activity
1. Connect to the Internet.
2. Go to the National Center for Biotechnology Information (NCBI) at http://blast.ncbi.nlm.nih.gov/Blast.cgi?PROGRAM=blastn&BLAST_PROGRAMS=megaBlast&PAGE_TYPE=BlastSearch&SHOW_DEFAULTS=on&LINK_LOC=blasthome
3. You will see the following screen:
4. Copy the exact nucleotide sequence given below and then paste it into the Enter Query Sequence box on the Nucleotide-nucleotide BLAST search page. Accuracy is extremely important. The sequence should be one long string of letters with no spaces. Record the sequence number you are given (you may have a different sequence than your classmates).
DNA sequences (When copying and pasting, please don't include the dashes!)
For the men:
----------------------------------------------------------------
cccgaattcgacaatgcaatcatatgcttctgctatgttaagcgtattcaacagcgatgattacagtccagctgtgcaagagaatattcccgctctccggagaagctcttccttcctttgcactgaaagctgtaactctaagtatcagtgtgaaacgggagaaaacagtaaaggcaacgtccaggatggagtgaagcgacccatgaacgcattcatcgtgtggtctcgcgatcagaggcgcaagatggctctagagaatcccagaatgcgaaactcagagatcagcaagcagctgggataccagtggaaaatgcttactgaagccgaaaaatggccattcttccaggaggcacagaaattacaggccatgcacagagagaaatacccgaattataagtatcgacctcgtcggaaggcgaagatgctgccgaagaattgcagtttgcttcccgcagatcccgcttcggtactctgcagcgaagtgcaactggacaacaggttgtacagggatgactgtacgaaagccacacactcaagaatggagcaccagctaggccacttaccgcccatcaacgcagccagctcaccgcagcaacgggaccgctacag
----------------------------------------------------------------
For the women:
----------------------------------------------------------------
cagtggaattctagagtcacacttcctaaaatatgcatttttgttttcacttttagatatgatacggaaattgatagaagcagaagatcggctataaaaaagataatggaaagggatgacacagctgcaaaaacacttgttctctgtgtttctgacataatttcattgagcgcaaatatatctgaaacttctagcagtaaaactagtagtgcagatacccaaaaagtggc
---------------------------------------------------------------
5. Scroll down and Press BLAST:
Once you have pasted the sequence in the Search box, click on BLAST! A screen should appear that says to wait a period of time (usually 30 seconds or less) for the search to be completed. Be patient while formatting takes place.
After the search has ended, scroll down the screen until you find a list introduced by the words DESCRIPTIONS – look below it and find Sequences producing significant alignments. Listed in order are the closest matches with the pasted DNA sequence.
6. Find the first listing (the closest match) that has both a blue square containing a 'G' and the term Human or Homo sapiens in the description. Click on the blue G. Clicking on the G will take you to a new screen that gives the gene name, symbol, and identifier. Write down or electronically copy the name of the gene. (Copy and paste the information in a separate MSWord file for y.
Learn about enabling next-generation sequencing applications with IBM Storwize V7000 Unified and SONAS Gateway solutions. This paper offers recommendations and guidance to facilitate easy configuration and installation of the solution to ensure an efficient installation with good performance. For more information on IBM Storage Systems, visit http://ibm.co/LIg7gk.
Visit http://on.fb.me/LT4gdu to 'Like' the official Facebook page of IBM India Smarter Computing.
This presentation was given to me during my higher education in the Lebanese University, Faculty of Education. It includes detailed explanation about DNA microarray.
The study of nucleic acids began with the discovery of DNA, progressed to the study of genes and small fragments, and has now exploded to the field of genomics. Genomics is the study of entire genomes, including the complete set of genes, their nucleotide sequence and organization, and their interactions within a species and with other species. The advances in genomics have been made possible by DNA sequencing technology. [Source: https://opentextbc.ca/biology/chapter/10-3-genomics-and-proteomics/]
Data Mining-Project Report Gene Classification using Neural Network- Apil Tamang
1. Data Mining Fall 2013
Project Report
Apil Tamang
Gene Classification using Neural Networks
2. Introduction
Problem:
Genes play a fundamental role in any living organism’s life. The processes of life are controlled by
proteins that are produced within an organism’s cells. Functions such as muscle movement, food
digestion, production of energy, waste removal, producing antibodies to fight infection etc. are all
controlled by the production of proteins within an organism. Fundamental processes such as breathing,
heartbeats, growth and regeneration etc. are all dependent on the production of the right kind of
proteins at the right places and moments of time. In other words, life is sustained by proteins: many
different kinds of them!
The synthesis of proteins is controlled by genes. A gene is a certain length of DNA that is found within
the chromosomes within the nucleus of an organism. It consists of a sequence of DNA base pairs:
specifically, four different kinds. They are (A)denine, (T)hymine, (G)uanine, and (C)ytosine. Only certain
specific regions within in the DNA serve as protein synthesizing elements. Each gene results in the
production of one and only one kind of protein. The genes for the entire set of proteins available to an
organism are found within the chromosomes. Hence, the DNA is also called the code for life.
The entire DNA sequence contains many different kinds of sequences in addition to genes. There are
regions that serve as binding sites for other processes, regions that signal the beginning and end of the
gene regions, and regions that serve absolutely no purpose (to current knowledge), to name a few.
Much of the entire DNA sequence is not quite understood about. For e.g. it is estimated that only 2% of
the entire DNA sequence of human serve as genes. Researchers do not know for sure what the purpose
of the rest of the DNA is.
In this project, we examine the DNA of lower-class organisms where the entire DNA sequence can be
divided into two main categories: the coding (gene), and the non-coding (non-gene) regions. The files
containing sequences for all the proteins known to two organisms: E. Coli (Strain MG1665) and
A.Baccillus were downloaded from the NCBI genome repository. The work is based on the hypothesis
that the coding regions have a certain pattern in their gene statistics which makes it possible to identify
them from the innumerable sequence combinations that can be constructed from the DNA sequence.
We mentioned previously that most of the DNA itself consists of non-coding regions. Hence, the attempt
is to look at a sample of DNA sequence and be able to tell if it is a coding sequence or a non-coding
sequence, i.e. a gene or a non-gene region, respectively. We go a step further and see if there is any
specific pattern that can be inferred from the genes of two different organisms such that this
information can be used to correctly identify which organism the given sample is from.
Neural Networks:
3. Neural Networks are computer algorithms that can be used to solve many classes of artificial
intelligence problems. These problems can range from optimization to classification. The major
constituent elements of a neural network are neurons, connections, weights and transformation
functions. The neurons themselves are modeled after their biological counterparts that serve a central
role of survival in higher-class organisms. Neurons are specialized cells capable of receiving electrical
signals and transmitting them after some processing. They are also capable of forming interconnections
within the organism and control the movement of virtually every muscle in the organism. Neurons
constitute the central nervous system (i.e. brain and spinal cord) by forming a massive and very complex
web of connections between themselves. Thus, neurons are also the seat of intelligence and memory in
higher-class organisms.
The neurons in the neural network algorithm have very similar features. They are able to take in an
input and form connections with the neighboring neurons. Weights are the signals that neurons pass
amongst each other during a computation process. Each neuron is capable of processing the input via a
mathematical function that can be specified by a user. In a typical network, neurons communicate by
passing weights around. The output of a network is the overall collective processing performed by each
neuron as it communicates with other neurons in the network. In this way, neural networks often
provide a black-box like problem solving tool for the end user.
There are many different kinds of neural networks available in the field. These networks differ from each
other by the kind of function they use for processing the input, the way they are interconnected in the
network, and the way information is passed around in the system before an end result is displayed. In
this class project, we have used a fully interconnected Multilayer Perceptron network with the standard
forward-feed, back-propagation learning algorithm. This setup is optimal for classification and is widely
used for this class of problems.
4. Methodology
Data preprocessing:
The building block of proteins is the amino acid. An amino acid consists of a set of three DNA base pair
sequence. This set of three DNA base pair is also referred to as a codon. Given that there are 4 kinds of
DNA base pairs, there are 64 possible kinds of codons that can be formed by this set of sequence. There
are 20 different amino acids identified by scientists and researchers. Hence, there is a many-to-one
mapping from codons to amino acids.
The process of forming a gene statistics for this project consists of taking a gene sequence and deriving
normalized frequencies of all the codons and amino acids. This is done for each gene in a file containing
all known gene sequences for the organism. There are two source files for this purpose: the file
containing all the gene sequences as DNA base pairs, and the file containing all the sequences as amino
acids. These files are used to derive the normalized frequencies for the codons and the amino acids,
respectively. The process is mentioned in pseudo-code briefly below:
Organism 1
- Protein 1:
ATGGATCCG……
- Protein 2:
ATGCGATCG…..
- …
- ..
- .
- Protein N:
ATGTTACTG…..
Organism 1 Codon Freq. Table
Cdn1 Cdn2 ……………… Cdn64
0.23 0.12 ................ 0.05
0.11 0.17 ……………… 0.20
…
..
.
0.34 0.15 ………………. 0.16
Organism 1 AA Freq. Table
AA1 AA2 ……………… AA20
0.13 0.15 ................ 0.25
0.01 0.21 ……………… 0.10
…
..
.
0.14 0.09 ………………. 0.25
5. Once the files containing the statistics are obtained, we perform the following steps.
a. Split the statistics file into two disjoint parts for each organism. The split is randomly
orchestrated.
b. Merge one part of the statistics file from an organism with a part of the statistics from the
second organism. Ensure the lines from each are randomly distributed in the merged files. Do
the same for the remaining part of the statistics file for each organism.
c. Use one of the merged file as training (80%) and testing (20%) data for the neural network
classifier.
d. Use the other merged file as activating data for the neural network classifier. This is the set of
data that the neural classifier works on to produce classification results. Print the results.
e. Use the results file and perform analysis on the overall accuracy.
Note that the entire set of steps is carried out for the statistics file for both the codons and the
amino acids independently. The steps outlined above is presented diagrammatically in the
following image.
6. Classifier Setup:
The following describes the structure of the neural network classifier used in this project:
The above is a sample of a multilayered perceptron used in classification problems. The red ovals (far
left) represent input neurons. These are basically neurons that take in input as normalized numerical
values of the attributes in the classification problem. There needs to be one input neuron per attribute.
Hence, in this project, 20 input neurons are used when using amino acid statistics. Likewise, 64 input
neurons are used when using codon statistics to represent the 64 different types of codons available.
The green ovals in the middle represent the hidden neuron layer. They represent the layer that performs
the analysis on the input data. Their number can vary. It is recommended that they have at least as
many elements as the number of input neurons. Finally, we have an output layer that is represented by
the oval in dark green (far right). One output neuron is required for each class attribute. However; for a
classification problem, it is recommended that for optimal performance, one output neuron is used for
each class value possible. Hence in this project, we would have two output neurons: one for each of the
two organisms to whom the genes may belong.
The two output neurons in this project are configured to produce the output as a value between 0 and
1. The value represents the probability that the particular tuple is of a certain class. In this project, the
closer the value is to 0, the more likely that it is a gene of organism 1, and the closer it is to 1, the more
likely it is a gene of organism 2. The table of results looks as follows:
7. ID Class1 Class2
---------------------
1 0.32 0.68
2 0.11 0.89
…
..
.
N 0.09 0.91
It is required that a certain cutoff value be chosen to make a prediction. If I chose a cutoff value, say:
0.7, then based on that, tuples with ID 2 and N is chosen to be Class-2 and tuple with ID equals 1 will be
categorized as falling in Class-1.
8. Results
Problem 1:
In the first part of the experiment, we chose genes from the E.Coli (Strain MG1655) to produce the
corresponding statistics file on amino acid frequencies. Recall that this file contains lines of sequences of
numbers, each line being normalized frequencies of the amino acids of the genes from which they are
constructed from. This set of sequences formed the first class value in the classification problem. Then
we use a random number generator to generate random number sequences to represent the second
class value in the classification problem. After splitting and merging the records (lines) from these two
types of sequences, we want to see if the neural network we developed is able to correctly classify
sequences into the respective class (type). We present the result below.
Table 1: Analysis of Classification of E.Coli Gene Sequence statistics and Random Number Sequence
statistics (Amino Acid Frequency Values)
----------------------------------------------------------------------------------------------------------------
Using Cutoff Value: 0.6000 Yes: Class 1 (E.Coli Genes)
No: Class 2 (Random Sequences)
----------------------------------------------------------------------------------------------------------------
Predicted - Class
Actual_Class Yes No Total
Yes 1558 62 1620
No 0 1600 1600
Total 1558 1662 3221
----------------------------------------------------------------------------------------------------------------
Sensitivity (TP/P) [Percent of true postives identified]: 96.1728 %
Specificity (TN/N) [Percent of true negatives identified]: 100.0000 %
Accuracy ( TP+TN)/(P+N): 98.0441 %
Error ( FP+FN)/(P+N): 1.9249 %
9. Additional setup parameters:
For this problem, the momentum for learning was set to 0.5 and the weightdecay was set to 0.001. 20
neurons were used in the hidden layer (equal to the number of input neurons). The total mean-square
error at the end of 20 cycles for training was 0.0484.
Discussion:
We can see that our neural network is able to classify the two sequences with very high precision. By
setting the cutoff value at 0.6, we were able to correctly identify 1558 out of 1620 sequences that were
derived from the genes of the E.Coli. That represents 96.2% of the total gene used for testing the neural
classifier. In addition, no random sequences were classified as a sequence type derived from E.Coli
genes. That represents a 100% accuracy of identifying random sequences from gene sequences. We do
not perform the same analysis using statistics for the codon tables for this problem set. We realize that
we already achieved very high accuracy. Next, we present results obtained for using the classifier on
genes sequences derived from two different organisms: E. Coli, and A. Baci, where both are bacterial
species.
Problem 2:
In the next experiment, we use gene sequences from two different organisms. The results of the first
experiment suggest that the random number sequences must be very easily identifiable from the
sequences resulting from the analysis of actual gene statistics. Hence, we prepare statistics from two
real organisms to test and see if there are any discernible patterns in their genes. We expect the
classifier to yield slightly poorer results than the first experiment did. This is because we expect that
although the genes could be derived from two different organisms, they would still have a large set of
similarities resulting from the fact that both these sets are genes that actually support biological life. We
present the results on the following pages.
10. Table 2: Analysis of Classification of A. Baci and E. Coli Gene Sequence statistics (Amino Acid Values)
----------------------------------------------------------------------------------------------------------------
Using Cutoff Value: 0.5000 Yes: Class 1 (A. Baci Genes)
No: Class 2 (E. Coli Genes)
----------------------------------------------------------------------------------------------------------------
Predicted - Class
Actual_Class Yes No Total
Yes 822 411 1233
No 128 1493 1621
Total 950 1904 2855
----------------------------------------------------------------------------------------------------------------
Sensitivity (TP/P) [Percent of true postives identified]: 66.67 %
Specificity (TN/N) [Percent of true negatives identified]: 92.10 %
Accuracy ( TP+TN)/(P+N): 81.08 %
Error ( FP+FN)/(P+N): 18.88 %
11. Table 3: Analysis of Classification of A. Baci and E. Coli Gene Sequence statistics (Amino Acid Values)
----------------------------------------------------------------------------------------------------------------
Using Cutoff Value: 0.7000 Yes: Class 1 (A. Baci Genes)
No: Class 2 (E. Coli Genes)
----------------------------------------------------------------------------------------------------------------
Predicted - Class
Actual_Class Yes No Total
Yes 361 869 1230
No 8 1631 1639
Total 369 2500 2870
----------------------------------------------------------------------------------------------------------------
Sensitivity (TP/P) [Percent of true postives identified]: 29.35 %
Specificity (TN/N) [Percent of true negatives identified]: 99.51 %
Accuracy ( TP+TN)/(P+N): 69.41 %
Error ( FP+FN)/(P+N): 30.56 %
12. Table 4: Analysis of Classification of A. Baci and E. Coli Gene Sequence statistics (Codon Values)
----------------------------------------------------------------------------------------------------------------
Using Cutoff Value: 0.5000 Yes: Class 1 (A. Baci Genes)
No: Class 2 (E. Coli Genes)
----------------------------------------------------------------------------------------------------------------
Predicted - Class
Actual_Class Yes No Total
Yes 1181 68 1249
No 10 1609 1619
Total 1191 1677 2869
----------------------------------------------------------------------------------------------------------------
Sensitivity (TP/P) [Percent of true postives identified]: 94.56 %
Specificity (TN/N) [Percent of true negatives identified]: 99.38 %
Accuracy ( TP+TN)/(P+N): 97.25 %
Error ( FP+FN)/(P+N): 2.72 %
13. Table 5: Analysis of Classification of A. Baci and E. Coli Gene Sequence statistics (Codon Values)
----------------------------------------------------------------------------------------------------------------
Using Cutoff Value: 0.7000 Yes: Class 1 (A. Baci Genes)
No: Class 2 (E. Coli Genes)
----------------------------------------------------------------------------------------------------------------
Predicted - Class
Actual_Class Yes No Total
Yes 1043 181 1224
No 2 1650 1652
Total 1045 1831 2877
----------------------------------------------------------------------------------------------------------------
Sensitivity (TP/P) [Percent of true postives identified]: 85.21 %
Specificity (TN/N) [Percent of true negatives identified]: 99.88 %
Accuracy ( TP+TN)/(P+N): 93.60 %
Error ( FP+FN)/(P+N): 6.36 %
14. The content of Tables 2 through 5 is summarized in this table below:
Table 6: Analysis of performance of classifier on Gene Sequence Statistics from E.Coli and A.Baci
Using Amino Acids Using Codons
Cutoff At Cutoff At
0.5 0.7 0.5 0.7
% of Org1 genes Correctly Classified 66.67 29.35 94.56 85.21
% of Org2 genes Correctly Classified 92.11 99.51 99.38 99.84
% of gene Sequences Correctly Classified ( Accuracy) 81.08 69.41 97.25 93.6
% of gene Sequences Falsely Classified (Error) 18.88 30.56 2.72 6.36
Additional setup parameters:
A learning rate (momentum) of 0.0125 was used in this problem. The value of weightdecay was set at
0.001. It was required for the values of these parameters to be decreased from the values used for the
previous problem set in order to get a reasonable performance.
Discussion:
Using amino acids statistics and cutoff set at 0.5, 66.67% of A. Baci genes and 92.11% of E. Coli genes
were classified. Using a cutoff set at 0.7, only 29.35% of A. Baci genes are correctly classified. That
means that all those sequences who scored between 0.5 and 0.7 are now classified as E. Coli (Organism
2). Correspondingly, the accuracy for Organism 2 go up to 99.51% from 92.11%. However, this is only
because more data points are not categorized under the second class value, and this comes at the
expense of sharp decrease in the accuracy of classifying genes for A.Baci (Organism 1).
A similar pattern is observed when using codon statistics. In addition, it is also observed that the
classification yields higher accuracy and smaller errors when using the codon tables. This is expected
since using codons involve generating normalized frequencies data for 64 different kinds of the
15. respective codons. Using amino acids involve data for 20 different kinds of amino acids. Hence, there is
more information to help the classifier perform the classification using the codon values.
Summary and Future Work:
In this project, the neural network algorithm was used to perform classification of gene sequences
based on the statistics of the frequencies of their building blocks: amino acids, and codons. The
experiment was conducted independently for both these elementary constituents of the genes. The first
experiment involved classification of the gene sequences with sequences generated by random
numbers. The neural network classifier was easily able to sort them out with high precision. The second
experiment involved sequence classification of genes from two different organisms. With some changes
in the learning parameters and the weightdecay values, a reasonable degree of precision was achieved
again. It was observed as we expected that using codon statistics results in higher precision than using
amino acid statistics. However, this comes at added expense in computation since the former uses 64
input neurons versus 20 input neurons used by the latter.
As future work, this work could be extended to perform classification of additional types of genetic
material. In lower class organisms, their genome contains primarily two kinds of regions: the coding
(gene) and the non-coding (non-gene) regions. Higher-class organisms contain far more variety of
regions within their genome, for e.g. regions for binding other chemical processes, regions for splicing,
encoding and passive regions within a single gene sequence, and so on and so forth. Each region is
characterized by a specific statistical pattern of the constituent amino acid and corresponding codons.
Hence, the method developed in this project could be used to perform the same kind of statistical
classification of these many kinds of regions within a higher-class organism’s genome.
References:
[1] Harold, F. M. (2001). The way of the cell: Molecules, organisms, and the orders of life. (1st ed.).
Oxford University Press.
[2] Michal Q., Z. (2002). Computational prediction of eukaryotic protein-coding genes. Nature , 3, 698-
709. Retrieved from www.cs.odu.edu/~pothen/Courses/CS791/zhang.pdf
[3] Johansen, O. (2008). Gene splice site prediction using artifical neural networks. (Master's thesis).
[4] Turban, E. (2011). Neural networks and data mining. In R. Sharda & D. Delen (Eds.), Business
Intelligence: A Managerial Approach (2nd ed.). Retrieved from
http://www70.homepage.villanova.edu/matthew.liberatore/Mgt2206/turban_online_ch06.pdf
[5] Pybrain: The machine learning library. (11, 12 2009). Retrieved from
http://pybrain.org/docs/index.html