SlideShare a Scribd company logo
1 of 1
Download to read offline
Using Deep Learning to Automatically Learn Feature Representation and
Build a Better Classification Model on Protein Sequential Data
SonPham,BrianR.King,PhD
Computer Science department, Bucknell University, Lewisburg, PA
BACKGROUND
Deep Learning recently became one of the most exciting
directions that Machine Learning has witnessed in years.
The technology achieved unbelievable success in image
recognition, facial detection and audio extraction. While
most research on Deep Learning focuses on 2D image
recognition, there are very few methods that have
investigated its use on strictly 1D sequential data, such
as those found in biological sequences.
OUR GOAL
This study will aim to investigate the use of deep
learning in order to:
• Understand how each layer of neural network helps
represent hierarchical features of one-dimensional
sequential data
• Induce a protein sequence classifier that can
outperform existing methods.
SCRATCH 1-D DATABASE
SRATCH 1-D protein database is an open-source protein
database by University of California Irvine. The database
contains data of over 5700 proteins and their respective
secondary structures. In this database, each amino acid in a
protein is encoded as one of 20 alphabet letters and its
secondary structure is encoded as either Coil (C), α-helix (H)
or β-strand (E).
PREPROCESSING
In this problem, we will slice each protein into smaller substrings of length 15 using the sliding window technique. Each of these
substrings will be attached with the label of the middle amino acid. We will also randomly sample 100,000 substrings of length 6
for feature detections.
RESULTS
We were able to achieve 62.156% accuracy on
Protein Secondary Structure Prediction. The
technology seems to be able to detect common
features that can be used to distinguish between
different secondary structures.
FUTURE WORK
We plan to work on improve the accuracy of Protein
Secondary Structure prediction as well as applying
the current deep learning architecture to predicting
protein subcellular localization.
ACKNOWLEDGEMENT
REFERENCE
http://deeplearning.stanford.edu/
http://scratch.proteomics.ics.uci.edu/
Given the amino acid sequence, our goal is to predict as
many correct secondary structure as possible.
Random Coil
α-helix
β-strand
Sequence: TIKVLFVDDHEMVRIGIS…
Structure: CEEEEEECCCHHHHHHHH…
Example sequence and its respective structure
FEATURE DETECTION
In order to detect meaningful features out of protein
sequence, we decided to use the sparse auto-encoder. Sparse
auto-encoder is a neural network that can detects common
features out of a set of data. We feed 100,000 sample strings
of length 6 into the network and to find 40 common features
out of these samples.
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
A
R
N
D
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
R
N
D
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
R
N
D
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
𝑥𝑥
�𝑥𝑥
𝑊𝑊
ℎ
Sparse auto-encoder network architecture Visualization of found features
OOOOOOTIKVLFVDDHEMVRIGISSYLSTQSDIEVVGEGASGKEA…
CEEEEEECCCHHHHHHHHHHHHHCCCEEEEEEECHHHCC…
Input layer
21 x 15
Hidden layer
21 x 6 x 40
CC
Convolutional
layer
40 x 10
Output layer
40 x 10
Random Coil
β-strand
α-helix
𝑃𝑃(𝑦𝑦 = 𝐶𝐶| 𝑥𝑥)
𝑃𝑃(𝑦𝑦 = 𝐸𝐸| 𝑥𝑥)
𝑃𝑃(𝑦𝑦 = 𝐻𝐻| 𝑥𝑥)
Sliding window
Labels
Padded O’s Random sample
DEEP LEARNING ARCHTECTURE
After preprocessing, we will feed the
substrings into the deep learning model,
which is a neural network that contains
multiple layers. Each layer has a different
functionality:
• The first layer retrieves the 15-length
substring input.
• The second layer detects common
features out of 6-length samples.
• The third layer convolve the input will
found features to determine where the
features are.
• The fourth layer uses the new found
input to classify whether the structure
of the substring is a Coil, α-helix or β-
strand.
Cpred Epred Hpred
C
E
H
66.91%
27.20%
21.10%
10.10%
43.46%
10.03%
22.99%
29.35%
68.87%
By observing the confusion matrix, we learnt that
the network still has some problems detecting β-
strand. This might be due to the fact that β-strand
often has contact with other strands that are way
farther than the scope of the substrings that we
sampled.
I would like to thank Professor Brian King for his
expert advise and encouragement throughout this
research. Also, this project would have been
impossible without the funding support from
Bucknell University Program for Undergraduate
Research

More Related Content

Similar to SonPhamSVURS2015

Open CV Implementation of Object Recognition Using Artificial Neural Networks
Open CV Implementation of Object Recognition Using Artificial Neural NetworksOpen CV Implementation of Object Recognition Using Artificial Neural Networks
Open CV Implementation of Object Recognition Using Artificial Neural Networksijceronline
 
M.SC_3rd(sem_Kuntal_seurat_rna_analysis.pptx
M.SC_3rd(sem_Kuntal_seurat_rna_analysis.pptxM.SC_3rd(sem_Kuntal_seurat_rna_analysis.pptx
M.SC_3rd(sem_Kuntal_seurat_rna_analysis.pptxKuntalPatra4
 
A Study of Deep Learning Applications
A Study of Deep Learning ApplicationsA Study of Deep Learning Applications
A Study of Deep Learning Applicationsijtsrd
 
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...cscpconf
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET Journal
 
Data driven model optimization [autosaved]
Data driven model optimization [autosaved]Data driven model optimization [autosaved]
Data driven model optimization [autosaved]Russell Jarvis
 
Final cnn shruthi gali
Final cnn shruthi galiFinal cnn shruthi gali
Final cnn shruthi galiSam Ram
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”Dr.(Mrs).Gethsiyal Augasta
 
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...IJMER
 
Introduction to ANN Principles and its Applications in Solar Energy Technology
Introduction to ANN Principles and its Applications in Solar Energy TechnologyIntroduction to ANN Principles and its Applications in Solar Energy Technology
Introduction to ANN Principles and its Applications in Solar Energy TechnologyAli Al-Waeli
 
Blue Brain_Nikhilesh+Krishna Raj
Blue Brain_Nikhilesh+Krishna RajBlue Brain_Nikhilesh+Krishna Raj
Blue Brain_Nikhilesh+Krishna RajKrishna Raj .S
 
Classification Of Iris Plant Using Feedforward Neural Network
Classification Of Iris Plant Using Feedforward Neural NetworkClassification Of Iris Plant Using Feedforward Neural Network
Classification Of Iris Plant Using Feedforward Neural Networkirjes
 
Neural Networks-introduction_with_prodecure.pptx
Neural Networks-introduction_with_prodecure.pptxNeural Networks-introduction_with_prodecure.pptx
Neural Networks-introduction_with_prodecure.pptxRatuRumana3
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Akash Goel
 
Predictive Metabonomics
Predictive MetabonomicsPredictive Metabonomics
Predictive MetabonomicsMarilyn Arceo
 
20141003.journal club
20141003.journal club20141003.journal club
20141003.journal clubHayaru SHOUNO
 

Similar to SonPhamSVURS2015 (20)

Research Paper - Vaibhav
Research Paper - VaibhavResearch Paper - Vaibhav
Research Paper - Vaibhav
 
Open CV Implementation of Object Recognition Using Artificial Neural Networks
Open CV Implementation of Object Recognition Using Artificial Neural NetworksOpen CV Implementation of Object Recognition Using Artificial Neural Networks
Open CV Implementation of Object Recognition Using Artificial Neural Networks
 
M.SC_3rd(sem_Kuntal_seurat_rna_analysis.pptx
M.SC_3rd(sem_Kuntal_seurat_rna_analysis.pptxM.SC_3rd(sem_Kuntal_seurat_rna_analysis.pptx
M.SC_3rd(sem_Kuntal_seurat_rna_analysis.pptx
 
A Study of Deep Learning Applications
A Study of Deep Learning ApplicationsA Study of Deep Learning Applications
A Study of Deep Learning Applications
 
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
Employing Neocognitron Neural Network Base Ensemble Classifiers To Enhance Ef...
 
IRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural NetworkIRJET-Breast Cancer Detection using Convolution Neural Network
IRJET-Breast Cancer Detection using Convolution Neural Network
 
Data driven model optimization [autosaved]
Data driven model optimization [autosaved]Data driven model optimization [autosaved]
Data driven model optimization [autosaved]
 
Final cnn shruthi gali
Final cnn shruthi galiFinal cnn shruthi gali
Final cnn shruthi gali
 
Neural Networks in Data Mining - “An Overview”
Neural Networks  in Data Mining -   “An Overview”Neural Networks  in Data Mining -   “An Overview”
Neural Networks in Data Mining - “An Overview”
 
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...
Artificial Neural Networks (ANNS) For Prediction of California Bearing Ratio ...
 
Introduction to Apollo for i5k
Introduction to Apollo for i5kIntroduction to Apollo for i5k
Introduction to Apollo for i5k
 
Attractive light wid
Attractive light widAttractive light wid
Attractive light wid
 
Introduction to ANN Principles and its Applications in Solar Energy Technology
Introduction to ANN Principles and its Applications in Solar Energy TechnologyIntroduction to ANN Principles and its Applications in Solar Energy Technology
Introduction to ANN Principles and its Applications in Solar Energy Technology
 
Blue Brain_Nikhilesh+Krishna Raj
Blue Brain_Nikhilesh+Krishna RajBlue Brain_Nikhilesh+Krishna Raj
Blue Brain_Nikhilesh+Krishna Raj
 
Classification Of Iris Plant Using Feedforward Neural Network
Classification Of Iris Plant Using Feedforward Neural NetworkClassification Of Iris Plant Using Feedforward Neural Network
Classification Of Iris Plant Using Feedforward Neural Network
 
DATA SCIENCE
DATA SCIENCEDATA SCIENCE
DATA SCIENCE
 
Neural Networks-introduction_with_prodecure.pptx
Neural Networks-introduction_with_prodecure.pptxNeural Networks-introduction_with_prodecure.pptx
Neural Networks-introduction_with_prodecure.pptx
 
Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders Intro to Deep learning - Autoencoders
Intro to Deep learning - Autoencoders
 
Predictive Metabonomics
Predictive MetabonomicsPredictive Metabonomics
Predictive Metabonomics
 
20141003.journal club
20141003.journal club20141003.journal club
20141003.journal club
 

SonPhamSVURS2015

  • 1. Using Deep Learning to Automatically Learn Feature Representation and Build a Better Classification Model on Protein Sequential Data SonPham,BrianR.King,PhD Computer Science department, Bucknell University, Lewisburg, PA BACKGROUND Deep Learning recently became one of the most exciting directions that Machine Learning has witnessed in years. The technology achieved unbelievable success in image recognition, facial detection and audio extraction. While most research on Deep Learning focuses on 2D image recognition, there are very few methods that have investigated its use on strictly 1D sequential data, such as those found in biological sequences. OUR GOAL This study will aim to investigate the use of deep learning in order to: • Understand how each layer of neural network helps represent hierarchical features of one-dimensional sequential data • Induce a protein sequence classifier that can outperform existing methods. SCRATCH 1-D DATABASE SRATCH 1-D protein database is an open-source protein database by University of California Irvine. The database contains data of over 5700 proteins and their respective secondary structures. In this database, each amino acid in a protein is encoded as one of 20 alphabet letters and its secondary structure is encoded as either Coil (C), α-helix (H) or β-strand (E). PREPROCESSING In this problem, we will slice each protein into smaller substrings of length 15 using the sliding window technique. Each of these substrings will be attached with the label of the middle amino acid. We will also randomly sample 100,000 substrings of length 6 for feature detections. RESULTS We were able to achieve 62.156% accuracy on Protein Secondary Structure Prediction. The technology seems to be able to detect common features that can be used to distinguish between different secondary structures. FUTURE WORK We plan to work on improve the accuracy of Protein Secondary Structure prediction as well as applying the current deep learning architecture to predicting protein subcellular localization. ACKNOWLEDGEMENT REFERENCE http://deeplearning.stanford.edu/ http://scratch.proteomics.ics.uci.edu/ Given the amino acid sequence, our goal is to predict as many correct secondary structure as possible. Random Coil α-helix β-strand Sequence: TIKVLFVDDHEMVRIGIS… Structure: CEEEEEECCCHHHHHHHH… Example sequence and its respective structure FEATURE DETECTION In order to detect meaningful features out of protein sequence, we decided to use the sparse auto-encoder. Sparse auto-encoder is a neural network that can detects common features out of a set of data. We feed 100,000 sample strings of length 6 into the network and to find 40 common features out of these samples. 1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6 A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V A R N D C E Q G H I L K M F P S T W Y V 𝑥𝑥 �𝑥𝑥 𝑊𝑊 ℎ Sparse auto-encoder network architecture Visualization of found features OOOOOOTIKVLFVDDHEMVRIGISSYLSTQSDIEVVGEGASGKEA… CEEEEEECCCHHHHHHHHHHHHHCCCEEEEEEECHHHCC… Input layer 21 x 15 Hidden layer 21 x 6 x 40 CC Convolutional layer 40 x 10 Output layer 40 x 10 Random Coil β-strand α-helix 𝑃𝑃(𝑦𝑦 = 𝐶𝐶| 𝑥𝑥) 𝑃𝑃(𝑦𝑦 = 𝐸𝐸| 𝑥𝑥) 𝑃𝑃(𝑦𝑦 = 𝐻𝐻| 𝑥𝑥) Sliding window Labels Padded O’s Random sample DEEP LEARNING ARCHTECTURE After preprocessing, we will feed the substrings into the deep learning model, which is a neural network that contains multiple layers. Each layer has a different functionality: • The first layer retrieves the 15-length substring input. • The second layer detects common features out of 6-length samples. • The third layer convolve the input will found features to determine where the features are. • The fourth layer uses the new found input to classify whether the structure of the substring is a Coil, α-helix or β- strand. Cpred Epred Hpred C E H 66.91% 27.20% 21.10% 10.10% 43.46% 10.03% 22.99% 29.35% 68.87% By observing the confusion matrix, we learnt that the network still has some problems detecting β- strand. This might be due to the fact that β-strand often has contact with other strands that are way farther than the scope of the substrings that we sampled. I would like to thank Professor Brian King for his expert advise and encouragement throughout this research. Also, this project would have been impossible without the funding support from Bucknell University Program for Undergraduate Research