1. Using Deep Learning to Automatically Learn Feature Representation and
Build a Better Classification Model on Protein Sequential Data
SonPham,BrianR.King,PhD
Computer Science department, Bucknell University, Lewisburg, PA
BACKGROUND
Deep Learning recently became one of the most exciting
directions that Machine Learning has witnessed in years.
The technology achieved unbelievable success in image
recognition, facial detection and audio extraction. While
most research on Deep Learning focuses on 2D image
recognition, there are very few methods that have
investigated its use on strictly 1D sequential data, such
as those found in biological sequences.
OUR GOAL
This study will aim to investigate the use of deep
learning in order to:
• Understand how each layer of neural network helps
represent hierarchical features of one-dimensional
sequential data
• Induce a protein sequence classifier that can
outperform existing methods.
SCRATCH 1-D DATABASE
SRATCH 1-D protein database is an open-source protein
database by University of California Irvine. The database
contains data of over 5700 proteins and their respective
secondary structures. In this database, each amino acid in a
protein is encoded as one of 20 alphabet letters and its
secondary structure is encoded as either Coil (C), α-helix (H)
or β-strand (E).
PREPROCESSING
In this problem, we will slice each protein into smaller substrings of length 15 using the sliding window technique. Each of these
substrings will be attached with the label of the middle amino acid. We will also randomly sample 100,000 substrings of length 6
for feature detections.
RESULTS
We were able to achieve 62.156% accuracy on
Protein Secondary Structure Prediction. The
technology seems to be able to detect common
features that can be used to distinguish between
different secondary structures.
FUTURE WORK
We plan to work on improve the accuracy of Protein
Secondary Structure prediction as well as applying
the current deep learning architecture to predicting
protein subcellular localization.
ACKNOWLEDGEMENT
REFERENCE
http://deeplearning.stanford.edu/
http://scratch.proteomics.ics.uci.edu/
Given the amino acid sequence, our goal is to predict as
many correct secondary structure as possible.
Random Coil
α-helix
β-strand
Sequence: TIKVLFVDDHEMVRIGIS…
Structure: CEEEEEECCCHHHHHHHH…
Example sequence and its respective structure
FEATURE DETECTION
In order to detect meaningful features out of protein
sequence, we decided to use the sparse auto-encoder. Sparse
auto-encoder is a neural network that can detects common
features out of a set of data. We feed 100,000 sample strings
of length 6 into the network and to find 40 common features
out of these samples.
1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6
A
R
N
D
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
R
N
D
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
A
R
N
D
C
E
Q
G
H
I
L
K
M
F
P
S
T
W
Y
V
𝑥𝑥
�𝑥𝑥
𝑊𝑊
ℎ
Sparse auto-encoder network architecture Visualization of found features
OOOOOOTIKVLFVDDHEMVRIGISSYLSTQSDIEVVGEGASGKEA…
CEEEEEECCCHHHHHHHHHHHHHCCCEEEEEEECHHHCC…
Input layer
21 x 15
Hidden layer
21 x 6 x 40
CC
Convolutional
layer
40 x 10
Output layer
40 x 10
Random Coil
β-strand
α-helix
𝑃𝑃(𝑦𝑦 = 𝐶𝐶| 𝑥𝑥)
𝑃𝑃(𝑦𝑦 = 𝐸𝐸| 𝑥𝑥)
𝑃𝑃(𝑦𝑦 = 𝐻𝐻| 𝑥𝑥)
Sliding window
Labels
Padded O’s Random sample
DEEP LEARNING ARCHTECTURE
After preprocessing, we will feed the
substrings into the deep learning model,
which is a neural network that contains
multiple layers. Each layer has a different
functionality:
• The first layer retrieves the 15-length
substring input.
• The second layer detects common
features out of 6-length samples.
• The third layer convolve the input will
found features to determine where the
features are.
• The fourth layer uses the new found
input to classify whether the structure
of the substring is a Coil, α-helix or β-
strand.
Cpred Epred Hpred
C
E
H
66.91%
27.20%
21.10%
10.10%
43.46%
10.03%
22.99%
29.35%
68.87%
By observing the confusion matrix, we learnt that
the network still has some problems detecting β-
strand. This might be due to the fact that β-strand
often has contact with other strands that are way
farther than the scope of the substrings that we
sampled.
I would like to thank Professor Brian King for his
expert advise and encouragement throughout this
research. Also, this project would have been
impossible without the funding support from
Bucknell University Program for Undergraduate
Research