Computing Just What You Need: Online Data Analysis and Reduction at Extreme ...
DeepLearningProjV3
1. DEEP LEARNING FOR
FEATURE SELECTION IN
CASE AND CONTROL
PREDICTION
Ana Sanchez
Graduate Student Project Day
Spring 2015
2. Introduction
• Deep learning is a method that uses sets of
machine learning algorithms to model data.
• Learning representations of data algorithms are
used to obtain information that may otherwise not
be detected.
• In this project, deep learning is used for case and
control prediction in genome wide association
studies, specifically to construct new feature
vectors.
3. Methods
• SNP data is used to implement deep learning.
Currently experimenting with Upper GI Cancer
data.
• The data contains approximately 5000 subjects
(case and control) and was reduced to 10 case
subjects and 10 control subjects.
• There are 491,774 SNP features per subject. The
SNPs are condensed to patches of 100.
4. Methods - Patching
• The patches are generated by starting at index 0
to 99, then shifting by 50 SNPs and obtaining the
next 100 SNPs until the final SNP (ex: 1st patch 0-
99, 2nd patch 49-149, 3rd patch 100-199, etc.).
• Python program is written to accomplish patching.
6. Output of Patching
A 196680 X 100 matrix. The matrix
represents 9834 patches per patient.
7. Methods – Deep Learning with K-Means
• To implement Deep learning, K-Means is used.
• The K-Means algorithm clusters data by
separating samples in N groups within cluster
sum of squares.
• Number of clusters must be specified.
8. Methods – Deep Learning with K-Means
• For the purpose of this project, Scikit-Learn was
used to implement the K-Means algorithm.
• Scikit-Learn is open source software that
provides libraries for machine learning in Python.
• Using imported libraries like numpy, sklearn
clusters, and sklearn metrics simple programs are
written to handle the complexities of the
algorithm.
• For K-Means, the number of K clusters were K=
1000 and K=10000.
10. Output for K-Means
One file contains a 196680 x 1 matrix.
This represents the label for each patch.
A second file contains a 1000 x 100 matrix. This is the
distance of each data point to the centroid. In other words,
the coordinates of the data points to each cluster.
11. Methods – Constructing a Distance Matrix
• At this point, three files have been created from
the original set, Patches, Centers, and Labels.
• The next step is to construct a pairwise distance
matrix. This will construct a file that contains the
distance of every data point, in each patch, to
every center cluster.
• Python program is written with patch and center
files as input.
14. Output for Distance Matrix
Screen Shot of word count for distance matrix output. Shows
196680 rows. When dividing 196680000 by the rows, the
columns are obtained. In this case there are 1000 columns.
Therefore, a 196680 x 1000 matrix resulted. This matrix
represents the number of patches by the number of K
centers.
15. Methods – Constructing New Feature Vector file
• Now that the Distance Matrix is obtained, the final
New Feature Vector file is created.
• Since the matrix has196680 x K, the sum of each
column for every 9834 rows needs to be
obtained. 9834 rows represent 1 patient and
should be represented in the final file.
• A python program was written that uses the
Distance Matrix file as input.
17. Output of New Features
A New Vector matrix results, 20 x 1000.
18. Output of New Features
Screen Shot of word count for new feature vector output. Shows 20 rows.
When dividing 20000 by 20, the number of columns is obtained. In this
case, the number of columns is 1000. This represents the original 20
patients with their new features, 1000, based on the patching, kmeans,
pair wise, and summation methods that were applied.
19. Encountered Issues throughout Project
• Large files
• Need for extra programs to download
• When using Sci Kit, running normal K-means is extremely
time consuming. Instead MiniBatchKmeans is used.
• MBK makes subsets of input data, greatly reduces computation
time. Performs only slightly worse.
• Even with MBK, computation time for K=100000 is too
long and was not used.
• Still, long runtimes( k=1000 (~30 mins), k=10000
(~60mins), k=100000 (>16hours)).
• Long runtimes even for simple scripts due to large data.
20. Comparison of Kmeans and
MiniBatchKmeans
Image and comparison from SciKit-Learn website.
21. Conclusions
• The process was completed twice, for K= 1000, and K =
10000.
• Using deep learning ideas, a 20 x 491,774 data set was
properly processed into a 20x1000 file and a 20x10000
file.
• This significant reduction in file size saves memory.
• Classification algorithms can implemented on the new
files. Prediction Accuracy may be higher for the new files
than raw data.
22. References and Acknowledgments
• Scikit-learn: Machine Learning in Python, Pedregosa et al.,
JMLR 12, pp. 2825-2830, 2011.
• Michael K. K. Leung, Hui Yuan Xiong, Leo J. Lee, and
Brendan J. Frey Deep learning of the tissue-regulated
splicing code Bioinformatics 2014 30: i121-i129.
• Rasool Fakoor, Faisal Ladhak, Azade Nazi, Manfred Huber
Using Deep Learning to enhance cancer diagnosis and
classification Proceedings of the 30th International
Conference on Ma- chine Learning, Atlanta, Georgia, USA,
2013. JMLR: W&CP volume 28
Special Thanks to Mohammadreza Esfandiari, Payam, for
guidance and assistance throughout project.