Distributed Monte Carlo Feature Selection (DMCFS) is a method for performing feature selection on large, high-dimensional datasets in a distributed manner. It uses a Monte Carlo sampling approach to select important features. The algorithm can be run across multiple computing nodes and provides constantly updated partial results. It scales nearly linearly as more processors are added. DMCFS is implemented using an actor framework, allowing nodes to be dynamically added and removed. The software is open source and platform independent.
Call Girls In Mahipalpur O9654467111 Escorts Service
Distributed Monte Carlo Feature Selection Scales Linearly
1. Distributed Monte Carlo Feature Selection
Łukasz Król
Data Mining Group
Faculty of Automatic Control,
Electronics and Computer Science
Silesian University of Technology
2. Classical Structured Big Data Problems
d – number of features
n – number of
observations
n >> d
• Number of features is usually much smaller
than the number of observations.
• The problem is the scale of the data, rather than
its structure.
• Observations can be often processed
independently of each other.
• In most use cases, the problem is only that of
filtering and aggregating the data (MapReduce).
3. High-Dimensional Big Data
d – number of features
n – number of
observations
n << d
• Number of features can be a few orders of magnitude higher than the number of
observations.
• Most features are not relevant for the problem.
• There are interdependencies between the features, and sets of features from
different parts of the dataset often need to be processed together.
• Because of high dimensionality of the dataset, there can be a lot of features
correlated with the decision vector and each other only by chance (False Discoveries).
4. High Throughput Biological Experiments
experiment observations features
RNA microarrays 102-103 104
SNP microarrays 102-103 105-106
CNV microarrays 102-103 106
methylation sites 102-103 108-109
sequencing data 102-103 109
Scale of dimensionality of different high-throughput experiments:
5. Feature Selection
Dimensionality can be reduced using Feature
Selection…
…but Feature Selection itself is affected
by feature to observation imbalance!
6. Feature Selection
What can be the objectives of a Feature
Selection application in a supervised scenario?
•Outputting a set of features that are most
useful for training a classifier.
•Outputting a set of features that can be
directly analyzed by a Life Scientist.
7. Feature Selection
Basic requirements of a Feature Selection Application:
• Is not biased by the dataset.
• Is agnostic of type of variables and number of categories.
• Takes into account interactions between variables.
• Takes into account the contextual dependencies with the
response variable.
• Is not bound to a greedy search path.
• Allows to capture statistical significance of selected
features.
Requirements for human readable output:
•Does not transform the feature space.
•Provides information on interdependencies between the
features.
•Does not remove weaker alternative signal paths.
8. Monte Carlo Feature Selection
Bioinformatics (2008) 24: 110-117
Advances in Machine Learning II (2010) 263: 371-385
Big Data Analysis: New Algorithms for a New Society (2015) 16: 285-304
9. Distributed MCFS - motivation
•Constant increase of dimensionality of analyzed problems
requires new tools.
•Current software does not allow to make use of distributed
resources.
•Experiment scenarios are becoming harder. Fewer significant
features are present in microarrays created out of blood
samples than those created out of healthy vs. ill tissues.
•Abundancy of distributed data analysis frameworks resulting
from the Big Data movement.
13. Monte Carlo Feature SelectionOBSERVATIONS
FEATURES
Observation sampling:
j=3
k=1
14. Monte Carlo Feature SelectionOBSERVATIONS
FEATURES
Observation sampling:
j=3
k=2
15. Monte Carlo Feature Selection
Training a decision tree and analyzing its structure and performance.
feat=1
IG=0.3
n=10
fear=2
IG=0.1
n=3
feat=3
IG=0.2
n=7
feat=1
IG=0.1
n=5
16. Monte Carlo Feature Selection
feat=1
IG=0.3
n=10
fear=2
IG=0.1
n=3
feat=3
IG=0.2
n=7
feat=1
IG=0.1
n=5
wAcc=0.75
Training a decision tree and analyzing its structure and performance.
18. Monte Carlo Feature Selection
top features
s1st 2n
d
3rd 4th 5th
1000
2000
3000
4000
5000
6000
7000
Stopping criterion:
STABLE RANKING
19. Monte Carlo Feature Selection
top features
s
cumulated samples number
1st 2n
d
3rd 4th 5th independent incremental
1000 1000 1000
2000 3000 2000
3000 6000 3000
4000 10000 4000
5000 15000 5000
6000 21000 6000
7000 28000 7000
Stopping criterion:
STABLE RANKING
20. DMCFS – basic concepts
•Sampling loop can be executed in parallel.
•Computations can be distributed between hosts.
21. DMCFS – basic concepts
•Communication between threads should be
asynchronous (non blocking), and as minimal as possible.
IDLE
SYNCHRONOUS:
PROGRESS
22. DMCFS – basic concepts
•Communication between threads should be
asynchronous (non blocking), and as minimal as possible.
ASYNCHRONOUS:
PROGRESS
23. DMCFS – basic concepts
• Data should not have to be reshuffled by the
application (like in MapReduce).
• Data distribution should be in the scope of
infrastructure, not the application.
• If possible, data should be loaded only once.
24. DMCFS – basic concepts
• Data should not have to be reshuffled by the
application (like in MapReduce).
• Data distribution should be in the scope of
infrastructure, not the application.
• If possible, data should be loaded only once.
NODE 1 NODE 2 NODE 3
DATA
NETWORK FILESYSTEM
25. DMCFS – basic concepts
• Data should not have to be reshuffled by the
application (like in MapReduce).
• Data distribution should be in the scope of
infrastructure, not the application.
• If possible, data should be loaded only once.
NODE 1 NODE 2 NODE 3
DATABASE – when the problem dataset does not fit into memory
DATA
26. DMCFS – basic concepts
• Data should not have to be reshuffled by the
application (like in MapReduce).
• Data distribution should be in the scope of
infrastructure, not the application.
• If possible, data should be loaded only once.
NODE 1 NODE 2 NODE 3
HDFS+Spark – when feature samples do not fit in the memory
27. DMCFS – basic concepts
Computations should not be affected by nodes being
disconnected:
• Most of the nodes do not need to be aware of
other nodes and size of the cluster.
• Most nodes perform simple stateless workload.
• A single point of failure and potential bottleneck is
present in the form of a Master Node – it is however
restorable, and multithreading allows it to remain
responsive.
28. DMCFS – basic concepts
Writing and maintaining the application can be greatly
facilitated by employing a parallel programming
framework, more specifically an actor framework.
42. DMCFS – summary
•Can be run on an arbitrary number of physical machines.
•Allows to dynamically attach and detach nodes while running
computations.
•Provides constantly updated partial results.
•Scales almost linearly when increasing the amount of
available processors.
•Platform-independent.
•Has no dependencies other than Java 1.8.
•Is open for extending by new types of feature selectors.
•Can be deployed on public or private cloud.
•Software (0.1.0) available upon request.
•Creation of an intranet service is ongoing.
43. Acknowledgements
I would like to thank dr. Draminski for providing the latest version of dmLab
software for evaluation, as well as Najla Al-Harbi, Sara Bin Judia, dr Salma Majid,
dr. Ghazi Alsbeih (Faisal Specialist Hospital & Research Centre, Riyadh 11211,
Kingdom of Saudi Arabia), and furthermore Bozena Rolnik (Data Mining Group) for
providing the CNV data. Calculations were carried out using the computer cluster
Ziemowit (http://www.ziemowit.hpc.polsl.pl) funded by the Silesian BIO-FARMA
project No. POIG.02.01.00-00-166/08 in the Computational Biology and
Bioinformatics Laboratory of the Biotechnology Centre in the Silesian University of
Technology. The work was nancially supported by NCN grant HARMONIA UMO-
2013/08/M/ST6/00924 (LK).