Distributed Monte Carlo Feature Selection Scales Linearly

Distributed Monte Carlo Feature Selection
Łukasz Król
Data Mining Group
Faculty of Automatic Control,
Electronics and Computer Science
Silesian University of Technology

Classical Structured Big Data Problems
d – number of features
n – number of
observations
n >> d
• Number of features is usually much smaller
than the number of observations.
• The problem is the scale of the data, rather than
its structure.
• Observations can be often processed
independently of each other.
• In most use cases, the problem is only that of
filtering and aggregating the data (MapReduce).

High-Dimensional Big Data
d – number of features
n – number of
observations
n << d
• Number of features can be a few orders of magnitude higher than the number of
observations.
• Most features are not relevant for the problem.
• There are interdependencies between the features, and sets of features from
different parts of the dataset often need to be processed together.
• Because of high dimensionality of the dataset, there can be a lot of features
correlated with the decision vector and each other only by chance (False Discoveries).

High Throughput Biological Experiments
experiment observations features
RNA microarrays 102-103 104
SNP microarrays 102-103 105-106
CNV microarrays 102-103 106
methylation sites 102-103 108-109
sequencing data 102-103 109
Scale of dimensionality of different high-throughput experiments:

Feature Selection
Dimensionality can be reduced using Feature
Selection…
…but Feature Selection itself is affected
by feature to observation imbalance!

Feature Selection
What can be the objectives of a Feature
Selection application in a supervised scenario?
•Outputting a set of features that are most
useful for training a classifier.
•Outputting a set of features that can be
directly analyzed by a Life Scientist.

Feature Selection
Basic requirements of a Feature Selection Application:
• Is not biased by the dataset.
• Is agnostic of type of variables and number of categories.
• Takes into account interactions between variables.
• Takes into account the contextual dependencies with the
response variable.
• Is not bound to a greedy search path.
• Allows to capture statistical significance of selected
features.
Requirements for human readable output:
•Does not transform the feature space.
•Provides information on interdependencies between the
features.
•Does not remove weaker alternative signal paths.

Monte Carlo Feature Selection
Bioinformatics (2008) 24: 110-117
Advances in Machine Learning II (2010) 263: 371-385
Big Data Analysis: New Algorithms for a New Society (2015) 16: 285-304

Distributed MCFS - motivation
•Constant increase of dimensionality of analyzed problems
requires new tools.
•Current software does not allow to make use of distributed
resources.
•Experiment scenarios are becoming harder. Fewer significant
features are present in microarrays created out of blood
samples than those created out of healthy vs. ill tissues.
•Abundancy of distributed data analysis frameworks resulting
from the Big Data movement.

Monte Carlo Feature SelectionOBSERVATIONS
FEATURES
Feature sampling:
j=1

FEATURES
Feature sampling:
j=2

FEATURES
Feature sampling:
j=3

FEATURES
Observation sampling:
j=3
k=1

FEATURES
Observation sampling:
j=3
k=2

Training a decision tree and analyzing its structure and performance.
feat=1
IG=0.3
n=10
fear=2
IG=0.1
n=3
feat=3
IG=0.2
n=7
feat=1
IG=0.1
n=5

feat=1
IG=0.3
n=10
fear=2
IG=0.1
n=3
feat=3
IG=0.2
n=7
feat=1
IG=0.1
n=5
wAcc=0.75
Training a decision tree and analyzing its structure and performance.

Capturing feature interdependency:
feat=1
IG=0.3
n=10
fear=2
IG=0.1
n=3
feat=3
IG=0.2
n=7
feat=1
IG=0.1
n=5
wAcc=0.75

top features
s1st 2n
d
3rd 4th 5th
1000
2000
3000
4000
5000
6000
7000
Stopping criterion:
STABLE RANKING

top features
s
cumulated samples number
1st 2n
d
3rd 4th 5th independent incremental
1000 1000 1000
2000 3000 2000
3000 6000 3000
4000 10000 4000
5000 15000 5000
6000 21000 6000
7000 28000 7000
Stopping criterion:
STABLE RANKING

DMCFS – basic concepts
•Sampling loop can be executed in parallel.
•Computations can be distributed between hosts.

•Communication between threads should be
asynchronous (non blocking), and as minimal as possible.
IDLE
SYNCHRONOUS:
PROGRESS

•Communication between threads should be
asynchronous (non blocking), and as minimal as possible.
ASYNCHRONOUS:
PROGRESS

• Data should not have to be reshuffled by the
application (like in MapReduce).
• Data distribution should be in the scope of
infrastructure, not the application.
• If possible, data should be loaded only once.

NODE 1 NODE 2 NODE 3
DATA
NETWORK FILESYSTEM

DATABASE – when the problem dataset does not fit into memory
DATA

HDFS+Spark – when feature samples do not fit in the memory

Computations should not be affected by nodes being
disconnected:
• Most of the nodes do not need to be aware of
other nodes and size of the cluster.
• Most nodes perform simple stateless workload.
• A single point of failure and potential bottleneck is
present in the form of a Master Node – it is however
restorable, and multithreading allows it to remain
responsive.

Writing and maintaining the application can be greatly
facilitated by employing a parallel programming
framework, more specifically an actor framework.

DMCFS - architecture
Core actors:

Attaching Booster Nodes:

Interfacing with the outside world:

Feature sampling and producing partial RIs:

Comparing Feature Rankings:

Storing historical feature rankings and ranking distances:

Outputting results to client application:

Test datasets
dataset observations features
Golub et al. Leukemia data 38 7130
CNV radiosensitivity data 130 2*106

Results – nearly linear speedup

Results – comparison with dmLab

Results – high-dimensional dataset

DMCFS – censored survival data
features response

DMCFS – censored survival data
1
2
3
1
2
3

DMCFS – summary
•Can be run on an arbitrary number of physical machines.
•Allows to dynamically attach and detach nodes while running
computations.
•Provides constantly updated partial results.
•Scales almost linearly when increasing the amount of
available processors.
•Platform-independent.
•Has no dependencies other than Java 1.8.
•Is open for extending by new types of feature selectors.
•Can be deployed on public or private cloud.
•Software (0.1.0) available upon request.
•Creation of an intranet service is ongoing.

Acknowledgements
I would like to thank dr. Draminski for providing the latest version of dmLab
software for evaluation, as well as Najla Al-Harbi, Sara Bin Judia, dr Salma Majid,
dr. Ghazi Alsbeih (Faisal Specialist Hospital & Research Centre, Riyadh 11211,
Kingdom of Saudi Arabia), and furthermore Bozena Rolnik (Data Mining Group) for
providing the CNV data. Calculations were carried out using the computer cluster
Ziemowit (http://www.ziemowit.hpc.polsl.pl) funded by the Silesian BIO-FARMA
project No. POIG.02.01.00-00-166/08 in the Computational Biology and
Bioinformatics Laboratory of the Biotechnology Centre in the Silesian University of
Technology. The work was nancially supported by NCN grant HARMONIA UMO-
2013/08/M/ST6/00924 (LK).

Distributed Monte Carlo Feature Selection Scales Linearly

Recommended

Recommended

More Related Content

Similar to Distributed Monte Carlo Feature Selection Scales Linearly

Similar to Distributed Monte Carlo Feature Selection Scales Linearly (20)

Recently uploaded

Recently uploaded (20)

Distributed Monte Carlo Feature Selection Scales Linearly

Editor's Notes