Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

BEGIN AT THE BEGINNING:
FEATURE SELECTION FOR BIG
DATA
AMPARO ALONSO-BETANZOS
BIG DATA SPAIN 2015, Madrid

BIG DATA HISPANO, 2015 2
Begin at the Begining
“Begin at the beginning," the King said, very gravely, "and go on
till you come to the end: then stop.”

The first step: Preprocessing the data
Peter Norvig
Google Research
Director

Not everything that counts can be
counted,
and not everything that can be counted
counts.
Equality is not the way

Why Feature reduction?

Arriving at the best features

Feature selection. Benefits

Feature selection: basic flavors
Advantages Disadvantages Examples
• Independence of classifier No interaction with classifier CFS
• Low computational cost Consistency-based
• Fast
• Good generalization
ability
INTERACT
ReliefF
FCBF
InfoGain
mRMr

Basic shapes of filters: In several ways
Subset
Filters
Ranker
Filters
Univariate
methods
Multivariate
methods
Feature selection techniques do not scale well with
Big data

Distributed Feature Selection
• Allocating the learning process among several workstations as a
natural manner of scaling up learning algorithms.
Scaling up FS
• Advantages:
– Reduction in execution time
– Resources sharing
– Better performance

Cluster computing
MLlib
Distributed implementation of a FS method

 It is built on Apache Spark, a fast and general engine for large-scale
data processing.
 Runs programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
 Runs on Hadoop 2 clusters
 Write applications quickly in Java, Scala, or Python.
MLlib, why?

MLLib contents
No FS
algorithms!!!

 Implemented a generic FS framework for Big Data based on
Information Theory
• Brown G, Pocock A, Zhao MJ, Lujan M (2012) Conditional likelihood
maximisation: A unifying framework for information theoretic feature
selection. J Mach Learn Res 13:27–66.
Implementing FS based on IT framework
Relevance Redundance Conditional
Redundance

The long and winding road
Discretization is
needed!!
Transform numerical attributes into discrete or nominal attributes
with a finite number of intervals.

Stages in the discretization process

 Proposal: Complete Re-design of Discretization method
MDLP (Minimum Description Length Principle)
– Sort all points in the dataset using a single distributed
operation using a SPARK primitive.
– Evaluates boundary points (per feature) in an parallel
way.
The algorithm: MLDP

Is it worth?

Original criteria and reformulation in
framework

 The complexity of the framework is determined by the
computations of relevance and redundancy
 Proposal: complete re-design of Brown's framework
– Columnar transformation: The access pattern presented
by most FS methods is feature-wise. The partitioning
scheme of data is quite influential in Apache Spark.
Re-design of FS framework for Spark

 The complexity of the framework is determined by the
computations of relevance and redundancy
 Proposal: complete re-design of Brown's framework
– Columnar transformation: The access pattern presented
by most FS methods is feature-wise. The partitioning
scheme of data is quite influential in Apache Spark.
– Caching variables: relevance is computed and cached
once at the start. The marginal and joint proportions
derived from these operation are also cached. This info. is
replicated.
– Greedy approach: only one feature is selected in each
iteration. The quadratic complexity is transformed to a
complexity determined by the number of features
selected.
Re-design of FS framework for Spark

Scalability results: Selection time
Dna dataset

Scalability : Cores
ECDBL14 dataset

Is it useful?

Time for creating the classification model

Spark-Infotheoretic FS

Other attempts
Parallel Implementation of mRMR on
GPU
https://github.com/sramirez/fast-mRMR
Implementation of other FS algorithms:
ReliefF, CFS, SVM-RFE
(working on the scalability studies)

Data can be located in different sites:
• Different parts of a company
• Different cooperating organizations
• A very large data set can be distributed on several processors and then
combine the results
Distributed Feature Selection (DFS)
 DFS Goal:
• to reduce the computational time
• while maintaining the classification
performance

DFS. Types of partition
By samples
By features

DFS with rankers
NDCGvalues

Distributed FS by Samples
HORIZONTAL PARTITION

Distributed FS by Features
VERTICAL PARTITION

Thank You!!!
33

Discretization. How does it work?
 Parameters: 50 intervals and 100,000 max candidates per partition.
 Classifier: Naive Bayes from MLLib, lambda = 1, iterations = 100.
 Hardware: 16 nodes (12 cores per node), 64 GB RAM.
 Software: Hadoop 2.5 and Apache Spark 1.2.0.

Feature Selection. Experimental results
DATASETS
Parameters: FS algorithm = mRMR, level of parallelism = 864 partitions.
Classifier: Naive Bayes and SVM (default parameters), from MLLIB.
Hardware: 18 nodes (12 cores per node), 64 GB RAM.
Software: Hadoop 2.5 and Apache Spark 1.2.0.

CPU vs CUDA
Low number of possible values
(< 64)
High number of possible values
(up to 256)

GPU. Real Datasets
DATASET PATTERNS FEATURES VALUES
KDDCup99 4000000 41 255
Higgs 11000000 21 255

Horizontal partitioning: By samples

BIG DATA HISPANO, 2015
1. Horizontal partitioning of the datasets
39

2. Application of the filter to the subsets

3.Combination of the results

BIG DATA HISPANO, 2015
Experimental Framework
FILTERS
 Subset filters:
• CFS (Correlation-based
Feature Selection)
• INTERACT
• Consistency-based filter
 Ranker filters:
• InfoGain
• ReliefF
CLASSIFIERS
• C4.5
• Naïve Bayes
• IB1
• SVM
42

Results with C 4.5

 Horizontal partitioning of the datasets
• Partitioning of the data maintaining class distribution
 Application of the filter to the datasets
 Combination of the results
• Merging procedure: Theoretical complexity of feature subsets
Improving the method
New approach

 Calculate the complexity of each candidate subset of features
 Fisher discriminant ratio
The complexity measurement
mi, si
2 and pi mean, variance and proportion of the ith-class
Independency from classifier
Temporal improvement

Connect
4
Isolet Madelon Ozone Spambase Mnist
Full set 42 617 500 72 57 717
Centralized 7 186 18 20 19 61
Distrib-Comp 8 105 9 8 18 77
Number of selected features

Classification accuracy (II)
CFS
ReliefF

Time. Distributed vs Centralized
(maximum)

Runtime. Comparing Distributed approaches
(Average)

Maximum Run Time (MNIST)

Runtime to obtain threshold of votes

Vertical partitioning: By features

Vertical partitioning

Different approaches tested

Experimental results. Microarray datasets
DNA microarray data is a good candidate for vertical distributed
feature selection, since data needs to be split by features
This type of data usually have redundant features

Vertical distribution. Accuracy and time

 Using complexity measurement instead of accuracy
Vertical partition. Complexity measure
Features Training Test Classes
Isolet 617 6238 1236 26
Madelon 500 1600 800 2
Mnist 717 40000 20000 2
Number of features

Experimental results. Classification accuracy

Complexity measure. Time

Complexity measure. Time
Isolet
Madelon
MNIST
Average Speedup
2318,45
Average Speedup
26,13
Average Speedup
1483,80
Average: 573.4337

GPU implementation

 Parallel Computing paradigm
 CUDA Platform
• NVIDIA 780 GTI
 IT-based algorithm: mRMR
• (Maximum Relevance minimum
redundancy)
– CPU optimized version
(fast mRMR)
– GPU version
GPU implementation of FS

 GPU compute efficiently image histograms
 Accelerate computation of MI in GPU
 Image processing (up to 256 values)
Previous ideas
Input
Data
Threads Partial
histograms Final results

 Reorder
 Variable Discretization : value 8 bits number (max 256 different
values per feature)
Data Access pattern

Datasets: performance and scalability

Results. CPU optimizationTime(s)
Dataset
Dataset
400 Features
100 Features
Time(s)

Scalability CPU Optimization (I)
Number of patterns
Maximum posible values per feature
Time(s)
Time(s)

Scalability CPU Optimization (II)
Time(s)
Number of features

CPU vs CUDA
Low number of possible values
(< 64)
High number of possible values
(up to 256)

GPU. Real Datasets
DATASET PATTERNS FEATURES VALUES
KDDCup99 4000000 41 255
Higgs 11000000 21 255

Challenges : Millions of Dimensions

Challenges: Scalability

Challenges: Distributed FS

Challenges: Real-time FS

Challenges: Cost-based FS

Challenges: Visualization and
Interpretability

 V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos.
Distributed feature selection: An application to microarray data
classification. Applied Soft Computing 05/2015; 30:136-150.
DOI:10.1016/j.asoc.2015.01.035.
 V. Bolón- Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos.
Recent advances and emerging challenges of feature selection
in the context of Big Data. Knowledge-Based Systems (2015).
doi:http://dx.doi.org/10.1016/j.knosys.2015.05.014.
 V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos.
Feature Selection for high-dimensional data. Springer-Verlag,
2015 (in production).
References

Thank You!!!
78

Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

Recommended

Recommended

More Related Content

Viewers also liked

Viewers also liked (20)

Similar to Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

Similar to Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015 (20)

More from Big Data Spain

More from Big Data Spain (20)

Recently uploaded

Recently uploaded (20)

Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015