Preprocessing data is one of the most effort consuming tasks in Machine Learning (ML). In the Big Data context, the models automatically derived from data should be as simple as possible, interpretable and fast, and for achieving that we will need to use the best variables, that is, use the best features of such data.
Although there are already several libraries available which approach ML tasks in Big Data, that is not the case for FS algorithms yet, and other preprocessing techniques such as discretization. However, the existing FS methods do not scale well when dealing with Big Data. In this presentation, we show our efforts and new ideas for parallelizing standard FS methods for its use on Big Data environments.
Session presented at Big Data Spain 2015 Conference
15th Oct 2015
Kinépolis Madrid
http://www.bigdataspain.org
Event promoted by: http://www.paradigmatecnologico.com
Abstract: http://www.bigdataspain.org/program/thu/slot-11.html
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015
1.
2. BEGIN AT THE BEGINNING:
FEATURE SELECTION FOR BIG
DATA
AMPARO ALONSO-BETANZOS
BIG DATA SPAIN 2015, Madrid
3. BIG DATA HISPANO, 2015 2
Begin at the Begining
“Begin at the beginning," the King said, very gravely, "and go on
till you come to the end: then stop.”
4. BIG DATA HISPANO, 2015 3
The first step: Preprocessing the data
Peter Norvig
Google Research
Director
5. BIG DATA HISPANO, 2015 4
Not everything that counts can be
counted,
and not everything that can be counted
counts.
Equality is not the way
9. BIG DATA HISPANO, 2015 8
Feature selection: basic flavors
Advantages Disadvantages Examples
• Independence of classifier No interaction with classifier CFS
• Low computational cost Consistency-based
• Fast
• Good generalization
ability
INTERACT
ReliefF
FCBF
InfoGain
mRMr
10. BIG DATA HISPANO, 2015 9
Basic shapes of filters: In several ways
Subset
Filters
Ranker
Filters
Univariate
methods
Multivariate
methods
Feature selection techniques do not scale well with
Big data
11. BIG DATA HISPANO, 2015 10
Distributed Feature Selection
• Allocating the learning process among several workstations as a
natural manner of scaling up learning algorithms.
Scaling up FS
• Advantages:
– Reduction in execution time
– Resources sharing
– Better performance
12. BIG DATA HISPANO, 2015 11
Cluster computing
MLlib
Distributed implementation of a FS method
13. BIG DATA HISPANO, 2015 12
It is built on Apache Spark, a fast and general engine for large-scale
data processing.
Runs programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
Runs on Hadoop 2 clusters
Write applications quickly in Java, Scala, or Python.
MLlib, why?
15. BIG DATA HISPANO, 2015 14
Implemented a generic FS framework for Big Data based on
Information Theory
• Brown G, Pocock A, Zhao MJ, Lujan M (2012) Conditional likelihood
maximisation: A unifying framework for information theoretic feature
selection. J Mach Learn Res 13:27–66.
Implementing FS based on IT framework
Relevance Redundance Conditional
Redundance
16. BIG DATA HISPANO, 2015 15
The long and winding road
Discretization is
needed!!
Transform numerical attributes into discrete or nominal attributes
with a finite number of intervals.
18. BIG DATA HISPANO, 2015 17
Proposal: Complete Re-design of Discretization method
MDLP (Minimum Description Length Principle)
– Sort all points in the dataset using a single distributed
operation using a SPARK primitive.
– Evaluates boundary points (per feature) in an parallel
way.
The algorithm: MLDP
20. BIG DATA HISPANO, 2015 19
Original criteria and reformulation in
framework
21. BIG DATA HISPANO, 2015 20
The complexity of the framework is determined by the
computations of relevance and redundancy
Proposal: complete re-design of Brown's framework
– Columnar transformation: The access pattern presented
by most FS methods is feature-wise. The partitioning
scheme of data is quite influential in Apache Spark.
Re-design of FS framework for Spark
22. BIG DATA HISPANO, 2015 21
The complexity of the framework is determined by the
computations of relevance and redundancy
Proposal: complete re-design of Brown's framework
– Columnar transformation: The access pattern presented
by most FS methods is feature-wise. The partitioning
scheme of data is quite influential in Apache Spark.
– Caching variables: relevance is computed and cached
once at the start. The marginal and joint proportions
derived from these operation are also cached. This info. is
replicated.
– Greedy approach: only one feature is selected in each
iteration. The quadratic complexity is transformed to a
complexity determined by the number of features
selected.
Re-design of FS framework for Spark
23. BIG DATA HISPANO, 2015 22
Scalability results: Selection time
Dna dataset
28. BIG DATA HISPANO, 2015 27
Other attempts
Parallel Implementation of mRMR on
GPU
https://github.com/sramirez/fast-mRMR
Implementation of other FS algorithms:
ReliefF, CFS, SVM-RFE
(working on the scalability studies)
29. BIG DATA HISPANO, 2015 28
Data can be located in different sites:
• Different parts of a company
• Different cooperating organizations
• A very large data set can be distributed on several processors and then
combine the results
Distributed Feature Selection (DFS)
DFS Goal:
• to reduce the computational time
• while maintaining the classification
performance
30. BIG DATA HISPANO, 2015 29
DFS. Types of partition
By samples
By features
35. BIG DATA HISPANO, 2015 34
Discretization. How does it work?
Parameters: 50 intervals and 100,000 max candidates per partition.
Classifier: Naive Bayes from MLLib, lambda = 1, iterations = 100.
Hardware: 16 nodes (12 cores per node), 64 GB RAM.
Software: Hadoop 2.5 and Apache Spark 1.2.0.
36. BIG DATA HISPANO, 2015 35
Feature Selection. Experimental results
DATASETS
Parameters: FS algorithm = mRMR, level of parallelism = 864 partitions.
Classifier: Naive Bayes and SVM (default parameters), from MLLIB.
Hardware: 18 nodes (12 cores per node), 64 GB RAM.
Software: Hadoop 2.5 and Apache Spark 1.2.0.
37. BIG DATA HISPANO, 2015 36
CPU vs CUDA
Low number of possible values
(< 64)
High number of possible values
(up to 256)
38. BIG DATA HISPANO, 2015 37
GPU. Real Datasets
DATASET PATTERNS FEATURES VALUES
KDDCup99 4000000 41 255
Higgs 11000000 21 255
45. BIG DATA HISPANO, 2015 44
Horizontal partitioning of the datasets
• Partitioning of the data maintaining class distribution
Application of the filter to the datasets
Combination of the results
• Merging procedure: Theoretical complexity of feature subsets
Improving the method
New approach
46. BIG DATA HISPANO, 2015 45
Calculate the complexity of each candidate subset of features
Fisher discriminant ratio
The complexity measurement
mi, si
2 and pi mean, variance and proportion of the ith-class
Independency from classifier
Temporal improvement
47. BIG DATA HISPANO, 2015 46
Connect
4
Isolet Madelon Ozone Spambase Mnist
Full set 42 617 500 72 57 717
Centralized 7 186 18 20 19 61
Distrib-Comp 8 105 9 8 18 77
Number of selected features
56. BIG DATA HISPANO, 2015 55
Experimental results. Microarray datasets
DNA microarray data is a good candidate for vertical distributed
feature selection, since data needs to be split by features
This type of data usually have redundant features
58. BIG DATA HISPANO, 2015 57
Using complexity measurement instead of accuracy
Vertical partition. Complexity measure
Features Training Test Classes
Isolet 617 6238 1236 26
Madelon 500 1600 800 2
Mnist 717 40000 20000 2
Number of features
59. BIG DATA HISPANO, 2015 58
Experimental results. Classification accuracy
61. BIG DATA HISPANO, 2015 60
Complexity measure. Time
Isolet
Madelon
MNIST
Average Speedup
2318,45
Average Speedup
26,13
Average Speedup
1483,80
Average: 573.4337
63. BIG DATA HISPANO, 2015 62
Parallel Computing paradigm
CUDA Platform
• NVIDIA 780 GTI
IT-based algorithm: mRMR
• (Maximum Relevance minimum
redundancy)
– CPU optimized version
(fast mRMR)
– GPU version
GPU implementation of FS
64. BIG DATA HISPANO, 2015 63
GPU compute efficiently image histograms
Accelerate computation of MI in GPU
Image processing (up to 256 values)
Previous ideas
Input
Data
Threads Partial
histograms Final results
65. BIG DATA HISPANO, 2015 64
Reorder
Variable Discretization : value 8 bits number (max 256 different
values per feature)
Data Access pattern
77. BIG DATA HISPANO, 2015 76
Challenges: Visualization and
Interpretability
78. BIG DATA HISPANO, 2015 77
V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos.
Distributed feature selection: An application to microarray data
classification. Applied Soft Computing 05/2015; 30:136-150.
DOI:10.1016/j.asoc.2015.01.035.
V. Bolón- Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos.
Recent advances and emerging challenges of feature selection
in the context of Big Data. Knowledge-Based Systems (2015).
doi:http://dx.doi.org/10.1016/j.knosys.2015.05.014.
V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos.
Feature Selection for high-dimensional data. Springer-Verlag,
2015 (in production).
References