SlideShare a Scribd company logo
1 of 79
Download to read offline
BEGIN AT THE BEGINNING:
FEATURE SELECTION FOR BIG
DATA
AMPARO ALONSO-BETANZOS
BIG DATA SPAIN 2015, Madrid
BIG DATA HISPANO, 2015 2
Begin at the Begining
“Begin at the beginning," the King said, very gravely, "and go on
till you come to the end: then stop.”
BIG DATA HISPANO, 2015 3
The first step: Preprocessing the data
Peter Norvig
Google Research
Director
BIG DATA HISPANO, 2015 4
Not everything that counts can be
counted,
and not everything that can be counted
counts.
Equality is not the way
BIG DATA HISPANO, 2015 5
Why Feature reduction?
BIG DATA HISPANO, 2015 6
Arriving at the best features
BIG DATA HISPANO, 2015 7
Feature selection. Benefits
BIG DATA HISPANO, 2015 8
Feature selection: basic flavors
Advantages Disadvantages Examples
• Independence of classifier No interaction with classifier CFS
• Low computational cost Consistency-based
• Fast
• Good generalization
ability
INTERACT
ReliefF
FCBF
InfoGain
mRMr
BIG DATA HISPANO, 2015 9
Basic shapes of filters: In several ways
Subset
Filters
Ranker
Filters
Univariate
methods
Multivariate
methods
Feature selection techniques do not scale well with
Big data
BIG DATA HISPANO, 2015 10
Distributed Feature Selection
• Allocating the learning process among several workstations as a
natural manner of scaling up learning algorithms.
Scaling up FS
• Advantages:
– Reduction in execution time
– Resources sharing
– Better performance
BIG DATA HISPANO, 2015 11
Cluster computing
MLlib
Distributed implementation of a FS method
BIG DATA HISPANO, 2015 12
 It is built on Apache Spark, a fast and general engine for large-scale
data processing.
 Runs programs up to 100x faster than Hadoop MapReduce in
memory, or 10x faster on disk.
 Runs on Hadoop 2 clusters
 Write applications quickly in Java, Scala, or Python.
MLlib, why?
BIG DATA HISPANO, 2015 13
MLLib contents
No FS
algorithms!!!
BIG DATA HISPANO, 2015 14
 Implemented a generic FS framework for Big Data based on
Information Theory
• Brown G, Pocock A, Zhao MJ, Lujan M (2012) Conditional likelihood
maximisation: A unifying framework for information theoretic feature
selection. J Mach Learn Res 13:27–66.
Implementing FS based on IT framework
Relevance Redundance Conditional
Redundance
BIG DATA HISPANO, 2015 15
The long and winding road
Discretization is
needed!!
Transform numerical attributes into discrete or nominal attributes
with a finite number of intervals.
BIG DATA HISPANO, 2015 16
Stages in the discretization process
BIG DATA HISPANO, 2015 17
 Proposal: Complete Re-design of Discretization method
MDLP (Minimum Description Length Principle)
– Sort all points in the dataset using a single distributed
operation using a SPARK primitive.
– Evaluates boundary points (per feature) in an parallel
way.
The algorithm: MLDP
BIG DATA HISPANO, 2015 18
Is it worth?
BIG DATA HISPANO, 2015 19
Original criteria and reformulation in
framework
BIG DATA HISPANO, 2015 20
 The complexity of the framework is determined by the
computations of relevance and redundancy
 Proposal: complete re-design of Brown's framework
– Columnar transformation: The access pattern presented
by most FS methods is feature-wise. The partitioning
scheme of data is quite influential in Apache Spark.
Re-design of FS framework for Spark
BIG DATA HISPANO, 2015 21
 The complexity of the framework is determined by the
computations of relevance and redundancy
 Proposal: complete re-design of Brown's framework
– Columnar transformation: The access pattern presented
by most FS methods is feature-wise. The partitioning
scheme of data is quite influential in Apache Spark.
– Caching variables: relevance is computed and cached
once at the start. The marginal and joint proportions
derived from these operation are also cached. This info. is
replicated.
– Greedy approach: only one feature is selected in each
iteration. The quadratic complexity is transformed to a
complexity determined by the number of features
selected.
Re-design of FS framework for Spark
BIG DATA HISPANO, 2015 22
Scalability results: Selection time
Dna dataset
BIG DATA HISPANO, 2015 23
Scalability : Cores
ECDBL14 dataset
BIG DATA HISPANO, 2015 24
Is it useful?
BIG DATA HISPANO, 2015 25
Time for creating the classification model
BIG DATA HISPANO, 2015 26
Spark-Infotheoretic FS
BIG DATA HISPANO, 2015 27
Other attempts
Parallel Implementation of mRMR on
GPU
https://github.com/sramirez/fast-mRMR
Implementation of other FS algorithms:
ReliefF, CFS, SVM-RFE
(working on the scalability studies)
BIG DATA HISPANO, 2015 28
Data can be located in different sites:
• Different parts of a company
• Different cooperating organizations
• A very large data set can be distributed on several processors and then
combine the results
Distributed Feature Selection (DFS)
 DFS Goal:
• to reduce the computational time
• while maintaining the classification
performance
BIG DATA HISPANO, 2015 29
DFS. Types of partition
By samples
By features
BIG DATA HISPANO, 2015 30
DFS with rankers
NDCGvalues
BIG DATA HISPANO, 2015 31
Distributed FS by Samples
HORIZONTAL PARTITION
BIG DATA HISPANO, 2015 32
Distributed FS by Features
VERTICAL PARTITION
Thank You!!!
33
BIG DATA SPAIN 2015, Madrid
BIG DATA HISPANO, 2015 34
Discretization. How does it work?
 Parameters: 50 intervals and 100,000 max candidates per partition.
 Classifier: Naive Bayes from MLLib, lambda = 1, iterations = 100.
 Hardware: 16 nodes (12 cores per node), 64 GB RAM.
 Software: Hadoop 2.5 and Apache Spark 1.2.0.
BIG DATA HISPANO, 2015 35
Feature Selection. Experimental results
DATASETS
Parameters: FS algorithm = mRMR, level of parallelism = 864 partitions.
Classifier: Naive Bayes and SVM (default parameters), from MLLIB.
Hardware: 18 nodes (12 cores per node), 64 GB RAM.
Software: Hadoop 2.5 and Apache Spark 1.2.0.
BIG DATA HISPANO, 2015 36
CPU vs CUDA
Low number of possible values
(< 64)
High number of possible values
(up to 256)
BIG DATA HISPANO, 2015 37
GPU. Real Datasets
DATASET PATTERNS FEATURES VALUES
KDDCup99 4000000 41 255
Higgs 11000000 21 255
BIG DATA HISPANO, 2015 38
Horizontal partitioning: By samples
BIG DATA HISPANO, 2015
1. Horizontal partitioning of the datasets
39
BIG DATA HISPANO, 2015 40
2. Application of the filter to the subsets
BIG DATA HISPANO, 2015 41
3.Combination of the results
BIG DATA HISPANO, 2015
Experimental Framework
FILTERS
 Subset filters:
• CFS (Correlation-based
Feature Selection)
• INTERACT
• Consistency-based filter
 Ranker filters:
• InfoGain
• ReliefF
CLASSIFIERS
• C4.5
• Naïve Bayes
• IB1
• SVM
42
BIG DATA HISPANO, 2015 43
Results with C 4.5
BIG DATA HISPANO, 2015 44
 Horizontal partitioning of the datasets
• Partitioning of the data maintaining class distribution
 Application of the filter to the datasets
 Combination of the results
• Merging procedure: Theoretical complexity of feature subsets
Improving the method
New approach
BIG DATA HISPANO, 2015 45
 Calculate the complexity of each candidate subset of features
 Fisher discriminant ratio
The complexity measurement
mi, si
2 and pi mean, variance and proportion of the ith-class
Independency from classifier
Temporal improvement
BIG DATA HISPANO, 2015 46
Connect
4
Isolet Madelon Ozone Spambase Mnist
Full set 42 617 500 72 57 717
Centralized 7 186 18 20 19 61
Distrib-Comp 8 105 9 8 18 77
Number of selected features
BIG DATA HISPANO, 2015 47
Classification accuracy (II)
CFS
ReliefF
BIG DATA HISPANO, 2015 48
Time. Distributed vs Centralized
(maximum)
BIG DATA HISPANO, 2015 49
Runtime. Comparing Distributed approaches
(Average)
BIG DATA HISPANO, 2015 50
Maximum Run Time (MNIST)
BIG DATA HISPANO, 2015 51
Runtime to obtain threshold of votes
BIG DATA HISPANO, 2015 52
Vertical partitioning: By features
BIG DATA HISPANO, 2015 53
Vertical partitioning
BIG DATA HISPANO, 2015 54
Different approaches tested
BIG DATA HISPANO, 2015 55
Experimental results. Microarray datasets
DNA microarray data is a good candidate for vertical distributed
feature selection, since data needs to be split by features
This type of data usually have redundant features
BIG DATA HISPANO, 2015 56
Vertical distribution. Accuracy and time
BIG DATA HISPANO, 2015 57
 Using complexity measurement instead of accuracy
Vertical partition. Complexity measure
Features Training Test Classes
Isolet 617 6238 1236 26
Madelon 500 1600 800 2
Mnist 717 40000 20000 2
Number of features
BIG DATA HISPANO, 2015 58
Experimental results. Classification accuracy
BIG DATA HISPANO, 2015 59
Complexity measure. Time
BIG DATA HISPANO, 2015 60
Complexity measure. Time
Isolet
Madelon
MNIST
Average Speedup
2318,45
Average Speedup
26,13
Average Speedup
1483,80
Average: 573.4337
BIG DATA HISPANO, 2015 61
GPU implementation
BIG DATA HISPANO, 2015 62
 Parallel Computing paradigm
 CUDA Platform
• NVIDIA 780 GTI
 IT-based algorithm: mRMR
• (Maximum Relevance minimum
redundancy)
– CPU optimized version
(fast mRMR)
– GPU version
GPU implementation of FS
BIG DATA HISPANO, 2015 63
 GPU compute efficiently image histograms
 Accelerate computation of MI in GPU
 Image processing (up to 256 values)
Previous ideas
Input
Data
Threads Partial
histograms Final results
BIG DATA HISPANO, 2015 64
 Reorder
 Variable Discretization : value 8 bits number (max 256 different
values per feature)
Data Access pattern
BIG DATA HISPANO, 2015 65
Datasets: performance and scalability
BIG DATA HISPANO, 2015 66
Results. CPU optimizationTime(s)
Dataset
Dataset
400 Features
100 Features
Time(s)
BIG DATA HISPANO, 2015 67
Scalability CPU Optimization (I)
Number of patterns
Maximum posible values per feature
Time(s)
Time(s)
BIG DATA HISPANO, 2015 68
Scalability CPU Optimization (II)
Time(s)
Number of features
BIG DATA HISPANO, 2015 69
CPU vs CUDA
Low number of possible values
(< 64)
High number of possible values
(up to 256)
BIG DATA HISPANO, 2015 70
GPU. Real Datasets
DATASET PATTERNS FEATURES VALUES
KDDCup99 4000000 41 255
Higgs 11000000 21 255
BIG DATA HISPANO, 2015 71
Challenges : Millions of Dimensions
BIG DATA HISPANO, 2015 72
Challenges: Scalability
BIG DATA HISPANO, 2015 73
Challenges: Distributed FS
BIG DATA HISPANO, 2015 74
Challenges: Real-time FS
BIG DATA HISPANO, 2015 75
Challenges: Cost-based FS
BIG DATA HISPANO, 2015 76
Challenges: Visualization and
Interpretability
BIG DATA HISPANO, 2015 77
 V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos.
Distributed feature selection: An application to microarray data
classification. Applied Soft Computing 05/2015; 30:136-150.
DOI:10.1016/j.asoc.2015.01.035.
 V. Bolón- Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos.
Recent advances and emerging challenges of feature selection
in the context of Big Data. Knowledge-Based Systems (2015).
doi:http://dx.doi.org/10.1016/j.knosys.2015.05.014.
 V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos.
Feature Selection for high-dimensional data. Springer-Verlag,
2015 (in production).
References
Thank You!!!
78
BIG DATA SPAIN 2015, Madrid

More Related Content

Viewers also liked

How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...Big Data Spain
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Big Data Spain
 
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...Big Data Spain
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Big Data Spain
 
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Big Data Spain
 
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Big Data Spain
 
Big Data as a game-changer of clinical research strategies by Rafael San Migu...
Big Data as a game-changer of clinical research strategies by Rafael San Migu...Big Data as a game-changer of clinical research strategies by Rafael San Migu...
Big Data as a game-changer of clinical research strategies by Rafael San Migu...Big Data Spain
 
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015Big Data Spain
 
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Big Data Spain
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Big Data Spain
 
Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ...
 Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ... Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ...
Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ...Big Data Spain
 
Big Data the potential for data to improve service and business management by...
Big Data the potential for data to improve service and business management by...Big Data the potential for data to improve service and business management by...
Big Data the potential for data to improve service and business management by...Big Data Spain
 
Location analytics by Marc Planaguma at Big Data Spain 2014
 Location analytics by Marc Planaguma at Big Data Spain 2014 Location analytics by Marc Planaguma at Big Data Spain 2014
Location analytics by Marc Planaguma at Big Data Spain 2014Big Data Spain
 
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012Big Data Spain
 
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...Big Data Spain
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...Big Data Spain
 
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
 Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014 Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014Big Data Spain
 
Intro to the Big Data Spain 2014 conference
Intro to the Big Data Spain 2014 conferenceIntro to the Big Data Spain 2014 conference
Intro to the Big Data Spain 2014 conferenceBig Data Spain
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...Big Data Spain
 
Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Getting the best insights from your data using Apache Metamodel by Alberto Ro...Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Getting the best insights from your data using Apache Metamodel by Alberto Ro...Big Data Spain
 

Viewers also liked (20)

How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
How to integrate Big Data onto an analytical portal, Big Data benchmarking fo...
 
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
Data Science in 2016: Moving up by Paco Nathan at Big Data Spain 2015
 
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
IAd-learning: A new e-learning platform by José Antonio Omedes at Big Data Sp...
 
Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...Essential ingredients for real time stream processing @Scale by Kartik pParam...
Essential ingredients for real time stream processing @Scale by Kartik pParam...
 
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
Geospatial and bitemporal search in C* with pluggable Lucene index by Andrés ...
 
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...Analyzing organization e-mails in near real time using hadoop ecosystem tools...
Analyzing organization e-mails in near real time using hadoop ecosystem tools...
 
Big Data as a game-changer of clinical research strategies by Rafael San Migu...
Big Data as a game-changer of clinical research strategies by Rafael San Migu...Big Data as a game-changer of clinical research strategies by Rafael San Migu...
Big Data as a game-changer of clinical research strategies by Rafael San Migu...
 
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
Predicting failures on complex machines by Ion Marqués at Big Data Spain 2015
 
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
Euclid & Big Data from dark space by Guillermo Buenadicha at Big Data Spain 2015
 
Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015Building graphs to discover information by David Martínez at Big Data Spain 2015
Building graphs to discover information by David Martínez at Big Data Spain 2015
 
Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ...
 Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ... Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ...
Dataflows: The abstraction that powers Big Data by Raul Castro Fernandez at ...
 
Big Data the potential for data to improve service and business management by...
Big Data the potential for data to improve service and business management by...Big Data the potential for data to improve service and business management by...
Big Data the potential for data to improve service and business management by...
 
Location analytics by Marc Planaguma at Big Data Spain 2014
 Location analytics by Marc Planaguma at Big Data Spain 2014 Location analytics by Marc Planaguma at Big Data Spain 2014
Location analytics by Marc Planaguma at Big Data Spain 2014
 
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012
The top five questions to ask about NoSQL. JONATHAN ELLIS at Big Data Spain 2012
 
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
ToroDB: Scaling PostgreSQL like MongoDB by Álvaro Hernández at Big Data Spain...
 
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data... Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
Big Data Web applications for Interactive Hadoop by ENRICO BERTI at Big Data...
 
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
 Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014 Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
Data warehouse modernization programme by TOBY WOOLFE at Big Data Spain 2014
 
Intro to the Big Data Spain 2014 conference
Intro to the Big Data Spain 2014 conferenceIntro to the Big Data Spain 2014 conference
Intro to the Big Data Spain 2014 conference
 
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
State of Play. Data Science on Hadoop in 2015 by SEAN OWEN at Big Data Spain ...
 
Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Getting the best insights from your data using Apache Metamodel by Alberto Ro...Getting the best insights from your data using Apache Metamodel by Alberto Ro...
Getting the best insights from your data using Apache Metamodel by Alberto Ro...
 

Similar to Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureBig Data Spain
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Jiaheng Lu
 
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningSigOpt
 
STARBUCKS Site Selection Analysis drift
STARBUCKS Site Selection Analysis driftSTARBUCKS Site Selection Analysis drift
STARBUCKS Site Selection Analysis driftPark JunPyo
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsEsther Vasiete
 
(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collectionsBIOVIA
 
The Knowledge Graph Conference 2022 - Bo Wu's Presentation
The Knowledge Graph Conference 2022 - Bo Wu's PresentationThe Knowledge Graph Conference 2022 - Bo Wu's Presentation
The Knowledge Graph Conference 2022 - Bo Wu's PresentationKatana Graph
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsScott Clark
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsSigOpt
 
SigOpt for Hedge Funds
SigOpt for Hedge FundsSigOpt for Hedge Funds
SigOpt for Hedge FundsSigOpt
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsConnected Data World
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBencht_ivanov
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...LDBC council
 
STREAM-0D: a new vision for Zero-Defect Manufacturing
STREAM-0D: a new vision for Zero-Defect ManufacturingSTREAM-0D: a new vision for Zero-Defect Manufacturing
STREAM-0D: a new vision for Zero-Defect ManufacturingFulvio Bernardini
 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarSigOpt
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 RecapSri Ambati
 
SigOpt for Machine Learning and AI
SigOpt for Machine Learning and AISigOpt for Machine Learning and AI
SigOpt for Machine Learning and AISigOpt
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Edwin Poot
 
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...Big Data Value Association
 

Similar to Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015 (20)

The Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren ShureThe Rise of Engineering-Driven Analytics by Loren Shure
The Rise of Engineering-Driven Analytics by Loren Shure
 
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
Auto­matic Para­meter Tun­ing for Data­bases and Big Data Sys­tems
 
Tuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep LearningTuning for Systematic Trading: Talk 2: Deep Learning
Tuning for Systematic Trading: Talk 2: Deep Learning
 
STARBUCKS Site Selection Analysis drift
STARBUCKS Site Selection Analysis driftSTARBUCKS Site Selection Analysis drift
STARBUCKS Site Selection Analysis drift
 
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source ToolsData Science at Scale on MPP databases - Use Cases & Open Source Tools
Data Science at Scale on MPP databases - Use Cases & Open Source Tools
 
(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections(ATS6-PLAT03) What's behind Discngine collections
(ATS6-PLAT03) What's behind Discngine collections
 
The Knowledge Graph Conference 2022 - Bo Wu's Presentation
The Knowledge Graph Conference 2022 - Bo Wu's PresentationThe Knowledge Graph Conference 2022 - Bo Wu's Presentation
The Knowledge Graph Conference 2022 - Bo Wu's Presentation
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
Using Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning ModelsUsing Bayesian Optimization to Tune Machine Learning Models
Using Bayesian Optimization to Tune Machine Learning Models
 
SigOpt for Hedge Funds
SigOpt for Hedge FundsSigOpt for Hedge Funds
SigOpt for Hedge Funds
 
RAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needsRAPIDS cuGraph – Accelerating all your Graph needs
RAPIDS cuGraph – Accelerating all your Graph needs
 
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBenchWBDB 2015 Performance Evaluation of Spark SQL using BigBench
WBDB 2015 Performance Evaluation of Spark SQL using BigBench
 
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
8th TUC Meeting - Tim Hegeman (TU Delft). Social Network Benchmark, Analytics...
 
STREAM-0D: a new vision for Zero-Defect Manufacturing
STREAM-0D: a new vision for Zero-Defect ManufacturingSTREAM-0D: a new vision for Zero-Defect Manufacturing
STREAM-0D: a new vision for Zero-Defect Manufacturing
 
Advanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise WebinarAdvanced Optimization for the Enterprise Webinar
Advanced Optimization for the Enterprise Webinar
 
ICLR 2020 Recap
ICLR 2020 RecapICLR 2020 Recap
ICLR 2020 Recap
 
SigOpt for Machine Learning and AI
SigOpt for Machine Learning and AISigOpt for Machine Learning and AI
SigOpt for Machine Learning and AI
 
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
Analyzing petabytes of smartmeter data using Cloud Bigtable, Cloud Dataflow, ...
 
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
BDVe Webinar Series: DataBench – Benchmarking Big Data. Arne Berre. Tue, Oct ...
 
Rapids: Data Science on GPUs
Rapids: Data Science on GPUsRapids: Data Science on GPUs
Rapids: Data Science on GPUs
 

More from Big Data Spain

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data Spain
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Big Data Spain
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017Big Data Spain
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Big Data Spain
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Big Data Spain
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Big Data Spain
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Big Data Spain
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Big Data Spain
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...Big Data Spain
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Big Data Spain
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Big Data Spain
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...Big Data Spain
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Big Data Spain
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Big Data Spain
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Big Data Spain
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Big Data Spain
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...Big Data Spain
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Big Data Spain
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...Big Data Spain
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Big Data Spain
 

More from Big Data Spain (20)

Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
Big Data, Big Quality? by Irene Gonzálvez at Big Data Spain 2017
 
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
Scaling a backend for a big data and blockchain environment by Rafael Ríos at...
 
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017AI: The next frontier by Amparo Alonso at Big Data Spain 2017
AI: The next frontier by Amparo Alonso at Big Data Spain 2017
 
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
Disaster Recovery for Big Data by Carlos Izquierdo at Big Data Spain 2017
 
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
Presentation: Boost Hadoop and Spark with in-memory technologies by Akmal Cha...
 
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
Data science for lazy people, Automated Machine Learning by Diego Hueltes at ...
 
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
Training Deep Learning Models on Multiple GPUs in the Cloud by Enrique Otero ...
 
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
Unbalanced data: Same algorithms different techniques by Eric Martín at Big D...
 
State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...State of the art time-series analysis with deep learning by Javier Ordóñez at...
State of the art time-series analysis with deep learning by Javier Ordóñez at...
 
Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...Trading at market speed with the latest Kafka features by Iñigo González at B...
Trading at market speed with the latest Kafka features by Iñigo González at B...
 
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
Unified Stream Processing at Scale with Apache Samza by Jake Maes at Big Data...
 
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a... The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
The Analytic Platform behind IBM’s Watson Data Platform by Luciano Resende a...
 
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
Artificial Intelligence and Data-centric businesses by Óscar Méndez at Big Da...
 
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
Why big data didn’t end causal inference by Totte Harinen at Big Data Spain 2017
 
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
Meme Index. Analyzing fads and sensations on the Internet by Miguel Romero at...
 
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
Vehicle Big Data that Drives Smart City Advancement by Mike Branch at Big Dat...
 
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
End of the Myth: Ultra-Scalable Transactional Management by Ricardo Jiménez-P...
 
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
Attacking Machine Learning used in AntiVirus with Reinforcement by Rubén Mart...
 
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
More people, less banking: Blockchain by Salvador Casquero at Big Data Spain ...
 
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
Make the elephant fly, once again by Sourygna Luangsay at Big Data Spain 2017
 

Recently uploaded

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Recently uploaded (20)

08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Begin at the beginning: Feature selection for Big Data by Amparo Alonso at Big Data Spain 2015

  • 1.
  • 2. BEGIN AT THE BEGINNING: FEATURE SELECTION FOR BIG DATA AMPARO ALONSO-BETANZOS BIG DATA SPAIN 2015, Madrid
  • 3. BIG DATA HISPANO, 2015 2 Begin at the Begining “Begin at the beginning," the King said, very gravely, "and go on till you come to the end: then stop.”
  • 4. BIG DATA HISPANO, 2015 3 The first step: Preprocessing the data Peter Norvig Google Research Director
  • 5. BIG DATA HISPANO, 2015 4 Not everything that counts can be counted, and not everything that can be counted counts. Equality is not the way
  • 6. BIG DATA HISPANO, 2015 5 Why Feature reduction?
  • 7. BIG DATA HISPANO, 2015 6 Arriving at the best features
  • 8. BIG DATA HISPANO, 2015 7 Feature selection. Benefits
  • 9. BIG DATA HISPANO, 2015 8 Feature selection: basic flavors Advantages Disadvantages Examples • Independence of classifier No interaction with classifier CFS • Low computational cost Consistency-based • Fast • Good generalization ability INTERACT ReliefF FCBF InfoGain mRMr
  • 10. BIG DATA HISPANO, 2015 9 Basic shapes of filters: In several ways Subset Filters Ranker Filters Univariate methods Multivariate methods Feature selection techniques do not scale well with Big data
  • 11. BIG DATA HISPANO, 2015 10 Distributed Feature Selection • Allocating the learning process among several workstations as a natural manner of scaling up learning algorithms. Scaling up FS • Advantages: – Reduction in execution time – Resources sharing – Better performance
  • 12. BIG DATA HISPANO, 2015 11 Cluster computing MLlib Distributed implementation of a FS method
  • 13. BIG DATA HISPANO, 2015 12  It is built on Apache Spark, a fast and general engine for large-scale data processing.  Runs programs up to 100x faster than Hadoop MapReduce in memory, or 10x faster on disk.  Runs on Hadoop 2 clusters  Write applications quickly in Java, Scala, or Python. MLlib, why?
  • 14. BIG DATA HISPANO, 2015 13 MLLib contents No FS algorithms!!!
  • 15. BIG DATA HISPANO, 2015 14  Implemented a generic FS framework for Big Data based on Information Theory • Brown G, Pocock A, Zhao MJ, Lujan M (2012) Conditional likelihood maximisation: A unifying framework for information theoretic feature selection. J Mach Learn Res 13:27–66. Implementing FS based on IT framework Relevance Redundance Conditional Redundance
  • 16. BIG DATA HISPANO, 2015 15 The long and winding road Discretization is needed!! Transform numerical attributes into discrete or nominal attributes with a finite number of intervals.
  • 17. BIG DATA HISPANO, 2015 16 Stages in the discretization process
  • 18. BIG DATA HISPANO, 2015 17  Proposal: Complete Re-design of Discretization method MDLP (Minimum Description Length Principle) – Sort all points in the dataset using a single distributed operation using a SPARK primitive. – Evaluates boundary points (per feature) in an parallel way. The algorithm: MLDP
  • 19. BIG DATA HISPANO, 2015 18 Is it worth?
  • 20. BIG DATA HISPANO, 2015 19 Original criteria and reformulation in framework
  • 21. BIG DATA HISPANO, 2015 20  The complexity of the framework is determined by the computations of relevance and redundancy  Proposal: complete re-design of Brown's framework – Columnar transformation: The access pattern presented by most FS methods is feature-wise. The partitioning scheme of data is quite influential in Apache Spark. Re-design of FS framework for Spark
  • 22. BIG DATA HISPANO, 2015 21  The complexity of the framework is determined by the computations of relevance and redundancy  Proposal: complete re-design of Brown's framework – Columnar transformation: The access pattern presented by most FS methods is feature-wise. The partitioning scheme of data is quite influential in Apache Spark. – Caching variables: relevance is computed and cached once at the start. The marginal and joint proportions derived from these operation are also cached. This info. is replicated. – Greedy approach: only one feature is selected in each iteration. The quadratic complexity is transformed to a complexity determined by the number of features selected. Re-design of FS framework for Spark
  • 23. BIG DATA HISPANO, 2015 22 Scalability results: Selection time Dna dataset
  • 24. BIG DATA HISPANO, 2015 23 Scalability : Cores ECDBL14 dataset
  • 25. BIG DATA HISPANO, 2015 24 Is it useful?
  • 26. BIG DATA HISPANO, 2015 25 Time for creating the classification model
  • 27. BIG DATA HISPANO, 2015 26 Spark-Infotheoretic FS
  • 28. BIG DATA HISPANO, 2015 27 Other attempts Parallel Implementation of mRMR on GPU https://github.com/sramirez/fast-mRMR Implementation of other FS algorithms: ReliefF, CFS, SVM-RFE (working on the scalability studies)
  • 29. BIG DATA HISPANO, 2015 28 Data can be located in different sites: • Different parts of a company • Different cooperating organizations • A very large data set can be distributed on several processors and then combine the results Distributed Feature Selection (DFS)  DFS Goal: • to reduce the computational time • while maintaining the classification performance
  • 30. BIG DATA HISPANO, 2015 29 DFS. Types of partition By samples By features
  • 31. BIG DATA HISPANO, 2015 30 DFS with rankers NDCGvalues
  • 32. BIG DATA HISPANO, 2015 31 Distributed FS by Samples HORIZONTAL PARTITION
  • 33. BIG DATA HISPANO, 2015 32 Distributed FS by Features VERTICAL PARTITION
  • 34. Thank You!!! 33 BIG DATA SPAIN 2015, Madrid
  • 35. BIG DATA HISPANO, 2015 34 Discretization. How does it work?  Parameters: 50 intervals and 100,000 max candidates per partition.  Classifier: Naive Bayes from MLLib, lambda = 1, iterations = 100.  Hardware: 16 nodes (12 cores per node), 64 GB RAM.  Software: Hadoop 2.5 and Apache Spark 1.2.0.
  • 36. BIG DATA HISPANO, 2015 35 Feature Selection. Experimental results DATASETS Parameters: FS algorithm = mRMR, level of parallelism = 864 partitions. Classifier: Naive Bayes and SVM (default parameters), from MLLIB. Hardware: 18 nodes (12 cores per node), 64 GB RAM. Software: Hadoop 2.5 and Apache Spark 1.2.0.
  • 37. BIG DATA HISPANO, 2015 36 CPU vs CUDA Low number of possible values (< 64) High number of possible values (up to 256)
  • 38. BIG DATA HISPANO, 2015 37 GPU. Real Datasets DATASET PATTERNS FEATURES VALUES KDDCup99 4000000 41 255 Higgs 11000000 21 255
  • 39. BIG DATA HISPANO, 2015 38 Horizontal partitioning: By samples
  • 40. BIG DATA HISPANO, 2015 1. Horizontal partitioning of the datasets 39
  • 41. BIG DATA HISPANO, 2015 40 2. Application of the filter to the subsets
  • 42. BIG DATA HISPANO, 2015 41 3.Combination of the results
  • 43. BIG DATA HISPANO, 2015 Experimental Framework FILTERS  Subset filters: • CFS (Correlation-based Feature Selection) • INTERACT • Consistency-based filter  Ranker filters: • InfoGain • ReliefF CLASSIFIERS • C4.5 • Naïve Bayes • IB1 • SVM 42
  • 44. BIG DATA HISPANO, 2015 43 Results with C 4.5
  • 45. BIG DATA HISPANO, 2015 44  Horizontal partitioning of the datasets • Partitioning of the data maintaining class distribution  Application of the filter to the datasets  Combination of the results • Merging procedure: Theoretical complexity of feature subsets Improving the method New approach
  • 46. BIG DATA HISPANO, 2015 45  Calculate the complexity of each candidate subset of features  Fisher discriminant ratio The complexity measurement mi, si 2 and pi mean, variance and proportion of the ith-class Independency from classifier Temporal improvement
  • 47. BIG DATA HISPANO, 2015 46 Connect 4 Isolet Madelon Ozone Spambase Mnist Full set 42 617 500 72 57 717 Centralized 7 186 18 20 19 61 Distrib-Comp 8 105 9 8 18 77 Number of selected features
  • 48. BIG DATA HISPANO, 2015 47 Classification accuracy (II) CFS ReliefF
  • 49. BIG DATA HISPANO, 2015 48 Time. Distributed vs Centralized (maximum)
  • 50. BIG DATA HISPANO, 2015 49 Runtime. Comparing Distributed approaches (Average)
  • 51. BIG DATA HISPANO, 2015 50 Maximum Run Time (MNIST)
  • 52. BIG DATA HISPANO, 2015 51 Runtime to obtain threshold of votes
  • 53. BIG DATA HISPANO, 2015 52 Vertical partitioning: By features
  • 54. BIG DATA HISPANO, 2015 53 Vertical partitioning
  • 55. BIG DATA HISPANO, 2015 54 Different approaches tested
  • 56. BIG DATA HISPANO, 2015 55 Experimental results. Microarray datasets DNA microarray data is a good candidate for vertical distributed feature selection, since data needs to be split by features This type of data usually have redundant features
  • 57. BIG DATA HISPANO, 2015 56 Vertical distribution. Accuracy and time
  • 58. BIG DATA HISPANO, 2015 57  Using complexity measurement instead of accuracy Vertical partition. Complexity measure Features Training Test Classes Isolet 617 6238 1236 26 Madelon 500 1600 800 2 Mnist 717 40000 20000 2 Number of features
  • 59. BIG DATA HISPANO, 2015 58 Experimental results. Classification accuracy
  • 60. BIG DATA HISPANO, 2015 59 Complexity measure. Time
  • 61. BIG DATA HISPANO, 2015 60 Complexity measure. Time Isolet Madelon MNIST Average Speedup 2318,45 Average Speedup 26,13 Average Speedup 1483,80 Average: 573.4337
  • 62. BIG DATA HISPANO, 2015 61 GPU implementation
  • 63. BIG DATA HISPANO, 2015 62  Parallel Computing paradigm  CUDA Platform • NVIDIA 780 GTI  IT-based algorithm: mRMR • (Maximum Relevance minimum redundancy) – CPU optimized version (fast mRMR) – GPU version GPU implementation of FS
  • 64. BIG DATA HISPANO, 2015 63  GPU compute efficiently image histograms  Accelerate computation of MI in GPU  Image processing (up to 256 values) Previous ideas Input Data Threads Partial histograms Final results
  • 65. BIG DATA HISPANO, 2015 64  Reorder  Variable Discretization : value 8 bits number (max 256 different values per feature) Data Access pattern
  • 66. BIG DATA HISPANO, 2015 65 Datasets: performance and scalability
  • 67. BIG DATA HISPANO, 2015 66 Results. CPU optimizationTime(s) Dataset Dataset 400 Features 100 Features Time(s)
  • 68. BIG DATA HISPANO, 2015 67 Scalability CPU Optimization (I) Number of patterns Maximum posible values per feature Time(s) Time(s)
  • 69. BIG DATA HISPANO, 2015 68 Scalability CPU Optimization (II) Time(s) Number of features
  • 70. BIG DATA HISPANO, 2015 69 CPU vs CUDA Low number of possible values (< 64) High number of possible values (up to 256)
  • 71. BIG DATA HISPANO, 2015 70 GPU. Real Datasets DATASET PATTERNS FEATURES VALUES KDDCup99 4000000 41 255 Higgs 11000000 21 255
  • 72. BIG DATA HISPANO, 2015 71 Challenges : Millions of Dimensions
  • 73. BIG DATA HISPANO, 2015 72 Challenges: Scalability
  • 74. BIG DATA HISPANO, 2015 73 Challenges: Distributed FS
  • 75. BIG DATA HISPANO, 2015 74 Challenges: Real-time FS
  • 76. BIG DATA HISPANO, 2015 75 Challenges: Cost-based FS
  • 77. BIG DATA HISPANO, 2015 76 Challenges: Visualization and Interpretability
  • 78. BIG DATA HISPANO, 2015 77  V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos. Distributed feature selection: An application to microarray data classification. Applied Soft Computing 05/2015; 30:136-150. DOI:10.1016/j.asoc.2015.01.035.  V. Bolón- Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos. Recent advances and emerging challenges of feature selection in the context of Big Data. Knowledge-Based Systems (2015). doi:http://dx.doi.org/10.1016/j.knosys.2015.05.014.  V. Bolón-Canedo, N. Sánchez-Maroño, A. Alonso-Betanzos. Feature Selection for high-dimensional data. Springer-Verlag, 2015 (in production). References
  • 79. Thank You!!! 78 BIG DATA SPAIN 2015, Madrid