SlideShare a Scribd company logo
1 of 43
Distributed Monte Carlo Feature Selection
Łukasz Król
Data Mining Group
Faculty of Automatic Control,
Electronics and Computer Science
Silesian University of Technology
Classical Structured Big Data Problems
d – number of features
n – number of
observations
n >> d
• Number of features is usually much smaller
than the number of observations.
• The problem is the scale of the data, rather than
its structure.
• Observations can be often processed
independently of each other.
• In most use cases, the problem is only that of
filtering and aggregating the data (MapReduce).
High-Dimensional Big Data
d – number of features
n – number of
observations
n << d
• Number of features can be a few orders of magnitude higher than the number of
observations.
• Most features are not relevant for the problem.
• There are interdependencies between the features, and sets of features from
different parts of the dataset often need to be processed together.
• Because of high dimensionality of the dataset, there can be a lot of features
correlated with the decision vector and each other only by chance (False Discoveries).
High Throughput Biological Experiments
experiment observations features
RNA microarrays 102-103 104
SNP microarrays 102-103 105-106
CNV microarrays 102-103 106
methylation sites 102-103 108-109
sequencing data 102-103 109
Scale of dimensionality of different high-throughput experiments:
Feature Selection
Dimensionality can be reduced using Feature
Selection…
…but Feature Selection itself is affected
by feature to observation imbalance!
Feature Selection
What can be the objectives of a Feature
Selection application in a supervised scenario?
•Outputting a set of features that are most
useful for training a classifier.
•Outputting a set of features that can be
directly analyzed by a Life Scientist.
Feature Selection
Basic requirements of a Feature Selection Application:
• Is not biased by the dataset.
• Is agnostic of type of variables and number of categories.
• Takes into account interactions between variables.
• Takes into account the contextual dependencies with the
response variable.
• Is not bound to a greedy search path.
• Allows to capture statistical significance of selected
features.
Requirements for human readable output:
•Does not transform the feature space.
•Provides information on interdependencies between the
features.
•Does not remove weaker alternative signal paths.
Monte Carlo Feature Selection
Bioinformatics (2008) 24: 110-117
Advances in Machine Learning II (2010) 263: 371-385
Big Data Analysis: New Algorithms for a New Society (2015) 16: 285-304
Distributed MCFS - motivation
•Constant increase of dimensionality of analyzed problems
requires new tools.
•Current software does not allow to make use of distributed
resources.
•Experiment scenarios are becoming harder. Fewer significant
features are present in microarrays created out of blood
samples than those created out of healthy vs. ill tissues.
•Abundancy of distributed data analysis frameworks resulting
from the Big Data movement.
Monte Carlo Feature SelectionOBSERVATIONS
FEATURES
Feature sampling:
j=1
Monte Carlo Feature SelectionOBSERVATIONS
FEATURES
Feature sampling:
j=2
Monte Carlo Feature SelectionOBSERVATIONS
FEATURES
Feature sampling:
j=3
Monte Carlo Feature SelectionOBSERVATIONS
FEATURES
Observation sampling:
j=3
k=1
Monte Carlo Feature SelectionOBSERVATIONS
FEATURES
Observation sampling:
j=3
k=2
Monte Carlo Feature Selection
Training a decision tree and analyzing its structure and performance.
feat=1
IG=0.3
n=10
fear=2
IG=0.1
n=3
feat=3
IG=0.2
n=7
feat=1
IG=0.1
n=5
Monte Carlo Feature Selection
feat=1
IG=0.3
n=10
fear=2
IG=0.1
n=3
feat=3
IG=0.2
n=7
feat=1
IG=0.1
n=5
wAcc=0.75
Training a decision tree and analyzing its structure and performance.
Monte Carlo Feature Selection
Capturing feature interdependency:
feat=1
IG=0.3
n=10
fear=2
IG=0.1
n=3
feat=3
IG=0.2
n=7
feat=1
IG=0.1
n=5
wAcc=0.75
Monte Carlo Feature Selection
top features
s1st 2n
d
3rd 4th 5th
1000
2000
3000
4000
5000
6000
7000
Stopping criterion:
STABLE RANKING
Monte Carlo Feature Selection
top features
s
cumulated samples number
1st 2n
d
3rd 4th 5th independent incremental
1000 1000 1000
2000 3000 2000
3000 6000 3000
4000 10000 4000
5000 15000 5000
6000 21000 6000
7000 28000 7000
Stopping criterion:
STABLE RANKING
DMCFS – basic concepts
•Sampling loop can be executed in parallel.
•Computations can be distributed between hosts.
DMCFS – basic concepts
•Communication between threads should be
asynchronous (non blocking), and as minimal as possible.
IDLE
SYNCHRONOUS:
PROGRESS
DMCFS – basic concepts
•Communication between threads should be
asynchronous (non blocking), and as minimal as possible.
ASYNCHRONOUS:
PROGRESS
DMCFS – basic concepts
• Data should not have to be reshuffled by the
application (like in MapReduce).
• Data distribution should be in the scope of
infrastructure, not the application.
• If possible, data should be loaded only once.
DMCFS – basic concepts
• Data should not have to be reshuffled by the
application (like in MapReduce).
• Data distribution should be in the scope of
infrastructure, not the application.
• If possible, data should be loaded only once.
NODE 1 NODE 2 NODE 3
DATA
NETWORK FILESYSTEM
DMCFS – basic concepts
• Data should not have to be reshuffled by the
application (like in MapReduce).
• Data distribution should be in the scope of
infrastructure, not the application.
• If possible, data should be loaded only once.
NODE 1 NODE 2 NODE 3
DATABASE – when the problem dataset does not fit into memory
DATA
DMCFS – basic concepts
• Data should not have to be reshuffled by the
application (like in MapReduce).
• Data distribution should be in the scope of
infrastructure, not the application.
• If possible, data should be loaded only once.
NODE 1 NODE 2 NODE 3
HDFS+Spark – when feature samples do not fit in the memory
DMCFS – basic concepts
Computations should not be affected by nodes being
disconnected:
• Most of the nodes do not need to be aware of
other nodes and size of the cluster.
• Most nodes perform simple stateless workload.
• A single point of failure and potential bottleneck is
present in the form of a Master Node – it is however
restorable, and multithreading allows it to remain
responsive.
DMCFS – basic concepts
Writing and maintaining the application can be greatly
facilitated by employing a parallel programming
framework, more specifically an actor framework.
DMCFS - architecture
Core actors:
DMCFS - architecture
Attaching Booster Nodes:
DMCFS - architecture
Interfacing with the outside world:
DMCFS - architecture
Feature sampling and producing partial RIs:
DMCFS - architecture
Comparing Feature Rankings:
DMCFS - architecture
Storing historical feature rankings and ranking distances:
DMCFS - architecture
Outputting results to client application:
Test datasets
dataset observations features
Golub et al. Leukemia data 38 7130
CNV radiosensitivity data 130 2*106
Results – nearly linear speedup
Results – comparison with dmLab
Results – high-dimensional dataset
DMCFS – censored survival data
features response
DMCFS – censored survival data
1
2
3
1
2
3
DMCFS – summary
•Can be run on an arbitrary number of physical machines.
•Allows to dynamically attach and detach nodes while running
computations.
•Provides constantly updated partial results.
•Scales almost linearly when increasing the amount of
available processors.
•Platform-independent.
•Has no dependencies other than Java 1.8.
•Is open for extending by new types of feature selectors.
•Can be deployed on public or private cloud.
•Software (0.1.0) available upon request.
•Creation of an intranet service is ongoing.
Acknowledgements
I would like to thank dr. Draminski for providing the latest version of dmLab
software for evaluation, as well as Najla Al-Harbi, Sara Bin Judia, dr Salma Majid,
dr. Ghazi Alsbeih (Faisal Specialist Hospital & Research Centre, Riyadh 11211,
Kingdom of Saudi Arabia), and furthermore Bozena Rolnik (Data Mining Group) for
providing the CNV data. Calculations were carried out using the computer cluster
Ziemowit (http://www.ziemowit.hpc.polsl.pl) funded by the Silesian BIO-FARMA
project No. POIG.02.01.00-00-166/08 in the Computational Biology and
Bioinformatics Laboratory of the Biotechnology Centre in the Silesian University of
Technology. The work was nancially supported by NCN grant HARMONIA UMO-
2013/08/M/ST6/00924 (LK).

More Related Content

Similar to Distributed Monte Carlo Feature Selection Scales Linearly

Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226Nick Kypreos
 
Cloud data management
Cloud data managementCloud data management
Cloud data managementambitlick
 
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsM3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsVladislavKashansky
 
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxAbnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxShakas Technologies
 
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxAbnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxShakas Technologies
 
Detecting_and_Mitigating_Botnet_Attacks_in_Software-Defined_Networks_Using_De...
Detecting_and_Mitigating_Botnet_Attacks_in_Software-Defined_Networks_Using_De...Detecting_and_Mitigating_Botnet_Attacks_in_Software-Defined_Networks_Using_De...
Detecting_and_Mitigating_Botnet_Attacks_in_Software-Defined_Networks_Using_De...Shakas Technologies
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELJenny Liu
 
Proposal for System Analysis and Desing
Proposal for System Analysis and DesingProposal for System Analysis and Desing
Proposal for System Analysis and DesingMd Khaza Main Uddin
 
Data compression, data security, and machine learning
Data compression, data security, and machine learningData compression, data security, and machine learning
Data compression, data security, and machine learningChris Huang
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool developmentAnubhav Jain
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for BeginnersSanghamitra Deb
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006raj_vij
 
Big Data Analytics and Advanced Computer Networking Scenarios
Big Data Analytics and Advanced Computer Networking ScenariosBig Data Analytics and Advanced Computer Networking Scenarios
Big Data Analytics and Advanced Computer Networking ScenariosStenio Fernandes
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...Feng Li
 
Applications of machine learning in Wireless sensor networks.
Applications of machine learning in Wireless sensor networks.Applications of machine learning in Wireless sensor networks.
Applications of machine learning in Wireless sensor networks.Sahana B S
 

Similar to Distributed Monte Carlo Feature Selection Scales Linearly (20)

Scalability20140226
Scalability20140226Scalability20140226
Scalability20140226
 
Cloud data management
Cloud data managementCloud data management
Cloud data management
 
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive ApplicationsM3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
M3AT: Monitoring Agents Assignment Model for the Data-Intensive Applications
 
Camp finall
Camp finallCamp finall
Camp finall
 
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxAbnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
 
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docxAbnormal Traffic Detection Based on Attention and Big Step Convolution.docx
Abnormal Traffic Detection Based on Attention and Big Step Convolution.docx
 
Detecting_and_Mitigating_Botnet_Attacks_in_Software-Defined_Networks_Using_De...
Detecting_and_Mitigating_Botnet_Attacks_in_Software-Defined_Networks_Using_De...Detecting_and_Mitigating_Botnet_Attacks_in_Software-Defined_Networks_Using_De...
Detecting_and_Mitigating_Botnet_Attacks_in_Software-Defined_Networks_Using_De...
 
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLELA TALE of DATA PATTERN DISCOVERY IN PARALLEL
A TALE of DATA PATTERN DISCOVERY IN PARALLEL
 
Proposal for System Analysis and Desing
Proposal for System Analysis and DesingProposal for System Analysis and Desing
Proposal for System Analysis and Desing
 
Data compression, data security, and machine learning
Data compression, data security, and machine learningData compression, data security, and machine learning
Data compression, data security, and machine learning
 
Overview of DuraMat software tool development
Overview of DuraMat software tool developmentOverview of DuraMat software tool development
Overview of DuraMat software tool development
 
Introduction
IntroductionIntroduction
Introduction
 
Computer Vision for Beginners
Computer Vision for BeginnersComputer Vision for Beginners
Computer Vision for Beginners
 
Internet data mining 2006
Internet data mining   2006Internet data mining   2006
Internet data mining 2006
 
Big Data Analytics and Advanced Computer Networking Scenarios
Big Data Analytics and Advanced Computer Networking ScenariosBig Data Analytics and Advanced Computer Networking Scenarios
Big Data Analytics and Advanced Computer Networking Scenarios
 
Seminar nov2017
Seminar nov2017Seminar nov2017
Seminar nov2017
 
PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...PEARC17:A real-time machine learning and visualization framework for scientif...
PEARC17:A real-time machine learning and visualization framework for scientif...
 
Presentation1.pptx
Presentation1.pptxPresentation1.pptx
Presentation1.pptx
 
Final_Report
Final_ReportFinal_Report
Final_Report
 
Applications of machine learning in Wireless sensor networks.
Applications of machine learning in Wireless sensor networks.Applications of machine learning in Wireless sensor networks.
Applications of machine learning in Wireless sensor networks.
 

Recently uploaded

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfgstagge
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改atducpo
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...soniya singh
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...dajasot375
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubaihf8803863
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024thyngster
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130Suhani Kapoor
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...ThinkInnovation
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Sapana Sha
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxEmmanuel Dauda
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home ServiceSapana Sha
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAbdelrhman abooda
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsappssapnasaifi408
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...Suhani Kapoor
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...Suhani Kapoor
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts ServiceSapana Sha
 

Recently uploaded (20)

RadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdfRadioAdProWritingCinderellabyButleri.pdf
RadioAdProWritingCinderellabyButleri.pdf
 
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Defence Colony Delhi 💯Call Us 🔝8264348440🔝
 
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
代办国外大学文凭《原版美国UCLA文凭证书》加州大学洛杉矶分校毕业证制作成绩单修改
 
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
High Class Call Girls Noida Sector 39 Aarushi 🔝8264348440🔝 Independent Escort...
 
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
Indian Call Girls in Abu Dhabi O5286O24O8 Call Girls in Abu Dhabi By Independ...
 
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls DubaiDubai Call Girls Wifey O52&786472 Call Girls Dubai
Dubai Call Girls Wifey O52&786472 Call Girls Dubai
 
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
Consent & Privacy Signals on Google *Pixels* - MeasureCamp Amsterdam 2024
 
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
VIP Call Girls Service Miyapur Hyderabad Call +91-8250192130
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
Predictive Analysis - Using Insight-informed Data to Determine Factors Drivin...
 
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
Saket, (-DELHI )+91-9654467111-(=)CHEAP Call Girls in Escorts Service Saket C...
 
Customer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptxCustomer Service Analytics - Make Sense of All Your Data.pptx
Customer Service Analytics - Make Sense of All Your Data.pptx
 
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
Deep Generative Learning for All - The Gen AI Hype (Spring 2024)
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service9654467111 Call Girls In Munirka Hotel And Home Service
9654467111 Call Girls In Munirka Hotel And Home Service
 
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptxAmazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
Amazon TQM (2) Amazon TQM (2)Amazon TQM (2).pptx
 
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /WhatsappsBeautiful Sapna Vip  Call Girls Hauz Khas 9711199012 Call /Whatsapps
Beautiful Sapna Vip Call Girls Hauz Khas 9711199012 Call /Whatsapps
 
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
VIP High Class Call Girls Jamshedpur Anushka 8250192130 Independent Escort Se...
 
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
VIP High Profile Call Girls Amravati Aarushi 8250192130 Independent Escort Se...
 
Call Girls In Mahipalpur O9654467111 Escorts Service
Call Girls In Mahipalpur O9654467111  Escorts ServiceCall Girls In Mahipalpur O9654467111  Escorts Service
Call Girls In Mahipalpur O9654467111 Escorts Service
 

Distributed Monte Carlo Feature Selection Scales Linearly

  • 1. Distributed Monte Carlo Feature Selection Łukasz Król Data Mining Group Faculty of Automatic Control, Electronics and Computer Science Silesian University of Technology
  • 2. Classical Structured Big Data Problems d – number of features n – number of observations n >> d • Number of features is usually much smaller than the number of observations. • The problem is the scale of the data, rather than its structure. • Observations can be often processed independently of each other. • In most use cases, the problem is only that of filtering and aggregating the data (MapReduce).
  • 3. High-Dimensional Big Data d – number of features n – number of observations n << d • Number of features can be a few orders of magnitude higher than the number of observations. • Most features are not relevant for the problem. • There are interdependencies between the features, and sets of features from different parts of the dataset often need to be processed together. • Because of high dimensionality of the dataset, there can be a lot of features correlated with the decision vector and each other only by chance (False Discoveries).
  • 4. High Throughput Biological Experiments experiment observations features RNA microarrays 102-103 104 SNP microarrays 102-103 105-106 CNV microarrays 102-103 106 methylation sites 102-103 108-109 sequencing data 102-103 109 Scale of dimensionality of different high-throughput experiments:
  • 5. Feature Selection Dimensionality can be reduced using Feature Selection… …but Feature Selection itself is affected by feature to observation imbalance!
  • 6. Feature Selection What can be the objectives of a Feature Selection application in a supervised scenario? •Outputting a set of features that are most useful for training a classifier. •Outputting a set of features that can be directly analyzed by a Life Scientist.
  • 7. Feature Selection Basic requirements of a Feature Selection Application: • Is not biased by the dataset. • Is agnostic of type of variables and number of categories. • Takes into account interactions between variables. • Takes into account the contextual dependencies with the response variable. • Is not bound to a greedy search path. • Allows to capture statistical significance of selected features. Requirements for human readable output: •Does not transform the feature space. •Provides information on interdependencies between the features. •Does not remove weaker alternative signal paths.
  • 8. Monte Carlo Feature Selection Bioinformatics (2008) 24: 110-117 Advances in Machine Learning II (2010) 263: 371-385 Big Data Analysis: New Algorithms for a New Society (2015) 16: 285-304
  • 9. Distributed MCFS - motivation •Constant increase of dimensionality of analyzed problems requires new tools. •Current software does not allow to make use of distributed resources. •Experiment scenarios are becoming harder. Fewer significant features are present in microarrays created out of blood samples than those created out of healthy vs. ill tissues. •Abundancy of distributed data analysis frameworks resulting from the Big Data movement.
  • 10. Monte Carlo Feature SelectionOBSERVATIONS FEATURES Feature sampling: j=1
  • 11. Monte Carlo Feature SelectionOBSERVATIONS FEATURES Feature sampling: j=2
  • 12. Monte Carlo Feature SelectionOBSERVATIONS FEATURES Feature sampling: j=3
  • 13. Monte Carlo Feature SelectionOBSERVATIONS FEATURES Observation sampling: j=3 k=1
  • 14. Monte Carlo Feature SelectionOBSERVATIONS FEATURES Observation sampling: j=3 k=2
  • 15. Monte Carlo Feature Selection Training a decision tree and analyzing its structure and performance. feat=1 IG=0.3 n=10 fear=2 IG=0.1 n=3 feat=3 IG=0.2 n=7 feat=1 IG=0.1 n=5
  • 16. Monte Carlo Feature Selection feat=1 IG=0.3 n=10 fear=2 IG=0.1 n=3 feat=3 IG=0.2 n=7 feat=1 IG=0.1 n=5 wAcc=0.75 Training a decision tree and analyzing its structure and performance.
  • 17. Monte Carlo Feature Selection Capturing feature interdependency: feat=1 IG=0.3 n=10 fear=2 IG=0.1 n=3 feat=3 IG=0.2 n=7 feat=1 IG=0.1 n=5 wAcc=0.75
  • 18. Monte Carlo Feature Selection top features s1st 2n d 3rd 4th 5th 1000 2000 3000 4000 5000 6000 7000 Stopping criterion: STABLE RANKING
  • 19. Monte Carlo Feature Selection top features s cumulated samples number 1st 2n d 3rd 4th 5th independent incremental 1000 1000 1000 2000 3000 2000 3000 6000 3000 4000 10000 4000 5000 15000 5000 6000 21000 6000 7000 28000 7000 Stopping criterion: STABLE RANKING
  • 20. DMCFS – basic concepts •Sampling loop can be executed in parallel. •Computations can be distributed between hosts.
  • 21. DMCFS – basic concepts •Communication between threads should be asynchronous (non blocking), and as minimal as possible. IDLE SYNCHRONOUS: PROGRESS
  • 22. DMCFS – basic concepts •Communication between threads should be asynchronous (non blocking), and as minimal as possible. ASYNCHRONOUS: PROGRESS
  • 23. DMCFS – basic concepts • Data should not have to be reshuffled by the application (like in MapReduce). • Data distribution should be in the scope of infrastructure, not the application. • If possible, data should be loaded only once.
  • 24. DMCFS – basic concepts • Data should not have to be reshuffled by the application (like in MapReduce). • Data distribution should be in the scope of infrastructure, not the application. • If possible, data should be loaded only once. NODE 1 NODE 2 NODE 3 DATA NETWORK FILESYSTEM
  • 25. DMCFS – basic concepts • Data should not have to be reshuffled by the application (like in MapReduce). • Data distribution should be in the scope of infrastructure, not the application. • If possible, data should be loaded only once. NODE 1 NODE 2 NODE 3 DATABASE – when the problem dataset does not fit into memory DATA
  • 26. DMCFS – basic concepts • Data should not have to be reshuffled by the application (like in MapReduce). • Data distribution should be in the scope of infrastructure, not the application. • If possible, data should be loaded only once. NODE 1 NODE 2 NODE 3 HDFS+Spark – when feature samples do not fit in the memory
  • 27. DMCFS – basic concepts Computations should not be affected by nodes being disconnected: • Most of the nodes do not need to be aware of other nodes and size of the cluster. • Most nodes perform simple stateless workload. • A single point of failure and potential bottleneck is present in the form of a Master Node – it is however restorable, and multithreading allows it to remain responsive.
  • 28. DMCFS – basic concepts Writing and maintaining the application can be greatly facilitated by employing a parallel programming framework, more specifically an actor framework.
  • 31. DMCFS - architecture Interfacing with the outside world:
  • 32. DMCFS - architecture Feature sampling and producing partial RIs:
  • 33. DMCFS - architecture Comparing Feature Rankings:
  • 34. DMCFS - architecture Storing historical feature rankings and ranking distances:
  • 35. DMCFS - architecture Outputting results to client application:
  • 36. Test datasets dataset observations features Golub et al. Leukemia data 38 7130 CNV radiosensitivity data 130 2*106
  • 37. Results – nearly linear speedup
  • 40. DMCFS – censored survival data features response
  • 41. DMCFS – censored survival data 1 2 3 1 2 3
  • 42. DMCFS – summary •Can be run on an arbitrary number of physical machines. •Allows to dynamically attach and detach nodes while running computations. •Provides constantly updated partial results. •Scales almost linearly when increasing the amount of available processors. •Platform-independent. •Has no dependencies other than Java 1.8. •Is open for extending by new types of feature selectors. •Can be deployed on public or private cloud. •Software (0.1.0) available upon request. •Creation of an intranet service is ongoing.
  • 43. Acknowledgements I would like to thank dr. Draminski for providing the latest version of dmLab software for evaluation, as well as Najla Al-Harbi, Sara Bin Judia, dr Salma Majid, dr. Ghazi Alsbeih (Faisal Specialist Hospital & Research Centre, Riyadh 11211, Kingdom of Saudi Arabia), and furthermore Bozena Rolnik (Data Mining Group) for providing the CNV data. Calculations were carried out using the computer cluster Ziemowit (http://www.ziemowit.hpc.polsl.pl) funded by the Silesian BIO-FARMA project No. POIG.02.01.00-00-166/08 in the Computational Biology and Bioinformatics Laboratory of the Biotechnology Centre in the Silesian University of Technology. The work was nancially supported by NCN grant HARMONIA UMO- 2013/08/M/ST6/00924 (LK).

Editor's Notes

  1. -combining many types of data together
  2. -relative importance…
  3. -combining many types of data together