SlideShare a Scribd company logo
1 of 29
Download to read offline
An opensource library to implement random forests in genomic contexts




      Oscar González-Recio     Juanjo Bazán       Selma Forni
The Problem
The Problem
Massive amount of information from high troughput genotyping platforms.
The Problem
Massive amount of information from high troughput genotyping platforms.

Need to extract knowledge from large, noisy, redundant, missing and fuzzy data.
The Problem
Massive amount of information from high troughput genotyping platforms.

Need to extract knowledge from large, noisy, redundant, missing and fuzzy data.

Massive amount of information consumes the attention of its recipients.
We need to allocate that attention efciently.
Why Random Forest?
Why Random Forest?
Using Machine Learning techniques we can extract hidden relationships that
exist in these huge volumes of data and do not follow a particular parametric
design.
Why Random Forest?
Using Machine Learning techniques we can extract hidden relationships that
exist in these huge volumes of data and do not follow a particular parametric
design.

Random Forest have desirable statistical properties.
Why Random Forest?
Using Machine Learning techniques we can extract hidden relationships that
exist in these huge volumes of data and do not follow a particular parametric
design.

Random Forest have desirable statistical properties.


Random Forest scales well computationally.
Why Random Forest?
Using Machine Learning techniques we can extract hidden relationships that
exist in these huge volumes of data and do not follow a particular parametric
design.

Random Forest have desirable statistical properties.


Random Forest scales well computationally.


Random Forest performs extremely well in a variety of possible complex
domains (Breiman, 2001; Gonzalez-Recio & Forni, 2011).
The Algorithm
The Algorithm
Ensemble methods:
    - Combination of diferent methods (usually simple models).
    - They have very good predictive ability because use additivity of models performances.
Based on Classifcation And Regression Trees (CART).

Use Randomization and Bagging.

Performs Feature Subset Selection.

Convenient for classifcation problems.

Fast computation.

Simple interpretation of results for human minds.

Previous work in genome-wide prediction (Gonzalez-Recio and Forni, 2011)
The Algorithm
Perform bootstrap on data: Ψ* = (y, X)




Build a CART ( fi (y, X) = ht (x) ) using only mtry proportion of SNPs in each node.



Repeat M times to reduce residuals by a factor of M.


Average estimates c0 = μ ; ci = 1/M
The Algorithm
Classifcation and regression trees:



Let Ψ = (y, X) be a set of data, with
y = vector of phenotypes (response variables)

X = (x1, x2) = matrix of features

y1 x11 x12
… … ...
yi xi1 xi2
… … ...
yn xn1 xn2
The Algorithm
Nimbus library
Nimbus library
Written in Ruby


            Open source programming language

            Syntax focused on simplicity

            Natural to read and easy to write

            www.ruby-lang.org
Nimbus library
How to install:


         > gem install nimbus


Prerequisites:

        Ruby and Rubygems (default library manager) installed in the system
Nimbus library
How to run:


        > nimbus


Confguration:


        Via confg.yml fle
Nimbus library
confg.yml fle:
Nimbus library
Input fles:

                   training




                    testing
Nimbus library
Features, use cases:


    Training of a prediction forest
         Using a training set of individuals
         Nimbus creates a reutilizable forest
         Generalization error are computed for every tree in the forest
         Nimbus calculates SNP importances


    Genomic Prediction of a testing sample
         Using a new trained forest
         Using a custom/reused forest, specif ed via conf g.yml
                                            i           i
Outputs
Random Forest fle


   In standard YAML format
Outputs
Predictions for the training sample
Outputs
Predictions for the testing sample
Outputs
SNP importances
More info:
Nimbus website:
               www.nimbusgem.org
Source code:
               www.github.com/xuanxu/nimbus
Report bugs/request features:
               www.github.com/xuanxu/nimbus/issues
Thank you!
Questions?

More Related Content

Similar to Nimbus

32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
butest
 
GNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.pptGNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.ppt
ManiMaran230751
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
c.titus.brown
 

Similar to Nimbus (20)

32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Classification of chestnuts with feature selection by noise resilient classif...
Classification of chestnuts with feature selection by noise resilient classif...Classification of chestnuts with feature selection by noise resilient classif...
Classification of chestnuts with feature selection by noise resilient classif...
 
GNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.pptGNA 13552928 deep learning for GAN a.ppt
GNA 13552928 deep learning for GAN a.ppt
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
2013 talk at TGAC, November 4
2013 talk at TGAC, November 42013 talk at TGAC, November 4
2013 talk at TGAC, November 4
 
Occurrence Prediction_NLP
Occurrence Prediction_NLPOccurrence Prediction_NLP
Occurrence Prediction_NLP
 
Creating smaller, faster, production-ready mobile machine learning models.
Creating smaller, faster, production-ready mobile machine learning models.Creating smaller, faster, production-ready mobile machine learning models.
Creating smaller, faster, production-ready mobile machine learning models.
 
Deep learning from a novice perspective
Deep learning from a novice perspectiveDeep learning from a novice perspective
Deep learning from a novice perspective
 
presentation.ppt
presentation.pptpresentation.ppt
presentation.ppt
 
Introduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at GalvanizeIntroduction to Deep Learning and neon at Galvanize
Introduction to Deep Learning and neon at Galvanize
 
Random Forest
Random ForestRandom Forest
Random Forest
 
2014 nci-edrn
2014 nci-edrn2014 nci-edrn
2014 nci-edrn
 
Secrets of Supercomputing
Secrets of SupercomputingSecrets of Supercomputing
Secrets of Supercomputing
 
Deep Learning and Watson Studio
Deep Learning and Watson StudioDeep Learning and Watson Studio
Deep Learning and Watson Studio
 
An Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithmsAn Introduction to Random Forest and linear regression algorithms
An Introduction to Random Forest and linear regression algorithms
 
deepnet-lourentzou.ppt
deepnet-lourentzou.pptdeepnet-lourentzou.ppt
deepnet-lourentzou.ppt
 
13 random forest
13 random forest13 random forest
13 random forest
 
what is Random-Forest-Machine-Learning.pptx
what is Random-Forest-Machine-Learning.pptxwhat is Random-Forest-Machine-Learning.pptx
what is Random-Forest-Machine-Learning.pptx
 
2013 duke-talk
2013 duke-talk2013 duke-talk
2013 duke-talk
 
Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)Deep learning lecture - part 1 (basics, CNN)
Deep learning lecture - part 1 (basics, CNN)
 

More from Juanjo Bazán

More from Juanjo Bazán (6)

How to manage an open source project
How to manage an open source projectHow to manage an open source project
How to manage an open source project
 
Space software
Space softwareSpace software
Space software
 
Ruby & Ciencia
Ruby & CienciaRuby & Ciencia
Ruby & Ciencia
 
Ruby and Science
Ruby and ScienceRuby and Science
Ruby and Science
 
Contribuir a Rails
Contribuir a RailsContribuir a Rails
Contribuir a Rails
 
Rails para programadores Java
Rails para programadores JavaRails para programadores Java
Rails para programadores Java
 

Recently uploaded

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
Joaquim Jorge
 

Recently uploaded (20)

Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live StreamsTop 5 Benefits OF Using Muvi Live Paywall For Live Streams
Top 5 Benefits OF Using Muvi Live Paywall For Live Streams
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

Nimbus

  • 1. An opensource library to implement random forests in genomic contexts Oscar González-Recio Juanjo Bazán Selma Forni
  • 3. The Problem Massive amount of information from high troughput genotyping platforms.
  • 4. The Problem Massive amount of information from high troughput genotyping platforms. Need to extract knowledge from large, noisy, redundant, missing and fuzzy data.
  • 5. The Problem Massive amount of information from high troughput genotyping platforms. Need to extract knowledge from large, noisy, redundant, missing and fuzzy data. Massive amount of information consumes the attention of its recipients. We need to allocate that attention efciently.
  • 7. Why Random Forest? Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design.
  • 8. Why Random Forest? Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design. Random Forest have desirable statistical properties.
  • 9. Why Random Forest? Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design. Random Forest have desirable statistical properties. Random Forest scales well computationally.
  • 10. Why Random Forest? Using Machine Learning techniques we can extract hidden relationships that exist in these huge volumes of data and do not follow a particular parametric design. Random Forest have desirable statistical properties. Random Forest scales well computationally. Random Forest performs extremely well in a variety of possible complex domains (Breiman, 2001; Gonzalez-Recio & Forni, 2011).
  • 12. The Algorithm Ensemble methods: - Combination of diferent methods (usually simple models). - They have very good predictive ability because use additivity of models performances. Based on Classifcation And Regression Trees (CART). Use Randomization and Bagging. Performs Feature Subset Selection. Convenient for classifcation problems. Fast computation. Simple interpretation of results for human minds. Previous work in genome-wide prediction (Gonzalez-Recio and Forni, 2011)
  • 13. The Algorithm Perform bootstrap on data: Ψ* = (y, X) Build a CART ( fi (y, X) = ht (x) ) using only mtry proportion of SNPs in each node. Repeat M times to reduce residuals by a factor of M. Average estimates c0 = μ ; ci = 1/M
  • 14. The Algorithm Classifcation and regression trees: Let Ψ = (y, X) be a set of data, with y = vector of phenotypes (response variables) X = (x1, x2) = matrix of features y1 x11 x12 … … ... yi xi1 xi2 … … ... yn xn1 xn2
  • 17. Nimbus library Written in Ruby Open source programming language Syntax focused on simplicity Natural to read and easy to write www.ruby-lang.org
  • 18. Nimbus library How to install: > gem install nimbus Prerequisites: Ruby and Rubygems (default library manager) installed in the system
  • 19. Nimbus library How to run: > nimbus Confguration: Via confg.yml fle
  • 21. Nimbus library Input fles: training testing
  • 22. Nimbus library Features, use cases: Training of a prediction forest Using a training set of individuals Nimbus creates a reutilizable forest Generalization error are computed for every tree in the forest Nimbus calculates SNP importances Genomic Prediction of a testing sample Using a new trained forest Using a custom/reused forest, specif ed via conf g.yml i i
  • 23. Outputs Random Forest fle In standard YAML format
  • 24. Outputs Predictions for the training sample
  • 25. Outputs Predictions for the testing sample
  • 27. More info: Nimbus website: www.nimbusgem.org Source code: www.github.com/xuanxu/nimbus Report bugs/request features: www.github.com/xuanxu/nimbus/issues