elephant56 is an open source framework for the development and execution of single and parallel Genetic Algorithms (GAs). It provides high level functionalities that can be reused by developers, who no longer need to worry about complex internal structures. In particular, it offers the possibility of distributing the GAs computation over a Hadoop MapReduce cluster of multiple computers. In this paper we describe the design and the implementation details of the framework that supports three different models for parallel GAs, namely the global model, the grid model and the island model. Moreover, we provide a complete example of use.
This work was presented at the Evolutionary Computation Software Systems Workshop at GECCO (EvoSoft), July, 2016, Denver, Colorado, United States of America.
Genetic Algorithms (GAs) are a metaheuristic search technique belonging to the class of Evolutionary Algorithms (EAs). They have been proven to be effective in addressing several problems in many fields but also suffer from scalability issues that may not let them find a valid application for real world problems. Thus, the aim of providing highly scalable GA-based solutions, together with the reduced costs of parallel architectures, motivate the research on Parallel Genetic Algorithms (PGAs). Cloud computing may be a valid option for parallelisation, since there is no need of owning the physical hardware, which can be purchased from cloud providers, for the desired time, quantity and quality. There are different employable cloud technologies and approaches for this purpose, but they all introduce communication overhead. Thus, one might wonder if, and possibly when, specific approaches, environments and models show better performance than sequential versions in terms of execution time and resource usage. This thesis investigates if and when GAs can scale in the cloud using specific approaches. Firstly, Hadoop MapReduce is exploited designing and developing an open source framework, i.e., elephant56, that reduces the effort in developing and speed up GAs using three parallel models. The performance of the framework is then evaluated through an empirical study. Secondly, software containers and message queues are employed to develop, deploy and execute PGAs in the cloud and the devised system is evaluated with an empirical study on a commercial cloud provider. Finally, cloud technologies are also explored for the parallelisation of other EAs, designing and developing cCube, a collaborative microservices architecture for machine learning problems.
This thesis of PhD in Management & Information Technology was presented on 20th April 2017, University of Salerno, Fisciano, Italy.
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Applying your Convolutional Neural NetworksDatabricks
Part 3 of the Deep Learning Fundamentals Series, this session starts with a quick primer on activation functions, learning rates, optimizers, and backpropagation. Then it dives deeper into convolutional neural networks discussing convolutions (including kernels, local connectivity, strides, padding, and activation functions), pooling (or subsampling to reduce the image size), and fully connected layer. The session also provides a high-level overview of some CNN architectures. The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
Teaching K-Means New Tricks: Over 50 years old, the k-means algorithm remains one of the most popular clustering algorithms. In this talk we’ll cover some recent developments, including better initialization, the notion of coresets, clustering at scale, and clustering with outliers.
How to win data science competitions with Deep LearningSri Ambati
Note: Please download the slides first, otherwise some links won't work!
How to win kaggle style data science competitions and influence decisions with R, Deep Learning and H2O's fast algorithms.
We take a few public and kaggle datasets and model to win competitions on accuracy and scoring speed.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Dmitry will show the audience on how get started with Mxnet and building Deep Learning models to classify images, sound and text.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Genetic Algorithms (GAs) are a metaheuristic search technique belonging to the class of Evolutionary Algorithms (EAs). They have been proven to be effective in addressing several problems in many fields but also suffer from scalability issues that may not let them find a valid application for real world problems. Thus, the aim of providing highly scalable GA-based solutions, together with the reduced costs of parallel architectures, motivate the research on Parallel Genetic Algorithms (PGAs). Cloud computing may be a valid option for parallelisation, since there is no need of owning the physical hardware, which can be purchased from cloud providers, for the desired time, quantity and quality. There are different employable cloud technologies and approaches for this purpose, but they all introduce communication overhead. Thus, one might wonder if, and possibly when, specific approaches, environments and models show better performance than sequential versions in terms of execution time and resource usage. This thesis investigates if and when GAs can scale in the cloud using specific approaches. Firstly, Hadoop MapReduce is exploited designing and developing an open source framework, i.e., elephant56, that reduces the effort in developing and speed up GAs using three parallel models. The performance of the framework is then evaluated through an empirical study. Secondly, software containers and message queues are employed to develop, deploy and execute PGAs in the cloud and the devised system is evaluated with an empirical study on a commercial cloud provider. Finally, cloud technologies are also explored for the parallelisation of other EAs, designing and developing cCube, a collaborative microservices architecture for machine learning problems.
This thesis of PhD in Management & Information Technology was presented on 20th April 2017, University of Salerno, Fisciano, Italy.
Dr. Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf SEA - 5/20/16MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Applying your Convolutional Neural NetworksDatabricks
Part 3 of the Deep Learning Fundamentals Series, this session starts with a quick primer on activation functions, learning rates, optimizers, and backpropagation. Then it dives deeper into convolutional neural networks discussing convolutions (including kernels, local connectivity, strides, padding, and activation functions), pooling (or subsampling to reduce the image size), and fully connected layer. The session also provides a high-level overview of some CNN architectures. The demos included in these slides are running on Keras with TensorFlow backend on Databricks.
Sergei Vassilvitskii, Research Scientist, Google at MLconf NYC - 4/15/16MLconf
Teaching K-Means New Tricks: Over 50 years old, the k-means algorithm remains one of the most popular clustering algorithms. In this talk we’ll cover some recent developments, including better initialization, the notion of coresets, clustering at scale, and clustering with outliers.
How to win data science competitions with Deep LearningSri Ambati
Note: Please download the slides first, otherwise some links won't work!
How to win kaggle style data science competitions and influence decisions with R, Deep Learning and H2O's fast algorithms.
We take a few public and kaggle datasets and model to win competitions on accuracy and scoring speed.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Dmitry will show the audience on how get started with Mxnet and building Deep Learning models to classify images, sound and text.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Title: "Understanding PyTorch: PyTorch in Image Processing". Github: https://github.com/azarnyx/PyData_Meetup. The Dataset: https://goo.gl/CWmLWD.
The talk was given in PyData Meetup which took place in Munich on 06.03.2019 in Data Reply office. The talk was given by Dmitrii Azarnykh, data scientist in Data Reply.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
We all know what they say – the bigger the data, the better. But when the data gets really big, how do you mine it and what deep learning framework to use? This talk will survey, with a developer’s perspective, three of the most popular deep learning frameworks—TensorFlow, Keras, and PyTorch—as well as when to use their distributed implementations.
We’ll compare code samples from each framework and discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data) as well as help you answer questions such as:
As a developer how do I pick the right deep learning framework?
Do I want to develop my own model or should I employ an existing one?
How do I strike a trade-off between productivity and control through low-level APIs?
What language should I choose?
In this session, we will explore how to build a deep learning application with Tensorflow, Keras, or PyTorch in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Note: Make sure to download the slides to get the high-resolution version!
Also, you can find the webinar recording here (please also download for better quality): https://www.dropbox.com/s/72qi6wjzi61gs3q/H2ODeepLearningArnoCandel052114.mov
Come hear how Deep Learning in H2O is unlocking never before seen performance for prediction!
H2O is google-scale open source machine learning engine for R & Big Data. Enterprises can now use all of their data without sampling and build intelligent applications. This live webinar introduces Distributed Deep Learning concepts, implementation and results from recent developments. Real world classification & regression use cases from eBay text dataset, MNIST handwritten digits and Cancer datasets will present the power of this game changing technology.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Smart Scalable Feature Reduction with Random Forests with Erik ErlandsonDatabricks
Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features.
In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data.
Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Riley Waite
SPRITE (Small Projects Rapid Integration and Test Environment) is a NASA project focused on creating open source software and easily accessible hardware for real-time, hardware-in-the-loop simulations. The SPRITE Lab’s first iteration of this idea, called PHIL, was a system created with the intention of providing an affordable, modular environment for space flight testing. It functioned as an AC, compact PCI housing one cubic foot in size. Although the PHIL project accomplished its goals, PHIL was found to be too impractical. Its execution platform was a heavily modified and outdated ARTEMIS, its user interface an old version of MAESTRO, and its visualization software bdStudio, all of which were derived from the SLS’s proprietary HIL software. PHIL-Rebooted, the next iteration of PHIL, was created with the objective of reducing overall cost and open-sourcing the system. The entire system is being developed to run solely as a software package, as opposed to the original PHIL system. Not only is this beneficial in terms of user-friendliness, but it will also dramatically reduce the cost of the systemby not requiring the purchase of proprietary hardware. Rather, users will only need an inexpensive, third-party USB-IO interfacing board for their closed-loop simulations, allowing for a variety of boards to be used. PHIL- Rebooted’s new execution platform is libSPRITE, an open source, run-time process manager developed by the SPRITE Lab.
Scaling out logistic regression with SparkBarak Gitsis
Large scale multinomial logistic regression with Spark. Contains animated gifs. Analysis of LBFGS. Real world spark configurations. SimilarWeb categorization algorithm
Spoon is an open-source library that enables you to transform and analyze Java source code. Due to a complete and fine-grained Java metamodel, you can read and write the AST built by Spoon. In this talk, you'll see all strong concepts and API with an example. Then, you'll see how you can integrate this project in yours.
Alex Smola, Professor in the Machine Learning Department, Carnegie Mellon Uni...MLconf
Fast, Cheap and Deep – Scaling Machine Learning: Distributed high throughput machine learning is both a challenge and a key enabling technology. Using a Parameter Server template we are able to distribute algorithms efficiently over multiple GPUs and in the cloud. This allows us to design very fast recommender systems, factorization machines, classifiers, and deep networks. This degree of scalability allows us to tackle computationally expensive problems efficiently, yielding excellent results e.g. in visual question answering.
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016MLconf
Multi-algorithm Ensemble Learning at Scale: Software, Hardware and Algorithmic Approaches: Multi-algorithm ensemble machine learning methods are often used when the true prediction function is not easily approximated by a single algorithm. The Super Learner algorithm, also known as stacking, combines multiple, typically diverse, base learning algorithms into a single, powerful prediction function through a secondary learning process called metalearning. Although ensemble methods offer superior performance over their singleton counterparts, there is an implicit computational cost to ensembles, as it requires training and cross-validating multiple base learning algorithms.
We will demonstrate a variety of software- and hardware-based approaches that lead to more scalable ensemble learning software, including a highly scalable implementation of stacking called “H2O Ensemble”, built on top of the open source, distributed machine learning platform, H2O. H2O Ensemble scales across multi-node clusters and allows the user to create ensembles of deep neural networks, Gradient Boosting Machines, Random Forest, and others. As for algorithm-based approaches, we will present two algorithmic modifications to the original stacking algorithm that further reduce computation time — Subsemble algorithm and the Online Super Learner algorithm. This talk will also include benchmarks of the implementations of these new stacking variants.
Title: "Understanding PyTorch: PyTorch in Image Processing". Github: https://github.com/azarnyx/PyData_Meetup. The Dataset: https://goo.gl/CWmLWD.
The talk was given in PyData Meetup which took place in Munich on 06.03.2019 in Data Reply office. The talk was given by Dmitrii Azarnykh, data scientist in Data Reply.
This contains the agenda of the Spark Meetup I organised in Bangalore on Friday, the 23rd of Jan 2014. It carries the slides for the talk I gave on distributed deep learning over Spark
A Tale of Three Deep Learning Frameworks: TensorFlow, Keras, & PyTorch with B...Databricks
We all know what they say – the bigger the data, the better. But when the data gets really big, how do you mine it and what deep learning framework to use? This talk will survey, with a developer’s perspective, three of the most popular deep learning frameworks—TensorFlow, Keras, and PyTorch—as well as when to use their distributed implementations.
We’ll compare code samples from each framework and discuss their integration with distributed computing engines such as Apache Spark (which can handle massive amounts of data) as well as help you answer questions such as:
As a developer how do I pick the right deep learning framework?
Do I want to develop my own model or should I employ an existing one?
How do I strike a trade-off between productivity and control through low-level APIs?
What language should I choose?
In this session, we will explore how to build a deep learning application with Tensorflow, Keras, or PyTorch in under 30 minutes. After this session, you will walk away with the confidence to evaluate which framework is best for you.
In KDD2011, Vijay Narayanan (Yahoo!) and Milind Bhandarkar (Greenplum Labs, EMC) conducted a tutorial on "Modeling with Hadoop". This is the second half of the tutorial.
Mining Big Data Streams with APACHE SAMOAAlbert Bifet
In this talk, we present Apache SAMOA, an open-source platform for
mining big data streams with Apache Flink, Storm and Samza. Real time analytics is
becoming the fastest and most efficient way to obtain useful knowledge
from what is happening now, allowing organizations to react quickly
when problems appear or to detect new trends helping to improve their
performance. Apache SAMOA includes algorithms for the most common
machine learning tasks such as classification and clustering. It
provides a pluggable architecture that allows it to run on Apache
Flink, but also with other several distributed stream processing
engines such as Storm and Samza.
Note: Make sure to download the slides to get the high-resolution version!
Also, you can find the webinar recording here (please also download for better quality): https://www.dropbox.com/s/72qi6wjzi61gs3q/H2ODeepLearningArnoCandel052114.mov
Come hear how Deep Learning in H2O is unlocking never before seen performance for prediction!
H2O is google-scale open source machine learning engine for R & Big Data. Enterprises can now use all of their data without sampling and build intelligent applications. This live webinar introduces Distributed Deep Learning concepts, implementation and results from recent developments. Real world classification & regression use cases from eBay text dataset, MNIST handwritten digits and Cancer datasets will present the power of this game changing technology.
- Powered by the open source machine learning software H2O.ai. Contributors welcome at: https://github.com/h2oai
- To view videos on H2O open source machine learning software, go to: https://www.youtube.com/user/0xdata
Smart Scalable Feature Reduction with Random Forests with Erik ErlandsonDatabricks
Modern datacenters and IoT networks generate a wide variety of telemetry that makes excellent fodder for machine learning algorithms. Combined with feature extraction and expansion techniques such as word2vec or polynomial expansion, these data yield an embarrassment of riches for learning models and the data scientists who train them. However, these extremely rich feature sets come at a cost. High-dimensional feature spaces almost always include many redundant or noisy dimensions. These low-information features waste space and computation, and reduce the quality of learning models by diluting useful features.
In this talk, Erlandson will describe how Random Forest Clustering identifies useful features in data having many low-quality features, and will demonstrate a feature reduction application using Apache Spark to analyze compute infrastructure telemetry data.
Learn the principles of how Random Forest Clustering solves feature reduction problems, and how you can apply Random Forest tools in Apache Spark to improve your model training scalability, the quality of your models, and your understanding of application domains.
TensorFrames: Google Tensorflow on Apache SparkDatabricks
Presentation at Bay Area Spark Meetup by Databricks Software Engineer and Spark committer Tim Hunter.
This presentation covers how you can use TensorFrames with Tensorflow to distributed computing on GPU.
Real-Time Hardware Simulation with Portable Hardware-in-the-Loop (PHIL-Rebooted)Riley Waite
SPRITE (Small Projects Rapid Integration and Test Environment) is a NASA project focused on creating open source software and easily accessible hardware for real-time, hardware-in-the-loop simulations. The SPRITE Lab’s first iteration of this idea, called PHIL, was a system created with the intention of providing an affordable, modular environment for space flight testing. It functioned as an AC, compact PCI housing one cubic foot in size. Although the PHIL project accomplished its goals, PHIL was found to be too impractical. Its execution platform was a heavily modified and outdated ARTEMIS, its user interface an old version of MAESTRO, and its visualization software bdStudio, all of which were derived from the SLS’s proprietary HIL software. PHIL-Rebooted, the next iteration of PHIL, was created with the objective of reducing overall cost and open-sourcing the system. The entire system is being developed to run solely as a software package, as opposed to the original PHIL system. Not only is this beneficial in terms of user-friendliness, but it will also dramatically reduce the cost of the systemby not requiring the purchase of proprietary hardware. Rather, users will only need an inexpensive, third-party USB-IO interfacing board for their closed-loop simulations, allowing for a variety of boards to be used. PHIL- Rebooted’s new execution platform is libSPRITE, an open source, run-time process manager developed by the SPRITE Lab.
Scaling out logistic regression with SparkBarak Gitsis
Large scale multinomial logistic regression with Spark. Contains animated gifs. Analysis of LBFGS. Real world spark configurations. SimilarWeb categorization algorithm
Spoon is an open-source library that enables you to transform and analyze Java source code. Due to a complete and fine-grained Java metamodel, you can read and write the AST built by Spoon. In this talk, you'll see all strong concepts and API with an example. Then, you'll see how you can integrate this project in yours.
Framhald af umfjöllun um hlutbundna forritun og hönnun. Nú förum við yfir Generic Programming sem sem leið til að búa til sveigjanlega og endurnýtanlegar einingar. Skoðum líka reflection. Þá verður farið fyrir hvernig má hann laustengdar einingar og notum við frægan andarleik sem dæmi.
Þá mun Code Horror Dude kíkja í heimsókn
Separating Hype from Reality in Deep Learning with Sameer FarooquiDatabricks
Deep Learning is all the rage these days, but where does the reality of what Deep Learning can do end and the media hype begin? In this talk, I will dispel common myths about Deep Learning that are not necessarily true and help you decide whether you should practically use Deep Learning in your software stack.
I’ll begin with a technical overview of common neural network architectures like CNNs, RNNs, GANs and their common use cases like computer vision, language understanding or unsupervised machine learning. Then I’ll separate the hype from reality around questions like:
• When should you prefer traditional ML systems like scikit learn or Spark.ML instead of Deep Learning?
• Do you no longer need to do careful feature extraction and standardization if using Deep Learning?
• Do you really need terabytes of data when training neural networks or can you ‘steal’ pre-trained lower layers from public models by using transfer learning?
• How do you decide which activation function (like ReLU, leaky ReLU, ELU, etc) or optimizer (like Momentum, AdaGrad, RMSProp, Adam, etc) to use in your neural network?
• Should you randomly initialize the weights in your network or use more advanced strategies like Xavier or He initialization?
• How easy is it to overfit/overtrain a neural network and what are the common techniques to ovoid overfitting (like l1/l2 regularization, dropout and early stopping)?
Azure Digital Twins is an Azure IoT service that allow you creating comprehensive models of the physical environment the covers your devices and the envronment that surrounds them.
Let's discover what a Twin is and which features this new service offers
A beginner's guide to annotation processing.
In this talk that I gave at Droidcon Tel Aviv in 2016, I walk you through the process of building a custom annotation processor which mimics some of the behavior you may be familiar with from the popular Android library: Butter Knife.
Clonal Selection Algorithm Parallelization with MPJExpressAyi Purbasari
This paper exploits the parallelism potential on a Clonal Selection Algorithm (CSA) as a parallel metaheuristic algorithm, due the lack of explanation detail of the stages of designing parallel algorithms. To parallelise population-based algorithms, we need to exploit and define their granularity for each stage; do data or functional partition; and choose the communication model. Using a library for a message-passing model, such as MPJExpress, we define appropriate methods to implement process communication. This research results pseudo-code for the two communication message-passing models, using MPJExpress. We implemented this pseudo-codes using Java Language with a dataset from the Travelling Salesman Problem (TSP). The experiments showed that multicommunication model using alltogether method gained better performance that master-slave model that using send-and receive method.
Sharpen existing tools or get a new toolbox? Contemporary cluster initiatives...Orkestra
UIIN Conference, Madrid, 27-29 May 2024
James Wilson, Orkestra and Deusto Business School
Emily Wise, Lund University
Madeline Smith, The Glasgow School of Art
Have you ever wondered how search works while visiting an e-commerce site, internal website, or searching through other types of online resources? Look no further than this informative session on the ways that taxonomies help end-users navigate the internet! Hear from taxonomists and other information professionals who have first-hand experience creating and working with taxonomies that aid in navigation, search, and discovery across a range of disciplines.
0x01 - Newton's Third Law: Static vs. Dynamic AbusersOWASP Beja
f you offer a service on the web, odds are that someone will abuse it. Be it an API, a SaaS, a PaaS, or even a static website, someone somewhere will try to figure out a way to use it to their own needs. In this talk we'll compare measures that are effective against static attackers and how to battle a dynamic attacker who adapts to your counter-measures.
About the Speaker
===============
Diogo Sousa, Engineering Manager @ Canonical
An opinionated individual with an interest in cryptography and its intersection with secure software development.
This presentation by Morris Kleiner (University of Minnesota), was made during the discussion “Competition and Regulation in Professions and Occupations” held at the Working Party No. 2 on Competition and Regulation on 10 June 2024. More papers and presentations on the topic can be found out at oe.cd/crps.
This presentation was uploaded with the author’s consent.
Acorn Recovery: Restore IT infra within minutesIP ServerOne
Introducing Acorn Recovery as a Service, a simple, fast, and secure managed disaster recovery (DRaaS) by IP ServerOne. A DR solution that helps restore your IT infra within minutes.
elephant56: Design and Implementation of a Parallel Genetic Algorithms Framework on Hadoop MapReduce
1. EvoSoft 2016 @ GECCO 2016
Denver, Colorado, USA
July 20-24, 2016
elephant56
Design and Implementation of a
Parallel Genetic Algorithms Framework
on Hadoop MapReduce
PASQUALE SALZA1, FILOMENA FERRUCCI1, FEDERICA SARRO2
1University of Salerno, 2University College London
https://github.com/pasqualesalza/elephant56
2. • An open source framework based on Hadoop
MapReduce
• Allows developers to focus on Genetic
Algorithms definition only
• Manages the interaction with Hadoop
MapReduce
56elephant
https://github.com/pasqualesalza/elephant56
4. Why parallelize?
• Genetic Algorithms are intensive time and
resource consuming
• Luckily they are naturally parallelizable!
5. Hadoop
• Created by Doug Cutting in 2006 when he was
in Yahoo!
• Nowadays, it is part of YARN (Yet Another
Resource Negotiator)
6. Why Hadoop?
• Large clusters of machines
• Scalability and reliability of computation
processes and storage
• Wide diffusion in industries (BigData)
7. HDFS
• Hadoop Distributed File System
• Multiple data nodes
• Files are split up in blocks, spread across data
nodes
• Blocks replicated multiple times for reliability
and performance
8. MapReduce
• A programming paradigm derived from
functional programming
• Based on two main functions:
• map: handles the parallelization
• reduce: collects and merge the results
9. A Typical Day of a
MapReduce Job
• Implements word count, that counts word
occurrences in a text
Shuffling Reducing OutputMappingSplittingInput
Dear Bear River
Car Car River
Deer Car Bear
Deer Bear River
Car Car River
Deer Car Bear
Deer, 1
Bear, 1
River, 1
Car, 1
Car, 1
River, 1
Deer, 1
Car, 1
Bear, 1
Bear, (1, 1)
Car, (1, 1, 1)
Deer, (1, 1)
River, (1, 1)
Bear, 2
Car, 3
Deer, 2
River, 2
Bear, 2
Car, 3
Deer, 2
River, 2
12. Master/slave Model
• The master node manages the population of individuals and assigns
them to slave nodes
• The slaves evaluate only the fitness function
• The master applies other genetic operators
• Is similar to a sequential Genetic Algorithm
slave slave slave slave
master
13. Master/slave MapReduce
1. The Driver (master node) initializes the initial population
2. During each generation, it spreads the individuals across slave nodes
3. The slaves perform the fitness evaluation
4. The Driver applies the other genetic operators
...
FITNESS EVALUATION
Generation 1
...
Generation 2
. . .
ELITISM
PARENTS SELECTION
CROSSOVER
MUTATION
SURVIVAL SELECTION
Mapper 1
Mapper n
Driver Driver
15. Grid Model
• Applies genetic operators independently for each
neighborhood
• Improves the diversification of individuals during evolution
• Reduces the probability of local optimum convergence
16. Grid MapReduce
1. The neighborhoods are chosen randomly at each generation
2. The Driver splits the population in equal parts
3. The partitioner sends individuals to their assigned neighborhood
4. The reducers execute other genetic operators
...
...
FITNESS EVALUATION SHUFFLE
Generation 1
...
Generation 2
...
. . .
ELITISM
PARENTS SELECTION
CROSSOVER
MUTATION
SURVIVAL SELECTION
Mapper 1
Mapper n
Reducer 1
Reducer n
Driver
18. Island Model
• The initial population is split in several groups (islands)
• The Genetic Algorithm is executed “independently” on each of them
• Periodically, the individuals are exchanged between islands through the
migration
• Explores different directions in the search space and the migration enhances the
diversity
19. Island MapReduce
1. The driver spreads individuals across islands
2. Consecutive generations are executed on islands
3. The migration occurs in periods with reducers
...
...
FITNESS EVALUATION
ELITISM
PARENTS SELECTION
CROSSOVER
MUTATION
SURVIVAL SELECTION MIGRATION
Generations Period 1
...
Generations Period 2
...
. . .
Mapper 1
Mapper n Reducer n
Driver
Reducer 1
23. Individual
public class BitStringIndividual extends Individual {
private List<Boolean> bits;
public BitStringIndividual(int size) {
bits = new ArrayList<>(size);
}
public void set(int index, boolean value) {
bits.set(index, value);
}
public boolean get(int index) {
return bits.get(index);
}
public int size() {
return bits.size();
}
}
24. Fitness Value
public class IntegerFitnessValue extends FitnessValue {
private int number;
public IntegerFitnessValue(int value) {
number = value;
}
public int get() {
return number;
}
@Override
public int compareTo(FitnessValue other) {
if (other == null)
return 1;
Integer otherInteger = ((IntegerFitnessValue) other).get();
Integer.compare(number, otherInteger);
}
}
25. Fitness Evaluation
public class OneMaxFitnessEvaluation
extends FitnessEvaluation<BitStringIndividual, IntegerFitnessValue> {
@Override
public IntegerFitnessValue evaluate(
IndividualWrapper<BitStringIndividual, IntegerFitnessValue> wrapper) {
BitStringIndividual individual = wrapper.getIndividual();
int count = 0;
for (int i = 0; i < individual.size(); i++)
if (individual.get(i))
count++;
return new IntegerFitnessValue(count);
}
}
26. Initialization
public class RandomBitStringInitialization
extends Initialization<BitStringIndividual, IntegerFitnessValue> {
public RandomBitStringInitilization(..., Properties userProperties, ...) {
individualSize = userProperties.getInt(INDIVIDUAL_SIZE_PROPERTY);
random = new Random();
}
@Override
public IndividualWrapper<BitStringIndividual, IntegerFitnessValue>
generateNextIndividual(int id) {
BitStringIndividual individual = new BitStringIndividual(individualSize);
for (int i = 0; i < individualSize; i++)
individual.set(i, random.nextInt(2) == 1);
return new IndividualWrapper(individual);
}
}
27. Crossover
public class BitStringSinglePointCrossover
extends Crossover<BitStringIndividual, IntegerFitnessValue> {
public BitStringSinglePointCrossover(...) {
random = new Random();
}
@Override
public List<IndividualWrapper<BitStringIndividual, IntegerFitnessValue>>
cross(IndividualWrapper<BitStringIndividual, IntegerFitnessValue> wrapper1, wrapper2, ...) {
BitStringIndividual parent1 = wrapper1.getIndividual();
BitStringIndividual parent2 = wrapper2.getIndividual();
cutPoint = random.nextInt(parent1.size());
BitStringIndividual child1 = new BitStringIndividual(parent1.size());
BitStringIndividual child2 = new BitStringIndividual(parent1.size());
for (int i = 0; i < cutPoint; i++) {
child1.set(i, parent1.get(i));
child2.set(i, parent2.get(i));
}
for (int i = cutPoint; i < parent1.size(); i++) {
child1.set(i, parent2.get(i));
child2.set(i, parent1.get(i));
}
List<IndividualWrapper<BitStringIndividual, IntegerFitnessValue>> children = new ArrayList<>(2);
children.add(new IndividualWrapper<>(child1));
children.add(new IndividualWrapper<>(child2));
return children;
}
}
28. Mutation
public class BitStringMutation
extends Mutation<BitStringIndividual, IntegerFitnessValue> {
public BitStringMutation(...) {
mutationProbability =
userProperties.getDouble(MUTATION_PROBABILITY_PROPERTY);
random = new Random();
}
@Override
public IndividualWrapper<BitStringIndividual, IntegerFitnessValue> mutate(
IndividualWrapper<BitStringIndividual, IntegerFitnessValue> wrapper) {
BitStringIndividual individual = wrapper.getIndividual();
for (int i = 0; i < individual.size(); i++)
if (random.nextDouble() <= mutationProbability)
individual.set(i, !individual.get(i));
return wrapper;
}
}
29. App
public class App {
public static void main(String[] args) {
driver = new GlobalDriver();
driver.setIndividualClass(BitStringIndividual.class);
driver.setFitnessValueClass(IntegerFitnessValue.class);
driver.setInitializationClass(RandomBitStringInitialization.class);
driver.setInitializationPopulationSize(POPULATION_SIZE);
userProperties.setInt(INDIVIDUAL_SIZE_PROPERTY, INDIVIDUAL_SIZE);
driver.setElitismClass(BestIndividualsElitism.class);
driver.activateElitism(true);
userProperties.setInt(NUMBER_OF_ELITISTS_PROPERTY, NUMBER_OF_ELITISTS);
driver.setParentsSelectionClass(RouletteWheelParentsSelection.class);
driver.setCrossoverClass(BitStringSinglePointCrossover.class);
driver.setSurvivalSelectionClass(RouletteWheelSurvivalSelection.class);
driver.activateSurvivalSelection(true);
driver.setUserProperties(userProperties);
driver.run();
}
}