SlideShare a Scribd company logo
15-618 Parallel Computer Architecture and Programming 
Meta Machine Learning: Hyperparameter optimization 
https://pbollimp.github.io/Meta-Machine-Learning/ 
Priyatham Bollimpalli (pbollimp) 
Mohit Deep Singh (mohitdes) 
 
Abstract
In this project, we implemented various hyperparameter optimization algorithms primarily for 
Deep Neural Networks (Multi-Layer Perceptrons). We used OpenMP as the baseline for grid 
search, random search and a simple evolutionary algorithm. We then proceeded by 
implementing our own three variants of evolutionary algorithms using pthreads that not only 
give upto 9x speedups on 16 cores and scale very well with increasing number of threads, but 
also explore a much larger hyperparameter space and finds the hyperparameters that give good 
accuracy much faster than the baselines. Even though our experiments are specific to MLPs, we 
are confident that these methods are very applicable to hyperparameter optimization for any 
machine learning problem. We also performed an extensive breakdown of the execution times 
for our algorithms and reasoned about the performance of individual algorithms. 
 
Background
The task of Hyperparameter search  
In the recent past, deep learning has been incredibly successful in various tasks such as image 
classification and tracking, machine translation, speech recognition etc. One of the biggest 
challenges faced by all machine learning researchers today (especially in deep learning) is 
selecting hyperparameters for models. A hyperparameter is a parameter whose value is set 
before the learning process starts. In contrast, model parameters are parameters that are 
optimized as part of the learning process.  
 
Hyperparameters usually have a huge impact on the amount of time it takes to train a model, 
and how well the model actually does. Usually, hyperparameters end up being critical in model 
performance, and that makes it critical for the researchers to find the most optimal values for it. 
In practice today, most researchers manually tweak the hyperparameters and train the networks 
to see which hyperparameters give the best results. This process is extremely tedious and time 
consuming, and the process of automating this is called hyperparameter optimization. The few 
main approaches that exist as of today include: 
 
1. Grid Search​: This basically involves exhaustively searching through the search space of 
the hyperparameters, evaluating how each set of hyperparameters performs in terms of 
training, and choosing the best combination. This approach is extremely heavyweight, 
involves doing a brute force search over the space of hyperparameters. The advantage of 
this is that it can be parallelized well, and given enough resources (and good intuition), 
one can usually end up with the best set of hyperparameters. 
2. Random Search: ​This approach involves randomly sampling hyperparameter sets 
through the search space, and searching till it gets the desired accuracy. This method is 
very successful in practice, since it doesn’t exhaustively search the entire space, and still 
gets good results in practice. 
3. Bayesian Optimizations​: This approach usually involves using the past instances of the 
hyperparameters already tested to learn, and sample better sets of hyperparameters to 
evaluate. Evolutionary search and gaussian processes are the most common techniques 
used as bayesian optimization techniques. The biggest challenge is that since there is an 
optimization step, this makes these inherently sequential, and harder to parallelize. 
Evolutionary Search Algorithm 
In the artificial intelligence literature, an evolutionary search algorithm is a generic population 
based meta-heuristic based optimization algorithm. An evolutionary strategy algorithm uses 
methods that are inspired by evolution, such a mutation, reproduction, reconstruction and 
selection. The high level idea is that given a population of individuals, we have a fitness function 
to rank which of the individuals are the best performers. Using this function, we rank the 
individuals, kill a certain fraction of the weakest part of the population, and have the stronger 
part of the population reproduce and mutate to arrive at better solutions. 
 
In our case, the individual hyperparameters sets are the candidates in a population. The 
accuracy of a neural network trained with those hyperparameters is our fitness function. The 
algorithm inputs are the population size, some initial random population of candidates and the 
mutations and crossovers mechanism. The output is the best candidate with the highest 
accuracy (fitness). The mutations can heavily vary based on the application. In our particular 
case,we randomly chose 3 candidates from the pool of the strongest candidates, took the 
average of their hyperparameters, and added some noise to it to encourage the algorithm to 
explore more. 
Initial Analysis: Workload and Data Structures 
The main components of our program are the neural network (training and inference), the 
dataset which the neural network is being trained on and the set of the best hyperparameters, 
or the hyperparameters that we are searching for. In the described algorithms, the main 
computationally expensive part is training and inferring multiple neural networks to get the 
accuracies of different sets of parameters. So parallelizing this would give us huge benefits, as 
we can test the fitness of multiple candidates in parallel.  
There are a few dependencies in this program. Firstly, for evolutionary search algorithms, we 
want to very carefully search through the hyperparameter search, so as to make sure we are 
regularly using the past results to carefully restrict our search space in the future to get better 
results. This requires us to frequently synchronize all parallel threads, evaluate the current 
population, rank it and mutate it. Furthermore, depending on what parameters we are 
searching, and what the values are, we could end up having very skewed workloads, where 
some hyperparameters based on their value might take a much longer time to train and 
evaluate as opposed to other. Therefore, these algorithms could greatly benefit from data 
parallelism in the first few approaches, where we could analyze fitness of the population in 
parallel, synchronize and then parallelize the mutations as well.  
The key data structures ended up being the neural networks, which we could not end up sharing 
(since we need to train each network individually through backpropagation to evaluate how well 
the hyperparameters would perform). The other key structure is the array of all candidates of a 
population or multiple populations. The main operations on this array were updating the correct 
accuracies, ranking them with respect to each other, and killing the weakest candidates and 
mutating and reproducing the strongest ones to produce new candidates.  
Main Challenge 
 
The task of hyperparameter search is a very tedious one, especially given we need to train 
multiple neural networks from scratch with different hyperparameter configurations. As the 
networks become more complex, and the number of hyperparameters increase, this task 
becomes more and more challenging - the search space increases exponentially, and more 
applicable methods such as bayesian optimization techniques are difficult to parallelize.  
The main challenge we faced while trying to parallelize these algorithms was the ability to speed 
up not just the algorithms, but also get the correct tradeoff between the accuracy and how fast 
and effectively we can search the hyperparameter space. There is an inherently sequential part 
of any bayesian optimization, where based on the past results, we need to come up with better 
estimates of the hyperparameters. We tried parallelizing the algorithms in phases first. Then, we 
aimed to optimize that sequential part as well, and traded-off the quality of the bayesian 
optimization for a faster albeit slightly less accurate search. 
 
Approach 
In this section, we go into our approaches and techniques we used to parallelize the algorithms 
Grid Search (GS) 
The high level algorithm is as follows: 
 
For grid search, we loop through all possible configurations of hyperparameters. In our case, the 
fitness function is training a neural network, and getting the accuracy of the network on the 
validation set. Parallelizing grid search was mostly straightforward, other than lines 5 and 6 in 
the algorithm. As a result, we just created an array that kept track of all accuracies for all 
configurations, and parallelized the calculation of the fitness for each possible configuration. In 
the end, we get the maximum accuracy, and select that configuration. We realized this was not 
very scalable with the size of the hyperparameters, so we broke up the entire parameter space 
into chunks, parallely calculate the fitness of those chunks, and incrementally keep track of the 
most accurate configuration. 
Random Search (GS)
The high level algorithm is as follows:
As can be seen in the algorithm, random search randomly samples hyperparameters from their 
search space, and tests them for their fitness. It keeps track on the best parameter configuration 
at any given point, and if the fitness is over a threshold, it stops searching the parameter space. 
Parallelizing this was not absolutely straightforward, but we followed the strategy of creating 
batches of threads which randomly sample from the hyperparameter space and report its 
accuracy to an array in the global memory. The master thread, every few chunks, gets the best 
accuracy from this array, and if it is over the threshold kills the program. Otherwise it keeps 
spawning off new blocks of threads to evaluate other random configurations. This is always in 
practice much faster than grid search, and parallelized version of it usually helps explore the 
space in a more efficient manner and get the desired accuracy pretty quickly. 
 
Evolutionary Search - Basic Algorithm
 
As can be seen in the algorithm, the basic evolutionary search works in a way where it tries to 
leverage the past evaluations of the fitness functions of the explored parameters to try and 
come up with a better and more effective form of optimizing the search space.  
The idea is the following. We start out with a population of candidate hyperparameters, and 
evaluate all of them. After evaluating them, we rank all the candidates in the population 
according to their fitness. We then kill/discard the last k candidates (least fit). Then we go ahead 
and generate new children using the top/best performing hyperparameter. The way we mutated 
our top performing parameters were: we randomly sample 3 hyperparameters from the top 
performing ones, and took the average. We then add some uniform noise to these averages to 
encourage our algorithm to explore.  
Parallel Evolutionary Search 1 (EV1) 
The first approach we took was to try and parallelize the basic evolutionary search algorithm. We  
first decided to parallelize the evaluation of the fitness functions of the individual candidates. We 
ended up parallelizing this using OpenMP, and creating threads to divide this work up. We 
noticed that with certain hyperparameters like the number of hidden units, the workload 
imbalance ended up being very high, and openMP static scheduling did not give us good 
speedups. So we decided to use dynamic scheduling, and despite the existence of sequential 
code (which ranked the population and mutated it), we got decent speedups (discussed in the 
results section). 
Evolutionary Search with Islands - Our adaptation of the basic algorithm 
The earlier algorithm usually would do well with hyperparameter spaces as long as the 
dimensionality is not too high. But this algorithm is not ideal when the dimensionality of the 
number of hyperparameters we search for is high. Because of random sampling, if it does not 
start with good enough points, a basic evolutionary search would get stuck at local optimums. It 
helps if we create multiple local evolutionary search algorithm with specific start points, and 
then have them communicate with each other what their best parameters are after a few 
generations of local optimization. We call this variant the “​island​” approach, where you have 
island of local populations, which intermingle every few iterations.  
The basic idea was, we kept n different “islands”, each of which had its own population of 
candidate hyperparameters. Each of these islands separately carried out the evolutionary search 
on its candidate population for a few generations. This means each island would explore their 
candidates, rank them within their population and kill and repopulate the weakest members. 
After a few iterations, we took the top-k candidates from each of these islands, merged them 
together and applied a global evolution to it. This means, the merged candidates were now 
sorted, the weakest in this global evolutionary search were killed and repopulated based on the 
mutations explained in evolutionary search algorithm section. After this, we would reseed the 
local population with these globally mutated seed, and some random seeds, so that it can 
explore the high dimensional space in a better way. 
 
 
We explained how we parallelized these algorithms in the next few sections. 
Parallel Evolutionary Search 2 (EV2) 
We wanted to parallelize the island approach, to be able to scale better and search dense 
hyperparameter spaces more efficiently. We decided to use the fork join model with shared 
memory to implement this parallel approach.  
The initial idea was that the global top candidates would be stored in shared memory. The 
master then spawns off the requisite number of threads, and every thread essentially ends up 
acting as an island. Every island then reads its chunk of “seed” candidates from the global top 
candidates (which is shuffled after mutation by the master). Then each of these threads run a 
certain number of iterations of the evolutionary search algorithm locally. After finding its top 
candidates having run a few generations, the islands report their top-k candidates to the global 
array. The master waits for all these threads to complete executing their local “generations”, 
using join. Once every thread finishes, master then sorts these reported local winners in the 
global top array, and applies a “global evolution” on this array. This is followed by the master 
shuffling this global array, and then spawning off local threads again to do more iterations of 
these local evolutionary searches.  
The approach we took here was to parallelize the islands, rather than parallelizing every single 
candidate, so that it is easier for us to do a global intermingling of the best candidates.  
Parallel Evolutionary Search 3 (EV3) 
Independent to EV2, we also tried implementing a threading model, where instead of using join 
every time, we used a pool of threads, and shared variables and mutexes to synchronize 
between the threads. The idea was to compare the performance of the two and see which one 
would do better.  
 
We used a similar model in terms of splitting up the islands between the threads. We had a 
shared count variables, and a mutex per thread, and also a mutex for the count variable. The 
master thread created a thread pool of the specified number of threads to begin with and would 
set the count variable to the number of threads. Each thread would work in an infinite loop, 
where at every iteration of the while loop, it would lock its respective mutex. Then it would get 
the requisite global “seed” candidates, and run the respective local evolutionary search 
algorithms. After each individual thread would be done, it would decrement the count variable 
by one, and loop back and try grabbing its own lock again (where it will wait). The master thread 
waits for the count variable to get to 0, and when it does, it sorts the global tops array, applies 
the respective global mutations, shuffles the array, and resets the count variable to num 
threads. After that the master thread unlocks the respective mutex for each thread so they can 
start running. This would run till either the master thread found an accuracy greater than the 
threshold, or a certain upper limit of iterations. Once the stopping condition would hit, the 
master thread sends out a signal to the other threads to quit, using a global variable.  
 
The motivation behind this was that, we thought pthread fork and joins may have extra 
overhead, and using mutexes might help us prevent that overhead. As discussed in the result 
section, this ended up being slower than EV2 for smaller number of threads, and that was 
mostly because of the synchronization overheads due to mutexes and shared variables. But for 
16 threads, this outperformed EV2, because of the overhead of spawning new threads every 
iteration.  
 
Finally, we wanted to target and reduce the time the island program was taking in the sequential 
section. The approach we took is defined below. 
 
Parallel Evolutionary Search 4 (EV4) 
In the last couple of approaches we noticed a few things. First was that sometimes the workload 
distribution ends up being skewed, and that causes some threads to just be idle to wait for other 
threads to catch up. Also, the global evolution would happen in a sequential fashion, which was 
definitely a bottleneck in further speedups.  
 
For this, we went back to the drawing board, and realized that the whole idea behind us coming 
up with the island approach was to have different local evolutionary populations communicate 
and intermingle to encourage more random exploration, and better evolution eventually. We 
realized that the communication does not need to be completely exact (and accurate) in time, 
and we could do this in an asynchronous manner. We built this on top of EV2, but got rid of the 
synchronizations. So every thread, including the master would exclusively work on its local 
population. We created three shared arrays of global populations. One of the arrays was called 
the “read array”: this is where the master initializes the first “seed” for all islands. The second 
array, was the “write array” where all threads would report the best candidates from their 
population. The third is the buffer array, which is used to do any intermittent calculations.  
 
The master first initializes the “read array”. Then it spawns off all other threads. Each thread 
basically runs the local evolutionary search algorithm, which either runs till the master signals it 
to stop (using shared variable), or it hits a certain number of iterations. The master also runs 
local evolutionary search, but every few iterations, it performs the global sorting and mutation 
step without blocking other threads. It copies the “write array” results to the buffer, sorts the 
buffer, applies global evolutionary search to it, and shuffles it. Then it switches the pointer 
locations of read array and the buffer, so now the threads can read the new globally evolved 
array. This allows all threads to operate asynchronously, where the only drawback is that the 
intermingling does not exactly happen at the same times for the threads. This is acceptable, 
because the main idea was to interchange information between islands, and by making this 
asynchronous we did not lose much accuracy.  
 
The only problem we faced with this was that the master thread now had to do extra work so it 
severely lagged behind, and that caused the other threads to work on stale data for a much 
longer time. As a fix to this, we significantly reduced the population size of the local evolutionary 
search for the master, and that way, the master was always either ahead of the other threads 
(which did us no harm), or would be at par with the other threads. We kept executing master till 
either it hit a threshold, or some other thread exited, which signaled all other threads to quit via 
a shared variable. This way, we completely removed any sequential code from our algorithm.  
 
We did tradeoff a little bit of freshness in terms of global evolution, but since the local evolutions 
follow the process correctly anyway, this ended up not causing a huge issue with accuracies. We 
just wanted information to flow through all threads to explore a larger space better, and not get 
stuck in local optimums, which this model helps us achieve. This gave us more speedups as we 
have shown in the results section. 
Parallel Evolutionary Search 5 (EV5)
For this approach we used the same model as EV4, but we also parallelized the part of the code 
where the master sorts and evolves the global array. We used openMP to implement a parallel 
for loop here and ended up parallelizing the evolutions carried out by the master thread.  
 
In addition to that, we also padded the three global arrays to completely take up the cache line 
size in order to prevent false sharing. This method thus is our best model in terms to 
performance and gave us the most speedups. 
Experimental Setup
Tiny DNN 
Our initial approach was to use a highly optimized neural network library, to run 
hyperparameter search on convolutional neural networks. We wanted to stick with C/C++ for our 
implementation, as it is much easier to parallelize on that, rather than python (which has much 
better frameworks for deep learning).  
 
We found a library called tiny-dnn which provided a very high level abstraction of neural 
networks. It was a highly optimized library which could be linked to a project by using header 
only compilation. This library was also in C++14 which prevented us from running this on GHC. 
We implemented most of our techniques using the Tiny DNN framework, but on actually running 
the tests on multiple threads, we saw that there were no performance improvements. We 
hypothesized that this is because of the tinyDNN doing some optimizations using openMP and 
pthreads under the hood in the background and this was clashing with our parallelization 
implementations. The library was taking up most resources on the limited resource machine we 
were running our experiments on. We could not profile because we could not run it on GHC (due 
to C++14) and AWS blocked proper profiling on their instances. At this point, we decided to 
abandon this library, and use a simpler DNN (a multi layer perceptron) to train our networks. 
MLP  
We eventually decided that we will mostly try and use an existing simple multi-layer perceptron 
codebase, or reimplement our own version on a simple multilayer perceptron with 
backpropagation. We ended up taking some motivation and code [2], and reimplemented parts 
of it to fit our needs.  
 
This meant that we could not test our algorithms on very complicated networks (just MLPs), but 
given the resource constraints we had, and the amount of time it takes to train even a single 
complicated network, it would have taken us considerable amount of resources and time to 
write C++ code and train these networks to get good accuracies. So we scoped our problem 
down to multi-layer perceptrons, as it was easier for us to implement it and use a version of it 
for hyperparameter search. We still spent a considerable amount of time trying to implement 
and get a simple MLP working in C++, so we decided to focus the rest of our time trying to 
parallelize hyperparameter search, rather than build more complicated networks such as 
convolution networks. 
Datasets 
We primarily tested our results on two datasets. We used the iris dataset as our first dataset, 
because it was a rather small classification dataset (150 points, with 3 labels), and it was very 
quick for us to train multiple networks in parallel with different hyperparameters to test the 
speedups and results. Once we verified our approaches and got some considerable speedups 
with the iris dataset, we switched to using the MNIST dataset, which was considerably larger 
(60,000 images). We initially wanted to get all our results on the CIFAR10 datasets, but since we 
resorted to using MLPs, it was very hard to get any substantial accuracies with the CIFAR dataset, 
and that was throwing our hyperparameter search off.  
 
Machine Specifications 
All the experiments were run on the AWS c5.4xlarge instances. These instances have Intel Xeon 
Platinum 8000 series​, with 16 virtual cores and 32 gb memory. We also tested this on the GHC 
machine, which have eight 3.2 GHz Intel Core i7 processors, and we noticed similar speedups. 
Results and Analysis
Speedups 
Wall Clock Times 
The wallclock time for different algorithms are given here. Observations: 
1. GS, RS, EV1 are going to be faster because the parameter space size on which they search 
is much less than the other algorithms. We significantly restricted these to measure 
speedups, and not waste time on running it on hyperparameter large spaces. 
2. EV3 takes the longest time since the synchronization cost is a lot more than the other 
techniques 
3. For random search, average of three runs is taken and since the algorithm stops when it 
reaches an given high accuracy threshold (75% for MNIST and 97% for Iris), it’s total 
execution time is much less. 
4. EV4 execution time is less because it has near zero synchronization cost. 
5. EV5 execution time is the least among EV2-5s since it benefits the most from the 
parallelization while maintaining the same search space. 
Relative Speedups 
 
The relative speedup for different algorithms on both the datasets are given here. Observations: 
1. RS doesn’t scale after a point since the time needed for achieving a minimum good 
accuracy has been reached with a certain number of threads (8 in this case) and more 
threads do not improve the execution time since we hit the steady state in the expected 
iterations to hit the maximum accuracy threshold. 
2. EV 3, 4, 5 scales similarly since the parallelization technique is similar. EV 5 scales the best 
since it has no synchronization overhead. 
3. As the number of threads increase, the overhead of repeatedly creating pthreads 
increases for EV2 due to which it’s speedup at the end is not as good as the others. EV3 
just creates a pool of threads at the start and uses mutex to synchronize due to which it 
did not achieve perfect speed up but it did better than EV2.  
4. EV1 does not scale well since the contention between different threads increases as they 
are operating on the same array which leads to false sharing and more synchronization 
overhead due to some workload imbalance (explained in next section). Moreover, the 
search space (amount of work) is less, so more threads are not giving ideal speedups. 
5. EV5 scales the best since it has no synchronization (build from EV4), and additionally we 
added padding to the structure which reduces false sharing and it also contained 
additional parallelism in addition to the parallel component of parameter search due to 
which we get the best efficiency at the end. 
 
To summarize, the overhead of creating threads, synchronization cost, part sequential 
work inherent in each algorithm in order to reconcile and get best hyperparameters and 
other costs (like initial data loading) prevent us from achieving perfect speedups. This is 
more elaborately portrayed for EV2, EV3 and EV4 in ‘Normalized Execution Time 
breakdown’ section. 
Scaling with different parameters  
This section gives the relative change in the execution time within each algorithm. The 
experiments are run using the best possible execution time that is achieved using 16 threads for 
all the algorithms. 
Dataset size
The relative execution time (degradation in this scale), when scaled with dataset size is is shown 
here. Note that the dataset sizes are increased exponentially, so an exponential slowdown in this 
case is observed (except for RS since the time to reach the accuracy threshold for the network 
becomes almost constant after certain amount of data). Since the data is passed through the 
neural network, irrespective of the algorithm, we should observe the same proportionate 
change in the execution times. This is because the forward pass, backward pass and the testing 
(which is 10% of the dataset) is done in the neural network and this changes exactly in the same 
way for all the algorithms due to which same trend is observed. 
Number of Initial Hyperparameters
The relative execution time (degradation in this scale), when scaled with hyperparameter search 
size is shown here. There are multiple interesting observations. Note that the number of 
hyperparameters are also changed exponentially. 
1. The grid search is essentially an exhaustive search due to which an exponential increase 
in the hyperparameters gave an exponential increase in execution times. 
2. If the space of random hyperparameter sampling is more, it is observed that random 
search quickly finds the best model and reaches its accuracy threshold. Thus, the 
runtimes decrease in this case. 
3. The whole idea of evolutionary search is to retain only the best hyperparameters. Thus, 
even though we sample from a large hyper parameter space, since we discard the bad 
ones and only operate on the best ones, the execution time remains constant. There is a 
slight increase due to the increase in the sampling cost because of larger range. 
Size of Neural Network
The relative execution time (degradation in this scale), when scaled with the size of the neural 
network is shown here. Note that similar to the increase in the dataset, the cost incurred will be 
the same across all the algorithms. The training and testing times proportionately increase here 
too and since the depth of the network increases exponentially from 1 to 8, so are the relative 
execution times.  
Workload distribution between the threads in the algorithms 
The primary difference in EV2 and EV5 are that in EV2, the pthreads are forked and joined 
repeatedly. A major problem with this is the unequal execution times between the threads. For 
instance, having a hidden layer size of 512 would take more time to train and test compared to a 
32 unit one. This would lead to multiple threads being idle waiting for the long running threads. 
In EV5, since the algorithm is modified to remove synchronization, all the threads can be busy 
and get the best result. From the figure above, for 16 threaded run, ‘blue’ from EV5 are better 
distributed than the ‘orange’ color ones from EV2. This can also be confirmed from the speedups 
and wallclock time graphs too in which EV5 is shown to be the best.  
 
Note that all the results on GS, RS and EV1 are reported on dynamically scheduled workload of 
openMP since similar trend is observed. 
Normalized Execution Time Breakdown
For each algorithm, a detailed normalized execution time breakdown is given below. 
The relative trends and speedups in the execution times are already explained above. But from 
this graph, we can clearly see exactly why that is the case.  
EV3 has a lot more synchronization cost associate with it since the threads stall a lot more to 
acquire the mutexes. This cost (time) adds up and only gets exponentially worse as the number 
of threads increase.  
EV2 in contrast, has some synchronization cost during the ‘join’ part of the pthreads. But this is 
almost constant since the master executes them a fixed number of times (that is specific to the 
algorithm). 
We clearly see here that for EV4 (and EV5 too since its built on top of EV4), the lack of 
synchronization time makes them very efficient. 
The algorithms have similar proportion of total parallel work (given in orange) across the threads 
but we see the speedups because this is divided between the threads and each thread has 
lesser work to do as we increase the number of threads.  
Moreover, we can see that the inherent sequential part of the algorithms prevent us from 
achieving perfect speedup.  
Accuracy vs. Time
For this chart, we set a threshold for each of the parameters to stop searching the parameter 
space as soon as it hits a 75% accuracy (which is the a good accuracy on MNIST using MLP). We 
had significantly constrained the parameter space for grid search and random search to prevent 
long runtimes, so these numbers are the proportion of the time taken to get to the required 
accuracy, with respect to the total time taken(for an iteration bound search or an exhaustive 
search for that particular algorithm).  
 
As we can see, grid search needs to do a sweep of all parameters before it can check the 
accuracies, so it takes the most time. Random search, has a threshold on accuracies that are 
checked after every block of parallel execution, and we can see that it reaches 75% accuracy in 
about 20% of the time it takes to do a longer sweep and still get similar results.  
 
We can see that EV4 takes the least amount of time, as it is searching in an asynchronous 
manner, and it still retains information across threads or islands. EV2 and EV3 are very close, 
and this can even be written down to the random initializations of the populations, but the idea 
is they get to a pretty good solution significantly faster than grid or random search. EV3 is 
slightly higher due to synchronization cost involved (explained in previous section) 
Credit Distribution
The work was equal by both partners. We both worked together on figuring out the approaches, 
setting up frameworks and getting a basic version of our project working. We split up the 
algorithms we have implemented equally, and had equal contribution in terms of the analysis 
and the report. 
References
[1] ​https://github.com/tiny-dnn
[2] ​https://github.com/davidalbertonogueira/MLP/blob/master/README.md
[3] ​https://deepmind.com/blog/population-based-training-neural-networks/
[4] ​http://geco.mines.edu/workshop/aug2011/05fri/parallelGA.pdf
[5] ​https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/
Two images were taken from:
[6] ​https://www.slideshare.net/alirezaandalib77/evolutionary-algorithms-69187390
[7] ​https://www.slideshare.net/alirezaandalib77/evolutionary-algorithms-69187390

More Related Content

What's hot

Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
swapnac12
 
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEEFINALYEARSTUDENTPROJECTS
 
Pareto depth for multiple-query image retrieval
Pareto depth for multiple-query image retrievalPareto depth for multiple-query image retrieval
Pareto depth for multiple-query image retrieval
jpstudcorner
 
Discovering latent informaion by
Discovering latent informaion byDiscovering latent informaion by
Discovering latent informaion by
ijaia
 
Clonal Selection Algorithm Parallelization with MPJExpress
Clonal Selection Algorithm Parallelization with MPJExpressClonal Selection Algorithm Parallelization with MPJExpress
Clonal Selection Algorithm Parallelization with MPJExpress
Ayi Purbasari
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
IEEEFINALYEARPROJECTS
 
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
ijcseit
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastRai University
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
JPINFOTECH JAYAPRAKASH
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
IEEEGLOBALSOFTTECHNOLOGIES
 
Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity ranking
Shakas Technologies
 
Application of three graph Laplacian based semisupervised learning methods to...
Application of three graph Laplacian based semisupervised learning methods to...Application of three graph Laplacian based semisupervised learning methods to...
Application of three graph Laplacian based semisupervised learning methods to...ijbbjournal
 
Introducing GRAKN.AI
Introducing GRAKN.AIIntroducing GRAKN.AI
Introducing GRAKN.AI
Vaticle
 

What's hot (13)

Genetic algorithms
Genetic algorithmsGenetic algorithms
Genetic algorithms
 
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
IEEE 2014 JAVA DATA MINING PROJECTS A fast clustering based feature subset se...
 
Pareto depth for multiple-query image retrieval
Pareto depth for multiple-query image retrievalPareto depth for multiple-query image retrieval
Pareto depth for multiple-query image retrieval
 
Discovering latent informaion by
Discovering latent informaion byDiscovering latent informaion by
Discovering latent informaion by
 
Clonal Selection Algorithm Parallelization with MPJExpress
Clonal Selection Algorithm Parallelization with MPJExpressClonal Selection Algorithm Parallelization with MPJExpress
Clonal Selection Algorithm Parallelization with MPJExpress
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
AUTOMATED INFORMATION RETRIEVAL MODEL USING FP GROWTH BASED FUZZY PARTICLE SW...
 
B.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blastB.sc biochem i bobi u 3.2 algorithm + blast
B.sc biochem i bobi u 3.2 algorithm + blast
 
A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...A fast clustering based feature subset selection algorithm for high-dimension...
A fast clustering based feature subset selection algorithm for high-dimension...
 
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
DOTNET 2013 IEEE CLOUDCOMPUTING PROJECT A fast clustering based feature subse...
 
Efficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity rankingEfficient instant fuzzy search with proximity ranking
Efficient instant fuzzy search with proximity ranking
 
Application of three graph Laplacian based semisupervised learning methods to...
Application of three graph Laplacian based semisupervised learning methods to...Application of three graph Laplacian based semisupervised learning methods to...
Application of three graph Laplacian based semisupervised learning methods to...
 
Introducing GRAKN.AI
Introducing GRAKN.AIIntroducing GRAKN.AI
Introducing GRAKN.AI
 

Similar to Meta Machine Learning: Hyperparameter Optimization

OPTIMIZE TO ACTUALIZE: THE IMPACT OF HYPERPARAMETER TUNING ON AI
OPTIMIZE TO ACTUALIZE: THE IMPACT OF HYPERPARAMETER TUNING ON AIOPTIMIZE TO ACTUALIZE: THE IMPACT OF HYPERPARAMETER TUNING ON AI
OPTIMIZE TO ACTUALIZE: THE IMPACT OF HYPERPARAMETER TUNING ON AI
ChristopherTHyatt
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Mumbai Academisc
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Mumbai Academisc
 
Genetic Algorithms for Evolving Computer Chess Programs
Genetic Algorithms for Evolving Computer Chess Programs   Genetic Algorithms for Evolving Computer Chess Programs
Genetic Algorithms for Evolving Computer Chess Programs
Patrick Walter
 
Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization a...
Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization a...Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization a...
Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization a...
IAESIJAI
 
Optimised random mutations for
Optimised random mutations forOptimised random mutations for
Optimised random mutations for
ijaia
 
Mining Frequent Item set Using Genetic Algorithm
Mining Frequent Item set Using Genetic AlgorithmMining Frequent Item set Using Genetic Algorithm
Mining Frequent Item set Using Genetic Algorithm
ijsrd.com
 
nature inspired algorithms
nature inspired algorithmsnature inspired algorithms
nature inspired algorithms
Gaurav Goel
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
Arpitha Gurumurthy
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
Dinusha Dilanka
 
Nature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic WebNature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic Web
Stefan Ceriu
 
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
IEEEMEMTECHSTUDENTSPROJECTS
 
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
IEEEGLOBALSOFTTECHNOLOGIES
 
UNIT-5 Optimization (Part-1).ppt
UNIT-5 Optimization (Part-1).pptUNIT-5 Optimization (Part-1).ppt
UNIT-5 Optimization (Part-1).ppt
TvVignesh3
 
FORWARD CHAINING AND BACKWARD CHAINING SYSTEMS IN ARTIFICIAL INTELIGENCE
FORWARD CHAINING AND BACKWARD CHAINING SYSTEMS IN ARTIFICIAL INTELIGENCEFORWARD CHAINING AND BACKWARD CHAINING SYSTEMS IN ARTIFICIAL INTELIGENCE
FORWARD CHAINING AND BACKWARD CHAINING SYSTEMS IN ARTIFICIAL INTELIGENCE
JohnLeonard Onwuzuruigbo
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
Praveen Kumar
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
Hayim Makabee
 
Are Evolutionary Algorithms Required to Solve Sudoku Problems
Are Evolutionary Algorithms Required to Solve Sudoku ProblemsAre Evolutionary Algorithms Required to Solve Sudoku Problems
Are Evolutionary Algorithms Required to Solve Sudoku Problems
csandit
 
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
IJCSIS Research Publications
 
T01732115119
T01732115119T01732115119
T01732115119
IOSR Journals
 

Similar to Meta Machine Learning: Hyperparameter Optimization (20)

OPTIMIZE TO ACTUALIZE: THE IMPACT OF HYPERPARAMETER TUNING ON AI
OPTIMIZE TO ACTUALIZE: THE IMPACT OF HYPERPARAMETER TUNING ON AIOPTIMIZE TO ACTUALIZE: THE IMPACT OF HYPERPARAMETER TUNING ON AI
OPTIMIZE TO ACTUALIZE: THE IMPACT OF HYPERPARAMETER TUNING ON AI
 
Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...Predictive job scheduling in a connection limited system using parallel genet...
Predictive job scheduling in a connection limited system using parallel genet...
 
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
Bra a bidirectional routing abstraction for asymmetric mobile ad hoc networks...
 
Genetic Algorithms for Evolving Computer Chess Programs
Genetic Algorithms for Evolving Computer Chess Programs   Genetic Algorithms for Evolving Computer Chess Programs
Genetic Algorithms for Evolving Computer Chess Programs
 
Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization a...
Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization a...Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization a...
Adaptive Bayesian contextual hyperband: A novel hyperparameter optimization a...
 
Optimised random mutations for
Optimised random mutations forOptimised random mutations for
Optimised random mutations for
 
Mining Frequent Item set Using Genetic Algorithm
Mining Frequent Item set Using Genetic AlgorithmMining Frequent Item set Using Genetic Algorithm
Mining Frequent Item set Using Genetic Algorithm
 
nature inspired algorithms
nature inspired algorithmsnature inspired algorithms
nature inspired algorithms
 
Everything you need to know about AutoML
Everything you need to know about AutoMLEverything you need to know about AutoML
Everything you need to know about AutoML
 
Performance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning AlgorithmsPerformance Comparision of Machine Learning Algorithms
Performance Comparision of Machine Learning Algorithms
 
Nature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic WebNature Inspired Models And The Semantic Web
Nature Inspired Models And The Semantic Web
 
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
2014 IEEE JAVA DATA MINING PROJECT A fast clustering based feature subset sel...
 
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
JAVA 2013 IEEE DATAMINING PROJECT A fast clustering based feature subset sele...
 
UNIT-5 Optimization (Part-1).ppt
UNIT-5 Optimization (Part-1).pptUNIT-5 Optimization (Part-1).ppt
UNIT-5 Optimization (Part-1).ppt
 
FORWARD CHAINING AND BACKWARD CHAINING SYSTEMS IN ARTIFICIAL INTELIGENCE
FORWARD CHAINING AND BACKWARD CHAINING SYSTEMS IN ARTIFICIAL INTELIGENCEFORWARD CHAINING AND BACKWARD CHAINING SYSTEMS IN ARTIFICIAL INTELIGENCE
FORWARD CHAINING AND BACKWARD CHAINING SYSTEMS IN ARTIFICIAL INTELIGENCE
 
Final proj 2 (1)
Final proj 2 (1)Final proj 2 (1)
Final proj 2 (1)
 
Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)Automated Machine Learning (Auto ML)
Automated Machine Learning (Auto ML)
 
Are Evolutionary Algorithms Required to Solve Sudoku Problems
Are Evolutionary Algorithms Required to Solve Sudoku ProblemsAre Evolutionary Algorithms Required to Solve Sudoku Problems
Are Evolutionary Algorithms Required to Solve Sudoku Problems
 
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
Parallel Evolutionary Algorithms for Feature Selection in High Dimensional Da...
 
T01732115119
T01732115119T01732115119
T01732115119
 

More from Priyatham Bollimpalli

Science and Ethics: The Manhattan Project during World War II
Science and Ethics: The Manhattan Project during World War IIScience and Ethics: The Manhattan Project during World War II
Science and Ethics: The Manhattan Project during World War II
Priyatham Bollimpalli
 
Kernel Descriptors for Visual Recognition
Kernel Descriptors for Visual RecognitionKernel Descriptors for Visual Recognition
Kernel Descriptors for Visual Recognition
Priyatham Bollimpalli
 
Auction Portal
Auction PortalAuction Portal
Auction Portal
Priyatham Bollimpalli
 
IIT JEE Seat Allocation System
IIT JEE Seat Allocation System IIT JEE Seat Allocation System
IIT JEE Seat Allocation System
Priyatham Bollimpalli
 
Design and Fabrication of 4-bit processor
Design and Fabrication of  4-bit processorDesign and Fabrication of  4-bit processor
Design and Fabrication of 4-bit processor
Priyatham Bollimpalli
 
Library Management System
Library  Management  SystemLibrary  Management  System
Library Management System
Priyatham Bollimpalli
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation Memory
Priyatham Bollimpalli
 
GCC RTL and Machine Description
GCC RTL and Machine DescriptionGCC RTL and Machine Description
GCC RTL and Machine Description
Priyatham Bollimpalli
 
The problem of Spatio-Temporal Invariant Points in Videos
The problem of Spatio-Temporal Invariant Points in VideosThe problem of Spatio-Temporal Invariant Points in Videos
The problem of Spatio-Temporal Invariant Points in VideosPriyatham Bollimpalli
 
Literature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based WatermarkingLiterature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based Watermarking
Priyatham Bollimpalli
 

More from Priyatham Bollimpalli (10)

Science and Ethics: The Manhattan Project during World War II
Science and Ethics: The Manhattan Project during World War IIScience and Ethics: The Manhattan Project during World War II
Science and Ethics: The Manhattan Project during World War II
 
Kernel Descriptors for Visual Recognition
Kernel Descriptors for Visual RecognitionKernel Descriptors for Visual Recognition
Kernel Descriptors for Visual Recognition
 
Auction Portal
Auction PortalAuction Portal
Auction Portal
 
IIT JEE Seat Allocation System
IIT JEE Seat Allocation System IIT JEE Seat Allocation System
IIT JEE Seat Allocation System
 
Design and Fabrication of 4-bit processor
Design and Fabrication of  4-bit processorDesign and Fabrication of  4-bit processor
Design and Fabrication of 4-bit processor
 
Library Management System
Library  Management  SystemLibrary  Management  System
Library Management System
 
Interface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation MemoryInterface for Finding Close Matches from Translation Memory
Interface for Finding Close Matches from Translation Memory
 
GCC RTL and Machine Description
GCC RTL and Machine DescriptionGCC RTL and Machine Description
GCC RTL and Machine Description
 
The problem of Spatio-Temporal Invariant Points in Videos
The problem of Spatio-Temporal Invariant Points in VideosThe problem of Spatio-Temporal Invariant Points in Videos
The problem of Spatio-Temporal Invariant Points in Videos
 
Literature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based WatermarkingLiterature Survey on Interest Points based Watermarking
Literature Survey on Interest Points based Watermarking
 

Recently uploaded

Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
Kamal Acharya
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
ShahidSultan24
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
AafreenAbuthahir2
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
abh.arya
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
Pratik Pawar
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
Jayaprasanna4
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
SamSarthak3
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
Kamal Acharya
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
AhmedHussein950959
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation & Control
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
obonagu
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
MLILAB
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
Intella Parts
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
Pipe Restoration Solutions
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
R&R Consult
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
Kamal Acharya
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
karthi keyan
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
Jayaprasanna4
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
Kamal Acharya
 

Recently uploaded (20)

Final project report on grocery store management system..pdf
Final project report on grocery store management system..pdfFinal project report on grocery store management system..pdf
Final project report on grocery store management system..pdf
 
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang,  ICLR 2024, MLILAB, KAIST AI.pdfJ.Yang,  ICLR 2024, MLILAB, KAIST AI.pdf
J.Yang, ICLR 2024, MLILAB, KAIST AI.pdf
 
addressing modes in computer architecture
addressing modes  in computer architectureaddressing modes  in computer architecture
addressing modes in computer architecture
 
WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234WATER CRISIS and its solutions-pptx 1234
WATER CRISIS and its solutions-pptx 1234
 
Democratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek AryaDemocratizing Fuzzing at Scale by Abhishek Arya
Democratizing Fuzzing at Scale by Abhishek Arya
 
weather web application report.pdf
weather web application report.pdfweather web application report.pdf
weather web application report.pdf
 
ethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.pptethical hacking-mobile hacking methods.ppt
ethical hacking-mobile hacking methods.ppt
 
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdfAKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
AKS UNIVERSITY Satna Final Year Project By OM Hardaha.pdf
 
Student information management system project report ii.pdf
Student information management system project report ii.pdfStudent information management system project report ii.pdf
Student information management system project report ii.pdf
 
ASME IX(9) 2007 Full Version .pdf
ASME IX(9)  2007 Full Version       .pdfASME IX(9)  2007 Full Version       .pdf
ASME IX(9) 2007 Full Version .pdf
 
Water Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdfWater Industry Process Automation and Control Monthly - May 2024.pdf
Water Industry Process Automation and Control Monthly - May 2024.pdf
 
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
在线办理(ANU毕业证书)澳洲国立大学毕业证录取通知书一模一样
 
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
H.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdfH.Seo,  ICLR 2024, MLILAB,  KAIST AI.pdf
H.Seo, ICLR 2024, MLILAB, KAIST AI.pdf
 
Forklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella PartsForklift Classes Overview by Intella Parts
Forklift Classes Overview by Intella Parts
 
The Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdfThe Benefits and Techniques of Trenchless Pipe Repair.pdf
The Benefits and Techniques of Trenchless Pipe Repair.pdf
 
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptxCFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
CFD Simulation of By-pass Flow in a HRSG module by R&R Consult.pptx
 
Vaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdfVaccine management system project report documentation..pdf
Vaccine management system project report documentation..pdf
 
CME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional ElectiveCME397 Surface Engineering- Professional Elective
CME397 Surface Engineering- Professional Elective
 
ethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.pptethical hacking in wireless-hacking1.ppt
ethical hacking in wireless-hacking1.ppt
 
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdfCOLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
COLLEGE BUS MANAGEMENT SYSTEM PROJECT REPORT.pdf
 

Meta Machine Learning: Hyperparameter Optimization

  • 1. 15-618 Parallel Computer Architecture and Programming  Meta Machine Learning: Hyperparameter optimization  https://pbollimp.github.io/Meta-Machine-Learning/  Priyatham Bollimpalli (pbollimp)  Mohit Deep Singh (mohitdes)    Abstract In this project, we implemented various hyperparameter optimization algorithms primarily for  Deep Neural Networks (Multi-Layer Perceptrons). We used OpenMP as the baseline for grid  search, random search and a simple evolutionary algorithm. We then proceeded by  implementing our own three variants of evolutionary algorithms using pthreads that not only  give upto 9x speedups on 16 cores and scale very well with increasing number of threads, but  also explore a much larger hyperparameter space and finds the hyperparameters that give good  accuracy much faster than the baselines. Even though our experiments are specific to MLPs, we  are confident that these methods are very applicable to hyperparameter optimization for any  machine learning problem. We also performed an extensive breakdown of the execution times  for our algorithms and reasoned about the performance of individual algorithms.    Background The task of Hyperparameter search   In the recent past, deep learning has been incredibly successful in various tasks such as image  classification and tracking, machine translation, speech recognition etc. One of the biggest  challenges faced by all machine learning researchers today (especially in deep learning) is  selecting hyperparameters for models. A hyperparameter is a parameter whose value is set  before the learning process starts. In contrast, model parameters are parameters that are  optimized as part of the learning process.     Hyperparameters usually have a huge impact on the amount of time it takes to train a model,  and how well the model actually does. Usually, hyperparameters end up being critical in model  performance, and that makes it critical for the researchers to find the most optimal values for it. 
  • 2. In practice today, most researchers manually tweak the hyperparameters and train the networks  to see which hyperparameters give the best results. This process is extremely tedious and time  consuming, and the process of automating this is called hyperparameter optimization. The few  main approaches that exist as of today include:    1. Grid Search​: This basically involves exhaustively searching through the search space of  the hyperparameters, evaluating how each set of hyperparameters performs in terms of  training, and choosing the best combination. This approach is extremely heavyweight,  involves doing a brute force search over the space of hyperparameters. The advantage of  this is that it can be parallelized well, and given enough resources (and good intuition),  one can usually end up with the best set of hyperparameters.  2. Random Search: ​This approach involves randomly sampling hyperparameter sets  through the search space, and searching till it gets the desired accuracy. This method is  very successful in practice, since it doesn’t exhaustively search the entire space, and still  gets good results in practice.  3. Bayesian Optimizations​: This approach usually involves using the past instances of the  hyperparameters already tested to learn, and sample better sets of hyperparameters to  evaluate. Evolutionary search and gaussian processes are the most common techniques  used as bayesian optimization techniques. The biggest challenge is that since there is an  optimization step, this makes these inherently sequential, and harder to parallelize. 
  • 3. Evolutionary Search Algorithm  In the artificial intelligence literature, an evolutionary search algorithm is a generic population  based meta-heuristic based optimization algorithm. An evolutionary strategy algorithm uses  methods that are inspired by evolution, such a mutation, reproduction, reconstruction and  selection. The high level idea is that given a population of individuals, we have a fitness function  to rank which of the individuals are the best performers. Using this function, we rank the  individuals, kill a certain fraction of the weakest part of the population, and have the stronger  part of the population reproduce and mutate to arrive at better solutions.    In our case, the individual hyperparameters sets are the candidates in a population. The  accuracy of a neural network trained with those hyperparameters is our fitness function. The  algorithm inputs are the population size, some initial random population of candidates and the  mutations and crossovers mechanism. The output is the best candidate with the highest  accuracy (fitness). The mutations can heavily vary based on the application. In our particular  case,we randomly chose 3 candidates from the pool of the strongest candidates, took the  average of their hyperparameters, and added some noise to it to encourage the algorithm to  explore more.  Initial Analysis: Workload and Data Structures  The main components of our program are the neural network (training and inference), the  dataset which the neural network is being trained on and the set of the best hyperparameters,  or the hyperparameters that we are searching for. In the described algorithms, the main  computationally expensive part is training and inferring multiple neural networks to get the  accuracies of different sets of parameters. So parallelizing this would give us huge benefits, as  we can test the fitness of multiple candidates in parallel.   There are a few dependencies in this program. Firstly, for evolutionary search algorithms, we  want to very carefully search through the hyperparameter search, so as to make sure we are  regularly using the past results to carefully restrict our search space in the future to get better  results. This requires us to frequently synchronize all parallel threads, evaluate the current  population, rank it and mutate it. Furthermore, depending on what parameters we are  searching, and what the values are, we could end up having very skewed workloads, where  some hyperparameters based on their value might take a much longer time to train and  evaluate as opposed to other. Therefore, these algorithms could greatly benefit from data  parallelism in the first few approaches, where we could analyze fitness of the population in  parallel, synchronize and then parallelize the mutations as well.   The key data structures ended up being the neural networks, which we could not end up sharing  (since we need to train each network individually through backpropagation to evaluate how well  the hyperparameters would perform). The other key structure is the array of all candidates of a  population or multiple populations. The main operations on this array were updating the correct  accuracies, ranking them with respect to each other, and killing the weakest candidates and  mutating and reproducing the strongest ones to produce new candidates.  
  • 4. Main Challenge    The task of hyperparameter search is a very tedious one, especially given we need to train  multiple neural networks from scratch with different hyperparameter configurations. As the  networks become more complex, and the number of hyperparameters increase, this task  becomes more and more challenging - the search space increases exponentially, and more  applicable methods such as bayesian optimization techniques are difficult to parallelize.   The main challenge we faced while trying to parallelize these algorithms was the ability to speed  up not just the algorithms, but also get the correct tradeoff between the accuracy and how fast  and effectively we can search the hyperparameter space. There is an inherently sequential part  of any bayesian optimization, where based on the past results, we need to come up with better  estimates of the hyperparameters. We tried parallelizing the algorithms in phases first. Then, we  aimed to optimize that sequential part as well, and traded-off the quality of the bayesian  optimization for a faster albeit slightly less accurate search.    Approach  In this section, we go into our approaches and techniques we used to parallelize the algorithms  Grid Search (GS)  The high level algorithm is as follows:    For grid search, we loop through all possible configurations of hyperparameters. In our case, the  fitness function is training a neural network, and getting the accuracy of the network on the 
  • 5. validation set. Parallelizing grid search was mostly straightforward, other than lines 5 and 6 in  the algorithm. As a result, we just created an array that kept track of all accuracies for all  configurations, and parallelized the calculation of the fitness for each possible configuration. In  the end, we get the maximum accuracy, and select that configuration. We realized this was not  very scalable with the size of the hyperparameters, so we broke up the entire parameter space  into chunks, parallely calculate the fitness of those chunks, and incrementally keep track of the  most accurate configuration.  Random Search (GS) The high level algorithm is as follows: As can be seen in the algorithm, random search randomly samples hyperparameters from their  search space, and tests them for their fitness. It keeps track on the best parameter configuration  at any given point, and if the fitness is over a threshold, it stops searching the parameter space.  Parallelizing this was not absolutely straightforward, but we followed the strategy of creating  batches of threads which randomly sample from the hyperparameter space and report its  accuracy to an array in the global memory. The master thread, every few chunks, gets the best  accuracy from this array, and if it is over the threshold kills the program. Otherwise it keeps  spawning off new blocks of threads to evaluate other random configurations. This is always in  practice much faster than grid search, and parallelized version of it usually helps explore the  space in a more efficient manner and get the desired accuracy pretty quickly.   
  • 6. Evolutionary Search - Basic Algorithm   As can be seen in the algorithm, the basic evolutionary search works in a way where it tries to  leverage the past evaluations of the fitness functions of the explored parameters to try and  come up with a better and more effective form of optimizing the search space.   The idea is the following. We start out with a population of candidate hyperparameters, and  evaluate all of them. After evaluating them, we rank all the candidates in the population  according to their fitness. We then kill/discard the last k candidates (least fit). Then we go ahead  and generate new children using the top/best performing hyperparameter. The way we mutated  our top performing parameters were: we randomly sample 3 hyperparameters from the top  performing ones, and took the average. We then add some uniform noise to these averages to  encourage our algorithm to explore.   Parallel Evolutionary Search 1 (EV1)  The first approach we took was to try and parallelize the basic evolutionary search algorithm. We   first decided to parallelize the evaluation of the fitness functions of the individual candidates. We  ended up parallelizing this using OpenMP, and creating threads to divide this work up. We  noticed that with certain hyperparameters like the number of hidden units, the workload  imbalance ended up being very high, and openMP static scheduling did not give us good 
  • 7. speedups. So we decided to use dynamic scheduling, and despite the existence of sequential  code (which ranked the population and mutated it), we got decent speedups (discussed in the  results section).  Evolutionary Search with Islands - Our adaptation of the basic algorithm  The earlier algorithm usually would do well with hyperparameter spaces as long as the  dimensionality is not too high. But this algorithm is not ideal when the dimensionality of the  number of hyperparameters we search for is high. Because of random sampling, if it does not  start with good enough points, a basic evolutionary search would get stuck at local optimums. It  helps if we create multiple local evolutionary search algorithm with specific start points, and  then have them communicate with each other what their best parameters are after a few  generations of local optimization. We call this variant the “​island​” approach, where you have  island of local populations, which intermingle every few iterations.   The basic idea was, we kept n different “islands”, each of which had its own population of  candidate hyperparameters. Each of these islands separately carried out the evolutionary search  on its candidate population for a few generations. This means each island would explore their  candidates, rank them within their population and kill and repopulate the weakest members.  After a few iterations, we took the top-k candidates from each of these islands, merged them  together and applied a global evolution to it. This means, the merged candidates were now  sorted, the weakest in this global evolutionary search were killed and repopulated based on the  mutations explained in evolutionary search algorithm section. After this, we would reseed the  local population with these globally mutated seed, and some random seeds, so that it can  explore the high dimensional space in a better way.      We explained how we parallelized these algorithms in the next few sections. 
  • 8. Parallel Evolutionary Search 2 (EV2)  We wanted to parallelize the island approach, to be able to scale better and search dense  hyperparameter spaces more efficiently. We decided to use the fork join model with shared  memory to implement this parallel approach.   The initial idea was that the global top candidates would be stored in shared memory. The  master then spawns off the requisite number of threads, and every thread essentially ends up  acting as an island. Every island then reads its chunk of “seed” candidates from the global top  candidates (which is shuffled after mutation by the master). Then each of these threads run a  certain number of iterations of the evolutionary search algorithm locally. After finding its top  candidates having run a few generations, the islands report their top-k candidates to the global  array. The master waits for all these threads to complete executing their local “generations”,  using join. Once every thread finishes, master then sorts these reported local winners in the  global top array, and applies a “global evolution” on this array. This is followed by the master  shuffling this global array, and then spawning off local threads again to do more iterations of  these local evolutionary searches.  
  • 9. The approach we took here was to parallelize the islands, rather than parallelizing every single  candidate, so that it is easier for us to do a global intermingling of the best candidates.   Parallel Evolutionary Search 3 (EV3)  Independent to EV2, we also tried implementing a threading model, where instead of using join  every time, we used a pool of threads, and shared variables and mutexes to synchronize  between the threads. The idea was to compare the performance of the two and see which one  would do better.     We used a similar model in terms of splitting up the islands between the threads. We had a  shared count variables, and a mutex per thread, and also a mutex for the count variable. The  master thread created a thread pool of the specified number of threads to begin with and would  set the count variable to the number of threads. Each thread would work in an infinite loop,  where at every iteration of the while loop, it would lock its respective mutex. Then it would get  the requisite global “seed” candidates, and run the respective local evolutionary search  algorithms. After each individual thread would be done, it would decrement the count variable  by one, and loop back and try grabbing its own lock again (where it will wait). The master thread  waits for the count variable to get to 0, and when it does, it sorts the global tops array, applies  the respective global mutations, shuffles the array, and resets the count variable to num  threads. After that the master thread unlocks the respective mutex for each thread so they can  start running. This would run till either the master thread found an accuracy greater than the  threshold, or a certain upper limit of iterations. Once the stopping condition would hit, the  master thread sends out a signal to the other threads to quit, using a global variable.     The motivation behind this was that, we thought pthread fork and joins may have extra  overhead, and using mutexes might help us prevent that overhead. As discussed in the result  section, this ended up being slower than EV2 for smaller number of threads, and that was  mostly because of the synchronization overheads due to mutexes and shared variables. But for  16 threads, this outperformed EV2, because of the overhead of spawning new threads every  iteration.     Finally, we wanted to target and reduce the time the island program was taking in the sequential  section. The approach we took is defined below. 
  • 11. In the last couple of approaches we noticed a few things. First was that sometimes the workload  distribution ends up being skewed, and that causes some threads to just be idle to wait for other  threads to catch up. Also, the global evolution would happen in a sequential fashion, which was  definitely a bottleneck in further speedups.     For this, we went back to the drawing board, and realized that the whole idea behind us coming  up with the island approach was to have different local evolutionary populations communicate  and intermingle to encourage more random exploration, and better evolution eventually. We  realized that the communication does not need to be completely exact (and accurate) in time,  and we could do this in an asynchronous manner. We built this on top of EV2, but got rid of the  synchronizations. So every thread, including the master would exclusively work on its local  population. We created three shared arrays of global populations. One of the arrays was called  the “read array”: this is where the master initializes the first “seed” for all islands. The second  array, was the “write array” where all threads would report the best candidates from their  population. The third is the buffer array, which is used to do any intermittent calculations.     The master first initializes the “read array”. Then it spawns off all other threads. Each thread  basically runs the local evolutionary search algorithm, which either runs till the master signals it  to stop (using shared variable), or it hits a certain number of iterations. The master also runs  local evolutionary search, but every few iterations, it performs the global sorting and mutation  step without blocking other threads. It copies the “write array” results to the buffer, sorts the  buffer, applies global evolutionary search to it, and shuffles it. Then it switches the pointer  locations of read array and the buffer, so now the threads can read the new globally evolved  array. This allows all threads to operate asynchronously, where the only drawback is that the  intermingling does not exactly happen at the same times for the threads. This is acceptable,  because the main idea was to interchange information between islands, and by making this  asynchronous we did not lose much accuracy.     The only problem we faced with this was that the master thread now had to do extra work so it  severely lagged behind, and that caused the other threads to work on stale data for a much  longer time. As a fix to this, we significantly reduced the population size of the local evolutionary  search for the master, and that way, the master was always either ahead of the other threads  (which did us no harm), or would be at par with the other threads. We kept executing master till  either it hit a threshold, or some other thread exited, which signaled all other threads to quit via  a shared variable. This way, we completely removed any sequential code from our algorithm.     We did tradeoff a little bit of freshness in terms of global evolution, but since the local evolutions  follow the process correctly anyway, this ended up not causing a huge issue with accuracies. We  just wanted information to flow through all threads to explore a larger space better, and not get  stuck in local optimums, which this model helps us achieve. This gave us more speedups as we  have shown in the results section. 
  • 12. Parallel Evolutionary Search 5 (EV5) For this approach we used the same model as EV4, but we also parallelized the part of the code  where the master sorts and evolves the global array. We used openMP to implement a parallel  for loop here and ended up parallelizing the evolutions carried out by the master thread.     In addition to that, we also padded the three global arrays to completely take up the cache line  size in order to prevent false sharing. This method thus is our best model in terms to  performance and gave us the most speedups.  Experimental Setup Tiny DNN  Our initial approach was to use a highly optimized neural network library, to run  hyperparameter search on convolutional neural networks. We wanted to stick with C/C++ for our  implementation, as it is much easier to parallelize on that, rather than python (which has much  better frameworks for deep learning).     We found a library called tiny-dnn which provided a very high level abstraction of neural  networks. It was a highly optimized library which could be linked to a project by using header  only compilation. This library was also in C++14 which prevented us from running this on GHC.  We implemented most of our techniques using the Tiny DNN framework, but on actually running  the tests on multiple threads, we saw that there were no performance improvements. We  hypothesized that this is because of the tinyDNN doing some optimizations using openMP and  pthreads under the hood in the background and this was clashing with our parallelization  implementations. The library was taking up most resources on the limited resource machine we  were running our experiments on. We could not profile because we could not run it on GHC (due  to C++14) and AWS blocked proper profiling on their instances. At this point, we decided to  abandon this library, and use a simpler DNN (a multi layer perceptron) to train our networks.  MLP   We eventually decided that we will mostly try and use an existing simple multi-layer perceptron  codebase, or reimplement our own version on a simple multilayer perceptron with  backpropagation. We ended up taking some motivation and code [2], and reimplemented parts  of it to fit our needs.     This meant that we could not test our algorithms on very complicated networks (just MLPs), but  given the resource constraints we had, and the amount of time it takes to train even a single  complicated network, it would have taken us considerable amount of resources and time to 
  • 13. write C++ code and train these networks to get good accuracies. So we scoped our problem  down to multi-layer perceptrons, as it was easier for us to implement it and use a version of it  for hyperparameter search. We still spent a considerable amount of time trying to implement  and get a simple MLP working in C++, so we decided to focus the rest of our time trying to  parallelize hyperparameter search, rather than build more complicated networks such as  convolution networks.  Datasets  We primarily tested our results on two datasets. We used the iris dataset as our first dataset,  because it was a rather small classification dataset (150 points, with 3 labels), and it was very  quick for us to train multiple networks in parallel with different hyperparameters to test the  speedups and results. Once we verified our approaches and got some considerable speedups  with the iris dataset, we switched to using the MNIST dataset, which was considerably larger  (60,000 images). We initially wanted to get all our results on the CIFAR10 datasets, but since we  resorted to using MLPs, it was very hard to get any substantial accuracies with the CIFAR dataset,  and that was throwing our hyperparameter search off.     Machine Specifications  All the experiments were run on the AWS c5.4xlarge instances. These instances have Intel Xeon  Platinum 8000 series​, with 16 virtual cores and 32 gb memory. We also tested this on the GHC  machine, which have eight 3.2 GHz Intel Core i7 processors, and we noticed similar speedups. 
  • 14. Results and Analysis Speedups  Wall Clock Times  The wallclock time for different algorithms are given here. Observations:  1. GS, RS, EV1 are going to be faster because the parameter space size on which they search  is much less than the other algorithms. We significantly restricted these to measure  speedups, and not waste time on running it on hyperparameter large spaces.  2. EV3 takes the longest time since the synchronization cost is a lot more than the other  techniques  3. For random search, average of three runs is taken and since the algorithm stops when it  reaches an given high accuracy threshold (75% for MNIST and 97% for Iris), it’s total  execution time is much less.  4. EV4 execution time is less because it has near zero synchronization cost.  5. EV5 execution time is the least among EV2-5s since it benefits the most from the  parallelization while maintaining the same search space. 
  • 15. Relative Speedups    The relative speedup for different algorithms on both the datasets are given here. Observations:  1. RS doesn’t scale after a point since the time needed for achieving a minimum good  accuracy has been reached with a certain number of threads (8 in this case) and more  threads do not improve the execution time since we hit the steady state in the expected  iterations to hit the maximum accuracy threshold.  2. EV 3, 4, 5 scales similarly since the parallelization technique is similar. EV 5 scales the best  since it has no synchronization overhead.  3. As the number of threads increase, the overhead of repeatedly creating pthreads  increases for EV2 due to which it’s speedup at the end is not as good as the others. EV3  just creates a pool of threads at the start and uses mutex to synchronize due to which it  did not achieve perfect speed up but it did better than EV2.   4. EV1 does not scale well since the contention between different threads increases as they  are operating on the same array which leads to false sharing and more synchronization  overhead due to some workload imbalance (explained in next section). Moreover, the  search space (amount of work) is less, so more threads are not giving ideal speedups. 
  • 16. 5. EV5 scales the best since it has no synchronization (build from EV4), and additionally we  added padding to the structure which reduces false sharing and it also contained  additional parallelism in addition to the parallel component of parameter search due to  which we get the best efficiency at the end.    To summarize, the overhead of creating threads, synchronization cost, part sequential  work inherent in each algorithm in order to reconcile and get best hyperparameters and  other costs (like initial data loading) prevent us from achieving perfect speedups. This is  more elaborately portrayed for EV2, EV3 and EV4 in ‘Normalized Execution Time  breakdown’ section. 
  • 17.
  • 18. Scaling with different parameters   This section gives the relative change in the execution time within each algorithm. The  experiments are run using the best possible execution time that is achieved using 16 threads for  all the algorithms.  Dataset size The relative execution time (degradation in this scale), when scaled with dataset size is is shown  here. Note that the dataset sizes are increased exponentially, so an exponential slowdown in this  case is observed (except for RS since the time to reach the accuracy threshold for the network  becomes almost constant after certain amount of data). Since the data is passed through the  neural network, irrespective of the algorithm, we should observe the same proportionate  change in the execution times. This is because the forward pass, backward pass and the testing  (which is 10% of the dataset) is done in the neural network and this changes exactly in the same  way for all the algorithms due to which same trend is observed. 
  • 19. Number of Initial Hyperparameters The relative execution time (degradation in this scale), when scaled with hyperparameter search  size is shown here. There are multiple interesting observations. Note that the number of  hyperparameters are also changed exponentially.  1. The grid search is essentially an exhaustive search due to which an exponential increase  in the hyperparameters gave an exponential increase in execution times.  2. If the space of random hyperparameter sampling is more, it is observed that random  search quickly finds the best model and reaches its accuracy threshold. Thus, the  runtimes decrease in this case.  3. The whole idea of evolutionary search is to retain only the best hyperparameters. Thus,  even though we sample from a large hyper parameter space, since we discard the bad  ones and only operate on the best ones, the execution time remains constant. There is a  slight increase due to the increase in the sampling cost because of larger range. 
  • 20. Size of Neural Network The relative execution time (degradation in this scale), when scaled with the size of the neural  network is shown here. Note that similar to the increase in the dataset, the cost incurred will be  the same across all the algorithms. The training and testing times proportionately increase here  too and since the depth of the network increases exponentially from 1 to 8, so are the relative  execution times.  
  • 21. Workload distribution between the threads in the algorithms  The primary difference in EV2 and EV5 are that in EV2, the pthreads are forked and joined  repeatedly. A major problem with this is the unequal execution times between the threads. For  instance, having a hidden layer size of 512 would take more time to train and test compared to a  32 unit one. This would lead to multiple threads being idle waiting for the long running threads.  In EV5, since the algorithm is modified to remove synchronization, all the threads can be busy  and get the best result. From the figure above, for 16 threaded run, ‘blue’ from EV5 are better  distributed than the ‘orange’ color ones from EV2. This can also be confirmed from the speedups  and wallclock time graphs too in which EV5 is shown to be the best.     Note that all the results on GS, RS and EV1 are reported on dynamically scheduled workload of  openMP since similar trend is observed. 
  • 22. Normalized Execution Time Breakdown For each algorithm, a detailed normalized execution time breakdown is given below.  The relative trends and speedups in the execution times are already explained above. But from  this graph, we can clearly see exactly why that is the case.   EV3 has a lot more synchronization cost associate with it since the threads stall a lot more to  acquire the mutexes. This cost (time) adds up and only gets exponentially worse as the number  of threads increase.   EV2 in contrast, has some synchronization cost during the ‘join’ part of the pthreads. But this is  almost constant since the master executes them a fixed number of times (that is specific to the  algorithm).  We clearly see here that for EV4 (and EV5 too since its built on top of EV4), the lack of  synchronization time makes them very efficient.  The algorithms have similar proportion of total parallel work (given in orange) across the threads  but we see the speedups because this is divided between the threads and each thread has  lesser work to do as we increase the number of threads.   Moreover, we can see that the inherent sequential part of the algorithms prevent us from  achieving perfect speedup.  
  • 23. Accuracy vs. Time For this chart, we set a threshold for each of the parameters to stop searching the parameter  space as soon as it hits a 75% accuracy (which is the a good accuracy on MNIST using MLP). We  had significantly constrained the parameter space for grid search and random search to prevent  long runtimes, so these numbers are the proportion of the time taken to get to the required  accuracy, with respect to the total time taken(for an iteration bound search or an exhaustive  search for that particular algorithm).     As we can see, grid search needs to do a sweep of all parameters before it can check the  accuracies, so it takes the most time. Random search, has a threshold on accuracies that are  checked after every block of parallel execution, and we can see that it reaches 75% accuracy in  about 20% of the time it takes to do a longer sweep and still get similar results.     We can see that EV4 takes the least amount of time, as it is searching in an asynchronous  manner, and it still retains information across threads or islands. EV2 and EV3 are very close,  and this can even be written down to the random initializations of the populations, but the idea  is they get to a pretty good solution significantly faster than grid or random search. EV3 is  slightly higher due to synchronization cost involved (explained in previous section) 
  • 24. Credit Distribution The work was equal by both partners. We both worked together on figuring out the approaches,  setting up frameworks and getting a basic version of our project working. We split up the  algorithms we have implemented equally, and had equal contribution in terms of the analysis  and the report.  References [1] ​https://github.com/tiny-dnn [2] ​https://github.com/davidalbertonogueira/MLP/blob/master/README.md [3] ​https://deepmind.com/blog/population-based-training-neural-networks/ [4] ​http://geco.mines.edu/workshop/aug2011/05fri/parallelGA.pdf [5] ​https://blog.floydhub.com/guide-to-hyperparameters-search-for-deep-learning-models/ Two images were taken from: [6] ​https://www.slideshare.net/alirezaandalib77/evolutionary-algorithms-69187390 [7] ​https://www.slideshare.net/alirezaandalib77/evolutionary-algorithms-69187390