8 Team Backrow Final Presentation

Team Backrow
Team 8: Charlotte Steinichen, Josh
Dierberger, Kenedy Thorne, Phillip Stephens

Introduction
The high percentage of idle time on a computer can reveal the waste of certain
resources at institutions and other labs. When a resource isn’t being used
essentiality its value is being depreciated as this is a time of rapid evolving
technology. In order to capitalize on this and help systems be more efficient, our
team proposes to use this idle time to have backend processes run on the system
such as mining for bitcoin or pulling the computer system in to help with
calculations or with other system’s functionalities such as rendering or simulation.
Idle Machines would now be pulled in to help support other research tasks in time
that would otherwise be wasted. This helps institutions with maximizing the value
of their resources and further pursue other ventures along with improving cluster
usage significantly and improving on the user experience for difficult/frustrating
tasks.

Architecture/Setup
For the sake of simplicity and practicality, we divided the architecture into a
distributed hypervisor architecture with a singleton supervisor.
Each hypervisor switches between a user and research VM.
This supervisor decides what research jobs to run based on available hypervisors.

System Diagram
Level 0: Commander
Level 1: VM_Admin
Level 2:
User/Research

Hypervisor (VM-Admin)
● A hypervisor is needed to switch between running research tasks or providing
a user’s Desktop environment
● VM-Admin serves as the local hosted (software) hypervisor on the machine,
switching between these VMs
○ This offers modularity that a native/bare-metal hypervisor can’t offer.
● Job Overhead ≅ 1-2 seconds
● Context Switch between Research to User ≅ 10 seconds

Hypervisor (VM-Admin)
Structure

Supervisor (Commander) Setup
● Since the hypervisors are distributed, we separate managing and scheduling
research jobs into a single Commander program.
● The hypervisors don’t have a holistic view, the Commander does.
● This allows us modularity as well, since the Commander doesn’t require
specific hypervisor setups.
● The Commander knows:
○ The time of day (to guesstimate availability)
○ All pending jobs and their priorities
■ This can be an arbitrary number, or something like how long the job has run so far
○ All running jobs
○ The availability of all VM-Admins, and what VM-Admins are paused/storing specific jobs
○ The saved state of all jobs and the messages they passed/are passing

Supervisor (Commander) Setup (cont.)
Some major actions the Commander uses:
● It can deploy a job, either a from a fresh specification or a saved state (if there
is a difference in a specific VM-Admin setup)
● It can query for and/or delete the saved state of a paused job
● It can pass messages from client jobs
These actions allow it to undertake any form of non-preemptive scheduling.

MVP Testing-Distributed Sum
● Our MVP was a program that could run a
distributed Sum
● This program ran with 1 commander, 2
VM-Admin’s, and 2 Nodes
Job 1:
Local_sum = 0
For i = 1 to 500:
Local_sum += i
Send local_sum to Job 2
Job 2:
Local_sum = 0
For i = 501 to 999 (inclusive):
Local_sum += i
Global_sum = Recv_local_sum from Job 1 + local_sum
Return global_sum
Commander
VM-Admin
Node
VM-Admin
Node

Commander
Notice the correct
Final_Sum being
computed in the last
line

Commander: Final View
Finally, the commander receives the job ‘FINISHED’ packet, with the contents of
the job’s output. Since this was an echo, it's just the initial phrase.
The first job has now been run!

Solving scheduling
Because we couldn’t test our product in a realistic environment, we built a
simulator to help us see the usage gain from scheduling jobs in computer
downtime and see what the most effective scheduler type would be.
Simulator assumptions:
● 20 computers
● 300 jobs, 100 “small” (15 min), 100 “medium” (45 min), 100 “large” (3 hour)
● Can set policy to either save state or delete progress when users interrupt
jobs
● Can use FIFO, SJF, or SRTF* as scheduling policies
● It takes 2 minutes to switch from a user VM to a research VM

Scheduler Policy Comparison
FIFO+DELETE FIFO+SAVE SJF+DELETE SJF+SAVE SRTF+SAVE
# jobs complete 202 241 221 269 270
S jobs complete 87 97 100 100 100
M jobs complete 79 90 100 100 100
L jobs complete 36 54 21 69 70
# job interruptions 353 341 215 380 367
% jobs interrupted 67% 56% 18% 34% 32%
# work hours done 189 253 163 307 310

Simulator Graphs
time
time
Number of Jobs Running Per Type (FIFO/SAVE)
Number of Jobs Running Per Type (SRTF/SAVE)
#ofcomputers#ofcomputers

Simulator Graphs (cont.)
computer
%usage
% Type of Job Done Per Computer

Simulator Experiment: Multiple Days
This time, ran the simulator over 5 weekdays with a pool of 1000 of each job size.
The new scheduler switches from SRTF to LRTF at night (from 1AM - 9AM).
FIFO+DELETE FIFO+SAVE SJF+DELETE SJF+SAVE SRTF SRTF+LRTF
# jobs complete 1218 1246 2060 2206 2207 2010
S jobs complete 594 600 1000 1000 1000 1000
M jobs complete 433 450 1000 1000 1000 778
L jobs complete 191 196 60 206 207 232
# job interruptions 2112 2108 1897 2053 2037 2023
% jobs interrupted 61% 61% 31% 36% 38% 51%
# work hours done 1046 1075 1180 1618 1621 1529

SRTF vs SRTF + LRTF
time
#ofcomputers#ofcomputers
time
Number of Jobs Running Per Type (SRTF/SAVE)
Number of Jobs Running Per Type (SRTF+LRTF/SAVE)

SRTF+LRTF thoughts
SRTF+LRTF forces a more equitable distribution of work and does increase the
number of long jobs that are able to be completed. It outperforms all deletion
tactics and FIFO as an algorithm. It is also more realistic than SRTF, which looks
attractive due to the high number of work hours it completes, but ignores all long
jobs for multiple days, which is unlikely to be ideal in a real-life situation where
there would not be a cap on the number of “short” jobs in the queue. Ignoring
longer jobs indefinitely is unrealistic, and this scheduler does address this.
However, this scheduler still fails to be entirely realistic by ignoring the “medium”
length jobs, which only get completed because the simulator runs out of short
jobs. A next step might be giving medium-length jobs priority for medium-traffic
times and giving small jobs priority for high-traffic times.

Group Limitations
● Coronavirus moved everyone off campus so we couldn’t collect any real data
from the GT libraries/computers.
● Difficulty with varying schedules and distance forcing us to utilize video chats
etc. to work remotely which causes some communication to be lost in
translation.
● Integration testing became very difficult, as hypervisor testing is difficult to
build a CI pipeline for and integration testing coordination remotely is a
logistical challenge.
● We didn’t have a lot of experience with VM provisioning and most industry
solutions didn’t suit our needs. Developing a solution from scratch was
lengthy

Lessons
● Value of Communication and Teamwork – differing skills and strengths
● Variety of Testing in terms of Data sets & Timings – luxury of real time data vs
simulated data
● Break down of jobs & Delegation of Work – starting small and working up in
complexity
● The Reach & Scalability of the Project – just the beginning

Conclusions / Findings
In the end, we found our system ran very smoothly on our given scale. Through
our research of implementing different schedulers the SRTF with SAVE proved to
be most effective in getting the most amount of jobs completed with little waste as
it prioritized smaller jobs first; Although, long term data might reveal other
schedulers proving to be also effective. Our Architectural use of a Commander to
switch between user and our Research VM proved to show promising results by
detecting a system’s use and switching to the proper state. Our simulated data
revealed an outstanding amount of time and resources going to waste when a
computer wasn’t in use where the implementation of our system could be utilized
about 90% of that time. Overall, our system shows promising results and to be
very resourceful and beneficial to it’s possible use at Georgia Tech. Time is
valuable; let's not waste it.

8 Team Backrow Final Presentation

Recommended

Recommended

More Related Content

Similar to 8 Team Backrow Final Presentation

Similar to 8 Team Backrow Final Presentation (20)

Recently uploaded

Recently uploaded (20)

8 Team Backrow Final Presentation

Editor's Notes