SlideShare a Scribd company logo
A battlesnake reinforcement
learning starter pack
Jonathan Chung
Xavier Raffin
©2020 Amazon Web Services, Inc. or its affiliates, All rights reserved
Battlesnake
Battlesnake
Battlesnake reinforcement learning starter pack
Agent Environment
Training your own battlesnake
reinforcement learning model
2. Reinforcement learning module
Build custom rules on top of an
existing model
3. Heuristics module
One-click deployment of an
existing model
1. One-click deploy module
Reinforcement learning module
Agent Environment
Reinforcement learning module
Agent Environment
Actions
Rewards
State
Reinforcement learning module
Agent Environment
Actions
Rewards
State
Accelerate, move left, move right
Road image
#Kms driven
Reinforcement learning module
Agent Environment
Actions
Rewards
State
Accelerate, move left, move right
Road image
#Kms driven
Reinforcement learning module
Agent Environment
Actions
Rewards
State
Position of pieces
Position of all pieces
Winning/losing
Reinforcement learning module
Agent Environment
Actions
Rewards
State
Direction of movement
Pos. snakes&food
Reward?
Reinforcement learning module
Direction of movement
Pos. snakes+food
Reward per snake
Multiagent reinforcement learning
environment
snakes
Reinforcement learning module
Training routine for the module (deep Q learning)
for epi in episodes:
state = env.reset()
while agent.is_alive():
if prob < eps:
action = agent.get_random_action()
else:
action = agent.get_next_best_action(state)
next_state, reward = env.step(action)
memory.append(next_state, state, action, reward)
agent.learn(memory)
Agent Environment
Actions
Rewards
State
Reinforcement learning module
Training routine for the module (deep Q learning)
for epi in episodes:
state = env.reset()
while agents.agents_alive() > 1:
actions = []
for agent in agents:
if prob < eps:
action = agent.get_random_action()
else:
action = agent.get_next_best_action(state)
actions.append(action)
next_state, reward = env.step(actions)
memory.append(next_state, state, actions, reward)
for agent in agents:
agent.learn(memory)
Agent
Environment
Actions
Rewards
State
Agent
Agent
Reinforcement learning module
Environment:
Agent
Environment
Actions
Rewards
State
Agent
Agent
Reinforcement learning module
State representation:
the positions of the snakes and food
Agent
Environment
Actions
Rewards
State
Agent
Agent
Reinforcement learning module
State representation:
the positions of the snakes and food
-1 -1 -1 -1 -1 -1
-1 0 0 0 0 -1
-1 0 0 0 0 -1
-1 0 0 0 0 -1
-1 0 1 0 0 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 0 5 1 1 -1
-1 0 0 0 1 -1
-1 0 0 0 1 -1
-1 0 0 0 0 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 1 0 0 0 -1
-1 5 0 1 0 -1
-1 0 0 5 0 -1
-1 0 0 0 0 -1
-1 -1 -1 -1 -1 -1
Food position Target snake position Other snake position
Agent
Environment
Actions
Rewards
State
Agent
Agent
Reinforcement learning module
State representation:
the positions of the snakes and food
-1 -1 -1 -1 -1 -1
-1 0 0 0 0 -1
-1 0 0 1 0 -1
-1 0 0 5 0 -1
-1 0 0 0 0 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 1 5 1 1 -1
-1 5 0 0 1 -1
-1 0 0 0 1 -1
-1 0 0 0 0 -1
-1 -1 -1 -1 -1 -1
Food position Snake position Other snake positions
Agent
Environment
Actions
Rewards
State
Agent
Agent
-1 -1 -1 -1 -1 -1
-1 0 0 0 0 -1
-1 0 0 0 0 -1
-1 0 0 0 0 -1
-1 0 1 0 0 -1
-1 -1 -1 -1 -1 -1
Reinforcement learning module
State representation:
the positions of the snakes and food
-1 -1 -1 -1 -1 -1
-1 1 0 0 0 -1
-1 5 0 0 0 -1
-1 0 0 0 0 -1
-1 0 0 0 0 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 0 5 1 1 -1
-1 0 0 1 1 -1
-1 0 0 5 1 -1
-1 0 0 0 0 -1
-1 -1 -1 -1 -1 -1
Food position Snake position Other snake positions
Agent
Environment
Actions
Rewards
State
Agent
Agent
-1 -1 -1 -1 -1 -1
-1 0 0 0 0 -1
-1 0 0 0 0 -1
-1 0 0 0 0 -1
-1 0 1 0 0 -1
-1 -1 -1 -1 -1 -1
Reinforcement learning module
Environment: rewards
+1 every turn the snake lives
Agent
Environment
Actions
Rewards
State
Agent
Agent
Reinforcement learning module
Agent
Agent??
Agent
Environment
Actions
Rewards
State
Agent
Agent
Reinforcement learning module
Agent
Neural
network
Agent
Environment
Actions
Rewards
State
Agent
Agent
Reinforcement learning module
Agent
Neural
network
15
10
Predicted total expected
reward
13
0
Reinforcement learning module
Learning
Neural
network
t = 0
t = 1
Neural
network
New reward = 1
48
84
New predicted
total expected reward
1510
130
Predicted total expected reward
Reinforcement learning module
Learning
New predicted total expected reward ≃ neural network (new reward, predicted total expected reward)
t = 0
Predicted total expected rewards
t = 1 New predicted total expected rewards
New reward = 1
Reinforcement learning module
Learning
- Total Expected reward (Q)
Qupdated(st) ← Qold(st) + ⍺ (reward + ɣ ·[max Q(st+1) – Q(st)])
Qt + Reward = 1 → Qt+1
Reinforcement learning module
Customization opportunities
• Environment representations
• Neural network design
• Rewards design
-1 -1 -1 -1 -1 -1
-1 1 0 0 0 -1
-1 5 0 1 0 -1
-1 0 0 5 0 -1
-1 0 0 0 0 -1
-1 -1 -1 -1 -1 -1
Reinforcement learning module
Environment representation:
-1 -1 -1 -1 -1 -1
-1 0 5 1 1 -1
-1 0 0 0 1 -1
-1 0 0 0 1 -1
-1 0 0 0 0 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 0 0 0 0 -1
-1 0 0 1 0 -1
-1 0 0 0 0 -1
-1 0 1 0 0 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 1 0 0 0 -1
-1 5 0 1 0 -1
-1 0 0 5 0 -1
-1 0 0 0 0 -1
-1 -1 -1 -1 -1 -1
Reinforcement learning module
Neural network design:
-1 -1 -1 -1 -1 -1
-1 0 5 1 1 -1
-1 0 0 0 1 -1
-1 0 0 0 1 -1
-1 0 0 0 0 -1
-1 -1 -1 -1 -1 -1
-1 -1 -1 -1 -1 -1
-1 0 0 0 0 -1
-1 0 0 1 0 -1
-1 0 0 0 0 -1
-1 0 1 0 0 -1
-1 -1 -1 -1 -1 -1
Snake
health Snake ID
Turn
count
Neural
network
Reinforcement learning module
Rewards design
• Surviving another turn
• Eating food
• Starving
• Winning the game
• Losing the game
• Hitting a
wall/snake/yourself
• Performing a forbidden
move
• Eating another snake
• Forcing another snake to
hit your body
Reinforcement learning module
Amazon Sagemaker
Reinforcement learning module
Amazon Sagemaker
Reinforcement learning module
Amazon Sagemaker
Agent
Environment
Actions
Rewards
State
Agent
Agent Snake model
Training
Reinforcement learning module
Amazon Sagemaker
• Neural network
parameters
• Learning parameters
• Environment
configurations
Agent
Environment
Actions
Rewards
State
Agent
Agent Best snake
Hyper parameter
optimisation
Reinforcement learning module
Amazon Sagemaker
-1 -1 -1 -1 -1 -1
-1 0 0 0 0 -1
-1 0 0 0 0 -1
-1 0 0 0 0 -1
-1 0 1 0 0 -1
-1 -1 -1 -1 -1 -1
Agent
Environment
Actions
Rewards
State
Agent
Agent
0 0 0 0
0 0 0 0
0 0 0 0
0 1 0 0
Snake with
Border
Snake without
border
Reinforcement learning module
Amazon Sagemaker
Agent
Environment
Actions
Rewards
State
Agent
Agent
Battlesnake reinforcement learning starter pack
Agent Environment
Training your own battlesnake
reinforcement learning model
2. Reinforcement learning module
Build custom rules on top of an
existing model
3. Heuristics module
One-click deployment of an
existing model
1. One-click deploy module
Custom rules with the heuristic module
Provides a starting point for you to build upon
Direction of movement
Pos. snakes+food
Reward?
Trained AI snake model
Custom rules with the heuristic module
Provides a starting point for you to build upon
Direction of movement
Pos. snakes+food
Reward?
Trained AI snake model
Custom rules with the heuristic module
Code
Custom rules with the heuristic module
Code
Custom rules with the heuristic module
Situation simulation
Battlesnake reinforcement learning starter pack
Agent Environment
Training your own battlesnake
reinforcement learning model
2. Reinforcement learning module
Build custom rules on top of an
existing model
3. Heuristics module
One-click deployment of an
existing model
1. One-click deploy module
One-click deployment with sagemaker
• After you train your own model
• After writing your custom rules
• Using the existing pretrained snake
One-click deployment with sagemaker

More Related Content

Similar to Battlesnake AWS ML Meetup Victoria 2020

Slope one recommender on hadoop
Slope one recommender on hadoopSlope one recommender on hadoop
Slope one recommender on hadoop
YONG ZHENG
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
The Statistical and Applied Mathematical Sciences Institute
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
MeetupDataScienceRoma
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
Tusharchauhan939328
 
Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안
KyuYeolJung
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
Ben Ball
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
Abhanshu Gupta
 
20181221 q-trader
20181221 q-trader20181221 q-trader
20181221 q-trader
Taku Yoshioka
 
CNN
CNNCNN
How to generate game character behaviors using AI and ML - Unite Copenhagen
How to generate game character behaviors using AI and ML - Unite CopenhagenHow to generate game character behaviors using AI and ML - Unite Copenhagen
How to generate game character behaviors using AI and ML - Unite Copenhagen
Unity Technologies
 

Similar to Battlesnake AWS ML Meetup Victoria 2020 (10)

Slope one recommender on hadoop
Slope one recommender on hadoopSlope one recommender on hadoop
Slope one recommender on hadoop
 
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
GDRR Opening Workshop - Deep Reinforcement Learning for Asset Based Modeling ...
 
Deep Reinforcement Learning
Deep Reinforcement LearningDeep Reinforcement Learning
Deep Reinforcement Learning
 
Reinforcement-Learning.ppt
Reinforcement-Learning.pptReinforcement-Learning.ppt
Reinforcement-Learning.ppt
 
Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안Marl의 개념 및 군사용 적용방안
Marl의 개념 및 군사용 적용방안
 
TensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and TricksTensorFlow and Deep Learning Tips and Tricks
TensorFlow and Deep Learning Tips and Tricks
 
CS799_FinalReport
CS799_FinalReportCS799_FinalReport
CS799_FinalReport
 
20181221 q-trader
20181221 q-trader20181221 q-trader
20181221 q-trader
 
CNN
CNNCNN
CNN
 
How to generate game character behaviors using AI and ML - Unite Copenhagen
How to generate game character behaviors using AI and ML - Unite CopenhagenHow to generate game character behaviors using AI and ML - Unite Copenhagen
How to generate game character behaviors using AI and ML - Unite Copenhagen
 

Recently uploaded

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
tolgahangng
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
alexjohnson7307
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Tosin Akinosho
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
Alex Pruden
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Precisely
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
Miro Wengner
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
MichaelKnudsen27
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
Tatiana Kojar
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
Jason Packer
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
saastr
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
panagenda
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
innovationoecd
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
Pixlogix Infotech
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
Zilliz
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
Ivanti
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
Jakub Marek
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
saastr
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Tatiana Kojar
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Jeffrey Haguewood
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
akankshawande
 

Recently uploaded (20)

Serial Arm Control in Real Time Presentation
Serial Arm Control in Real Time PresentationSerial Arm Control in Real Time Presentation
Serial Arm Control in Real Time Presentation
 
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
leewayhertz.com-AI in predictive maintenance Use cases technologies benefits ...
 
Monitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdfMonitoring and Managing Anomaly Detection on OpenShift.pdf
Monitoring and Managing Anomaly Detection on OpenShift.pdf
 
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
zkStudyClub - LatticeFold: A Lattice-based Folding Scheme and its Application...
 
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their MainframeDigital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
Digital Banking in the Cloud: How Citizens Bank Unlocked Their Mainframe
 
JavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green MasterplanJavaLand 2024: Application Development Green Masterplan
JavaLand 2024: Application Development Green Masterplan
 
Nordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptxNordic Marketo Engage User Group_June 13_ 2024.pptx
Nordic Marketo Engage User Group_June 13_ 2024.pptx
 
Skybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoptionSkybuffer SAM4U tool for SAP license adoption
Skybuffer SAM4U tool for SAP license adoption
 
Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024Columbus Data & Analytics Wednesdays - June 2024
Columbus Data & Analytics Wednesdays - June 2024
 
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
Deep Dive: AI-Powered Marketing to Get More Leads and Customers with HyperGro...
 
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAUHCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
HCL Notes und Domino Lizenzkostenreduzierung in der Welt von DLAU
 
Presentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of GermanyPresentation of the OECD Artificial Intelligence Review of Germany
Presentation of the OECD Artificial Intelligence Review of Germany
 
Best 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERPBest 20 SEO Techniques To Improve Website Visibility In SERP
Best 20 SEO Techniques To Improve Website Visibility In SERP
 
Generating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and MilvusGenerating privacy-protected synthetic data using Secludy and Milvus
Generating privacy-protected synthetic data using Secludy and Milvus
 
June Patch Tuesday
June Patch TuesdayJune Patch Tuesday
June Patch Tuesday
 
Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)Main news related to the CCS TSI 2023 (2023/1695)
Main news related to the CCS TSI 2023 (2023/1695)
 
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
Overcoming the PLG Trap: Lessons from Canva's Head of Sales & Head of EMEA Da...
 
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
Skybuffer AI: Advanced Conversational and Generative AI Solution on SAP Busin...
 
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
Salesforce Integration for Bonterra Impact Management (fka Social Solutions A...
 
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development ProvidersYour One-Stop Shop for Python Success: Top 10 US Python Development Providers
Your One-Stop Shop for Python Success: Top 10 US Python Development Providers
 

Battlesnake AWS ML Meetup Victoria 2020

Editor's Notes

  1. Hello everyone, my name is Jonathan Chung and I’m an applied scientist at AWS. My colleague here is Xavier and he’s a solutions architect at AWS.
  2. Since you are in this talk, I presume everyone knows about battlesnakes but I’ll give a brief description anyway. Whoever is old enough might remember the snakes game on their phone. The snake moves around. When you hit the wall, you die. When you hit yourself, you also die. When you eat some food, you get longer The aim of the game is to stay alive as long as possible Here is battlesnakes which is an online version of the traditional snakes game where multiple snakes compete and the winner is the snake that survives for longest
  3. The main differences in the gameplay is that: If your snake hits another snake’s head, the shorter snakes die, and Every snake starts with 100 health and every move you take you lose one health. Eating food will replenish your health and if your health falls to 0, you die
  4. We built a battlesnake starter pack that could be used by all types of developers. Firstly, the one-click deploy module will build a snake for you, deploy this on the cloud and provide you a URL for you. We want to demonstrate how easy it is so here’s a quick demo of how to get a snake Let’s go back to this module. if you are an AI or reinforcement learning enthusiast or you want to learn how to train your own reinforcement learning algorithm or you are just too lazy to write your own and you want a starting point, you can use the reinforcement learning module. This module will allow you to train then automatically deploy your snake after it has trained. Let’s say you don’t want to train a snake, but you want to make use of the existing module but add your own flare to it, you can use the heuristics module. In this module, you can write custom rules on top of existing models that overrides the commands of the AI.
  5. I’ll talk a bit about the reinforcement learning module and briefly how reinforcement learning works. I’m not an expert at this field so please feel free to stop me any time. In reinforcement learning, you have an agent that interacts with environment.
  6. Specifically, the agent provides actions, which changes the state of the environment and also the environment provides rewards for the actions it took. The aim of the agent is to maximise the rewards
  7. Let me give you some examples. In a self driving car. The car is the agent. The agent can for example, accelerate, deaccelerate, turn left, turn right, etc. and these could be the possible actions The environment is the road and the agent views this in the form of images of the road. The reward is the number of KMs the car has driven. For example, the car sees that the road is curving to the right. Then
  8. So to maximise the number of KMs driven, the agent (which is the car) will turn right. Let’s say the environment shows a stop sign, you might think the agent should just ignore it and going because the reward is to maximise the number of KMs driven. But reinforcement learning is trained to maximise the future rewards and stopping will maximise the future rewards rather than the immediate rewards.
  9. There are many more examples, go example alpha go. the agent decides the position of the piece in the next move The environment is the position of all the pieces on the board. The reward, is simple, which is just to win or lose.
  10. Can we guess what the battlesnake one will be? Ok so what actions can the snake take? What will the state be? How about the reward? Actually, I didn’t write it because there are many different choices. I’ll get into that later. But battlesnakes is not just one snake.
  11. This is called a multiagent reinforcement learning problem where there are multiple agents each providing their own actions. Examples of multiagent reinforcement learning problem includes starcraft or alphastar, which controls each one of the units separately Essentially, the configuration of the problem is the same. Each agent provide actions which interacts with the environment and they receive rewards.
  12. So let me explain the pseudocode code of how reinforcement learning works in a single agent case. In this module, we developed an implementation of deep Q learning How this works is like a simulation. You play make the agent take actions given a state, then you record the action and what happened. Specifically, you define the total number of games you want to simulate in episodes Then you get the initial state and you play the game until the agent (which is the snake) dies. At each time step, the agents will take an action. At the start, the agent will simply randomly choose an action. (I'll explain this condition later) Then you apply the action to the environment and you get the resulting state and reward So you keep repeating these steps while storing the next_state, state, action and reward into memory. When you have enough simulated results, You’ll set the agent to learn from the memory, and I’ll explain how later Once the agent starts to learn how to play, you give it more chance to provide actions with the else statement here. These actions will be what the agent thinks the best action to take are. Given the state So as your agent gets better, you get more simulated results For example in battlesnakes, this way you can get simulation results of when the snakes are larger, or when there is a scarcity of food. etc
  13. I’ll explain how training routine accommodates for multiple agents. Firstly, the training loop checks that if there are fewer than 2 snakes alive. If there is only 1 snake, that snake has won Next, instead of preforming 1 action, each agent will perform their own action. The remaining steps are similar to the 1 agent case.
  14. Let me first explain the environment. We modeled the rules of the battlesnake engine which includes how the snake moves, eats, grows, and dies based on the openAI gym. And this is a representation of the snakes. The environment takes in the actions then moves the snakes accordingly. Afterwards, the rewards and the states are emitted out.
  15. Next, I’ll explain the state. Let’s say we have a very small board and here is the food. We also have 3 snakes, 1 orange long one and 2 short ones.
  16. So we represented the state with an image. Each agent represents 1 snake So if we have 3 snakes, then we will need 3 agents The agents are fed in specific images. For example, the agent representing the orange snake will be fed an image like this. Similar to an RGB image with 3 channels, in the first channel, we provided the information about the food. The second channel will provide information about the orange snake, which is the snake that the agent is representing The third channel has information about all the other snakes. So we also provided a border of -1s to indicate the wall. We found that it’s easier for the algorithm to avoid the walls this way Also, we put a 5 to represent the head, 1s to represent the body. Because the head is more important
  17. Similar for the green snakes
  18. And for the blue snake.
  19. So we used a very simple reward function. Every time the snake lives another turn, they are given another reward.
  20. Let’s talk about the agent now. The agent takes the state and figures out which direction to move to. Note that the reward here is only used during the learning process. So how does it figure out how which direction to take?
  21. We use a neural network to learn this behavior The input is an image and the output is of size 4 representing up down left right.
  22. Given the image representation of the environment, in the methodology presented, the neural network is trained to predict the total expected reward of each move. Basically, you want to take the action that gives you the highest total expected reward. For example, given this snake, the expected reward of moving up is 0, because it’ll die immediately. Therefore, the neural network shouldn’t predict this action.
  23. So how does the neural network learn? Suppose you run the neural network once like just now and you get the predicted total expected rewards here. suppose that you went right and you get an actual reward. Then you run the next step with the new state and you get another total expected rewards predictions again What is the relationship between the predicted total expected reward *this* and here total predicted
  24. Let me try to explain is this way. At t = 0, the total expected rewards is everything to the future At t = 1, it’s a bit different, because you took one action and you actually know what the reward is. You know that by going right, you got 1 reward.
  25. Formally, the total expected reward is typically denoted as Q. This component was described just now where the difference between the current and total predicted Q is related to the reward. The gamma term here is called the discount factor. This is a number between 0 and 1 which kind of accounts for opportunity costs where future actions provide less reward. A common strategy for neural networks to learn is to incrementally update the weights of the neural network The alpha term here is the learning rate which determines how much to incrementally alter the weights of the network Since you take the actions with the maximum rewards, the network will slowly learn to take the actions of the maximum reward
  26. So that’s the basics of reinforcement learning with. Feel free to ask me questions about it or I can guide you to some more material What I did was very simple applications of deep Q network. There are many opportunities to improve the model I presented.
  27. - Firstly, the methods of representing the state could change. As you can imagine, this method requires a fixed size of the map. One possible method to is to create a snake centric representation, which means your snake is only provided with information of what it’s head is close to. Other possibly representations include the snakes and food only as coordinates.
  28. The neural network we took was also very simple. An attention-like method was used to incorporate the snake health, ID, and turn count into a convolutional neural network to predict the actions.
  29. But I believe the most work could be achieved with the rewards design. The gym provides functionaility to try to maximise or minimize these rewards, but we really didn’t investigate it too much.
  30. Let me go into a bit of technical details about the module. We built it with Apache’s MXNet and the solution was built with amazon sagemaker.
  31. - Sagemaker provides methods to build, train & tune, and deploy the models
  32. We know that not everyone has a 12 gb GPU at home to train an AI bot. So sagemaker allows you to train your own snake directly on the cloud This way you can get your own snake model.
  33. As you can imagine, there are many different parameters that you need to decide on. For example, how the network is designed, like how deep the network is, etc. The best parameters for the learning parameters such as the discount factor and the learning rate could be investigated Also, you could test different methods to represent the state and the rewards could be investigated
  34. For example, if you want to investigate if including a border in the state representation will help. In one run you will put the border and in the second run, you remove the border Then you compare between the snakes to see which one works better For example, using the optimization module, we found that the -1 borders actually make the agents significantly better
  35. Once your model has been trained, you can deploy the model to the cloud My colleague Xavier will give you more details about this later
  36. So for developers who don’t want necessarily want to train a new snake but you want to make use of this environement and deployment methods, you can use the heuristics module. Also, we know that you don’t have days and days to train an AI for every single situation out there. You can build custom rules to override the commands of the AI
  37. Let’s say you are the pink snake in this situation. The AI tells you to go left which is fine in this situation but you know that if the blue snake just continues up, you are dead meat. So you can override the AI for you to go right instead, that way, you wont die.
  38. Furthermore, we believe that we can streamline the development processes this way The suggested method is to set up the game engine on your own computer, write your code then upload it to the cloud.
  39. So we decided to make use of the gym we built that simulates the battlesnake engine for you to develop your rules. After you are satisfied with your rules, the code will be automatically packaged and uploaded into the cloud.
  40. The heuristics module also provides a situation simulation component. For example, you want your snake to be in this exact configuration and to see what it’ll do. You can define this in the gym then try it out.
  41. Finally, Xavier will describe the deployment process
  42. In fact, our solution supports a one-click deployment after you train your own model, writing your custom rules, or even if you wanted to use our pretrained snake. The purpose of this is so that you could focus on developing your snake more rather than worrying how how to deploy the snake into a webserver