SlideShare a Scribd company logo
1 of 17
Download to read offline
DigiPen Machine Learning
Internship
Summer 2017
Christopher Eicher | Aakash Chotrani | Johann Saumer
Contents
Exploratory research.............................................................................................................................................4
Goal...................................................................................................................................................................4
Work .................................................................................................................................................................4
Wrap-up............................................................................................................................................................4
Tic-Tac-Toe............................................................................................................................................................4
Goal...................................................................................................................................................................4
Work .................................................................................................................................................................5
Agents:-.........................................................................................................................................................5
Environment:-...............................................................................................................................................5
State:-............................................................................................................................................................5
Recap: ...........................................................................................................................................................5
Unintended consequences: Dangers of Reinforcement Learning................................................................6
Example:- ......................................................................................................................................................6
Tic-Tac-Toe....................................................................................................................................................6
Wrap-up............................................................................................................................................................8
Dodging Agent ......................................................................................................................................................8
Goal:..................................................................................................................................................................8
Work Done:.......................................................................................................................................................8
Wrap-up:.........................................................................................................................................................10
Frozen Lake.........................................................................................................................................................10
Goal.................................................................................................................................................................10
Work Done......................................................................................................................................................10
Wrap-up..........................................................................................................................................................11
Cart-Pole .............................................................................................................................................................11
Goal.................................................................................................................................................................11
Work Done......................................................................................................................................................11
Tensor Flow Notes:-....................................................................................................................................12
Final thoughts:-...........................................................................................................................................14
Wrap Up..........................................................................................................................................................15
Lunar Lander [In progress]..................................................................................................................................15
Goal.................................................................................................................................................................15
Work Done......................................................................................................................................................15
Wrap-up..........................................................................................................................................................15
Resources............................................................................................................................................................16
Resources Websites:.......................................................................................................................................16
Tutorials:.........................................................................................................................................................16
Articles:...........................................................................................................................................................16
White Papers: .................................................................................................................................................17
Data sets: ........................................................................................................................................................17
Books: .............................................................................................................................................................17
Exploratory research
Chris, Aakash, Hans
Goal
To compile resources for future projects and figure out what kind environment, library, and resources
the team will use moving forward
Work
Decided to use TensorFlow because it’s open source, pretty well documented, and plenty of tutorials
on its website and on YouTube. We want to use the GPU version so that we can run experiments
faster
Considered R programming language because it’s commonly used in data science
Decided to use python language because it’s used a lot in machine learning, is widely used so there
would be lots of learning materials for python, and used a lot anyways, great for personal
development. Had some difficulties managing python, this turned out to be the result of having
multiple versions of python installed. (python vs python 3, 32-bit vs. 64-bit)
Found OpenAi, a non-profit research company that has a library called Gym which was made to fill a
need in the reinforcement learning research community for benchmarks and standardized
environments. There were some installation problems that took a while to fix. Problems with using
the correct version of Anaconda and cmake.
Decided to use Anaconda to manager our python libraries for us, since there are a lot libraries that
TensorFlow and OpenAI Gym rely on and Anaconda makes it easy to manage these libraries
We’ve begun to compile a list of websites that have relevant tutorials for TensorFlow, Gym, python,
and reinforcement learning in general.
Started compiling a list of white papers related to machine learning, specifically reinforcement
learning.
Started taking important notes on the whiteboards and uploading them to slack. This is a great way
to share very specific and technical information with the team.
Compiled a list of websites that host data sets, these will be great if we get to unsupervised learning
and supervised learning.
Wrap-up
We will continue to update our lists of resources
Tic-Tac-Toe
Aakash
Goal
The goal of the project was to apply reinforcement learning to solve a game of Tic-Tac-Toe instead
of using hard coded rules. Explore more about specific type of reinforcement learning called Q-
Learning
Work
Reinforcement learning is way different than Supervised Learning and Unsupervised Learning
Supervised Learning interface : def Fit(X,Y) and def Predict(X)
Unsupervised Learning Interface : def Fit(X) and sometimes def Transform(X)<---- turn X input into
different representation to Z
Interface for reinforcement learning is broader, it's entire environment (real world or simulated world).
Supervised learning needs labelled data from the humans which is time consuming and costly.
Reinforcement learning: no need for hand labelled data.
Agents:-
RL agents train in complete different way. Many references to psychology.
-model animal behavior
-objective is goal.
AlphaGo's goal is to win GO
Goal of video game AI is to win the game and achieve highest score.
Animals/Humans: "selfish gene"(Richard Dawkins) Evolutionary psychologist have said that
our genes are selfish and they only want to make more of themselves.
Example: why do people want to be rich?
-leads to better healthcare or social status, led to genes maximizing their goal. Richness has
no physical relationship to genes, yet it's a novel solution to the problem.
Environment:-
-Agent gets feedback interacting with environment.
State:-
Humans/AIs alike never sense the entire world/universe at once.
We have sensors(sight, sound, touch) which feed signals to our brain from the environment.
These measurement we get from sensors make up a "state"
Tic-tac-toe game: how many states?
each location has 3 possibilities : empty, X,O
9 locations on the board
#states = 3*3*....*3 = 3^9 = 19683
Recap:
1)Agent
2)Environment
3)State
4)rewards/punishment: How good or bad the ai is doing. always a real number
5)Actions : Finite state of action. 2d video game: up,down,left,right,jump
Unintended consequences: Dangers of Reinforcement Learning
Example:-
Goal: Minimize human deaths
Ai decides since # humans grows exponentially. More people will die in future, best to destroy
everyone now to minimize dying in the future.
SAR triples:-
(State, Action Reward)
Notation: (s,a,r)
Timing is important in RL:
S(t),A(t) --------> S(t+1)
Notation: (s,a,s')
Tic-Tac-Toe
How will a first year computer science student program a tic-tac-toe game.
Programming all the general rules
Example: if board is empty then first move should be middle or corner.
Example: if there are two pieces in the row of opponent then bock it's third position so that they don't
win the game.
Example: if we have 2 pieces in a row , add a 3rd to win the game.
It will look like bunch of if-else statements. The agent will only be able to play tic-tac-toe.
Which goes against the idea of machine learning. We want to have one algorithm that can play
different games. We need something better hence reinforcement learning.
New terms:
Episode: represents one run of tic-tac-toe game. Until win-lost.
Our RL agent will learn through 1000,10000 or 100000 (it depends on how long the game is how
complicated states are in the game)
Terminal State: no more action can be taken. Episode ended.
How to give rewards and punish for bad behavior.
Not to build any prior knowledge into the AI.
Tell the agent WHAT you want to achieve and not HOW you want it to be achieved.
Intro to scenario:-
Planning Scenario:-
Suppose there is exam tomorrow.
hangout with friends--------> Feel happy (positive)
Study ------------> Feel Bored (negative)
Why study?
We don't think of immediate rewards but future rewards too. Hence we want to assign value to the
current state that reflects the future too. Call this "Value Function"
Credit/Assignment Scenario:-
Suppose you got your dream job at the company.
What actions did you take in the past so that you are receiving reward right now.
Delayed Rewards:-
Two directions of thinking the delayed reward:-
Credit assignment: present(receiving reward)---------->(because of action in)past
Planning : present(do the action now)--------->(to receive reward)Future
Value Function : Measure of future rewards we might get. The value tells us the future goodness of
the state.
Reward V/S Value:-
Value is future measure goodness. Example: Standing in front of Goomba will put you in a position
to jump in the next few states.
Reward is immediate goodness. Example: jumping on a Goomba will immediately increase your
score.
Reward is the goal but we can't use only the reward to guide actions because they don't tell us about
the future rewards.
V(s) = E[all future rewards | S(t) = s]
where,
s = state(input)
E = expected value, E[X] = average of X
Finding V(s)
Algorithm:-
step1) Initialize V(s)
V(s) = 1 if s = winning state
V(s) = 0 if s = lose or draw
V(s) = 0.5 otherwise
Step2) Update V(s) in each episode.
V(s) <- V(s) + alpha(V(s') - V(s))
note: terminal state never gets updated since it doesn't have next state.
Pseudocode:-
for t in range(max_iterations) //loop through max number of episodes
state_history = play_game
s = state_history[0]
for s' in state_history[1]
V(s) = V(s) + learning_rate*(V(s')-V(s))
s = s'
Playing the game:-
How do we actually play the game?
Take random action------>NO!!!!
We have a value function
pseudocode:-
maxV = 0
maxA = None
for a , s' in possible_next_states:
if V(s') > maxV:
maxV = V(s')
maxA = a
perform action maxA
Wrap-up
By implementing this project I came across different types of learning strategies and how each one
is different from each other. I learnt about different terms in reinforcement like agent, environment,
states. Why we need to consider future reward instead of just relying on current reward and how to
implement value function for learning. I would like to improve the layout of the outcome of the project
in future by implementing a better looking GUI.
Dodging Agent
Hans
Goal:
The goal was to develop a small, discrete problem that could be easily solved with machine learning
so that I could focus on understanding the implementation details.
Work Done:
The problem that I made was confined to a small grid, 2 rows by 3 columns. An agent would inhabit
the top row, and could move between the adjacent spaces. It could also choose to do nothing and
not move. A projectile would be spawned in the bottom row and would move into the top row on the
next update. The agent would then try to dodge the arrow by moving out of the way or not moving
into the arrow’s path.
The first thing I did was conduct some research into the topic of reinforcement learning. I only had a
basic understanding of the topic. I studied some of the general problems involved in every machine
learning problem, such as the trade-off between exploration and exploitation.
I also researched extensively the use of genetic algorithms in reinforcement learning problems. I was
very interested in this topic and how it transformed the topics of reinforcement learning in order to
develop its own strategy of learning through reinforcement. This involved rewarding the agent by
giving it a fitness score. The exploring aspect was done by a crossover operation and a mutation
operation. These too are affected by different rates associated with it. As much as this topic intrigued
me, I decided it would be easier and more beneficial to start with a more traditional approach to
solving reinforcement problems.
The topic of exploration and exploitation seemed very crucial to developing a decent learning
program so I took some extra time to explore the various methods and how they impacted the
learning. As explained later on, I wanted to use a method that was keen to explore early on, but then
exploit more often as the program had exhausted its options.
I also researched Markov Decision Processes because random actions without some sort of
probability distribution will lead to poor performance. MDPs are used in these cases as they
incorporate probability into the action/reward system.
After learning a good portion of material i got started with my application. I chose to implement it in
C++ because it is the language I am most familiar with. I had to program the problem and the
learning system, but since the problem was simple it did not take long to write up the whole thing. I
also wrote a driver so that I can run different tests to get data from these tests.
The first iteration of the system is as was described in the abstract: an agent trying to dodge a single
projectile. Every timestep, the agent will either explore a random action or perform the most
rewarding action based on a function that determines the explore rate. For this problem I am using
the given explore rate raised to the timestep, so that it diminishes over time. A value of 1 for the
base explore rate results in the agent always choosing a random action. An explore rate of 0 results
in the agent always choosing the best action. If the agent tries to exploit but does not know a single
best action, it will select a random action from the best actions available.
To determine the best action, I paired every action with total reward from choosing that action.
These rewards were all initiated to 0 to avoid having bias in the system. In a discrete problem it
would be simple to input the values that we consider correct, however the goal is to allow the
program to learn so I did not input these values. The reward would get updated after every timestep
when it determines if the agent got hit. If the agent did not get hit, the reward got incremented by 1. If
it did get hit, then the reward was decremented by 1.
The agent knew what position it was, and the position of the projectile. It did not know what the next
state would be. In this iteration of the problem, this did not present any problems. The agent did not
have to know the probability of the next state it would end up in. It was very short-sighted. It only
decided to move depending on what the current state was. This allowed the problem to remain
extremely small.
Having a very small, discrete, short-sighted problem is the main reason why I chose a method that
caused the program to stop exploring after a short number of iterations. I knew that all of the options
would be exhausted very quickly when exploring so that it would be able to start choosing the best
option.
I ran a few tests, in order to see that the agent was learning correctly. I first ran it with an explore
factor of 1, to analyze what happens with random inputs. The results appear as I suspected, about a
third of the time the agent got hit. I also ran tests with an explore factor of 0. These results were also
as I had suspected. The agent would get hit between 0 and 9 times. If it did get hit it would try a
different action and be successful. Every time after this it would perform the successful action. I also
ran test with an explore factor of .99 to allow for the explore algorithm to work its magic. This allowed
the agent to experiment with finding some alternative solutions other than the first solution it finds as
it did with the previous step. The data appears very similar, this is do to the fact that it is always
training and updating the rates of success. This causes it to tend towards a single solution as well.
This test also reveals that with the current method of exploring the agent will eventually not get hit at
all as it found a solution.
Wrap-up:
This project was a decent entry point into the topic of machine learning. I was able to implement
basic reinforcement learning that allowed an agent to learn how to optimally solve a problem. The
first thing that I would do with this project to move forward is to write-up different methods of
selecting action, and different methods of exploring. I believe that how these methods are
implemented can greatly influence the outcome of the problem.
I would also try to make the problem more robust by allowing additional projectiles to spawn. This
adds in many more states and overall creates a more interesting problem. The agent would then
have to think about how its current action will impact the next action it takes. It will have to use
probability of advancing state in its decision of action. Currently the probability of a projectile being in
a space is distributed equally, but I would like to investigate the effects of using a different probability
distribution of spawning projectiles.
By implementing these changes I would be able to get a much larger grasp of how these parameters
can affect the agent’s capability of reaching an optimal solution.
Frozen Lake
Chris
Goal
To dive head first into machine learning. I found a tutorial that used Q-Learning to solve the Frozen
lake problem
Work Done
Spent some time how Q-learning works and understanding a variation of the Bellman equation, to
understand intuitively why it works. Refactored, tweaked and played with some example code from a
tutorial that implements Q-Learning with a table.
Used MatPlotLib to visual the Q-Learning process. This allowed me to see how the agent moved
around its environment and made me realize that there were some interesting problems with how it
was making decisions, like the fact it doesn’t care about the shortest path. It can feel incentivised to
move around in circles or make effectively noop moves because in the simple implementation it
would didn’t get penalized for that.
Learned a lot about how to used MatPlotLib, I plan on using what I know to help visual data for us in
the path so that as we run experiments and we can see how the algorithms are working over time.
Refactored and tweaked code that used TensorFlow to do nearly the same thing, was basically
wrapping around a table, except it was far less accurate because we couldn’t update the table
directly, had to go through an optimizer.
Wrap-up
Was a good primer for learning how to use the Bellman equation and TensorFlow. I wanted to tweak
it parameters more and use a few more special techniques to see if I could get the success rate
higher but I felt it’s important to move on with using TensorFlow to solve more interesting problems
with the rest of the team.
Cart-Pole
Aakash
Goal
To use TensorFlow to build a neural net that solves OpenAI Gym’s Cart Pole.
Work Done
Started with exploring Open Ai gym classic control environment. Read all the documentation to get started
and installing gym. Initially I had lot of problems getting open AI setup on my machine because there were
multiple versions of python installed. Hence had to remove previous versions.
I had to follow pip install procedure in the documentation:-
git clone https://github.com/openai/gym
cd gym
pip install -e . # minimal instal
I couldn’t download all the gym package. The previous snippet downloads the classic controller version of
gym but if we need to install all the gym packages such as atari games, box2d package, etc we need to write
pip install gym[all].
After getting everything setup I copy pasted the code snipped from the open ai documentation which creates
an environment and plays 20 random games. The documentation clearly explains what an observation,
reward, done and info means.
import gym
env = gym.make('CartPole-v0')
for i_episode in range(20):
observation = env.reset()
for t in range(100):
env.render()
print(observation)
action = env.action_space.sample()
observation, reward, done, info = env.step(action)
if done:
print("Episode finished after {} timesteps".format(t+1))
break
There is open ai leaderboard which shows different algorithms which are uploaded by people to solve the
problem. Most of the solutions were using neural network hence it was difficult to grasp. Hence decided to
install tensor flow and explore getting started documentation.
Then had to follow buch of tensor flow tutorials online on how to install it. We have 2 different types of
tensorflow packages gpu version and cpu version. The cpu version is easy to download but is slower. The gpu
version requires Nvidia graphics card. Also I had to update all the graphics card driver and had to install
Nvidia cuda package before installing tensorflow.
After getting tensorflow working followed their documentation to learn the basic terms and took notes:
https://www.tensorflow.org/get_started/get_started
And explored MNIST program which is hello world for tensor flow.
Tensor Flow Notes:-
Tensorflow provides multiple APIs.
Lower level api -------------------> tensorflow core
Higher level api built on top of lower level api and are easier to learn and make repetitive task easier to
implement.
TENSOR: central unit of data in tensorflow. Set of primitive values into an array of any number of dimensions.
RANK of TENSOR:-
3-------------------------------------------> #rank 0, scalar shape[]
[1,2,3]-------------------------------------> #rank 1, vector shape[3]
[[1,2,3],[4,5,6]]---------------------------> #rank 2, matrix shape[2,3]
[[[1,2,3]],[[7,8,9]]]-----------------------> #rank 3, shape[2,1,3]
Tensorflow core programs 2 sections:-
1) Building computational graph
2) Running computational graph
Computational graph: series of tensorflow operations arrranged into a graph of nodes.
Each node takes 0 or more tensors as input-------------> tensor as output
constant node: no input-------> output is a value that is stored internally
MNIST is like hello world for starting tensor flow it consists of number of labelled images and our goal is to
build a tensorflow model to predict.
Data is split into 3 parts:-
1) 55,000 datapoints for training data
2) 10,000 datapoints for test data
3) 5,000 datapoints for validation data
MNIST dataset has 2 parts:-
1) Image of handwritten digit --------- X
2) Corresponding label ------------------ Y
Each image is 28*28 pixels hence is big array of numbers.(28*28 == 784 numbers)
mnist.train.images is a tensor (an n-dimensional array) with a shape of [55000, 784]
The first dimension is an index into the list of images and the second dimension is the index for each pixel in
each image.
Each entry in the tensor is a pixel intensity between 0 and 1, for a particular pixel in a particular image.
Each image in MNIST has a corresponding label, a number between 0 and 9 representing the digit drawn in
the image.
mnist.train.labels is a [55000, 10] array of floats.
Softmax Regression:-
If you want to assign probabilities to an object being one of several different things, softmax is the thing to
do,
because softmax gives us a list of values between 0 and 1 that add up to 1.
Even later on, when we train more sophisticated models, the final step will be a layer of softmax.
2 steps of softmax regression
1) add up the evidence of our input being in certain classes(NOTE: weighted sum of pixel intensity. If the pixel
matches then positive weight else negative weight)
2) convert that evidence into probabilities
Softmax: exponentiating its inputs and then normalizing them
exponentiation : that one more unit of evidence increases the weight given to any hypothesis
multiplicatively.
Softmax then normalizes these weights, so that they add up to one, forming a valid probability distribution.
x = tf.placeholder(tf.float32, [None, 784])
W = tf.Variable(tf.zeros([784, 10]))
b = tf.Variable(tf.zeros([10]))
y = tf.nn.softmax(tf.matmul(x, W) + b)
In machine learning we typically define what it means for a model to be bad. We call this the cost, or the loss,
and it represents how far off our model is from our desired outcome. We try to minimize that error, and the
smaller the error margin,
the better our model is.
CROSS-ENTROPY : is measuring how inefficient our predictions are for describing the truth
Using small batches of random data is called stochastic training -- in this case, stochastic gradient descent.
Final thoughts:-
It is still difficult to understand softmax regression, cross entropy. I started learning more about the algorithm
but it required higher level understanding of fuzzy logic.
After getting familiar with tensorflow and basic concepts of neural networks. I implemented neural network
in my algorithm to solve cart pole problem.
The algorithm plays 1000 games by taking random action. Moving the cart left or right at each frame. If the
particular game scores more than 50 then the algorithm saves the observation and the actions the cart took
to achieve it. This data serves as the training data for the neural network. NOTE: I did not render any games
during training because it will be very slow to render 1000 games.
Then I created a model which contains 5 layers with 128,256,512,256,128 neurons at each layer. It uses
softmax regression as learning algorithm.
Then the training data is used to train the neural network.
After the network is trained it is used to play 10 games which we can see how well it performs by rendering
each game. In the end the algorithm prints out the final score and which action it took how many times. How
many times it decided to go left or go right.
It was very difficult to and time consuming tweaking each parameters to find a better output. I tried using
different learning rates to see the final result.
In the end I uploaded my solution to open ai gym official website:
https://gym.openai.com/evaluations/eval_HhJcddvETPu16QBl1hjYw
You can take a look at my solution on github:
https://github.com/akuchotrani/MyOpenAIGym/blob/master/CartPoleWithTensorFlow.py
Wrap Up
I still do not fully understand how softmax regression works. Looked at various tutorials online but it requires
higher understanding of fuzzy logic and neural networks. I will try to implement the same algorithm to lunar
lander by changing the environment and set of action spaces and minimum score requirements.
Lunar Lander [In progress]
Chris, Aakash, Hans
Goal
We want to use TensorFlow to build a neural net that solves OpenAI Gym’s Lunar Lander.
Work Done
Got Lunar Lander and TensorFlow working on everyone’s machine.
Got a simple neural network running. We got a GitHub set up so we can collaborate together.
Wrap-up
Resources
Resources Websites:
R (https://cran.r-project.org/ )
Tensor flow https://www.tensorflow.org/
OpenAI Gym https://gym.openai.com/
OpenAI Universe https://blog.openai.com/universe/
Kaggle - Data science competitionshttps://www.kaggle.com/
RStudio (https://www.rstudio.com/products/rstudio/download2/ )
Anaconda (https://www.continuum.io/downloads#windows Python 3.6 version)
Keras: The Python Deep Learning library https://keras.io/
Practical Deep Learning For Coders—18 hours of lessons for free http://course.fast.ai/
OpenCV (Open Source Computer Vision Library) http://opencv.org/
Python Programming Tutorials https://pythonprogramming.net/
Python 3.6.2rc1 Documentation https://docs.python.org/3/
TFLearn: Deep learning library featuring a higher-level API for TensorFlow http://tflearn.org/
Matplotlib 1.5.1 documentation https://matplotlib.org/1.5.1/index.html
NumPy http://www.numpy.org/
SciPy https://scipy.org/
Tutorials:
Q-Learning https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q-
learning-with-tables-and-neural-networks-d195264329d0
Pacman and reinforcement learning
https://inst.eecs.berkeley.edu/~cs188/sp12/projects/reinforcement/reinforcement.html
Youtube Python tutorial:
https://www.youtube.com/watch?v=kkQlyDMa-h0
https://www.jetbrains.com/pycharm/
Articles:
Artificial neural network https://en.wikipedia.org/wiki/Artificial_neural_network
Reinforcement learning https://en.wikipedia.org/wiki/Reinforcement_learning
Markov decision process https://en.wikipedia.org/wiki/Markov_decision_process
Deep Learning https://en.wikipedia.org/wiki/Deep_learning
Q-learning https://en.wikipedia.org/wiki/Q-learning
Pac-Man:
http://www.ias.tu-darmstadt.de/uploads/Site/EditPublication/Hochlaender_BScThesis_2014.pdf
Demystifying Deep Reinforcement Learning https://www.intelnervana.com/demystifying-deep-
reinforcement-learning/
White Papers:
Continuous Control with Deep Reinforcement Learning https://arxiv.org/pdf/1509.02971.pdf
Playing Atari with Deep Reinforcement Learning https://arxiv.org/pdf/1312.5602.pdf
Mining Muscle Use Data for Fatigue Reduction in IndyCar http://www.sloansportsconference.com/wp-
content/uploads/2017/02/1622.pdf
Real-Time Decision Making in Motorsports:
Analytics for Improving Professional Car Race Strategy
https://dspace.mit.edu/bitstream/handle/1721.1/100310/931596281-MIT.pdf?sequence=1
Data sets:
http://www.image-net.org/
Titanic https://www.kaggle.com/c/titanic
Quandl Financial, Economic and Alternative Data https://www.quandl.com/
Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP)
https://github.com/niderhoff/nlp-datasets
List of datasets for machine learning research - Wikipedia
https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research
Books:
AI techniques for game programming by Mat Buckland
Bayesian Reasoning and Machine Learning by David Barber
Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and
Code, Third Edition, Video-Enhanced Edition
By: Zed A. Shaw http://proquestcombo.safaribooksonline.com/book/programming/python/9780133124316

More Related Content

Similar to Reinforcement Learning

 Towards Reproducible Data Analysis Using Cloud and Container Technologies
 Towards Reproducible Data Analysis Using Cloud and Container Technologies Towards Reproducible Data Analysis Using Cloud and Container Technologies
 Towards Reproducible Data Analysis Using Cloud and Container Technologiesinside-BigData.com
 
Polyrepresentation in Complex (Book) Search Tasks - How can we use what the o...
Polyrepresentation in Complex (Book) Search Tasks - How can we use what the o...Polyrepresentation in Complex (Book) Search Tasks - How can we use what the o...
Polyrepresentation in Complex (Book) Search Tasks - How can we use what the o...Ingo Frommholz
 
Final report 1.0 - Good Practice Report
Final report 1.0 - Good Practice ReportFinal report 1.0 - Good Practice Report
Final report 1.0 - Good Practice ReportMike KEPPELL
 
HJohansen (Publishable)
HJohansen (Publishable)HJohansen (Publishable)
HJohansen (Publishable)Henry Johansen
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeVasu S
 
Dnle final project
Dnle final projectDnle final project
Dnle final projectMatthieu Cisel
 
Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)didicadoida
 
Enterprise Ontology and Semantics
Enterprise Ontology and SemanticsEnterprise Ontology and Semantics
Enterprise Ontology and Semanticscurioz
 
EMDT_2
EMDT_2EMDT_2
EMDT_2PMI2011
 
Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Gora Buzz
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4CLARIAH
 
Using Bets, Boards and Missions to Inspire Org-wide Agility
Using Bets, Boards and Missions to Inspire Org-wide AgilityUsing Bets, Boards and Missions to Inspire Org-wide Agility
Using Bets, Boards and Missions to Inspire Org-wide AgilityC4Media
 
Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...Vasco Duarte
 
Group Partners approach to solving the right problem
Group Partners approach to solving the right problemGroup Partners approach to solving the right problem
Group Partners approach to solving the right problemHazel Tiffany
 
2009 Approach(N)
2009 Approach(N)2009 Approach(N)
2009 Approach(N)John Caswell
 
A Better Way to Design & Build Immersive E Learning
A Better Way to Design & Build Immersive E LearningA Better Way to Design & Build Immersive E Learning
A Better Way to Design & Build Immersive E Learningnarchambeau
 

Similar to Reinforcement Learning (20)

 Towards Reproducible Data Analysis Using Cloud and Container Technologies
 Towards Reproducible Data Analysis Using Cloud and Container Technologies Towards Reproducible Data Analysis Using Cloud and Container Technologies
 Towards Reproducible Data Analysis Using Cloud and Container Technologies
 
Polyrepresentation in Complex (Book) Search Tasks - How can we use what the o...
Polyrepresentation in Complex (Book) Search Tasks - How can we use what the o...Polyrepresentation in Complex (Book) Search Tasks - How can we use what the o...
Polyrepresentation in Complex (Book) Search Tasks - How can we use what the o...
 
Final report 1.0 - Good Practice Report
Final report 1.0 - Good Practice ReportFinal report 1.0 - Good Practice Report
Final report 1.0 - Good Practice Report
 
HJohansen (Publishable)
HJohansen (Publishable)HJohansen (Publishable)
HJohansen (Publishable)
 
O'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data LakeO'Reilly ebook: Operationalizing the Data Lake
O'Reilly ebook: Operationalizing the Data Lake
 
Dnle final project
Dnle final projectDnle final project
Dnle final project
 
Microaccess 2007
Microaccess 2007Microaccess 2007
Microaccess 2007
 
Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)Microsoft cloud migration and modernization playbook 031819 (1) (2)
Microsoft cloud migration and modernization playbook 031819 (1) (2)
 
Enterprise Ontology and Semantics
Enterprise Ontology and SemanticsEnterprise Ontology and Semantics
Enterprise Ontology and Semantics
 
EMDT_2
EMDT_2EMDT_2
EMDT_2
 
Handbook of e Learning Strategy
Handbook of e Learning StrategyHandbook of e Learning Strategy
Handbook of e Learning Strategy
 
Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020Computational thinking v0.1_13-oct-2020
Computational thinking v0.1_13-oct-2020
 
DigiSeniors Curriculum - Leaders Guide
DigiSeniors Curriculum - Leaders GuideDigiSeniors Curriculum - Leaders Guide
DigiSeniors Curriculum - Leaders Guide
 
2016 05-20-clariah-wp4
2016 05-20-clariah-wp42016 05-20-clariah-wp4
2016 05-20-clariah-wp4
 
Using Bets, Boards and Missions to Inspire Org-wide Agility
Using Bets, Boards and Missions to Inspire Org-wide AgilityUsing Bets, Boards and Missions to Inspire Org-wide Agility
Using Bets, Boards and Missions to Inspire Org-wide Agility
 
Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...Story points considered harmful - or why the future of estimation is really i...
Story points considered harmful - or why the future of estimation is really i...
 
Work History Narrative-TraceyJackson
Work History Narrative-TraceyJacksonWork History Narrative-TraceyJackson
Work History Narrative-TraceyJackson
 
Group Partners approach to solving the right problem
Group Partners approach to solving the right problemGroup Partners approach to solving the right problem
Group Partners approach to solving the right problem
 
2009 Approach(N)
2009 Approach(N)2009 Approach(N)
2009 Approach(N)
 
A Better Way to Design & Build Immersive E Learning
A Better Way to Design & Build Immersive E LearningA Better Way to Design & Build Immersive E Learning
A Better Way to Design & Build Immersive E Learning
 

More from Aakash Chotrani

Efficient Backpropagation
Efficient BackpropagationEfficient Backpropagation
Efficient BackpropagationAakash Chotrani
 
What is goap, and why is it not already mainstream
What is goap, and why is it not already mainstreamWhat is goap, and why is it not already mainstream
What is goap, and why is it not already mainstreamAakash Chotrani
 
Deep q learning with lunar lander
Deep q learning with lunar landerDeep q learning with lunar lander
Deep q learning with lunar landerAakash Chotrani
 
Course recommender system
Course recommender systemCourse recommender system
Course recommender systemAakash Chotrani
 
Artificial Intelligence in games
Artificial Intelligence in gamesArtificial Intelligence in games
Artificial Intelligence in gamesAakash Chotrani
 
Simple & Fast Fluids
Simple & Fast FluidsSimple & Fast Fluids
Simple & Fast FluidsAakash Chotrani
 
Supervised Unsupervised and Reinforcement Learning
Supervised Unsupervised and Reinforcement Learning Supervised Unsupervised and Reinforcement Learning
Supervised Unsupervised and Reinforcement Learning Aakash Chotrani
 

More from Aakash Chotrani (7)

Efficient Backpropagation
Efficient BackpropagationEfficient Backpropagation
Efficient Backpropagation
 
What is goap, and why is it not already mainstream
What is goap, and why is it not already mainstreamWhat is goap, and why is it not already mainstream
What is goap, and why is it not already mainstream
 
Deep q learning with lunar lander
Deep q learning with lunar landerDeep q learning with lunar lander
Deep q learning with lunar lander
 
Course recommender system
Course recommender systemCourse recommender system
Course recommender system
 
Artificial Intelligence in games
Artificial Intelligence in gamesArtificial Intelligence in games
Artificial Intelligence in games
 
Simple & Fast Fluids
Simple & Fast FluidsSimple & Fast Fluids
Simple & Fast Fluids
 
Supervised Unsupervised and Reinforcement Learning
Supervised Unsupervised and Reinforcement Learning Supervised Unsupervised and Reinforcement Learning
Supervised Unsupervised and Reinforcement Learning
 

Recently uploaded

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.PraveenaKalaiselvan1
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaPraksha3
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...SĂ©rgio Sacani
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.aasikanpl
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Nistarini College, Purulia (W.B) India
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝soniya singh
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfSwapnil Therkar
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxAArockiyaNisha
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 sciencefloriejanemacaya1
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCEPRINCE C P
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfnehabiju2046
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzohaibmir069
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...jana861314
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfSELF-EXPLANATORY
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...RohitNehra6
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxUmerFayaz5
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett SquareIsiahStephanRadaza
 

Recently uploaded (20)

BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
BIOETHICS IN RECOMBINANT DNA TECHNOLOGY.
 
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tantaDashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
Dashanga agada a formulation of Agada tantra dealt in 3 Rd year bams agada tanta
 
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Mayapuri Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
Call Girls in Munirka Delhi 💯Call Us 🔝9953322196🔝 💯Escort.
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...Bentham & Hooker's Classification. along with the merits and demerits of the ...
Bentham & Hooker's Classification. along with the merits and demerits of the ...
 
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
Call Girls in Munirka Delhi 💯Call Us 🔝8264348440🔝
 
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdfAnalytical Profile of Coleus Forskohlii | Forskolin .pdf
Analytical Profile of Coleus Forskohlii | Forskolin .pdf
 
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptxPhysiochemical properties of nanomaterials and its nanotoxicity.pptx
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
 
Boyles law module in the grade 10 science
Boyles law module in the grade 10 scienceBoyles law module in the grade 10 science
Boyles law module in the grade 10 science
 
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCESTERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
STERILITY TESTING OF PHARMACEUTICALS ppt by DR.C.P.PRINCE
 
A relative description on Sonoporation.pdf
A relative description on Sonoporation.pdfA relative description on Sonoporation.pdf
A relative description on Sonoporation.pdf
 
zoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistanzoogeography of pakistan.pptx fauna of Pakistan
zoogeography of pakistan.pptx fauna of Pakistan
 
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
Traditional Agroforestry System in India- Shifting Cultivation, Taungya, Home...
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdfBehavioral Disorder: Schizophrenia & it's Case Study.pdf
Behavioral Disorder: Schizophrenia & it's Case Study.pdf
 
Biopesticide (2).pptx .This slides helps to know the different types of biop...
Biopesticide (2).pptx  .This slides helps to know the different types of biop...Biopesticide (2).pptx  .This slides helps to know the different types of biop...
Biopesticide (2).pptx .This slides helps to know the different types of biop...
 
Animal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptxAnimal Communication- Auditory and Visual.pptx
Animal Communication- Auditory and Visual.pptx
 
Module 4: Mendelian Genetics and Punnett Square
Module 4:  Mendelian Genetics and Punnett SquareModule 4:  Mendelian Genetics and Punnett Square
Module 4: Mendelian Genetics and Punnett Square
 

Reinforcement Learning

  • 1. DigiPen Machine Learning Internship Summer 2017 Christopher Eicher | Aakash Chotrani | Johann Saumer
  • 2. Contents Exploratory research.............................................................................................................................................4 Goal...................................................................................................................................................................4 Work .................................................................................................................................................................4 Wrap-up............................................................................................................................................................4 Tic-Tac-Toe............................................................................................................................................................4 Goal...................................................................................................................................................................4 Work .................................................................................................................................................................5 Agents:-.........................................................................................................................................................5 Environment:-...............................................................................................................................................5 State:-............................................................................................................................................................5 Recap: ...........................................................................................................................................................5 Unintended consequences: Dangers of Reinforcement Learning................................................................6 Example:- ......................................................................................................................................................6 Tic-Tac-Toe....................................................................................................................................................6 Wrap-up............................................................................................................................................................8 Dodging Agent ......................................................................................................................................................8 Goal:..................................................................................................................................................................8 Work Done:.......................................................................................................................................................8 Wrap-up:.........................................................................................................................................................10 Frozen Lake.........................................................................................................................................................10 Goal.................................................................................................................................................................10 Work Done......................................................................................................................................................10 Wrap-up..........................................................................................................................................................11 Cart-Pole .............................................................................................................................................................11 Goal.................................................................................................................................................................11 Work Done......................................................................................................................................................11 Tensor Flow Notes:-....................................................................................................................................12 Final thoughts:-...........................................................................................................................................14 Wrap Up..........................................................................................................................................................15 Lunar Lander [In progress]..................................................................................................................................15 Goal.................................................................................................................................................................15
  • 3. Work Done......................................................................................................................................................15 Wrap-up..........................................................................................................................................................15 Resources............................................................................................................................................................16 Resources Websites:.......................................................................................................................................16 Tutorials:.........................................................................................................................................................16 Articles:...........................................................................................................................................................16 White Papers: .................................................................................................................................................17 Data sets: ........................................................................................................................................................17 Books: .............................................................................................................................................................17
  • 4. Exploratory research Chris, Aakash, Hans Goal To compile resources for future projects and figure out what kind environment, library, and resources the team will use moving forward Work Decided to use TensorFlow because it’s open source, pretty well documented, and plenty of tutorials on its website and on YouTube. We want to use the GPU version so that we can run experiments faster Considered R programming language because it’s commonly used in data science Decided to use python language because it’s used a lot in machine learning, is widely used so there would be lots of learning materials for python, and used a lot anyways, great for personal development. Had some difficulties managing python, this turned out to be the result of having multiple versions of python installed. (python vs python 3, 32-bit vs. 64-bit) Found OpenAi, a non-profit research company that has a library called Gym which was made to fill a need in the reinforcement learning research community for benchmarks and standardized environments. There were some installation problems that took a while to fix. Problems with using the correct version of Anaconda and cmake. Decided to use Anaconda to manager our python libraries for us, since there are a lot libraries that TensorFlow and OpenAI Gym rely on and Anaconda makes it easy to manage these libraries We’ve begun to compile a list of websites that have relevant tutorials for TensorFlow, Gym, python, and reinforcement learning in general. Started compiling a list of white papers related to machine learning, specifically reinforcement learning. Started taking important notes on the whiteboards and uploading them to slack. This is a great way to share very specific and technical information with the team. Compiled a list of websites that host data sets, these will be great if we get to unsupervised learning and supervised learning. Wrap-up We will continue to update our lists of resources Tic-Tac-Toe Aakash Goal The goal of the project was to apply reinforcement learning to solve a game of Tic-Tac-Toe instead of using hard coded rules. Explore more about specific type of reinforcement learning called Q- Learning
  • 5. Work Reinforcement learning is way different than Supervised Learning and Unsupervised Learning Supervised Learning interface : def Fit(X,Y) and def Predict(X) Unsupervised Learning Interface : def Fit(X) and sometimes def Transform(X)<---- turn X input into different representation to Z Interface for reinforcement learning is broader, it's entire environment (real world or simulated world). Supervised learning needs labelled data from the humans which is time consuming and costly. Reinforcement learning: no need for hand labelled data. Agents:- RL agents train in complete different way. Many references to psychology. -model animal behavior -objective is goal. AlphaGo's goal is to win GO Goal of video game AI is to win the game and achieve highest score. Animals/Humans: "selfish gene"(Richard Dawkins) Evolutionary psychologist have said that our genes are selfish and they only want to make more of themselves. Example: why do people want to be rich? -leads to better healthcare or social status, led to genes maximizing their goal. Richness has no physical relationship to genes, yet it's a novel solution to the problem. Environment:- -Agent gets feedback interacting with environment. State:- Humans/AIs alike never sense the entire world/universe at once. We have sensors(sight, sound, touch) which feed signals to our brain from the environment. These measurement we get from sensors make up a "state" Tic-tac-toe game: how many states? each location has 3 possibilities : empty, X,O 9 locations on the board #states = 3*3*....*3 = 3^9 = 19683 Recap: 1)Agent 2)Environment 3)State 4)rewards/punishment: How good or bad the ai is doing. always a real number 5)Actions : Finite state of action. 2d video game: up,down,left,right,jump
  • 6. Unintended consequences: Dangers of Reinforcement Learning Example:- Goal: Minimize human deaths Ai decides since # humans grows exponentially. More people will die in future, best to destroy everyone now to minimize dying in the future. SAR triples:- (State, Action Reward) Notation: (s,a,r) Timing is important in RL: S(t),A(t) --------> S(t+1) Notation: (s,a,s') Tic-Tac-Toe How will a first year computer science student program a tic-tac-toe game. Programming all the general rules Example: if board is empty then first move should be middle or corner. Example: if there are two pieces in the row of opponent then bock it's third position so that they don't win the game. Example: if we have 2 pieces in a row , add a 3rd to win the game. It will look like bunch of if-else statements. The agent will only be able to play tic-tac-toe. Which goes against the idea of machine learning. We want to have one algorithm that can play different games. We need something better hence reinforcement learning. New terms: Episode: represents one run of tic-tac-toe game. Until win-lost. Our RL agent will learn through 1000,10000 or 100000 (it depends on how long the game is how complicated states are in the game) Terminal State: no more action can be taken. Episode ended. How to give rewards and punish for bad behavior. Not to build any prior knowledge into the AI. Tell the agent WHAT you want to achieve and not HOW you want it to be achieved. Intro to scenario:- Planning Scenario:- Suppose there is exam tomorrow. hangout with friends--------> Feel happy (positive) Study ------------> Feel Bored (negative) Why study? We don't think of immediate rewards but future rewards too. Hence we want to assign value to the current state that reflects the future too. Call this "Value Function"
  • 7. Credit/Assignment Scenario:- Suppose you got your dream job at the company. What actions did you take in the past so that you are receiving reward right now. Delayed Rewards:- Two directions of thinking the delayed reward:- Credit assignment: present(receiving reward)---------->(because of action in)past Planning : present(do the action now)--------->(to receive reward)Future Value Function : Measure of future rewards we might get. The value tells us the future goodness of the state. Reward V/S Value:- Value is future measure goodness. Example: Standing in front of Goomba will put you in a position to jump in the next few states. Reward is immediate goodness. Example: jumping on a Goomba will immediately increase your score. Reward is the goal but we can't use only the reward to guide actions because they don't tell us about the future rewards. V(s) = E[all future rewards | S(t) = s] where, s = state(input) E = expected value, E[X] = average of X Finding V(s) Algorithm:- step1) Initialize V(s) V(s) = 1 if s = winning state V(s) = 0 if s = lose or draw V(s) = 0.5 otherwise Step2) Update V(s) in each episode. V(s) <- V(s) + alpha(V(s') - V(s)) note: terminal state never gets updated since it doesn't have next state. Pseudocode:- for t in range(max_iterations) //loop through max number of episodes state_history = play_game s = state_history[0] for s' in state_history[1] V(s) = V(s) + learning_rate*(V(s')-V(s)) s = s' Playing the game:-
  • 8. How do we actually play the game? Take random action------>NO!!!! We have a value function pseudocode:- maxV = 0 maxA = None for a , s' in possible_next_states: if V(s') > maxV: maxV = V(s') maxA = a perform action maxA Wrap-up By implementing this project I came across different types of learning strategies and how each one is different from each other. I learnt about different terms in reinforcement like agent, environment, states. Why we need to consider future reward instead of just relying on current reward and how to implement value function for learning. I would like to improve the layout of the outcome of the project in future by implementing a better looking GUI. Dodging Agent Hans Goal: The goal was to develop a small, discrete problem that could be easily solved with machine learning so that I could focus on understanding the implementation details. Work Done: The problem that I made was confined to a small grid, 2 rows by 3 columns. An agent would inhabit the top row, and could move between the adjacent spaces. It could also choose to do nothing and not move. A projectile would be spawned in the bottom row and would move into the top row on the next update. The agent would then try to dodge the arrow by moving out of the way or not moving into the arrow’s path. The first thing I did was conduct some research into the topic of reinforcement learning. I only had a basic understanding of the topic. I studied some of the general problems involved in every machine learning problem, such as the trade-off between exploration and exploitation. I also researched extensively the use of genetic algorithms in reinforcement learning problems. I was very interested in this topic and how it transformed the topics of reinforcement learning in order to develop its own strategy of learning through reinforcement. This involved rewarding the agent by giving it a fitness score. The exploring aspect was done by a crossover operation and a mutation operation. These too are affected by different rates associated with it. As much as this topic intrigued
  • 9. me, I decided it would be easier and more beneficial to start with a more traditional approach to solving reinforcement problems. The topic of exploration and exploitation seemed very crucial to developing a decent learning program so I took some extra time to explore the various methods and how they impacted the learning. As explained later on, I wanted to use a method that was keen to explore early on, but then exploit more often as the program had exhausted its options. I also researched Markov Decision Processes because random actions without some sort of probability distribution will lead to poor performance. MDPs are used in these cases as they incorporate probability into the action/reward system. After learning a good portion of material i got started with my application. I chose to implement it in C++ because it is the language I am most familiar with. I had to program the problem and the learning system, but since the problem was simple it did not take long to write up the whole thing. I also wrote a driver so that I can run different tests to get data from these tests. The first iteration of the system is as was described in the abstract: an agent trying to dodge a single projectile. Every timestep, the agent will either explore a random action or perform the most rewarding action based on a function that determines the explore rate. For this problem I am using the given explore rate raised to the timestep, so that it diminishes over time. A value of 1 for the base explore rate results in the agent always choosing a random action. An explore rate of 0 results in the agent always choosing the best action. If the agent tries to exploit but does not know a single best action, it will select a random action from the best actions available. To determine the best action, I paired every action with total reward from choosing that action. These rewards were all initiated to 0 to avoid having bias in the system. In a discrete problem it would be simple to input the values that we consider correct, however the goal is to allow the program to learn so I did not input these values. The reward would get updated after every timestep when it determines if the agent got hit. If the agent did not get hit, the reward got incremented by 1. If it did get hit, then the reward was decremented by 1. The agent knew what position it was, and the position of the projectile. It did not know what the next state would be. In this iteration of the problem, this did not present any problems. The agent did not have to know the probability of the next state it would end up in. It was very short-sighted. It only decided to move depending on what the current state was. This allowed the problem to remain extremely small. Having a very small, discrete, short-sighted problem is the main reason why I chose a method that caused the program to stop exploring after a short number of iterations. I knew that all of the options would be exhausted very quickly when exploring so that it would be able to start choosing the best option. I ran a few tests, in order to see that the agent was learning correctly. I first ran it with an explore factor of 1, to analyze what happens with random inputs. The results appear as I suspected, about a third of the time the agent got hit. I also ran tests with an explore factor of 0. These results were also as I had suspected. The agent would get hit between 0 and 9 times. If it did get hit it would try a different action and be successful. Every time after this it would perform the successful action. I also ran test with an explore factor of .99 to allow for the explore algorithm to work its magic. This allowed the agent to experiment with finding some alternative solutions other than the first solution it finds as it did with the previous step. The data appears very similar, this is do to the fact that it is always training and updating the rates of success. This causes it to tend towards a single solution as well.
  • 10. This test also reveals that with the current method of exploring the agent will eventually not get hit at all as it found a solution. Wrap-up: This project was a decent entry point into the topic of machine learning. I was able to implement basic reinforcement learning that allowed an agent to learn how to optimally solve a problem. The first thing that I would do with this project to move forward is to write-up different methods of selecting action, and different methods of exploring. I believe that how these methods are implemented can greatly influence the outcome of the problem. I would also try to make the problem more robust by allowing additional projectiles to spawn. This adds in many more states and overall creates a more interesting problem. The agent would then have to think about how its current action will impact the next action it takes. It will have to use probability of advancing state in its decision of action. Currently the probability of a projectile being in a space is distributed equally, but I would like to investigate the effects of using a different probability distribution of spawning projectiles. By implementing these changes I would be able to get a much larger grasp of how these parameters can affect the agent’s capability of reaching an optimal solution. Frozen Lake Chris Goal To dive head first into machine learning. I found a tutorial that used Q-Learning to solve the Frozen lake problem Work Done Spent some time how Q-learning works and understanding a variation of the Bellman equation, to understand intuitively why it works. Refactored, tweaked and played with some example code from a tutorial that implements Q-Learning with a table. Used MatPlotLib to visual the Q-Learning process. This allowed me to see how the agent moved around its environment and made me realize that there were some interesting problems with how it was making decisions, like the fact it doesn’t care about the shortest path. It can feel incentivised to move around in circles or make effectively noop moves because in the simple implementation it would didn’t get penalized for that. Learned a lot about how to used MatPlotLib, I plan on using what I know to help visual data for us in the path so that as we run experiments and we can see how the algorithms are working over time. Refactored and tweaked code that used TensorFlow to do nearly the same thing, was basically wrapping around a table, except it was far less accurate because we couldn’t update the table directly, had to go through an optimizer.
  • 11. Wrap-up Was a good primer for learning how to use the Bellman equation and TensorFlow. I wanted to tweak it parameters more and use a few more special techniques to see if I could get the success rate higher but I felt it’s important to move on with using TensorFlow to solve more interesting problems with the rest of the team. Cart-Pole Aakash Goal To use TensorFlow to build a neural net that solves OpenAI Gym’s Cart Pole. Work Done Started with exploring Open Ai gym classic control environment. Read all the documentation to get started and installing gym. Initially I had lot of problems getting open AI setup on my machine because there were multiple versions of python installed. Hence had to remove previous versions. I had to follow pip install procedure in the documentation:- git clone https://github.com/openai/gym cd gym pip install -e . # minimal instal I couldn’t download all the gym package. The previous snippet downloads the classic controller version of gym but if we need to install all the gym packages such as atari games, box2d package, etc we need to write pip install gym[all]. After getting everything setup I copy pasted the code snipped from the open ai documentation which creates an environment and plays 20 random games. The documentation clearly explains what an observation, reward, done and info means. import gym env = gym.make('CartPole-v0') for i_episode in range(20): observation = env.reset() for t in range(100): env.render() print(observation) action = env.action_space.sample() observation, reward, done, info = env.step(action) if done:
  • 12. print("Episode finished after {} timesteps".format(t+1)) break There is open ai leaderboard which shows different algorithms which are uploaded by people to solve the problem. Most of the solutions were using neural network hence it was difficult to grasp. Hence decided to install tensor flow and explore getting started documentation. Then had to follow buch of tensor flow tutorials online on how to install it. We have 2 different types of tensorflow packages gpu version and cpu version. The cpu version is easy to download but is slower. The gpu version requires Nvidia graphics card. Also I had to update all the graphics card driver and had to install Nvidia cuda package before installing tensorflow. After getting tensorflow working followed their documentation to learn the basic terms and took notes: https://www.tensorflow.org/get_started/get_started And explored MNIST program which is hello world for tensor flow. Tensor Flow Notes:- Tensorflow provides multiple APIs. Lower level api -------------------> tensorflow core Higher level api built on top of lower level api and are easier to learn and make repetitive task easier to implement. TENSOR: central unit of data in tensorflow. Set of primitive values into an array of any number of dimensions. RANK of TENSOR:- 3-------------------------------------------> #rank 0, scalar shape[] [1,2,3]-------------------------------------> #rank 1, vector shape[3] [[1,2,3],[4,5,6]]---------------------------> #rank 2, matrix shape[2,3] [[[1,2,3]],[[7,8,9]]]-----------------------> #rank 3, shape[2,1,3] Tensorflow core programs 2 sections:- 1) Building computational graph 2) Running computational graph Computational graph: series of tensorflow operations arrranged into a graph of nodes. Each node takes 0 or more tensors as input-------------> tensor as output
  • 13. constant node: no input-------> output is a value that is stored internally MNIST is like hello world for starting tensor flow it consists of number of labelled images and our goal is to build a tensorflow model to predict. Data is split into 3 parts:- 1) 55,000 datapoints for training data 2) 10,000 datapoints for test data 3) 5,000 datapoints for validation data MNIST dataset has 2 parts:- 1) Image of handwritten digit --------- X 2) Corresponding label ------------------ Y Each image is 28*28 pixels hence is big array of numbers.(28*28 == 784 numbers) mnist.train.images is a tensor (an n-dimensional array) with a shape of [55000, 784] The first dimension is an index into the list of images and the second dimension is the index for each pixel in each image. Each entry in the tensor is a pixel intensity between 0 and 1, for a particular pixel in a particular image. Each image in MNIST has a corresponding label, a number between 0 and 9 representing the digit drawn in the image. mnist.train.labels is a [55000, 10] array of floats. Softmax Regression:- If you want to assign probabilities to an object being one of several different things, softmax is the thing to do, because softmax gives us a list of values between 0 and 1 that add up to 1. Even later on, when we train more sophisticated models, the final step will be a layer of softmax. 2 steps of softmax regression 1) add up the evidence of our input being in certain classes(NOTE: weighted sum of pixel intensity. If the pixel matches then positive weight else negative weight) 2) convert that evidence into probabilities
  • 14. Softmax: exponentiating its inputs and then normalizing them exponentiation : that one more unit of evidence increases the weight given to any hypothesis multiplicatively. Softmax then normalizes these weights, so that they add up to one, forming a valid probability distribution. x = tf.placeholder(tf.float32, [None, 784]) W = tf.Variable(tf.zeros([784, 10])) b = tf.Variable(tf.zeros([10])) y = tf.nn.softmax(tf.matmul(x, W) + b) In machine learning we typically define what it means for a model to be bad. We call this the cost, or the loss, and it represents how far off our model is from our desired outcome. We try to minimize that error, and the smaller the error margin, the better our model is. CROSS-ENTROPY : is measuring how inefficient our predictions are for describing the truth Using small batches of random data is called stochastic training -- in this case, stochastic gradient descent. Final thoughts:- It is still difficult to understand softmax regression, cross entropy. I started learning more about the algorithm but it required higher level understanding of fuzzy logic. After getting familiar with tensorflow and basic concepts of neural networks. I implemented neural network in my algorithm to solve cart pole problem. The algorithm plays 1000 games by taking random action. Moving the cart left or right at each frame. If the particular game scores more than 50 then the algorithm saves the observation and the actions the cart took to achieve it. This data serves as the training data for the neural network. NOTE: I did not render any games during training because it will be very slow to render 1000 games. Then I created a model which contains 5 layers with 128,256,512,256,128 neurons at each layer. It uses softmax regression as learning algorithm.
  • 15. Then the training data is used to train the neural network. After the network is trained it is used to play 10 games which we can see how well it performs by rendering each game. In the end the algorithm prints out the final score and which action it took how many times. How many times it decided to go left or go right. It was very difficult to and time consuming tweaking each parameters to find a better output. I tried using different learning rates to see the final result. In the end I uploaded my solution to open ai gym official website: https://gym.openai.com/evaluations/eval_HhJcddvETPu16QBl1hjYw You can take a look at my solution on github: https://github.com/akuchotrani/MyOpenAIGym/blob/master/CartPoleWithTensorFlow.py Wrap Up I still do not fully understand how softmax regression works. Looked at various tutorials online but it requires higher understanding of fuzzy logic and neural networks. I will try to implement the same algorithm to lunar lander by changing the environment and set of action spaces and minimum score requirements. Lunar Lander [In progress] Chris, Aakash, Hans Goal We want to use TensorFlow to build a neural net that solves OpenAI Gym’s Lunar Lander. Work Done Got Lunar Lander and TensorFlow working on everyone’s machine. Got a simple neural network running. We got a GitHub set up so we can collaborate together. Wrap-up
  • 16. Resources Resources Websites: R (https://cran.r-project.org/ ) Tensor flow https://www.tensorflow.org/ OpenAI Gym https://gym.openai.com/ OpenAI Universe https://blog.openai.com/universe/ Kaggle - Data science competitionshttps://www.kaggle.com/ RStudio (https://www.rstudio.com/products/rstudio/download2/ ) Anaconda (https://www.continuum.io/downloads#windows Python 3.6 version) Keras: The Python Deep Learning library https://keras.io/ Practical Deep Learning For Coders—18 hours of lessons for free http://course.fast.ai/ OpenCV (Open Source Computer Vision Library) http://opencv.org/ Python Programming Tutorials https://pythonprogramming.net/ Python 3.6.2rc1 Documentation https://docs.python.org/3/ TFLearn: Deep learning library featuring a higher-level API for TensorFlow http://tflearn.org/ Matplotlib 1.5.1 documentation https://matplotlib.org/1.5.1/index.html NumPy http://www.numpy.org/ SciPy https://scipy.org/ Tutorials: Q-Learning https://medium.com/emergent-future/simple-reinforcement-learning-with-tensorflow-part-0-q- learning-with-tables-and-neural-networks-d195264329d0 Pacman and reinforcement learning https://inst.eecs.berkeley.edu/~cs188/sp12/projects/reinforcement/reinforcement.html Youtube Python tutorial: https://www.youtube.com/watch?v=kkQlyDMa-h0 https://www.jetbrains.com/pycharm/ Articles: Artificial neural network https://en.wikipedia.org/wiki/Artificial_neural_network Reinforcement learning https://en.wikipedia.org/wiki/Reinforcement_learning Markov decision process https://en.wikipedia.org/wiki/Markov_decision_process
  • 17. Deep Learning https://en.wikipedia.org/wiki/Deep_learning Q-learning https://en.wikipedia.org/wiki/Q-learning Pac-Man: http://www.ias.tu-darmstadt.de/uploads/Site/EditPublication/Hochlaender_BScThesis_2014.pdf Demystifying Deep Reinforcement Learning https://www.intelnervana.com/demystifying-deep- reinforcement-learning/ White Papers: Continuous Control with Deep Reinforcement Learning https://arxiv.org/pdf/1509.02971.pdf Playing Atari with Deep Reinforcement Learning https://arxiv.org/pdf/1312.5602.pdf Mining Muscle Use Data for Fatigue Reduction in IndyCar http://www.sloansportsconference.com/wp- content/uploads/2017/02/1622.pdf Real-Time Decision Making in Motorsports: Analytics for Improving Professional Car Race Strategy https://dspace.mit.edu/bitstream/handle/1721.1/100310/931596281-MIT.pdf?sequence=1 Data sets: http://www.image-net.org/ Titanic https://www.kaggle.com/c/titanic Quandl Financial, Economic and Alternative Data https://www.quandl.com/ Alphabetical list of free/public domain datasets with text data for use in Natural Language Processing (NLP) https://github.com/niderhoff/nlp-datasets List of datasets for machine learning research - Wikipedia https://en.wikipedia.org/wiki/List_of_datasets_for_machine_learning_research Books: AI techniques for game programming by Mat Buckland Bayesian Reasoning and Machine Learning by David Barber Learn Python the Hard Way: A Very Simple Introduction to the Terrifyingly Beautiful World of Computers and Code, Third Edition, Video-Enhanced Edition By: Zed A. Shaw http://proquestcombo.safaribooksonline.com/book/programming/python/9780133124316