Reinforcement Learning Literature review - apr2019/feb2021 (with zip file)

1
Overview of Reinforcement Learning literature
April 2019/February 2021
Chiel Bakkeren

2
Contents
Contents.........................................................................................................................................2
Introduction....................................................................................................................................3
Other researchers ........................................................................................................................ 3
Review articles............................................................................................................................. 4
Applications in a business environment......................................................................................... 4
Markov Decision Processes............................................................................................................... 5
Solution methods......................................................................................................................... 6
Reinforcement Learning................................................................................................................ 6
Delayed reward............................................................................................................................ 7
Algorithms.......................................................................................................................................9
On- and off-policy methods...........................................................................................................9
Q-learningen SARSA................................................................................................................... 10
Multi-step methods.................................................................................................................... 10
Replay ....................................................................................................................................... 11
Expected SARSA and Q(σ) ........................................................................................................... 11
Other variants............................................................................................................................ 11
Hyperparameters........................................................................................................................... 13
Hyperparameter optimisation..................................................................................................... 14
Reinforcement Learning in R and Python......................................................................................... 15
References .................................................................................................................................... 16
Index............................................................................................................................................. 18

3
Introduction
AlthoughReinforcementLearningasa conceptis now several decadesold,muchresearchhasonlybeen
done inthe last 5-10 years.Most famousare probably the resultsof DeepMind,aLondoncompanythat
was boughtbyGoogle inearly2014. DeepMindturnedouttobe able tobeat the bestGo playerinthe
worldwithitsalgorithmAlphaGo,wherebythe algorithminitiallyalsolearnedfromgamesplayedby
goodGo players,andsubsequentlyimproveditself(AlphaGo Zero) bylearningfromplayingmanygames
againstitself [20].Nowthere isAlphaZero,whichinadditiontoGocan also playchessandshogi at the
highestlevel.
DeepMindisone of the researchcentreswhere alot of researchisdone on RL, boththeoreticallyandin
termsof applications.Well-knownresearchersfromDeepMindare:
• DavidSilver:wrote manyarticlesasfirstorco-author(see e.g.the article onAlphaGo) andiswidely
cited.Alsomade a seriesof videosof lecturesonRL (slideshere).
• Hado vanHasselt,a Dutch PhD(UU, CWI,see [9]),whodeveloped,amongotherthings,the Double Q
learning(see Appendix B.
• Tom Schaul:leadauthorof the article on PrioritisedExperienceReplay[6].
The standard workinthe fieldof ReinforcementLearningisthe bookbySuttonand Barto [1]; thisbookis
inthe reference listof almosteveryRLarticle.Inthe more recentversionsof thisbook,the last
paragraph of each chaptercontainsan interestinglistingof the authorsandtheirarticleswho
contributedtothe developmentof the chapter'stopic;veryuseful asa cross reference.Inthisoverview,
we use the red line of thisbookto review the variousothertopicsinthe RLliterature.
Suttonismore prominentthanBarto,he has,among otherthings,anextensive(butslightlymessy)
website.Manyof the currentRL researchershave spentsome time withthe RL& AIgroup at Alberta
University inCanada,where RichardSuttonworks.Suttonpublishesarticlescontinuously,includingvia
ResearchGate.
There is now alsoa (setof) courses,a Specialization,onCourseraaboutReinforcementLearning.Thishas
not beencheckedoutyet.
Other researchers
As isoftenthe case,the US/NorthAmericaisleadinginthe field
 AndrewNg:one of the best-knownAIresearchers,workingatStanford,whoisverysuccessful,partly
throughhispopularcoursesviaCoursera(of whichhe is a founder).
 Naoki Abe:IBMresearcher,whostandsoutfor a numberof publicationsonapplicationsof RLina
businessenvironment.
 GeorgiosTheocharous:researcheratAdobe (previouslyatM.I.T.,Intel andYahoo) inthe fieldof
(applicationsof) RLandotherML techniques,especiallyinmarketing.
I probablyhave a biastowardsDutch (speaking)researchers,butitisstrikingthatthere are quite a fewof
those:

4
 Harm van Seijen:didaPhDat the UvA withShimonWhitesonandalsoworkedfor4yearsas a
postdocwithSutton.Nowheadsthe MicrosoftResearchMontréal RL team.
 PieterAbbeel:aBelgianwho,throughaPhDat StanfordwithAndrew Ngand a research position at
OpenAI,eventuallybecameheadof the RobotLearningLab inBerkeley(amongothers).
 Marco Wiering:supervisorof Hadovan Hasselt;hisname appearsinall kindsof publications.Has
apparentlynot(yet) succumbedtothe majorplayersinthe AI field,becausehe still 'just'worksinthe
AI groupof the RUG.
Reviewarticles
In additiontothe articlesonspecificRLtopics,manyof whichare referencedinthisdocument,there are
alsoa numberof interestingreviewarticles:
• The ResearchGate article [3] by AhmadHammoudehprovidesanice insightintowhatRL is,includinga
numberof applicationsandmanyreferencestoliterature.
• In September2018 “NieuwArchief voorde Wiskunde”(the magazineof the DutchRoyal
Mathematical Society) wasentirelydevotedtoAI, andit contained anice overview article byMarco
Wiering[27],alsowithmanyreferences.
• Evenmore extensiveisthe overview of Yuxi Li [4],whichcan safelybe calledabooklet,withover40
pagesof background,mainelementsandmechanisms,applications,resources(books,reports,
coursesand tutorials,conferences,blogs,testbeds,etc) andthatalsocontains25 pagesof references.
Applicationsina business environment
Of course,fora commercial companyit isat leastasinterestingtolookforRL applicationsinabusiness
environment.Afterall, insuchan environmentyou runintoall sortsof specificproblemsthatwill bother
youlessor not at all withthe 'school examples':how doyoufindthe best(ora good) algorithmor
parameterswithoutimmediatelylearningonline with‘live’clients (batch,offlinelearning,settingup
simulationenvironment?),howdoyouconnectthingsto existingsystemsandprocedures,how doyou
explaintothose involvedhowthe algorithmlearnsandwhatthe final policyentails,etc?Forthisreason
it isnice to see howothershave dealtwiththose problems.However,the numberof published
applicationsof RLin commercial environmentsisnot(yet) thatlarge.Itmaybe that there isstill littleto
report,because itisa relativelynew technologyand/orbecause itisdifficulttogetitworkingproperly
and to optimise it.Butitmayalso be that successful applicationsare notpublished(indetail) for
competitivereasons.Althoughitisregularlyclaimedthatuse ismade of RL, detailsare mostly notgiven.
Still,anumberof articleswithRL businessapplicationswere found;includingthreepieces(partly)by
Naoki Abe (IBMresearch),of which2 in the fieldof marketing([22] and[23]) and 1 inthe fieldof debt
collection[24].Thislastarticle concerns the collectionof tax liabilitiesbythe New YorkState Department
of TaxationandFinance.Atthe time of the article (2010), itwas estimatedthatthiscouldsave $ 100
millionin3yearswithannual tax revenuesof $1 billion.Incidentally,the developmentof the "engine"
for thiscost IBM5 milliondollarsinresearchinvestments(andthe State Department another4million
dollars).

5
Anotherinterestingarticle is[15],whichdescribesanapplicationof RLinthe medical world.Thisshows
manyproblemsandchoicesthat I also encountered inpractice.Itookovera numberof things,suchas
target clipping,rewardshaping,acustomlossfunction topunishdeviantQvalues andpartsof the
configurationof the Neural Network (NN).
Two otherarticles withcommercial applicationsforRLhave beenfound,buthave notyetbeenstudied
([25] and[26]).
Markov Decision Processes
Markov DecisionProcessesorMDPsare decisionproblemscharacterisedbyanAgent(actor) whotakes
decisionsinan Environment,at/insuccessive times/periodsandgraduallyreceivesrewardsorincurs
costs (positiveornegative rewards).
Somewhatmore formallydefined:
 The Environmentfindsitself atanytime tin one of the possible statessS
 Everyperiodthe Agentperformsan actionaA (allowedinstate s;thiscan alsobe "do nothing")
 The feedbackfromthe Environmenttothe Agent,followingthe actiontaken,istwofold:(r,s’)
where r isthe (direct) rewardands’ the new state of the Environment.Thisfeedbackonlydepends
on the oldstate s and the action taken,noton the previoushistoryof the process(thisisthe so-
calledMarkovpropertyof the process)
 The entire processcan be finite (thenasequencefromstartto endiscalledan episode1
) orinfinite
 The aim for the Agentisto act in sucha waythat the total average reward2
ismaximised
Schematic(from[1]):
Figure 1 Agent acting in an Environment
The way the Agenttriesto maximisethe rewardiscalledthe policy (denotedby π),whichis the central
conceptinRL. Formally,the policyisthe rule thatdeterminesthe actionato be takenfor a givenstate s:
π: S -> A.The policycanbe deterministic, i.e.exactlyone actionais prescribedforeachstate s (π (s) = a),
or stochastic,i.e. fora givenstate s the actiona is chosenwitha certainprobability(π(a|s), with
1 When every episodeconsists of only one period, so it requires only one decision,the problem is called a Multi-
armed Bandit (MAB) problem (One-armed Banditbeing the nickname for the well-known (fruit) slotmachines).The
decision required there is which Arm (slotmachine) to run for maximum (average) reward over all episodes.
2 Averaged over episodes if the process is finite averaged over periods in the infinitecase(in which casea discount
factor<1 must be chosen).

6
∑ 𝜋(𝑎|𝑠)
𝑎 = 1).
The feedbackfromthe Environmenttothe agentisinprinciple stochastic,whichmeansthatthat
feedback, consistingof the newstate s'and the directrewardr, are determinedbyprobabilities:
𝑃(𝑟,𝑠′| 𝑠,𝑎) = probabilitythat,givenstate sandactiona, the nextstate is s’and the directreward
isr, where of course ∑ 𝑃(𝑟, 𝑠′|𝑠, 𝑎)
𝑟∈ℝ,𝑠′∈𝑆 = 1, ∀𝑠,𝑎.
NB.It is quite possiblethat ina specificRLproblem the directrewarddoesnotdependons,s' or a, but
onlyon one or twoof thistrio.
Solutionmethods
In orderto solve anMDP (i.e.determinethe optimal policy3
),itmatterswhetherwe know the entire
operationof the environment.The statesSandthe possible actionsA are inprinciple always4
known,but
the probabilitiesP(r,s’|s, a) do not have to be.A solutionmethodthatmakesuse of the knowledge
aboutthe P () iscalled model-based;amethodthatdoesnot do thisiscalled model-free.
A model-basedmethodknownfromOperationsResearchis DynamicProgramming(possiblystochastic).
Usingthe knownmodel of the environment,itcanbe determinedforeachstate sT-1 fromthe
penultimateperiodwhatthe expected5
rewardrT isfor eachpermittedactionaT-1,and therefore also
whatthe optimal actiona*T- 1 is.The value of state ST-1 isthenthe rewardforthat optimal actionchoice.
Thenwe take anotherstepbackand performthe same exercise withthe possible statessT-2 inthe second
to lastperiod,withthe difference thatwe calculate the sumof the directrewardrT-1 + the expectedvalue
of the state sT-1 we endup in.The action that leadstothe highestsumisthe optimal actiona*T-2 fromST-2.
Thiscontinuesuntil we knowforeachinitial state s0 whatthe optimal actiona*0 isand whatthe
associatedexpectedvalue of [(direct)rewardr1 + the value of state S1] is.The setof optimal actions{a*t}
formsthe optimal policy(usuallydenotedby π*).
Otherknownsolutionmethodsare Value iteration(VI)andPolicyiteration(PI),whichare discussedin
detail in[1].
ReinforcementLearning
In manycases,the model P(r, s’|s, a) isunknown.We thenhave to relyonothermethods,whichhave
beenonthe rise in recentdecadesunderthe collective name ReinforcementLearning (RL).
A characteristicof these methodsis,thatthe missingknowledge aboutthe model of the environmentis
replacedbypiecesof experience thatare gainedwhile lettingthe agentoperate inthe environment.This
explainsthe secondpartof the name:witheveryinteractionsomethingis"learned"abouthow the
environmentworks.The firstpart, Reinforcement,referstothe strengthening(ornot) of alreadyacquired
3 In the seminal work of Sutton and Barto 0 findingthe optimal policy is called solvingthe control problem. Next to
this (or rather before this) a lotof attention is given to the prediction problem, i.e. determining (predicting) the
valuefunction V(s) and state-valuefunction Q
(s,a) for a given policy . These functions give the expected reward
from state s (for V), and takingaction a (for Q), followed by policy until the end of the episode (or “infinitely”).
4 Note that there arealso MDPs where the states areonly partially observed,the so-called POMDPs (Partially
Observed MDPs).
5 In a stochastic problem we have an expected reward and multiplepossiblenext states s’, in a completely
deterministic problem the reward r and the next state s’are known when s and a are given.

7
knowledge bynewexperiences:anexperience (s,a) => (s’,r) that correspondswithpreviousexperiences
from(s, a) providesreinforcementof the previouslyacquiredknowledge.Naturally,incase of non-
agreement,the previousknowledge is(partly) "extinguished".
Delayedreward
Withmany MDPs the directrewardis notrepresentative of how goodanaction is,itis oftenjustas
importantthatthe environmentisbroughtintoabetter,more valuable state,fromwhichthe final
rewardcan be obtained.Asanexample,inamarketingenvironment:if bysendinganumberof messages
throughthe right channels atthe right times(whichinthe firstinstance onlycostmoney) we cangeta
clienttoreact, or maybe even make afirstpurchase,thenwe have arrivedina much more favourable
state,witha higher'value'. Maybe itwouldhave beencheaperintermsof direct rewardto take less
expensive (orno) action(s),butthenwe mightnothave gottenintothatmore valuable state.
Directrewardsare often0 or negative,because the actiontakenentailscostsanddoesnotimmediately
yieldanything.Thismakes itatrickybusinesstotryand solve anMDP whose model isunknown byusing
supervisedlearningtechniques.Afterall,asa targetvariable we have at mostthe directrewardavailable,
and thisisnot necessarilymaximal forthe bestactioninthe givencircumstances.Instead,we canalso
waituntil the endof an episode andthentake the final (total) rewardasa targetvariable (thisisso-called
MonteCarlo learning),butthenthe bigquestionistowhatextentwe shouldassignthatrewardto each
actiontaken in the episode.Anotherdisadvantageisthatwe alwayshave to continue until the endof an
episode inordertolearnsomething.Inaddition,formanyproblems,the numberof possible
"trajectories"(pathsfromthe beginningtothe endof an episode) isso large thatitbecomesimpractical.
The methodthat sitsbetweenthesetwoextremes,andiscalledTemporal-Difference(TD) learning by
Sutton,usesestimatesof the value of the nextstate (alongwiththe directreward) asa'temporary'
target variable andadjuststhe currentestimate usingthese estimates(the so-called'update').The word
"difference"here referstothe difference betweenthe new estimate(calledthe target) andthe earlier
estimate knownuptothe time of the update.TDlearningistherefore learning“aguessfroma guess”[2]
(slide 7.Inthe same presentation,see alsoslides9-16forfurtherexplanationof multi-steppredictions
versusone-stepmethodssuchassupervisedlearning).
The differentmethodsforsolvingmulti-stepdecisionproblems –all classifiedbySuttonunderRL – are
not completelyseparate fromeachother,butgraduallymerge into one another.Thisisillustratedby
Figure 2 below(from[1],p.157)

8
The "depth"of the update isshown
here on the vertical axis,with1-step
TD learningat the top andMonte
Carlolearning(entire episodes) atthe
bottom.On the horizontal axisonthe
leftare the methodswhere onlythe
effectof one decision (action) is
viewedperstep(itis"sampled"from
the possible decisions),while onthe
rightare methodswhere the
expectationvalueiscalculatedoverall
possible decisions.Becausethe
expectationvalueassumesknowledge
of the underlyingprobabilitymodel,all
model-freemethods are locatedonthe
leftaxis.
Many otheraspectsplaya role inthe use of these methodsthatcannotbe capturedinthis two-
dimensional picture.Twoof those aspectsshouldbe mentionedhere,namelythe exploration-
exploitation dilemma andfunction approximation.
As soonas variousactionshave beentriedoutina certainstate,a (temporary) picture of the value of
takingthose variousactionsinthat state alreadyarises.Inparticular,itisknownwhat- up to thatpoint-
isthe bestaction.Everytime we come back to the same state,we can of course take that bestaction(we
call thisexploitation of the acquiredknowledge,oralso greedy action selection),butthenwe may not
learna lot6
.That is whyit makessense tokeep exploring andoccasionallychooseactionsthat - up to that
point- are not optimal,orthat have not beentriedatall;we thenlearnwhetherwe cantake evenbetter
actionsthan we thought. Itmay be that thisexploration(temporarily) costsmoney,butpotentially there
isa betterpolicyinreturn7
.
Whenthe numberof possible statesbecomesverylarge,itbecomesimpractical toregisterthe value of
each state (orstate-actioncombination) separately(intable form). Function approximation offersa
solution:the values are approximatedbya(value orstate-value) functionof the available features,which
containsa limitednumberof parameters.Initssimplestform, thiscanbe a linearfunctionof the
features,butalsomore complex non-linearfunctionssuchasa neural network(NN) canbe used.Whena
deep NN isused(looselydefinedasa NN withmore than 2 hiddenlayers), the termDeepReinforcement
Learningisused.
6 In a stochastic environmentwe obtain an increasingly better estimate of the ‘expected reward’ of the specific
action,but we do not learn anythingmore about the other actions.
7 Compare this to the situation when going out to dinner (pre-corona 😉): if you always choosea restaurantthat
gave you good experiences in the past,you will never try out new restaurants or revisitrestaurants thatgave you
one or two less favourableexperiences.At the very least,you might miss a very good restaurantyou haven’t tried
yet.
Figure 2 Schematic overview of RL methods

9
Algorithms
In orderto solve the MDP (tofindthe optimal policy),we have toapproachthe actual (state) value
functionascloselyaspossible bitbybit.Thisisdone systematicallybyanRL algorithm.A lot of research
has beendone inthe fieldof RLalgorithmsandmanydifferent"flavours"have emerged.Thissection
providesabrief overviewof these flavours.
The optimal policymustcomplywiththe Bellman optimality equation:
𝑄∗(𝑠,𝑎) = ∑ 𝑃(𝑠′|𝑠,𝑎)
𝑠′∈𝑆
∗ [𝑟𝑠,𝑎,𝑠′ + 𝛾 ∗ max
𝑎′
𝑄∗(𝑠′,𝑎′)]
If we have foundthe function 𝑄∗ - or approximateditcloselyenough - thenwe know the optimal policy,
because thisisthe greedy policy that choosesthe actiona withmaximum 𝑄∗ value ineachstate.
On- andoff-policymethods
An oftenmentioneddistinctionwithinRLalgorithmsisthatbetween on-policy andoff-policy methods.As
mentioned,anRL algorithmsearchesforthe optimal policy;we thereforecall thispolicythe targetpolicy
(alsoknownas: estimation policy).However,there isanotherpolicyinplay,namelythe policythat
determineswhichactionistaken duringthe searchprocess,the so-calledbehaviourpolicy.Whenboth
policiesare the same - and thusthe usedpolicyitself isoptimised,we speakof anon-policyalgorithm,
otherwise the algorithmis off-policy.Inordertocontinue tolearnsufficientlyfromnew experiences,the
behaviourpolicyinanon-policyalgorithmmustcontinuetoexplore(otherwise the greedyactionis
alwayschosen).Thisproblemisnotanissue withanoff-policyalgorithm:afterall,youcanexplore to
your heart'scontent(in principle even“random”).However,thisdoesmeanthatthere are lessguarantee
for convergence(i.e.the optimal policyisapproachedmore andmore closely) whenusinganoff-policy
algorithm,certainlyincombinationwithfunctionapproximation.AccordingtoSutton,the combinationof
3 elementsinparticularis'dangerous'(he speaksof the Deadly Triad,see section11.3 in [1]):function
approximation,bootstrapping(=use of TD-learning,“aguessfroma guess”) and an off-policy algorithm.
Since functionapproximationisindispensable forlarger,complexproblemsandbootstrappingcan
greatlyincrease efficiency,the use of on-policylearningisoftenthe solution.Incidentally,further
researchisstill beingdone intothe Deadly Triad[5],mainlybecause peoplewanttobetterunderstand
howit ispossible thatsome algorithmsstill successfullycombine the threeelementsmentioned.
Off-policylearningcan,however,offergreatadvantages,forexamplebylearningwhatthe optimal policy
isfrom alreadyavailablepreviousexperiences,whichhave beenacquiredbyfollowingadifferentpolicy
(forexample byareal life agent).
Whenusinga so-called-greedybehaviourpolicy,off-policylearningcomesveryclose toon-policy
learning:withprobability whichof course mustbe small) arandomaction ischosenand with
probability1-the greedyactionischosen,justasinthe target policy.

10
Q-learningenSARSA
The best-knownoff-policyRLalgorithmis Q-learning8
andthe best-knownon-policyalgorithmis SARSA9
.
The difference betweenthe twoisinthe update of the value function,whenthe new valueisformedby
the directreward+ the value of the nextstate inwhichwe endup. In Q-learning,the value of the next
state is takenat the – upto that moment– bestaction inthat state (the greedyaction),while withSARSA
the value istakenat the action whichisprescribedbythe policy(the behaviourpolicyandthe target
policy,because theyare the same;note thatthispolicycan be stochastic,i.e. π(s,a) = P(a| s) isa
probabilitydistributionandnotadeterministicrule π(s) =a).
In formulae:
𝑄(𝑠,𝑎) ← 𝑄((𝑠,𝑎) + 𝛼 ∗ (𝑟 + 𝛾 ∗ max
𝑎′
𝑄(𝑠′,𝑎′) − 𝑄(𝑠, 𝑎)) for Q-learning
𝑄(𝑠,𝑎) ← 𝑄((𝑠,𝑎) + 𝛼 ∗ (𝑟 + 𝛾 ∗ 𝑄(𝑠′,𝜋(𝑠′,𝑎)) − 𝑄(𝑠,𝑎)) for SARSA
Where  is the learningrate and the discountfactor.
Multi-stepmethods
If we lookbackat Figure 2 on page 8, we see on the leftaxisthe TD learningmethodswithTD(0) atthe
top10
andMonte Carlolearningatthe bottom.The so-calledmulti-step methods lie betweenthese two
extremes.Thesemethodsare characterisedbythe factthat the target doesnotconsistof the one-step
return (i.e.the rewardof one step+ the value of the nextstate),butof the n-stepreturn(the consecutive
rewardsof several (sayn) steps+ the value of the state we endup inafterthose n steps).Sowe still
bootstrap(i.e.we use the "preliminaryvalue" of the state afternsteps),but"sample"more thanone
stepbefore bootstrapping.Asnincreases,we getcloserandclosertoMonte Carlolearning,where we
sample until the endof the episodes(infinite processes).Whenusingn-stepmethods,the question
naturallyariseswhichnisoptimal.Because eachnhas itsadvantagesanddisadvantages,the eligibility
trace conceptwas devised(see chapter 12in [1]).We thencalculate the targetas the weightedsumof all
individualn-stepreturns(so1-step,2-step,etc),where the 1-stepreturnisgivenweight (1-) andthe
weightof the returnof eachsubsequentstepisafactor  (0≤≤1) smaller:the returnof stepkis
therefore givenweight (1-)*k-1
.The resultingalgorithmiscalledTD() forshort.
Viewedinthisway,we use the “forwardview”([1] p.288).We can alsouse the “backwardview”by
lookingbackfromthe reachedstate to whichpreviousevents(actionsinpreviousstates)contributed11
to
that update:the longerago,the smallerthe contributionandthusthe update (withfactor ).Thisseems
like amore natural wayto lookat the updatesandalsothe most logical waytoprogram them.
The addedvalue of multi-stepmethodshasbeendemonstratedinvariousstudies,see,amongothers,
the DeepMindarticle onthe “Rainbow”method [11],inwhichit isarguedthat the two mostimportant
add-ons(improvements) inRLalgorithmsare the multi-stepmechanism.andPrioritisedExperience
Replay(see nextsection).
8 The name Q-learningis derived from the use of the letter Q for the action-valuefunction Q(s,a).
9 SARSA refers to the letters that are used to indicatethe current state (s),the action taken in s (a), the reward (r),
the new state (s’) and the action taken in s’ (a’); a’ is determined by the target policy: π(s’)=a’.
10 TD(0) and 1-step TD learningaresynonymous; the 0 in TD(0) refers to the valueof in TD(, not to the number
of steps.
11 Since they contributed, they are eligible for updating, hence the name eligibility trace.

11
Replay
The difference betweenon- andoff-policyalgorithms,andalsothatbetweenonlineandoffline
algorithms,becomessomewhatlessclearwhenwe use ExperienceReplay.ExperienceReplay(ER) isthe
(repeated) re-offeringof previous experiences(transitions s,a-> r, s ") to the algorithmandperforming
the updatesbasedonthat. A 'stock' of transitions(the replaybuffer) isthereforekeptonhand,from
whichtransitionsare drawnat settimes(thiscanbe aftereachtransition,butalsolessfrequently) that
are usedforthe update.It isnot necessaryforthe currenttransition(s) tobe usedimmediately;these
are simplybufferedandcanbe drawn.Some experiences/transitionsare 'more interesting'thanothers,
for example transitionsthathave a targetthat deviatesstronglyfromthe currentQ(s,a) value,andthose
transitionswe wanttoofferwithhigherpriorityforthe update,sothatthe functionapproximationfor
that ( s,a) combinationcanbecome more accurate.Thismethodisknownas Prioritised Experience
Replay,and itis widelyusedincurrentRLresearch.I’ve usedthe methoddescribedin [6],includingthe
Importancesampling correction(thiscorrectionensuresthatthe biasinthe expectedvalue(s)thatoccurs
due to the prioritisationiscorrected).
ExpectedSARSA andQ(σ)
In additiontoQ-learningandSARSA,we regularlyencounterthe Expected SARSA algorithminthe
literature.Inthisalgorithm, the targetforthe Q(s',a') value doesnottake the value at a' ~ π (s', a) as with
SARSA,northat at a' = argmax{Q(s',a)} asin Q-learning,butthe expectationof the Q-value overall
possible actionsins':
𝑄(𝑠,𝑎) ← 𝑄(𝑠,𝑎) + 𝛼 ∗ [𝑟 + 𝛾 ∗ ∑ 𝜋(𝑠′,𝑎)
𝑎∈𝐴
∗ 𝑄(𝑠′,𝑎) − 𝑄(𝑠, 𝑎)]
Thismakesthe variance of the updatesmuchsmaller - certainlywithastochasticpolicy - thanwith
SARSA,sothat convergence occursfaster.Detailsaboutthismethodare givenin [7].The authorsof this
article presentExpectedSARSAasan on-policymethod,butSuttonindicatesin 0(Remark6.6 on page
140) that the behaviourandtargetpoliciesmaydiffer,aswithoff-policymethods.Inhisdissertation [9],
van Hasseltconfirmsthis,buthe callsthe off-policyversion“General Q-learning”.
De Asisetal. [10] attemptedtounifythe differentalgorithmsinamulti-stepRLalgorithmentitledQ(σ).It
wouldbe attractive toprogram thisalgorithmandexperimentwithit –because itcombinesthe other
algorithms.Itwouldmean,however,thatthe numberof hyperparameters(see the section
Hyperparameteroptimisationbelow)increasesevenfurther.
Other variants
To most algorithmsextra’cleverness’canbe added,whichwe call “add-ons”here.The reasonforthe
creationof these add-onsisusuallythatthe algorithmsdidnotconverge withoutthemorwere notstable
(i.e.theysometimesdid,sometimesdidnotleadtothe desiredresult,orledtodifferentresults),orwere
far too slow.Belowwe willdiscussanumberof add-onsone byone,thatI have implementedinmy
research:Double Q-learning,BatchLearning,Residual Learning,Rewardshaping,Targetclippingandthe
use of a customlossfunction.
 DoubleQ-learning
In Q-learning,the valueof the nextstate isequal tothe value Q(s’,a*) of the bestactiona* inthat
nextstate.These Qvalues are approximatedbythe functionapproximator,forexampleanNN.
Because itis more likelythatanestimate that(bychance) turnsout tobe too highisthe highestvalue

12
than an estimate thatisaccurate or toolow,there isa risk of overestimation. WithDoubleQ-learning
[8] thisisovercome byusing2 separate NNs,the so-calledOnlinenetwork andthe Targetnetwork.
The Online Networkisusedforthe selection of the bestactioninthe nextstate,while the Target
Networkprovidesthe valueforthatbestaction.The Target networkremainsunchangedforx
iterations,while the Onlinenetworkisupdatedinthe usual way.Afterx iterations,the Online
networkiscopiedtothe Target network.
There are indications [12] thatif the rewardsare stochastic,Double learningalsohasaddedvalue for
the on-policyalgorithmsSARSA andExpectedSARSA.
 Batch Learning
If the agentcannot (ormay not) interactwiththe real environment(andtherefore cannotchoose its
ownactions),buttransitiondataisavailable,we canstill make anattemptto use Batch learningto
create an optimal (orgood) policy.find.Thiscanthenbe usedinan online situation,withoutchanging
it online anymore.Itisalsopossible tovaryonthisby firstlearninginbatchesandofferingthe
resultingmodel asa"startingmodel"tothe online agent,whothenlearnsfurther.Anextensive
article aboutBatch learningis [13].
 ResidualLearning
LeemonBairdisa bigname inthe fieldof MDPsand RL. He isknownfor, amongotherthings, Baird's
counterexample,acounterexampleof the propositionthatRL algorithmswithlinearfunction
approximationalwaysconverge.Asasolutiontothis,Bairdproposes Residuallearning,where amix of
the normallyused directgradient(whichensuresthe speedof learning)andthe residualgradient
(whichensuresconvergence) isapplied.Bytakingthe rightcombination,convergence isensured(due
to the residual component)while the speedof learningremainsashighaspossible (due tothe direct
component).Detailsinthe article from Baird[14].
 Reward shaping
Withmany RL problems the agentonlyreceivesfeedbackaboutthe final reward,oftenthe most
importantcomponentinthe total reward,at the endof an episode.Ittakesa while (manyiterations
and manyrepetitionsof the same state-actioncombinations) before the agentlearns fromthat
informationwhatagoodaction isin anyspecificstate.Thisgave rise tothe ideato lendthe agenta
handand alsogive himfeedbackduringthe processwhetherhe isworking"inthe rightdirection".
Thiscan be done bymanipulatingthe rewardfunctioninsuchaway thatthe rightdirectionis
rewardedand/orthe wrongdirectionispenalised.Of course,the real rewardmust(ultimately) be
independentof this;afterall,itisaninterimassistant,notareal reward.Thistechnique iscalled
reward shaping,see applicationin [15].Anexample:if,indebtcollectionprocess anactionleadstoa
response fromthe debtor. ora (partial) payment,thisispositive,evenif it doesnotimmediatelyyield
a reward.By neverthelessassigningavalue toreceivingthe response orpayment,the agentis
encouragedtorepeatthe behaviourdisplayed.
 Incidentally,the well-knownAIexpertAndrew Ngetal. have shown [16] that whenusingso-called
potentialbased shaping,the original optimal policyisalsooptimal withrespecttothe shapedreward.
Wiewiora[17] thenprovedthatinitialisingthe Qvalues inacertain way can achieve exactlythe same
effectaspotential-basedshaping.
 Target clipping
Due to the iterative nature of the RLupdates,itcan occur that Q values grow out of control and
become unrealisticallylarge (orverynegative).Inthatcase it can helpto limitthe calculatedtargets,
by so-calledtargetclipping.See alsothe nextpoint.

13
 Useof a customlossfunction
Whenusingan NN for functionapproximation,we canchoose froma numberof loss functions.Sucha
lossfunctionisusedwhenfittingthe NN tothe dataprovided:the smallerthe value of the loss
function,the betterthe NN fitsthe data.The lossfunctionisminimisedbyadjustingthe weightsinthe
NN witha gradientdescentalgorithm.
We can alsodefine ourown (custom) lossfunction,whichmakesitpossible tosteerthe algorithmina
certaindirection.Forexample,we canbuildina'penalty' - inthe form of an extraloss - whenthe
(realistic) limitsforthe Q values,i.e.forthe outputvalues of the NN,are exceeded.Incombination
withtargetclipping(see previouspoint),thisisaneffectivewaytopreventthe algorithmfrom
derailing.
Hyperparameters
All algorithmsandtheiradd-onsmentionedinthistexthave one ormore hyperparameters thatinfluence
theiroperation.We call these hyperparameters,becausethere are also"normal"parameters,for
example the rewardsandcostsof actions(inputparameters) andthe resultingcoefficientsinthe NN
(outputparameters).Mostalgorithmsandadd-onsare verysensitivetothe bestchoice fortheir
hyperparametervalues;theycanmake the difference betweenasmoothandfastconvergingora slow
and divergingalgorithm.
Hyperparameters thathave notyetexplicitly beendiscussedinthisdocumentare:
 Learning rate (also called: step size)
Usuallyindicatedwith (asinthe formulae inthisdocument),sometimeswith η:the speed(step
size) withwhichwe move inthe directionof the new value (target) duringanupdate of the oldQ
value. WhenusinganNN,thisparameterisincludedinthe optimiserselection(seenextpoint).
 Exploration rate
Usuallyindicatedwith ,itgivesthe fractionof cases inwhich explorationisrequiredinsteadof
exploitation.
 Rate decay patterns
Both the learningrate andthe explorationrate should eventuallygoto0; so,in the end there is
no more explorationandnomore learning. Thisisanecessaryconditionforconvergence tothe
optimal policy. The way(speed,shape of the decrease)these ratesdecreasecanalsodetermine
the convergence (speed) of the algorithm.
Incidentally,insituationswhere anongoing (business) process12
maychange overtime, itseems
advisable nottostoplearningcompletely.Afterall,anychangesinthe processcouldalsochange
the optimal policy,butthiswill notbe pickedupwithoutlearning.
 Discountfactor
Usuallyindicatedwith ,asin the formulae inthisdocument.Asusual,the discountfactor
reflectsthe diminishingvalueof moneyovertime (lostopportunitycostandinflation).A high
discountfactorcombinedwitha(possiblylarge)rewardatthe endof an episode couldleadto
attemptsbythe agent to keepthe episodeasshortas possible,withoutloweringthe probability
of successtoomuch.
12 As opposed to e.g. a physical environmentwhere the laws of physics apply.

14
 Replay parameters
Several parametersplayarole in(PrioritisedExperience) Replay:
Buffersize:howmanyexperiences(transitions) are kept(rollingwindow)?
Sample size:howmanyexperiencesare usedineachreplayaction?
Replayfrequency:afterhowmanytransitionsisreplayplanned13
?
Replacement:issamplingwithorwithoutreplacement?
PrioritisationAlphaandBeta:determine how strongthe prioritisationis;see [6].
UpdTarget (Freq):howoftenare the savedtargetsupdatedaccordingtothe latestNN model?
 NN parameters
Whenusingan NN for FA,the gradientdescentalgorithm isusedto minimise the lossfunction
(see above).The choice of thisalgorithm(andthe associatedlearningrate) canbe verydecisive
for (the speedof) convergence.Thereisextensive literatureonthe differentoptimisation
routines,see e.g. [18].
The structure of the NN itself isalsogovernedbyhyperparameters,suchasthe numberand the
widthof the layers,the activationfunction(‘sigmoid’,‘ReLu’,etc) andmanymore.
 Featureengineering
Thisis notactuallya parameter,buta setof operationsonthe original featuresbefore theyserve
as inputforthe NN,thus"helpingthe NN".Ideally,nofeature engineeringis neededatall,and
the NN workspurelyonthe raw data.All necessary"operations"(suchasnon-linear
transformations,interactions,etc.) are thenlearnedbythe NN.Withmany-dimensional real
problemsitisthengenerallynecessarytouse verywide (manynodesperlayer) anddeep(many
layers) networks,whichtherefore alsorequire verylongtrainingtimesanddata.Thisisthe
domainof DeepLearning. Anexample isanRL algorithmforlearningtoplaycomputergames
withas inputonlythe raw pixelsof the computerscreen.
There isalso a dangerin(manual) feature engineering:if the constructedfeaturesare less
predictive thanthe original data,the horse isputbehindthe cart.
Hyperparameter optimisation
In orderto findthe bestvaluesof the hyperparametersforanRL algorithm, we canof course use similar
techniquesasforotherML algorithms. There isalso alot of literature aboutoptimisinghyperparameters,
see e.g. [18],ranging fromsimple gridsearchtoadvancedBayesianoptimisation.Aninterestingoption,
whichisnot exploredinthe mentionedarticle,isusingaGeneticAlgorithm(GA).
Assumingwe have aperformance metricforeachrun/episode(oraverage overmultiple runs/episodes)
of the RL algorithmwitha specifichyperparametersetting,we caninterpretthismetricasthe “fitness”of
the solution(“gene”inGA terminology) consistingof the stringof hyperparameters.Crossoverof two
solutions isthen easilydone bycuttingthe listwithhyperparameterssomewhere andcombiningthe two
endscrosswise.Mutationof asolutionisalsoeasyto program.
Of course,asin gridsearch,this “triple-AIsolution”(namelyRL+ NN + GA) forHyperparameter
optimisationmay alsobe challengingintermsof requiredcomputingtimeorpower (althoughadegree
13 This of coursealso depends on practical circumstances:in a simulated environment any number of transitions can
be replayed at any time, in reality itmay only be possibleto execute replay and update of the model infrequently.

15
of parallelisationseemsfeasible).
Reinforcement Learning in R and Python
Although there are some effortstoincorporate RLinR code,notablythe ReinforcementLearning package
by Markus Dumke andthe MDPtoolbox,byfarthe most sourcesforprogrammingRL algorithms(on
github,kaggle,etc.) isfoundinPython.
RL playgrounds
Much researchintoReinforcementLearningtechniquesiscarriedoutusingenvironmentsthatare easy
to simulate,anumberof whichare nowstandard benchmarks:
• The "classicenvironments"suchas(windy) gridworld,cart-pole balancingandmountaincar(inmany
articlesandbooksthese are usedto explainconcepts,alsoin[1])
• Atari games:50 gamesof differentnature anddifficulty.Itrarelyhappensthatanalgorithmisequally
goodin all games.
The Gym website of OpenAI (affiliatedwithElonMusk,amongothers) providesaccesstothese andother
environments,unfortunately - accordingtothe githubwebsite –onlyviaPython (butthere appearstobe
an R interface package anyway;thishasnot yetbeentested).

16
References
The referencesbeloware all containedinthiszipfile:
RL Literature
Review - References.zip
[1] Sutton,R. andBarto, A. (2020). ReinforcementLearning:anIntroduction,secondedition.MITPress.
Original versionisfrom1998, but new versionsof thisbookare still created,see
http://incompleteideas.net/book/the-book.html.
[2] Sutton,R. (2017).Temporal-Difference Learning,slides andvideolecture
(http://videolectures.net/deeplearning2017_sutton_td_learning/).
[3] AhmadHammoudeh(2018). A Concise IntroductiontoReinforcementLearning. ResearchGate.
[4] Yuxi Li (2017). DeepReinforcementLearning:AnOverview. ArXiv preprintarXiv:1701.07274v5.
[5] VanHasselt,H. etal.,DeepMind(2018). DeepReinforcementLearningandthe DeadlyTriad. ArXiv
preprintarXiv:1812.02648.
[6] Tom Schaul,JohnQuan,IoannisAntonoglouandDavidSilver,Google DeepMind(2016).Prioritized
Experience Replay. ArXiv preprintarXiv:1511.05952V4.
[7] VanSeijen,H.,vanHasselt,H.,Whiteson,S.,andWiering,M. (2009). A theoretical andempirical
analysisof Expected Sarsa. InIEEE SymposiumonAdaptiveDynamicProgrammingand
ReinforcementLearning,pp. 177–184.
[8] Hado VanHasselt,ArthurGuezand DavidSilver,Google DeepMind(2015).DeepReinforcement
LearningwithDouble Q-learning. arXiv:1509.06461v3.
[9] VanHasselt,H. (2011). InsightsinReinforcementLearning:Formal AnalysisandEmpirical Evaluation
of Temporal-differenceLearning.SIKSdissertationseriesnumber2011-04.
[10] De Asis,K.,Hernandez-Garcia,J.F.,Holland,G.Z.,andSutton,R. S. (2017). Multi-stepReinforcement
Learning:A UnifyingAlgorithm. ArXiv preprintarXiv:1703.01327.
[11] Matteo Hessel etal.,DeepMind(2017). Rainbow:CombiningImprovementsinDeepReinforcement
Learning. ArXiv preprintarXiv:1710.02298v1.
[12] Ganger,M., Duryea,E. andHu, W. (2016). Double Sarsaand Double ExpectedSarsawithShallow
and DeepLearning. Journalof DataAnalysisand Information Processing, 4,159-176.
[13] Lange S., Gabel T., RiedmillerM.(2012) Batch ReinforcementLearning.In:WieringM.,vanOtterlo
M. (eds) ReinforcementLearning.Adaptation,Learning,andOptimization,vol 12.Springer,Berlin,
Heidelberg.
[14] Baird,L.C. (1995). Residual Algorithms:ReinforcementLearningwithFunctionApproximation.In
Prieditis&Russell,eds.Machine Learning:Proceedingsof the TwelfthInternational Conference,9-
12 July,Morgan KaufmanPublishers,SanFrancisco,CA.
[15] Raghu,A. etal. (2017). Deep ReinforcementLearning forSepsisTreatment. ArXiv preprint
arXiv:1711.09602v1.
[16] AndrewNgetal. (1999). Policyinvariance underrewardtransformations:theoryandapplicationto
rewardshaping.In MachineLearning,Proceedingsof theSixteenth InternationalConference, pp.
278-287. Morgan Kaufmann.
[17] Eric Wiewiora(2003). Potential-basedShapingandQ-value Initializationare equivalent. Journal of
Artificial IntelligenceResearch,19,205-208.

17
[18] SebastianRuder(2017). Anoverview of gradientdescentoptimizationalgorithms. ArXiv preprint
arXiv:1609.04747v2.
[19] Hazan, E. (2018). HyperparameterOptimization:A Spectral Approach. ArXiv preprint
arXiv:1706.00764v4.
[20] DavidSilveretal.Google DeepMind(2016). Masteringthe game of Go withdeepneural networks
and treessearch.Nature vol 529.
[21] EdwinPednault, Naoki Abeetal. (2002). Sequential Cost-Sensitive DecisionMakingwith
ReinforcementLearning.IBMWatson ResearchCentre. SIGKDD’02 Edmonton,Alberta,Canada.
[22] Naoki Abe etal.(2004). Cross Channel OptimizedMarketingby ReinforcementLearning.
IBM WatsonResearchCentre. SIGKDD’04.
[23] Naoki Abe etal.(2010). OptimizingDebtCollectionsUsingConstrainedReinforcementLearning.
IBM Research. KDD’10, Washington DC.
[24] GeorgiosTheocharous,Assaf Hallak,AdobeResearch (2013). Lifetime Value Marketingusing
ReinforcementLearning.PaperF21in RLDM (Multi-disciplinaryConferenceonReinforcement
LearningandDecisionMaking) 2013, Princeton,New Jersey,USA.
[25] Yegor Tkachenko(2015). AutonomousCRMControl viaCLV ApproximationwithDeep
ReinforcementLearninginDiscreteandContinuousActionSpace.StanfordUniversity. ArXiv preprint
arXiv:1504.01840v1.
[26] Marco Wiering(2018). ReinforcementLearning:frommethodstoapplications.Nieuw Archief voor
de Wiskunde (KWG).

18
Index
Baird's counterexample,11
Batch Learning,11
behaviourpolicy,8
Bellman optimalityequation,8
customlossfunction,4, 12
Deadly Triad, 8
Deep Learning,13
Deep ReinforcementLearning,7
deterministic policy,4
direct gradient,11
DoubleQ-learning,10
DynamicProgramming,5
eligibility trace, 9
episode,4
estimation policy,8
Expected SARSA,10
ExperienceReplay,10
exploration-exploitation dilemma,7
Featureengineering,13
Function approximation,7
gradientdescentalgorithm,12
greedy action selection, 7
Importancesampling,10
model-based,5
model-free,5
MonteCarlo learning,6
Multi-armed Bandit,4
Multi-step methods,9
NeuralNetwork,4
off-policy,8
Online network,11
on-policy,8
Policy iteration (PI),5
POMDP,5
Prioritised ExperienceReplay,10
Q-learning,9
ReinforcementLearning (RL),5
residualgradient,11
ResidualLearning,11
return,9
Reward shaping,11
RL in R en Python,14
SARSA,9
stochasticpolicy,4
target,6
Target clipping,11
Target network,11
targetpolicy, 8
TD() algorithm,9
Temporal-Difference(TD) learning,6
trajectories,6
transitions,10
Valueiteration (VI),5
-greedy,8

Reinforcement Learning Literature review - apr2019/feb2021 (with zip file)

Recommended

Recommended

More Related Content

What's hot

What's hot (9)

Similar to Reinforcement Learning Literature review - apr2019/feb2021 (with zip file)

Similar to Reinforcement Learning Literature review - apr2019/feb2021 (with zip file) (20)

Recently uploaded

Recently uploaded (20)

Reinforcement Learning Literature review - apr2019/feb2021 (with zip file)