SlideShare a Scribd company logo
1 of 18
1
Overview of Reinforcement Learning literature
April 2019/February 2021
Chiel Bakkeren
2
Contents
Contents.........................................................................................................................................2
Introduction....................................................................................................................................3
Other researchers ........................................................................................................................ 3
Review articles............................................................................................................................. 4
Applications in a business environment......................................................................................... 4
Markov Decision Processes............................................................................................................... 5
Solution methods......................................................................................................................... 6
Reinforcement Learning................................................................................................................ 6
Delayed reward............................................................................................................................ 7
Algorithms.......................................................................................................................................9
On- and off-policy methods...........................................................................................................9
Q-learningen SARSA................................................................................................................... 10
Multi-step methods.................................................................................................................... 10
Replay ....................................................................................................................................... 11
Expected SARSA and Q(σ) ........................................................................................................... 11
Other variants............................................................................................................................ 11
Hyperparameters........................................................................................................................... 13
Hyperparameter optimisation..................................................................................................... 14
Reinforcement Learning in R and Python......................................................................................... 15
References .................................................................................................................................... 16
Index............................................................................................................................................. 18
3
Introduction
AlthoughReinforcementLearningasa conceptis now several decadesold,muchresearchhasonlybeen
done inthe last 5-10 years.Most famousare probably the resultsof DeepMind,aLondoncompanythat
was boughtbyGoogle inearly2014. DeepMindturnedouttobe able tobeat the bestGo playerinthe
worldwithitsalgorithmAlphaGo,wherebythe algorithminitiallyalsolearnedfromgamesplayedby
goodGo players,andsubsequentlyimproveditself(AlphaGo Zero) bylearningfromplayingmanygames
againstitself [20].Nowthere isAlphaZero,whichinadditiontoGocan also playchessandshogi at the
highestlevel.
DeepMindisone of the researchcentreswhere alot of researchisdone on RL, boththeoreticallyandin
termsof applications.Well-knownresearchersfromDeepMindare:
• DavidSilver:wrote manyarticlesasfirstorco-author(see e.g.the article onAlphaGo) andiswidely
cited.Alsomade a seriesof videosof lecturesonRL (slideshere).
• Hado vanHasselt,a Dutch PhD(UU, CWI,see [9]),whodeveloped,amongotherthings,the Double Q
learning(see Appendix B.
• Tom Schaul:leadauthorof the article on PrioritisedExperienceReplay[6].
The standard workinthe fieldof ReinforcementLearningisthe bookbySuttonand Barto [1]; thisbookis
inthe reference listof almosteveryRLarticle.Inthe more recentversionsof thisbook,the last
paragraph of each chaptercontainsan interestinglistingof the authorsandtheirarticleswho
contributedtothe developmentof the chapter'stopic;veryuseful asa cross reference.Inthisoverview,
we use the red line of thisbookto review the variousothertopicsinthe RLliterature.
Suttonismore prominentthanBarto,he has,among otherthings,anextensive(butslightlymessy)
website.Manyof the currentRL researchershave spentsome time withthe RL& AIgroup at Alberta
University inCanada,where RichardSuttonworks.Suttonpublishesarticlescontinuously,includingvia
ResearchGate.
There is now alsoa (setof) courses,a Specialization,onCourseraaboutReinforcementLearning.Thishas
not beencheckedoutyet.
Other researchers
As isoftenthe case,the US/NorthAmericaisleadinginthe field
 AndrewNg:one of the best-knownAIresearchers,workingatStanford,whoisverysuccessful,partly
throughhispopularcoursesviaCoursera(of whichhe is a founder).
 Naoki Abe:IBMresearcher,whostandsoutfor a numberof publicationsonapplicationsof RLina
businessenvironment.
 GeorgiosTheocharous:researcheratAdobe (previouslyatM.I.T.,Intel andYahoo) inthe fieldof
(applicationsof) RLandotherML techniques,especiallyinmarketing.
I probablyhave a biastowardsDutch (speaking)researchers,butitisstrikingthatthere are quite a fewof
those:
4
 Harm van Seijen:didaPhDat the UvA withShimonWhitesonandalsoworkedfor4yearsas a
postdocwithSutton.Nowheadsthe MicrosoftResearchMontréal RL team.
 PieterAbbeel:aBelgianwho,throughaPhDat StanfordwithAndrew Ngand a research position at
OpenAI,eventuallybecameheadof the RobotLearningLab inBerkeley(amongothers).
 Marco Wiering:supervisorof Hadovan Hasselt;hisname appearsinall kindsof publications.Has
apparentlynot(yet) succumbedtothe majorplayersinthe AI field,becausehe still 'just'worksinthe
AI groupof the RUG.
Reviewarticles
In additiontothe articlesonspecificRLtopics,manyof whichare referencedinthisdocument,there are
alsoa numberof interestingreviewarticles:
• The ResearchGate article [3] by AhmadHammoudehprovidesanice insightintowhatRL is,includinga
numberof applicationsandmanyreferencestoliterature.
• In September2018 “NieuwArchief voorde Wiskunde”(the magazineof the DutchRoyal
Mathematical Society) wasentirelydevotedtoAI, andit contained anice overview article byMarco
Wiering[27],alsowithmanyreferences.
• Evenmore extensiveisthe overview of Yuxi Li [4],whichcan safelybe calledabooklet,withover40
pagesof background,mainelementsandmechanisms,applications,resources(books,reports,
coursesand tutorials,conferences,blogs,testbeds,etc) andthatalsocontains25 pagesof references.
Applicationsina business environment
Of course,fora commercial companyit isat leastasinterestingtolookforRL applicationsinabusiness
environment.Afterall, insuchan environmentyou runintoall sortsof specificproblemsthatwill bother
youlessor not at all withthe 'school examples':how doyoufindthe best(ora good) algorithmor
parameterswithoutimmediatelylearningonline with‘live’clients (batch,offlinelearning,settingup
simulationenvironment?),howdoyouconnectthingsto existingsystemsandprocedures,how doyou
explaintothose involvedhowthe algorithmlearnsandwhatthe final policyentails,etc?Forthisreason
it isnice to see howothershave dealtwiththose problems.However,the numberof published
applicationsof RLin commercial environmentsisnot(yet) thatlarge.Itmaybe that there isstill littleto
report,because itisa relativelynew technologyand/orbecause itisdifficulttogetitworkingproperly
and to optimise it.Butitmayalso be that successful applicationsare notpublished(indetail) for
competitivereasons.Althoughitisregularlyclaimedthatuse ismade of RL, detailsare mostly notgiven.
Still,anumberof articleswithRL businessapplicationswere found;includingthreepieces(partly)by
Naoki Abe (IBMresearch),of which2 in the fieldof marketing([22] and[23]) and 1 inthe fieldof debt
collection[24].Thislastarticle concerns the collectionof tax liabilitiesbythe New YorkState Department
of TaxationandFinance.Atthe time of the article (2010), itwas estimatedthatthiscouldsave $ 100
millionin3yearswithannual tax revenuesof $1 billion.Incidentally,the developmentof the "engine"
for thiscost IBM5 milliondollarsinresearchinvestments(andthe State Department another4million
dollars).
5
Anotherinterestingarticle is[15],whichdescribesanapplicationof RLinthe medical world.Thisshows
manyproblemsandchoicesthat I also encountered inpractice.Itookovera numberof things,suchas
target clipping,rewardshaping,acustomlossfunction topunishdeviantQvalues andpartsof the
configurationof the Neural Network (NN).
Two otherarticles withcommercial applicationsforRLhave beenfound,buthave notyetbeenstudied
([25] and[26]).
Markov Decision Processes
Markov DecisionProcessesorMDPsare decisionproblemscharacterisedbyanAgent(actor) whotakes
decisionsinan Environment,at/insuccessive times/periodsandgraduallyreceivesrewardsorincurs
costs (positiveornegative rewards).
Somewhatmore formallydefined:
 The Environmentfindsitself atanytime tin one of the possible statessS
 Everyperiodthe Agentperformsan actionaA (allowedinstate s;thiscan alsobe "do nothing")
 The feedbackfromthe Environmenttothe Agent,followingthe actiontaken,istwofold:(r,s’)
where r isthe (direct) rewardands’ the new state of the Environment.Thisfeedbackonlydepends
on the oldstate s and the action taken,noton the previoushistoryof the process(thisisthe so-
calledMarkovpropertyof the process)
 The entire processcan be finite (thenasequencefromstartto endiscalledan episode1
) orinfinite
 The aim for the Agentisto act in sucha waythat the total average reward2
ismaximised
Schematic(from[1]):
Figure 1 Agent acting in an Environment
The way the Agenttriesto maximisethe rewardiscalledthe policy (denotedby π),whichis the central
conceptinRL. Formally,the policyisthe rule thatdeterminesthe actionato be takenfor a givenstate s:
π: S -> A.The policycanbe deterministic, i.e.exactlyone actionais prescribedforeachstate s (π (s) = a),
or stochastic,i.e. fora givenstate s the actiona is chosenwitha certainprobability(π(a|s), with
1 When every episodeconsists of only one period, so it requires only one decision,the problem is called a Multi-
armed Bandit (MAB) problem (One-armed Banditbeing the nickname for the well-known (fruit) slotmachines).The
decision required there is which Arm (slotmachine) to run for maximum (average) reward over all episodes.
2 Averaged over episodes if the process is finite averaged over periods in the infinitecase(in which casea discount
factor<1 must be chosen).
6
∑ 𝜋(𝑎|𝑠)
𝑎 = 1).
The feedbackfromthe Environmenttothe agentisinprinciple stochastic,whichmeansthatthat
feedback, consistingof the newstate s'and the directrewardr, are determinedbyprobabilities:
𝑃(𝑟,𝑠′| 𝑠,𝑎) = probabilitythat,givenstate sandactiona, the nextstate is s’and the directreward
isr, where of course ∑ 𝑃(𝑟, 𝑠′|𝑠, 𝑎)
𝑟∈ℝ,𝑠′∈𝑆 = 1, ∀𝑠,𝑎.
NB.It is quite possiblethat ina specificRLproblem the directrewarddoesnotdependons,s' or a, but
onlyon one or twoof thistrio.
Solutionmethods
In orderto solve anMDP (i.e.determinethe optimal policy3
),itmatterswhetherwe know the entire
operationof the environment.The statesSandthe possible actionsA are inprinciple always4
known,but
the probabilitiesP(r,s’|s, a) do not have to be.A solutionmethodthatmakesuse of the knowledge
aboutthe P () iscalled model-based;amethodthatdoesnot do thisiscalled model-free.
A model-basedmethodknownfromOperationsResearchis DynamicProgramming(possiblystochastic).
Usingthe knownmodel of the environment,itcanbe determinedforeachstate sT-1 fromthe
penultimateperiodwhatthe expected5
rewardrT isfor eachpermittedactionaT-1,and therefore also
whatthe optimal actiona*T- 1 is.The value of state ST-1 isthenthe rewardforthat optimal actionchoice.
Thenwe take anotherstepbackand performthe same exercise withthe possible statessT-2 inthe second
to lastperiod,withthe difference thatwe calculate the sumof the directrewardrT-1 + the expectedvalue
of the state sT-1 we endup in.The action that leadstothe highestsumisthe optimal actiona*T-2 fromST-2.
Thiscontinuesuntil we knowforeachinitial state s0 whatthe optimal actiona*0 isand whatthe
associatedexpectedvalue of [(direct)rewardr1 + the value of state S1] is.The setof optimal actions{a*t}
formsthe optimal policy(usuallydenotedby π*).
Otherknownsolutionmethodsare Value iteration(VI)andPolicyiteration(PI),whichare discussedin
detail in[1].
ReinforcementLearning
In manycases,the model P(r, s’|s, a) isunknown.We thenhave to relyonothermethods,whichhave
beenonthe rise in recentdecadesunderthe collective name ReinforcementLearning (RL).
A characteristicof these methodsis,thatthe missingknowledge aboutthe model of the environmentis
replacedbypiecesof experience thatare gainedwhile lettingthe agentoperate inthe environment.This
explainsthe secondpartof the name:witheveryinteractionsomethingis"learned"abouthow the
environmentworks.The firstpart, Reinforcement,referstothe strengthening(ornot) of alreadyacquired
3 In the seminal work of Sutton and Barto 0 findingthe optimal policy is called solvingthe control problem. Next to
this (or rather before this) a lotof attention is given to the prediction problem, i.e. determining (predicting) the
valuefunction V(s) and state-valuefunction Q
(s,a) for a given policy . These functions give the expected reward
from state s (for V), and takingaction a (for Q), followed by policy until the end of the episode (or “infinitely”).
4 Note that there arealso MDPs where the states areonly partially observed,the so-called POMDPs (Partially
Observed MDPs).
5 In a stochastic problem we have an expected reward and multiplepossiblenext states s’, in a completely
deterministic problem the reward r and the next state s’are known when s and a are given.
7
knowledge bynewexperiences:anexperience (s,a) => (s’,r) that correspondswithpreviousexperiences
from(s, a) providesreinforcementof the previouslyacquiredknowledge.Naturally,incase of non-
agreement,the previousknowledge is(partly) "extinguished".
Delayedreward
Withmany MDPs the directrewardis notrepresentative of how goodanaction is,itis oftenjustas
importantthatthe environmentisbroughtintoabetter,more valuable state,fromwhichthe final
rewardcan be obtained.Asanexample,inamarketingenvironment:if bysendinganumberof messages
throughthe right channels atthe right times(whichinthe firstinstance onlycostmoney) we cangeta
clienttoreact, or maybe even make afirstpurchase,thenwe have arrivedina much more favourable
state,witha higher'value'. Maybe itwouldhave beencheaperintermsof direct rewardto take less
expensive (orno) action(s),butthenwe mightnothave gottenintothatmore valuable state.
Directrewardsare often0 or negative,because the actiontakenentailscostsanddoesnotimmediately
yieldanything.Thismakes itatrickybusinesstotryand solve anMDP whose model isunknown byusing
supervisedlearningtechniques.Afterall,asa targetvariable we have at mostthe directrewardavailable,
and thisisnot necessarilymaximal forthe bestactioninthe givencircumstances.Instead,we canalso
waituntil the endof an episode andthentake the final (total) rewardasa targetvariable (thisisso-called
MonteCarlo learning),butthenthe bigquestionistowhatextentwe shouldassignthatrewardto each
actiontaken in the episode.Anotherdisadvantageisthatwe alwayshave to continue until the endof an
episode inordertolearnsomething.Inaddition,formanyproblems,the numberof possible
"trajectories"(pathsfromthe beginningtothe endof an episode) isso large thatitbecomesimpractical.
The methodthat sitsbetweenthesetwoextremes,andiscalledTemporal-Difference(TD) learning by
Sutton,usesestimatesof the value of the nextstate (alongwiththe directreward) asa'temporary'
target variable andadjuststhe currentestimate usingthese estimates(the so-called'update').The word
"difference"here referstothe difference betweenthe new estimate(calledthe target) andthe earlier
estimate knownuptothe time of the update.TDlearningistherefore learning“aguessfroma guess”[2]
(slide 7.Inthe same presentation,see alsoslides9-16forfurtherexplanationof multi-steppredictions
versusone-stepmethodssuchassupervisedlearning).
The differentmethodsforsolvingmulti-stepdecisionproblems –all classifiedbySuttonunderRL – are
not completelyseparate fromeachother,butgraduallymerge into one another.Thisisillustratedby
Figure 2 below(from[1],p.157)
8
The "depth"of the update isshown
here on the vertical axis,with1-step
TD learningat the top andMonte
Carlolearning(entire episodes) atthe
bottom.On the horizontal axisonthe
leftare the methodswhere onlythe
effectof one decision (action) is
viewedperstep(itis"sampled"from
the possible decisions),while onthe
rightare methodswhere the
expectationvalueiscalculatedoverall
possible decisions.Becausethe
expectationvalueassumesknowledge
of the underlyingprobabilitymodel,all
model-freemethods are locatedonthe
leftaxis.
Many otheraspectsplaya role inthe use of these methodsthatcannotbe capturedinthis two-
dimensional picture.Twoof those aspectsshouldbe mentionedhere,namelythe exploration-
exploitation dilemma andfunction approximation.
As soonas variousactionshave beentriedoutina certainstate,a (temporary) picture of the value of
takingthose variousactionsinthat state alreadyarises.Inparticular,itisknownwhat- up to thatpoint-
isthe bestaction.Everytime we come back to the same state,we can of course take that bestaction(we
call thisexploitation of the acquiredknowledge,oralso greedy action selection),butthenwe may not
learna lot6
.That is whyit makessense tokeep exploring andoccasionallychooseactionsthat - up to that
point- are not optimal,orthat have not beentriedatall;we thenlearnwhetherwe cantake evenbetter
actionsthan we thought. Itmay be that thisexploration(temporarily) costsmoney,butpotentially there
isa betterpolicyinreturn7
.
Whenthe numberof possible statesbecomesverylarge,itbecomesimpractical toregisterthe value of
each state (orstate-actioncombination) separately(intable form). Function approximation offersa
solution:the values are approximatedbya(value orstate-value) functionof the available features,which
containsa limitednumberof parameters.Initssimplestform, thiscanbe a linearfunctionof the
features,butalsomore complex non-linearfunctionssuchasa neural network(NN) canbe used.Whena
deep NN isused(looselydefinedasa NN withmore than 2 hiddenlayers), the termDeepReinforcement
Learningisused.
6 In a stochastic environmentwe obtain an increasingly better estimate of the ‘expected reward’ of the specific
action,but we do not learn anythingmore about the other actions.
7 Compare this to the situation when going out to dinner (pre-corona 😉): if you always choosea restaurantthat
gave you good experiences in the past,you will never try out new restaurants or revisitrestaurants thatgave you
one or two less favourableexperiences.At the very least,you might miss a very good restaurantyou haven’t tried
yet.
Figure 2 Schematic overview of RL methods
9
Algorithms
In orderto solve the MDP (tofindthe optimal policy),we have toapproachthe actual (state) value
functionascloselyaspossible bitbybit.Thisisdone systematicallybyanRL algorithm.A lot of research
has beendone inthe fieldof RLalgorithmsandmanydifferent"flavours"have emerged.Thissection
providesabrief overviewof these flavours.
The optimal policymustcomplywiththe Bellman optimality equation:
𝑄∗(𝑠,𝑎) = ∑ 𝑃(𝑠′|𝑠,𝑎)
𝑠′∈𝑆
∗ [𝑟𝑠,𝑎,𝑠′ + 𝛾 ∗ max
𝑎′
𝑄∗(𝑠′,𝑎′)]
If we have foundthe function 𝑄∗ - or approximateditcloselyenough - thenwe know the optimal policy,
because thisisthe greedy policy that choosesthe actiona withmaximum 𝑄∗ value ineachstate.
On- andoff-policymethods
An oftenmentioneddistinctionwithinRLalgorithmsisthatbetween on-policy andoff-policy methods.As
mentioned,anRL algorithmsearchesforthe optimal policy;we thereforecall thispolicythe targetpolicy
(alsoknownas: estimation policy).However,there isanotherpolicyinplay,namelythe policythat
determineswhichactionistaken duringthe searchprocess,the so-calledbehaviourpolicy.Whenboth
policiesare the same - and thusthe usedpolicyitself isoptimised,we speakof anon-policyalgorithm,
otherwise the algorithmis off-policy.Inordertocontinue tolearnsufficientlyfromnew experiences,the
behaviourpolicyinanon-policyalgorithmmustcontinuetoexplore(otherwise the greedyactionis
alwayschosen).Thisproblemisnotanissue withanoff-policyalgorithm:afterall,youcanexplore to
your heart'scontent(in principle even“random”).However,thisdoesmeanthatthere are lessguarantee
for convergence(i.e.the optimal policyisapproachedmore andmore closely) whenusinganoff-policy
algorithm,certainlyincombinationwithfunctionapproximation.AccordingtoSutton,the combinationof
3 elementsinparticularis'dangerous'(he speaksof the Deadly Triad,see section11.3 in [1]):function
approximation,bootstrapping(=use of TD-learning,“aguessfroma guess”) and an off-policy algorithm.
Since functionapproximationisindispensable forlarger,complexproblemsandbootstrappingcan
greatlyincrease efficiency,the use of on-policylearningisoftenthe solution.Incidentally,further
researchisstill beingdone intothe Deadly Triad[5],mainlybecause peoplewanttobetterunderstand
howit ispossible thatsome algorithmsstill successfullycombine the threeelementsmentioned.
Off-policylearningcan,however,offergreatadvantages,forexamplebylearningwhatthe optimal policy
isfrom alreadyavailablepreviousexperiences,whichhave beenacquiredbyfollowingadifferentpolicy
(forexample byareal life agent).
Whenusinga so-called-greedybehaviourpolicy,off-policylearningcomesveryclose toon-policy
learning:withprobability whichof course mustbe small) arandomaction ischosenand with
probability1-the greedyactionischosen,justasinthe target policy.
10
Q-learningenSARSA
The best-knownoff-policyRLalgorithmis Q-learning8
andthe best-knownon-policyalgorithmis SARSA9
.
The difference betweenthe twoisinthe update of the value function,whenthe new valueisformedby
the directreward+ the value of the nextstate inwhichwe endup. In Q-learning,the value of the next
state is takenat the – upto that moment– bestaction inthat state (the greedyaction),while withSARSA
the value istakenat the action whichisprescribedbythe policy(the behaviourpolicyandthe target
policy,because theyare the same;note thatthispolicycan be stochastic,i.e. π(s,a) = P(a| s) isa
probabilitydistributionandnotadeterministicrule π(s) =a).
In formulae:
𝑄(𝑠,𝑎) ← 𝑄((𝑠,𝑎) + 𝛼 ∗ (𝑟 + 𝛾 ∗ max
𝑎′
𝑄(𝑠′,𝑎′) − 𝑄(𝑠, 𝑎)) for Q-learning
𝑄(𝑠,𝑎) ← 𝑄((𝑠,𝑎) + 𝛼 ∗ (𝑟 + 𝛾 ∗ 𝑄(𝑠′,𝜋(𝑠′,𝑎)) − 𝑄(𝑠,𝑎)) for SARSA
Where  is the learningrate and the discountfactor.
Multi-stepmethods
If we lookbackat Figure 2 on page 8, we see on the leftaxisthe TD learningmethodswithTD(0) atthe
top10
andMonte Carlolearningatthe bottom.The so-calledmulti-step methods lie betweenthese two
extremes.Thesemethodsare characterisedbythe factthat the target doesnotconsistof the one-step
return (i.e.the rewardof one step+ the value of the nextstate),butof the n-stepreturn(the consecutive
rewardsof several (sayn) steps+ the value of the state we endup inafterthose n steps).Sowe still
bootstrap(i.e.we use the "preliminaryvalue" of the state afternsteps),but"sample"more thanone
stepbefore bootstrapping.Asnincreases,we getcloserandclosertoMonte Carlolearning,where we
sample until the endof the episodes(infinite processes).Whenusingn-stepmethods,the question
naturallyariseswhichnisoptimal.Because eachnhas itsadvantagesanddisadvantages,the eligibility
trace conceptwas devised(see chapter 12in [1]).We thencalculate the targetas the weightedsumof all
individualn-stepreturns(so1-step,2-step,etc),where the 1-stepreturnisgivenweight (1-) andthe
weightof the returnof eachsubsequentstepisafactor  (0≤≤1) smaller:the returnof stepkis
therefore givenweight (1-)*k-1
.The resultingalgorithmiscalledTD() forshort.
Viewedinthisway,we use the “forwardview”([1] p.288).We can alsouse the “backwardview”by
lookingbackfromthe reachedstate to whichpreviousevents(actionsinpreviousstates)contributed11
to
that update:the longerago,the smallerthe contributionandthusthe update (withfactor ).Thisseems
like amore natural wayto lookat the updatesandalsothe most logical waytoprogram them.
The addedvalue of multi-stepmethodshasbeendemonstratedinvariousstudies,see,amongothers,
the DeepMindarticle onthe “Rainbow”method [11],inwhichit isarguedthat the two mostimportant
add-ons(improvements) inRLalgorithmsare the multi-stepmechanism.andPrioritisedExperience
Replay(see nextsection).
8 The name Q-learningis derived from the use of the letter Q for the action-valuefunction Q(s,a).
9 SARSA refers to the letters that are used to indicatethe current state (s),the action taken in s (a), the reward (r),
the new state (s’) and the action taken in s’ (a’); a’ is determined by the target policy: π(s’)=a’.
10 TD(0) and 1-step TD learningaresynonymous; the 0 in TD(0) refers to the valueof in TD(, not to the number
of steps.
11 Since they contributed, they are eligible for updating, hence the name eligibility trace.
11
Replay
The difference betweenon- andoff-policyalgorithms,andalsothatbetweenonlineandoffline
algorithms,becomessomewhatlessclearwhenwe use ExperienceReplay.ExperienceReplay(ER) isthe
(repeated) re-offeringof previous experiences(transitions s,a-> r, s ") to the algorithmandperforming
the updatesbasedonthat. A 'stock' of transitions(the replaybuffer) isthereforekeptonhand,from
whichtransitionsare drawnat settimes(thiscanbe aftereachtransition,butalsolessfrequently) that
are usedforthe update.It isnot necessaryforthe currenttransition(s) tobe usedimmediately;these
are simplybufferedandcanbe drawn.Some experiences/transitionsare 'more interesting'thanothers,
for example transitionsthathave a targetthat deviatesstronglyfromthe currentQ(s,a) value,andthose
transitionswe wanttoofferwithhigherpriorityforthe update,sothatthe functionapproximationfor
that ( s,a) combinationcanbecome more accurate.Thismethodisknownas Prioritised Experience
Replay,and itis widelyusedincurrentRLresearch.I’ve usedthe methoddescribedin [6],includingthe
Importancesampling correction(thiscorrectionensuresthatthe biasinthe expectedvalue(s)thatoccurs
due to the prioritisationiscorrected).
ExpectedSARSA andQ(σ)
In additiontoQ-learningandSARSA,we regularlyencounterthe Expected SARSA algorithminthe
literature.Inthisalgorithm, the targetforthe Q(s',a') value doesnottake the value at a' ~ π (s', a) as with
SARSA,northat at a' = argmax{Q(s',a)} asin Q-learning,butthe expectationof the Q-value overall
possible actionsins':
𝑄(𝑠,𝑎) ← 𝑄(𝑠,𝑎) + 𝛼 ∗ [𝑟 + 𝛾 ∗ ∑ 𝜋(𝑠′,𝑎)
𝑎∈𝐴
∗ 𝑄(𝑠′,𝑎) − 𝑄(𝑠, 𝑎)]
Thismakesthe variance of the updatesmuchsmaller - certainlywithastochasticpolicy - thanwith
SARSA,sothat convergence occursfaster.Detailsaboutthismethodare givenin [7].The authorsof this
article presentExpectedSARSAasan on-policymethod,butSuttonindicatesin 0(Remark6.6 on page
140) that the behaviourandtargetpoliciesmaydiffer,aswithoff-policymethods.Inhisdissertation [9],
van Hasseltconfirmsthis,buthe callsthe off-policyversion“General Q-learning”.
De Asisetal. [10] attemptedtounifythe differentalgorithmsinamulti-stepRLalgorithmentitledQ(σ).It
wouldbe attractive toprogram thisalgorithmandexperimentwithit –because itcombinesthe other
algorithms.Itwouldmean,however,thatthe numberof hyperparameters(see the section
Hyperparameteroptimisationbelow)increasesevenfurther.
Other variants
To most algorithmsextra’cleverness’canbe added,whichwe call “add-ons”here.The reasonforthe
creationof these add-onsisusuallythatthe algorithmsdidnotconverge withoutthemorwere notstable
(i.e.theysometimesdid,sometimesdidnotleadtothe desiredresult,orledtodifferentresults),orwere
far too slow.Belowwe willdiscussanumberof add-onsone byone,thatI have implementedinmy
research:Double Q-learning,BatchLearning,Residual Learning,Rewardshaping,Targetclippingandthe
use of a customlossfunction.
 DoubleQ-learning
In Q-learning,the valueof the nextstate isequal tothe value Q(s’,a*) of the bestactiona* inthat
nextstate.These Qvalues are approximatedbythe functionapproximator,forexampleanNN.
Because itis more likelythatanestimate that(bychance) turnsout tobe too highisthe highestvalue
12
than an estimate thatisaccurate or toolow,there isa risk of overestimation. WithDoubleQ-learning
[8] thisisovercome byusing2 separate NNs,the so-calledOnlinenetwork andthe Targetnetwork.
The Online Networkisusedforthe selection of the bestactioninthe nextstate,while the Target
Networkprovidesthe valueforthatbestaction.The Target networkremainsunchangedforx
iterations,while the Onlinenetworkisupdatedinthe usual way.Afterx iterations,the Online
networkiscopiedtothe Target network.
There are indications [12] thatif the rewardsare stochastic,Double learningalsohasaddedvalue for
the on-policyalgorithmsSARSA andExpectedSARSA.
 Batch Learning
If the agentcannot (ormay not) interactwiththe real environment(andtherefore cannotchoose its
ownactions),buttransitiondataisavailable,we canstill make anattemptto use Batch learningto
create an optimal (orgood) policy.find.Thiscanthenbe usedinan online situation,withoutchanging
it online anymore.Itisalsopossible tovaryonthisby firstlearninginbatchesandofferingthe
resultingmodel asa"startingmodel"tothe online agent,whothenlearnsfurther.Anextensive
article aboutBatch learningis [13].
 ResidualLearning
LeemonBairdisa bigname inthe fieldof MDPsand RL. He isknownfor, amongotherthings, Baird's
counterexample,acounterexampleof the propositionthatRL algorithmswithlinearfunction
approximationalwaysconverge.Asasolutiontothis,Bairdproposes Residuallearning,where amix of
the normallyused directgradient(whichensuresthe speedof learning)andthe residualgradient
(whichensuresconvergence) isapplied.Bytakingthe rightcombination,convergence isensured(due
to the residual component)while the speedof learningremainsashighaspossible (due tothe direct
component).Detailsinthe article from Baird[14].
 Reward shaping
Withmany RL problems the agentonlyreceivesfeedbackaboutthe final reward,oftenthe most
importantcomponentinthe total reward,at the endof an episode.Ittakesa while (manyiterations
and manyrepetitionsof the same state-actioncombinations) before the agentlearns fromthat
informationwhatagoodaction isin anyspecificstate.Thisgave rise tothe ideato lendthe agenta
handand alsogive himfeedbackduringthe processwhetherhe isworking"inthe rightdirection".
Thiscan be done bymanipulatingthe rewardfunctioninsuchaway thatthe rightdirectionis
rewardedand/orthe wrongdirectionispenalised.Of course,the real rewardmust(ultimately) be
independentof this;afterall,itisaninterimassistant,notareal reward.Thistechnique iscalled
reward shaping,see applicationin [15].Anexample:if,indebtcollectionprocess anactionleadstoa
response fromthe debtor. ora (partial) payment,thisispositive,evenif it doesnotimmediatelyyield
a reward.By neverthelessassigningavalue toreceivingthe response orpayment,the agentis
encouragedtorepeatthe behaviourdisplayed.
 Incidentally,the well-knownAIexpertAndrew Ngetal. have shown [16] that whenusingso-called
potentialbased shaping,the original optimal policyisalsooptimal withrespecttothe shapedreward.
Wiewiora[17] thenprovedthatinitialisingthe Qvalues inacertain way can achieve exactlythe same
effectaspotential-basedshaping.
 Target clipping
Due to the iterative nature of the RLupdates,itcan occur that Q values grow out of control and
become unrealisticallylarge (orverynegative).Inthatcase it can helpto limitthe calculatedtargets,
by so-calledtargetclipping.See alsothe nextpoint.
13
 Useof a customlossfunction
Whenusingan NN for functionapproximation,we canchoose froma numberof loss functions.Sucha
lossfunctionisusedwhenfittingthe NN tothe dataprovided:the smallerthe value of the loss
function,the betterthe NN fitsthe data.The lossfunctionisminimisedbyadjustingthe weightsinthe
NN witha gradientdescentalgorithm.
We can alsodefine ourown (custom) lossfunction,whichmakesitpossible tosteerthe algorithmina
certaindirection.Forexample,we canbuildina'penalty' - inthe form of an extraloss - whenthe
(realistic) limitsforthe Q values,i.e.forthe outputvalues of the NN,are exceeded.Incombination
withtargetclipping(see previouspoint),thisisaneffectivewaytopreventthe algorithmfrom
derailing.
Hyperparameters
All algorithmsandtheiradd-onsmentionedinthistexthave one ormore hyperparameters thatinfluence
theiroperation.We call these hyperparameters,becausethere are also"normal"parameters,for
example the rewardsandcostsof actions(inputparameters) andthe resultingcoefficientsinthe NN
(outputparameters).Mostalgorithmsandadd-onsare verysensitivetothe bestchoice fortheir
hyperparametervalues;theycanmake the difference betweenasmoothandfastconvergingora slow
and divergingalgorithm.
Hyperparameters thathave notyetexplicitly beendiscussedinthisdocumentare:
 Learning rate (also called: step size)
Usuallyindicatedwith (asinthe formulae inthisdocument),sometimeswith η:the speed(step
size) withwhichwe move inthe directionof the new value (target) duringanupdate of the oldQ
value. WhenusinganNN,thisparameterisincludedinthe optimiserselection(seenextpoint).
 Exploration rate
Usuallyindicatedwith ,itgivesthe fractionof cases inwhich explorationisrequiredinsteadof
exploitation.
 Rate decay patterns
Both the learningrate andthe explorationrate should eventuallygoto0; so,in the end there is
no more explorationandnomore learning. Thisisanecessaryconditionforconvergence tothe
optimal policy. The way(speed,shape of the decrease)these ratesdecreasecanalsodetermine
the convergence (speed) of the algorithm.
Incidentally,insituationswhere anongoing (business) process12
maychange overtime, itseems
advisable nottostoplearningcompletely.Afterall,anychangesinthe processcouldalsochange
the optimal policy,butthiswill notbe pickedupwithoutlearning.
 Discountfactor
Usuallyindicatedwith ,asin the formulae inthisdocument.Asusual,the discountfactor
reflectsthe diminishingvalueof moneyovertime (lostopportunitycostandinflation).A high
discountfactorcombinedwitha(possiblylarge)rewardatthe endof an episode couldleadto
attemptsbythe agent to keepthe episodeasshortas possible,withoutloweringthe probability
of successtoomuch.
12 As opposed to e.g. a physical environmentwhere the laws of physics apply.
14
 Replay parameters
Several parametersplayarole in(PrioritisedExperience) Replay:
Buffersize:howmanyexperiences(transitions) are kept(rollingwindow)?
Sample size:howmanyexperiencesare usedineachreplayaction?
Replayfrequency:afterhowmanytransitionsisreplayplanned13
?
Replacement:issamplingwithorwithoutreplacement?
PrioritisationAlphaandBeta:determine how strongthe prioritisationis;see [6].
UpdTarget (Freq):howoftenare the savedtargetsupdatedaccordingtothe latestNN model?
 NN parameters
Whenusingan NN for FA,the gradientdescentalgorithm isusedto minimise the lossfunction
(see above).The choice of thisalgorithm(andthe associatedlearningrate) canbe verydecisive
for (the speedof) convergence.Thereisextensive literatureonthe differentoptimisation
routines,see e.g. [18].
The structure of the NN itself isalsogovernedbyhyperparameters,suchasthe numberand the
widthof the layers,the activationfunction(‘sigmoid’,‘ReLu’,etc) andmanymore.
 Featureengineering
Thisis notactuallya parameter,buta setof operationsonthe original featuresbefore theyserve
as inputforthe NN,thus"helpingthe NN".Ideally,nofeature engineeringis neededatall,and
the NN workspurelyonthe raw data.All necessary"operations"(suchasnon-linear
transformations,interactions,etc.) are thenlearnedbythe NN.Withmany-dimensional real
problemsitisthengenerallynecessarytouse verywide (manynodesperlayer) anddeep(many
layers) networks,whichtherefore alsorequire verylongtrainingtimesanddata.Thisisthe
domainof DeepLearning. Anexample isanRL algorithmforlearningtoplaycomputergames
withas inputonlythe raw pixelsof the computerscreen.
There isalso a dangerin(manual) feature engineering:if the constructedfeaturesare less
predictive thanthe original data,the horse isputbehindthe cart.
Hyperparameter optimisation
In orderto findthe bestvaluesof the hyperparametersforanRL algorithm, we canof course use similar
techniquesasforotherML algorithms. There isalso alot of literature aboutoptimisinghyperparameters,
see e.g. [18],ranging fromsimple gridsearchtoadvancedBayesianoptimisation.Aninterestingoption,
whichisnot exploredinthe mentionedarticle,isusingaGeneticAlgorithm(GA).
Assumingwe have aperformance metricforeachrun/episode(oraverage overmultiple runs/episodes)
of the RL algorithmwitha specifichyperparametersetting,we caninterpretthismetricasthe “fitness”of
the solution(“gene”inGA terminology) consistingof the stringof hyperparameters.Crossoverof two
solutions isthen easilydone bycuttingthe listwithhyperparameterssomewhere andcombiningthe two
endscrosswise.Mutationof asolutionisalsoeasyto program.
Of course,asin gridsearch,this “triple-AIsolution”(namelyRL+ NN + GA) forHyperparameter
optimisationmay alsobe challengingintermsof requiredcomputingtimeorpower (althoughadegree
13 This of coursealso depends on practical circumstances:in a simulated environment any number of transitions can
be replayed at any time, in reality itmay only be possibleto execute replay and update of the model infrequently.
15
of parallelisationseemsfeasible).
Reinforcement Learning in R and Python
Although there are some effortstoincorporate RLinR code,notablythe ReinforcementLearning package
by Markus Dumke andthe MDPtoolbox,byfarthe most sourcesforprogrammingRL algorithms(on
github,kaggle,etc.) isfoundinPython.
RL playgrounds
Much researchintoReinforcementLearningtechniquesiscarriedoutusingenvironmentsthatare easy
to simulate,anumberof whichare nowstandard benchmarks:
• The "classicenvironments"suchas(windy) gridworld,cart-pole balancingandmountaincar(inmany
articlesandbooksthese are usedto explainconcepts,alsoin[1])
• Atari games:50 gamesof differentnature anddifficulty.Itrarelyhappensthatanalgorithmisequally
goodin all games.
The Gym website of OpenAI (affiliatedwithElonMusk,amongothers) providesaccesstothese andother
environments,unfortunately - accordingtothe githubwebsite –onlyviaPython (butthere appearstobe
an R interface package anyway;thishasnot yetbeentested).
16
References
The referencesbeloware all containedinthiszipfile:
RL Literature
Review - References.zip
[1] Sutton,R. andBarto, A. (2020). ReinforcementLearning:anIntroduction,secondedition.MITPress.
Original versionisfrom1998, but new versionsof thisbookare still created,see
http://incompleteideas.net/book/the-book.html.
[2] Sutton,R. (2017).Temporal-Difference Learning,slides andvideolecture
(http://videolectures.net/deeplearning2017_sutton_td_learning/).
[3] AhmadHammoudeh(2018). A Concise IntroductiontoReinforcementLearning. ResearchGate.
[4] Yuxi Li (2017). DeepReinforcementLearning:AnOverview. ArXiv preprintarXiv:1701.07274v5.
[5] VanHasselt,H. etal.,DeepMind(2018). DeepReinforcementLearningandthe DeadlyTriad. ArXiv
preprintarXiv:1812.02648.
[6] Tom Schaul,JohnQuan,IoannisAntonoglouandDavidSilver,Google DeepMind(2016).Prioritized
Experience Replay. ArXiv preprintarXiv:1511.05952V4.
[7] VanSeijen,H.,vanHasselt,H.,Whiteson,S.,andWiering,M. (2009). A theoretical andempirical
analysisof Expected Sarsa. InIEEE SymposiumonAdaptiveDynamicProgrammingand
ReinforcementLearning,pp. 177–184.
[8] Hado VanHasselt,ArthurGuezand DavidSilver,Google DeepMind(2015).DeepReinforcement
LearningwithDouble Q-learning. arXiv:1509.06461v3.
[9] VanHasselt,H. (2011). InsightsinReinforcementLearning:Formal AnalysisandEmpirical Evaluation
of Temporal-differenceLearning.SIKSdissertationseriesnumber2011-04.
[10] De Asis,K.,Hernandez-Garcia,J.F.,Holland,G.Z.,andSutton,R. S. (2017). Multi-stepReinforcement
Learning:A UnifyingAlgorithm. ArXiv preprintarXiv:1703.01327.
[11] Matteo Hessel etal.,DeepMind(2017). Rainbow:CombiningImprovementsinDeepReinforcement
Learning. ArXiv preprintarXiv:1710.02298v1.
[12] Ganger,M., Duryea,E. andHu, W. (2016). Double Sarsaand Double ExpectedSarsawithShallow
and DeepLearning. Journalof DataAnalysisand Information Processing, 4,159-176.
[13] Lange S., Gabel T., RiedmillerM.(2012) Batch ReinforcementLearning.In:WieringM.,vanOtterlo
M. (eds) ReinforcementLearning.Adaptation,Learning,andOptimization,vol 12.Springer,Berlin,
Heidelberg.
[14] Baird,L.C. (1995). Residual Algorithms:ReinforcementLearningwithFunctionApproximation.In
Prieditis&Russell,eds.Machine Learning:Proceedingsof the TwelfthInternational Conference,9-
12 July,Morgan KaufmanPublishers,SanFrancisco,CA.
[15] Raghu,A. etal. (2017). Deep ReinforcementLearning forSepsisTreatment. ArXiv preprint
arXiv:1711.09602v1.
[16] AndrewNgetal. (1999). Policyinvariance underrewardtransformations:theoryandapplicationto
rewardshaping.In MachineLearning,Proceedingsof theSixteenth InternationalConference, pp.
278-287. Morgan Kaufmann.
[17] Eric Wiewiora(2003). Potential-basedShapingandQ-value Initializationare equivalent. Journal of
Artificial IntelligenceResearch,19,205-208.
17
[18] SebastianRuder(2017). Anoverview of gradientdescentoptimizationalgorithms. ArXiv preprint
arXiv:1609.04747v2.
[19] Hazan, E. (2018). HyperparameterOptimization:A Spectral Approach. ArXiv preprint
arXiv:1706.00764v4.
[20] DavidSilveretal.Google DeepMind(2016). Masteringthe game of Go withdeepneural networks
and treessearch.Nature vol 529.
[21] EdwinPednault, Naoki Abeetal. (2002). Sequential Cost-Sensitive DecisionMakingwith
ReinforcementLearning.IBMWatson ResearchCentre. SIGKDD’02 Edmonton,Alberta,Canada.
[22] Naoki Abe etal.(2004). Cross Channel OptimizedMarketingby ReinforcementLearning.
IBM WatsonResearchCentre. SIGKDD’04.
[23] Naoki Abe etal.(2010). OptimizingDebtCollectionsUsingConstrainedReinforcementLearning.
IBM Research. KDD’10, Washington DC.
[24] GeorgiosTheocharous,Assaf Hallak,AdobeResearch (2013). Lifetime Value Marketingusing
ReinforcementLearning.PaperF21in RLDM (Multi-disciplinaryConferenceonReinforcement
LearningandDecisionMaking) 2013, Princeton,New Jersey,USA.
[25] Yegor Tkachenko(2015). AutonomousCRMControl viaCLV ApproximationwithDeep
ReinforcementLearninginDiscreteandContinuousActionSpace.StanfordUniversity. ArXiv preprint
arXiv:1504.01840v1.
[26] Marco Wiering(2018). ReinforcementLearning:frommethodstoapplications.Nieuw Archief voor
de Wiskunde (KWG).
18
Index
Baird's counterexample,11
Batch Learning,11
behaviourpolicy,8
Bellman optimalityequation,8
customlossfunction,4, 12
Deadly Triad, 8
Deep Learning,13
Deep ReinforcementLearning,7
deterministic policy,4
direct gradient,11
DoubleQ-learning,10
DynamicProgramming,5
eligibility trace, 9
episode,4
estimation policy,8
Expected SARSA,10
ExperienceReplay,10
exploration-exploitation dilemma,7
Featureengineering,13
Function approximation,7
gradientdescentalgorithm,12
greedy action selection, 7
Importancesampling,10
model-based,5
model-free,5
MonteCarlo learning,6
Multi-armed Bandit,4
Multi-step methods,9
NeuralNetwork,4
off-policy,8
Online network,11
on-policy,8
Policy iteration (PI),5
POMDP,5
Prioritised ExperienceReplay,10
Q-learning,9
ReinforcementLearning (RL),5
residualgradient,11
ResidualLearning,11
return,9
Reward shaping,11
RL in R en Python,14
SARSA,9
stochasticpolicy,4
target,6
Target clipping,11
Target network,11
targetpolicy, 8
TD() algorithm,9
Temporal-Difference(TD) learning,6
trajectories,6
transitions,10
Valueiteration (VI),5
-greedy,8

More Related Content

What's hot

Ecn30205 economics assignment class trip
Ecn30205 economics assignment   class tripEcn30205 economics assignment   class trip
Ecn30205 economics assignment class tripashleyyeap
 
Ecn30205 economics assignment sept 2015 intake
Ecn30205 economics assignment   sept 2015 intakeEcn30205 economics assignment   sept 2015 intake
Ecn30205 economics assignment sept 2015 intakeTung97Michelle
 
Ecn30205 economics assignment sept 2015 intake
Ecn30205 economics assignment   sept 2015 intakeEcn30205 economics assignment   sept 2015 intake
Ecn30205 economics assignment sept 2015 intakeArissa Loh
 
[1999][r&d][eee extended engineeringenterprise]
[1999][r&d][eee extended engineeringenterprise][1999][r&d][eee extended engineeringenterprise]
[1999][r&d][eee extended engineeringenterprise]Dino, llc
 
Back to the Future: A Review and Editorial Agenda of the International Journa...
Back to the Future: A Review and Editorial Agenda of the International Journa...Back to the Future: A Review and Editorial Agenda of the International Journa...
Back to the Future: A Review and Editorial Agenda of the International Journa...CSCJournals
 
Eng 130 literature and comp literary response for point o
Eng 130 literature and comp literary response for point oEng 130 literature and comp literary response for point o
Eng 130 literature and comp literary response for point ojoney4
 
The Political, Legal & Technological Environment in Global Scenario
The Political, Legal & Technological Environment in Global ScenarioThe Political, Legal & Technological Environment in Global Scenario
The Political, Legal & Technological Environment in Global ScenarioIJESM JOURNAL
 

What's hot (9)

Ecn30205 economics assignment class trip
Ecn30205 economics assignment   class tripEcn30205 economics assignment   class trip
Ecn30205 economics assignment class trip
 
Ecn30205 economics assignment sept 2015 intake
Ecn30205 economics assignment   sept 2015 intakeEcn30205 economics assignment   sept 2015 intake
Ecn30205 economics assignment sept 2015 intake
 
Ecn30205 economics assignment sept 2015 intake
Ecn30205 economics assignment   sept 2015 intakeEcn30205 economics assignment   sept 2015 intake
Ecn30205 economics assignment sept 2015 intake
 
[1999][r&d][eee extended engineeringenterprise]
[1999][r&d][eee extended engineeringenterprise][1999][r&d][eee extended engineeringenterprise]
[1999][r&d][eee extended engineeringenterprise]
 
Back to the Future: A Review and Editorial Agenda of the International Journa...
Back to the Future: A Review and Editorial Agenda of the International Journa...Back to the Future: A Review and Editorial Agenda of the International Journa...
Back to the Future: A Review and Editorial Agenda of the International Journa...
 
CASE Network Studies and Analyses 454 - External vs Internal Determinants of ...
CASE Network Studies and Analyses 454 - External vs Internal Determinants of ...CASE Network Studies and Analyses 454 - External vs Internal Determinants of ...
CASE Network Studies and Analyses 454 - External vs Internal Determinants of ...
 
Eng 130 literature and comp literary response for point o
Eng 130 literature and comp literary response for point oEng 130 literature and comp literary response for point o
Eng 130 literature and comp literary response for point o
 
The Political, Legal & Technological Environment in Global Scenario
The Political, Legal & Technological Environment in Global ScenarioThe Political, Legal & Technological Environment in Global Scenario
The Political, Legal & Technological Environment in Global Scenario
 
B.com
B.comB.com
B.com
 

Similar to Reinforcement Learning Literature review - apr2019/feb2021 (with zip file)

CMGT578 v12Long-range IT Plan Focusing on InnovationCMGT578
CMGT578 v12Long-range IT Plan Focusing on InnovationCMGT578 CMGT578 v12Long-range IT Plan Focusing on InnovationCMGT578
CMGT578 v12Long-range IT Plan Focusing on InnovationCMGT578 WilheminaRossi174
 
An Assignment On Ratio Analysis
An Assignment On  Ratio AnalysisAn Assignment On  Ratio Analysis
An Assignment On Ratio AnalysisDon Dooley
 
2014 Analysis of the Tokyo Startup Ecosystem (Andrew Shuttleworth)
2014 Analysis of the Tokyo Startup Ecosystem (Andrew Shuttleworth)2014 Analysis of the Tokyo Startup Ecosystem (Andrew Shuttleworth)
2014 Analysis of the Tokyo Startup Ecosystem (Andrew Shuttleworth)Andrew Shuttleworth
 
Business models and frameworks the case of black socks
Business models and frameworks the case of black socksBusiness models and frameworks the case of black socks
Business models and frameworks the case of black socksEnrique de Luis Araque
 
Problem Statement Of Durban
Problem Statement Of DurbanProblem Statement Of Durban
Problem Statement Of DurbanMichele Thomas
 
Innovation & Alternative Energy Technologies: A Warwick Global Energy MBA tas...
Innovation & Alternative Energy Technologies: A Warwick Global Energy MBA tas...Innovation & Alternative Energy Technologies: A Warwick Global Energy MBA tas...
Innovation & Alternative Energy Technologies: A Warwick Global Energy MBA tas...Warwick Business School
 
Research Paper(APARC)_Dismal_Software_Industry_in_Japan
Research Paper(APARC)_Dismal_Software_Industry_in_JapanResearch Paper(APARC)_Dismal_Software_Industry_in_Japan
Research Paper(APARC)_Dismal_Software_Industry_in_JapanMosh Suzuki, CPA/Statistician
 
agileBIResearch
agileBIResearchagileBIResearch
agileBIResearchppetr82
 
Essay 2 The purpose of this project is to see if you are a.docx
Essay 2 The purpose of this project is to see if you are a.docxEssay 2 The purpose of this project is to see if you are a.docx
Essay 2 The purpose of this project is to see if you are a.docxtheodorelove43763
 
Essay 2 The purpose of this project is to see if you are a.docx
Essay 2 The purpose of this project is to see if you are a.docxEssay 2 The purpose of this project is to see if you are a.docx
Essay 2 The purpose of this project is to see if you are a.docxelbanglis
 
Logic Model Template 2Name of Organization PRORAM LOGIC.docx
Logic Model Template 2Name of Organization PRORAM LOGIC.docxLogic Model Template 2Name of Organization PRORAM LOGIC.docx
Logic Model Template 2Name of Organization PRORAM LOGIC.docxSHIVA101531
 
Module 3 - OutcomesFiscal Policy Government Expenditures and Re.docx
Module 3 - OutcomesFiscal Policy Government Expenditures and Re.docxModule 3 - OutcomesFiscal Policy Government Expenditures and Re.docx
Module 3 - OutcomesFiscal Policy Government Expenditures and Re.docxannandleola
 
Employment and social protection in the informal sector
Employment and social protection in the informal sector Employment and social protection in the informal sector
Employment and social protection in the informal sector Dr Lendy Spires
 
Business plan template
Business plan templateBusiness plan template
Business plan templateCozmik BodyInc
 
1 s2.0-s2096248718300225-main
1 s2.0-s2096248718300225-main1 s2.0-s2096248718300225-main
1 s2.0-s2096248718300225-mainKirana Ririh
 
International Marketing report
International Marketing reportInternational Marketing report
International Marketing reportAli Aljoubory
 

Similar to Reinforcement Learning Literature review - apr2019/feb2021 (with zip file) (20)

Dissertation
DissertationDissertation
Dissertation
 
CMGT578 v12Long-range IT Plan Focusing on InnovationCMGT578
CMGT578 v12Long-range IT Plan Focusing on InnovationCMGT578 CMGT578 v12Long-range IT Plan Focusing on InnovationCMGT578
CMGT578 v12Long-range IT Plan Focusing on InnovationCMGT578
 
An Assignment On Ratio Analysis
An Assignment On  Ratio AnalysisAn Assignment On  Ratio Analysis
An Assignment On Ratio Analysis
 
2014 Analysis of the Tokyo Startup Ecosystem (Andrew Shuttleworth)
2014 Analysis of the Tokyo Startup Ecosystem (Andrew Shuttleworth)2014 Analysis of the Tokyo Startup Ecosystem (Andrew Shuttleworth)
2014 Analysis of the Tokyo Startup Ecosystem (Andrew Shuttleworth)
 
Business models and frameworks the case of black socks
Business models and frameworks the case of black socksBusiness models and frameworks the case of black socks
Business models and frameworks the case of black socks
 
Problem Statement Of Durban
Problem Statement Of DurbanProblem Statement Of Durban
Problem Statement Of Durban
 
832318
832318832318
832318
 
Case Study Approach
Case Study ApproachCase Study Approach
Case Study Approach
 
Innovation & Alternative Energy Technologies: A Warwick Global Energy MBA tas...
Innovation & Alternative Energy Technologies: A Warwick Global Energy MBA tas...Innovation & Alternative Energy Technologies: A Warwick Global Energy MBA tas...
Innovation & Alternative Energy Technologies: A Warwick Global Energy MBA tas...
 
Research Paper(APARC)_Dismal_Software_Industry_in_Japan
Research Paper(APARC)_Dismal_Software_Industry_in_JapanResearch Paper(APARC)_Dismal_Software_Industry_in_Japan
Research Paper(APARC)_Dismal_Software_Industry_in_Japan
 
agileBIResearch
agileBIResearchagileBIResearch
agileBIResearch
 
Essay 2 The purpose of this project is to see if you are a.docx
Essay 2 The purpose of this project is to see if you are a.docxEssay 2 The purpose of this project is to see if you are a.docx
Essay 2 The purpose of this project is to see if you are a.docx
 
Essay 2 The purpose of this project is to see if you are a.docx
Essay 2 The purpose of this project is to see if you are a.docxEssay 2 The purpose of this project is to see if you are a.docx
Essay 2 The purpose of this project is to see if you are a.docx
 
Logic Model Template 2Name of Organization PRORAM LOGIC.docx
Logic Model Template 2Name of Organization PRORAM LOGIC.docxLogic Model Template 2Name of Organization PRORAM LOGIC.docx
Logic Model Template 2Name of Organization PRORAM LOGIC.docx
 
Module 3 - OutcomesFiscal Policy Government Expenditures and Re.docx
Module 3 - OutcomesFiscal Policy Government Expenditures and Re.docxModule 3 - OutcomesFiscal Policy Government Expenditures and Re.docx
Module 3 - OutcomesFiscal Policy Government Expenditures and Re.docx
 
Employment and social protection in the informal sector
Employment and social protection in the informal sector Employment and social protection in the informal sector
Employment and social protection in the informal sector
 
Business plan template
Business plan templateBusiness plan template
Business plan template
 
2. The Logical Framework
2. The Logical Framework2. The Logical Framework
2. The Logical Framework
 
1 s2.0-s2096248718300225-main
1 s2.0-s2096248718300225-main1 s2.0-s2096248718300225-main
1 s2.0-s2096248718300225-main
 
International Marketing report
International Marketing reportInternational Marketing report
International Marketing report
 

Recently uploaded

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxolyaivanovalion
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfLars Albertsson
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Valters Lauzums
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxolyaivanovalion
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxolyaivanovalion
 

Recently uploaded (20)

Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls Punjabi Bagh 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Carero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptxCarero dropshipping via API with DroFx.pptx
Carero dropshipping via API with DroFx.pptx
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Schema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdfSchema on read is obsolete. Welcome metaprogramming..pdf
Schema on read is obsolete. Welcome metaprogramming..pdf
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Ravak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptxRavak dropshipping via API with DroFx.pptx
Ravak dropshipping via API with DroFx.pptx
 
BabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptxBabyOno dropshipping via API with DroFx.pptx
BabyOno dropshipping via API with DroFx.pptx
 

Reinforcement Learning Literature review - apr2019/feb2021 (with zip file)

  • 1. 1 Overview of Reinforcement Learning literature April 2019/February 2021 Chiel Bakkeren
  • 2. 2 Contents Contents.........................................................................................................................................2 Introduction....................................................................................................................................3 Other researchers ........................................................................................................................ 3 Review articles............................................................................................................................. 4 Applications in a business environment......................................................................................... 4 Markov Decision Processes............................................................................................................... 5 Solution methods......................................................................................................................... 6 Reinforcement Learning................................................................................................................ 6 Delayed reward............................................................................................................................ 7 Algorithms.......................................................................................................................................9 On- and off-policy methods...........................................................................................................9 Q-learningen SARSA................................................................................................................... 10 Multi-step methods.................................................................................................................... 10 Replay ....................................................................................................................................... 11 Expected SARSA and Q(σ) ........................................................................................................... 11 Other variants............................................................................................................................ 11 Hyperparameters........................................................................................................................... 13 Hyperparameter optimisation..................................................................................................... 14 Reinforcement Learning in R and Python......................................................................................... 15 References .................................................................................................................................... 16 Index............................................................................................................................................. 18
  • 3. 3 Introduction AlthoughReinforcementLearningasa conceptis now several decadesold,muchresearchhasonlybeen done inthe last 5-10 years.Most famousare probably the resultsof DeepMind,aLondoncompanythat was boughtbyGoogle inearly2014. DeepMindturnedouttobe able tobeat the bestGo playerinthe worldwithitsalgorithmAlphaGo,wherebythe algorithminitiallyalsolearnedfromgamesplayedby goodGo players,andsubsequentlyimproveditself(AlphaGo Zero) bylearningfromplayingmanygames againstitself [20].Nowthere isAlphaZero,whichinadditiontoGocan also playchessandshogi at the highestlevel. DeepMindisone of the researchcentreswhere alot of researchisdone on RL, boththeoreticallyandin termsof applications.Well-knownresearchersfromDeepMindare: • DavidSilver:wrote manyarticlesasfirstorco-author(see e.g.the article onAlphaGo) andiswidely cited.Alsomade a seriesof videosof lecturesonRL (slideshere). • Hado vanHasselt,a Dutch PhD(UU, CWI,see [9]),whodeveloped,amongotherthings,the Double Q learning(see Appendix B. • Tom Schaul:leadauthorof the article on PrioritisedExperienceReplay[6]. The standard workinthe fieldof ReinforcementLearningisthe bookbySuttonand Barto [1]; thisbookis inthe reference listof almosteveryRLarticle.Inthe more recentversionsof thisbook,the last paragraph of each chaptercontainsan interestinglistingof the authorsandtheirarticleswho contributedtothe developmentof the chapter'stopic;veryuseful asa cross reference.Inthisoverview, we use the red line of thisbookto review the variousothertopicsinthe RLliterature. Suttonismore prominentthanBarto,he has,among otherthings,anextensive(butslightlymessy) website.Manyof the currentRL researchershave spentsome time withthe RL& AIgroup at Alberta University inCanada,where RichardSuttonworks.Suttonpublishesarticlescontinuously,includingvia ResearchGate. There is now alsoa (setof) courses,a Specialization,onCourseraaboutReinforcementLearning.Thishas not beencheckedoutyet. Other researchers As isoftenthe case,the US/NorthAmericaisleadinginthe field  AndrewNg:one of the best-knownAIresearchers,workingatStanford,whoisverysuccessful,partly throughhispopularcoursesviaCoursera(of whichhe is a founder).  Naoki Abe:IBMresearcher,whostandsoutfor a numberof publicationsonapplicationsof RLina businessenvironment.  GeorgiosTheocharous:researcheratAdobe (previouslyatM.I.T.,Intel andYahoo) inthe fieldof (applicationsof) RLandotherML techniques,especiallyinmarketing. I probablyhave a biastowardsDutch (speaking)researchers,butitisstrikingthatthere are quite a fewof those:
  • 4. 4  Harm van Seijen:didaPhDat the UvA withShimonWhitesonandalsoworkedfor4yearsas a postdocwithSutton.Nowheadsthe MicrosoftResearchMontréal RL team.  PieterAbbeel:aBelgianwho,throughaPhDat StanfordwithAndrew Ngand a research position at OpenAI,eventuallybecameheadof the RobotLearningLab inBerkeley(amongothers).  Marco Wiering:supervisorof Hadovan Hasselt;hisname appearsinall kindsof publications.Has apparentlynot(yet) succumbedtothe majorplayersinthe AI field,becausehe still 'just'worksinthe AI groupof the RUG. Reviewarticles In additiontothe articlesonspecificRLtopics,manyof whichare referencedinthisdocument,there are alsoa numberof interestingreviewarticles: • The ResearchGate article [3] by AhmadHammoudehprovidesanice insightintowhatRL is,includinga numberof applicationsandmanyreferencestoliterature. • In September2018 “NieuwArchief voorde Wiskunde”(the magazineof the DutchRoyal Mathematical Society) wasentirelydevotedtoAI, andit contained anice overview article byMarco Wiering[27],alsowithmanyreferences. • Evenmore extensiveisthe overview of Yuxi Li [4],whichcan safelybe calledabooklet,withover40 pagesof background,mainelementsandmechanisms,applications,resources(books,reports, coursesand tutorials,conferences,blogs,testbeds,etc) andthatalsocontains25 pagesof references. Applicationsina business environment Of course,fora commercial companyit isat leastasinterestingtolookforRL applicationsinabusiness environment.Afterall, insuchan environmentyou runintoall sortsof specificproblemsthatwill bother youlessor not at all withthe 'school examples':how doyoufindthe best(ora good) algorithmor parameterswithoutimmediatelylearningonline with‘live’clients (batch,offlinelearning,settingup simulationenvironment?),howdoyouconnectthingsto existingsystemsandprocedures,how doyou explaintothose involvedhowthe algorithmlearnsandwhatthe final policyentails,etc?Forthisreason it isnice to see howothershave dealtwiththose problems.However,the numberof published applicationsof RLin commercial environmentsisnot(yet) thatlarge.Itmaybe that there isstill littleto report,because itisa relativelynew technologyand/orbecause itisdifficulttogetitworkingproperly and to optimise it.Butitmayalso be that successful applicationsare notpublished(indetail) for competitivereasons.Althoughitisregularlyclaimedthatuse ismade of RL, detailsare mostly notgiven. Still,anumberof articleswithRL businessapplicationswere found;includingthreepieces(partly)by Naoki Abe (IBMresearch),of which2 in the fieldof marketing([22] and[23]) and 1 inthe fieldof debt collection[24].Thislastarticle concerns the collectionof tax liabilitiesbythe New YorkState Department of TaxationandFinance.Atthe time of the article (2010), itwas estimatedthatthiscouldsave $ 100 millionin3yearswithannual tax revenuesof $1 billion.Incidentally,the developmentof the "engine" for thiscost IBM5 milliondollarsinresearchinvestments(andthe State Department another4million dollars).
  • 5. 5 Anotherinterestingarticle is[15],whichdescribesanapplicationof RLinthe medical world.Thisshows manyproblemsandchoicesthat I also encountered inpractice.Itookovera numberof things,suchas target clipping,rewardshaping,acustomlossfunction topunishdeviantQvalues andpartsof the configurationof the Neural Network (NN). Two otherarticles withcommercial applicationsforRLhave beenfound,buthave notyetbeenstudied ([25] and[26]). Markov Decision Processes Markov DecisionProcessesorMDPsare decisionproblemscharacterisedbyanAgent(actor) whotakes decisionsinan Environment,at/insuccessive times/periodsandgraduallyreceivesrewardsorincurs costs (positiveornegative rewards). Somewhatmore formallydefined:  The Environmentfindsitself atanytime tin one of the possible statessS  Everyperiodthe Agentperformsan actionaA (allowedinstate s;thiscan alsobe "do nothing")  The feedbackfromthe Environmenttothe Agent,followingthe actiontaken,istwofold:(r,s’) where r isthe (direct) rewardands’ the new state of the Environment.Thisfeedbackonlydepends on the oldstate s and the action taken,noton the previoushistoryof the process(thisisthe so- calledMarkovpropertyof the process)  The entire processcan be finite (thenasequencefromstartto endiscalledan episode1 ) orinfinite  The aim for the Agentisto act in sucha waythat the total average reward2 ismaximised Schematic(from[1]): Figure 1 Agent acting in an Environment The way the Agenttriesto maximisethe rewardiscalledthe policy (denotedby π),whichis the central conceptinRL. Formally,the policyisthe rule thatdeterminesthe actionato be takenfor a givenstate s: π: S -> A.The policycanbe deterministic, i.e.exactlyone actionais prescribedforeachstate s (π (s) = a), or stochastic,i.e. fora givenstate s the actiona is chosenwitha certainprobability(π(a|s), with 1 When every episodeconsists of only one period, so it requires only one decision,the problem is called a Multi- armed Bandit (MAB) problem (One-armed Banditbeing the nickname for the well-known (fruit) slotmachines).The decision required there is which Arm (slotmachine) to run for maximum (average) reward over all episodes. 2 Averaged over episodes if the process is finite averaged over periods in the infinitecase(in which casea discount factor<1 must be chosen).
  • 6. 6 ∑ 𝜋(𝑎|𝑠) 𝑎 = 1). The feedbackfromthe Environmenttothe agentisinprinciple stochastic,whichmeansthatthat feedback, consistingof the newstate s'and the directrewardr, are determinedbyprobabilities: 𝑃(𝑟,𝑠′| 𝑠,𝑎) = probabilitythat,givenstate sandactiona, the nextstate is s’and the directreward isr, where of course ∑ 𝑃(𝑟, 𝑠′|𝑠, 𝑎) 𝑟∈ℝ,𝑠′∈𝑆 = 1, ∀𝑠,𝑎. NB.It is quite possiblethat ina specificRLproblem the directrewarddoesnotdependons,s' or a, but onlyon one or twoof thistrio. Solutionmethods In orderto solve anMDP (i.e.determinethe optimal policy3 ),itmatterswhetherwe know the entire operationof the environment.The statesSandthe possible actionsA are inprinciple always4 known,but the probabilitiesP(r,s’|s, a) do not have to be.A solutionmethodthatmakesuse of the knowledge aboutthe P () iscalled model-based;amethodthatdoesnot do thisiscalled model-free. A model-basedmethodknownfromOperationsResearchis DynamicProgramming(possiblystochastic). Usingthe knownmodel of the environment,itcanbe determinedforeachstate sT-1 fromthe penultimateperiodwhatthe expected5 rewardrT isfor eachpermittedactionaT-1,and therefore also whatthe optimal actiona*T- 1 is.The value of state ST-1 isthenthe rewardforthat optimal actionchoice. Thenwe take anotherstepbackand performthe same exercise withthe possible statessT-2 inthe second to lastperiod,withthe difference thatwe calculate the sumof the directrewardrT-1 + the expectedvalue of the state sT-1 we endup in.The action that leadstothe highestsumisthe optimal actiona*T-2 fromST-2. Thiscontinuesuntil we knowforeachinitial state s0 whatthe optimal actiona*0 isand whatthe associatedexpectedvalue of [(direct)rewardr1 + the value of state S1] is.The setof optimal actions{a*t} formsthe optimal policy(usuallydenotedby π*). Otherknownsolutionmethodsare Value iteration(VI)andPolicyiteration(PI),whichare discussedin detail in[1]. ReinforcementLearning In manycases,the model P(r, s’|s, a) isunknown.We thenhave to relyonothermethods,whichhave beenonthe rise in recentdecadesunderthe collective name ReinforcementLearning (RL). A characteristicof these methodsis,thatthe missingknowledge aboutthe model of the environmentis replacedbypiecesof experience thatare gainedwhile lettingthe agentoperate inthe environment.This explainsthe secondpartof the name:witheveryinteractionsomethingis"learned"abouthow the environmentworks.The firstpart, Reinforcement,referstothe strengthening(ornot) of alreadyacquired 3 In the seminal work of Sutton and Barto 0 findingthe optimal policy is called solvingthe control problem. Next to this (or rather before this) a lotof attention is given to the prediction problem, i.e. determining (predicting) the valuefunction V(s) and state-valuefunction Q (s,a) for a given policy . These functions give the expected reward from state s (for V), and takingaction a (for Q), followed by policy until the end of the episode (or “infinitely”). 4 Note that there arealso MDPs where the states areonly partially observed,the so-called POMDPs (Partially Observed MDPs). 5 In a stochastic problem we have an expected reward and multiplepossiblenext states s’, in a completely deterministic problem the reward r and the next state s’are known when s and a are given.
  • 7. 7 knowledge bynewexperiences:anexperience (s,a) => (s’,r) that correspondswithpreviousexperiences from(s, a) providesreinforcementof the previouslyacquiredknowledge.Naturally,incase of non- agreement,the previousknowledge is(partly) "extinguished". Delayedreward Withmany MDPs the directrewardis notrepresentative of how goodanaction is,itis oftenjustas importantthatthe environmentisbroughtintoabetter,more valuable state,fromwhichthe final rewardcan be obtained.Asanexample,inamarketingenvironment:if bysendinganumberof messages throughthe right channels atthe right times(whichinthe firstinstance onlycostmoney) we cangeta clienttoreact, or maybe even make afirstpurchase,thenwe have arrivedina much more favourable state,witha higher'value'. Maybe itwouldhave beencheaperintermsof direct rewardto take less expensive (orno) action(s),butthenwe mightnothave gottenintothatmore valuable state. Directrewardsare often0 or negative,because the actiontakenentailscostsanddoesnotimmediately yieldanything.Thismakes itatrickybusinesstotryand solve anMDP whose model isunknown byusing supervisedlearningtechniques.Afterall,asa targetvariable we have at mostthe directrewardavailable, and thisisnot necessarilymaximal forthe bestactioninthe givencircumstances.Instead,we canalso waituntil the endof an episode andthentake the final (total) rewardasa targetvariable (thisisso-called MonteCarlo learning),butthenthe bigquestionistowhatextentwe shouldassignthatrewardto each actiontaken in the episode.Anotherdisadvantageisthatwe alwayshave to continue until the endof an episode inordertolearnsomething.Inaddition,formanyproblems,the numberof possible "trajectories"(pathsfromthe beginningtothe endof an episode) isso large thatitbecomesimpractical. The methodthat sitsbetweenthesetwoextremes,andiscalledTemporal-Difference(TD) learning by Sutton,usesestimatesof the value of the nextstate (alongwiththe directreward) asa'temporary' target variable andadjuststhe currentestimate usingthese estimates(the so-called'update').The word "difference"here referstothe difference betweenthe new estimate(calledthe target) andthe earlier estimate knownuptothe time of the update.TDlearningistherefore learning“aguessfroma guess”[2] (slide 7.Inthe same presentation,see alsoslides9-16forfurtherexplanationof multi-steppredictions versusone-stepmethodssuchassupervisedlearning). The differentmethodsforsolvingmulti-stepdecisionproblems –all classifiedbySuttonunderRL – are not completelyseparate fromeachother,butgraduallymerge into one another.Thisisillustratedby Figure 2 below(from[1],p.157)
  • 8. 8 The "depth"of the update isshown here on the vertical axis,with1-step TD learningat the top andMonte Carlolearning(entire episodes) atthe bottom.On the horizontal axisonthe leftare the methodswhere onlythe effectof one decision (action) is viewedperstep(itis"sampled"from the possible decisions),while onthe rightare methodswhere the expectationvalueiscalculatedoverall possible decisions.Becausethe expectationvalueassumesknowledge of the underlyingprobabilitymodel,all model-freemethods are locatedonthe leftaxis. Many otheraspectsplaya role inthe use of these methodsthatcannotbe capturedinthis two- dimensional picture.Twoof those aspectsshouldbe mentionedhere,namelythe exploration- exploitation dilemma andfunction approximation. As soonas variousactionshave beentriedoutina certainstate,a (temporary) picture of the value of takingthose variousactionsinthat state alreadyarises.Inparticular,itisknownwhat- up to thatpoint- isthe bestaction.Everytime we come back to the same state,we can of course take that bestaction(we call thisexploitation of the acquiredknowledge,oralso greedy action selection),butthenwe may not learna lot6 .That is whyit makessense tokeep exploring andoccasionallychooseactionsthat - up to that point- are not optimal,orthat have not beentriedatall;we thenlearnwhetherwe cantake evenbetter actionsthan we thought. Itmay be that thisexploration(temporarily) costsmoney,butpotentially there isa betterpolicyinreturn7 . Whenthe numberof possible statesbecomesverylarge,itbecomesimpractical toregisterthe value of each state (orstate-actioncombination) separately(intable form). Function approximation offersa solution:the values are approximatedbya(value orstate-value) functionof the available features,which containsa limitednumberof parameters.Initssimplestform, thiscanbe a linearfunctionof the features,butalsomore complex non-linearfunctionssuchasa neural network(NN) canbe used.Whena deep NN isused(looselydefinedasa NN withmore than 2 hiddenlayers), the termDeepReinforcement Learningisused. 6 In a stochastic environmentwe obtain an increasingly better estimate of the ‘expected reward’ of the specific action,but we do not learn anythingmore about the other actions. 7 Compare this to the situation when going out to dinner (pre-corona 😉): if you always choosea restaurantthat gave you good experiences in the past,you will never try out new restaurants or revisitrestaurants thatgave you one or two less favourableexperiences.At the very least,you might miss a very good restaurantyou haven’t tried yet. Figure 2 Schematic overview of RL methods
  • 9. 9 Algorithms In orderto solve the MDP (tofindthe optimal policy),we have toapproachthe actual (state) value functionascloselyaspossible bitbybit.Thisisdone systematicallybyanRL algorithm.A lot of research has beendone inthe fieldof RLalgorithmsandmanydifferent"flavours"have emerged.Thissection providesabrief overviewof these flavours. The optimal policymustcomplywiththe Bellman optimality equation: 𝑄∗(𝑠,𝑎) = ∑ 𝑃(𝑠′|𝑠,𝑎) 𝑠′∈𝑆 ∗ [𝑟𝑠,𝑎,𝑠′ + 𝛾 ∗ max 𝑎′ 𝑄∗(𝑠′,𝑎′)] If we have foundthe function 𝑄∗ - or approximateditcloselyenough - thenwe know the optimal policy, because thisisthe greedy policy that choosesthe actiona withmaximum 𝑄∗ value ineachstate. On- andoff-policymethods An oftenmentioneddistinctionwithinRLalgorithmsisthatbetween on-policy andoff-policy methods.As mentioned,anRL algorithmsearchesforthe optimal policy;we thereforecall thispolicythe targetpolicy (alsoknownas: estimation policy).However,there isanotherpolicyinplay,namelythe policythat determineswhichactionistaken duringthe searchprocess,the so-calledbehaviourpolicy.Whenboth policiesare the same - and thusthe usedpolicyitself isoptimised,we speakof anon-policyalgorithm, otherwise the algorithmis off-policy.Inordertocontinue tolearnsufficientlyfromnew experiences,the behaviourpolicyinanon-policyalgorithmmustcontinuetoexplore(otherwise the greedyactionis alwayschosen).Thisproblemisnotanissue withanoff-policyalgorithm:afterall,youcanexplore to your heart'scontent(in principle even“random”).However,thisdoesmeanthatthere are lessguarantee for convergence(i.e.the optimal policyisapproachedmore andmore closely) whenusinganoff-policy algorithm,certainlyincombinationwithfunctionapproximation.AccordingtoSutton,the combinationof 3 elementsinparticularis'dangerous'(he speaksof the Deadly Triad,see section11.3 in [1]):function approximation,bootstrapping(=use of TD-learning,“aguessfroma guess”) and an off-policy algorithm. Since functionapproximationisindispensable forlarger,complexproblemsandbootstrappingcan greatlyincrease efficiency,the use of on-policylearningisoftenthe solution.Incidentally,further researchisstill beingdone intothe Deadly Triad[5],mainlybecause peoplewanttobetterunderstand howit ispossible thatsome algorithmsstill successfullycombine the threeelementsmentioned. Off-policylearningcan,however,offergreatadvantages,forexamplebylearningwhatthe optimal policy isfrom alreadyavailablepreviousexperiences,whichhave beenacquiredbyfollowingadifferentpolicy (forexample byareal life agent). Whenusinga so-called-greedybehaviourpolicy,off-policylearningcomesveryclose toon-policy learning:withprobability whichof course mustbe small) arandomaction ischosenand with probability1-the greedyactionischosen,justasinthe target policy.
  • 10. 10 Q-learningenSARSA The best-knownoff-policyRLalgorithmis Q-learning8 andthe best-knownon-policyalgorithmis SARSA9 . The difference betweenthe twoisinthe update of the value function,whenthe new valueisformedby the directreward+ the value of the nextstate inwhichwe endup. In Q-learning,the value of the next state is takenat the – upto that moment– bestaction inthat state (the greedyaction),while withSARSA the value istakenat the action whichisprescribedbythe policy(the behaviourpolicyandthe target policy,because theyare the same;note thatthispolicycan be stochastic,i.e. π(s,a) = P(a| s) isa probabilitydistributionandnotadeterministicrule π(s) =a). In formulae: 𝑄(𝑠,𝑎) ← 𝑄((𝑠,𝑎) + 𝛼 ∗ (𝑟 + 𝛾 ∗ max 𝑎′ 𝑄(𝑠′,𝑎′) − 𝑄(𝑠, 𝑎)) for Q-learning 𝑄(𝑠,𝑎) ← 𝑄((𝑠,𝑎) + 𝛼 ∗ (𝑟 + 𝛾 ∗ 𝑄(𝑠′,𝜋(𝑠′,𝑎)) − 𝑄(𝑠,𝑎)) for SARSA Where  is the learningrate and the discountfactor. Multi-stepmethods If we lookbackat Figure 2 on page 8, we see on the leftaxisthe TD learningmethodswithTD(0) atthe top10 andMonte Carlolearningatthe bottom.The so-calledmulti-step methods lie betweenthese two extremes.Thesemethodsare characterisedbythe factthat the target doesnotconsistof the one-step return (i.e.the rewardof one step+ the value of the nextstate),butof the n-stepreturn(the consecutive rewardsof several (sayn) steps+ the value of the state we endup inafterthose n steps).Sowe still bootstrap(i.e.we use the "preliminaryvalue" of the state afternsteps),but"sample"more thanone stepbefore bootstrapping.Asnincreases,we getcloserandclosertoMonte Carlolearning,where we sample until the endof the episodes(infinite processes).Whenusingn-stepmethods,the question naturallyariseswhichnisoptimal.Because eachnhas itsadvantagesanddisadvantages,the eligibility trace conceptwas devised(see chapter 12in [1]).We thencalculate the targetas the weightedsumof all individualn-stepreturns(so1-step,2-step,etc),where the 1-stepreturnisgivenweight (1-) andthe weightof the returnof eachsubsequentstepisafactor  (0≤≤1) smaller:the returnof stepkis therefore givenweight (1-)*k-1 .The resultingalgorithmiscalledTD() forshort. Viewedinthisway,we use the “forwardview”([1] p.288).We can alsouse the “backwardview”by lookingbackfromthe reachedstate to whichpreviousevents(actionsinpreviousstates)contributed11 to that update:the longerago,the smallerthe contributionandthusthe update (withfactor ).Thisseems like amore natural wayto lookat the updatesandalsothe most logical waytoprogram them. The addedvalue of multi-stepmethodshasbeendemonstratedinvariousstudies,see,amongothers, the DeepMindarticle onthe “Rainbow”method [11],inwhichit isarguedthat the two mostimportant add-ons(improvements) inRLalgorithmsare the multi-stepmechanism.andPrioritisedExperience Replay(see nextsection). 8 The name Q-learningis derived from the use of the letter Q for the action-valuefunction Q(s,a). 9 SARSA refers to the letters that are used to indicatethe current state (s),the action taken in s (a), the reward (r), the new state (s’) and the action taken in s’ (a’); a’ is determined by the target policy: π(s’)=a’. 10 TD(0) and 1-step TD learningaresynonymous; the 0 in TD(0) refers to the valueof in TD(, not to the number of steps. 11 Since they contributed, they are eligible for updating, hence the name eligibility trace.
  • 11. 11 Replay The difference betweenon- andoff-policyalgorithms,andalsothatbetweenonlineandoffline algorithms,becomessomewhatlessclearwhenwe use ExperienceReplay.ExperienceReplay(ER) isthe (repeated) re-offeringof previous experiences(transitions s,a-> r, s ") to the algorithmandperforming the updatesbasedonthat. A 'stock' of transitions(the replaybuffer) isthereforekeptonhand,from whichtransitionsare drawnat settimes(thiscanbe aftereachtransition,butalsolessfrequently) that are usedforthe update.It isnot necessaryforthe currenttransition(s) tobe usedimmediately;these are simplybufferedandcanbe drawn.Some experiences/transitionsare 'more interesting'thanothers, for example transitionsthathave a targetthat deviatesstronglyfromthe currentQ(s,a) value,andthose transitionswe wanttoofferwithhigherpriorityforthe update,sothatthe functionapproximationfor that ( s,a) combinationcanbecome more accurate.Thismethodisknownas Prioritised Experience Replay,and itis widelyusedincurrentRLresearch.I’ve usedthe methoddescribedin [6],includingthe Importancesampling correction(thiscorrectionensuresthatthe biasinthe expectedvalue(s)thatoccurs due to the prioritisationiscorrected). ExpectedSARSA andQ(σ) In additiontoQ-learningandSARSA,we regularlyencounterthe Expected SARSA algorithminthe literature.Inthisalgorithm, the targetforthe Q(s',a') value doesnottake the value at a' ~ π (s', a) as with SARSA,northat at a' = argmax{Q(s',a)} asin Q-learning,butthe expectationof the Q-value overall possible actionsins': 𝑄(𝑠,𝑎) ← 𝑄(𝑠,𝑎) + 𝛼 ∗ [𝑟 + 𝛾 ∗ ∑ 𝜋(𝑠′,𝑎) 𝑎∈𝐴 ∗ 𝑄(𝑠′,𝑎) − 𝑄(𝑠, 𝑎)] Thismakesthe variance of the updatesmuchsmaller - certainlywithastochasticpolicy - thanwith SARSA,sothat convergence occursfaster.Detailsaboutthismethodare givenin [7].The authorsof this article presentExpectedSARSAasan on-policymethod,butSuttonindicatesin 0(Remark6.6 on page 140) that the behaviourandtargetpoliciesmaydiffer,aswithoff-policymethods.Inhisdissertation [9], van Hasseltconfirmsthis,buthe callsthe off-policyversion“General Q-learning”. De Asisetal. [10] attemptedtounifythe differentalgorithmsinamulti-stepRLalgorithmentitledQ(σ).It wouldbe attractive toprogram thisalgorithmandexperimentwithit –because itcombinesthe other algorithms.Itwouldmean,however,thatthe numberof hyperparameters(see the section Hyperparameteroptimisationbelow)increasesevenfurther. Other variants To most algorithmsextra’cleverness’canbe added,whichwe call “add-ons”here.The reasonforthe creationof these add-onsisusuallythatthe algorithmsdidnotconverge withoutthemorwere notstable (i.e.theysometimesdid,sometimesdidnotleadtothe desiredresult,orledtodifferentresults),orwere far too slow.Belowwe willdiscussanumberof add-onsone byone,thatI have implementedinmy research:Double Q-learning,BatchLearning,Residual Learning,Rewardshaping,Targetclippingandthe use of a customlossfunction.  DoubleQ-learning In Q-learning,the valueof the nextstate isequal tothe value Q(s’,a*) of the bestactiona* inthat nextstate.These Qvalues are approximatedbythe functionapproximator,forexampleanNN. Because itis more likelythatanestimate that(bychance) turnsout tobe too highisthe highestvalue
  • 12. 12 than an estimate thatisaccurate or toolow,there isa risk of overestimation. WithDoubleQ-learning [8] thisisovercome byusing2 separate NNs,the so-calledOnlinenetwork andthe Targetnetwork. The Online Networkisusedforthe selection of the bestactioninthe nextstate,while the Target Networkprovidesthe valueforthatbestaction.The Target networkremainsunchangedforx iterations,while the Onlinenetworkisupdatedinthe usual way.Afterx iterations,the Online networkiscopiedtothe Target network. There are indications [12] thatif the rewardsare stochastic,Double learningalsohasaddedvalue for the on-policyalgorithmsSARSA andExpectedSARSA.  Batch Learning If the agentcannot (ormay not) interactwiththe real environment(andtherefore cannotchoose its ownactions),buttransitiondataisavailable,we canstill make anattemptto use Batch learningto create an optimal (orgood) policy.find.Thiscanthenbe usedinan online situation,withoutchanging it online anymore.Itisalsopossible tovaryonthisby firstlearninginbatchesandofferingthe resultingmodel asa"startingmodel"tothe online agent,whothenlearnsfurther.Anextensive article aboutBatch learningis [13].  ResidualLearning LeemonBairdisa bigname inthe fieldof MDPsand RL. He isknownfor, amongotherthings, Baird's counterexample,acounterexampleof the propositionthatRL algorithmswithlinearfunction approximationalwaysconverge.Asasolutiontothis,Bairdproposes Residuallearning,where amix of the normallyused directgradient(whichensuresthe speedof learning)andthe residualgradient (whichensuresconvergence) isapplied.Bytakingthe rightcombination,convergence isensured(due to the residual component)while the speedof learningremainsashighaspossible (due tothe direct component).Detailsinthe article from Baird[14].  Reward shaping Withmany RL problems the agentonlyreceivesfeedbackaboutthe final reward,oftenthe most importantcomponentinthe total reward,at the endof an episode.Ittakesa while (manyiterations and manyrepetitionsof the same state-actioncombinations) before the agentlearns fromthat informationwhatagoodaction isin anyspecificstate.Thisgave rise tothe ideato lendthe agenta handand alsogive himfeedbackduringthe processwhetherhe isworking"inthe rightdirection". Thiscan be done bymanipulatingthe rewardfunctioninsuchaway thatthe rightdirectionis rewardedand/orthe wrongdirectionispenalised.Of course,the real rewardmust(ultimately) be independentof this;afterall,itisaninterimassistant,notareal reward.Thistechnique iscalled reward shaping,see applicationin [15].Anexample:if,indebtcollectionprocess anactionleadstoa response fromthe debtor. ora (partial) payment,thisispositive,evenif it doesnotimmediatelyyield a reward.By neverthelessassigningavalue toreceivingthe response orpayment,the agentis encouragedtorepeatthe behaviourdisplayed.  Incidentally,the well-knownAIexpertAndrew Ngetal. have shown [16] that whenusingso-called potentialbased shaping,the original optimal policyisalsooptimal withrespecttothe shapedreward. Wiewiora[17] thenprovedthatinitialisingthe Qvalues inacertain way can achieve exactlythe same effectaspotential-basedshaping.  Target clipping Due to the iterative nature of the RLupdates,itcan occur that Q values grow out of control and become unrealisticallylarge (orverynegative).Inthatcase it can helpto limitthe calculatedtargets, by so-calledtargetclipping.See alsothe nextpoint.
  • 13. 13  Useof a customlossfunction Whenusingan NN for functionapproximation,we canchoose froma numberof loss functions.Sucha lossfunctionisusedwhenfittingthe NN tothe dataprovided:the smallerthe value of the loss function,the betterthe NN fitsthe data.The lossfunctionisminimisedbyadjustingthe weightsinthe NN witha gradientdescentalgorithm. We can alsodefine ourown (custom) lossfunction,whichmakesitpossible tosteerthe algorithmina certaindirection.Forexample,we canbuildina'penalty' - inthe form of an extraloss - whenthe (realistic) limitsforthe Q values,i.e.forthe outputvalues of the NN,are exceeded.Incombination withtargetclipping(see previouspoint),thisisaneffectivewaytopreventthe algorithmfrom derailing. Hyperparameters All algorithmsandtheiradd-onsmentionedinthistexthave one ormore hyperparameters thatinfluence theiroperation.We call these hyperparameters,becausethere are also"normal"parameters,for example the rewardsandcostsof actions(inputparameters) andthe resultingcoefficientsinthe NN (outputparameters).Mostalgorithmsandadd-onsare verysensitivetothe bestchoice fortheir hyperparametervalues;theycanmake the difference betweenasmoothandfastconvergingora slow and divergingalgorithm. Hyperparameters thathave notyetexplicitly beendiscussedinthisdocumentare:  Learning rate (also called: step size) Usuallyindicatedwith (asinthe formulae inthisdocument),sometimeswith η:the speed(step size) withwhichwe move inthe directionof the new value (target) duringanupdate of the oldQ value. WhenusinganNN,thisparameterisincludedinthe optimiserselection(seenextpoint).  Exploration rate Usuallyindicatedwith ,itgivesthe fractionof cases inwhich explorationisrequiredinsteadof exploitation.  Rate decay patterns Both the learningrate andthe explorationrate should eventuallygoto0; so,in the end there is no more explorationandnomore learning. Thisisanecessaryconditionforconvergence tothe optimal policy. The way(speed,shape of the decrease)these ratesdecreasecanalsodetermine the convergence (speed) of the algorithm. Incidentally,insituationswhere anongoing (business) process12 maychange overtime, itseems advisable nottostoplearningcompletely.Afterall,anychangesinthe processcouldalsochange the optimal policy,butthiswill notbe pickedupwithoutlearning.  Discountfactor Usuallyindicatedwith ,asin the formulae inthisdocument.Asusual,the discountfactor reflectsthe diminishingvalueof moneyovertime (lostopportunitycostandinflation).A high discountfactorcombinedwitha(possiblylarge)rewardatthe endof an episode couldleadto attemptsbythe agent to keepthe episodeasshortas possible,withoutloweringthe probability of successtoomuch. 12 As opposed to e.g. a physical environmentwhere the laws of physics apply.
  • 14. 14  Replay parameters Several parametersplayarole in(PrioritisedExperience) Replay: Buffersize:howmanyexperiences(transitions) are kept(rollingwindow)? Sample size:howmanyexperiencesare usedineachreplayaction? Replayfrequency:afterhowmanytransitionsisreplayplanned13 ? Replacement:issamplingwithorwithoutreplacement? PrioritisationAlphaandBeta:determine how strongthe prioritisationis;see [6]. UpdTarget (Freq):howoftenare the savedtargetsupdatedaccordingtothe latestNN model?  NN parameters Whenusingan NN for FA,the gradientdescentalgorithm isusedto minimise the lossfunction (see above).The choice of thisalgorithm(andthe associatedlearningrate) canbe verydecisive for (the speedof) convergence.Thereisextensive literatureonthe differentoptimisation routines,see e.g. [18]. The structure of the NN itself isalsogovernedbyhyperparameters,suchasthe numberand the widthof the layers,the activationfunction(‘sigmoid’,‘ReLu’,etc) andmanymore.  Featureengineering Thisis notactuallya parameter,buta setof operationsonthe original featuresbefore theyserve as inputforthe NN,thus"helpingthe NN".Ideally,nofeature engineeringis neededatall,and the NN workspurelyonthe raw data.All necessary"operations"(suchasnon-linear transformations,interactions,etc.) are thenlearnedbythe NN.Withmany-dimensional real problemsitisthengenerallynecessarytouse verywide (manynodesperlayer) anddeep(many layers) networks,whichtherefore alsorequire verylongtrainingtimesanddata.Thisisthe domainof DeepLearning. Anexample isanRL algorithmforlearningtoplaycomputergames withas inputonlythe raw pixelsof the computerscreen. There isalso a dangerin(manual) feature engineering:if the constructedfeaturesare less predictive thanthe original data,the horse isputbehindthe cart. Hyperparameter optimisation In orderto findthe bestvaluesof the hyperparametersforanRL algorithm, we canof course use similar techniquesasforotherML algorithms. There isalso alot of literature aboutoptimisinghyperparameters, see e.g. [18],ranging fromsimple gridsearchtoadvancedBayesianoptimisation.Aninterestingoption, whichisnot exploredinthe mentionedarticle,isusingaGeneticAlgorithm(GA). Assumingwe have aperformance metricforeachrun/episode(oraverage overmultiple runs/episodes) of the RL algorithmwitha specifichyperparametersetting,we caninterpretthismetricasthe “fitness”of the solution(“gene”inGA terminology) consistingof the stringof hyperparameters.Crossoverof two solutions isthen easilydone bycuttingthe listwithhyperparameterssomewhere andcombiningthe two endscrosswise.Mutationof asolutionisalsoeasyto program. Of course,asin gridsearch,this “triple-AIsolution”(namelyRL+ NN + GA) forHyperparameter optimisationmay alsobe challengingintermsof requiredcomputingtimeorpower (althoughadegree 13 This of coursealso depends on practical circumstances:in a simulated environment any number of transitions can be replayed at any time, in reality itmay only be possibleto execute replay and update of the model infrequently.
  • 15. 15 of parallelisationseemsfeasible). Reinforcement Learning in R and Python Although there are some effortstoincorporate RLinR code,notablythe ReinforcementLearning package by Markus Dumke andthe MDPtoolbox,byfarthe most sourcesforprogrammingRL algorithms(on github,kaggle,etc.) isfoundinPython. RL playgrounds Much researchintoReinforcementLearningtechniquesiscarriedoutusingenvironmentsthatare easy to simulate,anumberof whichare nowstandard benchmarks: • The "classicenvironments"suchas(windy) gridworld,cart-pole balancingandmountaincar(inmany articlesandbooksthese are usedto explainconcepts,alsoin[1]) • Atari games:50 gamesof differentnature anddifficulty.Itrarelyhappensthatanalgorithmisequally goodin all games. The Gym website of OpenAI (affiliatedwithElonMusk,amongothers) providesaccesstothese andother environments,unfortunately - accordingtothe githubwebsite –onlyviaPython (butthere appearstobe an R interface package anyway;thishasnot yetbeentested).
  • 16. 16 References The referencesbeloware all containedinthiszipfile: RL Literature Review - References.zip [1] Sutton,R. andBarto, A. (2020). ReinforcementLearning:anIntroduction,secondedition.MITPress. Original versionisfrom1998, but new versionsof thisbookare still created,see http://incompleteideas.net/book/the-book.html. [2] Sutton,R. (2017).Temporal-Difference Learning,slides andvideolecture (http://videolectures.net/deeplearning2017_sutton_td_learning/). [3] AhmadHammoudeh(2018). A Concise IntroductiontoReinforcementLearning. ResearchGate. [4] Yuxi Li (2017). DeepReinforcementLearning:AnOverview. ArXiv preprintarXiv:1701.07274v5. [5] VanHasselt,H. etal.,DeepMind(2018). DeepReinforcementLearningandthe DeadlyTriad. ArXiv preprintarXiv:1812.02648. [6] Tom Schaul,JohnQuan,IoannisAntonoglouandDavidSilver,Google DeepMind(2016).Prioritized Experience Replay. ArXiv preprintarXiv:1511.05952V4. [7] VanSeijen,H.,vanHasselt,H.,Whiteson,S.,andWiering,M. (2009). A theoretical andempirical analysisof Expected Sarsa. InIEEE SymposiumonAdaptiveDynamicProgrammingand ReinforcementLearning,pp. 177–184. [8] Hado VanHasselt,ArthurGuezand DavidSilver,Google DeepMind(2015).DeepReinforcement LearningwithDouble Q-learning. arXiv:1509.06461v3. [9] VanHasselt,H. (2011). InsightsinReinforcementLearning:Formal AnalysisandEmpirical Evaluation of Temporal-differenceLearning.SIKSdissertationseriesnumber2011-04. [10] De Asis,K.,Hernandez-Garcia,J.F.,Holland,G.Z.,andSutton,R. S. (2017). Multi-stepReinforcement Learning:A UnifyingAlgorithm. ArXiv preprintarXiv:1703.01327. [11] Matteo Hessel etal.,DeepMind(2017). Rainbow:CombiningImprovementsinDeepReinforcement Learning. ArXiv preprintarXiv:1710.02298v1. [12] Ganger,M., Duryea,E. andHu, W. (2016). Double Sarsaand Double ExpectedSarsawithShallow and DeepLearning. Journalof DataAnalysisand Information Processing, 4,159-176. [13] Lange S., Gabel T., RiedmillerM.(2012) Batch ReinforcementLearning.In:WieringM.,vanOtterlo M. (eds) ReinforcementLearning.Adaptation,Learning,andOptimization,vol 12.Springer,Berlin, Heidelberg. [14] Baird,L.C. (1995). Residual Algorithms:ReinforcementLearningwithFunctionApproximation.In Prieditis&Russell,eds.Machine Learning:Proceedingsof the TwelfthInternational Conference,9- 12 July,Morgan KaufmanPublishers,SanFrancisco,CA. [15] Raghu,A. etal. (2017). Deep ReinforcementLearning forSepsisTreatment. ArXiv preprint arXiv:1711.09602v1. [16] AndrewNgetal. (1999). Policyinvariance underrewardtransformations:theoryandapplicationto rewardshaping.In MachineLearning,Proceedingsof theSixteenth InternationalConference, pp. 278-287. Morgan Kaufmann. [17] Eric Wiewiora(2003). Potential-basedShapingandQ-value Initializationare equivalent. Journal of Artificial IntelligenceResearch,19,205-208.
  • 17. 17 [18] SebastianRuder(2017). Anoverview of gradientdescentoptimizationalgorithms. ArXiv preprint arXiv:1609.04747v2. [19] Hazan, E. (2018). HyperparameterOptimization:A Spectral Approach. ArXiv preprint arXiv:1706.00764v4. [20] DavidSilveretal.Google DeepMind(2016). Masteringthe game of Go withdeepneural networks and treessearch.Nature vol 529. [21] EdwinPednault, Naoki Abeetal. (2002). Sequential Cost-Sensitive DecisionMakingwith ReinforcementLearning.IBMWatson ResearchCentre. SIGKDD’02 Edmonton,Alberta,Canada. [22] Naoki Abe etal.(2004). Cross Channel OptimizedMarketingby ReinforcementLearning. IBM WatsonResearchCentre. SIGKDD’04. [23] Naoki Abe etal.(2010). OptimizingDebtCollectionsUsingConstrainedReinforcementLearning. IBM Research. KDD’10, Washington DC. [24] GeorgiosTheocharous,Assaf Hallak,AdobeResearch (2013). Lifetime Value Marketingusing ReinforcementLearning.PaperF21in RLDM (Multi-disciplinaryConferenceonReinforcement LearningandDecisionMaking) 2013, Princeton,New Jersey,USA. [25] Yegor Tkachenko(2015). AutonomousCRMControl viaCLV ApproximationwithDeep ReinforcementLearninginDiscreteandContinuousActionSpace.StanfordUniversity. ArXiv preprint arXiv:1504.01840v1. [26] Marco Wiering(2018). ReinforcementLearning:frommethodstoapplications.Nieuw Archief voor de Wiskunde (KWG).
  • 18. 18 Index Baird's counterexample,11 Batch Learning,11 behaviourpolicy,8 Bellman optimalityequation,8 customlossfunction,4, 12 Deadly Triad, 8 Deep Learning,13 Deep ReinforcementLearning,7 deterministic policy,4 direct gradient,11 DoubleQ-learning,10 DynamicProgramming,5 eligibility trace, 9 episode,4 estimation policy,8 Expected SARSA,10 ExperienceReplay,10 exploration-exploitation dilemma,7 Featureengineering,13 Function approximation,7 gradientdescentalgorithm,12 greedy action selection, 7 Importancesampling,10 model-based,5 model-free,5 MonteCarlo learning,6 Multi-armed Bandit,4 Multi-step methods,9 NeuralNetwork,4 off-policy,8 Online network,11 on-policy,8 Policy iteration (PI),5 POMDP,5 Prioritised ExperienceReplay,10 Q-learning,9 ReinforcementLearning (RL),5 residualgradient,11 ResidualLearning,11 return,9 Reward shaping,11 RL in R en Python,14 SARSA,9 stochasticpolicy,4 target,6 Target clipping,11 Target network,11 targetpolicy, 8 TD() algorithm,9 Temporal-Difference(TD) learning,6 trajectories,6 transitions,10 Valueiteration (VI),5 -greedy,8