The document discusses balancing speed and probability of winning in reinforcement learning problems. It presents a game called Decision Snakes and Ladders where introducing a step punishment term incentivizes faster termination while still maximizing winning probability. This leads to finding a policy that maximizes winning probability over mean episode length. The approach can extend to other problems and trading off different metrics. Future work includes incorporating robustness to policy and state variations.
Robust Policy Computation in Reward-uncertain MDPs using Nondominated PoliciesKevin Regan
This document discusses computing robust policies for Markov decision processes (MDPs) with uncertain rewards. It motivates the problem by noting that precisely specifying rewards for all states and actions is challenging. The key ideas are: (1) modeling the MDP with an imprecise set of possible reward functions rather than a single function, (2) computing policies that are minimax regret optimal over the set of reward functions, and (3) using nondominated policies, which are guaranteed to contain a minimax regret optimal policy. Constraint generation and iteratively computing maximum regrets are approaches used to efficiently solve the minimax regret optimization problem.
This document provides an introduction to machine learning and decision trees. It defines key concepts like deep learning, artificial intelligence, and machine learning. It then discusses different machine learning algorithms like supervised learning, unsupervised learning, and decision trees. The document explains how decision trees are built by choosing features to split on at each node based on metrics like information gain and entropy. It provides an example of calculating entropy and information gain to select the best feature to split the root node on.
Punishment and Grace: On the Economics of Tax AmnestiesNugroho Adi
This document summarizes an economic analysis of tax amnesties. It discusses three types of amnesties: revision amnesties, which allow taxpayers to revise past returns with reduced penalties; investigation amnesties, which provide immunity from audits for a fee; and prosecution amnesties, which partially waive penalties for indicted taxpayers who plead guilty. It presents a model of optimal taxpayer behavior under different enforcement scenarios and analyzes how the different amnesty types may impact taxpayer compliance, government revenue, and the desirability of implementing amnesties permanently. Investigation amnesties are found to only increase long-term revenue if offered before audits begin, providing full insurance against audits.
The document provides a summary of Semi-Markov Decision Processes (SMDPs) in 10 points:
1. It describes the basic components of an SMDP including states, actions, rewards, policies, and value functions.
2. It discusses the concepts of optimal policies, average reward models, and discount factors in SMDPs.
3. It introduces the idea of transition times in SMDPs, which allows actions to take varying amounts of time. This makes SMDPs a generalization of Markov Decision Processes.
4. It notes that algorithms for solving SMDPs typically involve estimating the average reward per action to find an optimal policy.
This document provides an overview of the Capital Asset Pricing Model (CAPM). It outlines the key assumptions of CAPM, including that investors aim to maximize returns based on risk. It describes how the capital market reaches equilibrium when there is no incentive to trade. It also defines concepts like the capital market line, securities market line, beta, and the CAPM formula. Examples are provided to demonstrate how to calculate expected returns using CAPM. The document concludes by discussing empirical testing of CAPM and common findings that its assumptions do not always hold in practice.
Financial Crime Compliance at Standard CharteredTEDxMongKok
The document provides information about Standard Chartered Bank's efforts to combat financial crime through its Financial Crime Compliance (FCC) division. Some key points:
- Financial crime is a highly profitable global industry that funds terrorism, drug trafficking, and human rights abuses. Standard Chartered has more than doubled the size of its FCC team to help fight financial crime.
- The FCC division works to monitor transactions, screen clients, conduct investigations, and ensure the bank does not enable financial criminals. It aims to set new industry standards and lead the way in combating financial crime globally.
- FCC employees discuss the importance and challenges of their work, and how Standard Chartered provides opportunities to grow careers and have impact
The document defines and categorizes different types of economic crimes. It discusses crimes against property including theft, swindling, misappropriation, robbery, extortion, and destruction of property. It also outlines crimes in the economic sphere such as obstruction of business, illegal land deals, money laundering, and tax evasion. Finally, it mentions crimes against the interests of service in organizations like bribery in profit-making companies. The document provides an overview of common economic offenses with the goal of harming financial interests through illegal means.
Presentation on Financial Crimes. Money is one of the most important reasons behind all forms of crime whether Cyber or Internet crimes, Physical or Theft crimes. With the advancement of technology the crime has not decelerated but only esteemed and many more new techniques were by people and they were popularly called as Blackhat hackers. In this presentations we give an over view of the whole scenario.
Robust Policy Computation in Reward-uncertain MDPs using Nondominated PoliciesKevin Regan
This document discusses computing robust policies for Markov decision processes (MDPs) with uncertain rewards. It motivates the problem by noting that precisely specifying rewards for all states and actions is challenging. The key ideas are: (1) modeling the MDP with an imprecise set of possible reward functions rather than a single function, (2) computing policies that are minimax regret optimal over the set of reward functions, and (3) using nondominated policies, which are guaranteed to contain a minimax regret optimal policy. Constraint generation and iteratively computing maximum regrets are approaches used to efficiently solve the minimax regret optimization problem.
This document provides an introduction to machine learning and decision trees. It defines key concepts like deep learning, artificial intelligence, and machine learning. It then discusses different machine learning algorithms like supervised learning, unsupervised learning, and decision trees. The document explains how decision trees are built by choosing features to split on at each node based on metrics like information gain and entropy. It provides an example of calculating entropy and information gain to select the best feature to split the root node on.
Punishment and Grace: On the Economics of Tax AmnestiesNugroho Adi
This document summarizes an economic analysis of tax amnesties. It discusses three types of amnesties: revision amnesties, which allow taxpayers to revise past returns with reduced penalties; investigation amnesties, which provide immunity from audits for a fee; and prosecution amnesties, which partially waive penalties for indicted taxpayers who plead guilty. It presents a model of optimal taxpayer behavior under different enforcement scenarios and analyzes how the different amnesty types may impact taxpayer compliance, government revenue, and the desirability of implementing amnesties permanently. Investigation amnesties are found to only increase long-term revenue if offered before audits begin, providing full insurance against audits.
The document provides a summary of Semi-Markov Decision Processes (SMDPs) in 10 points:
1. It describes the basic components of an SMDP including states, actions, rewards, policies, and value functions.
2. It discusses the concepts of optimal policies, average reward models, and discount factors in SMDPs.
3. It introduces the idea of transition times in SMDPs, which allows actions to take varying amounts of time. This makes SMDPs a generalization of Markov Decision Processes.
4. It notes that algorithms for solving SMDPs typically involve estimating the average reward per action to find an optimal policy.
This document provides an overview of the Capital Asset Pricing Model (CAPM). It outlines the key assumptions of CAPM, including that investors aim to maximize returns based on risk. It describes how the capital market reaches equilibrium when there is no incentive to trade. It also defines concepts like the capital market line, securities market line, beta, and the CAPM formula. Examples are provided to demonstrate how to calculate expected returns using CAPM. The document concludes by discussing empirical testing of CAPM and common findings that its assumptions do not always hold in practice.
Financial Crime Compliance at Standard CharteredTEDxMongKok
The document provides information about Standard Chartered Bank's efforts to combat financial crime through its Financial Crime Compliance (FCC) division. Some key points:
- Financial crime is a highly profitable global industry that funds terrorism, drug trafficking, and human rights abuses. Standard Chartered has more than doubled the size of its FCC team to help fight financial crime.
- The FCC division works to monitor transactions, screen clients, conduct investigations, and ensure the bank does not enable financial criminals. It aims to set new industry standards and lead the way in combating financial crime globally.
- FCC employees discuss the importance and challenges of their work, and how Standard Chartered provides opportunities to grow careers and have impact
The document defines and categorizes different types of economic crimes. It discusses crimes against property including theft, swindling, misappropriation, robbery, extortion, and destruction of property. It also outlines crimes in the economic sphere such as obstruction of business, illegal land deals, money laundering, and tax evasion. Finally, it mentions crimes against the interests of service in organizations like bribery in profit-making companies. The document provides an overview of common economic offenses with the goal of harming financial interests through illegal means.
Presentation on Financial Crimes. Money is one of the most important reasons behind all forms of crime whether Cyber or Internet crimes, Physical or Theft crimes. With the advancement of technology the crime has not decelerated but only esteemed and many more new techniques were by people and they were popularly called as Blackhat hackers. In this presentations we give an over view of the whole scenario.
Overview and analysis on the economics model of crime by Becker (1968). Including a case study on the Three Strikes Law in California, USA, using differences in differences methodology
This presentation by Prof. John M. Connor from Purdue University, West Lafayette, US was made during the discussion on "Sanctions in Anti-trust cases" held at the 15th Global Forum on Competition on 2 December 2016. More papers and presentations on the topic can be found out at www.oecd.org/competition/globalforum/competition-and-sanctions-in-antitrust-cases.htm
This document discusses legislative measures to deal with economic crimes in India. It begins by defining economic offenses and their impact. It then provides an overview of various economic crimes in India, the relevant legislation, and enforcing authorities. Major economic crimes covered by the Indian Penal Code are discussed, including statistics on fraud cases from 2000-2002. The document also examines money laundering offenses, previously covered under FERA and now mainly addressed by the 2002 Prevention of Money Laundering Act. The goal of this legislation is to prevent, combat, and control money laundering, as well as confiscate property from laundered money.
Capital Asset Pricing Model (CAPM)
A model that describes the relationship between risk and expected return. The general idea behind CAPM is that investors need to be compensated in two ways: time value of money & risk. The time value of money is represented by the risk-free (rf) rate in the formula and compensates the investors for placing money in any investment over a period of time. The other half of the formula represents risk and calculates the amount of compensation the investor needs for taking on additional risk. This is calculated by taking a risk gauge (beta) that compares the returns of the asset to the market over a period of time and to the market premium (Rm-rf).
The United Kingdom has a population of over 62 million people and a GDP per capita of $35,900. London is the capital city, and other major cities include Birmingham, Manchester, and Glasgow. The UK has a constitutional monarchy and is a global leader in fields like innovation and creativity. It has the 6th largest economy in the world and is one of the easiest places in Europe to do business. Major industries include machinery, fuels, manufactured goods, and automobiles.
This presentation by Prof. Hwang LEE from the Korean University School of Law was made during the discussion on "Sanctions in Anti-trust cases" held at the 15th Global Forum on Competition on 2 December 2016. More papers and presentations on the topic can be found out at www.oecd.org/competition/globalforum/competition-and-sanctions-in-antitrust-cases.htm
The slides comprehends a firm understanding of the formation and functioning of British Economy
Highlights:
Foundation of British Economy
Nature of The Economy
Britain’s Current Economic Scenario ¡ London Stock Exchange
London vs. Economy
Role of The Government
Involvement in International Trade
Forecast on British Economy
Indifference Curve maps consumer preferences between two goods. It shows different combinations of goods that provide equal satisfaction. Indifference curves are downward sloping and convex since marginal rate of substitution diminishes. An indifference map consists of multiple indifference curves showing higher satisfaction as one moves to the northeast. The budget line represents affordable combinations given prices and income. Consumer equilibrium occurs where an indifference curve is tangent to the budget line, allowing maximum satisfaction within constraints.
Overview and analysis on the economics model of crime by Becker (1968). Including a case study on the Three Strikes Law in California, USA, using differences in differences methodology
This presentation by Prof. John M. Connor from Purdue University, West Lafayette, US was made during the discussion on "Sanctions in Anti-trust cases" held at the 15th Global Forum on Competition on 2 December 2016. More papers and presentations on the topic can be found out at www.oecd.org/competition/globalforum/competition-and-sanctions-in-antitrust-cases.htm
This document discusses legislative measures to deal with economic crimes in India. It begins by defining economic offenses and their impact. It then provides an overview of various economic crimes in India, the relevant legislation, and enforcing authorities. Major economic crimes covered by the Indian Penal Code are discussed, including statistics on fraud cases from 2000-2002. The document also examines money laundering offenses, previously covered under FERA and now mainly addressed by the 2002 Prevention of Money Laundering Act. The goal of this legislation is to prevent, combat, and control money laundering, as well as confiscate property from laundered money.
Capital Asset Pricing Model (CAPM)
A model that describes the relationship between risk and expected return. The general idea behind CAPM is that investors need to be compensated in two ways: time value of money & risk. The time value of money is represented by the risk-free (rf) rate in the formula and compensates the investors for placing money in any investment over a period of time. The other half of the formula represents risk and calculates the amount of compensation the investor needs for taking on additional risk. This is calculated by taking a risk gauge (beta) that compares the returns of the asset to the market over a period of time and to the market premium (Rm-rf).
The United Kingdom has a population of over 62 million people and a GDP per capita of $35,900. London is the capital city, and other major cities include Birmingham, Manchester, and Glasgow. The UK has a constitutional monarchy and is a global leader in fields like innovation and creativity. It has the 6th largest economy in the world and is one of the easiest places in Europe to do business. Major industries include machinery, fuels, manufactured goods, and automobiles.
This presentation by Prof. Hwang LEE from the Korean University School of Law was made during the discussion on "Sanctions in Anti-trust cases" held at the 15th Global Forum on Competition on 2 December 2016. More papers and presentations on the topic can be found out at www.oecd.org/competition/globalforum/competition-and-sanctions-in-antitrust-cases.htm
The slides comprehends a firm understanding of the formation and functioning of British Economy
Highlights:
Foundation of British Economy
Nature of The Economy
Britain’s Current Economic Scenario ¡ London Stock Exchange
London vs. Economy
Role of The Government
Involvement in International Trade
Forecast on British Economy
Indifference Curve maps consumer preferences between two goods. It shows different combinations of goods that provide equal satisfaction. Indifference curves are downward sloping and convex since marginal rate of substitution diminishes. An indifference map consists of multiple indifference curves showing higher satisfaction as one moves to the northeast. The budget line represents affordable combinations given prices and income. Consumer equilibrium occurs where an indifference curve is tangent to the budget line, allowing maximum satisfaction within constraints.
1. Winning slow, losing fast, and in between.
Reinaldo A Uribe Muriel
Colorado State University. Prof. C. Anderson
Oita University. Prof. K. Shibata
Universidad de Los Andes. Prof. F. Lozano
February 8, 2010
2. It’s all fun and games until someone proves a theorem.
Outline
1 Fun and games
2 A theorem
3 An algorithm
3. A game: Snakes & Ladders
Board: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)
Player advances the
number of steps
indicated by a die.
Landing on a snake’s
mouth sends the player
back to the tail.
Landing on a ladder’s
bottom moves the
player forward to the
top.
Goal: reaching state
100.
4. A game: Snakes & Ladders
Board: Crawford & Son, Melbourne, 1901. (Source: http://www.naa.gov.au/)
Player advances the
number of steps
indicated by a die.
Landing on a snake’s
mouth sends the player
back to the tail.
Boring! Landing on a ladder’s
(No skill required, only luck.)
bottom moves the
player forward to the
top.
Goal: reaching state
100.
5. Variation: Decision Snakes and Ladders
Sets of “win” and
“loss” terminal states.
Actions: either
“advance” or “retreat,”
to be decided before
throwing the die.
6. Reinforcement Learning: Finding the optimal policy.
“Natural” Rewards: ±1
on “win”/“lose”, 0
othw.
Optimal policy
maximizes total
expected reward.
Dynamic programming
quickly finds the
optimal policy.
Probability of winning:
pw = 0.97222 . . .
7. Reinforcement Learning: Finding the optimal policy.
“Natural” Rewards: ±1
on “win”/“lose”, 0
othw.
Optimal policy
maximizes total
expected reward.
Dynamic programming
quickly finds the
optimal policy.
Probability of winning:
pw = 0.97222 . . .
But...
8. Claim:
It is not always desirable to find the optimal policy
for that problem.
9. Claim:
It is not always desirable to find the optimal policy
for that problem.
Hint: mean episode length of the optimal policy, d = 84.58333
steps.
17. A simple, yet powerful idea.
Introduce a step punishment term −rstep so the
agent has an incentive to terminate faster.
18. A simple, yet powerful idea.
Introduce a step punishment term −rstep so the
agent has an incentive to terminate faster.
At time t,
+1 − rstep “win”
r (t) = −1 − rstep “loss”
−rstep othw.
19. A simple, yet powerful idea.
Introduce a step punishment term −rstep so the
agent has an incentive to terminate faster.
At time t,
+1 − rstep “win”
r (t) = −1 − rstep “loss”
−rstep othw.
Origin: Maze rewards, −1 except on termination.
Problem: rstep =?
(i.e, cost of staying in the game usually incommensurable with
terminal rewards)
25. Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
∗
in 108 ply.
26. Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
∗
in 108 ply.
√
Visits only about 5 of the total number of valid states1 , but, if a
ply takes one second, an average game will last three years and two
months.
1
Shannon, 1950.
27. Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
∗
in 108 ply.
√
Visits only about 5 of the total number of valid states1 , but, if a
ply takes one second, an average game will last three years and two
months.
Certainly unlikely to be the case, but in fact finding policies of
maximum winning probability remains the usual goal in RL.
1
Shannon, 1950.
28. Chess: White wins∗
Uribe Muriel. Journal of Fabricated Results, Vol 06, No. 8, 2010
∗
in 108 ply.
√
Visits only about 5 of the total number of valid states1 , but, if a
ply takes one second, an average game will last three years and two
months.
Certainly unlikely to be the case, but in fact finding policies of
maximum winning probability remains the usual goal in RL.
Discount factor γ, used to ensure values are finite, has effect in
episode length, but is unpredictable and suboptimal (for the pdw
problem)
1
Shannon, 1950.
29. Main result.
For a general ±1-rewarded problem, there exists an
∗
rstep for which the value-optimal solution maximizes
pw
d and the value of the initial state is -1
∗
∃rstep |
pw
π ∗ = argmax v = argmax
π∈Π π∈Π d
∗
v (s0 ) = v = −1
30. Stating the obvious.
Every policy has a mean episode length d ≥ 1 and probability
of winning 0 ≤ pw ≤ 1.
31. Stating the obvious.
Every policy has a mean episode length d ≥ 1 and probability
of winning 0 ≤ pw ≤ 1.
v = 2pw − 1 − rstep d
32. Stating the obvious.
Every policy has a mean episode length d ≥ 1 and probability
of winning 0 ≤ pw ≤ 1.
v = 2pw − 1 − rstep d
(Lemma: Extensible to vectors using indicator variables)
33. Stating the obvious.
Every policy has a mean episode length d ≥ 1 and probability
of winning 0 ≤ pw ≤ 1.
v = 2pw − 1 − rstep d
(Lemma: Extensible to vectors using indicator variables)
The proof rests on a solid foundation of duh!
35. Key substitution.
The w − l space
w = pd
w l = 1−pw
d
Each policy is represented by a unique point in the w − l
plane.
36. Key substitution.
The w − l space
w = pd
w l = 1−pw
d
Each policy is represented by a unique point in the w − l
plane.
The policy cloud is limited by the triangle with vertices (1,0),
(0,1), and (0,0).
37. Execution and speed in the w − l space.
Winning probability: Mean episode length:
w 1
pw = d=
w +l w +l
38. Proof Outline - Value in the w − l space.
w − l − rstep
v=
w +l
39. So...
All level sets intersect at the same point,
(rstep , −rstep )
There is a one-to-one relationship between
values and slopes.
Value (for all rstep ), mean episode length and
winning probability level sets are lines
Optimal policies in the convex hull of the policy
cloud.
40. And done!
pw
π ∗ = max = max w
π d π
(Vertical level sets) When vt ≈ −1, we’re there.
41. Algorithm
Set ε
Initialize π0
rstep ← 0
Repeat:
+
Find π + , vπ (solve from π0 by any RL method)
rstep ← rstep
π0 ← π +
+
Until |vπ (s0 ) + 1| < ε
42. Algorithm
Set ε
Initialize π0
rstep ← 0
Repeat:
+
Find π + , vπ (solve from π0 by any RL method)
rstep ← rstep
π0 ← π +
+
Until |vπ (s0 ) + 1| < ε
On termination, π + ≈ π ∗ .
rstep update using a learning rate µ > 0,
+
rstep = rstep + µ[vπ (s0 ) + 1]
43. Optimal rstep update.
Minimizing the interval of rstep uncertainty in the next
iteration.
Requires solving a minmax problem. Either root of an 8th
degree polynomial in rstep or zero of the difference of two
rational functions of order 4. (Easy using secant method).
O(log 1 ) complexity.
44. Extensions.
Problems solvable through a similar method
Convex (linear) tradeoff.
π ∗ = argmaxπ∈Π {αpw − (1 − α)d}
Greedy tradeoff.
∗ 2pw −1
π = argmaxπ∈Π d
Arbitrary tradeoffs.
∗ αpw −β
π = argmaxπ∈Π d
Asymmetric rewards.
rwin = a, rloss = −b; a, b ≥ 0
Games with tie outcomes.
Games with multiple win / loss rewards.
45. Harder family of problems
Maximize the probability of having won before n steps / m
episodes.
Why? Non-linear level sets / non-convex functions in the w − l
space.
46. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
47. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
Defining policy neighbourhoods.
48. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
Defining policy neighbourhoods.
1 Continuous/discrete statewise action neighbourhoods.
2 Discrete policy neighbourhoods for structured tasks.
3 General policy neighbourhoods.
49. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
Defining policy neighbourhoods.
Feature-robustness
50. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
Defining policy neighbourhoods.
Feature-robustness
1 Value/Speed/Execution neighbourhoods in the w − l space.
2 Robustness as a trading off of features
51. Outline of future research.
Towards robustness.
Policy variation in tasks with fixed episode length. Inclusion of
time as a component of the state space.
Defining policy neighbourhoods.
Feature-robustness
Can traditional Reinforcement Learning methods still be used
to handle the learning?