The document describes a study that analyzed code smells in popular Python projects for reinforcement learning. The study involved:
1) Filtering the 20 most popular RL repositories on GitHub that use Python.
2) Performing static analysis on the code to detect 8 different code smells based on predefined metrics and thresholds.
3) Finding that certain code smells like long methods and long classes were highly prevalent in the analyzed projects.
[CAIN'23] Prevalence of Code Smells in Reinforcement Learning Projects
1. CAIN 2023
Prevalence of Code Smells in
Reinforcement Learning Projects
NicolĂĄs Cardozo, Ivana Dusparic, Christian Cabrera
Systems an Computing Engineering - Universidad de los Andes, BogotĂĄ - Colombia
Trinity College Dublin - Ireland
University of Cambridge - UK
n.cardozo@uniandes.edu.co, ivana.dusparic@tcd.ie, chc79@cam.ac.uk
@ncardoz
18. Metrics
6
def update(self, msg_env):
'''
Update the state of the agent
:param msg_env: dict. A message generated by the order matching
'''
# check if should update, if it is not a trade
# if not isinstance(msg_env, type(None)):
if not msg_env:
if not self.should_update():
return None
# recover basic infos
inputs = self.env.sense(self)
state = self.env.agent_states[self]
s_cmm = self.env.s_main_intrument
# Update state (position ,volume and if has an order in bid or ask)
self.state = self.get_intern_state(inputs, state)
# Select action according to the agent's policy
s_action = None
s_action, l_msg = self.take_action(self.state, msg_env)
s_action2 = s_action
# Execute action and get reward
reward = 0.
self.env.update_order_book(l_msg)
l_prices_to_print = []
if len(l_msg) == 0:
reward += self.env.act(self, None)
self.b_new_reward = False
for msg in l_msg:
if msg['agent_id'] == self.i_id:
# check if should hedge the position
self.should_change_stoptime(msg)
# form log message
s_action = msg['action']
s_action2 = s_action
s_side_msg = msg['order_side'].split()[0]
s_indic = msg['agressor_indicator']
s_cmm = msg['instrumento_symbol']
d_aux = {'A': msg['order_status'],
# log just the last 4 digits of the order
'I': msg['order_id'] % 10**4,
'Q': msg['order_qty'],
'C': msg['instrumento_symbol'],
'S': s_side_msg,
'P': '{:0.2f}'.format(msg['order_price'])}
l_prices_to_print.append(d_aux)
#
l_prices_to_print.append('{:0.2f}'.format(msg['order_price']))
if s_indic == 'Agressor' and s_action == 'SELL':
s_action2 = 'HIT' # hit the bid
elif s_indic == 'Agressor' and s_action == 'BUY':
s_action2 = 'TAKE' # take the offer
try:
# the agent's positions and orders list are update here
# TODO: The reward really should be collect at this
point?
reward += self.env.act(self, msg)
self.b_new_reward = False
except:
print 'BasicAgent.update(): Message with error at
reward:'
pprint.pprint(msg)
raise
# check if should cancel any order due to excess
l_msg1 = self.could_include_new(s_action)
self.env.update_order_book(l_msg1)
for msg in l_msg1:
if msg['agent_id'] == self.i_id:
s_indic = msg['agressor_indicator']
d_aux = {'A': msg['order_status'],
'I': msg['order_id'],
'C': msg['instrumento_symbol'],
'S': msg['order_side'].split()[0],
'P': '{:0.2f}'.format(msg['order_price'])}
l_prices_to_print.append(d_aux)
try:
# the agent's positions and orders list are update here
# there is no meaning in colecting reward here
self.env.act(self, msg)
except:
print 'BasicAgent.update(): Message with error at
reward:'
pprint.pprint(msg)
raise
# === DEBUG ====
# if len(l_msg1) > 0:
# print 'n====CANCEL ORDER DUE TO EXCESS======n'
# pprint.pprint(l_msg1)
# ==============
# NOTE: I am not sure about that, but at least makes sense... I guess
# I should have to apply the reward to the action that has generated
# the trade (when my order was hit, I was in the book before)
if s_action2 == s_action:
if s_action == 'BUY':
s_action = 'BEST_BID'
elif s_action == 'SELL':
s_action = 'BEST_OFFER'
if s_action in ['correction_by_trade', 'crossed_prices']:
if s_side_msg == 'Buy':
s_action = 'BEST_BID'
elif s_side_msg == 'Sell':
s_action = 'BEST_OFFER'
# Learn policy based on state, action, reward
if s_cmm == self.env.s_main_intrument:
if self.policy_update(self.state, s_action, reward):
self.k_steps += 1
self.n_steps += 1
# print 'new step: {}n'.format(self.n_steps)
# calculate the next time that the agent will react
if not isinstance(msg_env, type(dict)):
self.next_time = self.env.order_matching.last_date
f_delta_time = self.f_min_time
# add additional miliseconds to the next_time to act
if self.f_min_time > 0.004:
if np.random.rand() > 0.4:
i_mult = 1
if np.random.rand() < 0.5:
i_mult = -1
f_add = min(1., self.f_min_time*100)
f_add *= np.random.rand()
f_delta_time += (int(np.ceil(f_add))*i_mult)/1000.
self.next_time += f_delta_time
self.last_delta_time = int(f_delta_time * 1000)
# print agent inputs
self._pnl_information_update()
self.log_step(state, inputs, s_action2, l_prices_to_print, reward)
Long method
19. Metrics
6
def update(self, msg_env):
'''
Update the state of the agent
:param msg_env: dict. A message generated by the order matching
'''
# check if should update, if it is not a trade
# if not isinstance(msg_env, type(None)):
if not msg_env:
if not self.should_update():
return None
# recover basic infos
inputs = self.env.sense(self)
state = self.env.agent_states[self]
s_cmm = self.env.s_main_intrument
# Update state (position ,volume and if has an order in bid or ask)
self.state = self.get_intern_state(inputs, state)
# Select action according to the agent's policy
s_action = None
s_action, l_msg = self.take_action(self.state, msg_env)
s_action2 = s_action
# Execute action and get reward
reward = 0.
self.env.update_order_book(l_msg)
l_prices_to_print = []
if len(l_msg) == 0:
reward += self.env.act(self, None)
self.b_new_reward = False
for msg in l_msg:
if msg['agent_id'] == self.i_id:
# check if should hedge the position
self.should_change_stoptime(msg)
# form log message
s_action = msg['action']
s_action2 = s_action
s_side_msg = msg['order_side'].split()[0]
s_indic = msg['agressor_indicator']
s_cmm = msg['instrumento_symbol']
d_aux = {'A': msg['order_status'],
# log just the last 4 digits of the order
'I': msg['order_id'] % 10**4,
'Q': msg['order_qty'],
'C': msg['instrumento_symbol'],
'S': s_side_msg,
'P': '{:0.2f}'.format(msg['order_price'])}
l_prices_to_print.append(d_aux)
#
l_prices_to_print.append('{:0.2f}'.format(msg['order_price']))
if s_indic == 'Agressor' and s_action == 'SELL':
s_action2 = 'HIT' # hit the bid
elif s_indic == 'Agressor' and s_action == 'BUY':
s_action2 = 'TAKE' # take the offer
try:
# the agent's positions and orders list are update here
# TODO: The reward really should be collect at this
point?
reward += self.env.act(self, msg)
self.b_new_reward = False
except:
print 'BasicAgent.update(): Message with error at
reward:'
pprint.pprint(msg)
raise
# check if should cancel any order due to excess
l_msg1 = self.could_include_new(s_action)
self.env.update_order_book(l_msg1)
for msg in l_msg1:
if msg['agent_id'] == self.i_id:
s_indic = msg['agressor_indicator']
d_aux = {'A': msg['order_status'],
'I': msg['order_id'],
'C': msg['instrumento_symbol'],
'S': msg['order_side'].split()[0],
'P': '{:0.2f}'.format(msg['order_price'])}
l_prices_to_print.append(d_aux)
try:
# the agent's positions and orders list are update here
# there is no meaning in colecting reward here
self.env.act(self, msg)
except:
print 'BasicAgent.update(): Message with error at
reward:'
pprint.pprint(msg)
raise
# === DEBUG ====
# if len(l_msg1) > 0:
# print 'n====CANCEL ORDER DUE TO EXCESS======n'
# pprint.pprint(l_msg1)
# ==============
# NOTE: I am not sure about that, but at least makes sense... I guess
# I should have to apply the reward to the action that has generated
# the trade (when my order was hit, I was in the book before)
if s_action2 == s_action:
if s_action == 'BUY':
s_action = 'BEST_BID'
elif s_action == 'SELL':
s_action = 'BEST_OFFER'
if s_action in ['correction_by_trade', 'crossed_prices']:
if s_side_msg == 'Buy':
s_action = 'BEST_BID'
elif s_side_msg == 'Sell':
s_action = 'BEST_OFFER'
# Learn policy based on state, action, reward
if s_cmm == self.env.s_main_intrument:
if self.policy_update(self.state, s_action, reward):
self.k_steps += 1
self.n_steps += 1
# print 'new step: {}n'.format(self.n_steps)
# calculate the next time that the agent will react
if not isinstance(msg_env, type(dict)):
self.next_time = self.env.order_matching.last_date
f_delta_time = self.f_min_time
# add additional miliseconds to the next_time to act
if self.f_min_time > 0.004:
if np.random.rand() > 0.4:
i_mult = 1
if np.random.rand() < 0.5:
i_mult = -1
f_add = min(1., self.f_min_time*100)
f_add *= np.random.rand()
f_delta_time += (int(np.ceil(f_add))*i_mult)/1000.
self.next_time += f_delta_time
self.last_delta_time = int(f_delta_time * 1000)
# print agent inputs
self._pnl_information_update()
self.log_step(state, inputs, s_action2, l_prices_to_print, reward)
Long method
Long class
class
QLearningAgent(BasicAgent):
'''
A representation of an agent
that learns using Q-learning
with linear
parametrization and e-greedy
exploration described at p.60 ~
p.61 form
Busoniu at al., 2010. The
approximator used is the
implementation of tile
coding, described at Sutton
and Barto, 2016 (draft).
'''
actions_to_open = [None,
'BEST_BID', 'BEST_OFFER',
'BEST_BOTH']
actions_to_close_when_short
= [None, 'BEST_BID']
actions_to_close_when_long
= [None, 'BEST_OFFER']
actions_to_stop_when_short
= [None, 'BEST_BID', 'BUY']
actions_to_stop_when_long
= [None, 'BEST_OFFER',
'SELL']
FROZEN_POLICY = False
def __init__(self, env, i_id,
d_normalizers, d_o
fi
_scale,
f_min_time=3600.,
f_gamma=0.5,
f_alpha=0.5, i_numOfTilings=16,
s_decay_fun=None,
f_ttoupdate=5.,
d_initial_pos={},
s_hedging_on='DI1F19',
b_hedging=True,
b_keep_pos=True):
'''
Initialize a
QLearningAgent. Save all
parameters as attributes
:param env: Environment
Object. The Environment where
the agent acts
:param i_id: integer. Agent
id
:param d_normalizers:
dictionary. The maximum range
of each feature
:param f_min_time*:
fl
oat.
Minimum time in seconds to the
agent react
:param f_gamma*:
fl
oat.
weight of delayed versus
immediate rewards
:param f_alpha*: the initial
learning rate used
:param i_numOfTilings*:
unmber of tiling desired
:param s_decay_fun*:
string. The exploration factor
decay function
:param f_ttoupdate*.
fl
oat.
time in seconds to choose a
diferent action
'''
f_aux = f_ttoupdate
super(QLearningAgent,
self).__init__(env, i_id,
f_min_time, f_aux,
d_initial_pos=d_initial_pos)
self.learning = True # this
agent is expected to learn
self.decayfun =
s_decay_fun
# Initialize any additional
variables here
self.max_pos = 100.
self.max_disclosed_pos =
10.
self.orders_lim = 4
self.order_size = 5
self.s_agent_name =
'QLearningAgent'
# control hedging
obj_aux =
risk_model.GreedyHedgeModel
self.s_hedging_on =
s_hedging_on
self.risk_model =
obj_aux(env,
s_instrument=s_hedging_on,
s_fairness='closeout')
self.last_spread = [0.0, 0.0]
self.f_spread = [0.0, 0.0]
self.f_gamma = f_gamma
self.f_alpha = f_alpha
self.f_epsilon = 1.0
self.b_hedging =
b_hedging
self.current_open_price =
None
self.current_max_price =
-9999.
self.current_min_price =
9999.
self.b_keep_pos =
b_keep_pos
# Initialize any additional
variables here
self.f_time_to_buy = 0.
self.f_time_to_sell = 0.
self.b_print_always = False
self.d_normalizers =
d_normalizers
self.d_o
fi
_scale =
d_o
fi
_scale
self.numOfTilings =
i_numOfTilings
self.alpha = f_alpha
i_nTiling = i_numOfTilings
value_fun =
ValueFunction(f_alpha,
d_normalizers, i_nTiling)
self.value_function =
value_fun
self.old_state = None
self.last_action = None
self.last_reward = None
self.disclosed_position = {}
self.f_stop_time =
STOP_MKT_TIME - 1 + 1
# self.features_names =
['position', 'o
fi
_new',
'spread_longo',
#
'ratio_longo', 'ratio_curto',
#
'size_bid_longo',
'size_bid_curto',
#
'spread_curto', 'high_low',
'rel_price']
self.features_names =
['position', 'o
fi
_new',
'ratio_longo',
'ratio_curto',
'spread_longo',
'rel_price']
def
reset_additional_variables(self,
testing):
'''
Reset the state and the
agent's memory about its
positions
:param testing: boolean. If
should freeze policy
'''
self.risk_model.reset()
self.f_time_to_buy = 0.
self.f_time_to_sell = 0.
self.last_reward = None
self.current_open_price =
None
self.current_max_price =
-9999.
self.current_min_price =
9999.
self.spread_position = {}
self.disclosed_position = {}
self.env.reward_fun.reset()
for s_instr in
self.env.l_instrument:
self.disclosed_position[s_instr]
= {'qAsk': 0.,
'Ask': 0.,
'qBid': 0.,
'Bid': 0.}
if testing:
self.freeze_policy()
def
additional_actions_when_exec(
self, s_instr, s_side, msg):
'''
Execute additional action
when execute a trade
:param s_instr: string.
:param s_side: string.
:param msg: dictionary.
Last trade message
'''
# check if the main
intrument was traded
s_main =
self.env.s_main_intrument
if
msg['instrumento_symbol'] ==
s_main:
self.b_has_traded = True
# check if it open or close
a pos
f_pos =
self.position[s_instr]['qBid']
f_pos -=
self.position[s_instr]['qAsk']
b_zeroout_buy = f_pos ==
0 and s_side == 'ASK'
b_zeroout_sell = f_pos ==
0 and s_side == 'BID'
b_new_buy = f_pos > 0
and s_side == 'BID'
b_new_sell = f_pos < 0
and s_side == 'ASK'
b_close_buy = f_pos > 0
and s_side == 'ASK'
b_close_sell = f_pos < 0
and s_side == 'BID'
s_other_side = 'BID'
# set the time to open
position if it just close it
if b_close_buy or
b_zeroout_buy:
self.f_time_to_buy =
self.env.order_matching.f_time
+ 60.
elif b_close_sell or
b_zeroout_sell:
self.f_time_to_sell =
self.env.order_matching.f_time
+ 60.
# print when executed
f_pnl = self.log_info['pnl']
s_time =
self.env.order_matching.s_time
s_err = '{}: {} - current
position {:0.2f}, PnL: {:0.2f}n'
print s_err.format(s_time,
s_instr, f_pos, f_pnl)
# keep a list of the opened
positions
if s_side == 'BID':
s_other_side = 'ASK'
if b_zeroout_buy or
b_zeroout_sell:
self.current_open_price
= None # update by risk model
self.current_max_price =
-9999.
self.current_min_price =
9999.
self.d_trades[s_instr]
[s_side] = []
self.d_trades[s_instr]
[s_other_side] = []
elif b_new_buy or
b_new_sell:
if b_new_buy:
self.risk_model.price_stop_sell
= None
if b_new_sell:
self.risk_model.price_stop_buy
= None
# log more information
l_info_to_hold =
[msg['order_price'],
msg['order_qty'], None]
if 'last_inputs' in
self.log_info:
l_info_to_hold[2] =
self.log_info['last_inputs']['TOB']
self.d_trades[s_instr]
[s_side].append(l_info_to_hold)
self.d_trades[s_instr]
[s_other_side] = []
elif b_close_buy or
b_close_sell:
f_qty_to_match =
msg['order_qty']
l_aux = []
for f_price, f_qty, d_tob
in self.d_trades[s_instr]
[s_other_side]:
if f_qty_to_match ==
0:
l_aux.append([f_price, f_qty,
d_tob])
elif f_qty <=
f_qty_to_match:
f_qty_to_match -=
f_qty
elif f_qty >
f_qty_to_match:
f_qty -=
f_qty_to_match
f_qty_to_match = 0
l_aux.append([f_price, f_qty,
d_tob])
self.d_trades[s_instr]
[s_other_side] = l_aux
if abs(f_qty_to_match) >
0:
l_info_to_hold =
[msg['order_price'],
f_qty_to_match, None]
if 'last_inputs' in
self.log_info:
l_info_to_hold[2] =
self.log_info['last_inputs']['TOB']
self.d_trades[s_instr]
[s_side].append(l_info_to_hold)
def need_to_hedge(self):
'''
Return if the agent need to
hedge position
'''
# ask risk model if should
hedge
if not self.b_hedging:
return False
if not self.b_keep_pos:
if
self.env.order_matching.last_da
te > self.f_stop_time:
if
abs(self.log_info['duration']) >
0.01:
self.b_need_hedge
= True
# print
'need_to_hedge(): HERE'
return
self.b_need_hedge
if
self.risk_model.should_stop_dis
closed(self):
return True
if
self.risk_model.should_hedge_o
pen_position(self):
# check if should hedge
position
if
abs(self.log_info['duration']) >
1.:
self.b_need_hedge =
True
return
self.b_need_hedge
return False
def
get_valid_actions_old(self):
'''
Return a list of valid
actions based on the current
position
'''
# b_stop = False
valid_actions =
list(self.actions_to_open)
if not
self.risk_model.can_open_positi
on('ASK', self):
valid_actions =
list(self.actions_to_close_when_
short) # copy
if
self.risk_model.should_stop_dis
closed(self):
# b_stop = True
valid_actions =
list(self.actions_to_stop_when_s
hort)
elif not
self.risk_model.can_open_positi
on('BID', self):
valid_actions =
list(self.actions_to_close_when_
long)
if
self.risk_model.should_stop_dis
closed(self):
# b_stop = True
valid_actions =
list(self.actions_to_stop_when_l
ong)
return valid_actions
def get_valid_actions(self):
'''
Return a list of valid
actions based on the current
position
'''
# b_stop = False
valid_actions =
list(self.actions_to_open)
return valid_actions
def get_intern_state(self,
inputs, state):
'''
Return a dcitionary
representing the intern state of
the agent
:param inputs: dictionary.
what the agent can sense from
env
:param state: dictionary.
the current state of the agent
'''
d_data = {}
s_main =
self.env.s_main_intrument
d_data['OFI'] =
inputs['qO
fi
']
d_data['qBID'] =
inputs['qBid']
d_data['BOOK_RATIO'] =
0.
d_data['LOG_RET'] =
inputs['logret']
d_rtn = {}
d_rtn['cluster'] = 0
d_rtn['Position'] =
fl
oat(state[s_main]['Position'])
d_rtn['best_bid'] =
state['best_bid']
d_rtn['best_o
ff
er'] =
state['best_o
ff
er']
# calculate the current
position in the main instrument
f_pos =
self.position[s_main]['qBid']
f_pos -=
self.position[s_main]['qAsk']
f_pos +=
self.disclosed_position[s_main]
['qBid']
f_pos -=
self.disclosed_position[s_main]
['qAsk']
# calculate the duration
exposure
# f_duration =
self.risk_model.portfolio_duratio
n(self.position)
# measure the OFI index
f_last_o
fi
= 0.
# if self.logged_action:
# compare with the last
data
# if 'to_delta' in
self.log_info:
# # measure the change
in OFI from he last sction taken
# for s_key in [s_main]:
# i_o
fi
_now =
inputs['OFI'][s_key]
# i_o
fi
_old =
self.log_info['to_delta']['OFI']
[s_key]
# f_aux = i_o
fi
_now -
i_o
fi
_old
# f_last_o
fi
+= f_aux
f_last_o
fi
= inputs['dOFI']
[s_main]
# for the list to be used as
features
fun = self.bound_values
s_lng =
self.env.s_main_intrument
s_crt = self.s_hedging_on
l_values = [fun(f_pos * 1.,
'position'),
fun(f_last_o
fi
,
'o
fi
_new', s_main),
fun(inputs['ratio']
[s_lng]['BID'], 'ratio_longo'),
fun(inputs['ratio']
[s_crt]['BID'], 'ratio_curto'),
fun(inputs['spread']
[s_lng], 'spread_longo'),
#
fun(inputs['spread'][s_crt],
'spread_curto'),
# fun(inputs['size']
[s_lng]['BID'], 'size_bid_longo'),
# fun(inputs['size']
[s_crt]['BID'], 'size_bid_curto'),
#
fun(inputs['HighLow'][s_lng],
'high_low'),
fun(inputs['reallAll']
[s_lng], 'rel_price')]
d_rtn['features'] =
dict(zip(self.features_names,
l_values))
return d_rtn
def bound_values(self,
f_value, s_feature_name,
s_cmm=None):
'''
Return the value bounded
by the maximum and minimum
values predicted.
Also apply nomalizations
functions if it is de
fi
ned and
d_normalizers,
in the FUN key.
:param f_value:
fl
oat. value
to be bounded
:param s_feature_name:
string. the name of the feature
in d_normalizers
:param s_cmm*: string.
Name of the instrument
'''
f_max =
self.d_normalizers[s_feature_na
me]['MAX']
f_min =
self.d_normalizers[s_feature_na
me]['MIN']
f_value2 = max(f_min,
f_value)
f_value2 = min(f_max,
f_value)
if 'FUN' in
self.d_normalizers[s_feature_na
me]:
if s_feature_name ==
'o
fi
_new':
f =
self.d_normalizers[s_feature_na
me]['FUN'](f_value, s_cmm)
f = max(f_min, f)
f = min(f_max, f)
f_value2 = f
else:
f_value2 =
self.d_normalizers[s_feature_na
me]['FUN'](f_value2)
return f_value2
def get_epsilon_k(self):
'''
Get $epsilon_k$
according to the exploration
schedule
'''
trial = self.env.count_trials
- 2 # ?
if self.decayfun ==
'tpower':
# e = a^t, where 0 < z <
1
# self.f_epsilon =
math.pow(0.9675, trial) # for
100 trials
self.f_epsilon =
math.pow(0.9333, trial) # for 50
trials
elif self.decayfun == 'trig':
# e = cos(at), where 0 <
z < 1
# self.f_epsilon =
math.cos(0.0168 * trial) # for
100 trials
self.f_epsilon =
math.cos(0.03457 * trial) # for
50 trials
else:
# self.f_epsilon =
max(0., 1. - (1./45. * trial)) # for
50 trials
self.f_epsilon = max(0.,
1. - (1./95. * trial)) # for 100
trials
return self.f_epsilon
def choose_an_action(self,
d_state, valid_actions):
'''
Return an action from a list
of allowed actions according to
the
agent's policy based on
epsilon greedy policy and
valueFunction
:param d_state: dictionary.
The inputs to be considered by
the agent
:param valid_actions: list.
List of the allowed actions
'''
# return a uniform random
action with prob $epsilon_k$
(exploration)
state_ = d_state['features']
best_Action =
random.choice(valid_actions)
if not
self.FROZEN_POLICY:
if np.random.binomial(1,
self.get_epsilon_k()) == 1:
return best_Action
# apply: arg max_{u'}
( phi^T (x_k, u') theta_k)
values = []
for action in valid_actions:
values.append(self.value_functi
on.value(state_, action, self))
# return
self.d_value_to_action[argmax(v
alues)]
return
valid_actions[argmax(values)]
def apply_policy(self, state,
action, reward):
'''
Learn policy based on
state, action, reward. The algo
part of "apply
action u_k" is in the update
method from agent frmk as the
update just
occur after one trial, state
and reward are at the next step.
Return
True if the policy was
updated
:param state: dictionary.
The current state of the agent
:param action: string. the
action selected at this time
:param reward: integer. the
rewards received due to the
action
'''
# check if there is some
state in cache
state_ = state['features']
valid_actions =
self.get_valid_actions()
if self.old_state and not
self.FROZEN_POLICY:
# TD Update
q_values_next = []
for act in valid_actions:
# for the vector: $
(phi^t (x_{k+1}, u') *
theta_k)_{u'}$
# state here plays the
role of next state x_{k+1}. act
are u's
f_value =
self.value_function.value(state_,
act, self)
q_values_next.append(f_value)
# Q-Value TD Target
# apply: Qhat <- r_{k+1}
+ y max_u' (phi^T(x_{k+1}, u')
theta_k)
# note that u' is the
result of apply u in x. u' is the
action that
# would maximize the
estimated Q-value for the state
x'
td_target =
self.last_reward + self.f_gamma
* np.max(q_values_next)
# Update the state value
function using our target
# apply: $theta_{k+1}
<- alpha_k (Q_ - Qhat)
theta(x_k, u_k)$
# the remain part of the
update is inside the method
learn
# use last_action here
because it generated the
curremt reward
self.value_function.learn(self.old
_state, self.last_action,
td_target, self)
# save current state, action
and reward to use in the next
run
self.old_state = state_ # in
the next run it is x_k <- x_{k+1}
self.last_action = action #
in the next run it is u_k
self.last_reward = reward
# in the next run it is r_{k+1}
if action in ['SELL', 'BUY']:
print '=',
return True
def set_qtable(self, s_fname,
b_freezy_policy=True):
'''
Set up the q-table to be
used in testing simulation and
freeze policy
:param s_fname: string.
Path to the qtable to be used
'''
# freeze policy if it is for
test and not for traning
if b_freezy_policy:
self.freeze_policy()
# load qtable and
transform in a dictionary
value_fun =
pickle.load(open(s_fname, 'r'))
self.value_function =
value_fun
# log
fi
le used
s_print = '{}.set_qtable():
Setting up the agent to use'
s_print =
s_print.format(self.s_agent_nam
e)
s_print += ' the Value
Function at {}'.format(s_fname)
# DEBUG
logging.info(s_print)
def stop_on_main(self, l_msg,
l_spread):
'''
Stop on the main
instrument
:param l_msg: list.
:param l_spread: list.
'''
s_main_action = ''
if
self.risk_model.should_stop_dis
closed(self):
if self.log_info['duration']
< 0.:
print '=',
s_main_action =
'SELL'
if self.log_info['duration']
> 0.:
print '=',
s_main_action =
'BUY'
if
self.env.order_matching.last_da
te > self.f_stop_time:
if not self.b_keep_pos:
if
self.log_info['duration'] < 0.:
print '>',
s_main_action =
'SELL'
if
self.log_info['duration'] > 0.:
print '>',
s_main_action =
'BUY'
# place orders in the best
price will be handle by the
spread
# in the next time the
agent updates its orders
# l_spread_main =
self._select_spread(self.state,
s_code)
if s_main_action in ['BUY',
'SELL']:
self.b_need_hedge =
False
l_msg +=
self.cancel_all_hedging_orders()
l_msg +=
self.translate_action(self.state,
s_main_action,
l_spread=l_spread)
return l_msg
return []
def msgs_due_hedge(self):
'''
Return messages given
that the agent needs to hedge
its positions
'''
# check if there are
reasons to hedge
l_aux =
self.risk_model.get_instruments
_to_hedge(self)
l_msg = []
if l_aux:
# print 'nHedging {} ...
n'.format(self.position['DI1F21']
)
s_, l_spread =
self._select_spread(self.state,
None)
s_action, s_instr, i_qty =
random.choice(l_aux)
# generate the
messages to the environment
my_book =
self.env.get_order_book(s_instr,
False)
row = {}
row['order_side'] = ''
row['order_price'] = 0.0
row['total_qty_order'] =
abs(i_qty)
row['instrumento_symbol'] =
s_instr
row['agent_id'] =
self.i_id
# check if should send
mkt orders in the main
instrument
l_rtn =
self.stop_on_main(l_msg,
l_spread)
if len(l_rtn) > 0:
# print 'stop on main'
s_time =
self.env.order_matching.s_time
print '{}: Stop loss.
{}'.format(s_time, l_aux)
return l_rtn
# generate trade and the
hedge instruments
s_time =
self.env.order_matching.s_time
print '{}: Stop gain.
{}'.format(s_time, l_aux)
if s_action == 'BUY':
self.b_need_hedge =
False
row['order_side'] =
'Buy Order'
row['order_price'] =
my_book.best_ask[0]
l_msg +=
self.cancel_all_hedging_orders()
l_msg +=
translator.translate_trades_to_a
gent(row, my_book)
return l_msg
elif s_action == 'SELL':
self.b_need_hedge =
False
row['order_side'] =
'Sell Order'
row['order_price'] =
my_book.best_bid[0]
l_msg +=
self.cancel_all_hedging_orders()
l_msg +=
translator.translate_trades_to_a
gent(row, my_book)
return l_msg
# generate limit order or
cancel everything
elif s_action ==
'BEST_BID':
f_curr_price,
i_qty_book = my_book.best_bid
l_spread = [0.,
self.f_spread_to_cancel]
elif s_action ==
'BEST_OFFER':
f_curr_price,
i_qty_book =
my_book.best_ask
l_spread =
[self.f_spread_to_cancel, 0.]
if s_action in
['BEST_BID', 'BEST_OFFER']:
i_order_size =
row['total_qty_order']
l_msg +=
translator.translate_to_agent(sel
f,
s_action,
my_book,
# worst t/TOB
l_spread,
i_qty=i_order_size)
return l_msg
else:
# if there is not need to
send any order, so there is no
# reason to hedge
self.b_need_hedge =
False
l_msg +=
self.cancel_all_hedging_orders()
return l_msg
self.b_need_hedge = False
return l_msg
def
cancel_all_hedging_orders(self):
'''
Cancel all hedging orders
that might be in the books
'''
l_aux = []
for s_instr in
self.risk_model.l_hedging_instr:
my_book =
self.env.get_order_book(s_instr,
False)
f_aux =
self.f_spread_to_cancel
l_aux +=
translator.translate_to_agent(sel
f,
None,
my_book,
# worst t/TOB
[f_aux, f_aux])
return l_aux
def _select_spread(self,
t_state, s_code=None):
'''
Select the spread to use in
a new order. Return the
criterium and
a list od spread
:param t_state: tuple. The
inputs to be considered by the
agent
'''
l_spread = [0.0, 0.0]
s_main =
self.env.s_main_intrument
my_book =
self.env.get_order_book(s_main,
False)
# check if it is a valid book
if abs(my_book.best_ask[0]
- my_book.best_bid[0]) <= 1e-6:
return s_code, [0.02,
0.02]
elif my_book.best_ask[0] -
my_book.best_bid[0] <= -0.01:
return s_code, [0.15,
0.15]
# check if should stop to
trade
if
self.risk_model.b_stop_trading:
return s_code, [0.04,
0.04]
# check if it is time to get
agressive due to closing market
if
self.env.order_matching.last_da
te > STOP_MKT_TIME:
if self.log_info['pos']
[s_main] < -0.01:
return s_code, [0.0,
0.04]
elif self.log_info['pos']
[s_main] > 0.01:
return s_code, [0.04,
0.0]
else:
return s_code, [0.04,
0.04]
# change spread
if not
self.risk_model.should_open_at
_current_price('ASK', self):
l_spread[1] = 0.01
elif not
self.risk_model.should_open_at
_current_price('BID', self):
l_spread[0] = 0.01
# if it just have close a
position at the speci
fi
c side
if
self.env.order_matching.f_time
< self.f_time_to_buy:
l_spread[0] = 0.01
if
self.env.order_matching.f_time
< self.f_time_to_sell:
l_spread[1] = 0.01
# check if can not open
positions due to limits
if not
self.risk_model.can_open_positi
on('ASK', self):
l_spread[1] = 0.02
elif not
self.risk_model.can_open_positi
on('BID', self):
l_spread[0] = 0.02
return s_code, l_spread
def should_print_logs(self,
s_question):
'''
Return if should print the
log based on s_question:
:param s_question: string.
All or 5MIN
'''
if self.b_print_always:
return True
if s_question == 'ALL':
return PRINT_ALL
elif s_question == '5MIN':
return PRINT_5MIN
return False
def set_to_print_always(self):
'''
'''
self.b_print_always = True
20. Metrics
7
state = [
(player.x_change == 20 and player.y_change == 0 and ((list(map(add, player.position[-1], [20, 0])) in player.position) or
player.position[-1][0] + 20 >= (game.game_width - 20))) or (player.x_change == -20 and player.y_change == 0 and ((list(map(add, player.position[-1], [-20, 0])) in player.position) or
player.position[-1][0] - 20 < 20)) or (player.x_change == 0 and player.y_change == -20 and ((list(map(add, player.position[-1], [0, -20])) in player.position) or
player.position[-1][-1] - 20 < 20)) or (player.x_change == 0 and player.y_change == 20 and ((list(map(add, player.position[-1], [0, 20])) in player.position) or
player.position[-1][-1] + 20 >= (game.game_height-20))), # danger straight
(player.x_change == 0 and player.y_change == -20 and ((list(map(add,player.position[-1],[20, 0])) in player.position) or
player.position[ -1][0] + 20 > (game.game_width-20))) or (player.x_change == 0 and player.y_change == 20 and ((list(map(add,player.position[-1],
[-20,0])) in player.position) or player.position[-1][0] - 20 < 20)) or (player.x_change == -20 and player.y_change == 0 and ((list(map(
add,player.position[-1],[0,-20])) in player.position) or player.position[-1][-1] - 20 < 20)) or (player.x_change == 20 and player.y_change == 0 and (
(list(map(add,player.position[-1],[0,20])) in player.position) or player.position[-1][
-1] + 20 >= (game.game_height-20))), # danger right
(player.x_change == 0 and player.y_change == 20 and ((list(map(add,player.position[-1],[20,0])) in player.position) or
player.position[-1][0] + 20 > (game.game_width-20))) or (player.x_change == 0 and player.y_change == -20 and ((list(map(
add, player.position[-1],[-20,0])) in player.position) or player.position[-1][0] - 20 < 20)) or (player.x_change == 20 and player.y_change == 0 and (
(list(map(add,player.position[-1],[0,-20])) in player.position) or player.position[-1][-1] - 20 < 20)) or (
player.x_change == -20 and player.y_change == 0 and ((list(map(add,player.position[-1],[0,20])) in player.position) or
player.position[-1][-1] + 20 >= (game.game_height-20))), #danger left
player.x_change == -20, # move left
player.x_change == 20, # move right
player.y_change == -20, # move up
player.y_change == 20, # move down
food.x_food < player.x, # food left
food.x_food > player.x, # food right
food.y_food < player.y, # food up
food.y_food > player.y # food down
]
Multiply-nested container
State space: 11
depth: 5
24. Conclusion
9
RL projects contain many code smells - 3.15 average per file
- can be up to 8 per file (or 1 code smell every 27 lines)
25. Conclusion
9
RL projects contain many code smells - 3.15 average per file
- can be up to 8 per file (or 1 code smell every 27 lines)
From top 4 most common code smells 3 are shared across the 2 data sets
(Multiply-Nested Container, Long Method, Long Parameter List)
26. Conclusion
9
RL projects contain many code smells - 3.15 average per file
- can be up to 8 per file (or 1 code smell every 27 lines)
From top 4 most common code smells 3 are shared across the 2 data sets
(Multiply-Nested Container, Long Method, Long Parameter List)
State representations
are inherently complex
27. Conclusion
9
RL projects contain many code smells - 3.15 average per file
- can be up to 8 per file (or 1 code smell every 27 lines)
From top 4 most common code smells 3 are shared across the 2 data sets
(Multiply-Nested Container, Long Method, Long Parameter List)
State representations
are inherently complex
Functionality is presented
as a code block
28. Conclusion
9
RL projects contain many code smells - 3.15 average per file
- can be up to 8 per file (or 1 code smell every 27 lines)
From top 4 most common code smells 3 are shared across the 2 data sets
(Multiply-Nested Container, Long Method, Long Parameter List)
State representations
are inherently complex
Functionality is presented
as a code block
RL algorithms are riddle
with learning parameters
29. Conclusion
9
RL projects contain many code smells - 3.15 average per file
- can be up to 8 per file (or 1 code smell every 27 lines)
From top 4 most common code smells 3 are shared across the 2 data sets
(Multiply-Nested Container, Long Method, Long Parameter List)
State representations
are inherently complex
Functionality is presented
as a code block
RL algorithms are riddle
with learning parameters
Code smells point to a violation of the design principles
(coupling, cohesion, single responsibility)
32. Future perspectives
10
specific metrics and
code smells for RL
we need specific metrics, thresholds, and tools
to capture the complexity of RL algorithms
The complexity of RL can be managed with the
creation of dedicated data structures or express
relations between entities more ergonomically