Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Augmenting Decisions of Taxi Drivers through Reinforcement Learning for Improving Revenues
1. Augmenting Decisions of Taxi Drivers through
Reinforcement Learning for Improving Revenues
AAAI Association for the Advancement of Artificial Intelligence, 2017
Tanvi Verma, Pradeep Varakantham, Sarit Kraus, Hoong Chuin Lau
November 3, 2021
Presenter: Kyunghwan Mun
3. Introduction
• Taxis roam around not having a customer (Cruising)
▪ It is important to reduce cruising time and increase revenue
• Right “location" at the right “time”
• Reinforcement Learning (RL)
▪ Maximizing the long term revenue
• Requirement of making a sequence of decisions
• Wait for 5 minutes
• Reinforcement Learning being well defined 🚌…
• Revenue earned from a customer
• Cost from travelling between locations
• Uncertain customer demand
• Reinforcement Learning captures uncertainty well
• Learning focus of RL can adapt demand patterns
2
4. 3
Introduction
• Contributions
▪ Annotation precedure of the trajectory data
▪ Monte Carlo Reinforcement Learning
▪ Iterative abstraction
▪ Evaluation method 🚌…
• The average revenue earned by the learned policy >>The top 10 percentile revenue
• The agent performance >> top 1 percentile revenue (Some time intervals)
• The increase in taxi utilization employing revenue maximization objective
5. 4
Related Work
• Taxi Guidance
▪ Pick-up probability to recommend a driving route for profit maximization
▪ Cruising route to vacant taxis such that vacancy time is minimized
▪ Driver’s experience to find parking spots for a cruising taxi
▪ Taxi trajectories to learn traffic patterns and estimate travel time
▪ Locations for taxi drivers by constructing a spatio-temporal profitability map
• Surrounding regions of the driver
• Computing potential profit using historical data
▪ Considers long term revenue
▪ Any perferences with respet to areas are inherently captured
▪ Relies on past experiences
▪ Taxi trajectory data
6. 5
Related Work
• Reinforcement Learning (RL)
▪ Model-based learning
• Transition probabilities
• Reward function to compute values of states
▪ Model-free learning
• When obtaining samples of experience from the dataset
• Temporal Difference method
• Monte Carlo method
• Estimate state-action values
7. 6
Related Work
• Deep Reinforcement Learning (DeepRL)
▪ Ideal methods for environments where tens of milions of learning episodes
▪ Inappropriate situations to apply in taxi cases 🚌…
• Too small the number of features within the state space
8. 7
Methodology
• Taxi Dataset
▪ A major company in Singapore
▪ Each log enry of the data
• Latitude (GPS)
• Longitude (GPS)
• Taxi ID
• Driver ID
• Taxi Status 🚌…
• Taxis-free (meter off, actively looking for next passenger)
• Busy (not accepting bookings)
• POB (Passenger On Board)
• Off-line
9. 8
Methodology
• Driver Activity Graphs - 1
▪ Cruising trajectory
• “Free” state “Non-free” state (passenger on board, busy, break, off-line, on call etc.)
• Cruising trajectories of drivers from the dataset
• Annotating the trajectories with the decisions made
▪ Figure 1.
• Starts at A
• Terminates at E
• B, C, and D are intermediate decision coordinates
▪ Desired path
• The shortest path between A and E
• Evaluate if the driver could have made the decision to go to D at A 🚌…
If not, includes C in the trajectory and repeats for the final trajectory
10. 9
Methodology
• Driver Activity Graphs – 2
▪ Convert each cruising trajectory into an activity graph
• A directed graph with decision coordinates as nodes
• Distance travelled between the coordinates
• Weight of the edge between them
• Terminating node of the activity graph
• Contains information about revenue earned
• Earned Revenue
• The fare of trip – The cost of travel for the trip
11. 10
Methodology
• Reinforcement Learning (RL) for Taxi Driver
▪ State is given as follows:
• <day of week, zone, time interval>
• Divide the entire map of Singapore into several zones
• Time interval (0-6 hours, 6-9 hours, 9-12 hours, 12-17 hours, 17-20 hours, 20-24 hours)
• For n zones, n available actions
(Stay in the current zone / Move to remaining n-1 zones)
▪ Episodes
• “Non-free” state “Free” state “Non-free” state (Termination)
• The cost of travel between nodes Fixed cost per km to the weight of the edge
• Positive reward : The fare of the trip – The cost to travel the trip
12. 11
Methodology
• Reinforcement Learning (RL) for Taxi Driver
▪ (Algorithm 1) Monte Carlo Estimation of Q Values
• Return (The cumulative reward accumulated till the end of the episode)
• 𝑄(𝑠, 𝑎) : The value of (𝑠, 𝑎) pair
• Variable “min-count” to avoid inaccurate estimated value
• 𝐶𝑜𝑢𝑛𝑡(𝑠, 𝑎) : The total number of training episodes in which
(𝑠, 𝑎) was visited
• Policy 𝜋(𝑠) : mapping state s to it’s optimal action
• 𝑆 : The set of states
• 𝐴 : The set of actions
• 𝑆𝑙𝑒𝑎𝑟𝑛𝑒𝑑 : The set of states for which we could learn optimal
policy
13. 12
Methodology
• Reinforcement Learning (RL) for Taxi Driver
▪ Zone Structure
• Too big zones
Increase uncertainty in outcome for actions
• Too small zones
Doesn’t have sufficient training data to learn something meaningful
It is importance to balance between uncertainty and granulaity
▪ Method 1.) Static Zones
• Start with a large number of uniformly distributed zones
• Check how many relevant episodes are present in each zone
• If the number < 𝑚𝑖𝑛 − 𝑐𝑜𝑢𝑛𝑡, merge the zone has sufficient data
(500 zones 111 zones)
14. 13
Methodology
• Reinforcement Learning (RL) for Taxi Driver
▪ Method 2.) Dynamic Zones
• Fix 𝑡𝑖𝑚𝑒 − 𝑖𝑛𝑡𝑒𝑟𝑣𝑎𝑙 and 𝑑𝑎𝑦 − 𝑜𝑓 − 𝑡ℎ𝑒 − 𝑤𝑒𝑒𝑘
each zone maps to a unique state and a unique action
• Decide whether certain low valued zones needs to split into smaller zones
• For Split zones, learn Q-values for the new set of zones
• Check if certain zones can be split
• Decrease the uncertainty in outcome of optimal action
• If smaller zones having adequate data & increasing the ocerall value of the bigger zone
Split larger zones into smaller zones
15. 14
Methodology
• (Algorithm 2) Dynamic zoning
▪ Start with four large uniform zones
▪ Split the zones repeatedly until further splits is not possible
• (Algorithm 3) WorthSplitting(z)
▪ Split the zones using K-Means Clustering
• Size of child zones > min-size
• max
𝑎
𝑄(𝑠1, 𝑎) + max
𝑎
𝑄 𝑠2, 𝑎 > max
𝑎
𝑄(𝑠, 𝑎)
• argmax
𝑎
𝑄 𝑠1, 𝑎 ! = argmax
𝑎
𝑄(𝑠, 𝑎) OR argmax
𝑎
𝑄 𝑠2, 𝑎 ! = argmax
𝑎
𝑄(𝑠, 𝑎)
16. 15
Experiments
• Evaluation Method
▪ Compare (a), (b) and (c)
• Average revenue earned by our learning agent … (a)
• The top percentile revenue of drivers … (b)
• Revenue earned by greedy heuristics typically employed by drivers during cruising … (c)
▪ Simulation of Agent Movements
• Assigning the available trips to the agent while consider competition from active drivers
• Trip data and trajectories of all active drivers during a given date and time-interval
• Finding the relevant available trips (non pre-booked trips) that originated from each state
• Revenue earned, duration and distance for each trip
• Assignment probability (𝑝𝑎𝑠𝑠𝑖𝑔𝑛
𝑠𝑡
) :
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑡𝑟𝑖𝑝𝑠 𝑎𝑣𝑎𝑖𝑙𝑎𝑏𝑙𝑒
𝑇ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑐𝑟𝑢𝑖𝑠𝑖𝑛𝑔 𝑑𝑟𝑖𝑣𝑒𝑟𝑠 𝑝𝑟𝑒𝑠𝑒𝑛𝑡 𝑖𝑛 𝑡ℎ𝑒 𝑠𝑡𝑎𝑡𝑒 𝑎𝑡 𝑡ℎ𝑎𝑡 𝑡𝑖𝑚𝑒
17. 16
Experiments
• Evaluation Method
▪ Driver revenue
• It is difficult to estimate the exact cruising distance of our agent 🚌…
• Apply cost of travel per cruising minute
• Compute time duration for which the driver was not hired in the time interval
• 𝐶𝑟𝑢𝑖𝑠𝑖𝑛𝑔 − 𝑐𝑜𝑠𝑡 per minute is appled for this duration
• Driver’s revenue in a time interval
= All the trips of the driver in the time interval – Cost of travelling all trip distance – Cost of all cruising
▪ Heuristic strategy
• The remaining probability (𝑝𝑠𝑡𝑎𝑦 = 0.5)
18. 17
Experiments
• Evaluation Method
▪ Agent revenue
• Compute agent’s revenue for each time-interval
• Initialize time with a start time of the interval
19. 18
Experiments
• Experimental Results
▪ Evaluate dataset period : 1 month
▪ Average agent revenue VS Average of top percentile revenues earned by drivers
• Compare with top 10 percentile revenues
▪ Starting states of agent : Top 500 drivers in each time interval 🚌…
• For a given time interval and day, the agent revenue is averaged over 500,000 executions
(500 different initial states * 1000 exeutions)
20. 19
Conclusion and Discussion
• Limitations & Requirements
▪ One single learning agent
Multiple learning agents
▪ Starting states of agent : Top 500 drivers in each time interval
Dynamic starting states with multiple learning agents
▪ Simple taxi states Construct diverse taxi states (K trips, etc)
▪ Construction of time intervals divided
Based on historical data
▪ Condition that the episode ends
Set the end time of episodes that exceeds a specific threshold
(ex) K trips, Cruising distance, Waiting time, …) to reduces executions