This document summarizes a model-free approach called MOVI for dynamic fleet management to optimize taxi dispatching and reduce passenger wait times. It compares MOVI to a baseline receding horizon control approach. MOVI uses a deep Q-network trained with a double DQN algorithm to learn optimal dispatch policies in a distributed, model-free manner. Evaluation on real taxi data shows MOVI reduces rejection rates and wait times compared to RHC and is more practical for real-time dispatching due to its faster computation. Future work includes handling partial observability and other reinforcement learning frameworks.
Model Call Girl in Narela Delhi reach out to us at 🔝8264348440🔝
INFOCOM 2018 Talk: MOVI
1. MOVI: A Model-Free Approach
to Dynamic Fleet Management
Takuma Oda and Carlee Joe-Wong
Carnegie Mellon University
IEEE INFOCOM 2018/4/19 @Honolulu, HI
2. Optimization of taxi dispatch/cruising
Reduce passengers’ waiting time
Increase drivers revenue
Vehicle Dispatch Problem
3. Real-time (On-demand)
Complexity
Large state space
Demand uncertainty
Coordination in large-scale fleet
Centralized, model-based approach
F. Miao, et al., Taxi Dispatch With Real-Time Sensing Data in Metropolitan
Areas: A Receding Horizon Control Approach, IEEE Trans. Autom. Sci. Eng.,
vol. 13, no. 2, pp. 463478, Apr. 2016.
Limited modeling of vehicles dynamics
Computationally intractable for real-time application
Our work: distributed, model-free approach
Challenges
5. All rides are requested with app
Vehicle state information is available in real time
Requests are rejected if no available vehicles within the
fixed range, e.g., 5 km.
Assumptions
6. Baseline: Receding Horizon Control (RHC)
Our approach: Deep Q-network (DQN)
Approach
Policy RHC DQN
Formulation Deterministic Optimization Reinforcement Learning
Coordination Centralized Distributed
Model Model-based Model-free
Discretization Taxi Zone Grid
7. Action: number of vehicles to send to each region, each time
Reward:
Transition Model:
Idle cruising costUnserved requests
Leftover vehicles Vehicles sent
Taxis dropping off
passengers
RHC Approach
8. Action: where each taxi should go in the next timeslot
Reward Model:
Optimal Action-Value Function:
Loss Function:
Idle cruising cost
Pickup
DQN Approach
Target value
11. DQN Architecture
Fully CNN with auxiliary inputs
Outputs: Q value for each possible moves
Inputs: demand and supply heat maps
12. DQN Training
Training Step Training Step
Algorithm: Double DQN with experience replay
Exploration: Epsilon greedy with activation rate
13. Performance Comparison Over a Week
Reject Rate Wait Time Idle Cruising Time
Relative to NO 76% 34% Increases by 1.3%
Relative to RHC 20% 12% Increases by 4%
14. DQN outperforms RHC due to the real-time dispatch decision
DQN forward pass < 100ms
RHC computation ~ a few seconds
DQN is more beneficial for drivers
DQN predicts best action for individual vehicle
More realistic to implement in real-world
Discussion
UtilizationRate
15. Conclusion
Contribution
Demonstrated the benefits of applying model-free,
distributed solution to large-scale taxi dispatch problem
Future Work
Partial Observable Environment
Other Reinforcement Learning Framework
21. V. Mnih, et al. Human-level control through deep
reinforcement learning., Nature, vol. 518, no. 7540, pp.
52933, Feb. 2015.
Q-learning algorithm with function approximation
1. Take some action and observe
2. Set target values
3. Perform a gradient descent step on
Q-learning
22. Problem Definition
RHC/DQN
Policy Engine
Data Pre-
processing
Demand
Prediction
atwt - 1
Ft
Xt:t + TWt:t + T
Vehicle/passenger
matching
It
Vehicle State
Past
Demands
Dispatch Center
Dispatch
Decisions
In traditional taxi networks, individual drivers look for passengers hailing on the street. They are relying on their experience and knowledge
But it can be inefficient if they don’t know future demand and are not coordinated
For instance, let’s say there are two vacant taxis on the streets and they cruise or are dispatched to this regions. But, customers may request rides at those locations. In this case, for both customers and drivers, dispatch decisions was not optimal. Either of drivers has to spend a lot of time on cruising
Modern ride-hailing fleet networks such as Uber and Lyft can track vehicles’ GPS location and passengers’ pickup location in real time.
This data can be utilized to predict passenger demand and vehicle mobility patterns in the future, which enables proactive dispatch of their vehicles to predicted future pickup locations
In this way, optimization of taxi dispatch can reduce passengers waiting time for a ride and increase drivers revenue
There several challenges in this problem.
For an on-demand ride-hailing application, it needs to be solved in real-time
However, challenges such as large state space, uncertain customer demand and coordination in large-scale fleet network, makes it difficult to solve efficiently
Most previous works on fleet management address this problem with a model-based approach
Model-based approach first models vehicles dynamics and interactions with passengers and then optimally solves the dispatch problems given these models
Though the model-based approaches can improve the performance, modeling complex dynamics of fleet networks is inherently limited and solving the problem in large-scale fleet in real-time tends to be computationally intractable
In this work, we propose a model-free, distributed approach for the problem to tackle these challenges
Our contribution of this work are:
Design and evaluate a distributed, model-free approach for taxi dispatch problem
Compare model-free, distributed approach and model-based, centralized approach
Demonstrate effectiveness of the new approach in a realistic simulated environment
Let me define the problem more precisely
We assume that there are an environment and an agent. The environment consists of vehicles and passengers with a mobile app
The agent takes an action by dispatching. By dispatch, we mean sending a vacant taxi to an other location
Agent observes each vehicles’ location and availability status and all passengers pickup requests. Using this real-time information, agent determines proactive dispatching for vacant vehicles.
Since we focus on optimizing proactive dispatching, we incorporate the matching algorithm between passengers and available vehicles in the environment.
The agent goal is to optimize sequential dispatch decisions so as to maximize accumulated reward
We assume that all rides requested with a mobile app so that the agent can get pickup and drop off location in real time
Vehicle state information is available in real time, including locations, occupancy status, destination
Requests are rejected if no available vehicles within fixed range. We use 5 km for our experiment
We used Receding Horizon Control approach as our baseline policy
It is centralized, model-based approach and formulated as deterministic optimization problem
For ours, we presented distributed, model-free approach using a popular reinforcement learning framework Deep Q-Network.
The action variable for the baseline is the number of vehicles to send to each region, each time, denoted by u_t
We wish to choose the u_t to maximize reward, defined by a weighted sum of the number of rejects and the vehicles’ idle cruising time
The number of vehicles in next time slot t is computed by this transition model.
The first term corresponds to leftover vehicles as the results of pickups
The second term is the net number of vehicles dispatched to this region
The last two terms represent occupied vehicles dropping off passengers within time slot t+1
Assuming the future demand are known, we can find optimal dispatch actions to maximize accumulated reward in T horizon
Every time step, we solve RHC to determine next T step actions, but execute only current action.
The first constraint ensures that the total number of vehicles dispatched from i-th region must not exceed the number of idle vehicles
The second constraint ensures that we do not dispatch vehicles to regions with travel times that exceeds d_t and all dispatch movement completes within a time interval
For simplicity, we assume that u_tij are continuous variables; we can then solve optimization problem efficiently with Linear Programming methods
The action variable is where each taxi should go in the next timeslot
Similar to the baseline, we express reward function for each vehicle as weighted sum of pickup reward and idle cruising cost
We would like to learn optimal action-value function, which is defined as the maximum expected return achievable by any policy
Since the number of states space is huge, we use neural network function approximator for Q
For loss function, we use MSE and a target value is computed by bellman backup of current estimation
To evaluate RHC and DQN policies, we design and implement MOVI as a taxi fleet simulator
This diagram shows the MOVI architecture
Fleet object simulates states of all vehicles
In every time step, MOVI generates ride requests based on the real trip records and matches each request to vehicles by nearest neighbor algorithm
Next, the agent observes the current state of the environment which includes vehicle and requests information
The agent then computes the actions, using either RHC or DQN policy, and sends a dispatch order to idle vehicles
For each dispatch order, MOVI creates an estimated trajectory to the dispatched location by computing the shortest path in OSM road network graph
Finally, all vehicles update their states according to their matching and dispatch assignments
Dispatch policy is a separate module and does not affect other simulator modules so that we can compare different dispatch policies in the same settings
We used NYC taxi trip records for the experiments
This is the regions in our experiment, showing geographical demand pattern
The area size is roughly 40 km x 40 km
We trained DQN and other machine learning models with one month data and evaluate metrics with one week data
Temporal demand patterns are roughly similar
We use a fully convolutional neural network with a 15 x 15 output map
Each grid corresponds to the Q-value for each possible move from center location
FCNN enables faster learning and inference due to the absence of fully connected layer
inputs: state of the env
For input features, we use demand and supply heat maps surrounding an agent vehicle as an environment state. It makes input size independent on the service area
The larger the input heat maps are, the further future demand agents can see for decision making, but the more computationally intensive it will be
In order to make image size small, we also use smoothed heat maps so that agents know further information easily
Another key design is incorporating other agents’ destination into input. This allows to mitigate environment non-stationarity because an agent can learn its optimal action conditioned on other agents’ current action
We trained DQNs with double DQN algorithm with experience replay
Network weights and replay memory are shared among agents
We customized epsilon greedy exploration methods by adding activation rate which controls the probability of move or stay. We found that it contributes to stable and faster training
These graphs show training curves of average loss and average max Q during training
max Q value starts decreasing after it reaches 100. This can be explained by environmental changes by more competitions among agents. It also indicates coordination in distributed manner.
We ran simulations with DQN policy, RHC policy and No Dispatching Policy and calculate three metrics: reject rate, passenger wait time and idle cruising time in each day of the week
All simulations were ran with 1 minute time step and 8000 vehicles.
In every day of week, the DQN policies significantly reduce the reject rate and wait time compared to no dispatching, while idle cruising time stays almost the same
In comparison with DQN and RHC, the reject rate of DQN is reduced by 20% and wait time is reduced by 12%
Despite the fact that the DQN policy does not make coordinated decisions for idle vehicles, our results show that DQN performs better than RHC.
We think that this is due to DQN’s faster and distributed dispatch decisions, allowing the dispatch policies to DQN forward pass computation takes less than 100ms while RHC computation takes a few seconds and depends on the number of regions
To investigate the effect of on-demand, distributed nature of DQN, we simulates “batch” version of DQN policy. The results plotted as DQN* show that batch DQN policy performs almost the same as RHC. This indicates that faster, on-demand computation of DQN contributes to rapid adaptation to the environment state
Another interesting feature of DQN policy is that it is more beneficial for drivers because it predicts the best action for each individual vehicle. The figure show that average and minimum utilization rate of all vehicles. DQN realizes better lowest utilization rate compared to RHC. Thus, the DQN policy may be more realistic to implement in real-world applications
Utilization rate strongly relates to the revenue
Let me conclude our work
For our contribution,
There might be several extensions of our work.