Liyuan Lin received his M.S. degree from the University of Science and Technology of China in 2024. His research mainly focuses on two-sided markets and opaque selling
Motivated by the business model called “community group buying” (CGB), which has emerged in China and some countries in Southeast Asia, such as Singapore and Indonesia, we develop algorithms that could help CGB platforms match consumers with stage-stations (the picking up center under the CGB mode). By altering the fundamental design of the existing hierarchy algorithms, improvements are achieved. It is proven that our method has a faster running speed and greater space efficiency. Our algorithms avoid traversal and compress the time complexities of matching a consumer with a stage-station and updating the storage information to O(logM) and O(MlogG), where M is the number of stage-stations and G is that of the platform’s stock-keeping units. Simulation comparisons of our algorithms with the current methods of CGB platforms show that our approaches can effectively reduce delivery costs. An interesting observation of the simulations is worthy of note: Increasing G may incur higher costs since it makes inventories more dispersed and delivery problems more complicated.
Graphical Abstract
The fundamental processes of a community group buying platform.
Abstract
Motivated by the business model called “community group buying” (CGB), which has emerged in China and some countries in Southeast Asia, such as Singapore and Indonesia, we develop algorithms that could help CGB platforms match consumers with stage-stations (the picking up center under the CGB mode). By altering the fundamental design of the existing hierarchy algorithms, improvements are achieved. It is proven that our method has a faster running speed and greater space efficiency. Our algorithms avoid traversal and compress the time complexities of matching a consumer with a stage-station and updating the storage information to O(logM) and O(MlogG), where M is the number of stage-stations and G is that of the platform’s stock-keeping units. Simulation comparisons of our algorithms with the current methods of CGB platforms show that our approaches can effectively reduce delivery costs. An interesting observation of the simulations is worthy of note: Increasing G may incur higher costs since it makes inventories more dispersed and delivery problems more complicated.
Public Summary
We develop algorithms to help community group buying platforms compress the time complexities of matching a consumer with a stage-stations and updating the storage information.
With simulations, we make comparisons between our algorithms and the currently usual methods of community group buying platforms and find that our approaches can effectively reduce delivery costs.
With more stock keeping uinits, a community group buying platform may bear higher delivery costs.
Multi-agent path finding (MAPF) is a significant problem in multi-agent planning problems, whose goal is to effectively plan paths for multiple agents with the constraint that agents need to avoid colliding with obstacles and other agents[1]. MAPF is a common problem in many practical scenarios. For example, MAPF is deployed in automated warehouses[2, 3], airplane taxiing[4, 5], automated guided vehicles (AGVs)[6, 7] and so on. Indeed, MAPF is full of challenges because solving for the optimal method is NP-hard[8], and when the size of the environment, the number of agents, and the density of obstacles increase sharply, a large number of potential path conflicts may occur.
Generally, there are two major categories in the multi-agent system: centralized methods and decentralized methods. Centralized methods apply a single learner to discover joint solutions (team learning), while decentralized methods use multiple simultaneous learners, usually one for each agent (concurrent learning)[9]. The same classification also exists on MAPF issues[1]. In the centralized MAPF, all agents’ partial observations are known and collected to generate collision-free paths for them, while in decentralized MAPF, it is not necessary to know all the states of all agents to determine each agent, and each agent makes decisions independently according to its own local observation. With the growth of the number of agents, world scale, and obstacle density, computational complexity will be an important concern, especially for centralized methods. Therefore, in this paper, we focus on decentralized methods, where effective and efficient communications between agents are crucial. Moreover, the information for communication also needs to be carefully coded, and it needs to consider not only the effective coding of local observations but also the effective combination of historical state information. Therefore, we use convolutional neural networks (CNN) with residual structures to encode local observations and gated recurrent unit (GRU) cells to integrate historical information.
In the past few decades, decentralized solutions to this problem have attracted extensive interest. In traditional methods, the reaction method is an intuitive and widely adopted benchmark. Take local repair A* (LRA*)[10], for example, in which each agent searches for its own path to its own goal using the A* algorithm, ignoring all other agents except for its current neighbors. Obviously, there will be many conflicts, so dealing with conflicts effectively is the key work of reaction. Whenever a collision is about to occur, the agent needs to avoid this situation by recalculating the remainder of its route and considering the occupied grids. In addition to reaction-based methods, there are also conflict-based methods (CBS[11], MA-CBS[12], ECBS[13], and ICBS[14]). In general, CBS[11] is a two-level search algorithm. While CBS calculates the path of each agent independently at the low level, and at the high level it detects conflicts between agent pairs and solves the conflict by splitting the current solution into two related subproblems, each of which involves replanning a single agent. By dividing the subproblem into two subproblems to solve the conflict recursively, the search tree is implicitly defined. Advanced search searches this tree with best-first search and terminates when the conflict-free leaf is extended.
In recent years, there has been some seminal work using deep architectures to automatically find optimal or suboptimal solutions for planning problems. These approaches vary from supervised learning to deep reinforcement learning, and their structures contain CNN[15, 16], LSTM[17], and GNN[18, 19]. In the previous approaches, some works (for example, PRIMAL[15], PRIMAL2[16]) did not think communication among agents is necessary and did not make any effort to design communication structures, while some works (for example, LSTM[17], GNN[18], and GAT[19]) thought the communication was very important and they took all the effort to build communication structures for agents to share their state information.
More importantly, these previous works did not distinguish the importance of local information, so we need to distinguish the importance of local information to extract more impressive features. Inspired by MAAC (multi-actor-attention-aritic), which uses an attention mechanism in multi-agent reinforcement learning to select relevant information for each agent at every time step, we use an attention mechanism to share neighbors’ information for each agent. Reinforcement learning methods also face the problem of solving long-horizon tasks with sparse rewards[20], especially when the environment is very large, and the training process will be inefficient. To speed up the training process in reinforcement learning, some of the previous works also used imitation learning for the cold start dilemma in reinforcement learning[15, 16], while curriculum learning for the environment setting also made sense where the easy tasks took some inspiration for hard tasks and humans explored complex tasks from easy step by step[16].
The major contributions of this paper are as follow:
(ⅰ) To solve the problem of multi-agent path finding (MAPF), a new deep reinforcement learning model with local attention cooperation , called local attention cooperation reinforcement learning (LACRL), is proposed in this work.
(ⅱ) We build the local observation encoder by using residual attention CNN to extract local observations and use the transformer architecture to build an interaction layer to combine local observations of agents.
(ⅲ) To overcome the deficiency of the success rate, we also designed a new evaluation index, namely the extra time rate (ETR). The experimental results show that our model is superior to most of the previous models in terms of success rate and extra time rate.
2.
Related works
Classical path planning methods. Generally, there are two major categories in MAPF approaches: coupled (centralized) and decoupled (decentralized). There will be a catastrophe for centralized approaches when the size of the world and the number of agents grow sharply, and the performance of centralized approaches will struggle because of the tremendous search space. Almost all traditional methods tend to use centralized or coupled methods, which use the complete information of all agents and the world environment to plan the path globally. Among the traditional methods, the most famous and well-used decoupled methods are search-based (for example, LRA* and CBS) methods. A*-based (for example, LRA* and WHCA*) algorithms rely on complete observations and use A* to calculate full paths for each agent, which could work in both centralized (CBS, WHCA, ORCA) and decentralized manner[6]. In A*-based methods, the reaction trick is intuitive and widely adopted, so there are also some works that take the reaction-based methods as a branch of approaches. Take LRA*, for example, in which each agent searches for its own path to its own goal using the A* algorithm, ignoring all other agents except for its current neighbors. Obviously, there will be many conflicts, so dealing with conflicts effectively is the key work of reaction. Whenever a collision is about to occur, the agent needs to avoid this situation by recalculating the remainder of its route and considering the occupied grid. Besides search-based methods, there are also conflict-based search and its variants (CBS, MA-CBS, ECBS, and ICBS), which make a plan for each single agent and construct a set of constraints to find the optimal or near-optimal solution without exploring the high-dimensional spaces. In conflict-based approaches, CBS is the most original and influential method, which is a two-level search algorithm. While CBS calculates the path of each agent independently at the low level, at the high level, it detects conflicts between agent pairs and solves the conflict by splitting the current solution into two related subproblems, each of that involves replanning a single agent. By dividing the subproblem into two subproblems to solve the conflict recursively, the search tree is implicitly defined. Advanced search searches this tree with the best-first search and terminates when the conflict-free leaf is extended.
In addition to search-based methods and conflict-based methods, there are also thousands of trick to solve the problem, such as direction map[21], subgraph structure[22], flow annotation replanning[23], increasing cost tress search (ICTS)[24–26], satisfiability[27–30], integer linear programming[31–33], and answer set programming[34]. In the direction map method[21], there is a direction map (DM) that stores information about the direction that agents have traveled in each portion of a map. In the planning process, agents then use this information, in which movement that runs counter to the pattern incurs additional penalties, thus encouraging delegates to move more evenly throughout the environment. In the subgraph structure method[22], they propose a new abstract form to plan more effectively, in which the key of this approach is to divide the mapping into subgraphs with known structures. These known subgraphs have entry and exit constraints, and can be compactly represented. Then, planning becomes a search in a smaller subgraph configuration space. Once an abstract plan is found, it can be quickly decomposed into correct (but possibly suboptimal) concrete plans without further search. ICTS[24–26] interweaves two search processes. The first is called advanced search, which aims to find the size of the agent’s single agent plan in the optimal solution of a given MAPF problem. The second is called low-level search, which accepts a plan size vector and verifies whether there is an effective solution for a given MAPF problem. Finally, there are some methods that transform the MAPF problem into a different problem for which good solvers exist, such as satisfiability[27–30], integer linear programming[31–33], and answer set programming[34].
In fact, solving the optimality is NP-hard. Although significant progress has been made in reducing the amount of computation, the scalability of these methods in an environment with a large number of potential path conflicts is still very poor. To reduce the number of hand-tuning parameters and address sensing uncertainty, some researchers have proposed learning-based methods to solve the planning problem[20].
Learning-based methods. Thanks to the rapid development of deep learning technology in recent years, learning-based methods are considered to be a promising direction for solving the task of path planning. Reinforcement learning has always been a powerful tool for solving planning problems, and it successfully solves the path planning problem, in which the agent completes the task through repeated trial and error. ORCA[35] adjusts the speed, size, and direction of agents online to avoid collision. On the separately planned single agent path, recent works have focused on the obstacle avoidance method of reinforcement learning. Refs. [15, 16] proposed a hybrid learning-based method called PRIMAL for multi-agent path finding that integrated imitation learning and multi-agent reinforcement learning. In addition, there is a method of reinforcement learning combined with evolutionary thought. The approaches[15, 16] did not consider interrobot communication and thus, did not exploit the full potential of the decentralized system[20]. In other words, communication is important, especially for decentralized approaches, and it is difficult to fully estimate the intention of adjacent decision agents without communication. Ref. [18] used CNN to extract adequate features from local observations and use graph neural network (GNN) to transfer these features between all agents. Then, the model is trained offline by imitating the expert algorithm, and the model is used online for decentralized planning involving only local communication and local observation. Ref. [19] believed that vanilla GNN relies on a simple message aggregation mechanism, which hinders the agent from prioritizing important information. Therefore, they extend the previous work that uses vanilla GNNs to graph attention network (GAT). Their message-aware graph attention network (MAGAT) is based on a key-query-like mechanism that determines the relative importance of features in the messages received from various neighboring robots. Ref. [17] proposed a decentralized multi-agent collision avoidance algorithm based on deep reinforcement learning, in which introduces a strategy using LSTM that enables the algorithm to use observations of an arbitrary number of other agents instead of previous methods that have a fixed observation size.
These previous works did not distinguish the importance of local information, so we need to distinguish the importance of local information to extract more impressive features used in the last decision process. Moreover, the approaches[15–16] did not have the communication between agents and did not exploit the full environment to complete the task. This is different from Refs. [18, 19], which used a graph structure to transfer features between all agents. In contrast, only when some agents are very close do they communicate information with each other. Based on another fact, when the world is large, and the number of agents is large, it takes considerable time to build graphs and aggregate information on all agents. Therefore, in a large-scale environment, local communication will save a lot of time and respond quickly.
In terms of training methods, we do not use imitation learning[15, 16, 18, 19] and guidance path[17, 20]. We use RL framework and design all rewards to control the process of path exploration. We believe that although these training skills can speed up the efficiency of agent exploration of the environment, they also greatly affect the quality of the agent strategy. Although we don’t use the above training skills, we also make an attempt to improve the efficiency of agent exploration. Inspired by the curriculum learning used in imitation learning, we divide the training process of the model into many steps, starting from relatively simple tasks to more complex tasks.
3.
Local attention cooperation reinforcement learning
3.1
Preliminary
We model MAPF under the Markov decision processes (MDPs) framework, which converts MAPF to a sequential decision-making problem in which each agent needs to take the instant action option at time t, with two goals: quickly reach the goal contently and make great efforts to avoid collision among agents.
Environment. Consider a 2-dimensional discrete environment E⊂R2 with size W×H and the cell C∈E. For each cell C, it has three states: free, busy (occupied by someone agent), and disable (occupied by obstacle). There are a set of No obstacles, Co={o1,...,oNo}, where oi∈E represents that there is the ith obstacle in the environment, and a set of NA agents A={A1,...,ANA}, where Ai∈E represents that the ith agent can travel between the free cells, and the free cells of the world can be represented by Ca=E∖O, O is the set of all obstacles in the world. For the ith agent, there is a unique start position csi and a unique goal position cgi, with the constraint that csi∈C and cgi∈C. Our goal is to find a decentralized planner, which plans the ith agent’s motion path by taking the local observation and considering the neighbor agents’ states in addition to quickly finding the scheduling path, it is also important to ensure the security of the solution, that is, try to avoid conflicts between agents.
State’s structure. We consider that every agent partially observes the environment (W×H), where agents only observe the grid world in a limited field of view (FOV), the radius of which is defined as RFOV and 9 is actually used in our experiment. As mentioned in Ref. [15], partially observing the world is an important step towards the deployment of robots in the real world. Only in closed and small scenarios, a complete map of the environment is available (e.g., automated warehouses), and agents can be trained by using a sufficiently large FOV to complete full observation of the system. For each agent in the environment, it will have a perception of its local environment. The perceived information plays an important role in planning the motion path of the agent. In detail, at each time step t, the agent Ai will take its local observation around it, which contains obstacle location information Lio, the location of adjacent agents Lia, the target location of current agent (if the goal in its neighboring) Lig, and the targets of other agents Liog. Local observations of agent Ai can be formally expressed as Liob=[Lio;Lia;Lig;Liog].
Additionally, assuming a fixed FOV, the strategy can be extended to any world size, which also helps to reduce the input dimension of the neural network and reduce model complexity[15]. The local state’s structure of each agent can be found in Fig. 1.
Figure
1.
State’s structure of each agent (here, for the red agent). Agents represented as a red square are positioned in the squares, while their goals are represented as a solid circle of the same color. For the current agent (red agent), its local observation contains four channel information: positions of obstacles, positions of nearby agents, goal positions of these nearby agents, and position of its own goal (if exist in the FOV). In addition to the current (red) agent status, other nearby agents’ states are also required (here, take the yellow agent and blue agent, for example).
Communications. Since agents do not have global position of others, the communication (or information sharing) between the agents are significant to complete the cooperative task. Addressing the problems of what information should be sent to whom and when is crucial to solving the task effectively[19]. We try to design a communication block, which takes the local observation of the agent and its neighbor agents as input and outputs a tensor condensing the information that will redound to do a decision-making. At time t, agent Ai will be adjacent to some agents that can be recorded as Nit. And the local observation of the agent Ai and its neighbor agents j∈Nit can be formally expressed as Ii={Ljob|j∈Nit}, while the output tensor can be recorded as Si.
Action space. We describe the MAPF problem as a sequence classification problem, which selects an optimal action from action space in each time step. And in our experiment, we consider a 4-connected grid environment, which means the agent can take 5 discrete actions in the grid world: moving a cell in one of the four basic directions or staying stationary. However, if the target mesh is already occupied by other agents or obstacles, the agent will not be able to move and will stay in their current position.
Reward design. The goal of MAPF is to reach the target position with the smallest stride while avoiding collision with obstacles and other agents. Therefore, there need step penalty rstep that after an agent move a step will get a small penalty with the purpose to push the agent to quickly reach its goal. Besides, collision penalty rcollision, which should be given to the agent when it collides with obstacles or other agents, is also important and will be little bigger than step penalty. To encourage exploration, we penalize slightly more for waiting than moving if the agent has not reached the goal. A similar training trick is also used in Ref. [20]. Since the waiting penalty is slightly more than the moving penalty, the swing of the agent will occur frequently in our experiment. To avoid this situation, here we need to introduce swing penalty rswing when agents return to the position they come from last time. Finally, when the agent reach its goal, the goal-reaching rearward rgoal will be given to the agent. The detailed values of these reward components in our experiment can be found in Table 1.
Table
1.
Reward design. There need step penalty rstep to promote the agent to achieve its goal quickly. And inspired by Refs. [15, 16, 20], we subdivide the step penalty into move penalty and wait penalty. The waiting penalty (e.g., –0.5) is slightly larger than the moving penalty (e.g., –0.3), which will promote the agent to explore the environment. Since the waiting penalty is slightly more than the moving penalty, we introduce the swing penalty rswing to avoid the swing of the agent. In addition, collision penalty rcollision should be given to the agent when it collides with obstacles or other agents. Finally, when the agent reach its goal, the goal-reaching reward rgoal will be given to the agent.
In this section, we will explain how agents output actions based on inputs, and show our entire model architecture, and then explain the specific network architecture on each module. The whole model architecture is shown in Fig. 2. First, the agents locally observe their environment to get the surrounding information, and through a local attention encoder to get the expression of its local observation. Then, the agent will communicate with its neighbor agents to cooperate, in which an transformer architecture is used to exchange the local observation, so the communication block can also be called the interaction layer. Finally, the tensor output from the communication block will be input to a decision block, which is considered as a learning strategy and estimates the optimal action at this time.
Figure
2.
Model overview. For the input observations, there are local observations of the target agent (red agent) that need to be planned and other agents nearby the target agent (blue agent, yellow agent, etc). For our deep reinforcement learning framework, there are three components: observation encoder, communication block, and decision block. Finally, the estimated optimal action from the policy network is taken as the output of the model.
Local attention encoder (LA-Encoder). Because the motivation of this module is to extract and encode the local observations, it also can be called feature layers. In detail, at each time step t, the agent Ai will take its local observation Lit around it, which be formally expressed as
Lit=[Lio;Lia;Lig;Liog],
(1)
where Lio is the obstacle location information, Lia is the location of adjacent agents, Lig is the target location of current agent (if the goal in its neighboring), and Liog is the targets of other agents.
In addition to local observation information Lit, it is significant to introduce the local guide direction formally expressed as vit. The purpose is to point out the target location, especially when the grid world is too large and the target exceeds FOV. At time t, the cell of agent A is Cit and the cell of its target location is Cigoal, then the direction vector for local guidance can be formally expressed as vit. The calculation process of local guide vector vit is as follows:
vit=Cigoal−Cit∣Cigoal−Cit∣,
(2)
where Cit is the location of agent Ai at time t, Cigoal is the cell of its target location, and |⋅| means the modulus of a vector.
By using local observation information Lit and local guide vector vit, the LA-Encoder will output an intermediate expression hit:
hit=LA\rm-Encoder(Lit,vit).
(3)
The whole model structure of the LA-Encoder can be found in Fig. 3. The LA-Encoder is different from previous works[6, 15, 16, 18–20, 36], they only designed a deep convolutional neural network to extract features, but did not pay attention to the importance of local information, which made the expression ability of the extracted features insufficient, which affected the subsequent training difficulty of the model.
Figure
3.
The structure of LA-Encoder. Firstly, the local observation Lit needs to be convoluted three times with convolution kernels of 1×1 to obtain K tensor, Q tensor, and V tensor. Then, the attention map is carried out on each channel, and it is used as a weight to fully extract local information on each channel. Finally, the local feature extracted by the attention mechanism will be combined with the local guidance vector vit to obtain the intermediate expression.
As we can see in Fig. 3, the local observation information Lit of agent Ai will be input into a well-designed layer that uses attention mechanisms to process important information. To use the attention mechanism, the local observation Lit needs to be convoluted three times with convolution kernels of 1×1 to obtain K tensor, Q tensor, and V tensor. The attention calculation process, it should be noted, is carried out on each channel, because the values of each channel are in different ranges and have different meanings. Therefore, the calculated attention map is a group of attention maps on each channel, and it is used as a weight to fully extract local information on each channel. Finally, the local feature extracted by the attention mechanism will be combined with the local guidance vector vit to obtain the intermediate expression hit containing the local state and target information, which can be applied to the subsequent planning.
Transformer-based communication block (TC-Block). To enable agents to perceive the state of adjacent agents, we also designed a local cooperation module based on transformer architecture, namely Transformer-based communication block (TC-Block). For a certain agent Ai, it will communicate with its neighbors Ni and share their intermediate expressions hj(j∈Ni). Then TC-Block will output the comprehensive expression sit, which will be input into decision block to estimate optimal action at time t:
sit=TC\rm-Block(hit,hjt),j∈Nit,
(4)
where hit is the intermediate expression from LA-Encoder of agent Ai at time t, Nit is the group of agents which is adjacent to current agent Ai at time t.
The structure of TC-Block can be found in Fig. 4. With the purpose of combining the local states of the current agent and its neighbors, the TC-Block takes the features obtained from the LA-Encoder as input. And these encoded features of agents can be regarded as a sequence and can be used as a part of input. In addition, we also need to embed the position of each feature tensor. We use relative coordinates as position information. For example, the current agent Ai is in cell Cit, and a neighbor Aj is in cell Cjt. Therefore, the relative position Cjt−Cit will be location information and serve as an input for position embedding. And inspired by Ref. [37] using transformer in image classification task, we embed the position information from X coordinate and Y coordinate, each with the size of half of the position embedding D/2. Then, based on the relative coordinate, we concatenate the X and Y embedding to get the final positional embedding.
Figure
4.
The structure of Transformer-based communication block (TC-Block). The TC-Block takes the features obtained from the LA-Encoder as input, and regards them as a sequence. In addition, there also need to embed the position of each feature tensor. We use relative coordinates as position information. And inspired by Ref. [37] using Transformer in image classification task, we embed the position information from X coordinate and Y coordinate.
Decision block. Based the comprehensive expression si of communication block, decision block will need output the action policy for agent Ai at t time:
ait=DecisionBlock(sit).
(5)
We choose SAC[38] to train our model because of its stability and exploratory, and moreover, we compare with other algorithms (e.g., VPG, TRPO, PPO, DDPG, TD3) but they do not achieve the same effect as SAC. It is very hard to prove and explain which one of the popular DRL algorithms, such as VPG, TRPO, PPO, DDPG, TD3, and SAC, is best. And our focus of work does not tend to fully compare the algorithms, so we did not insist that SAC was the best one, while the SAC algorithm may be more suitable for our environment setting. SAC has a parameterized state value function Vψ(st), soft Q function Qθ(st,at), and the strategy function πϕ, where {ψ,θ, ϕ} are their parameters. And SAC is still based on actor-critic architecture, while the value function and the Q function are related, which can be regarded as critic network. The soft value function is trained to minimize the squared residual error:
where at is obtained by current strategy based on the current state st, D is experience replay buffer, logπϕ(at|st) is the entropy of strategy. The soft Q function parameters can be trained to minimize the MSE loss between Q value and target ˆQ value:
JQ(θ)=E(st,at)∼D[12(Qθ(st,at)−ˆQ(st,at))2],
(7)
where (st,at) is sampled from experience replay buffer D, target ˆQ value, and state value V is related:
ˆQ(st,at)=R(at,st)+γEst+1[Vˉψ(st)],
(8)
Vˉψ is the target network which can reduce sample correlation, and the parameter ˉψ of the target value network is updated in a smooth way:
ˉψ←τψ+(1−τ)ˉψ,
(9)
where τ is the smoothing coefficient.
Finally, the policy parameters θ can be learned by directly minimizing the expected KL-divergence
Jπ(ϕ)=Est∼D[DKL(πϕ(⋅|st)||exp(Qθ(st,⋅))Zθ(st))].
(10)
In SAC[38], the updating of actor network πϕ is realized by minimizing KL divergence. SAC uses the reparameterization trick, a′t is not taken from the experience replay, but the prediction made by reusing the strategy network. The final loss function can be obtained as follows:
Jπ(ϕ)=Est∼D,εt∼Δ[logπϕ(a′t|st)−Qθ(st,a′t)],
(11)
where εt is noise sampled from Gaussian distribution[38]. The overall procedure is shown in Algorithm 1.
4.
Results and discussion
4.1
Experiment setting
We evaluate our approach in a grid world simulation environment, which is similar to Refs. [15, 16]. The size of the square environment is randomly selected at the beginning of each episode to be either 20, 50, or 100. The obstacle density is randomly selected from 0%, 10%, or 30%. The placement of obstacles, agents, and goals is uniform and random throughout the environment, with the caveat that each agent had to be able to reach its goal. Here, we draw lessons from Ref. [15], and each agent is initially placed in the same connected region as its goal. The actions of the agents are executed sequentially in random order at each time step to ensure that they have equal priority.
Inspired by curriculum learning which is always used in imitation learning, our training procedure is divided into a few stages and starts from easier tasks to more difficult tasks. In the easy scene, we begin by initializing a small population of agents and dynamic obstacles and sample goals within a certain distance to let agents learn a short-range navigation policy. Then, we increase the number of agents and dynamic obstacles and sample goals in the whole map.
Algorithm 1: Off-policy training of proposed framework.
1 Initialize the capacity of replay memory D;
2 Initialize the weights of observation encoder and decision block;
3 for each iteration do
4 for each environment step tdo
5 for each agent ido
6 get local observation Lit, guiding vector vit, and its nearby agents Ni;
7 hit=LA\rm{-}Encoder(Lit,vit);
8 sit=TC\rm{-}Block(hit,hjt),j∈Ni;
9 ait=DecisionBlock(sit);
10 st+1∼p(st+1|st,at);
11 D=D∪(st,at,r(st,at),st+1);
12 end for
13 end for
14 for each gradient step do
15 Sample a minibatch data from D;
16 Update framework by loss Eq. (6), Eq. (8), and Eq. (11);
17 Update target value network by the smooth Eq. (9);
18 end for
19 end for
4.2
Training details
We use a discount factor (γ) of 0.95. We use different length episodes in different size world, e.g. , in 20-size world episode length is 128, in 50-size world episode is 256, in 100-size world episode length is 512. And the batch size is 128 so that integer multiple times of gradient updating can be performed each episode per agent. And we use Pytorch to realize the model and use RAdam[6] with a learning rate beginning at 3×10−4.
4.3
Metrics
Success\ Rate=nsuccess/n, where nsuccess is the number of agents that completed its travel and reached the goal and n is the total number of the agents that need to be planned, is the ratio of the number of agents reaching their goals within a certain time limit over the total number of agents.
Extra\ Time\ Rate=(T−T∗)/T∗ is the difference between the averaged travel time on all agents and the lower bound of the travel time, where T is the averaged travel time on all agents and T∗ is the lower bound of the travel time. The lower bound is the needed time for the agent’s tour ignoring other agents.
4.4
Baselines
We introduce two traditional algorithms and three learning-based models:
LRA*[10]: Local repair A* (LRA*) is a simple replan method, in which each agent will search for its own path to its goal ignoring other agents, but when an agent will encounter collision the algorithm will recalculate the remainder of its route. LRA* may be an adequate solution for simple environments with few obstacles and few agents. But with more complex environments, LRA* will fall into the dilemma of too many recalculations.
CBS[11]: Conflict-based search (CBS) is a two-level algorithm. At the high level, a search is performed on the conflict tree (CT), which is a tree-based on conflicts between individual agents, whose each node represents a set of constraints for the agents’ motion. And at the low level, fast single agent searches are performed to meet the constraints imposed by the high level CT node.
PRIMAL[15, 16]: PRIMAL is a hybrid learning-based method for MAPF that uses both imitation learning (based on an expert algorithm) and multi-agent reinforcement learning. But in PRIMAL it does not take inter-agent communication into consideration. As mentioned in Ref. [18], the key to solving the MAPF problem is learning what, how, and when to communicate.
IL_GNN[18, 19]: They use a convolutional neural network to extract features from local observations, and a graph neural network to communicate these features among agents. But in their work, they use imitation learning to train models, so there will be a lack of exploration of the environment and it’s hard to converge.
G2RL[36]: This algorithm combines the global guidance path and trains the model through the framework of reinforcement learning. In G2RL, it does not consider inter-agent communication and add the trick of guide path into every agent’s state.
4.5
Results
In this section, we will discuss the experimental results of our model and other models in terms of success rate and ETR. At the same time, to better compare the results, we also visualized the results.
Success rate. The widely used evaluation index in MAPF is the Success rate, that is, the proportion of the number of agents that have reached the target in all agents within the given time steps of the experiment. The results of the success rate can be found in Table 2. For better observation and comparison, we also visualize results in Fig. 5.
Table
2.
Results for success rate. We compare our LACRL with LRA*[10], CBS[11], PRIAML[15, 16], IL GNN[18], and G2RL[36] on different environment settings. Values are listed as “mean” across 100 instances. And the highest (best) values are highlighted.
Figure
5.
Visual results of success rate. On each subgraph, we list the success rate results of our LACRL and other methods on different obstacle densities. And the higher the column in the histogram, the better the performance of the corresponding model.
In our experiment, we take a search-based method (LRA*) and a conflict-based method (CBS) as traditional models to be part of baseline. Compared to traditional models, the success rate of our model is very similar to that of the two baselines when the grid world is small. With the growth of the number of agents, world scale, and obstacle density, we can find that the success rate of baseline has decreased obviously, but the success rate of our model decreases more slowly than the two baselines. However, we find that the traditional methods can get the best success rate in small scale world, which is caused by the fact that if the world is small and the number of agent is small, it may be no need to do any communication in this situation. And what’s more, if the world is small and the targets are very close to the agents, the information of FOV already contains most of the information of the environment, so the decentralized partial observation problem can be regarded as centralized full observation problem. Then our model may lose its advantages and will not achieve the high success rate of the traditional methods. Fortunately, the application scenarios in reality will not be particularly small. Besides search-based and conflict-based methods, we also take some learning-based model into consideration (PRIMAL, IL_GNN, G2RL). Compared with the previous learning based models, our model has the best success rate in any case. There is also an interesting discovery that as the problem becomes more complex, the models without the global guide path (PRIMAL) will fast decline, while the models with the global guide path (IL_GNN, G2RL) will slowly decline.
Extra time rate (ETR). From the calculation formula of the success rate, it is not difficult to see that the success rate only focuses on whether the task is completed within the specified time, and ignores the quality of the path found. Therefore, we measure the quality of the model strategy according to the proportion of extra time required, that is, the ratio of extra time. Therefore, we use the extra time rate to indicate how much extra time the method needs, in which the low bound of a path length is referred to a global path computed in a static environment ignoring other agents. If the path length found through the model is closer to the global path, we think the solution will be better and the extra time rate metric will be smaller.
The results of extra time rate can be found in Table 3, and the visualization of results are shown in Fig. 6 for better observation and comparison. Obviously, we found that our model LACRL can reach the lowest extra time rate than other learning-based methods in Table 3. And in Fig. 6, we can see that the red line (our LACRL) is always lower than other lines. When the environment scale is small and the number of agents is small, the gap between the red line and other broken lines is small. The first reason for this phenomenon is the environment. Due to the limitation of the environment, even if the solution obtained by the model is very poor, there will be no very high additional time ratio; Secondly, due to the small environment and the small number of agents, it is not difficult for each model to explore the environment, so the difference of additional time ratio will not be too high. This shows that when the scale of the environment becomes larger and the number of agents increases, the advantages and disadvantages of the ability of the model to explore the environment are gradually reflected. It can be roughly seen that the thickness of the shadow with LACRL is always the thinnest, that is, the standard deviation of the model is the smallest, which also shows that our LACRL has a certain stability. In other words, in the same environment, the strategy obtained by the model will not fluctuate greatly in the additional time ratio, which makes the model strategy more stable.
Table
3.
Results for extra time rate. Values are listed as “mean / standard deviation” across 100 instances. The lowest (best) values are highlighted. The lower the mean value, the better the model effect. And the smaller the standard deviation, the more stable the model effect.
Figure
6.
Visual results of extra time rate. On each subgraph, we list the extra time rate results of our LACRL and other methods on different environment settings. And in each line graph, the lower the line, the better the performance of the corresponding model.
In addition to comparing with the benchmark models, we also conduct ablation experiments that remove or replace part of components of our method. In ablation study, we evaluate the performance of our LACRL, LACRL with CNN instead of local attention layer in LA-Encoder (LACRL w/o att.), and LACRL without the communication block (LACRL w/o comm.).
As we can see in Table 4, we find that the results of LACRL with CNN instead of local attention layer (LACRL w/o att.) and LACRL without the communication block (LACRL w/o comm.) have a certain attenuation relative to LACRL. Especially, the attenuation amplitude of LACRL without the communication block is the most obvious. And we find that when the environment size is small, the influence of communication block is small, and the influence of communication block increases with the increase of environment size. Because when the world is small scale and simple, there is no need to do long-horizon decision and the agent will quickly achieve its goal after a little exploration. Therefore, the need of communication is small in that case. However, with the increase of the scale of the environment and the number of agents, the impact of communication block on the experimental performance is becoming greater and greater. In a word, it is convincing that the communication block and local attention works in MAPF task from ablation study.
Table
4.
Results of ablation study. We evaluate the performance of our LACRL, LACRL without local attention layer in LA-Encoder (LACRL w/o att.), and LACRL without the communication block (LACRL w/o comm.) on different environment settings.
This paper proposes a new model for multi-agent path finding in the partially observable environment, which is formalized into DEC-POMDP process and trained in the form of deep reinforcement learning through repeated trial and error. We build the local observation encoder by using residual attention CNN to extract local observations, and use the transformer architecture to build an interaction layer to combine local observations of agents. With the purpose of overcoming the deficiency of success rate, we also designed a new evaluation index, namely extra time rate (ETR). The experiment results show that our model outperforms traditional methods in most cases, and outperforms the previous learning-based models in all cases in terms of the success rate and the extra time rate among various experiment settings. What’s more, the ablation experiment shows that the components exactly work and the performance of our model will deteriorate without them.
Although the performance of our model in the simulated environment is acceptable, there is still a lot of work to be done to expand the experiment to the real environment. The discretization and modeling of the real environment, reward and punishment design and other environmental setting problems need to be adjusted appropriately according to the real environment and practical problems. Therefore, it is still very promising to apply the model in this paper to the actual scene. Later, the experiment can be extended to the actual scene to solve the actual needs.
Acknowledgements
This work was supported by the Fundamental Research Funds for the Central Universities (WK2150110017).
Conflict of interest
The authors declare that they have no conflict of interest.
Acknowledgements
This work was supported by the National Natural Science Foundation of China (71991464, 71921001) and Fundamental Research Funds for the Central Universities, General Program (WK2040000053) and Key Program (YD2040002027).
Conflict of interest
The author declares that he has no conflict of interest.
We develop algorithms to help community group buying platforms compress the time complexities of matching a consumer with a stage-stations and updating the storage information.
With simulations, we make comparisons between our algorithms and the currently usual methods of community group buying platforms and find that our approaches can effectively reduce delivery costs.
With more stock keeping uinits, a community group buying platform may bear higher delivery costs.
Yu B, Shan W, Sheu J B, et al. Branch-and-price for a combined order selection and distribution problem in online community group-buying of perishable products. Transportation Research Part B: Methodological, 2022, 158: 341–373. DOI: 10.1016/j.trb.2022.03.001
[2]
Chatterjee S, Peled R, Peres Y, et al. Gravitational allocation to Poisson points. Annals of Mathematics, 2010, 172 (1): 617–671. DOI: 10.4007/annals.2010.172.617
[3]
Holden N, Peres Y, Zhai A. Gravitational allocation on the sphere. Proceedings of the National Academy of Sciences, 2018, 115 (39): 9666–9671. DOI: 10.1073/pnas.1720804115
[4]
Ajtai M, Komlós, J, Tusnády G. On optimal matchings. Combinatorica, 1984, 4: 259–264. DOI: 10.1007/BF02579135
[5]
Talagrand M. Matching random samples in many dimensions. The Annals of Applied Probability, 1992, 2 (4): 846–856. DOI: 10.1214/aoap/1177005578
[6]
Kanoria Y. Dynamic spatial matching. arXiv: 2105.07329, 2021 .
[7]
Hoffman C, Holroyd A E, Peres Y. A stable marriage of Poisson and Lebesgue. The Annals of Probability, 2006, 34 (4): 1241–1272. DOI: 10.1214/009117906000000098
[8]
Markó R, Timár Á. A Poisson allocation of optimal tail. The Annals of Probability, 2016, 44 (2): 1285–1307. DOI: 10.1214/15-AOP1001
[9]
Shor P W. The average-case analysis of some on-line algorithms for bin packing. Combinatorica, 1986, 6: 179–200. DOI: 10.1007/BF02579171
[10]
Shor P W. How to pack better than best fit: tight bounds for average-case online bin packing. In: Proceedings 32nd Annual Symposium of Foundations of Computer Science. San Juan, PR, USA: IEEE, 1991 : 752–759.
[11]
Akbarpour M, Alimohammadi Y, Li S, et al. The value of excess supply in spatial matching markets. arXiv: 2104.03219, 2021 .
[12]
Sun S, Zhang B. Operation strategies for nanostore in community group buying. Omega, 2022, 110: 102636. DOI: 10.1016/j.omega.2022.102636
[13]
Li J, Li B, Shen Y, et al. Study on the steady state of the propagation model of consumers’ perceived service quality in the community group-buying. Journal of Retailing and Consumer Services, 2022, 65: 102882. DOI: 10.1016/j.jretconser.2021.102882
[14]
Li B, Krushinsky D, Reijers H A, et al. The share-a-ride problem: People and parcels sharing taxis. European Journal of Operational Research, 2014, 238 (1): 31–40. DOI: 10.1016/j.ejor.2014.03.003
[15]
Qi W, Li L, Liu S, et al. Shared mobility for last-mile delivery: Design, operational prescriptions, and environmental impact. Manufacturing & Service Operations Management, 2018, 20 (4): 737–751. DOI: 10.1287/msom.2017.0683
[16]
Agussurja L, Cheng S F, Lau H C. A state aggregation approach for stochastic multiperiod last-mile ride-sharing problems. Transportation Science, 2019, 53 (1): 148–166. DOI: 10.1287/trsc.2018.0840
[17]
Cao J, Olvera-Cravioto M, Shen Z J. Last-mile shared delivery: A discrete sequential packing approach. Mathematics of Operations Research, 2020, 45 (4): 1466–1497. DOI: 10.1287/moor.2019.1039
[18]
Mousavi K, Bodur M, Roorda M J. Stochastic last-mile delivery with crowd-shipping and mobile depots. Transportation Science, 2022, 56 (3): 612–630. DOI: 10.1287/trsc.2021.1088
[19]
Lyu G, Teo C P. Last mile innovation: The case of the Locker Alliance network. Manufacturing & Service Operations Management, 2022, 24 (5): 2425–2443. DOI: 10.1287/msom.2021.1000
[20]
Reed S, Campbell A M, Thomas B W. Impact of autonomous vehicle assisted last-mile delivery in urban to rural settings. Transportation Science, 2022, 56 (6): 1530–1548. DOI: 10.1287/trsc.2022.1142
Figure
4.
The cubes in the deepest level for M=600.
Figure
5.
The hierarchy process of G=6.
Figure
6.
Delivery routes of Warehouses 1 and 2.
Figure
7.
Delivery routes of Warehouses 3 and 4.
Figure
8.
The average cost of different pairs of (M, G).
Figure
9.
Average travel cost under different methods.
Figure
10.
The new sequence of events.
References
[1]
Yu B, Shan W, Sheu J B, et al. Branch-and-price for a combined order selection and distribution problem in online community group-buying of perishable products. Transportation Research Part B: Methodological, 2022, 158: 341–373. DOI: 10.1016/j.trb.2022.03.001
[2]
Chatterjee S, Peled R, Peres Y, et al. Gravitational allocation to Poisson points. Annals of Mathematics, 2010, 172 (1): 617–671. DOI: 10.4007/annals.2010.172.617
[3]
Holden N, Peres Y, Zhai A. Gravitational allocation on the sphere. Proceedings of the National Academy of Sciences, 2018, 115 (39): 9666–9671. DOI: 10.1073/pnas.1720804115
[4]
Ajtai M, Komlós, J, Tusnády G. On optimal matchings. Combinatorica, 1984, 4: 259–264. DOI: 10.1007/BF02579135
[5]
Talagrand M. Matching random samples in many dimensions. The Annals of Applied Probability, 1992, 2 (4): 846–856. DOI: 10.1214/aoap/1177005578
[6]
Kanoria Y. Dynamic spatial matching. arXiv: 2105.07329, 2021 .
[7]
Hoffman C, Holroyd A E, Peres Y. A stable marriage of Poisson and Lebesgue. The Annals of Probability, 2006, 34 (4): 1241–1272. DOI: 10.1214/009117906000000098
[8]
Markó R, Timár Á. A Poisson allocation of optimal tail. The Annals of Probability, 2016, 44 (2): 1285–1307. DOI: 10.1214/15-AOP1001
[9]
Shor P W. The average-case analysis of some on-line algorithms for bin packing. Combinatorica, 1986, 6: 179–200. DOI: 10.1007/BF02579171
[10]
Shor P W. How to pack better than best fit: tight bounds for average-case online bin packing. In: Proceedings 32nd Annual Symposium of Foundations of Computer Science. San Juan, PR, USA: IEEE, 1991 : 752–759.
[11]
Akbarpour M, Alimohammadi Y, Li S, et al. The value of excess supply in spatial matching markets. arXiv: 2104.03219, 2021 .
[12]
Sun S, Zhang B. Operation strategies for nanostore in community group buying. Omega, 2022, 110: 102636. DOI: 10.1016/j.omega.2022.102636
[13]
Li J, Li B, Shen Y, et al. Study on the steady state of the propagation model of consumers’ perceived service quality in the community group-buying. Journal of Retailing and Consumer Services, 2022, 65: 102882. DOI: 10.1016/j.jretconser.2021.102882
[14]
Li B, Krushinsky D, Reijers H A, et al. The share-a-ride problem: People and parcels sharing taxis. European Journal of Operational Research, 2014, 238 (1): 31–40. DOI: 10.1016/j.ejor.2014.03.003
[15]
Qi W, Li L, Liu S, et al. Shared mobility for last-mile delivery: Design, operational prescriptions, and environmental impact. Manufacturing & Service Operations Management, 2018, 20 (4): 737–751. DOI: 10.1287/msom.2017.0683
[16]
Agussurja L, Cheng S F, Lau H C. A state aggregation approach for stochastic multiperiod last-mile ride-sharing problems. Transportation Science, 2019, 53 (1): 148–166. DOI: 10.1287/trsc.2018.0840
[17]
Cao J, Olvera-Cravioto M, Shen Z J. Last-mile shared delivery: A discrete sequential packing approach. Mathematics of Operations Research, 2020, 45 (4): 1466–1497. DOI: 10.1287/moor.2019.1039
[18]
Mousavi K, Bodur M, Roorda M J. Stochastic last-mile delivery with crowd-shipping and mobile depots. Transportation Science, 2022, 56 (3): 612–630. DOI: 10.1287/trsc.2021.1088
[19]
Lyu G, Teo C P. Last mile innovation: The case of the Locker Alliance network. Manufacturing & Service Operations Management, 2022, 24 (5): 2425–2443. DOI: 10.1287/msom.2021.1000
[20]
Reed S, Campbell A M, Thomas B W. Impact of autonomous vehicle assisted last-mile delivery in urban to rural settings. Transportation Science, 2022, 56 (6): 1530–1548. DOI: 10.1287/trsc.2022.1142
Table
1.
Reward design. There need step penalty rstep to promote the agent to achieve its goal quickly. And inspired by Refs. [15, 16, 20], we subdivide the step penalty into move penalty and wait penalty. The waiting penalty (e.g., –0.5) is slightly larger than the moving penalty (e.g., –0.3), which will promote the agent to explore the environment. Since the waiting penalty is slightly more than the moving penalty, we introduce the swing penalty rswing to avoid the swing of the agent. In addition, collision penalty rcollision should be given to the agent when it collides with obstacles or other agents. Finally, when the agent reach its goal, the goal-reaching reward rgoal will be given to the agent.
Table
2.
Results for success rate. We compare our LACRL with LRA*[10], CBS[11], PRIAML[15, 16], IL GNN[18], and G2RL[36] on different environment settings. Values are listed as “mean” across 100 instances. And the highest (best) values are highlighted.
Table
3.
Results for extra time rate. Values are listed as “mean / standard deviation” across 100 instances. The lowest (best) values are highlighted. The lower the mean value, the better the model effect. And the smaller the standard deviation, the more stable the model effect.
Table
4.
Results of ablation study. We evaluate the performance of our LACRL, LACRL without local attention layer in LA-Encoder (LACRL w/o att.), and LACRL without the communication block (LACRL w/o comm.) on different environment settings.