1 Introduction
Reinforcement learning (RL) has recently shown the capability to solve many challenging sequential decisionmaking problems, ranging from the game of Go (silver2016mastering), Poker (brown2018superhuman), and realtime strategy games (vinyals2019grandmaster), to autonomous driving (shalev2016safe), and robotics (kober2013reinforcement). Many of the RL applications involve the interaction of multiple agents, which are modeled systematically within the framework of multiagent reinforcement learning (MARL). These success stories have inspired a remarkable line of studies on the theoretical aspects of MARL.
Most of the theoretical efforts in MARL, however, have been devoted to Markov games with special reward structures, such as fully competitive or cooperative games. One prevalent setting is MARL in twoplayer zerosum Markov games (wei2017online; xie2020learning), where the two agents have exactly opposite objectives. Such prevalence is mostly due to the fundamental computational difficulty in more general scenarios: Finding a Nash equilibrium (NE) is known to be PPADcomplete both for twoplayer generalsum games (chen2009settling) and zerosum games with more than two players (daskalakis2009complexity). Given the daunting impossibility results, convergence to NE in generic games with no special structure seems hopeless in general. As a result, many important problems in the multiplayer generalsum settings, which can model broader and more practical interactive behaviors of decision makers, have been left relatively open.
In this paper, we make an initial attempt toward understanding some of the theoretical aspects of MARL in decentralized generalsum Markov games. Given the inherent challenges for computing Nash equilibria, we need to target a slightly weaker solution concept than NE. One reasonable alternative is to find a coarse correlated equilibrium (CCE) (moulin1978strategically; aumann1987correlated) of the game. Unlike NE, CCE can always be found in polynomial time for generalsum games (papadimitriou2008computing), and due to its tractability, calculating CCE has also been commonly used as an important subroutine toward finding Nash equilibria in twoplayer zerosum Markov games (xie2020learning; bai2020near).
Our interest in CCE is mostly motivated by the following folklore result for learning in normalform games: When the agents independently run noregret learning algorithms in generalsum normalform games, their empirical frequency of plays converges to the set of CCE of the game (hart2000simple; cesa2006prediction). In noregret learning, each agent independently adapts its policy to minimize the cumulative regret based on only its local information, irrespective of the actions or rewards of the other agents. Wellknown examples of noregret learning algorithms include multiplicative weights update (MWU) (auer1995gambling) and online gradient descent (zinkevich2003online). Such a folk result hence suggests that CCE is a natural outcome of the simple and uncoupled learning dynamics of the agents. A natural question to ask is whether a similar result also holds for Markov games. Specifically, in this paper, we ask the following questions: Can we find CCE in generalsum Markov games using decentralized/uncoupled learning dynamics? If so, can we achieve such a result efficiently, by showing an explicit sample complexity upper bound?
Before answering these questions, we would like to remark that decentralized MARL in generalsum games can be highly challenging. Specifically, in decentralized learning^{1}^{1}1This setting has been studied under various names in the literature, including individual learning (leslie2005individual), decentralized learning (arslan2016decentralized), agnostic learning (tian2020provably; wei2021last), and independent learning (claus1998dynamics; daskalakis2020independent). It also belongs to a more general category of teams/games with decentralized information structure (ho1980team; nayyar2013common; nayyar2013decentralized)., each agent only has access to its own local information (history of states, local rewards and local actions), and can neither communicate with other agents nor be coordinated by any central controller during learning. In fact, we require that each agent is completely unaware of the underlying structure of the game, or the presence of the other agents. In decentralized learning, since both the reward and the transition are affected by the other agents, the environment becomes nonstationary from each agent’s own perspective, especially when the agents learn and update their policies simultaneously. Hence, an agent needs to efficiently explore the unknown environment while bearing in mind that the information it gathered a while ago might no longer be accurate. This makes many successful singleagent RL solutions, which assume that the agent is learning in a stationary Markovian environment, inapplicable. Furthermore, compared with RL in twoplayer zerosum games, an additional challenge in generalsum games is equilibrium selection. In zerosum games, all NE have the same value (filar2012competitive), and there is no ambiguity in defining the suboptimality of a policy. However, in generalsum games, multiple equilibria can have different values. We hence need to first identify which equilibrium to compare with when we are trying to measure the performance of a policy. Finally, we remark that we consider in this paper the fully observable setup, where the agents have full access to the state information. This is in contrast to the more general team theory with partially observable information structures (ho1980team; yuksel2013stochastic)
, such as those modeled by decentralized partially observable Markov decision processes (DecPOMDPs)
(bernstein2002complexity; nayyar2013decentralized), where each agent has only a private partial observation of the state. Learning or even computing a NE in the latter case is much more challenging, and we do not target such a more general setup in this paper.Despite the challenges identified above, we answer both of the aforementioned questions affirmatively, by presenting an algorithm in which the agents can find a CCE in generalsum Markov games efficiently through decentralized learning. Our contributions in this paper are summarized as follows.
Contributions. 1) We study provably efficient exploration in decentralized generalsum Markov games. We propose an algorithm named Optimistic Vlearning with Stabilized Online Mirror Descent (Vlearning OMD), where Vlearning (bai2020near) is a simple variant of Qlearning. In Vlearning OMD, each agent independently runs an optimistic Vlearning algorithm to explore the unknown environment, while using an online mirror descent procedure for policy updates. Following the learning process, the CCE can be extracted by simply letting the agents randomly repeat their previous strategies using a common random seed. 2) We show that if all agents in the game run the Vlearning OMD algorithm, they can find an approximate coarse correlated equilibrium in at most episodes, where is the number of states, is the size of the largest action space among the agents, and is the length of an episode. Our result complements its counterpart in normalform games that uncoupled noregret learning dynamics lead to CCE. We further show that our sample complexity is nearlyoptimal in that it matches all the parameter dependences in the informationtheoretical lower bound, except the horizon length . 3) As an important building block of our analysis, we conduct a novel investigation of a highprobability regret bound for OMD with a dynamic learning rate and weighted regret, which might be of independent interest. We emphasize that due to the decentralization property, our algorithm readily generalizes to an arbitrary number of agents without suffering from the exponential dependence on the number of agents. Our work appears to be the first to provide nonasymptotic guarantees for MARL in generic generalsum Markov games with efficient exploration, with an additional appealing feature of being decentralized.
Outline. The rest of the paper is organized as follows: We start with a literature review in Section 2. In Section 3, we introduce the mathematical model of our problem and necessary preliminaries. In Section 4, we present our Vlearning OMD algorithm for decentralized learning in generalsum Markov games. A sample complexity analysis of Vlearning OMD is given in Section 5. In Section 6, we analyze a specific adversarial multiarmed bandit problem, which plays a central role in our analysis of the Vlearning OMD algorithm. Finally, we conclude the paper in Section 7.
2 Related Work
Multiagent reinforcement learning is generally modeled within the mathematical framework of stochastic games (shapley1953stochastic) (also referred to as Markov games). Given the PPAD completeness of finding a Nash equilibrium in generic games (daskalakis2009complexity; chen2009settling), convergence to NE has mostly been studied in games with special structures, such as twoplayer zerosum games or cooperative games.
In twoplayer zerosum Markov games, littman1994markov has proposed a Qlearning based algorithm named minimaxQ, whose asymptotic convergence guarantee has later been established in littman1996generalized. Recently, various sample efficient methods have been proposed in this setting (wei2017online; bai2020provable; sidford2020solving; xie2020learning; bai2020near; liu2021sharp; zhao2021provably). Most notably, several works (daskalakis2020independent; tian2020provably; wei2021last) have investigated twoplayer zerosum games in a decentralized environment similar to ours, where an agent cannot observe the actions of the other agent.
Another line of research with convergence guarantees is RL in teams or cooperative games. wang2002reinforcement have proposed an optimal adaptive learning algorithm that converges to the optimal Nash equilibrium in Markov teams, where the agents share the same objective. More recently, arslan2016decentralized have shown that decentralized Qlearning can converge to NE in weakly acyclic games, which cover Markov teams and potential games as important special cases. Later, yongacoglu2019learning have further improved arslan2016decentralized and achieved convergence to the teamoptimal equilibrium. Overall, the aforementioned works have mainly focused on twoplayer zerosum games or cooperative games. These results do not carry over in any way to the generalsum games that we consider in this paper.
A few works have considered games beyond the zerosum or cooperative settings: hu2003nash, littman2001friend, and zehfroosh2020pac have established convergence guarantees under the assumptions that either a saddle point equilibrium or a coordination equilibrium exists. greenwald2003correlated has bypassed the computation of NE in generalsum games by targeting correlated equilibria instead, but no theoretical convergence result has been given. Other approaches for finding NE in generalsum games include minimizing the Bellmanlike residuals learned from offline/batch data (perolat2017learning), or using a twotimescale algorithm to learn the policy of each player from an optimization perspective (prasad2015two). Nevertheless, none of these works has considered sampleefficient exploration in a decentralized environment, a more challenging objective that we pursue in this paper. More recently, liu2021sharp have studied the nonasymptotic properties of learning CCE in generalsum Markov games, but their sample complexity bound scales exponentially in the number of agents as a consequence of using a centralized learning approach.
3 Preliminaries
An player (episodic) generalsum Markov game is defined by a tuple , where (1) is the set of agents; (2) is the number of time steps in each episode; (3) is the finite state space; (4) is the finite action space for agent ; (5) is the reward function for agent , where is the joint action space; and (6) is the transition kernel. We remark that both the reward function and the state transition function depend on the joint actions of all the agents. We assume for simplicity that the reward function is deterministic. Our results can be easily generalized to stochastic reward functions.
The agents interact with the unknown environment for episodes, and we let be the total number of time steps. At each time step, the agents observe the state , and take actions simultaneously. We let . The reward is then revealed to agent privately, and the environment transitions to the next state . We assume for simplicity that the initial state of each episode is fixed.
As mentioned before, we focus on the decentralized setting: Each agent only observes the states and its own actions and rewards, but not the actions or rewards of the other agents. In fact, each agent is completely oblivious of the existence of other agents, and is also unaware of the number of agents in the game. The agents are also not allowed to communicate with each other. This decentralized information structure requires an agent to learn to make decisions based on only its local information.
For each agent , a (Markov) policy is a mapping from the time index and the state space to a distribution over its action space. We define the space of policies for agent by . The joint policy space is denoted by . Each agent seeks to find a policy that maximizes its own cumulative reward. A joint policy induces a probability measure on the sequence of states and joint actions. For a joint policy , and for each time step , state , and joint action , we define the (state) value function and the stateaction value function (Qfunction) for agent as follows:
(1) 
For convenience of notations, we use the superscript to denote the set of agents excluding agent , i.e., . For example, we can rewrite using this convention.
Definition 1.
(Best response). A policy is a best response to for agent if for any state .
The supremum is taken over all general policies^{2}^{2}2A general policy (bai2020near) of agent is a set of maps
from a random variable
and a history of length to a distribution of actions in . that are not necessarily Markov. When is Markov, the environment is stationary from the perspective of agent , and there always exists a Markov best response.Definition 2.
(Nash equilibrium). A Markov joint policy is a Nash equilibrium (NE) if is a best response to for all .
Given the fundamental difficulty of calculating Nash equilibria in generalsum games (daskalakis2009complexity; chen2009settling), in this paper, we mainly target a weaker solution concept named coarse correlated equilibrium (moulin1978strategically; aumann1987correlated). CCE allows possible correlations among the agents’ policies. Before we formally define CCE, we need to refine the above definitions of policies to reflect such correlations. We define as a set of (nonMarkov) correlated policies, where for each , maps from a random variable and a history of length to a distribution over the joint action space . We also assume that the agents can access a common source of randomness (e.g., a common random seed) to sample the random variable . One can see that our definition of correlated policies generalizes the usual concept of policies as we defined earlier, since the latter only allows mappings to each agent’s individual action space simplex. Following the new notations, we can analogously define or for each as the proper marginal distributions of . The best responses , value functions , and stateaction value functions for agent can be defined similarly as in Definition 1 and Equation (1). We are now in a position to define the CCE in an episodic Markov game, as follows:
Definition 3.
(Coarse correlated equilibrium). A correlated policy is a coarse correlated equilibrium if for every agent ,
For illustrative purposes, let us compare the definitions of NE and CCE in a simple normalform game named HawkDove (with no state transitions). There are two players in this game. The row player has the action space , and the column player’s action space is . The reward matrix of the HawkDove game is described in Table 2. There are three Nash equilibria in this game: and are two pure strategy NE, and is a NE in mixed strategies. Table 2 gives a CCE distribution of the HawkDove game, which assigns equal probabilities to three action pairs: , , and . We can see that NE defines for each player an independentprobability distribution over a player’s own action space; in contrast, a CCE is a probability distribution over the joint action space of the players. In this sense, CCE generalizes NE by allowing possible correlations among the strategies of the agents. In our proposed algorithm, such correlation is implicitly achieved by letting the players use a common random seed.
4,4  1,5  
5,1  0,0 
1/3  1/3  
1/3  0 
Unlike NE, it can be shown that a CCE can always be calculated in polynomial time for generalsum games (papadimitriou2008computing). If the inequality in Definition 3 only holds approximately, then we have an approximate notion of coarse correlated equilibrium.
Definition 4.
(approximate CCE). For any , a correlated policy is an approximate coarse correlated equilibrium if for every agent ,
For notational convenience, in most parts of the paper we illustrate our algorithms and results for the special case of twoplayer generalsum games, i.e., . It is straightforward to extend our results to the general player games as we defined above. With two players, we use and to denote the action spaces of players 1 and 2, respectively. Let , and . We assume that an upper bound on the sizes of the action spaces is common knowledge. We also rewrite the policies as .
4 The VLearning OMD Algorithm
In this section, we introduce our algorithm Optimistic Vlearning with Stabilized Online Mirror Descent (Vlearning OMD) for decentralized multiagent RL in generalsum Markov games.
Vlearning OMD naturally integrates the idea of optimistic Vlearning in singleagent RL (jin2018q) with Online Mirror Descent (OMD) (nemirovskij1983problem; zinkevich2003online) in online convex optimization. First, our algorithm uses optimistic Vlearning to efficiently explore the unknown environment, as in singleagent RL. Second, each agent selects its actions following a noregret OMD algorithm in order to achieve a CCE. The intuition of using noregret learning here is to defend against the unobserved behavior of the opponents, by presuming that the opponents’ behavior will impair the reward sequence arbitrarily. Seemingly conservative, we will show that this suffices to find the CCE. The use of noregret learning is also reminiscent of the wellknown result in normalform games that if all agents run a noregret learning algorithm, the empirical frequency of their actions converge to a CCE (cesa2006prediction). These components also make our algorithm decentralized, which can be implemented individually using only the local rewards received and the local actions executed, without any communication among the agents.
algocf[!tbp]
The algorithm run by agent 1 (with action space ) is presented in Algorithm LABEL:alg:qomd. The algorithm for agent 2 (or other agents in the setting with more than two agents) is symmetric, by simply replacing the action space with the agent’s own action space. We thus omit the index of an agent in the notations for clarity. We use to denote the probability of taking action at state and step , where . At each step of an episode, the agent first takes an action according to a policy for the current state , and observes the reward and the next state . It also counts the number of times that state has been visited, and constructs a bonus term ( is some absolute constant and is a log factor to be defined later) that is used to upper bound the state value function. The agent then updates the optimistic state value functions by:
(2) 
where the learning rate is . This update rule essentially follows the optimistic Qlearning algorithm (jin2018q)
in the singleagent scenario, except that instead of estimating the Qfunctions, we maintain optimistic estimates of the state value functions. This is because the definition of
explicitly depends on the joint actions of all the agents, which cannot be observed in a decentralized environment. Such an argument is also consistent with the Optimistic Nash Vlearning (bai2020near) and the VOL (tian2020provably) algorithms for RL in twoplayer zerosum games.Unlike RL in the singleagent problem where the agent takes an action with the largest optimistic Qfunction, in a multiagent environment, the agent proceeds more conservatively by running an adversarial bandit algorithm to account for the unobserved effects of other agents’ policy changes. At each step and each state , we use a variant of online mirror descent with bandit feedback to compute a policy . OMD is an iterative process that computes the current policy by carrying out a simple gradient update in the dual space, where the dual space is defined by a mirror map (or a regularizer) . In our algorithm, we use a standard unnormalized negentropy regularizer for . Given a mirror map , the induced Bregman divergence is defined as
Given the (banditfeedback) loss vector
at step and state , the OMD update rule is given by (Line 13 of Algorithm LABEL:alg:qomd):where is the learning rate. We remark that OMD itself is a welldeveloped algorithmic framework with a rich literature. But in our case, to be consistent with the changing learning rate in the Vlearning part and the highprobability nature of the sample complexity bounds, we additionally require an OMD algorithm to have (1) a dynamic learning rate and (2) a high probability regret bound, with respect to (3) a weighted definition of regret. Such a result is absent in the literature as far as we know. Interestingly, incorporating OMD with a dynamic learning rate is an active and challenging subarea per se: An impossibility result (orabona2018scale) has shown that standard OMD with an learning rate can incur linear regret when the Bregman divergence is unbounded, which actually covers our choice of . A stabilization technique (fang2020online) was later introduced to resolve this problem, by replacing the policy at each step with a convex combination of this policy and the initial policy. This stabilization technique is also helpful in our method (Line 14 in Algorithm LABEL:alg:qomd), although the design of the convex combination is a little more involved due to the weighted regret. We provide a more detailed description of the bandit subroutine and an analysis of our OMD algorithm in Section 6.
5 Theoretical Analyses
In this section, we present our main results on the sample complexity upper bound of Vlearning OMD, and characterize the fundamental limits of the problem by providing a lower bound.
We first introduce a few notations to facilitate the analysis. For a given step of episode , we denote by the state that the agents observe at this step. Let and be the (interim) strategies at step of episode specified by in Algorithm LABEL:alg:qomd to agents 1 and 2, respectively. Let and be the actual actions taken by the two agents. Let and , respectively, be the values of and in Algorithm LABEL:alg:qomd calculated by agent 1 at the beginning of the th episode. Symmetrically, define to be the value of calculated by agent 2, which does not necessarily take the same value as . For notational convenience, we often suppress the sub/superscripts when there is no possibility of any ambiguity. When the state is clear from the context, we also sometimes abbreviate as or even simply as . For a fixed state , let , and suppose that was visited at episodes at the th step before the th episodes. If we further define and , one can show that the update rule in (2) can be equivalently expressed as
(3) 
This update rule follows the standard optimistic Qlearning algorithm (jin2018q) in singleagent RL, and has also appeared in RL for twoplayer zerosum games (bai2020near). In the following lemma, we recall several properties of that are useful in our analysis.
Lemma 1.
(Properties for , Lemma 4.1 in jin2018q).


.

.

.

.
algocf[!tbp]
Based on the strategy trajectories of the two agents specified by Algorithm LABEL:alg:qomd, we construct an auxiliary pair of correlated policies for each . The construction of such correlated policies, largely inspired by the construction of the “certified policies” in bai2020near, is formally defined in Algorithm LABEL:alg:correlated. Such auxiliary correlated policies will play a significant role throughout our analysis, and are closely related to the CCE correlated policy that we will construct later. In words, proceeds as follows: It first observes the current state , and let . Then, it randomly samples an episode index from , the set of episodes in which the state was previously visited during the execution of the first episodes of Algorithm LABEL:alg:qomd. Each index has a probability of to be selected. It is easy to verify that , and hence we have specified a welldefined probability distribution over the episode index set. Finally, executes the sampled strategy at step , and then repeats a similar procedure using at step , and so on.
From the collection of such auxiliary correlated policies , we finally construct a correlated policy , which we will show later is a CCE. A detailed description of the construction of is presented in Algorithm LABEL:alg:certify. By construction, first uniformly samples an index from using a common random seed, and then proceeds by following the auxiliary correlated policy . One can see that the notations we have defined are related through the following equation: . We also remark that the common random seed used in Algorithms LABEL:alg:correlated and LABEL:alg:certify implicitly plays the role of the “trusted coordinator” typically used in the language of correlated equilibria.
For notational convenience, we further introduce the operator for any value function , and for any strategy pair and any stateaction value function . With these notations, for any and for any policy pair , the Bellman equations can be rewritten more succinctly as and Using these notations, we define , and for each , define
(4) 
which can be considered as the value of agent 1’s best response at step against the auxiliary correlated policy . In what follows, we will simply write as for notational convenience, because the time step is always clear from the subscripts. We can also define analogously.
algocf[!tbp]
Remark 1.
Our definition of the correlated policy is inspired by the “certified policies” (bai2020near) for learning in twoplayer zerosum Markov games, but with additional challenges to address: In the zerosum setting, the Nash equilibrium value is always unique, and the regret with respect to the equilibrium value can be easily defined a priori (by means of the “duality gap”). But in generalsum games, the equilibrium value is not necessarily unique. We hence need to first specify an equilibrium before we are able to define the regret. In our analysis, the equilibrium value we choose is the one associated with the correlated policy . In addition, we also emphasize that the correlated policy is only used for analytical purposes; the actual strategies adopted by the agents during the execution of Algorithm LABEL:alg:qomd are still .
We start with an intermediate result, which states that the optimistic and values are indeed highprobability upper bounds of and , respectively. The proof relies on a delicate investigation of a highprobability regret bound for OMD with a dynamic learning rate, which we will elaborate on in the next section.
Lemma 2.
For any , let . It holds with probability at least that and , for all .
Proof.
In the following, we provide a proof for the first inequality. The second inequality can be shown using a similar argument.
Our proof relies on backward induction on . First, the claim holds for by the definition of . Now, suppose for all . By the definition of and the induction hypothesis,
(5) 
Further, define
(6) 
One may observe that from the perspective of player 1, is the weighted sum of the differences between the actual value that player 1 collected for the first times that state is visited, and the value that could have been achieved using the best fixed policy in hindsight. can hence be thought of as the weighted regret of an adversarial bandit problem, which we will formally present and analyze in Section 6
. Specifically, the loss function of the bandit problem is defined as
The weight of the regret at round is . If we define
then, can be equivalently rewritten as
Later in Section 6, we will analyze an adversarial bandit problem in exactly the same form. Applying the regret bound (which will be presented in Theorem 2 of Section 6) of this bandit problem, we obtain the following result with probability at least :
(7) 
where in the first step we have used the fact that is increasing and , and the second step is due to Lemma 1.
Finally, let be the algebra generated by all the random variables before episode . Then, we can see that is a martingale with respect to . From the AzumaHoeffding inequality, it holds with probability at least that
(8)  
where suppresses logarithmic terms. Finally, combining the results in (5), (6), (7), (8), and applying a union bound, we obtain that
where the second to last step is by the definition of for some large constant , and Lemma 1. In the last step we used the formulation of in (3). This completes the proof of the induction step. ∎
By construction of the auxiliary correlated policies , we know that for any , the corresponding value function can be written recursively as follows:
and for any , where again notice that we have dropped the dependence on . The following result shows that, on average, the agents have no incentive to deviate from the correlated policies, up to a regret term of the order .
Theorem 1.
For any , let . With probability at least ,
Proof.
We provide a proof for the first bound. The second one can be shown using a similar argument. From Lemma 2, it holds that
and so we only need to find an upper bound for the RHS. Define . The main idea of the proof is similar to optimistic Qlearning in the singleagent setting (jin2018q): We seek to upper bound by the next step , and then obtain a recursive formula.
Let be the algebra generated by all the random variables before the th episode. It can be seen that is a martingale with respect to the filtration . From the AzumaHoeffding inequality, it holds with high probability that
for some constant . Recall the update rule (3) that
Taking the difference of the two leads to
for some constant that could take different values from line to line, and the last step is due to Lemma 1. Summing over , notice that
because there are at most pairs of to be visited. Further,
Comments
There are no comments yet.