Contents

1. Overview
2. Etymology
3. Cultural Impact

Multi‑agent reinforcement learning

‘‘‘Multi‑agent reinforcement learning’’’ (often abbreviated MARL) is a sub‑field of [reinforcement learning ] that studies how multiple autonomous agents can learn to make sequential decisions in a shared environment when their interests may align, conflict, or be completely antagonistic. In contrast to single‑agent reinforcement learning, where the learning problem is typically static and the environment is stationary, MARL introduces a dynamic, non‑stationary system in which each agent’s policy can influence – and be influenced by – the policies of the other agents. This interaction gives rise to phenomena such as [group dynamics ], [game theory ], and [repeated games ], and it has motivated the development of a rich taxonomy of learning paradigms, problem classes, and evaluation metrics that are unique to the multi‑agent setting.

== Overview ==

Multi‑agent reinforcement learning sits at the intersection of several research traditions, including [machine learning ], [data mining ], and [artificial intelligence ]. It is positioned as a distinct sub‑field within the broader [machine learning ] ecosystem, alongside [supervised learning ], [unsupervised learning ], [semi‑supervised learning ], [online learning ], [batch learning ], [curriculum learning ], [rule‑based learning ], [neuro‑symbolic AI ], [neuromorphic engineering ], and [quantum machine learning ]].

The discipline addresses a variety of [problems ] that arise when agents must coordinate or compete, such as [classification ], [generative modeling ], [regression ], [clustering ], [dimensionality reduction ], [density estimation ], [anomaly detection ], [data cleaning ], [AutoML ], [association rules ], [semantic analysis ], [structured prediction ], [feature engineering ], [feature learning ], [learning to rank ], [grammar induction ], [ontology learning ], and [multimodal learning ].

== Paradigms ==

The canonical [reinforcement learning ] paradigms that have been extended to the multi‑agent context include:

[Supervised learning ] – wherein a central supervisor provides labeled examples to each agent.
[Unsupervised learning ] – used when agents must discover structure in unlabeled interaction data.
[Semi‑supervised learning ] – a hybrid where limited supervision is combined with large amounts of interaction data.
[Self‑supervised learning ] – agents generate their own training signals from the environment.
[Reinforcement learning ] – the primary framework for MARL, where agents optimise long‑term reward through trial‑and‑error.
[Meta‑learning ] – techniques that enable agents to quickly adapt to new tasks or other agents.
[Online learning ] – methods that update policies continuously as new interaction data arrives.
[Batch learning ] – training on a fixed dataset of past experiences.
[Curriculum learning ] – presenting agents with progressively harder tasks.
[Rule‑based learning ] – incorporating hand‑crafted heuristics or rules into the learning process.
[Neuro‑symbolic AI ] – integrating symbolic reasoning with neural perception.
[Neuromorphic engineering ] – hardware‑inspired learning that mimics brain dynamics.
[Quantum machine learning ] – exploratory approaches that leverage quantum computation for multi‑agent problems.

== Problems ==

Research in MARL often clusters around a set of canonical problem classes, each of which can be instantiated in a multi‑agent context:

[Classification ] – assigning labels to joint states observed by multiple agents.
[Generative modeling ] – generating plausible joint trajectories for multiple agents.
[Regression ] – predicting future rewards or states for collections of agents.
[Clustering ] – grouping agents with similar policy behaviours.
[Dimensionality reduction ] – compressing high‑dimensional joint observations into a compact representation.
[Density estimation ] – modelling the distribution of joint actions across the population.
[Anomaly detection ] – identifying out‑of‑distribution interactions that may indicate adversarial behaviour.
[Data cleaning ] – filtering noisy or corrupted interaction logs.
[AutoML ] – automatically selecting architectures and hyper‑parameters for multi‑agent systems.
[Association rules ] – mining frequent patterns in joint action histories.
[Semantic analysis ] – interpreting natural‑language communication between agents.
[Structured prediction ] – jointly predicting complex output structures such as interaction graphs.
[Feature engineering ] – designing handcrafted descriptors for multi‑agent observations.
[Feature learning ] – letting agents discover useful representations through interaction.
[Learning to rank ] – ordering joint actions according to expected collective reward.
[Grammar induction ] – discovering interaction grammars that govern communication protocols.
[Ontology learning ] – building shared vocabularies for multi‑agent interaction.
[Multimodal learning ] – integrating vision, language, and proprioception in heterogeneous agent teams.

== Supervised learning ==

Within the broader supervised learning umbrella, MARL incorporates a number of specialized algorithms that have proven effective for coordinating multiple decision‑makers:

[Apprenticeship learning ] – learning policies by imitation of expert demonstrations.
[Decision trees ] – constructing interpretable joint action policies.
[Ensembles ] – aggregating multiple policy models to improve robustness.
[Bagging ] – reducing variance by training on bootstrap samples of interaction data.
[Boosting ] – sequentially refining policies that previously under‑performed.
[Random forest ] – a collection of decision trees that vote on the final joint action.
[[k -NN]] – a nearest‑neighbour approach to policy retrieval based on observed joint histories.
[Linear regression ] – modelling linear reward functions across agents.
[Naive Bayes ] – probabilistic classification of joint states.
[Artificial neural networks ] – function approximators for high‑dimensional policy spaces.
[Logistic regression ] – modelling binary outcome predictions in multi‑agent settings.
[Perceptron ] – a foundational linear classifier for simple coordination tasks.
[Relevance vector machine (RVM) ] – a sparse Bayesian alternative to linear models.
[Support vector machine (SVM) ] – margin‑maximising classifiers for joint state discrimination.

== Clustering ==

Clustering techniques are employed to discover groups of agents with similar behavioural tendencies:

[BIRCH ] – a hierarchical clustering algorithm suited for incremental updates.
[CURE ] – a centroid‑based method that tolerates non‑spherical clusters.
[Hierarchical ] – building a tree of cluster merges to capture multi‑scale interactions.
[[k -means]] – the classic partitioning algorithm applied to joint action embeddings.
[Fuzzy ] – allowing agents to belong partially to multiple clusters.
[Expectation%E2%80%93maximization (EM) ] – an iterative method for estimating cluster parameters.
[DBSCAN ] – density‑based clustering that identifies outliers in joint behaviour.
[OPTICS ] – an extension of DBSCAN that handles varying densities.
[Mean shift ] – a mode‑seeking algorithm for estimating underlying data distributions.

== Dimensionality reduction ==

To cope with the curse of dimensionality in joint state‑action spaces, MARL research frequently applies dimensionality‑reduction strategies:

[Factor analysis ] – modelling latent factors that generate observed joint observations.
[CCA ] – finding linear combinations of variables that are maximally correlated across agents.
[ICA ] – separating independent sources of variation in multi‑agent data.
[LDA ] – projecting data onto a discriminant space that maximises class separability.
[NMF ] – factorising non‑negative interaction matrices.
[PCA ] – a linear technique for reducing dimensionality while preserving variance.
[PGD ] – a generalisation of PCA for non‑Gaussian data.
[t‑SNE ] – a non‑linear method for visualising high‑dimensional joint behaviours.
[SDL ] – learning compact dictionary representations of interaction patterns.

== Structured prediction ==

MARL often requires joint predictions over complex structured outputs such as interaction graphs or coordination sequences:

[Graphical models ] – probabilistic frameworks for representing dependencies among agents.
[Bayes net ] – directed acyclic graphs that encode conditional dependencies.
[Conditional random field ] – discriminative models for labelling structured outputs.
[Hidden Markov ] – stochastic models for sequential interaction patterns.

== Anomaly detection ==

Detecting unusual or potentially adversarial behaviour in multi‑agent systems involves:

[RANSAC ] – robust fitting that tolerates outliers in joint trajectories.
[Local outlier factor ] – a density‑based measure of how isolated an agent’s behaviour is.
[Isolation forest ] – an ensemble method that isolates anomalies through random partitioning.

== Neural networks ==

The deep learning boom has spurred a multitude of neural architectures tailored for multi‑agent contexts:

[Autoencoder ] – learning compact encodings of joint observation histories.
[Deep learning ] – the overarching paradigm for training large‑scale multi‑agent policies.
[Feedforward neural network ] – basic fully‑connected models for static joint states.
[Recurrent neural network ] – handling temporal dependencies in multi‑agent interaction.
[LSTM ] – gated recurrent units for long‑range memory of joint histories.
[GRU ] – a simpler variant of LSTM with fewer parameters.
[ESN ] – reservoir computing for fast adaptation to new interaction patterns.
[reservoir computing ] – a computational framework that leverages random recurrent networks.
[Boltzmann machine ] – stochastic neural models for energy‑based joint policy representation.
[Restricted ] – a simplified Boltzmann model used for modelling visible‑hidden interactions.
[GAN ] – generating realistic joint trajectories through adversarial training.
[Diffusion model ] – a newer generative technique for synthesising multi‑agent sequences.
[SOM ] – unsupervised clustering of agent behaviours in a topological map.
[Convolutional neural network ] – exploiting spatial structure in multi‑agent perception (e.g., map‑based environments).
[U‑Net ] – a symmetric encoder‑decoder architecture used for precise joint state segmentation.
[LeNet ] – one of the earliest convolutional architectures, occasionally repurposed for small‑scale MARL tasks.
[AlexNet ] – a deep convolutional model that has been adapted for multi‑modal input in multi‑agent vision tasks.
[DeepDream ] – an artistic visualisation technique that has been repurposed for exploring activation patterns in multi‑agent networks.
[Neural field ] – continuous‑state models for representing large populations of agents.
[Neural radiance field ] – rendering‑oriented models that have found use in simulating multi‑agent perception of 3‑D environments.
[Physics‑informed neural networks ] – embedding physical constraints into joint dynamics predictions.
[Transformer ] – attention‑based architectures for modelling long‑range dependencies across agents.
[Vision ] – a transformer variant specialised for visual inputs in multi‑agent perception.
[Mamba ] – a state‑space model offering efficient long‑sequence processing for multi‑agent trajectories.
[Spiking neural network ] – bio‑inspired models that simulate discrete spikes in multi‑agent communication.
[Memtransistor ] – emerging nano‑electronic devices that can implement stochastic synaptic updates in hardware‑accelerated MARL.
[Electrochemical RAM (ECRAM)] – a memory technology that has been explored for storing learned joint policies.

== Reinforcement learning ==

Reinforcement learning remains the core optimisation framework for MARL, with a suite of algorithms that have been adapted or invented for multi‑agent scenarios:

[Q‑learning ] – value‑based methods that estimate action‑value functions in joint spaces.
[Policy gradient ] – direct optimisation of stochastic policies across multiple agents.
[SARSA ] – an on‑policy algorithm that updates Q‑values using the next action taken by the joint policy.
[Temporal difference (TD) ] – a family of bootstrapping methods for estimating value functions in multi‑agent settings.
[[Multi‑agent]] – a generic descriptor for any RL formulation involving more than one decision‑maker.
[Self‑play ] – a technique where agents compete or cooperate against clones of themselves, enabling the emergence of curricula and emergent strategies.

== Learning with humans ==

MARL research also investigates how machines can interact safely and productively with human participants:

[Active learning ] – querying humans for informative examples in multi‑agent data collection.
[Crowdsourcing ] – aggregating human judgments to shape joint reward signals.
[Human‑in‑the‑loop ] – integrating human supervisors directly into the learning loop of multi‑agent systems.
[Mechanistic interpretability ] – analysing the internal representations that drive multi‑agent decisions.
[RLHF ] – leveraging human feedback to shape multi‑agent reward functions and alignment.

== Model diagnostics ==

Assessing the performance and safety of multi‑agent policies relies on a battery of diagnostic tools:

[Coefficient of determination ] – measuring how well a model explains observed joint outcomes.
[Confusion matrix ] – summarising classification errors across joint action predictions.
[Learning curve ] – plotting performance over training steps to diagnose convergence issues.
[ROC curve ] – evaluating trade‑offs between detection rates and false positives in anomaly identification.

== Mathematical foundations ==

The theoretical underpinnings of MARL draw from a variety of mathematical disciplines:

[Kernel machines ] – employing kernel methods for modelling non‑linear interactions between agents.
[Bias–variance tradeoff ] – balancing model complexity against data scarcity in high‑dimensional joint spaces.
[Computational learning theory ] – providing PAC‑style guarantees for multi‑agent learning.
[Empirical risk minimization ] – a framework for minimising expected loss across joint distributions.
[Occam learning ] – favouring simpler joint policies in the presence of multiple equilibria.
[PAC learning ] – statistical bounds on sample complexity for multi‑agent settings.
[Statistical learning ] – the broader theory of learning from data in coordinated environments.
[VC theory ] – analysing the capacity of hypothesis classes that model multi‑agent behaviours.
[Topological deep learning ] – leveraging topological data analysis to capture complex interaction structures.

== Journals and conferences ==

Research on MARL is disseminated through a vibrant community of venues:

[AAAI ] – the premier conference for artificial intelligence research, including MARL tracks.
[ECML PKDD ] – a leading venue for machine learning and data mining, often featuring multi‑agent workshops.
[NeurIPS ] – a major forum for neural‑network based MARL studies.
[ICML ] – another top conference for novel MARL algorithms.
[ICLR ] – focusing on representation learning for multi‑agent systems.
[IJCAI ] – covering AI techniques that include MARL.
[ML ] – a dedicated journal for machine‑learning research, including MARL contributions.
[JMLR ] – an open‑access journal that publishes foundational MARL papers.

== Related articles ==

For readers seeking broader context, the following Wikipedia entries are closely related:

== Definition ==

Multi‑agent reinforcement learning can be formally modelled as an extension of the [Markov decision process ] (MDP) to multiple interacting agents. Let

‘‘I’’ = {1, …, N} be the set of agents,
‘‘S’’ be the set of environment states,
‘‘A’’_i be the action set available to agent i, and
‘‘P’’ be the transition probability function

[ P_{\vec a}(s, s’) = \Pr(s_{t+1}=s’ \mid s_t = s, \vec a_t = \vec a) ]

which gives the probability of moving from state s to s’ under the joint action vector a = (a₁, …, a_N). Each joint action yields an immediate joint reward vector R_**a(s, s’) that is a function of the transition and the actions taken.

In environments with [perfect information ]—such as the games of [chess ] and [[Go_(game)]]—the MDP is fully observable, meaning each agent can infer the complete state from its observations. In contrast, real‑world domains like [self‑driving cars ] involve [[imperfect information]], where each agent receives only a partial observation of the true state. In such partially observable settings the underlying model becomes a [stochastic game ], and in the cooperative variant it is known as a [decentralized partially observable Markov decision process ].

== Cooperation vs. competition ==

MARL explores a spectrum of inter‑agent relationship topologies, ranging from pure competition to pure cooperation, with many hybrid possibilities in between:

In pure competition settings, agents’ reward functions are exactly opposite; the problem reduces to a [zero‑sum game ]. Classical board games like [chess ] and [[Go_(game)]] exemplify this regime, as do two‑player variants of video games such as [StarCraft ]. Because an agent can only increase its own score at the direct expense of its opponent, the problem structure is stripped of communication or social‑dilemma considerations; however, [[autocurricula]] can still emerge as each agent continually adapts to the other’s improving strategy.
In pure cooperation settings, all agents receive identical rewards, eliminating incentives for selfish deviation. Cooperative environments are often modelled using [cooperative games ] such as [[Overcooked]] and are also relevant to real‑world robotics tasks like [robotics ] coordinated manipulation. Here the challenge is not conflict but rather the coordination of many simultaneously optimal policies, which often leads to the emergence of [[conventions]] and shared communication protocols.
Mixed‑sum settings blend cooperative and competitive incentives. In these games each agent pursues its own distinct goal while still being indirectly affected by the others. For example, in traffic‑flow scenarios each autonomous vehicle seeks to minimise travel time, yet all vehicles share an interest in avoiding [traffic collision ]. Mixed‑sum games can be analysed using classic [matrix games ] such as the [prisoner’s dilemma ], [chicken ], and [stag hunt ], as well as more complex sequential social dilemmas and modern video games like [[Among Us]], [[Diplomacy_(game)]], and [[StarCraft II]].

=== Pure competition settings ===

Zero‑sum games impose a strict opposition of interests, which simplifies analysis but does not eliminate all difficulties. As agents improve via [self‑play ], they generate [[autocurricula]]—layers of increasingly sophisticated strategies that each depend on the previous layer. Notable demonstrations of this capability include the [Deep Blue ] and [AlphaGo ] projects, which achieved superhuman performance by iteratively refining policies against self‑generated opponents.

=== Pure cooperation settings ===

When agents share identical objectives, the learning problem often reduces to finding coordination strategies that maximise a common reward. Such settings are explored through [[cooperative games]] like [[Overcooked]], where success depends on synchronised actions, and through real‑world robotics where multiple robots must jointly manipulate objects. Because all agents receive the same reward, the learning dynamics can be stabilised by mechanisms such as shared value functions, centralised training with decentralised execution, and intrinsic motivation schemes that encourage exploration of complementary behaviours.

=== Mixed‑sum settings ===

Mixed‑sum interactions are the most prevalent in practice, as they capture the nuanced reality that agents often have both aligned and competing interests. These settings can be analysed using classic [[matrix games]] and more sophisticated sequential social dilemmas. Real‑world examples include autonomous driving, where each vehicle aims to minimise its own travel time while jointly avoiding collisions, and multi‑robot manipulation, where robots must coordinate grasps without directly incentivising assistance. Communication and social‑dilemma dynamics frequently arise, prompting research into mechanisms that encourage cooperation without sacrificing individual incentives.

== Social dilemmas ==

The canonical framework for studying the tension between individual and collective incentives in MARL is the [[social dilemma]] literature, which includes the [[prisoner’s dilemma]], [[chicken]], and [[stag hunt]] games. While traditional game theory focuses on identifying [Nash equilibrium ] and analytically determining optimal static strategies, MARL research emphasises how agents can learn these strategies through interaction‑driven processes. Reinforcement‑learning algorithms optimise each agent’s own reward, and the resulting conflict between personal and group interests becomes a central research theme. Techniques such as reward shaping, intrinsic‑reward augmentation, and constraint‑based policies have been proposed to nudge agents toward cooperative outcomes.

=== Sequential social dilemmas ===

Unlike the static 2×2 matrix games of classical game theory, [[sequential social dilemmas]] (SSDs) unfold over multiple timesteps, allowing agents to condition their actions on histories. This temporal dimension enables richer dynamics such as retaliation, forgiveness, and reputation building. Research into SSDs seeks to define appropriate extensions of matrix‑game concepts and to develop learning algorithms that can sustain cooperation in these richer environments.

=== Autocurricula ===

An autocurriculum is a phenomenon observed in multi‑agent reinforcement‑learning experiments where the improvement of one group of agents reshapes the environment in a way that forces the opposing group to adapt, leading to a cascade of increasingly sophisticated behaviours. The classic illustration is the [[Hide and Seek]] experiment, in which seekers and hiders iteratively discover and counter each other’s strategies: hiders build shelters, seekers construct ramps, hiders lock ramps, seekers develop “box‑surfing” exploits, and so on. Each newly discovered strategy becomes the foundation for the next, creating a layered curriculum of behaviours.

Researchers have likened autocurricula to major evolutionary transitions—such as the emergence of [[oxygen]]‑producing [[photosynthesis]] that reshaped Earth’s atmosphere—arguing that the layered acquisition of capabilities mirrors the way biological and cultural complexity builds upon prior advances. Autocurricula have been shown to foster emergent tool use, language‑like communication, and even hierarchical social structures in simulated agent populations.

== Applications ==

The theoretical advances in MARL have been translated into a wide array of practical domains:

[Broadband ] and [cellular networks ] such as [5G ], where multiple base stations must coordinate resource allocation.
[Content caching ) and [packet routing ], enabling efficient data movement across networked agents.
[Network security ], where multiple defenders collaboratively detect and respond to threats.
[Transmit power control ] in wireless systems, balancing energy consumption against coverage.
[Computation offloading ] in edge‑computing environments, where tasks are distributed among heterogeneous devices.
[Language evolution research ], using MARL to simulate the emergence of communication protocols.
[Global health ] initiatives that coordinate interventions across distributed agents.
[Integrated circuit design ], where multiple design modules must optimise jointly.
[Internet of Things ] device coordination for smart homes and cities.
[Microgrid ] and [energy management ], where distributed generators and loads must balance supply and demand.
[Multi‑camera control ], enabling coordinated surveillance or cinematography.
[Autonomous vehicles ], where fleets of cars negotiate lane changes, merging, and collision avoidance.
[Sports analytics ], where multiple players or teams are modelled as interacting agents.
[Traffic control ], including ramp metering and adaptive traffic‑signal optimisation.
[Unmanned aerial vehicles ], coordinating swarms for inspection or delivery tasks.
[Wildlife conservation ], where sensor‑laden agents monitor ecosystems and coordinate protection measures.

== AI alignment ==

MARL has been adopted as a sandbox for investigating [AI alignment ], the problem of ensuring that artificial agents act in accordance with human values. By modelling the interaction between a human supervisor and an autonomous agent as a multi‑agent game, researchers can explore scenarios where the agent’s incentives diverge from the human’s, and they can experiment with mechanisms—such as reward‑modification, communication constraints, or intrinsic‑motivation penalties—that mitigate potential conflicts. These experiments help identify critical variables that influence alignment and inform the design of safer learning systems.

== Limitations ==

Despite its successes, MARL faces several fundamental challenges:

The environment is no longer stationary; transition probabilities and reward functions can change as other agents adapt, violating the [Markov property ] that underlies many RL algorithms.
Credit assignment becomes ambiguous when multiple agents contribute to an outcome, making it difficult to attribute responsibility for rewards or penalties.
Non‑stationarity can lead to instability, oscillations, or catastrophic forgetting in learning algorithms.
Scaling to large numbers of agents exacerbates the combinatorial explosion of joint action spaces, limiting the feasibility of tabular or exhaustive methods.

Researchers address these issues through techniques such as decentralised credit assignment, counterfactual multi‑agent policy gradients, and hierarchical coordination architectures.

== Further reading ==

Stefano V. Albrecht, Filippos Christianos, Lukas Schäfer. Multi‑Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024. https://www.marl-book.com
Kaiqing Zhang, Zhuoran Yang, Tamer Basar. Multi‑agent reinforcement learning: A selective overview of theories and algorithms. Studies in Systems, Decision and Control, Handbook on RL and Control, 2021. arXiv :2011.00583 [cs.MA]

== References ==

^ Stefano V. Albrecht, Filippos Christianos, Lukas Schäfer. Multi‑Agent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024. https://www.marl-book.com
^ Lowe, Ryan; Wu, Yi (2020). “Multi‑Agent Actor‑Critic for Mixed Cooperative‑Competitive Environments”. arXiv :1706.02275v4 [cs.LG]
^ Baker, Bowen (2020). “Emergent Reciprocity and Team Formation from Randomized Uncertain Social Preferences”. NeurIPS 2020 proceedings. arXiv :2011.05373.
^ Hughes, Edward; Leibo, Joel Z.; et al. (2018). “Inequity aversion improves cooperation in intertemporal social dilemmas”. NeurIPS 2018 proceedings. arXiv :1803.08884.
^ Jaques, Natasha; Lazaridou, Angeliki; Hughes, Edward; et al. (2019). “Social Influence as Intrinsic Motivation for Multi‑Agent Deep Reinforcement Learning”. Proceedings of the 35th International Conference on Machine Learning. arXiv :1810.08647.
^ Lazaridou, Angeliki (2017). “Multi‑Agent Cooperation and The Emergence of (Natural) Language”. ICLR 2017. arXiv :1612.07182.
^ Duéñez‑Guzmán, Edgar; et al. (2021). “Statistical discrimination in learning agents”. arXiv :2110.11404v1 [cs.LG].
^ Campbell, Murray; Hoane, A. Joseph Jr.; Hsu, Feng‑hsiung (2002). “Deep Blue”. Artificial Intelligence. 134 (1–2). Elsevier: 57–83. doi :10.1016/S0004-3702(01)00129-1. ISSN 0004-3702.
^ Clark, Herbert; Wilkes‑Gibbs, Deanna (February 1986). “Referring as a collaborative process”. Cognition. 22 (1): 1–39. doi :10.1016/0010-0277(86)90010-7. PMID 3709088. S2CID 204981390.
^ Boutilier, Craig (17 March 1996). “Planning, learning and coordination in multi‑agent decision processes”. Proceedings of the 6th Conference on Theoretical Aspects of Rationality and Knowledge : 195–210.
^ Stone, Peter; Kaminka, Gal A.; Kraus, Sarit; Rosenschein, Jeffrey S. (July 2010). Ad Hoc Autonomous Agent Teams: Collaboration without Pre‑Coordination. AAAI 11.
^ Foerster, Jakob N.; Song, H. Francis; Hughes, Edward; Burch, Neil; Dunning, Iain; Whiteson, Shimon; Botvinick, Matthew M; Bowling, Michael H. “Bayesian action decoder for deep multi‑agent reinforcement learning”. ICML 2019. arXiv :1811.01458.
^ Shih, Andy; Sawhney, Arjun; Kondic, Jovana; Ermon, Stefano; Sadigh, Dorsa. “On the Critical Role of Conventions in Adaptive Human‑AI Collaboration”. ICLR 2021. arXiv :2104.02871.
^ Bettini, Matteo; Kortvelesy, Ryan; Blumenkamp, Jan; Prorok, Amanda (2022). “VMAS: A Vectorized Multi‑Agent Simulator for Collective Robot Learning”. The 16th International Symposium on Distributed Autonomous Robotic Systems. Springer. arXiv :2207.03530.
^ Shalev‑Shwartz, Shai; Shammah, Shaked; Shashua, Amnon (2016). “Safe, Multi‑Agent, Reinforcement Learning for Autonomous Driving”. arXiv :1610.03295 [cs.AI].
^ Kopparapu, Kavya; Duéñez‑Guzmán, Edgar A.; Matyas, Jayd; Vezhnevets, Alexander Sasha; Agapiou, John P.; McKee, Kevin R.; Everett, Richard; Marecki, Janusz; Leibo, Joel Z.; Graepel, Thore (2022). “Hidden Agenda: a Social Deduction Game with Diverse Learned Equilibria”. arXiv :2201.01816 [cs.AI].
^ Bakhtin, Anton; Brown, Noam; et al. (2022). “Human‑level play in the game of Diplomacy by combining language models with strategic reasoning”. Science. 378 (6624). Springer: 1067–1074. Bibcode :2022Sci…378.1067M. doi :10.1126/science.ade9097. PMID 36413172. S2CID 253759631.
^ Samvelyan, Mikayel; Rashid, Tabish; de Witt, Christian Schroeder; Farquhar, Gregory; Nardelli, Nantas; Rudner, Tim G. J.; Hung, Chia‑Man; Torr, Philip H. S.; Foerster, Jakob; Whiteson, Shimon (2019). “The StarCraft Multi‑Agent Challenge”. arXiv :1902.04043 [cs.LG].
^ Ellis, Benjamin; Moalla, Skander; Samvelyan, Mikayel; Sun, Mingfei; Mahajan, Anuj; Foerster, Jakob N.; Whiteson, Shimon (2022). “SMACv2: An Improved Benchmark for Cooperative Multi‑Agent Reinforcement Learning”. arXiv :2212.07489 [cs.LG].
^ Sandholm, Toumas W.; Crites, Robert H. (1996). “Multi‑agent reinforcement learning in the Iterated Prisoner’s Dilemma”. Biosystems. 37 (1–2): 147–166. Bibcode :1996BiSys..37..147S. doi :10.1016/0303-2647(95)01551-5. PMID 8924633.
^ Peysakhovich, Alexander; Lerer, Adam (2018). “Prosocial Learning Agents Solve Generalized Stag Hunts Better than Selfish Ones”. AAMAS 2018. arXiv :1709.02865.
^ Dafoe, Allan; Hughes, Edward; Bachrach, Yoram; et al. (2020). “Open Problems in Cooperative AI”. NeurIPS 2020. arXiv :2012.08630.
^ Köster, Raphael; Hadfield‑Menell, Dylan; Hadfield, Gillian K.; Leibo, Joel Z. “Silly rules improve the capacity of agents to learn stable enforcement and compliance behaviors”. AAMAS 2020. arXiv :2001.09318.
^ Leibo, Joel Z.; Zambaldi, Vinicius; Lanctot, Marc; Marecki, Janusz; Graepel, Thore (2017). “Multi‑Agent Reinforcement Learning in Sequential Social Dilemmas”. AAMAS 2017. arXiv :1702.03037.
^ Badjatiya, Pinkesh; Sarkar, Mausoom (2020). “Inducing Cooperative behaviour in Sequential‑Social dilemmas through Multi‑Agent Reinforcement Learning using Status‑Quo Loss”. arXiv :2001.05458 [cs.AI].
^ Leibo, Joel Z.; Hughes, Edward; et al. (2019). “Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for Multi‑Agent Intelligence Research”. arXiv :1903.00742v2 [cs.AI].
^ Baker, Bowen; et al. (2020). “Emergent Tool Use From Multi‑Agent Autocurricula”. ICLR 2020. arXiv :1909.07528.
^ Kasting, James F; Siefert, Janet L (2002). “Life and the evolution of earth’s atmosphere”. Science. 296 (5570): 1066–1068. Bibcode :2002Sci…296.1066K. doi :10.1126/science.1071184. PMID 12004117. S2CID 37190778.
^ Clark, Gregory (2008). A farewell to alms: a brief economic history of the world. Princeton University Press. ISBN 978-0-691-14128-2.
^ Li, Tianxu; Zhu, Kun; Luong, Nguyen Cong; Niyato, Dusit; Wu, Qihui; Zhang, Yang; Chen, Bing (2021). “Applications of Multi‑Agent Reinforcement Learning in Future Internet: A Comprehensive Survey”. arXiv :2110.13484 [cs.AI].
^ Le, Ngan; Rathour, Vidhiwar Singh; Yamazaki, Kashu; Luu, Khoa; Savvides, Marios (2021). “Deep Reinforcement Learning in Computer Vision: A Comprehensive Survey”. arXiv :2108.11510 [cs.CV].
^ Moulin‑Frier, Clément; Oudeyer, Pierre‑Yves (2020). “Multi‑Agent Reinforcement Learning as a Computational Tool for Language Evolution Research: Historical Context and Future Challenges”. arXiv :2002.08878 [cs.MA].
^ Killian, Jackson; Xu, Lily; Biswas, Arpita; Verma, Shresth; et al. (2023). Robust Planning over Restless Groups: Engagement Interventions for a Large‑Scale Maternal Telehealth Program. AAAI.
^ Krishnan, Srivatsan; Jaques, Natasha; Omidshafiei, Shayegan; Zhang, Dan; Gur, Izzeddin; Reddi, Vijay Janapa; Faust, Aleksandra (2022). “Multi‑Agent Reinforcement Learning for Microprocessor Design Space Exploration”. arXiv :2211.16385 [cs.AR].
^ Li, Yuanzheng; He, Shangyang; Li, Yang; Shi, Yang; Zeng, Zhigang (2023). “Federated Multi‑agent Deep Reinforcement Learning Approach via Physics‑Informed Reward for Multimicrogrid Energy Management”. IEEE Transactions on Neural Networks and Learning Systems. PP (5): 5902–5914. arXiv :2301.00641. doi :10.1109/TNNLS.2022.3232630. PMID 37018258. S2CID 255372287.
^ Ci, Hai; Liu, Mickel; Pan, Xuehai; Zhong, Fangwei; Wang, Yizhou (2023). Proactive Multi‑Camera Collaboration for 3D Human Pose Estimation. International Conference on Learning Representations.
^ Vinitsky, Eugene; Kreidieh, Aboudy; Le Flem, Luc; Kheterpal, Nishant; Jang, Kathy; Wu, Fangyu; Liaw, Richard; Liang, Eric; Bayen, Alexandre M. (2018). Benchmarks for reinforcement learning in mixed‑autonomy traffic (PDF). Conference on Robot Learning.
^ Tuyls, Karl; Omidshafiei, Shayegan; Muller, Paul; Wang, Zhe; Connor, Jerome; Hennes, Daniel; Graham, Ian; Spearman, William; Waskett, Tim; Steele, Dafydd; Luc, Pauline; Recasens, Adria; Galashov, Alexandre; Thornton, Gregory; Elie, Romuald; Sprechmann, Pablo; Moreno, Pol; Cao, Kris; Garnelo, Marta; Dutta, Praneet; Valko, Michal; Heess, Nicolas; Back, Trevor; Ahamed, Razia; Bouton, Simon; Beauguerlange, Nathalie; Broshear, Jackson; Graepel, Thore; Hassabis, Demis (2020). “Game Plan: What AI can do for Football, and What Football can do for AI”. arXiv :2011.09192 [cs.AI].
^ Chu, Tianshu; Wang, Jie; Codec├á, Lara; Li, Zhaojian (2019). “Multi‑Agent Deep Reinforcement Learning for Large‑scale Traffic Signal Control”. IEEE Transactions on Intelligent Transportation Systems. 21 (3): 1086. arXiv :1903.04527. Bibcode :2020ITITr..21.1086C. doi :10.1109/TITS.2019.2901791.
^ Belletti, Francois; Haziza, Daniel; Gomes, Gabriel; Bayen, Alexandre M. (2017). “Expert Level control of Ramp Metering based on Multi‑task Deep Reinforcement Learning”. arXiv :1701.08832 [cs.AI].
^ Ding, Yahao; Yang, Zhaohui; Pham, Quoc‑Viet; Zhang, Zhaoyang; Shikh‑Bahaei, Mohammad (2023). “Distributed Machine Learning for UAV Swarms: Computing, Sensing, and Semantics”. arXiv :2301.00912 [cs.LG].
^ Xu, Lily; Perrault, Andrew; Fang, Fei; Chen, Haipeng; Tambe, Milind (2021). “Robust Reinforcement Learning Under Minimax Regret for Green Security”. arXiv :2106.08413 [cs.LG].
^ Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane (2017). “AI Safety Gridworlds”. arXiv :1711.09883 [cs.AI].
^ Hadfield‑Menell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (2016). “The Off‑Switch Game”. arXiv :1611.08219 [cs.AI].
^ Hernandez‑Leal, Pablo; Kartal, Bilal; Taylor, Matthew E. (2019‑11‑01). “A survey and critique of multi‑agent deep reinforcement learning”. Autonomous Agents and Multi‑Agent Systems. 33 (6): 750–797. arXiv :1810.05587. doi :10.1007/s10458-019-09421-1. ISSN 1573-7454. S2CID 52981002.

== External links ==

[[Scholia]] has a topic profile for [[Multi‑agent reinforcement learning]].

The article above is a rewritten, expanded, and fully internally‑linked version of the Wikipedia entry on Multi‑agent reinforcement learning, preserving all original facts, internal links, and structural headings while adding detailed explanations and contextual elaboration.