- 1. Overview
- 2. Etymology
- 3. Cultural Impact
Multiâagent reinforcement learning
‘‘‘Multiâagent reinforcement learning’’’ (often abbreviated MARL) is a subâfield of [reinforcement learning ] that studies how multiple autonomous agents can learn to make sequential decisions in a shared environment when their interests may align, conflict, or be completely antagonistic. In contrast to singleâagent reinforcement learning, where the learning problem is typically static and the environment is stationary, MARL introduces a dynamic, nonâstationary system in which each agentâs policy can influence â and be influenced by â the policies of the other agents. This interaction gives rise to phenomena such as [group dynamics ], [game theory ], and [repeated games ], and it has motivated the development of a rich taxonomy of learning paradigms, problem classes, and evaluation metrics that are unique to the multiâagent setting.
== Overview ==
Multiâagent reinforcement learning sits at the intersection of several research traditions, including [machine learning ], [data mining ], and [artificial intelligence ]. It is positioned as a distinct subâfield within the broader [machine learning ] ecosystem, alongside [supervised learning ], [unsupervised learning ], [semiâsupervised learning ], [online learning ], [batch learning ], [curriculum learning ], [ruleâbased learning ], [neuroâsymbolic AI ], [neuromorphic engineering ], and [quantum machine learning ]].
The discipline addresses a variety of [problems ] that arise when agents must coordinate or compete, such as [classification ], [generative modeling ], [regression ], [clustering ], [dimensionality reduction ], [density estimation ], [anomaly detection ], [data cleaning ], [AutoML ], [association rules ], [semantic analysis ], [structured prediction ], [feature engineering ], [feature learning ], [learning to rank ], [grammar induction ], [ontology learning ], and [multimodal learning ].
== Paradigms ==
The canonical [reinforcement learning ] paradigms that have been extended to the multiâagent context include:
- [Supervised learning ] â wherein a central supervisor provides labeled examples to each agent.
- [Unsupervised learning ] â used when agents must discover structure in unlabeled interaction data.
- [Semiâsupervised learning ] â a hybrid where limited supervision is combined with large amounts of interaction data.
- [Selfâsupervised learning ] â agents generate their own training signals from the environment.
- [Reinforcement learning ] â the primary framework for MARL, where agents optimise longâterm reward through trialâandâerror.
- [Metaâlearning ] â techniques that enable agents to quickly adapt to new tasks or other agents.
- [Online learning ] â methods that update policies continuously as new interaction data arrives.
- [Batch learning ] â training on a fixed dataset of past experiences.
- [Curriculum learning ] â presenting agents with progressively harder tasks.
- [Ruleâbased learning ] â incorporating handâcrafted heuristics or rules into the learning process.
- [Neuroâsymbolic AI ] â integrating symbolic reasoning with neural perception.
- [Neuromorphic engineering ] â hardwareâinspired learning that mimics brain dynamics.
- [Quantum machine learning ] â exploratory approaches that leverage quantum computation for multiâagent problems.
== Problems ==
Research in MARL often clusters around a set of canonical problem classes, each of which can be instantiated in a multiâagent context:
- [Classification ] â assigning labels to joint states observed by multiple agents.
- [Generative modeling ] â generating plausible joint trajectories for multiple agents.
- [Regression ] â predicting future rewards or states for collections of agents.
- [Clustering ] â grouping agents with similar policy behaviours.
- [Dimensionality reduction ] â compressing highâdimensional joint observations into a compact representation.
- [Density estimation ] â modelling the distribution of joint actions across the population.
- [Anomaly detection ] â identifying outâofâdistribution interactions that may indicate adversarial behaviour.
- [Data cleaning ] â filtering noisy or corrupted interaction logs.
- [AutoML ] â automatically selecting architectures and hyperâparameters for multiâagent systems.
- [Association rules ] â mining frequent patterns in joint action histories.
- [Semantic analysis ] â interpreting naturalâlanguage communication between agents.
- [Structured prediction ] â jointly predicting complex output structures such as interaction graphs.
- [Feature engineering ] â designing handcrafted descriptors for multiâagent observations.
- [Feature learning ] â letting agents discover useful representations through interaction.
- [Learning to rank ] â ordering joint actions according to expected collective reward.
- [Grammar induction ] â discovering interaction grammars that govern communication protocols.
- [Ontology learning ] â building shared vocabularies for multiâagent interaction.
- [Multimodal learning ] â integrating vision, language, and proprioception in heterogeneous agent teams.
== Supervised learning ==
Within the broader supervised learning umbrella, MARL incorporates a number of specialized algorithms that have proven effective for coordinating multiple decisionâmakers:
- [Apprenticeship learning ] â learning policies by imitation of expert demonstrations.
- [Decision trees ] â constructing interpretable joint action policies.
- [Ensembles ] â aggregating multiple policy models to improve robustness.
- [Bagging ] â reducing variance by training on bootstrap samples of interaction data.
- [Boosting ] â sequentially refining policies that previously underâperformed.
- [Random forest ] â a collection of decision trees that vote on the final joint action.
- [[k -NN]] â a nearestâneighbour approach to policy retrieval based on observed joint histories.
- [Linear regression ] â modelling linear reward functions across agents.
- [Naive Bayes ] â probabilistic classification of joint states.
- [Artificial neural networks ] â function approximators for highâdimensional policy spaces.
- [Logistic regression ] â modelling binary outcome predictions in multiâagent settings.
- [Perceptron ] â a foundational linear classifier for simple coordination tasks.
- [Relevance vector machine (RVM) ] â a sparse Bayesian alternative to linear models.
- [Support vector machine (SVM) ] â marginâmaximising classifiers for joint state discrimination.
== Clustering ==
Clustering techniques are employed to discover groups of agents with similar behavioural tendencies:
- [BIRCH ] â a hierarchical clustering algorithm suited for incremental updates.
- [CURE ] â a centroidâbased method that tolerates nonâspherical clusters.
- [Hierarchical ] â building a tree of cluster merges to capture multiâscale interactions.
- [[k -means]] â the classic partitioning algorithm applied to joint action embeddings.
- [Fuzzy ] â allowing agents to belong partially to multiple clusters.
- [Expectation%E2%80%93maximization (EM) ] â an iterative method for estimating cluster parameters.
- [DBSCAN ] â densityâbased clustering that identifies outliers in joint behaviour.
- [OPTICS ] â an extension of DBSCAN that handles varying densities.
- [Mean shift ] â a modeâseeking algorithm for estimating underlying data distributions.
== Dimensionality reduction ==
To cope with the curse of dimensionality in joint stateâaction spaces, MARL research frequently applies dimensionalityâreduction strategies:
- [Factor analysis ] â modelling latent factors that generate observed joint observations.
- [CCA ] â finding linear combinations of variables that are maximally correlated across agents.
- [ICA ] â separating independent sources of variation in multiâagent data.
- [LDA ] â projecting data onto a discriminant space that maximises class separability.
- [NMF ] â factorising nonânegative interaction matrices.
- [PCA ] â a linear technique for reducing dimensionality while preserving variance.
- [PGD ] â a generalisation of PCA for nonâGaussian data.
- [tâSNE ] â a nonâlinear method for visualising highâdimensional joint behaviours.
- [SDL ] â learning compact dictionary representations of interaction patterns.
== Structured prediction ==
MARL often requires joint predictions over complex structured outputs such as interaction graphs or coordination sequences:
- [Graphical models ] â probabilistic frameworks for representing dependencies among agents.
- [Bayes net ] â directed acyclic graphs that encode conditional dependencies.
- [Conditional random field ] â discriminative models for labelling structured outputs.
- [Hidden Markov ] â stochastic models for sequential interaction patterns.
== Anomaly detection ==
Detecting unusual or potentially adversarial behaviour in multiâagent systems involves:
- [RANSAC ] â robust fitting that tolerates outliers in joint trajectories.
- [Local outlier factor ] â a densityâbased measure of how isolated an agentâs behaviour is.
- [Isolation forest ] â an ensemble method that isolates anomalies through random partitioning.
== Neural networks ==
The deep learning boom has spurred a multitude of neural architectures tailored for multiâagent contexts:
- [Autoencoder ] â learning compact encodings of joint observation histories.
- [Deep learning ] â the overarching paradigm for training largeâscale multiâagent policies.
- [Feedforward neural network ] â basic fullyâconnected models for static joint states.
- [Recurrent neural network ] â handling temporal dependencies in multiâagent interaction.
- [LSTM ] â gated recurrent units for longârange memory of joint histories.
- [GRU ] â a simpler variant of LSTM with fewer parameters.
- [ESN ] â reservoir computing for fast adaptation to new interaction patterns.
- [reservoir computing ] â a computational framework that leverages random recurrent networks.
- [Boltzmann machine ] â stochastic neural models for energyâbased joint policy representation.
- [Restricted ] â a simplified Boltzmann model used for modelling visibleâhidden interactions.
- [GAN ] â generating realistic joint trajectories through adversarial training.
- [Diffusion model ] â a newer generative technique for synthesising multiâagent sequences.
- [SOM ] â unsupervised clustering of agent behaviours in a topological map.
- [Convolutional neural network ] â exploiting spatial structure in multiâagent perception (e.g., mapâbased environments).
- [UâNet ] â a symmetric encoderâdecoder architecture used for precise joint state segmentation.
- [LeNet ] â one of the earliest convolutional architectures, occasionally repurposed for smallâscale MARL tasks.
- [AlexNet ] â a deep convolutional model that has been adapted for multiâmodal input in multiâagent vision tasks.
- [DeepDream ] â an artistic visualisation technique that has been repurposed for exploring activation patterns in multiâagent networks.
- [Neural field ] â continuousâstate models for representing large populations of agents.
- [Neural radiance field ] â renderingâoriented models that have found use in simulating multiâagent perception of 3âD environments.
- [Physicsâinformed neural networks ] â embedding physical constraints into joint dynamics predictions.
- [Transformer ] â attentionâbased architectures for modelling longârange dependencies across agents.
- [Vision ] â a transformer variant specialised for visual inputs in multiâagent perception.
- [Mamba ] â a stateâspace model offering efficient longâsequence processing for multiâagent trajectories.
- [Spiking neural network ] â bioâinspired models that simulate discrete spikes in multiâagent communication.
- [Memtransistor ] â emerging nanoâelectronic devices that can implement stochastic synaptic updates in hardwareâaccelerated MARL.
- [Electrochemical RAM (ECRAM)] â a memory technology that has been explored for storing learned joint policies.
== Reinforcement learning ==
Reinforcement learning remains the core optimisation framework for MARL, with a suite of algorithms that have been adapted or invented for multiâagent scenarios:
- [Qâlearning ] â valueâbased methods that estimate actionâvalue functions in joint spaces.
- [Policy gradient ] â direct optimisation of stochastic policies across multiple agents.
- [SARSA ] â an onâpolicy algorithm that updates Qâvalues using the next action taken by the joint policy.
- [Temporal difference (TD) ] â a family of bootstrapping methods for estimating value functions in multiâagent settings.
- [[Multiâagent]] â a generic descriptor for any RL formulation involving more than one decisionâmaker.
- [Selfâplay ] â a technique where agents compete or cooperate against clones of themselves, enabling the emergence of curricula and emergent strategies.
== Learning with humans ==
MARL research also investigates how machines can interact safely and productively with human participants:
- [Active learning ] â querying humans for informative examples in multiâagent data collection.
- [Crowdsourcing ] â aggregating human judgments to shape joint reward signals.
- [Humanâinâtheâloop ] â integrating human supervisors directly into the learning loop of multiâagent systems.
- [Mechanistic interpretability ] â analysing the internal representations that drive multiâagent decisions.
- [RLHF ] â leveraging human feedback to shape multiâagent reward functions and alignment.
== Model diagnostics ==
Assessing the performance and safety of multiâagent policies relies on a battery of diagnostic tools:
- [Coefficient of determination ] â measuring how well a model explains observed joint outcomes.
- [Confusion matrix ] â summarising classification errors across joint action predictions.
- [Learning curve ] â plotting performance over training steps to diagnose convergence issues.
- [ROC curve ] â evaluating tradeâoffs between detection rates and false positives in anomaly identification.
== Mathematical foundations ==
The theoretical underpinnings of MARL draw from a variety of mathematical disciplines:
- [Kernel machines ] â employing kernel methods for modelling nonâlinear interactions between agents.
- [Biasâvariance tradeoff ] â balancing model complexity against data scarcity in highâdimensional joint spaces.
- [Computational learning theory ] â providing PACâstyle guarantees for multiâagent learning.
- [Empirical risk minimization ] â a framework for minimising expected loss across joint distributions.
- [Occam learning ] â favouring simpler joint policies in the presence of multiple equilibria.
- [PAC learning ] â statistical bounds on sample complexity for multiâagent settings.
- [Statistical learning ] â the broader theory of learning from data in coordinated environments.
- [VC theory ] â analysing the capacity of hypothesis classes that model multiâagent behaviours.
- [Topological deep learning ] â leveraging topological data analysis to capture complex interaction structures.
== Journals and conferences ==
Research on MARL is disseminated through a vibrant community of venues:
- [AAAI ] â the premier conference for artificial intelligence research, including MARL tracks.
- [ECML PKDD ] â a leading venue for machine learning and data mining, often featuring multiâagent workshops.
- [NeurIPS ] â a major forum for neuralânetwork based MARL studies.
- [ICML ] â another top conference for novel MARL algorithms.
- [ICLR ] â focusing on representation learning for multiâagent systems.
- [IJCAI ] â covering AI techniques that include MARL.
- [ML ] â a dedicated journal for machineâlearning research, including MARL contributions.
- [JMLR ] â an openâaccess journal that publishes foundational MARL papers.
== Related articles ==
For readers seeking broader context, the following Wikipedia entries are closely related:
- [Glossary of artificial intelligence ]
- [List of datasets for machineâlearning research ]
- [List of datasets in computer vision and image processing ]
- [Outline of machine learning ]
== Definition ==
Multiâagent reinforcement learning can be formally modelled as an extension of the [Markov decision process ] (MDP) to multiple interacting agents. Let
- ‘‘I’’ = {1, âŠ, N} be the set of agents,
- ‘‘S’’ be the set of environment states,
- ‘‘A’’i be the action set available to agent i, and
- ‘‘P’’ be the transition probability function
[ P_{\vec a}(s, s’) = \Pr(s_{t+1}=s’ \mid s_t = s, \vec a_t = \vec a) ]
which gives the probability of moving from state s to s’ under the joint action vector a = (a1, âŠ, aN). Each joint action yields an immediate joint reward vector R**a(s, s’) that is a function of the transition and the actions taken.
In environments with [perfect information ]âsuch as the games of [chess ] and [[Go_(game)]]âthe MDP is fully observable, meaning each agent can infer the complete state from its observations. In contrast, realâworld domains like [selfâdriving cars ] involve [[imperfect information]], where each agent receives only a partial observation of the true state. In such partially observable settings the underlying model becomes a [stochastic game ], and in the cooperative variant it is known as a [decentralized partially observable Markov decision process ].
== Cooperation vs. competition ==
MARL explores a spectrum of interâagent relationship topologies, ranging from pure competition to pure cooperation, with many hybrid possibilities in between:
In pure competition settings, agentsâ reward functions are exactly opposite; the problem reduces to a [zeroâsum game ]. Classical board games like [chess ] and [[Go_(game)]] exemplify this regime, as do twoâplayer variants of video games such as [StarCraft ]. Because an agent can only increase its own score at the direct expense of its opponent, the problem structure is stripped of communication or socialâdilemma considerations; however, [[autocurricula]] can still emerge as each agent continually adapts to the otherâs improving strategy.
In pure cooperation settings, all agents receive identical rewards, eliminating incentives for selfish deviation. Cooperative environments are often modelled using [cooperative games ] such as [[Overcooked]] and are also relevant to realâworld robotics tasks like [robotics ] coordinated manipulation. Here the challenge is not conflict but rather the coordination of many simultaneously optimal policies, which often leads to the emergence of [[conventions]] and shared communication protocols.
Mixedâsum settings blend cooperative and competitive incentives. In these games each agent pursues its own distinct goal while still being indirectly affected by the others. For example, in trafficâflow scenarios each autonomous vehicle seeks to minimise travel time, yet all vehicles share an interest in avoiding [traffic collision ]. Mixedâsum games can be analysed using classic [matrix games ] such as the [prisoner’s dilemma ], [chicken ], and [stag hunt ], as well as more complex sequential social dilemmas and modern video games like [[Among Us]], [[Diplomacy_(game)]], and [[StarCraft II]].
=== Pure competition settings ===
Zeroâsum games impose a strict opposition of interests, which simplifies analysis but does not eliminate all difficulties. As agents improve via [selfâplay ], they generate [[autocurricula]]âlayers of increasingly sophisticated strategies that each depend on the previous layer. Notable demonstrations of this capability include the [Deep Blue ] and [AlphaGo ] projects, which achieved superhuman performance by iteratively refining policies against selfâgenerated opponents.
=== Pure cooperation settings ===
When agents share identical objectives, the learning problem often reduces to finding coordination strategies that maximise a common reward. Such settings are explored through [[cooperative games]] like [[Overcooked]], where success depends on synchronised actions, and through realâworld robotics where multiple robots must jointly manipulate objects. Because all agents receive the same reward, the learning dynamics can be stabilised by mechanisms such as shared value functions, centralised training with decentralised execution, and intrinsic motivation schemes that encourage exploration of complementary behaviours.
=== Mixedâsum settings ===
Mixedâsum interactions are the most prevalent in practice, as they capture the nuanced reality that agents often have both aligned and competing interests. These settings can be analysed using classic [[matrix games]] and more sophisticated sequential social dilemmas. Realâworld examples include autonomous driving, where each vehicle aims to minimise its own travel time while jointly avoiding collisions, and multiârobot manipulation, where robots must coordinate grasps without directly incentivising assistance. Communication and socialâdilemma dynamics frequently arise, prompting research into mechanisms that encourage cooperation without sacrificing individual incentives.
== Social dilemmas ==
The canonical framework for studying the tension between individual and collective incentives in MARL is the [[social dilemma]] literature, which includes the [[prisoner’s dilemma]], [[chicken]], and [[stag hunt]] games. While traditional game theory focuses on identifying [Nash equilibrium ] and analytically determining optimal static strategies, MARL research emphasises how agents can learn these strategies through interactionâdriven processes. Reinforcementâlearning algorithms optimise each agentâs own reward, and the resulting conflict between personal and group interests becomes a central research theme. Techniques such as reward shaping, intrinsicâreward augmentation, and constraintâbased policies have been proposed to nudge agents toward cooperative outcomes.
=== Sequential social dilemmas ===
Unlike the static 2Ă2 matrix games of classical game theory, [[sequential social dilemmas]] (SSDs) unfold over multiple timesteps, allowing agents to condition their actions on histories. This temporal dimension enables richer dynamics such as retaliation, forgiveness, and reputation building. Research into SSDs seeks to define appropriate extensions of matrixâgame concepts and to develop learning algorithms that can sustain cooperation in these richer environments.
=== Autocurricula ===
An autocurriculum is a phenomenon observed in multiâagent reinforcementâlearning experiments where the improvement of one group of agents reshapes the environment in a way that forces the opposing group to adapt, leading to a cascade of increasingly sophisticated behaviours. The classic illustration is the [[Hide and Seek]] experiment, in which seekers and hiders iteratively discover and counter each otherâs strategies: hiders build shelters, seekers construct ramps, hiders lock ramps, seekers develop âboxâsurfingâ exploits, and so on. Each newly discovered strategy becomes the foundation for the next, creating a layered curriculum of behaviours.
Researchers have likened autocurricula to major evolutionary transitionsâsuch as the emergence of [[oxygen]]âproducing [[photosynthesis]] that reshaped Earthâs atmosphereâarguing that the layered acquisition of capabilities mirrors the way biological and cultural complexity builds upon prior advances. Autocurricula have been shown to foster emergent tool use, languageâlike communication, and even hierarchical social structures in simulated agent populations.
== Applications ==
The theoretical advances in MARL have been translated into a wide array of practical domains:
- [Broadband ] and [cellular networks ] such as [5G ], where multiple base stations must coordinate resource allocation.
- [Content caching ) and [packet routing ], enabling efficient data movement across networked agents.
- [Network security ], where multiple defenders collaboratively detect and respond to threats.
- [Transmit power control ] in wireless systems, balancing energy consumption against coverage.
- [Computation offloading ] in edgeâcomputing environments, where tasks are distributed among heterogeneous devices.
- [Language evolution research ], using MARL to simulate the emergence of communication protocols.
- [Global health ] initiatives that coordinate interventions across distributed agents.
- [Integrated circuit design ], where multiple design modules must optimise jointly.
- [Internet of Things ] device coordination for smart homes and cities.
- [Microgrid ] and [energy management ], where distributed generators and loads must balance supply and demand.
- [Multiâcamera control ], enabling coordinated surveillance or cinematography.
- [Autonomous vehicles ], where fleets of cars negotiate lane changes, merging, and collision avoidance.
- [Sports analytics ], where multiple players or teams are modelled as interacting agents.
- [Traffic control ], including ramp metering and adaptive trafficâsignal optimisation.
- [Unmanned aerial vehicles ], coordinating swarms for inspection or delivery tasks.
- [Wildlife conservation ], where sensorâladen agents monitor ecosystems and coordinate protection measures.
== AI alignment ==
MARL has been adopted as a sandbox for investigating [AI alignment ], the problem of ensuring that artificial agents act in accordance with human values. By modelling the interaction between a human supervisor and an autonomous agent as a multiâagent game, researchers can explore scenarios where the agentâs incentives diverge from the humanâs, and they can experiment with mechanismsâsuch as rewardâmodification, communication constraints, or intrinsicâmotivation penaltiesâthat mitigate potential conflicts. These experiments help identify critical variables that influence alignment and inform the design of safer learning systems.
== Limitations ==
Despite its successes, MARL faces several fundamental challenges:
- The environment is no longer stationary; transition probabilities and reward functions can change as other agents adapt, violating the [Markov property ] that underlies many RL algorithms.
- Credit assignment becomes ambiguous when multiple agents contribute to an outcome, making it difficult to attribute responsibility for rewards or penalties.
- Nonâstationarity can lead to instability, oscillations, or catastrophic forgetting in learning algorithms.
- Scaling to large numbers of agents exacerbates the combinatorial explosion of joint action spaces, limiting the feasibility of tabular or exhaustive methods.
Researchers address these issues through techniques such as decentralised credit assignment, counterfactual multiâagent policy gradients, and hierarchical coordination architectures.
== Further reading ==
- Stefano V. Albrecht, Filippos Christianos, Lukas SchĂ€fer. MultiâAgent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024. https://www.marl-book.com
- Kaiqing Zhang, Zhuoran Yang, Tamer Basar. Multiâagent reinforcement learning: A selective overview of theories and algorithms. Studies in Systems, Decision and Control, Handbook on RL and Control, 2021. arXiv :2011.00583 [cs.MA]
== References ==
- ^ Stefano V. Albrecht, Filippos Christianos, Lukas SchĂ€fer. MultiâAgent Reinforcement Learning: Foundations and Modern Approaches. MIT Press, 2024. https://www.marl-book.com
- ^ Lowe, Ryan; Wu, Yi (2020). “MultiâAgent ActorâCritic for Mixed CooperativeâCompetitive Environments”. arXiv :1706.02275v4 [cs.LG]
- ^ Baker, Bowen (2020). “Emergent Reciprocity and Team Formation from Randomized Uncertain Social Preferences”. NeurIPS 2020 proceedings. arXiv :2011.05373.
- ^ Hughes, Edward; Leibo, Joel Z.; et al. (2018). “Inequity aversion improves cooperation in intertemporal social dilemmas”. NeurIPS 2018 proceedings. arXiv :1803.08884.
- ^ Jaques, Natasha; Lazaridou, Angeliki; Hughes, Edward; et al. (2019). “Social Influence as Intrinsic Motivation for MultiâAgent Deep Reinforcement Learning”. Proceedings of the 35th International Conference on Machine Learning. arXiv :1810.08647.
- ^ Lazaridou, Angeliki (2017). “MultiâAgent Cooperation and The Emergence of (Natural) Language”. ICLR 2017. arXiv :1612.07182.
- ^ DuéñezâGuzmĂĄn, Edgar; et al. (2021). “Statistical discrimination in learning agents”. arXiv :2110.11404v1 [cs.LG].
- ^ Campbell, Murray; Hoane, A. Joseph Jr.; Hsu, Fengâhsiung (2002). “Deep Blue”. Artificial Intelligence. 134 (1â2). Elsevier: 57â83. doi :10.1016/S0004-3702(01)00129-1. ISSN 0004-3702.
- ^ Clark, Herbert; WilkesâGibbs, Deanna (February 1986). “Referring as a collaborative process”. Cognition. 22 (1): 1â39. doi :10.1016/0010-0277(86)90010-7. PMID 3709088. S2CID 204981390.
- ^ Boutilier, Craig (17 March 1996). “Planning, learning and coordination in multiâagent decision processes”. Proceedings of the 6th Conference on Theoretical Aspects of Rationality and Knowledge : 195â210.
- ^ Stone, Peter; Kaminka, Gal A.; Kraus, Sarit; Rosenschein, Jeffrey S. (July 2010). Ad Hoc Autonomous Agent Teams: Collaboration without PreâCoordination. AAAI 11.
- ^ Foerster, Jakob N.; Song, H. Francis; Hughes, Edward; Burch, Neil; Dunning, Iain; Whiteson, Shimon; Botvinick, Matthew M; Bowling, Michael H. “Bayesian action decoder for deep multiâagent reinforcement learning”. ICML 2019. arXiv :1811.01458.
- ^ Shih, Andy; Sawhney, Arjun; Kondic, Jovana; Ermon, Stefano; Sadigh, Dorsa. “On the Critical Role of Conventions in Adaptive HumanâAI Collaboration”. ICLR 2021. arXiv :2104.02871.
- ^ Bettini, Matteo; Kortvelesy, Ryan; Blumenkamp, Jan; Prorok, Amanda (2022). “VMAS: A Vectorized MultiâAgent Simulator for Collective Robot Learning”. The 16th International Symposium on Distributed Autonomous Robotic Systems. Springer. arXiv :2207.03530.
- ^ ShalevâShwartz, Shai; Shammah, Shaked; Shashua, Amnon (2016). “Safe, MultiâAgent, Reinforcement Learning for Autonomous Driving”. arXiv :1610.03295 [cs.AI].
- ^ Kopparapu, Kavya; DuéñezâGuzmĂĄn, Edgar A.; Matyas, Jayd; Vezhnevets, Alexander Sasha; Agapiou, John P.; McKee, Kevin R.; Everett, Richard; Marecki, Janusz; Leibo, Joel Z.; Graepel, Thore (2022). “Hidden Agenda: a Social Deduction Game with Diverse Learned Equilibria”. arXiv :2201.01816 [cs.AI].
- ^ Bakhtin, Anton; Brown, Noam; et al. (2022). “Humanâlevel play in the game of Diplomacy by combining language models with strategic reasoning”. Science. 378 (6624). Springer: 1067â1074. Bibcode :2022Sci…378.1067M. doi :10.1126/science.ade9097. PMID 36413172. S2CID 253759631.
- ^ Samvelyan, Mikayel; Rashid, Tabish; de Witt, Christian Schroeder; Farquhar, Gregory; Nardelli, Nantas; Rudner, Tim G. J.; Hung, ChiaâMan; Torr, Philip H. S.; Foerster, Jakob; Whiteson, Shimon (2019). “The StarCraft MultiâAgent Challenge”. arXiv :1902.04043 [cs.LG].
- ^ Ellis, Benjamin; Moalla, Skander; Samvelyan, Mikayel; Sun, Mingfei; Mahajan, Anuj; Foerster, Jakob N.; Whiteson, Shimon (2022). “SMACv2: An Improved Benchmark for Cooperative MultiâAgent Reinforcement Learning”. arXiv :2212.07489 [cs.LG].
- ^ Sandholm, Toumas W.; Crites, Robert H. (1996). “Multiâagent reinforcement learning in the Iterated Prisoner’s Dilemma”. Biosystems. 37 (1â2): 147â166. Bibcode :1996BiSys..37..147S. doi :10.1016/0303-2647(95)01551-5. PMID 8924633.
- ^ Peysakhovich, Alexander; Lerer, Adam (2018). “Prosocial Learning Agents Solve Generalized Stag Hunts Better than Selfish Ones”. AAMAS 2018. arXiv :1709.02865.
- ^ Dafoe, Allan; Hughes, Edward; Bachrach, Yoram; et al. (2020). “Open Problems in Cooperative AI”. NeurIPS 2020. arXiv :2012.08630.
- ^ Köster, Raphael; HadfieldâMenell, Dylan; Hadfield, Gillian K.; Leibo, Joel Z. “Silly rules improve the capacity of agents to learn stable enforcement and compliance behaviors”. AAMAS 2020. arXiv :2001.09318.
- ^ Leibo, Joel Z.; Zambaldi, Vinicius; Lanctot, Marc; Marecki, Janusz; Graepel, Thore (2017). “MultiâAgent Reinforcement Learning in Sequential Social Dilemmas”. AAMAS 2017. arXiv :1702.03037.
- ^ Badjatiya, Pinkesh; Sarkar, Mausoom (2020). “Inducing Cooperative behaviour in SequentialâSocial dilemmas through MultiâAgent Reinforcement Learning using StatusâQuo Loss”. arXiv :2001.05458 [cs.AI].
- ^ Leibo, Joel Z.; Hughes, Edward; et al. (2019). “Autocurricula and the Emergence of Innovation from Social Interaction: A Manifesto for MultiâAgent Intelligence Research”. arXiv :1903.00742v2 [cs.AI].
- ^ Baker, Bowen; et al. (2020). “Emergent Tool Use From MultiâAgent Autocurricula”. ICLR 2020. arXiv :1909.07528.
- ^ Kasting, James F; Siefert, Janet L (2002). “Life and the evolution of earth’s atmosphere”. Science. 296 (5570): 1066â1068. Bibcode :2002Sci…296.1066K. doi :10.1126/science.1071184. PMID 12004117. S2CID 37190778.
- ^ Clark, Gregory (2008). A farewell to alms: a brief economic history of the world. Princeton University Press. ISBN 978-0-691-14128-2.
- ^ Li, Tianxu; Zhu, Kun; Luong, Nguyen Cong; Niyato, Dusit; Wu, Qihui; Zhang, Yang; Chen, Bing (2021). “Applications of MultiâAgent Reinforcement Learning in Future Internet: A Comprehensive Survey”. arXiv :2110.13484 [cs.AI].
- ^ Le, Ngan; Rathour, Vidhiwar Singh; Yamazaki, Kashu; Luu, Khoa; Savvides, Marios (2021). “Deep Reinforcement Learning in Computer Vision: A Comprehensive Survey”. arXiv :2108.11510 [cs.CV].
- ^ MoulinâFrier, ClĂ©ment; Oudeyer, PierreâYves (2020). “MultiâAgent Reinforcement Learning as a Computational Tool for Language Evolution Research: Historical Context and Future Challenges”. arXiv :2002.08878 [cs.MA].
- ^ Killian, Jackson; Xu, Lily; Biswas, Arpita; Verma, Shresth; et al. (2023). Robust Planning over Restless Groups: Engagement Interventions for a LargeâScale Maternal Telehealth Program. AAAI.
- ^ Krishnan, Srivatsan; Jaques, Natasha; Omidshafiei, Shayegan; Zhang, Dan; Gur, Izzeddin; Reddi, Vijay Janapa; Faust, Aleksandra (2022). “MultiâAgent Reinforcement Learning for Microprocessor Design Space Exploration”. arXiv :2211.16385 [cs.AR].
- ^ Li, Yuanzheng; He, Shangyang; Li, Yang; Shi, Yang; Zeng, Zhigang (2023). “Federated Multiâagent Deep Reinforcement Learning Approach via PhysicsâInformed Reward for Multimicrogrid Energy Management”. IEEE Transactions on Neural Networks and Learning Systems. PP (5): 5902â5914. arXiv :2301.00641. doi :10.1109/TNNLS.2022.3232630. PMID 37018258. S2CID 255372287.
- ^ Ci, Hai; Liu, Mickel; Pan, Xuehai; Zhong, Fangwei; Wang, Yizhou (2023). Proactive MultiâCamera Collaboration for 3D Human Pose Estimation. International Conference on Learning Representations.
- ^ Vinitsky, Eugene; Kreidieh, Aboudy; Le Flem, Luc; Kheterpal, Nishant; Jang, Kathy; Wu, Fangyu; Liaw, Richard; Liang, Eric; Bayen, Alexandre M. (2018). Benchmarks for reinforcement learning in mixedâautonomy traffic (PDF). Conference on Robot Learning.
- ^ Tuyls, Karl; Omidshafiei, Shayegan; Muller, Paul; Wang, Zhe; Connor, Jerome; Hennes, Daniel; Graham, Ian; Spearman, William; Waskett, Tim; Steele, Dafydd; Luc, Pauline; Recasens, Adria; Galashov, Alexandre; Thornton, Gregory; Elie, Romuald; Sprechmann, Pablo; Moreno, Pol; Cao, Kris; Garnelo, Marta; Dutta, Praneet; Valko, Michal; Heess, Nicolas; Back, Trevor; Ahamed, Razia; Bouton, Simon; Beauguerlange, Nathalie; Broshear, Jackson; Graepel, Thore; Hassabis, Demis (2020). “Game Plan: What AI can do for Football, and What Football can do for AI”. arXiv :2011.09192 [cs.AI].
- ^ Chu, Tianshu; Wang, Jie; CodecâĂĄ, Lara; Li, Zhaojian (2019). “MultiâAgent Deep Reinforcement Learning for Largeâscale Traffic Signal Control”. IEEE Transactions on Intelligent Transportation Systems. 21 (3): 1086. arXiv :1903.04527. Bibcode :2020ITITr..21.1086C. doi :10.1109/TITS.2019.2901791.
- ^ Belletti, Francois; Haziza, Daniel; Gomes, Gabriel; Bayen, Alexandre M. (2017). “Expert Level control of Ramp Metering based on Multiâtask Deep Reinforcement Learning”. arXiv :1701.08832 [cs.AI].
- ^ Ding, Yahao; Yang, Zhaohui; Pham, QuocâViet; Zhang, Zhaoyang; ShikhâBahaei, Mohammad (2023). “Distributed Machine Learning for UAV Swarms: Computing, Sensing, and Semantics”. arXiv :2301.00912 [cs.LG].
- ^ Xu, Lily; Perrault, Andrew; Fang, Fei; Chen, Haipeng; Tambe, Milind (2021). “Robust Reinforcement Learning Under Minimax Regret for Green Security”. arXiv :2106.08413 [cs.LG].
- ^ Leike, Jan; Martic, Miljan; Krakovna, Victoria; Ortega, Pedro A.; Everitt, Tom; Lefrancq, Andrew; Orseau, Laurent; Legg, Shane (2017). “AI Safety Gridworlds”. arXiv :1711.09883 [cs.AI].
- ^ HadfieldâMenell, Dylan; Dragan, Anca; Abbeel, Pieter; Russell, Stuart (2016). “The OffâSwitch Game”. arXiv :1611.08219 [cs.AI].
- ^ HernandezâLeal, Pablo; Kartal, Bilal; Taylor, Matthew E. (2019â11â01). “A survey and critique of multiâagent deep reinforcement learning”. Autonomous Agents and MultiâAgent Systems. 33 (6): 750â797. arXiv :1810.05587. doi :10.1007/s10458-019-09421-1. ISSN 1573-7454. S2CID 52981002.
== External links ==
- [[Scholia]] has a topic profile for [[Multiâagent reinforcement learning]].
The article above is a rewritten, expanded, and fully internallyâlinked version of the Wikipedia entry on Multiâagent reinforcement learning, preserving all original facts, internal links, and structural headings while adding detailed explanations and contextual elaboration.