Technical Tutorials

Building artificial intelligence for single-agent games is now a solved problem for many domains, from Chess to Go. However, creating AI that excels in complex, cooperative strategy games presents a significantly higher challenge. In these environments, agents must not only learn optimal policies for themselves but also anticipate, communicate, and synchronize with other agents to achieve a shared global reward. This blog post explores the architectural nuances of implementing Multi-Agent Reinforcement Learning (MARL) for cooperative scenarios, bridging the gap between theoretical concepts and practical code.

The Core Challenge: Non-Stationarity

The fundamental hurdle in MARL is the violation of the Markov Decision Process (MDP) assumption. In single-agent RL, the environment is static relative to the agent's policy updates. In multi-agent systems, every agent is simultaneously learning and changing its behavior. This means the environment is effectively non-stationary from the perspective of any single agent. A strategy that was optimal moments ago may become suboptimal because its teammates have evolved. To address this, developers often employ techniques such as Centralized Training with Decentralized Execution (CTDE), where agents train with access to global state information but execute policies based only on local observations.

Setting Up the Environment

Before diving into training algorithms, we need a robust environment. The PettingZoo library is an industry standard for multi-agent reinforcement learning, offering a unified interface for various environments, including the popular "simple_spread" or "multi_jockey" scenarios. It handles the synchronization of parallel environments and manages the observation/action spaces for multiple agents seamlessly.

Let's look at how to initialize a simple cooperative environment and process the observations for a neural network.

import gymnasium as gym
import pettingzoo

# Initialize a simple cooperative environment
env = pettingzoo.mpe.simple_spread_v3.env()
env.reset(seed=42)

# Example of extracting observation for an agent
agent_name = env.agents[0]
observation, info = env.observe(agent_name)

# Observation typically contains:
# 1. Agent's own position and velocity
# 2. Positions of other agents (relative to self)
# 3. Positions of objects to cover (relative to self)

print(f"Observation shape for {agent_name}: {observation.shape}")
print(f"Available actions: {env.action_space(agent_name)}")

Architecture Design: Actor-Critic with Shared Weights

For cooperative tasks, sharing weights between agents is often beneficial. It promotes symmetry, which is common in symmetric games, and drastically reduces the sample complexity required for convergence. We can use a shared Actor-Critic network where the input is the concatenated observation of the agent and its relative position to teammates.

Below is a simplified PyTorch implementation of a shared MLP network for the actor and critic heads.

import torch
import torch.nn as nn
import torch.nn.functional as F

class MultiAgentNetwork(nn.Module):
    def __init__(self, obs_dim, act_dim, hidden_dim=64):
        super(MultiAgentNetwork, self).__init__()
        
        # Shared feature extractor
        self.shared_layers = nn.Sequential(
            nn.Linear(obs_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU()
        )
        
        # Actor head (policy)
        self.actor_head = nn.Linear(hidden_dim, act_dim)
        
        # Critic head (value function)
        self.critic_head = nn.Linear(hidden_dim, 1)

    def forward(self, obs):
        features = self.shared_layers(obs)
        log_probs = self.actor_head(features)
        values = self.critic_head(features)
        return log_probs, values

Training Dynamics: Advantage Estimation

In cooperative games, the reward is usually sparse and shared. Calculating the advantage function is critical for stable training. We use Generalized Advantage Estimation (GAE) to balance bias and variance. The code snippet below demonstrates how to compute GAE for a batch of trajectories, ensuring that agents learn not just what reward they received, but how much better that action was compared to the baseline expectation.

def compute_gae(rewards, values, next_values, dones, gamma=0.99, lam=0.95):
    advantages = torch.zeros_like(rewards)
    lastgaelam = 0
    for t in reversed(range(len(rewards))):
        if t == len(rewards) - 1:
            nextnonterminal = 1.0 - dones[t]
            nextvalues = next_values[t]
        else:
            nextnonterminal = 1.0 - dones[t]
            nextvalues = values[t + 1]
            
        delta = rewards[t] + gamma * nextvalues * nextnonterminal - values[t]
        advantages[t] = lastgaelam = delta + gamma * lam * nextnonterminal * lastgaelam
        
    returns = advantages + values
    return returns, advantages

Conclusion

Implementing Multi-Agent Reinforcement Learning for cooperative strategy games requires a deep understanding of both neural network architectures and environmental dynamics. By leveraging shared policies, robust environment libraries like PettingZoo, and advanced advantage estimation techniques, developers can create AI agents that exhibit emergent cooperative behavior. As the field progresses, we will likely see more sophisticated communication channels and hierarchical structures, allowing agents to solve even more complex strategic challenges. Start small, visualize your agents' trajectories frequently, and iterate on your reward shaping to guide your AI toward true cooperation.