Learning to Settle: Reinforcement Learning in Catan

Abstract

Reinforcement learning has achieved striking results in classic games like Go and Chess, but multi-player board games with rich player interaction remain challenging. In this project we study RL agents for Catan, a four-player, turn-based strategy game where players build roads, settlements, and cities to reach a target victory-point score.

Our original goal was to train multi-agent policies using algorithms from MARLlib in a custom PettingZoo environment. In practice, library assumptions about parallel agents and synchronous play conflicted with Catan’s sequential, turn-based structure. Even after extensive engineering and masking, MARL methods failed to converge.

To still obtain quantitative results, we pivoted to a single-agent PPO baseline in a simplified Catan setting: no trading, adjustable victory conditions, and dense reward shaping. The resulting agents learn to shorten games and accumulate more victory points than random baselines, but still struggle to consistently win against evolving opponents. The gap between these results and our original MARL goals highlights both the promise and the difficulty of learning in non-stationary multi-player environments.

Project Overview

We started from the open-source Settlers-RL implementation and refactored it into a cleaner, modular codebase suitable for both single-agent and multi-agent RL experiments. The project has three main pieces:

Environment: a custom PettingZoo environment for Catan with adjustable victory conditions and optional PyGame rendering.
Training stack: experiments with MARLlib (MAPPO, MAA2C, MADDPG, MATRPO) followed by a pure PyTorch PPO implementation when MARLlib proved incompatible.
Infrastructure: training on a Google Cloud VM with an NVIDIA T4 GPU and 32 vCPUs, with many games played in parallel for efficient experience collection.

Environment and Representations

Catan is modeled as a turn-based multi-agent environment using the Agent Environment Cycle (AEC) API from PettingZoo. Each agent can take multiple actions in sequence on a turn (roll dice, move the robber, build, buy cards, end turn, etc.), which we expose through a structured observation and action design.

Observation Space

Each agent observes a single flat vector with three components:

Board: resource type and number for every tile, robber position, roads, settlements, and cities.
Self: the agent’s own buildings, resource counts, development-card counts, achievements (Longest Road / Largest Army), and current victory points.
Opponents: publicly visible buildings, turn order, and other shared public information (never hidden hands or cards).

This encoding stays compact but still lets the policy reason about long-term plans such as expanding toward high-yield tiles or blocking key intersections.

Action Space

We parameterize the action space with seven separate action heads, so the agent can compose complex moves while still sharing context across related decisions:

Action Type: one-hot indicator of the high-level action (e.g., roll dice, build, trade, play a development card, end turn).
Tile Action: selects a board tile (used for moving the robber) as a one-hot vector over the 19 tiles.
Corner Action: chooses a corner for placing a settlement or city, encoded as a one-hot vector over 54 corners.
Edge Action: chooses an edge for road placement, represented as a one-hot vector over 73 edges.
Development Card Action: selects which development card type to play, using a one-hot encoding.
Resource Action: uses two one-hot vectors to specify resource choices for operations like exchanging or discarding.
Player Action: picks the target player for interactions such as stealing, encoded as a one-hot vector over opponents.

These heads are concatenated into a 172-dimensional action vector, mirroring the observation representation and letting the network learn coordinated strategies across heads.

Reward Design

Dense, Step-Based Rewards

Sparse “win or lose at the end” rewards make long Catan games difficult to learn from. We instead use a dense reward signal that gives feedback at each step:

Big bonus for winning the game; penalty for losing.
Positive reward for gaining victory points or upgrading settlements to cities.
Smaller rewards for productive actions like playing development cards or moving the robber.
Penalties for discarding resources or attempting invalid actions.

In practice, these shaped rewards help the agent discover the game structure faster, while still aligning long-term behavior with the goal of winning.

Experiments and Findings

Multi-Agent RL (Attempted)

We first evaluated MAPPO and other MARL algorithms from MARLlib on simpler tasks (e.g., a multi-particle “simple-spread” environment) and observed stable convergence. But when plugged into Catan, MARLlib’s assumption of parallel, synchronous agents conflicted with our sequential AEC environment. Observations would become stale between turns, leading to illegal actions and non-convergent training even with aggressive masking and hyperparameter tuning.

Single-Agent PPO Baseline

To still measure learning progress, we implemented PPO directly in PyTorch and trained a single agent in two settings: standard 10-point Catan and a faster 4-point version for quicker curriculum-like experiments. Aggregated metrics across training show:

Shorter games and fewer decisions: the agent learns to play more efficiently, reducing both average game length and the number of actions per game.
More victory points: average victory points per game increase over time, indicating that the policy is learning productive building patterns.
Limited win rates: despite better play, the agent’s overall win fraction improves only modestly against baselines and earlier versions of itself.

Training Curves

The plots below mirror Figures 4–6 in the report and summarize how our PPO agent behaves in both 10-point and 4-point Catan.

Training loss curves (entropy, action, value, validation) for 10-point Catan — **Figure 4 — Losses (10-point Catan):** entropy drops sharply and action loss decreases modestly, while validation loss stays nearly flat, suggesting the policy becomes more decisive without fully generalizing to unseen states.

Evaluation curves for win rate, victory points, game length and decisions per game in 10-point Catan

Figure 5 — Evaluation (10-point Catan)

Games get shorter and require fewer decisions, and average victory points rise, but the win rate against baselines only improves slightly. This suggests the agent learns more efficient play without reliably converting that into consistent wins against strong opponents.

Evaluation curves for the 4-point Catan experiment against random and past policies — **Figure 6 — 4-point Catan:** the agent quickly learns to beat random policies, but its performance against earlier versions of itself degrades over time, hinting at overfitting and the need for true multi-agent methods to capture non-stationary opponent behavior.

Why Multi-Agent Matters

The single-agent PPO results suggest that optimizing against a mostly static environment is not enough: Catan’s difficulty lies in evolving opponent strategies and the resulting non-stationary dynamics. Our experiments reinforce that truly strong Catan agents will likely need full multi-agent methods or explicit modeling of other players, not just a stronger single-agent baseline.

Gameplay Demo

The clip below shows an RL agent playing through a full game of Catan in our environment. The board, dice rolls, robber moves, and building decisions are all driven by the learned policy.

A higher-resolution version is also available in the repository as gameplay-rl-agents.mp4.

Future Directions

Solidify a stable single-agent baseline with a framework like Stable-Baselines3, including robust masking and clean integration with the AEC environment.
Implement turn-based MARL algorithms (e.g., MAPPO-style centralized critic) directly in PyTorch, with curriculum learning from simpler sub-tasks (road placement, opening builds, etc.).
Integrate forward search (e.g., Monte Carlo Tree Search) or human-in-the-loop debugging to improve planning and interpretability.
Extend the framework to other strategic multiplayer games (e.g., 7 Wonders) to stress-test MARL methods in different synchronous/asynchronous regimes.

Citation

If you use this environment or codebase in your own work, please consider citing our webpage.