March 31, 2025By Nan Jiang

Training an AI Agent to Play Orbito

A deep dive into training a Deep Q-Network (DQN) agent to play Orbito, featuring prioritized experience replay and intrinsic rewards for exploration.

AIReinforcement LearningDQNGame AIPython

In this post, I'll walk through the process of training an AI agent to play Orbito, a strategic board game that combines elements of Connect Four with unique rotation mechanics. We'll use Deep Q-Learning (DQN) with several advanced techniques to create a competent AI player. You can play the game here: Playground

The Game Environment

First, let's understand the game environment. Orbito is played on a 4x4 board where players take turns either:

Moving an opponent's marble
Placing their own marble
Using the Orbito button to rotate sections of the board

The goal is to create a line of four marbles of your color, either horizontally, vertically, or diagonally.

class OrbitoEnv(gym.Env):
    def __init__(self, render_mode: Optional[str] = None):
        self.size = 4  # 4x4 board
        self.board = np.zeros((self.size, self.size), dtype=np.int8)
        self.marbles_left = {1: 8, 2: 8}  # Each player has 8 marbles
        self.orbito_presses = 0

The Neural Network Architecture

Our DQN agent uses a convolutional neural network to process the board state:

class DQN(nn.Module):
    def __init__(self, board_size: int = 4):
        super().__init__()

        # Process board state with convolutions
        self.conv1 = nn.Conv2d(1, 32, kernel_size=2, stride=1)
        self.conv2 = nn.Conv2d(32, 64, kernel_size=2, stride=1)

        # Additional features processing
        conv_out_size = 64 * 2 * 2
        self.fc1 = nn.Linear(conv_out_size + 3, 256)
        self.fc2 = nn.Linear(256, 128)
        self.fc3 = nn.Linear(128, 81)  # 81 possible actions

Advanced Training Techniques

1. Prioritized Experience Replay

To help the agent learn from important experiences, we implement prioritized experience replay. This gives higher sampling probability to rare actions and important state transitions:

class PrioritizedReplayBuffer:
    def __init__(self, capacity=10000, alpha=0.6, beta=0.4):
        self.capacity = capacity
        self.alpha = alpha    # Priority exponent
        self.beta = beta      # Importance sampling weight
        self.action_counts = {}  # Track action frequency

    def push(self, state, action, reward, next_state, done):
        # Compute priority based on action rarity
        action_frequency = self.action_counts.get(action, 0) / total_actions
        priority = (1.0 / (action_frequency + 0.1)) ** self.alpha

2. Intrinsic Rewards for Exploration

To encourage exploration of different board states, we implement a state visitation tracker that provides intrinsic rewards for discovering new states:

class StateVisitationTracker:
    def __init__(self, decay=0.999):
        self.state_counts = defaultdict(int)
        self.decay = decay

    def get_intrinsic_reward(self, state):
        state_hash = self.hash_state(state)
        count = self.state_counts.get(state_hash, 0)
        return 0.5 / np.sqrt(count) if count > 0 else 0.5

Training Process

The training process combines all these elements:

def train_dqn(args):
    env = OrbitoEnv()
    agent = DQNAgent()
    replay_buffer = PrioritizedReplayBuffer()
    state_tracker = StateVisitationTracker()

    for episode in range(args.episodes):
        state, _ = env.reset()
        done = False

        while not done:
            # Get valid actions
            valid_actions = get_valid_actions(env)

            # Choose action with epsilon-greedy policy
            epsilon = max(args.epsilon_min,
                        args.epsilon * (args.epsilon_decay ** episode))
            action = agent.act(state, valid_actions, epsilon)

            # Take action and get reward
            next_state, reward, done, _, info = env.step(action)

            # Calculate intrinsic reward for exploration
            intrinsic_reward = state_tracker.get_intrinsic_reward(next_state)
            combined_reward = reward + args.intrinsic_reward_scale * intrinsic_reward

            # Store experience and train
            replay_buffer.push(state, action, combined_reward, next_state, done)
            if len(replay_buffer) > args.batch_size:
                experiences, weights = replay_buffer.sample(args.batch_size)
                agent.train_prioritized(experiences, weights)

Training Parameters

We use the following hyperparameters for training:

python train/prioritized_train_dqn.py \
    --episodes 1000 \
    --target-update 10 \
    --save-interval 100 \
    --render-interval 100 \
    --learning-rate 0.0005 \
    --epsilon-decay 0.997 \
    --intrinsic-reward-scale 0.5 \
    --batch-size 128

Results and Visualization

During training, we track various metrics:

Episode rewards
Action distribution
State novelty rewards
Action type distribution over time

These metrics are visualized and saved periodically, allowing us to monitor the agent's learning progress and behavior patterns.

Conclusion

Training an AI to play Orbito presents unique challenges due to the game's complex action space and strategic depth. By combining DQN with prioritized experience replay and intrinsic rewards, we create an agent that can learn effective strategies while maintaining a good balance between exploration and exploitation.

The trained agent demonstrates the ability to:

Make strategic marble placements
Effectively use opponent marble movement
Utilize the Orbito rotation mechanic at appropriate times
Adapt its strategy based on the board state

Future improvements could include:

Self-play training to develop more advanced strategies
Curriculum learning to gradually increase opponent difficulty
Monte Carlo Tree Search (MCTS) integration for better planning

The complete code and training logs are available in the repository. Feel free to experiment with different hyperparameters and training strategies!

Try It Yourself!

Want to challenge the trained AI agent? Head over to the Orbito Playground to play against it! Test your strategic thinking and see if you can outsmart the AI in this engaging board game. The agent uses the techniques described in this blog post, including prioritized experience replay and intrinsic rewards for exploration.

Whether you're interested in AI, game theory, or just looking for a fun challenge, give it a try and let me know how you fare against the AI!