In this post, I'll walk through the process of training an AI agent to play Orbito, a strategic board game that combines elements of Connect Four with unique rotation mechanics. We'll use Deep Q-Learning (DQN) with several advanced techniques to create a competent AI player. You can play the game here: Playground
The Game Environment
First, let's understand the game environment. Orbito is played on a 4x4 board where players take turns either:
- Moving an opponent's marble
- Placing their own marble
- Using the Orbito button to rotate sections of the board
The goal is to create a line of four marbles of your color, either horizontally, vertically, or diagonally.
class OrbitoEnv(gym.Env):
def __init__(self, render_mode: Optional[str] = None):
self.size = 4
self.board = np.zeros((self.size, self.size), dtype=np.int8)
self.marbles_left = {1: 8, 2: 8}
self.orbito_presses = 0
The Neural Network Architecture
Our DQN agent uses a convolutional neural network to process the board state:
class DQN(nn.Module):
def __init__(self, board_size: int = 4):
super().__init__()
self.conv1 = nn.Conv2d(1, 32, kernel_size=2, stride=1)
self.conv2 = nn.Conv2d(32, 64, kernel_size=2, stride=1)
conv_out_size = 64 * 2 * 2
self.fc1 = nn.Linear(conv_out_size + 3, 256)
self.fc2 = nn.Linear(256, 128)
self.fc3 = nn.Linear(128, 81)
Advanced Training Techniques
1. Prioritized Experience Replay
To help the agent learn from important experiences, we implement prioritized experience replay. This gives higher sampling probability to rare actions and important state transitions:
class PrioritizedReplayBuffer:
def __init__(self, capacity=10000, alpha=0.6, beta=0.4):
self.capacity = capacity
self.alpha = alpha
self.beta = beta
self.action_counts = {}
def push(self, state, action, reward, next_state, done):
action_frequency = self.action_counts.get(action, 0) / total_actions
priority = (1.0 / (action_frequency + 0.1)) ** self.alpha
2. Intrinsic Rewards for Exploration
To encourage exploration of different board states, we implement a state visitation tracker that provides intrinsic rewards for discovering new states:
class StateVisitationTracker:
def __init__(self, decay=0.999):
self.state_counts = defaultdict(int)
self.decay = decay
def get_intrinsic_reward(self, state):
state_hash = self.hash_state(state)
count = self.state_counts.get(state_hash, 0)
return 0.5 / np.sqrt(count) if count > 0 else 0.5
Training Process
The training process combines all these elements:
def train_dqn(args):
env = OrbitoEnv()
agent = DQNAgent()
replay_buffer = PrioritizedReplayBuffer()
state_tracker = StateVisitationTracker()
for episode in range(args.episodes):
state, _ = env.reset()
done = False
while not done:
valid_actions = get_valid_actions(env)
epsilon = max(args.epsilon_min,
args.epsilon * (args.epsilon_decay ** episode))
action = agent.act(state, valid_actions, epsilon)
next_state, reward, done, _, info = env.step(action)
intrinsic_reward = state_tracker.get_intrinsic_reward(next_state)
combined_reward = reward + args.intrinsic_reward_scale * intrinsic_reward
replay_buffer.push(state, action, combined_reward, next_state, done)
if len(replay_buffer) > args.batch_size:
experiences, weights = replay_buffer.sample(args.batch_size)
agent.train_prioritized(experiences, weights)
Training Parameters
We use the following hyperparameters for training:
python train/prioritized_train_dqn.py \
--episodes 1000 \
--target-update 10 \
--save-interval 100 \
--render-interval 100 \
--learning-rate 0.0005 \
--epsilon-decay 0.997 \
--intrinsic-reward-scale 0.5 \
--batch-size 128
Results and Visualization
During training, we track various metrics:
- Episode rewards
- Action distribution
- State novelty rewards
- Action type distribution over time
These metrics are visualized and saved periodically, allowing us to monitor the agent's learning progress and behavior patterns.
Conclusion
Training an AI to play Orbito presents unique challenges due to the game's complex action space and strategic depth. By combining DQN with prioritized experience replay and intrinsic rewards, we create an agent that can learn effective strategies while maintaining a good balance between exploration and exploitation.
The trained agent demonstrates the ability to:
- Make strategic marble placements
- Effectively use opponent marble movement
- Utilize the Orbito rotation mechanic at appropriate times
- Adapt its strategy based on the board state
Future improvements could include:
- Self-play training to develop more advanced strategies
- Curriculum learning to gradually increase opponent difficulty
- Monte Carlo Tree Search (MCTS) integration for better planning
The complete code and training logs are available in the repository. Feel free to experiment with different hyperparameters and training strategies!
Try It Yourself!
Want to challenge the trained AI agent? Head over to the Orbito Playground to play against it! Test your strategic thinking and see if you can outsmart the AI in this engaging board game. The agent uses the techniques described in this blog post, including prioritized experience replay and intrinsic rewards for exploration.
Whether you're interested in AI, game theory, or just looking for a fun challenge, give it a try and let me know how you fare against the AI!
In this post, I'll walk through the process of training an AI agent to play Orbito, a strategic board game that combines elements of Connect Four with unique rotation mechanics. We'll use Deep Q-Learning (DQN) with several advanced techniques to create a competent AI player. You can play the game here: Playground
The Game Environment
First, let's understand the game environment. Orbito is played on a 4x4 board where players take turns either:
The goal is to create a line of four marbles of your color, either horizontally, vertically, or diagonally.
class OrbitoEnv(gym.Env): def __init__(self, render_mode: Optional[str] = None): self.size = 4 # 4x4 board self.board = np.zeros((self.size, self.size), dtype=np.int8) self.marbles_left = {1: 8, 2: 8} # Each player has 8 marbles self.orbito_presses = 0
The Neural Network Architecture
Our DQN agent uses a convolutional neural network to process the board state:
class DQN(nn.Module): def __init__(self, board_size: int = 4): super().__init__() # Process board state with convolutions self.conv1 = nn.Conv2d(1, 32, kernel_size=2, stride=1) self.conv2 = nn.Conv2d(32, 64, kernel_size=2, stride=1) # Additional features processing conv_out_size = 64 * 2 * 2 self.fc1 = nn.Linear(conv_out_size + 3, 256) self.fc2 = nn.Linear(256, 128) self.fc3 = nn.Linear(128, 81) # 81 possible actions
Advanced Training Techniques
1. Prioritized Experience Replay
To help the agent learn from important experiences, we implement prioritized experience replay. This gives higher sampling probability to rare actions and important state transitions:
class PrioritizedReplayBuffer: def __init__(self, capacity=10000, alpha=0.6, beta=0.4): self.capacity = capacity self.alpha = alpha # Priority exponent self.beta = beta # Importance sampling weight self.action_counts = {} # Track action frequency def push(self, state, action, reward, next_state, done): # Compute priority based on action rarity action_frequency = self.action_counts.get(action, 0) / total_actions priority = (1.0 / (action_frequency + 0.1)) ** self.alpha
2. Intrinsic Rewards for Exploration
To encourage exploration of different board states, we implement a state visitation tracker that provides intrinsic rewards for discovering new states:
class StateVisitationTracker: def __init__(self, decay=0.999): self.state_counts = defaultdict(int) self.decay = decay def get_intrinsic_reward(self, state): state_hash = self.hash_state(state) count = self.state_counts.get(state_hash, 0) return 0.5 / np.sqrt(count) if count > 0 else 0.5
Training Process
The training process combines all these elements:
def train_dqn(args): env = OrbitoEnv() agent = DQNAgent() replay_buffer = PrioritizedReplayBuffer() state_tracker = StateVisitationTracker() for episode in range(args.episodes): state, _ = env.reset() done = False while not done: # Get valid actions valid_actions = get_valid_actions(env) # Choose action with epsilon-greedy policy epsilon = max(args.epsilon_min, args.epsilon * (args.epsilon_decay ** episode)) action = agent.act(state, valid_actions, epsilon) # Take action and get reward next_state, reward, done, _, info = env.step(action) # Calculate intrinsic reward for exploration intrinsic_reward = state_tracker.get_intrinsic_reward(next_state) combined_reward = reward + args.intrinsic_reward_scale * intrinsic_reward # Store experience and train replay_buffer.push(state, action, combined_reward, next_state, done) if len(replay_buffer) > args.batch_size: experiences, weights = replay_buffer.sample(args.batch_size) agent.train_prioritized(experiences, weights)
Training Parameters
We use the following hyperparameters for training:
Results and Visualization
During training, we track various metrics:
These metrics are visualized and saved periodically, allowing us to monitor the agent's learning progress and behavior patterns.
Conclusion
Training an AI to play Orbito presents unique challenges due to the game's complex action space and strategic depth. By combining DQN with prioritized experience replay and intrinsic rewards, we create an agent that can learn effective strategies while maintaining a good balance between exploration and exploitation.
The trained agent demonstrates the ability to:
Future improvements could include:
The complete code and training logs are available in the repository. Feel free to experiment with different hyperparameters and training strategies!
Try It Yourself!
Want to challenge the trained AI agent? Head over to the Orbito Playground to play against it! Test your strategic thinking and see if you can outsmart the AI in this engaging board game. The agent uses the techniques described in this blog post, including prioritized experience replay and intrinsic rewards for exploration.
Whether you're interested in AI, game theory, or just looking for a fun challenge, give it a try and let me know how you fare against the AI!