Snake Q-Learning
Snake Game Q-Learning Tutorial
Teaching a Snake to Play by Itself
Reinforcement learning is one of the most fascinating areas of machine learning where an agent learns to make decisions by interacting with an environment. In this tutorial, I'll walk you through implementing Q-learning, a foundational reinforcement learning algorithm, to teach an AI agent to play Snake.
What You'll Learn
- How Q-learning works and why it's suitable for games like Snake
- How to represent game states in a way that's useful for learning
- Implementation of epsilon-greedy exploration strategy
- How to build and update a Q-table using the Bellman equation
- Training and evaluating your reinforcement learning agent
Understanding the Snake Game Environment
Before we dive into the code, let's understand our environment:
1. State
We represent the game state with 11 boolean values:
[0-2]
: Danger detection (front, right, left)[3-6]
: Current direction (left, right, up, down)[7-10]
: Food location (left, right, up, down)
2. Actions
The snake can make three moves:
FRONT
: Continue in the current directionRIGHT
: Turn right relative to current directionLEFT
: Turn left relative to current direction
3. Reward Structure
The game provides rewards when:
- The snake eats food (positive reward)
- The game ends (negative reward)
Q-Learning: The Theory Behind the Code
Q-learning works by maintaining a table of expected future rewards (Q-values) for each state-action pair. The core algorithm:
- Observe current state
- Choose an action (balancing exploration and exploitation)
- Execute action, observe reward and next state
- Update Q-value using the Bellman equation
- Repeat until learning converges
The Bellman equation is:
Where:
α
(learning rate): How much to update Q-values based on new informationγ
(discount factor): How much to value future rewards versus immediate onesR
: The reward received after taking actiona
in states
max(Q(s',a'))
: The maximum Q-value possible from the next state
The Implementation
Step 1: Define the Agent Class
class SnakeQAgent:
def __init__(self, state_size=11, action_size=3, learning_rate=0.001,
discount_factor=0.95, exploration_rate=1.0,
exploration_decay=0.995, min_exploration_rate=0.01):
self.state_size = state_size
self.action_size = action_size
self.learning_rate = learning_rate # Alpha
self.discount_factor = discount_factor # Gamma
self.exploration_rate = exploration_rate # Epsilon
self.exploration_decay = exploration_decay
self.min_exploration_rate = min_exploration_rate
# Initialize Q-table as a dictionary for sparse representation
self.q_table = {}
We define our agent with parameters that control learning:
learning_rate
: How quickly new information overrides old informationdiscount_factor
: How much future rewards are valuedexploration_rate
: Probability of taking a random action instead of the best known actionexploration_decay
: How quickly to reduce exploration over time
Step 2: State Representation and Action Selection
def _get_state_key(self, state):
"""Convert state list to a hashable tuple"""
return tuple(state)
def choose_action(self, state):
"""Choose action using epsilon-greedy policy"""
if random.uniform(0, 1) < self.exploration_rate:
# Exploration: choose a random action
return random.randint(0, self.action_size - 1)
else:
# Exploitation: choose best action based on Q-values
state_key = self._get_state_key(state)
if state_key not in self.q_table:
self.q_table[state_key] = np.zeros(self.action_size)
return np.argmax(self.q_table[state_key])
Here we:
- Convert our state list to a hashable tuple (so it can be used as a dictionary key)
- Implement the epsilon-greedy strategy for exploration:
- With probability
epsilon
, choose a random action (exploration) - Otherwise, choose the action with the highest Q-value (exploitation)
- With probability
Step 3: Q-Value Updates (Learning)
def update_q_value(self, state, action, reward, next_state, done):
"""Update Q-value for a state-action pair using Bellman equation"""
state_key = self._get_state_key(state)
next_state_key = self._get_state_key(next_state)
# Initialize if state not in Q-table
if state_key not in self.q_table:
self.q_table[state_key] = np.zeros(self.action_size)
if next_state_key not in self.q_table:
self.q_table[next_state_key] = np.zeros(self.action_size)
# Calculate target Q-value using Bellman equation
current_q = self.q_table[state_key][action]
max_next_q = np.max(self.q_table[next_state_key]) if not done else 0
target_q = reward + self.discount_factor * max_next_q
# Update Q-value
self.q_table[state_key][action] += self.learning_rate * (target_q - current_q)
This is the heart of Q-learning. After each action, we:
- Get the current Q-value for the state-action pair
- Calculate the maximum possible Q-value from the next state
- Compute the target Q-value using the Bellman equation
- Update the Q-value with a weighted average of old and new information
Step 4: Training the Agent
async def train_agent(episodes=100, move_function=None, get_state_function=None):
"""Train the agent for a specified number of episodes"""
agent = SnakeQAgent()
scores = []
max_score = 0
# Initial state
state = get_state_function()
for episode in range(episodes):
done = False
episode_score = 0
steps = 0
# Play one episode until game over
while not done:
# Choose action
action_idx = agent.choose_action(state)
action = [Direction.FRONT, Direction.RIGHT, Direction.LEFT][action_idx]
# Take action
done, score, reward = await move_function(action.value)
next_state = get_state_function()
# Update Q-values
agent.update_q_value(state, action_idx, reward, next_state, done)
# Update state
state = next_state
episode_score = score
steps += 1
# If game is over, the game will auto-reset
if done:
scores.append(episode_score)
if episode_score > max_score:
max_score = episode_score
# Get initial state for next episode after auto-reset
state = get_state_function()
# Decay exploration rate after each episode
agent.decay_exploration()
# Print progress
if episode % 5 == 0:
avg_score = np.mean(scores[-5:]) if len(scores) >= 5 else np.mean(scores)
print(f"Episode: {episode}, Score: {episode_score}, Steps: {steps}, "
f"Average Score: {avg_score:.2f}, Max Score: {max_score}, "
f"Exploration Rate: {agent.exploration_rate:.4f}")
During training:
- We run through multiple episodes (games)
- For each step in an episode, we:
- Choose an action using our epsilon-greedy policy
- Execute the action and observe the reward and next state
- Update the Q-values
- Check if the game is over
- After each episode, we:
- Reduce the exploration rate to encourage exploitation of learned knowledge
- Track performance metrics like score and steps
Step 5: Evaluating the Trained Agent
async def play_game(agent, episodes=10, move_function=None, get_state_function=None):
"""Use the trained agent to play the game without exploration"""
scores = []
# Get initial state
state = get_state_function()
for episode in range(episodes):
done = False
episode_score = 0
steps = 0
while not done:
# Choose the best action (no exploration)
state_key = agent._get_state_key(state)
if state_key in agent.q_table:
action_idx = np.argmax(agent.q_table[state_key])
else:
# If state not seen during training, use a simple heuristic
if state[0]: # Danger front
action_idx = 1 if not state[1] else 2 # Go right if no danger, otherwise left
else:
action_idx = 0 # Go forward if no danger ahead
action = [Direction.FRONT, Direction.RIGHT, Direction.LEFT][action_idx]
# Take action
done, score, _ = await move_function(action.value)
state = get_state_function()
episode_score = score
steps += 1
# If game is over, it will auto-reset
if done:
scores.append(episode_score)
# Get initial state for next episode
state = get_state_function()
print(f"Game {episode}: Score = {episode_score}, Steps = {steps}")
After training, we evaluate our agent by:
- Disabling exploration (no random actions)
- Playing several games
- Using a simple fallback strategy for any states not encountered during training
- Measuring performance to see how well the agent learned
Key Reinforcement Learning Concepts Applied
1. State Representation
Our 11-element state vector captures essential information about dangers, direction, and food location.
2. Exploration vs. Exploitation
The epsilon-greedy strategy balances:- Exploration: Trying new actions to discover better strategies
- Exploitation: Using known good strategies to maximize score
3. Temporal Difference Learning
The Q-learning update is a form of temporal difference learning, where we learn from the difference between current estimates and new information.
4. Discounted Future Rewards
The discount factor helps the agent balance immediate rewards against long-term strategy.
Tips for Improving Performance
- Hyperparameter Tuning: Experiment with different values for learning rate, discount factor, and exploration parameters.
- Reward Shaping: Consider modifying the reward structure to encourage specific behaviors (e.g., give small rewards for moving closer to food).
- State Representation: You could add more information to the state, like the snake's length or a broader view of dangers.
- Function Approximation: For more complex games, consider replacing the Q-table with a neural network (Deep Q-Learning).
- Experience Replay: Store past experiences and randomly sample them for learning to break action correlations.
Conclusion
Congratulations! You've implemented a Q-learning agent for Snake. Through trial and error, your agent has learned which actions lead to higher rewards in different situations. The beauty of reinforcement learning is that we didn't explicitly program the strategy - the agent discovered it through exploration and feedback.
This same approach can be adapted to many other games and real-world problems. As you continue your reinforcement learning journey, you might explore more advanced techniques like Deep Q-Networks (DQN), Policy Gradients, or Actor-Critic methods.
Happy coding, and may your snake grow ever longer!