DevLandArcade

Reinforcement learning is one of the most fascinating areas of machine learning where an agent learns to make decisions by interacting with an environment. In this tutorial, I'll walk you through implementing Q-learning, a foundational reinforcement learning algorithm, to teach an AI agent to play Snake.

What You'll Learn

How Q-learning works and why it's suitable for games like Snake
How to represent game states in a way that's useful for learning
Implementation of epsilon-greedy exploration strategy
How to build and update a Q-table using the Bellman equation
Training and evaluating your reinforcement learning agent

Understanding the Snake Game Environment

Before we dive into the code, let's understand our environment:

1. State

We represent the game state with 11 boolean values:

[0-2]: Danger detection (front, right, left)
[3-6]: Current direction (left, right, up, down)
[7-10]: Food location (left, right, up, down)

2. Actions

The snake can make three moves:

FRONT: Continue in the current direction
RIGHT: Turn right relative to current direction
LEFT: Turn left relative to current direction

3. Reward Structure

The game provides rewards when:

The snake eats food (positive reward)
The game ends (negative reward)

Q-Learning: The Theory Behind the Code

Q-learning works by maintaining a table of expected future rewards (Q-values) for each state-action pair. The core algorithm:

Observe current state
Choose an action (balancing exploration and exploitation)
Execute action, observe reward and next state
Update Q-value using the Bellman equation
Repeat until learning converges

The Bellman equation is:

Q(s,a) = Q(s,a) + α * [R + γ * max(Q(s',a')) - Q(s,a)]

Where:

α (learning rate): How much to update Q-values based on new information
γ (discount factor): How much to value future rewards versus immediate ones
R: The reward received after taking action a in state s
max(Q(s',a')): The maximum Q-value possible from the next state

The Implementation

Step 1: Define the Agent Class

class SnakeQAgent:
    def __init__(self, state_size=11, action_size=3, learning_rate=0.001, 
                 discount_factor=0.95, exploration_rate=1.0, 
                 exploration_decay=0.995, min_exploration_rate=0.01):
        self.state_size = state_size
        self.action_size = action_size
        self.learning_rate = learning_rate  # Alpha
        self.discount_factor = discount_factor  # Gamma
        self.exploration_rate = exploration_rate  # Epsilon
        self.exploration_decay = exploration_decay
        self.min_exploration_rate = min_exploration_rate
        
        # Initialize Q-table as a dictionary for sparse representation
        self.q_table = {}

We define our agent with parameters that control learning:

learning_rate: How quickly new information overrides old information
discount_factor: How much future rewards are valued
exploration_rate: Probability of taking a random action instead of the best known action
exploration_decay: How quickly to reduce exploration over time

Step 2: State Representation and Action Selection

def _get_state_key(self, state):
    """Convert state list to a hashable tuple"""
    return tuple(state)

def choose_action(self, state):
    """Choose action using epsilon-greedy policy"""
    if random.uniform(0, 1) < self.exploration_rate:
        # Exploration: choose a random action
        return random.randint(0, self.action_size - 1)
    else:
        # Exploitation: choose best action based on Q-values
        state_key = self._get_state_key(state)
        if state_key not in self.q_table:
            self.q_table[state_key] = np.zeros(self.action_size)
        return np.argmax(self.q_table[state_key])

Here we:

Convert our state list to a hashable tuple (so it can be used as a dictionary key)
Implement the epsilon-greedy strategy for exploration:
- With probability epsilon, choose a random action (exploration)
- Otherwise, choose the action with the highest Q-value (exploitation)

Step 3: Q-Value Updates (Learning)

def update_q_value(self, state, action, reward, next_state, done):
    """Update Q-value for a state-action pair using Bellman equation"""
    state_key = self._get_state_key(state)
    next_state_key = self._get_state_key(next_state)
    
    # Initialize if state not in Q-table
    if state_key not in self.q_table:
        self.q_table[state_key] = np.zeros(self.action_size)
    if next_state_key not in self.q_table:
        self.q_table[next_state_key] = np.zeros(self.action_size)
    
    # Calculate target Q-value using Bellman equation
    current_q = self.q_table[state_key][action]
    max_next_q = np.max(self.q_table[next_state_key]) if not done else 0
    target_q = reward + self.discount_factor * max_next_q
    
    # Update Q-value
    self.q_table[state_key][action] += self.learning_rate * (target_q - current_q)

This is the heart of Q-learning. After each action, we:

Get the current Q-value for the state-action pair
Calculate the maximum possible Q-value from the next state
Compute the target Q-value using the Bellman equation
Update the Q-value with a weighted average of old and new information

Step 4: Training the Agent

async def train_agent(episodes=100, move_function=None, get_state_function=None):
    """Train the agent for a specified number of episodes"""
    agent = SnakeQAgent()
    scores = []
    max_score = 0
    
    # Initial state
    state = get_state_function()
    
    for episode in range(episodes):
        done = False
        episode_score = 0
        steps = 0
        
        # Play one episode until game over
        while not done:
            # Choose action
            action_idx = agent.choose_action(state)
            action = [Direction.FRONT, Direction.RIGHT, Direction.LEFT][action_idx]
            
            # Take action
            done, score, reward = await move_function(action.value)
            next_state = get_state_function()
            
            # Update Q-values
            agent.update_q_value(state, action_idx, reward, next_state, done)
            
            # Update state
            state = next_state
            episode_score = score
            steps += 1
            
            # If game is over, the game will auto-reset
            if done:
                scores.append(episode_score)
                if episode_score > max_score:
                    max_score = episode_score
                
                # Get initial state for next episode after auto-reset
                state = get_state_function()
        
        # Decay exploration rate after each episode
        agent.decay_exploration()
        
        # Print progress
        if episode % 5 == 0:
            avg_score = np.mean(scores[-5:]) if len(scores) >= 5 else np.mean(scores)
            print(f"Episode: {episode}, Score: {episode_score}, Steps: {steps}, "
                  f"Average Score: {avg_score:.2f}, Max Score: {max_score}, "
                  f"Exploration Rate: {agent.exploration_rate:.4f}")

During training:

We run through multiple episodes (games)
For each step in an episode, we:
- Choose an action using our epsilon-greedy policy
- Execute the action and observe the reward and next state
- Update the Q-values
- Check if the game is over
After each episode, we:
- Reduce the exploration rate to encourage exploitation of learned knowledge
- Track performance metrics like score and steps

Step 5: Evaluating the Trained Agent

async def play_game(agent, episodes=10, move_function=None, get_state_function=None):
    """Use the trained agent to play the game without exploration"""
    scores = []
    
    # Get initial state
    state = get_state_function()
    
    for episode in range(episodes):
        done = False
        episode_score = 0
        steps = 0
        
        while not done:
            # Choose the best action (no exploration)
            state_key = agent._get_state_key(state)
            if state_key in agent.q_table:
                action_idx = np.argmax(agent.q_table[state_key])
            else:
                # If state not seen during training, use a simple heuristic
                if state[0]:  # Danger front
                    action_idx = 1 if not state[1] else 2  # Go right if no danger, otherwise left
                else:
                    action_idx = 0  # Go forward if no danger ahead
            
            action = [Direction.FRONT, Direction.RIGHT, Direction.LEFT][action_idx]
            
            # Take action
            done, score, _ = await move_function(action.value)
            state = get_state_function()
            
            episode_score = score
            steps += 1
            
            # If game is over, it will auto-reset
            if done:
                scores.append(episode_score)
                # Get initial state for next episode
                state = get_state_function()
        
        print(f"Game {episode}: Score = {episode_score}, Steps = {steps}")

After training, we evaluate our agent by:

Disabling exploration (no random actions)
Playing several games
Using a simple fallback strategy for any states not encountered during training
Measuring performance to see how well the agent learned

Key Reinforcement Learning Concepts Applied

1. State Representation

Our 11-element state vector captures essential information about dangers, direction, and food location.

2. Exploration vs. Exploitation

The epsilon-greedy strategy balances:

Exploration: Trying new actions to discover better strategies
Exploitation: Using known good strategies to maximize score

3. Temporal Difference Learning

The Q-learning update is a form of temporal difference learning, where we learn from the difference between current estimates and new information.

4. Discounted Future Rewards

The discount factor helps the agent balance immediate rewards against long-term strategy.

Tips for Improving Performance

Hyperparameter Tuning: Experiment with different values for learning rate, discount factor, and exploration parameters.
Reward Shaping: Consider modifying the reward structure to encourage specific behaviors (e.g., give small rewards for moving closer to food).
State Representation: You could add more information to the state, like the snake's length or a broader view of dangers.
Function Approximation: For more complex games, consider replacing the Q-table with a neural network (Deep Q-Learning).
Experience Replay: Store past experiences and randomly sample them for learning to break action correlations.

Conclusion

Congratulations! You've implemented a Q-learning agent for Snake. Through trial and error, your agent has learned which actions lead to higher rewards in different situations. The beauty of reinforcement learning is that we didn't explicitly program the strategy - the agent discovered it through exploration and feedback.

This same approach can be adapted to many other games and real-world problems. As you continue your reinforcement learning journey, you might explore more advanced techniques like Deep Q-Networks (DQN), Policy Gradients, or Actor-Critic methods.

Happy coding, and may your snake grow ever longer!

Snake Q-Learning

Contents

Learning Resources

Snake Game Q-Learning Tutorial