[reinforcement learning] python realizes Q-learning maze

Thanks hhh5460, the main code of this article refers to his blog, address: https://www.cnblogs.com/hhh5460/p/10143579.html

 1. Problem setting

An 8x8 maze, compared with the original 6x6, has made a simple improvement.

The upper left corner entrance, the lower right corner exit (yellow square), the Red Square is the player, and the black square is the obstacle.

 

 

 2. Train of thought analysis

The essence of reinforcement learning is to describe and solve the problem that agents learn strategies to maximize returns or achieve specific goals in the process of interacting with the environment. The interaction between agent and environment follows Markov decision.

The framework of agent reinforcement learning is:

 

Therefore, we need to propose the definition of three sets:

State set (S): a set representing the state of an agent. In this paper, the state of an agent is the location of the agent. Specifically: [0,1,..., 63], a total of 64 states (positions).

Action set (A): represents the set of agent actions. In this paper, the action of agent only includes up, down, left and right. Specifically: ['u','d','l','r'], there are four actions in total.

R eward set: each location has a reward value. The agent needs to calculate the optimal path according to the sum of reward values reaching the end point. The reward value of blank grid is 0, the reward value of obstacle is - 10, and the end point reward is 10 Specifically: [0, - 10,0,0,..., 10], 64 in total.

3. Complete code

import pandas as pd
import random
import time
import pickle
import pathlib
import os
import tkinter as tk

'''
 8*8 Maze of:
---------------------------------------------------------
| entrance | trap |      |      |      |      |      | trap |
---------------------------------------------------------
|      | trap |      |      | trap |      |      | trap |
---------------------------------------------------------
|      | trap |      | trap |      |      |      |      |
---------------------------------------------------------
|      | trap |      | trap |      |      |      |      |
---------------------------------------------------------
|      | trap |      | trap |      |      | trap |      |
---------------------------------------------------------
|      |      |      |      |      |      | trap |      |
---------------------------------------------------------
|      |      |      |      |      | trap | trap |      |
---------------------------------------------------------
|      | trap |      |      |      | trap | trap | Export |
---------------------------------------------------------

'''


class Maze(tk.Tk):
    '''Environment class( GUI)'''
    UNIT = 40  # pixels
    MAZE_H = 8  # grid height
    MAZE_W = 8  # grid width

    def __init__(self):
        '''initialization'''
        super().__init__()
        self.title('maze')
        h = self.MAZE_H * self.UNIT
        w = self.MAZE_W * self.UNIT
        self.geometry('{0}x{1}'.format(h, w))  # Window size
        self.canvas = tk.Canvas(self, bg='white', height=h, width=w)
        # Draw grid
        for c in range(0, w, self.UNIT):
            self.canvas.create_line(c, 0, c, h)
        for r in range(0, h, self.UNIT):
            self.canvas.create_line(0, r, w, r)
        # Draw obstacles
        self._draw_rect(1, 0, 'black')
        self._draw_rect(1, 1, 'black')
        self._draw_rect(1, 2, 'black')
        self._draw_rect(1, 3, 'black')
        self._draw_rect(1, 4, 'black')
        self._draw_rect(3, 2, 'black')
        self._draw_rect(3, 3, 'black')
        self._draw_rect(3, 4, 'black')
        self._draw_rect(5, 6, 'black')
        self._draw_rect(5, 7, 'black')
        self._draw_rect(6, 4, 'black')
        self._draw_rect(6, 5, 'black')
        self._draw_rect(6, 6, 'black')
        self._draw_rect(6, 7, 'black')
        self._draw_rect(4, 1, 'black')
        self._draw_rect(1, 7, 'black')
        self._draw_rect(7, 0, 'black')
        self._draw_rect(7, 1, 'black')
        # Painting reward
        self._draw_rect(7, 7, 'yellow')
        # Draw player (save!!)
        self.rect = self._draw_rect(0, 0, 'red')
        self.canvas.pack()  # Show painting!

    def _draw_rect(self, x, y, color):
        '''Draw a rectangle,  x,y Horizontal representation,Vertical grid'''
        padding = 5  # Inner margin 5px, see CSS
        coor = [self.UNIT * x + padding, self.UNIT * y + padding, self.UNIT * (x + 1) - padding,
                self.UNIT * (y + 1) - padding]
        return self.canvas.create_rectangle(*coor, fill=color)

    def move_to(self, state, delay=0.01):
        '''The player moves to a new location according to the incoming status'''
        coor_old = self.canvas.coords(self.rect)  # Such as [5.0, 5.0, 35.0, 35.0] (upper left and lower right coordinates of the first grid)
        x, y = state % 8, state // 8 # horizontal and vertical grids
        padding = 5  # Inner margin 5px, see CSS
        coor_new = [self.UNIT * x + padding, self.UNIT * y + padding, self.UNIT * (x + 1) - padding,
                    self.UNIT * (y + 1) - padding]
        dx_pixels, dy_pixels = coor_new[0] - coor_old[0], coor_new[1] - coor_old[1]  # Difference of vertex coordinates in the upper left corner
        self.canvas.move(self.rect, dx_pixels, dy_pixels)
        self.update()  # tkinter built-in update!
        time.sleep(delay)


class Agent(object):
    '''Individual class'''

    def __init__(self, alpha=0.1, gamma=0.9):
        '''initialization'''
        self.states = range(64)  # State set. 0 ~ 35, 36 states in total
        self.actions = list('udlr')  # Action set. Up, down, left and right
        self.rewards = [0, -10, 0, 0, 0, 0, 0, -10,
                        0, -10, 0, 0, -10, 0, 0, -10,
                        0, -10, 0, -10, 0, 0, 0, 0,
                        0, -10, 0, -10, 0, 0, 0, 0,
                        0, -10, 0, -10, 0, 0, -10, 0,
                        0, 0, 0, 0, 0, 0, -10, 0,
                        0, 0, 0, 0, 0, -10, -10, 0,
                        0, -10, 0, 0, 0, -10, -10, 10]  # Reward set. Export reward 10, trap reward - 10.
        self.hell_states = [1, 7, 9, 12, 15, 17, 19, 25, 27, 33, 35, 38, 46, 53, 54, 57, 61, 62]  # Trap location

        self.alpha = alpha
        self.gamma = gamma

        self.q_table = pd.DataFrame(data=[[0 for _ in self.actions] for _ in self.states],
                                    index=self.states,
                                    columns=self.actions)    # Define Q-table

    def save_policy(self):
        '''preservation Q table'''
        with open('q_table.pickle', 'wb') as f:
            # Pickle the 'data' dictionary using the highest protocol available.
            pickle.dump(self.q_table, f, pickle.HIGHEST_PROTOCOL)

    def load_policy(self):
        '''Import Q table'''
        with open('q_table.pickle', 'rb') as f:
            self.q_table = pickle.load(f)

    def choose_action(self, state, epsilon=0.8):
        '''Select the appropriate action. According to the current state, random or greedy, according to the parameters epsilon'''
        # if (random.uniform(0,1) > epsilon) or ((self.q_table.ix[state] == 0).all()):  # explore
        if random.uniform(0, 1) > epsilon:  # explore
            action = random.choice(self.get_valid_actions(state))
        else:
            # action = self.q_table.ix[state].idxmax() # When there are multiple maximum values, the first one will be locked!
            # action = self.q_table.ix[state].filter(items=self.get_valid_actions(state)).idxmax() # Major improvements! But the goose is the same as above
            s = self.q_table.loc[state].filter(items=self.get_valid_actions(state))
            action = random.choice(s[s == s.max()].index)  # Select one randomly from the maximum values that may have multiple values!
        return action

    def get_q_values(self, state):
        '''Take the given state state All of Q value'''
        q_values = self.q_table.loc[state, self.get_valid_actions(state)]
        return q_values

    def update_q_value(self, state, action, next_state_reward, next_state_q_values):
        '''to update Q value,According to Behrman equation'''
        self.q_table.loc[state, action] += self.alpha * (
                    next_state_reward + self.gamma * next_state_q_values.max() - self.q_table.loc[state, action])

    def get_valid_actions(self, state):
        '''
        Get the legal action set in the current state
        global reward
        valid_actions = reward.ix[state, reward.ix[state]!=0].index
        return valid_actions
        '''
        valid_actions = set(self.actions)
        if state % 8 == 7:  # Last column, then
            valid_actions -= set(['r'])  # No right movement
        if state % 8 == 0:  # The first column, then
            valid_actions -= set(['l'])  # Remove the action to the left
        if state // 8 = = 7: # last line, then
            valid_actions -= set(['d'])  # No down
        if state // 8 = = 0: # the first line, then
            valid_actions -= set(['u'])  # No upward
        return list(valid_actions)

    def get_next_state(self, state, action):
        '''After the action is performed on the state, the next state is obtained'''
        # u,d,l,r,n = -6,+6,-1,+1,0
        if state % 8 != 7 and action == 'r':  # Right (+ 1) except the last column
            next_state = state + 1
        elif state % 8 != 0 and action == 'l':  # All but the first column can be left (- 1)
            next_state = state - 1
        elif state // 8 !=  7 and action = ='d ': # except the last line, it can be down (+ 2)
            next_state = state + 8
        elif state // 8 != 0 and action == 'u': # except the first line, you can go up (- 2)
            next_state = state - 8
        else:
            next_state = state
        return next_state

    def learn(self, env=None, episode=1000, epsilon=0.8):
        '''q-learning algorithm'''
        print('Agent is learning...')
        for i in range(episode):
            """Start from the leftmost position"""
            current_state = self.states[0]

            if env is not None:  # If an environment is provided, reset it!
                env.move_to(current_state)

            while current_state != self.states[-1]:  # Randomly (or greedily) select one of the current legal actions as the current action
                current_action = self.choose_action(current_state, epsilon)  # Choose randomly or greedily according to a certain probability
                '''Execute the current action to get the next status (position)'''
                next_state = self.get_next_state(current_state, current_action)
                next_state_reward = self.rewards[next_state]
                '''Remove all the in a state Q-value,The maximum value to be taken'''
                next_state_q_values = self.get_q_values(next_state)
                '''Updated according to Behrman equation Q-table Current status in-Action corresponding Q-value'''
                self.update_q_value(current_state, current_action, next_state_reward, next_state_q_values)
                '''Enter the next state (position'''
                current_state = next_state

                # if next_state not in self.hell_states: # If it is not a trap, move forward; Otherwise stay in place
                #    current_state = next_state

                if env is not None:  # If the environment is provided, update it!
                    env.move_to(current_state)
            print(i)
        print('\nok')

    def test(self):
        '''test agent Is smart already available'''
        count = 0
        current_state = self.states[0]
        while current_state != self.states[-1]:
            current_action = self.choose_action(current_state, 1.)  # 1. Greed
            next_state = self.get_next_state(current_state, current_action)
            current_state = next_state
            count += 1

            if count > 64:  # If you don't get out of the maze within 36 steps
                return False  # No intelligence

        return True  # Intelligent

    def play(self, env=None, delay=0.5):
        '''Play games and use strategies'''
        assert env != None, 'Env must be not None!'

        if not self.test():  # If there is no intelligence, then
            if pathlib.Path("q_table.pickle").exists():
                self.load_policy()
            else:
                print("I need to learn before playing this game.")
                self.learn(env, episode=1000, epsilon=0.8)
                self.save_policy()

        print('Agent is playing...')
        current_state = self.states[0]
        env.move_to(current_state, delay)
        while current_state != self.states[-1]:
            current_action = self.choose_action(current_state, 1.)  # 1. Greed
            next_state = self.get_next_state(current_state, current_action)
            current_state = next_state
            env.move_to(current_state, delay)
        print('\nCongratulations, Agent got it!')


if __name__ == '__main__':
    env = Maze()  # environment
    agent = Agent()  # Individual (agent)
    agent.learn(env, episode=100, epsilon=0.8) # Learn first
    agent.save_policy()
    agent.load_policy()
    agent.play(env)  # outro 

    # env.after(0, agent.learn, env, 1000, 0.8) # Learn first
    # env.after(0, agent.save_policy) # Save what you have learned
    # env.after(0, agent.load_policy) # Import what you have learned
    # env.after(0, agent.play, env)            # Play again
    # env.mainloop()

 4. summary

Compared with the reference articles, there are still some problems when running the code directly.

Question 1: jump out error:

Solution: Baidu learned that after pandas version 1.0.0, the function has been upgraded and reconstructed. Change ix to loc.

Question 2: when solving question 1, baidu got the wrong answer. At that time, it changed ix to iloc and jumped out of the error

Solution: change iloc to loc ~ the reason for this problem is the problem of index function. The specific reason is in https://blog.csdn.net/weixin_35888365/article/details/113986290

Question 3: when modifying the 6 * 6 maze, I modified the size of the state set, get_ valid_ The actions function modifies the corresponding value. When running, it is found that the agent frantically hits the wall when exploring, ignores the obstacles, directly passes through, and "crosses the end point but does not enter" in the final real maze process. It keeps passing by the end point but turns back and starts wandering around the map.

Solution: at this time, I found that due to my negligence, I wrote several more zeros in the reward set in the article, resulting in the non-standard writing of the reward set. Moreover, the obstacle location is not modified with the update of the map.

Question 4: in the final maze walking process, the intelligent experience moves forward a few steps and then reverses, which is similar to wandering like walking 4 and retreating 3

Solution: after troubleshooting, it is found that the number of explorations is too small and the experience value is insufficient. After changing the exploration value to 100 or more, the effect is significantly improved.

 

 

 

 

 

Keywords: Python

Added by jek1134 on Sun, 16 Jan 2022 21:19:07 +0200