Reinforcement learning - PolicyGradient instance

Above, we introduced the simple random guessing algorithm & Hill clipping algorithm to solve the CartPole problem, which is mainly modified in the step of decision-making action. However, the methods described above are to change the weight randomly. For simple problems with less parameters, better results may be obtained, but if the problem is complex and requires more parameters, This method is not ideal. This paper mainly introduces how to solve the cartpolicy problem based on PolicyGradient.

PolicyGradient instance

The strategy based scheme has been introduced in the chapter of algorithm introduction. It directly models the strategy, represents the strategy with a neural network, and represents the action output with an output probability.

We are still based on the above learning framework, only in the most important choice_ In the action step, it is adjusted to the action predicted by the PolicyGradient model.

First, let's look at the learning process, in which the main logic is added to the code comments.

Exploration process

#Learning process, explore 1000 times
for i_episode in range(1000):
    # Reset the environment every time you explore
    observation = env.reset()
    while True:
        if RENDER: env.render()
        # Make decisions based on the policy model
        action = RL.choose_action(observation)
        # Execute the action and return the observation status, reward and other information after the action is executed
        observation_, reward, done, info = env.step(action)
        # Store observations, actions, and rewards. These sequence values need to be used for model learning
        RL.store_transition(observation, action, reward)
        # End of this exploration
        if done:
            ep_rs_sum = sum(RL.ep_rs)
            if 'running_reward' not in globals():
                running_reward = ep_rs_sum
                # Cumulative return value per exploration
                running_reward = running_reward * 0.99 + ep_rs_sum * 0.01
            # reward is greater than the threshold to start rendering, otherwise learn again
            if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True
            print("episode:", i_episode, "rewards:", int(running_reward), "RENDER", RENDER)
            # Learn once per exploration
            vt = RL.learn()
        # Next step of agent exploration
        observation = observation_

Model update process

The most important logic code is action = RL choose_ action(observation)

Vt = RL already in each exploration learn()

RL is PolicyGradient. Let's focus on the PolicyGradient model code and code analysis:

class PolicyGradient:
    def __init__(
        # Dimension of action space -- 2
        self.n_actions = n_actions
        # Dimension of state characteristics -- 4
        self.n_features = n_features
        # Learning rate = learning_rate
        # Return decay rate
        self.gamma = reward_decay
        # Observation value, action value, and return value of an exploration
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []
        # Create policy network

        self.sess = tf.Session()
        if output_graph:
            tf.summary.FileWriter("logs/", self.sess.graph)

    def _build_net(self):
        """ Implementation of creating policy network
        # 2.x version and 1 X version compatibility issues
        with tf.name_scope('input'):
            # Observation status -- [B, 4]
            self.tf_obs = tf.placeholder(tf.float32, [None, self.n_features], name="observations")
            # Execute action -- [B,]
            self.tf_acts = tf.placeholder(tf.int32, [None, ], name="actions_num")
            # Cumulative return value -- [B,]
            self.tf_vt = tf.placeholder(tf.float32, [None, ], name="actions_value")

        # Network structure, two layers, full connection layer
        layer = tf.layers.dense(
            kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),
        all_act = tf.layers.dense(
            kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3),

        # Using softmax function to predict the probability of each action
        self.all_act_prob = tf.nn.softmax(all_act, name='act_prob')

        # Define loss function
        with tf.name_scope('loss'):
            # to maximize total reward (log_p * R) is to minimize -(log_p * R), and the tf only have minimize(loss)
            # The goal is that maximization (log_p * R) is equivalent to optimizer minimization - (log_p * R)
            neg_log_prob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=all_act, labels=self.tf_acts)
            # or in this way:
            # neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1)
            loss = tf.reduce_mean(neg_log_prob * self.tf_vt)

        #Define training and update parameters
        with tf.name_scope('train'):
            self.train_op = tf.train.AdamOptimizer(

    def choose_action(self, observation, type="random"):
        """ Defines how to select behavior, i.e. statesBehavior sampling at.Sampling according to the current behavior probability distribution
        :param observation: Current observations
        :return: Actions selected according to policy
        prob_weights =, feed_dict={self.tf_obs: observation[np.newaxis, :]})
        # Sample according to the given probability, or take the maximum directly. (random mode adds more randomness and Exploration)
        if type == "random":
            action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.ravel())
            action = np.argmax(prob_weights.ravel())
        return action

    def store_transition(self, s, a, r):
        """ Define storage and save the state, action and return of a round
        :param s: Observations per step
        :param a: Action value of each step
        :param r: Every step reward

    def learn(self):
        """ After each exploration, learn to update the strategy network parameters
        # Calculate the cumulative discount return for an exploration
        discounted_ep_rs_norm = self._discount_and_norm_rewards()
        # Call training function to update parameters, feed_dict={
            self.tf_obs: np.vstack(self.ep_obs),
            self.tf_acts: np.array(self.ep_as),
            self.tf_vt: discounted_ep_rs_norm,
        # Clear the episode data and wait for the next exploration and learning
        self.ep_obs, self.ep_as, self.ep_rs = [], [], []
        return discounted_ep_rs_norm

    def _discount_and_norm_rewards(self):
        """ Attenuation round reward
        discounted_ep_rs = np.zeros_like(self.ep_rs)
        running_add = 0
        # Since we need to consider the long-term cumulative reward, here is the reverse order. t-Time reward: reward of current t-Time * gamma + (t + 1) time.
        for t in reversed(range(0, len(self.ep_rs))):
            running_add = running_add * self.gamma + self.ep_rs[t]
            discounted_ep_rs[t] = running_add
        # Normalize
        discounted_ep_rs -= np.mean(discounted_ep_rs)
        discounted_ep_rs /= np.std(discounted_ep_rs)
        return discounted_ep_rs


Through the above logic, the framework of the whole exploration and learning process is summarized as follows:

#Learning process, explore N times
for i_episode in range(N):
    observation = env.reset()
    while True:
        # Decision action (replaceable module)
        action = choose_action(observation)
        observation_, reward, done, _ = env.step(action)
        # Store discovery sequence information
        store_transition(observation, action, reward)
       if done:
           # Model learning (replaceable module)
           vt = learn()
        # The agent explores one step and updates the observation value
        observation = observation_

The most important one is a decision-making model, which can get the best guidance action through the current observation state, so as to maximize the long-term benefits.

The most important part of the decision model is the design of the network (the simple two-layer full link used in this code can design more complex networks), and the design of the loss part (the goal is to maximize the long-term benefits).

In the next article, we will introduce the actor critical scheme based on policy and value combination.

Code reference:

Chinese women's football is awesome

Added by kushaljutta on Sun, 06 Feb 2022 20:10:02 +0200