Above, we introduced the simple random guessing algorithm & Hill clipping algorithm to solve the CartPole problem, which is mainly modified in the step of decision-making action. However, the methods described above are to change the weight randomly. For simple problems with less parameters, better results may be obtained, but if the problem is complex and requires more parameters, This method is not ideal. This paper mainly introduces how to solve the cartpolicy problem based on PolicyGradient.

## PolicyGradient instance

The strategy based scheme has been introduced in the chapter of algorithm introduction. It directly models the strategy, represents the strategy with a neural network, and represents the action output with an output probability.

We are still based on the above learning framework, only in the most important choice_ In the action step, it is adjusted to the action predicted by the PolicyGradient model.

First, let's look at the learning process, in which the main logic is added to the code comments.

### Exploration process

#Learning process, explore 1000 times for i_episode in range(1000): # Reset the environment every time you explore observation = env.reset() while True: if RENDER: env.render() # Make decisions based on the policy model action = RL.choose_action(observation) # Execute the action and return the observation status, reward and other information after the action is executed observation_, reward, done, info = env.step(action) # Store observations, actions, and rewards. These sequence values need to be used for model learning RL.store_transition(observation, action, reward) # End of this exploration if done: ep_rs_sum = sum(RL.ep_rs) if 'running_reward' not in globals(): running_reward = ep_rs_sum else: # Cumulative return value per exploration running_reward = running_reward * 0.99 + ep_rs_sum * 0.01 # reward is greater than the threshold to start rendering, otherwise learn again if running_reward > DISPLAY_REWARD_THRESHOLD: RENDER = True print("episode:", i_episode, "rewards:", int(running_reward), "RENDER", RENDER) # Learn once per exploration vt = RL.learn() break # Next step of agent exploration observation = observation_

### Model update process

The most important logic code is action = RL choose_ action(observation)

Vt = RL already in each exploration learn()

RL is PolicyGradient. Let's focus on the PolicyGradient model code and code analysis:

class PolicyGradient: def __init__( self, n_actions, n_features, learning_rate=0.01, reward_decay=0.95, output_graph=False, ): # Dimension of action space -- 2 self.n_actions = n_actions # Dimension of state characteristics -- 4 self.n_features = n_features # Learning rate self.lr = learning_rate # Return decay rate self.gamma = reward_decay # Observation value, action value, and return value of an exploration self.ep_obs, self.ep_as, self.ep_rs = [], [], [] # Create policy network self._build_net() self.sess = tf.Session() self.sess.run(tf.global_variables_initializer()) if output_graph: tf.summary.FileWriter("logs/", self.sess.graph) def _build_net(self): """ Implementation of creating policy network """ # 2.x version and 1 X version compatibility issues tf.disable_eager_execution() with tf.name_scope('input'): # Observation status -- [B, 4] self.tf_obs = tf.placeholder(tf.float32, [None, self.n_features], name="observations") # Execute action -- [B,] self.tf_acts = tf.placeholder(tf.int32, [None, ], name="actions_num") # Cumulative return value -- [B,] self.tf_vt = tf.placeholder(tf.float32, [None, ], name="actions_value") # Network structure, two layers, full connection layer layer = tf.layers.dense( inputs=self.tf_obs, units=10, activation=tf.nn.tanh, kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3), bias_initializer=tf.constant_initializer(0.1), name='fc1', ) all_act = tf.layers.dense( inputs=layer, units=self.n_actions, activation=None, kernel_initializer=tf.random_normal_initializer(mean=0, stddev=0.3), bias_initializer=tf.constant_initializer(0.1), name='fc2' ) # Using softmax function to predict the probability of each action self.all_act_prob = tf.nn.softmax(all_act, name='act_prob') # Define loss function with tf.name_scope('loss'): # to maximize total reward (log_p * R) is to minimize -(log_p * R), and the tf only have minimize(loss) # The goal is that maximization (log_p * R) is equivalent to optimizer minimization - (log_p * R) neg_log_prob = tf.nn.sparse_softmax_cross_entropy_with_logits(logits=all_act, labels=self.tf_acts) # or in this way: # neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob)*tf.one_hot(self.tf_acts, self.n_actions), axis=1) loss = tf.reduce_mean(neg_log_prob * self.tf_vt) #Define training and update parameters with tf.name_scope('train'): self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss) def choose_action(self, observation, type="random"): """ Defines how to select behavior, i.e. stateď˝“Behavior sampling at.Sampling according to the current behavior probability distribution :param observation: Current observations :return: Actions selected according to policy """ prob_weights = self.sess.run(self.all_act_prob, feed_dict={self.tf_obs: observation[np.newaxis, :]}) # Sample according to the given probability, or take the maximum directly. (random mode adds more randomness and Exploration) if type == "random": action = np.random.choice(range(prob_weights.shape[1]), p=prob_weights.ravel()) else: action = np.argmax(prob_weights.ravel()) return action def store_transition(self, s, a, r): """ Define storage and save the state, action and return of a round :param s: Observations per step :param a: Action value of each step :param r: Every step reward """ self.ep_obs.append(s) self.ep_as.append(a) self.ep_rs.append(r) def learn(self): """ After each exploration, learn to update the strategy network parameters """ # Calculate the cumulative discount return for an exploration discounted_ep_rs_norm = self._discount_and_norm_rewards() # Call training function to update parameters self.sess.run(self.train_op, feed_dict={ self.tf_obs: np.vstack(self.ep_obs), self.tf_acts: np.array(self.ep_as), self.tf_vt: discounted_ep_rs_norm, }) # Clear the episode data and wait for the next exploration and learning self.ep_obs, self.ep_as, self.ep_rs = [], [], [] return discounted_ep_rs_norm def _discount_and_norm_rewards(self): """ Attenuation round reward """ discounted_ep_rs = np.zeros_like(self.ep_rs) running_add = 0 # Since we need to consider the long-term cumulative reward, here is the reverse order. t-Time reward: reward of current t-Time * gamma + (t + 1) time. for t in reversed(range(0, len(self.ep_rs))): running_add = running_add * self.gamma + self.ep_rs[t] discounted_ep_rs[t] = running_add # Normalize discounted_ep_rs -= np.mean(discounted_ep_rs) discounted_ep_rs /= np.std(discounted_ep_rs) return discounted_ep_rs

## summary

Through the above logic, the framework of the whole exploration and learning process is summarized as follows:

#Learning process, explore N times for i_episode in range(N): observation = env.reset() while True: # Decision action (replaceable module) action = choose_action(observation) observation_, reward, done, _ = env.step(action) # Store discovery sequence information store_transition(observation, action, reward) if done: # Model learning (replaceable module) vt = learn() break # The agent explores one step and updates the observation value observation = observation_

The most important one is a decision-making model, which can get the best guidance action through the current observation state, so as to maximize the long-term benefits.

The most important part of the decision model is the design of the network (the simple two-layer full link used in this code can design more complex networks), and the design of the loss part (the goal is to maximize the long-term benefits).

In the next article, we will introduce the actor critical scheme based on policy and value combination.

Code reference:

Chinese women's football is awesome