1. What is Policy Gradients
The basic idea of a strategic gradient is to output the probability of an action or action directly according to its state.So how to output, the simplest is to use a neural network.
When we use a neural network to enter the current state, the network can output the probability of each action we take in this state. How should the network be trained to achieve final convergence?Back-propagation algorithm is the most commonly used method when training neural networks. We need an error function to minimize our losses by decreasing the gradient.However, for intensive learning, we don't know whether the action is correct or not, we can only judge the relative good or bad of the action by the reward value.Based on that, we have a very simple idea:
If an action gets more rewards, we increase its probability of occurrence, and if an action gets fewer rewards, we decrease its probability of occurrence.
Based on this idea, we construct the following loss function: loss= -log(prob)*vt
The log(prob) in the form above indicates the degree of surprise to action a selected at the state s, but the inverse log(prob) increases with the smaller the probability. The VT represents the reward that action a can receive under the current state s, which is the sum of the discount values for the current reward and the future reward.That is, our strategy gradient algorithm must complete a complete eposide to update the parameters, instead of each (s,a,r,s') parameter a s the value method does.If you get a large reward with a very small prob, that is, a large vt, then the -log(prob)*vt is larger, which is even more surprising. (I picked an action that I don't often choose and found that it would get a good reward, so I have to make a big change to my parameters this time.)
This is the physical meaning of -log(prob)*vt. The core idea of Policy Gradient is to update parameters with two considerations: if an action is selected in this turn, the probability of selecting the action in the next round is higher, and then look at the reward and punishment values. If the reward and punishment are positive, the probability of this action will be magnified, if the reward and punishment are negative.Will reduce the probability of the action.
The output of the strategy gradient algorithm is the probability of the action, not the Q value.
The code ideas in this article follow the policy gradient process.
Define parameters
First, we define some model parameters:
self.ep_obs,self.ep_as,self.ep_rs Store current separately episode Status, actions, and rewards. self.n_actions = n_actions self.n_features = n_features self.lr = learning_rate self.gamma = reward_decay self.ep_obs,self.ep_as,self.ep_rs = [],[],[]
Define model input
The input of the model consists of three parts: observation value, action value and reward value.
with tf.name_scope('inputs'): self.tf_obs = tf.placeholder(tf.float32,[None,self.n_features],name='observation') self.tf_acts = tf.placeholder(tf.int32,[None,],name='actions_num') self.tf_vt = tf.placeholder(tf.float32,[None,],name='actions_value')
Building models
Our model defines two layers of neural networks. The input of the network is each state value, and the output is the probability of each action taken in that state. These probabilities are normalized by a softmax to get the probability vector of each action.
layer = tf.layers.dense( inputs = self.tf_obs, units = 10, activation= tf.nn.tanh, kernel_initializer=tf.random_normal_initializer(mean=0,stddev=0.3), bias_initializer= tf.constant_initializer(0.1), name='fc1' ) all_act = tf.layers.dense( inputs = layer, units = self.n_actions, activation = None, kernel_initializer=tf.random_normal_initializer(mean=0,stddev=0.3), bias_initializer = tf.constant_initializer(0.1), name='fc2' ) self.all_act_prob = tf.nn.softmax(all_act,name='act_prob')
Loss of model
As we mentioned earlier, the loss function of the model is calculated as loss=-log(prob)*vt, and we can use tf.nn.sparse_softmax_cross_entropy_with_logits directly to calculate the previous part, which is -log(prob), but to make our calculation process clearer, we use the following method:
with tf.name_scope('loss'): neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob) * tf.one_hot(indices=self.tf_acts,depth=self.n_actions),axis=1) loss = tf.reduce_mean(neg_log_prob * self.tf_vt)
We chose the AdamOptimizer optimizer to update the parameters:
with tf.name_scope('train'): self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)
Action Selection
Our choice of actions here is no longer based on greedy strategies, but on the size of the output action probability to choose different possibilities for the corresponding actions:
def choose_action(self,observation): prob_weights = self.sess.run(self.all_act_prob,feed_dict={self.tf_obs:observation[np.newaxis,:]}) action = np.random.choice(range(prob_weights.shape[1]),p=prob_weights.ravel()) return action
Storage experience
Previously, policy gradient s start training after a complete episode, so before an episode ends, we want to store all of the episode's experience, state, actions and rewards.
def store_transition(self,s,a,r): self.ep_obs.append(s) self.ep_as.append(a) self.ep_rs.append(r)
Calculate the discount value of the reward
The reward we previously stored was the immediate reward for action a taken by the current state s, while the real reward for action a taken by the current state s should be the immediate reward plus the future reward discount and until the end of episode.
def _discount_and_norm_rewards(self): discounted_ep_rs = np.zeros_like(self.ep_rs) running_add = 0 # reserved returns the reverse order of the list, which gives the discount sum value. for t in reversed(range(0,len(self.ep_rs))): running_add = running_add * self.gamma + self.ep_rs[t] discounted_ep_rs[t] = running_add discounted_ep_rs -= np.mean(discounted_ep_rs) discounted_ep_rs /= np.std(discounted_ep_rs) return discounted_ep_rs
model training
Once all the above components have been defined, we can write the model training function. It is important to note that we are not feeding the model with the reward values we store, but with the reward discount and sum calculated in the previous step.In addition, we need to empty our experience pool after each training session.
def learn(self): discounted_ep_rs_norm = self._discount_and_norm_rewards() self.sess.run(self.train_op,feed_dict={ self.tf_obs:np.vstack(self.ep_obs), self.tf_acts:np.array(self.ep_as), self.tf_vt:discounted_ep_rs_norm, }) self.ep_obs,self.ep_as,self.ep_rs = [],[],[] return discounted_ep_rs_norm
Okay, we've finished describing the code related to the model. I'm sure you can understand it at a glance, and we won't do that anymore.