Policy Gradient for Enhanced Learning

1. What is Policy Gradients

The basic idea of a strategic gradient is to output the probability of an action or action directly according to its state.So how to output, the simplest is to use a neural network.

When we use a neural network to enter the current state, the network can output the probability of each action we take in this state. How should the network be trained to achieve final convergence?Back-propagation algorithm is the most commonly used method when training neural networks. We need an error function to minimize our losses by decreasing the gradient.However, for intensive learning, we don't know whether the action is correct or not, we can only judge the relative good or bad of the action by the reward value.Based on that, we have a very simple idea:

If an action gets more rewards, we increase its probability of occurrence, and if an action gets fewer rewards, we decrease its probability of occurrence.

Based on this idea, we construct the following loss function: loss= -log(prob)*vt

The log(prob) in the form above indicates the degree of surprise to action a selected at the state s, but the inverse log(prob) increases with the smaller the probability. The VT represents the reward that action a can receive under the current state s, which is the sum of the discount values for the current reward and the future reward.That is, our strategy gradient algorithm must complete a complete eposide to update the parameters, instead of each (s,a,r,s') parameter a s the value method does.If you get a large reward with a very small prob, that is, a large vt, then the -log(prob)*vt is larger, which is even more surprising. (I picked an action that I don't often choose and found that it would get a good reward, so I have to make a big change to my parameters this time.)

This is the physical meaning of -log(prob)*vt. The core idea of Policy Gradient is to update parameters with two considerations: if an action is selected in this turn, the probability of selecting the action in the next round is higher, and then look at the reward and punishment values. If the reward and punishment are positive, the probability of this action will be magnified, if the reward and punishment are negative.Will reduce the probability of the action.

The output of the strategy gradient algorithm is the probability of the action, not the Q value.

The code ideas in this article follow the policy gradient process.

Define parameters

First, we define some model parameters:

self.ep_obs,self.ep_as,self.ep_rs Store current separately episode Status, actions, and rewards.
self.n_actions = n_actions
self.n_features = n_features
self.lr = learning_rate
self.gamma = reward_decay
self.ep_obs,self.ep_as,self.ep_rs = [],[],[]

Define model input

The input of the model consists of three parts: observation value, action value and reward value.

with tf.name_scope('inputs'):
    self.tf_obs = tf.placeholder(tf.float32,[None,self.n_features],name='observation')
    self.tf_acts = tf.placeholder(tf.int32,[None,],name='actions_num')
    self.tf_vt = tf.placeholder(tf.float32,[None,],name='actions_value')

Building models

Our model defines two layers of neural networks. The input of the network is each state value, and the output is the probability of each action taken in that state. These probabilities are normalized by a softmax to get the probability vector of each action.

layer = tf.layers.dense(
    inputs = self.tf_obs,
    units = 10,
    activation= tf.nn.tanh,
    kernel_initializer=tf.random_normal_initializer(mean=0,stddev=0.3),
    bias_initializer= tf.constant_initializer(0.1),
    name='fc1'
)
all_act = tf.layers.dense(
    inputs = layer,
    units = self.n_actions,
    activation = None,
    kernel_initializer=tf.random_normal_initializer(mean=0,stddev=0.3),
    bias_initializer = tf.constant_initializer(0.1),
    name='fc2'
)
self.all_act_prob = tf.nn.softmax(all_act,name='act_prob')

Loss of model

As we mentioned earlier, the loss function of the model is calculated as loss=-log(prob)*vt, and we can use tf.nn.sparse_softmax_cross_entropy_with_logits directly to calculate the previous part, which is -log(prob), but to make our calculation process clearer, we use the following method:

with tf.name_scope('loss'):
    neg_log_prob = tf.reduce_sum(-tf.log(self.all_act_prob) * tf.one_hot(indices=self.tf_acts,depth=self.n_actions),axis=1)
    loss = tf.reduce_mean(neg_log_prob * self.tf_vt)

We chose the AdamOptimizer optimizer to update the parameters:

with tf.name_scope('train'):
    self.train_op = tf.train.AdamOptimizer(self.lr).minimize(loss)

Action Selection

Our choice of actions here is no longer based on greedy strategies, but on the size of the output action probability to choose different possibilities for the corresponding actions:

def choose_action(self,observation):
    prob_weights = self.sess.run(self.all_act_prob,feed_dict={self.tf_obs:observation[np.newaxis,:]})
    action = np.random.choice(range(prob_weights.shape[1]),p=prob_weights.ravel())
    return action

Storage experience

Previously, policy gradient s start training after a complete episode, so before an episode ends, we want to store all of the episode's experience, state, actions and rewards.

def store_transition(self,s,a,r):
    self.ep_obs.append(s)
    self.ep_as.append(a)
    self.ep_rs.append(r)

Calculate the discount value of the reward

The reward we previously stored was the immediate reward for action a taken by the current state s, while the real reward for action a taken by the current state s should be the immediate reward plus the future reward discount and until the end of episode.

def _discount_and_norm_rewards(self):
    discounted_ep_rs = np.zeros_like(self.ep_rs)
    running_add = 0
    # reserved returns the reverse order of the list, which gives the discount sum value.
    for t in reversed(range(0,len(self.ep_rs))):
        running_add = running_add * self.gamma + self.ep_rs[t]
        discounted_ep_rs[t] = running_add
    discounted_ep_rs -= np.mean(discounted_ep_rs)
    discounted_ep_rs /= np.std(discounted_ep_rs)
    return discounted_ep_rs

model training

Once all the above components have been defined, we can write the model training function. It is important to note that we are not feeding the model with the reward values we store, but with the reward discount and sum calculated in the previous step.In addition, we need to empty our experience pool after each training session.

def learn(self):
    discounted_ep_rs_norm = self._discount_and_norm_rewards()
    self.sess.run(self.train_op,feed_dict={
        self.tf_obs:np.vstack(self.ep_obs),
        self.tf_acts:np.array(self.ep_as),
        self.tf_vt:discounted_ep_rs_norm,
    })

    self.ep_obs,self.ep_as,self.ep_rs = [],[],[]
    return discounted_ep_rs_norm

Okay, we've finished describing the code related to the model. I'm sure you can understand it at a glance, and we won't do that anymore.

Keywords: network Session

Added by voltrader on Sat, 31 Aug 2019 19:52:43 +0300