# 1. Activation function

The idea of activation function comes from the analysis of the working mechanism of neurons in human brain. Neurons are activated above a certain threshold (also known as activation potential). In most cases, the activation function is also intended to limit the output to a small range.

## 1.1 hyperbolic tangent function and Sigmoid function

The following figure shows the tanh and Sigmoid activation functions:

The demo code is as follows:

# 2. Loss function (cost function)

The loss function (cost function) is used to minimize to get the optimal value of each parameter of the model. For example, in order to use the predictor (X) to predict the value of the target (y), it is necessary to obtain the weight value (slope) and the offset amount (y intercept). The method to get the optimal value of slope and Y intercept is to minimize the sum of cost function / loss function / square. For any model, there are many parameters, and the model structure of prediction or classification is also represented by the value of parameters.

You need to calculate the model, and to do that, you need to define a cost function (loss function). The purpose of minimizing loss function is to find the optimal value of each parameter. L1 or L2 are useful loss functions for regression / numerical prediction problems. For classification problems, cross entropy is a very useful loss function. Softmax or Sigmoid cross entropy are very popular loss functions

## 2.1 L1 norm loss function

The L1 norm loss function, also known as the least absolute deviation (LAD), the least absolute error (LAE). In general, it is to minimize the sum (S) of the absolute difference between the target value (Yi) and the estimated value (f(xi)):

## 2.2 L2 norm loss function

L2 norm loss function, also known as least square error (LSE). In general, it is to minimize the sum of squares (S) of the difference between the target value (Yi) and the estimated value (f(xi)):

C is the cost function, x is the sample, y is the actual value, a is the output value, and n is the total number of samples.

For the sake of simplicity, take the example of only one sample,
a=σ(z), z=∑Wj*Xj+b
σ () is the activation function
In this case, the quadratic cost function is:

It can be seen from the above that the gradient change of the quadratic cost function W and b is related to the activation function.

• Let's suppose that our goal of convergence is 1. Point A is 0.82, which is far away from the target, while point A has A larger gradient and A larger weight adjustment. Point B is 0.98, which is close to the target, while point B has A smaller gradient and A smaller weight adjustment. Therefore, it can quickly adjust from point A to point B, and then slowly converge to point 1. This scheme is reasonable.
• Let's suppose that our goal of convergence is 0. Point B is 0.98, which is far away from the target, while point B has A smaller gradient and smaller weight adjustment. Point A is 0.82, which is close to the target, while point A has A larger gradient and larger weight adjustment. Therefore, if it starts from point B, it will take A long time to reach point A at point B, then this scheme is unreasonable.
If the error is relatively large, it means that we are far away from our goal. At this time, the weight adjustment should be relatively large, which we think is more reasonable.

## 2.5 log likelihood cost

Logarithmic relief function is often used as the cost function of softmax regression. If the output layer neuron is sigmoid function, cross entropy cost function can be used. In depth learning, softmax is regarded as the last layer. At this time, the common cost function is the logarithmic release cost function.

The combination of log likelihood cost function and softmax, and the combination of cross entropy and sigmoid function are very similar. The logarithmic release cost function can be reduced to the form of cross entropy cost function in binary classification.

In Tensorflow, use:
TF. NN. Sigmoid? Cross? Entropy? With? Logits() to represent the cross entropy used with sigmoid. TF. NN. Softmax? Cross? Entropy? With? Logits() to represent the cross entropy used with softmax.

# 3. Demo code

Modify 3-2 to simply implement handwritten digit recognition code, and use softmax cross entropy cost function:

```import tensorflow as tf
from tensorflow.examples.tutorials.mnist import input_data

#Size of each batch
batch_size = 50
#How many batches are there
n_batch = mnist.train.num_examples // batch_size

#Define two placeholder s
x = tf.placeholder(tf.float32,[None,784])
y = tf.placeholder(tf.float32,[None,10])

#Create a simple neural network
W = tf.Variable(tf.zeros([784,10]))
b = tf.Variable(tf.zeros([10]))
prediction = tf.nn.softmax(tf.matmul(x,W) + b)

# loss = tf.reduce_mean(tf.square(y-prediction))
#Using softmax cross entropy cost function
loss = tf.reduce_mean(tf.nn.softmax_cross_entropy_with_logits(labels=y,logits=prediction))

#initialize variable
init = tf.global_variables_initializer()

#Results are stored in a Boolean list
#argmax returns the position of the maximum value in one-dimensional tensor
correct_prediction = tf.equal(tf.argmax(y,1),tf.argmax(prediction,1))
#Accuracy rate
accuracy = tf.reduce_mean(tf.cast(correct_prediction,tf.float32))

with tf.Session() as sess:
sess.run(init)
for epoch in range(21):
for batch in range(n_batch):
batch_xs,batch_ys = mnist.train.next_batch(batch_size)
sess.run(train_step,feed_dict = {x:batch_xs,y:batch_ys})

acc = sess.run(accuracy,feed_dict = {x:mnist.test.images,y:mnist.test.labels})
print("Iter"+str(epoch)+",Testing Accuracy"+str(acc))
```

Comparison of operation results

```#The result of quadratic cost function
Iter0,Testing Accuracy0.8703
Iter1,Testing Accuracy0.8876
Iter2,Testing Accuracy0.8964
Iter3,Testing Accuracy0.9018
Iter4,Testing Accuracy0.9047
Iter5,Testing Accuracy0.9069
Iter6,Testing Accuracy0.9094
Iter7,Testing Accuracy0.9108
Iter8,Testing Accuracy0.9121
Iter9,Testing Accuracy0.9135
Iter10,Testing Accuracy0.9145
Iter11,Testing Accuracy0.9155
Iter12,Testing Accuracy0.9166
Iter13,Testing Accuracy0.9176
Iter14,Testing Accuracy0.9176
Iter15,Testing Accuracy0.9183
Iter16,Testing Accuracy0.9186
Iter17,Testing Accuracy0.9192
Iter18,Testing Accuracy0.9195
Iter19,Testing Accuracy0.919
Iter20,Testing Accuracy0.9205
#Result of cross entropy cost function
Iter0,Testing Accuracy0.8944
Iter1,Testing Accuracy0.9054
Iter2,Testing Accuracy0.9099
Iter3,Testing Accuracy0.9134
Iter4,Testing Accuracy0.9148
Iter5,Testing Accuracy0.9167
Iter6,Testing Accuracy0.9202
Iter7,Testing Accuracy0.9207
Iter8,Testing Accuracy0.9211
Iter9,Testing Accuracy0.9214
Iter10,Testing Accuracy0.9214
Iter11,Testing Accuracy0.9222
Iter12,Testing Accuracy0.9228
Iter13,Testing Accuracy0.9237
Iter14,Testing Accuracy0.924
Iter15,Testing Accuracy0.9245
Iter16,Testing Accuracy0.9238
Iter17,Testing Accuracy0.9245
Iter18,Testing Accuracy0.9244
Iter19,Testing Accuracy0.9251
Iter20,Testing Accuracy0.9251
```

It can be seen from the results that when the accuracy reaches 0.909, it needs 6 iterations to use the quadratic cost function, and only 2 iterations to use the softmax cross entropy function. It is obvious that the speed of using the cross entropy cost function will be much faster.
conclusion
So when we use S-type activation function or softmax, we should use cross entropy cost function, which is more efficient.

Published 35 original articles, won praise 7, visited 30000+

Keywords: network Session

Added by implications on Sun, 09 Feb 2020 12:09:55 +0200