Task2 Machine Learning Starter Notes Logical Regression

Differences between Logical Regression and Linear Regression

Linear regression can be applied to a similar number of training samples in different categories.If the number of training samples varies slightly from one category to another, it usually has little effect, but if the difference is large, the learning process will be confused.

Category imbalance refers to the situation where the number of training samples of different categories in the classification task varies greatly.

For example:

The graph shows whether toys are purchased and the relationship between age. Linear regression can be used to fit a straight line, labeling the purchased toys as 1, not as 0, and then taking a 0.5 value as a threshold to classify the categories.

You can see that on the way, the age distinction is about 19 years old.But when the data points are unbalanced, the thresholds are easily affected, as shown in the following figure:

It can be seen that the true threshold of 0-value samples is still about 19 years old after the age shift to the higher age, but the threshold of the fitted curve is shifted backwards.Consider that the more negative samples there are, the more older people there are and the more severe the offset.But is that true?The fact is that neither 60 nor 80 years of age will buy toys. Adding a few 80 years of age will not affect the ability to see that after the age of the 0-value sample shifts to a higher age, the true threshold is still around 19 years of age, but the threshold of the fitted curve shifts backwards.Consider that the more negative samples there are, the more older people there are and the more severe the offset.But is that true?The fact is that neither 60 nor 80 years of age will buy toys. Increasing the number of 80 years of age will not affect the probability of people under 20 buying toys.However, since the original range of the fitted curve is () and the converted range is [0,1], the threshold is sensitive to variable offsets.The probability that people under 20 will buy toys.However, since the original range of the fitted curve is () and the converted range is [0,1], the threshold is sensitive to variable offsets.

Principles of Logical Regression

A common alternative is the logarithmic probability function, which is a "Sigmoid" function that converts a Z-value to a y-value close to 0 or 1, and its output value varies steeply around z=0.

We can see that when z is greater than 0, the function is greater than 0.5; when the function is equal to 0, the function is equal to 0.5; and when the function is less than 0, the function is less than 0.5.If the probability of the target being classified into a certain class is expressed as a function, the following "unit step function" can be used to determine the type of data:

If Z is greater than 0, it is judged as normal; if less than 0, it is judged as negative; if equal to 0, it can be judged arbitrarily.Because the Sigmoid function is monotonic and differentiable, the Z-shaped range between functions (0, 1) can be used to simulate binary classification very well.

Regularization and Model Evaluation Indicators

Regularization

The over-fitting problem can be suppressed by adding a regularization function, the penalty term of theta, after the loss function.
L1 Regular:
J(θ)=1m∑i=1my(i)hθ(x(i))+(1−y(i))(1−hθ(x(i)))+λm∑i=1m∣θi∣ J(\theta) =\frac{1}{m}\sum^{m}_{i=1} y^{(i)}h_\theta (x^{(i)}) + (1-y^{(i)})(1-h_\theta (x^{(i)})) + \frac{\lambda}{m }\sum^m_{i=1}|\theta_i| J(θ)=m1i=1∑my(i)hθ(x(i))+(1−y(i))(1−hθ(x(i)))+mλi=1∑m∣θi∣
Δθil(θ)=1m∑i=1m(y(i)−hθ(x(i)))x(i)+λmsgn(θi) \Delta_{\theta_i} l(\theta) = \frac{1}{m}\sum^m_{i=1}(y^{(i)} - h_\theta (x^{(i)}))x^{(i)} + \frac{\lambda}{m}sgn(\theta_i) Δθil(θ)=m1i=1∑m(y(i)−hθ(x(i)))x(i)+mλsgn(θi)
The iteration function of the gradient descent method becomes,
θ:=θ−K′(θ)−λmsgn(θ)\theta:=\theta-K'(\theta)-\frac{\lambda}{m}sgn(\theta)θ:=θ−K′(θ)−mλsgn(θ)
K( theta) is the original loss function. Since the symbol of the last term is determined by, you can see that the updated becomes smaller when is greater whenAs a result, the result of the L1 regularization adjustment will be more sparse (with more 0 values in the result vector).(See illustration, equivalent to finding the point on the contour that minimizes the diamond.)

L2 Regular
J(θ)=1m∑i=1my(i)hθ(x(i))+(1−y(i))(1−hθ(x(i)))+λ2m∑i=1mθi2 J(\theta) =\frac{1}{m}\sum^{m}_{i=1} y^{(i)}h_\theta (x^{(i)}) + (1-y^{(i)})(1-h_\theta (x^{(i)})) + \frac{\lambda}{2m}\sum^m_{i=1}\theta_i^2 J(θ)=m1i=1∑my(i)hθ(x(i))+(1−y(i))(1−hθ(x(i)))+2mλi=1∑mθi2
Δθil(θ)=1m∑i=1m(y(i)−hθ(x(i)))x(i)+λmθi \Delta_{\theta_i} l(\theta) = \frac{1}{m}\sum^m_{i=1}(y^{(i)} - h_\theta (x^{(i)}))x^{(i)} + \frac{\lambda}{m}\theta_i Δθil(θ)=m1i=1∑m(y(i)−hθ(x(i)))x(i)+mλθi
The iteration function of the gradient descent method becomes,
θ:=θ−K′(θ)−2λmθ\theta:=\theta-K'(\theta)-\frac{2\lambda}{m}\thetaθ:=θ−K′(θ)−m2λθ
() is the original loss function, the most important one determines the penalty on the parameters. The greater the penalty, the smaller and dispersed parameters of the final result vector generally, avoiding the individual parameters having a greater impact on the whole function.(See illustration, equivalent to finding the point on the contour that minimizes the circle)

python implementation

use skearn

import pandas as pd
import matplotlib.pyplot as plt

df_X = pd.read_csv('./logistic_x.txt', sep='\ +', header=None, engine='python')  # Read X Value
ys = pd.read_csv('./logistic_y.txt', sep='\ +', header=None, engine='python')  # Read y Value
ys = ys.astype(int)  # Convert ys ys to an integer data type
df_X['label'] = ys[0].values  # Label X as y-values
ax = plt.axes()
# Draw the X-point location in a two-dimensional graph to visually see the distribution of data points
df_X.query('label == 0').plot.scatter(x=0, y=1, ax=ax, color='blue')
df_X.query('label == 1').plot.scatter(x=0, y=1, ax=ax, color='red')
plt.show()

from __future__ import print_function
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
from sklearn.linear_model import LogisticRegression

df_X = pd.read_csv('./logistic_x.txt', sep='\ +',header=None, engine='python')  # Read X Value
ys = pd.read_csv('./logistic_y.txt', sep='\ +',header=None, engine='python')  # Read y Value
ys = ys.astype(int)
df_X['label'] = ys[0].values  # Label X as y-values
# Extracting data for learning
Xs = df_X[[0, 1]].values
Xs = np.hstack([np.ones((Xs.shape[0], 1)), Xs])
"""
[Stitching Array Method)
np.hstack-Stack arrays in sequence horizontally (column wise).   
np.vstack-Stack arrays in sequence vertically (row wise).
https://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html
"""
ys = df_X['label'].values

lr = LogisticRegression(fit_intercept=False)  # Because the value of the intercept item has been previously merged into the variable, there is no need for the intercept item to be set here
lr.fit(Xs, ys)  # fitting
score = lr.score(Xs, ys)  # Result Evaluation
print("Coefficient: %s" % lr.coef_)
print("Score: %s" % score)

ax = plt.axes()
df_X.query('label == 0').plot.scatter(x=0, y=1, ax=ax, color='blue')
df_X.query('label == 1').plot.scatter(x=0, y=1, ax=ax, color='red')

_xs = np.array([np.min(Xs[:,]), np.max(Xs[:, 1])])
# The data is plotted in two-dimensional graphics, and the data area is divided using the parameter results learned as the threshold value.
_ys = (lr.coef_[0][0] + lr.coef_[0][1] * _xs) / (- lr.coef_[0][2])
plt.plot(_xs, _ys, lw=1)

logistic regression

import pandas as pd
import matplotlib.pyplot as plt
import numpy as np

df_X = pd.read_csv('./logistic_x.txt', sep='\ +', header=None, engine='python')  # Read X Value
ys = pd.read_csv('./logistic_y.txt', sep='\ +', header=None, engine='python')  # Read y Value
ys = ys.astype(int)  # Convert ys ys to an integer data type
df_X['label'] = ys[0].values  # Label X as y-values
# Extracting data for learning
Xs = df_X[[0, 1]].values
Xs = np.hstack([np.ones((Xs.shape[0], 1)), Xs])
"""
[Stitching Array Method)
np.hstack-Stack arrays in sequence horizontally (column wise).   
np.vstack-Stack arrays in sequence vertically (row wise).
https://docs.scipy.org/doc/numpy/reference/generated/numpy.hstack.html
"""
ys = df_X['label'].values

class LGR_GD():
    def __init__(self):
        self.w = None
        self.n_iters = None

    def fit(self, X, y, alpha=0.03, loss=1e-10):  # Set the step size to 0.02 and the criterion for determining convergence is 1 E-10
        y = y.reshape(-1, 1)  # Reshape the dimensions of y values for matrix operations
        [m, d] = np.shape(X)  # Dimension of independent variable
        self.w = np.zeros((1, d))  # Set the initial value of the parameter to 0
        tol = 1e5
        self.n_iters = 0
        # ============================= show me your code =======================
        while tol > loss:  # Set Convergence Conditions
            # Calculating Sigmoid Function Results
            sigmoid = 1 / (1 + np.exp(-X.dot(self.w.T)))
            theta = self.w + alpha * np.mean(X * (y - sigmoid), axis=0)  # Iterate the value of the TA
            tol = np.sum(np.abs(theta - self.w))  # Calculate loss value
            self.w = theta
            self.n_iters += 1  # Number of update iterations
        # ============================= show me your code =======================

    def predict(self, X):
        # Prediction of new independent variables with fitted parameter values
        y_pred = X.dot(self.w)
        return y_pred


if __name__ == "__main__":
    lr_gd = LGR_GD()
    lr_gd.fit(Xs, ys)

    ax = plt.axes()

    df_X.query('label == 0').plot.scatter(x=0, y=1, ax=ax, color='blue')
    df_X.query('label == 1').plot.scatter(x=0, y=1, ax=ax, color='red')

    _xs = np.array([np.min(Xs[:, 1]), np.max(Xs[:, 1])])
    _ys = (lr_gd.w[0][0] + lr_gd.w[0][1] * _xs) / (- lr_gd.w[0][2])
    plt.plot(_xs, _ys, lw=1)
    plt.show()