[machine learning practice] Logistic regression

1, Introduction to Logistic regression

Logistic regression is a widely used classification machine learning algorithm. It fits the data into a logistic function (or logistic function), so as to predict the probability of events. To say logistic regression, we have to trace back to linear regression. We must all have a certain understanding of linear regression, that is, for the sample points existing in multidimensional space, we use the linear combination of features to fit the distribution and trajectory of the points in space. As shown in the figure below:

Linear regression can predict continuous value results, and another common problem in real life is classification. The simplest case is the binary classification problem of yes and No. For example, the doctor needs to judge whether the patient is ill, the bank needs to judge whether a person's credit level can send him a credit card, and the e-mail inbox should automatically classify the e-mail into normal e-mail and spam, etc. In many practical cases, the classification data we need to learn is not so accurate. For example, in the above example, a data point that does not play cards according to the routine suddenly appears. The data of real-life classification problems will be more complex. At this time, it is difficult for us to complete a robust classifier with the help of linear regression threshold. Suppose there are some data points, we use a straight line to fit these points (the line is called the best fitting line), and this fitting process is called regression. The main idea of using Logistic regression for classification is to establish a regression formula for the classification boundary line according to the existing data. The word "regression" here comes from the best fitting, which means to find the best fitting parameter set. The method of training classifier is to find the best fitting parameters, and the optimization algorithm is used.

1. General process of Logistic regression

1) Data collection: collect data by any method.
2) Prepare data: since distance calculation is required, the data type is required to be numerical. In addition, the structured data format is the best.
3) Analyze data: analyze the data by any method.
4) Training algorithm: most of the time will be used for training. The purpose of training is to find the best classification regression coefficient.
5) Test algorithm: once the training steps are completed, the classification will be fast.
6) Using algorithm: first, we need to input some data and convert it into corresponding structured values; Then, based on the trained regression coefficients, these values can be simply regressed to determine which category they belong to; After that, we can do some other analysis work on the output categories.

2. Characteristics of logistic regression

Advantages: low computational cost, easy to understand and implement.
Disadvantages: it is easy to under fit, and the classification accuracy may not be high.
Applicable data types: numerical data and nominal data.

2, Code implementation

1. logistic regression gradient rising optimization algorithm

def loadDataSet():
    dataMat = []; labelMat = []
    fr = open('testSet.txt')
    for line in fr.readlines():
        lineArr = line.strip().split()
        dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])])
        labelMat.append(int(lineArr[2]))
    return dataMat,labelMat
 
def sigmoid(inX):
    return 1.0/(1+exp(-inX))
 
def gradAscent(dataMatIn, classLabels):
    dataMatrix = mat(dataMatIn)             #convert to NumPy matrix
    labelMat = mat(classLabels).transpose() #convert to NumPy matrix
    m,n = shape(dataMatrix)
    alpha = 0.001
    maxCycles = 500
    weights = ones((n,1))
    for k in range(maxCycles):              #heavy on matrix operations
        h = sigmoid(dataMatrix*weights)     #matrix mult
        error = (labelMat - h)              #vector subtraction
        weights = weights + alpha * dataMatrix.transpose()* error #matrix mult
    return weights

2. Draw the function of the best fitting line between the data set and logistic regression

def plotBestFit(weights):
    import matplotlib.pyplot as plt
    dataMat,labelMat=loadDataSet()
    dataArr = array(dataMat)
    n = shape(dataArr)[0] 
    xcord1 = []; ycord1 = []
    xcord2 = []; ycord2 = []
    for i in range(n):
        if int(labelMat[i])== 1:
            xcord1.append(dataArr[i,1]); ycord1.append(dataArr[i,2])
        else:
            xcord2.append(dataArr[i,1]); ycord2.append(dataArr[i,2])
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.scatter(xcord1, ycord1, s=30, c='red', marker='s')
    ax.scatter(xcord2, ycord2, s=30, c='green')
    x = arange(-3.0, 3.0, 0.1)
    y = (-weights[0]-weights[1]*x)/weights[2]
    ax.plot(x, y)
    plt.xlabel('X1'); plt.ylabel('X2');
    plt.show()

3. Random gradient rise algorithm

def stocGradAscent0(dataMatrix, classLabels):
    m,n = shape(dataMatrix)
    alpha = 0.01
    weights = ones(n)   #initialize to all ones
    for i in range(m):
        h = sigmoid(sum(dataMatrix[i]*weights))
        error = classLabels[i] - h
        weights = weights + alpha * error * dataMatrix[i]
    return weights

4. Improved random gradient rise algorithm

def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m,n = shape(dataMatrix)
    weights = ones(n)   #initialize to all ones
    for j in range(numIter):
        dataIndex = range(m)
        for i in range(m):
            alpha = 4/(1.0+j+i)+0.0001    #apha decreases with iteration, does not 
            randIndex = int(random.uniform(0,len(dataIndex)))#go to 0 because of the constant
            h = sigmoid(sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del(dataIndex[randIndex])
    return weights

5. Test algorithm: classify with logistic regression

Data set used in the experiment

logistic regression was used to calculate the classification error rate.

def classifyVector(inX, weights):
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5: return 1.0
    else: return 0.0
 
def auditTest():
    frTrain = open('auditDataTraining.txt'); frTest = open('auditDataTest.txt')
    trainingSet = []; trainingLabels = []
    for line in frTrain.readlines():
        currLine = line.strip().split(',')
        lineArr =[]
        for i in range(26):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[26]))
    trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 1000)
    errorCount = 0; numTestVec = 0.0
    for line in frTest.readlines():
        numTestVec += 1.0
        currLine = line.strip().split(',')
        lineArr =[]
        for i in range(26):
            lineArr.append(float(currLine[i]))
        if int(classifyVector(array(lineArr), trainWeights))!= int(currLine[26]):
            errorCount += 1
    errorRate = (float(errorCount)/numTestVec)
    print "the error rate of this test is: %f" % errorRate
    return errorRate
 
def multiTest():
    numTests = 10; errorSum=0.0
    for k in range(numTests):
        errorSum += auditTest()
    print "after %d iterations the average error rate is: %f" % (numTests, errorSum/float(numTests))

After running multiTest()

The average error rate is very low. It seems that there are few missing data in this dataset.

Keywords: Algorithm Machine Learning AI

Added by opalelement on Mon, 22 Nov 2021 14:44:53 +0200

Programming VIP