[data analysis and mining] binary classification / multi classification prediction practice based on LightGBM,XGBoost and logistic regression (with data sets and codes)

1, Classification prediction based on logistic regression

1 Introduction and application of logistic regression

1.1 introduction to logistic regression

Although Logistic regression (LR) has the word "regression", it is actually a classification model and is widely used in various fields. Although deep learning is more popular than these traditional methods, in fact, these traditional methods are still widely used in various fields because of their unique advantages.

For logistic regression, the most prominent two points are its simple model and strong interpretability.

Advantages and disadvantages of logistic regression model:

Advantages: simple implementation, easy to understand and implement; The computing cost is not high, the speed is fast, and the storage resources are low
Disadvantages: it is easy to under fit, and the classification accuracy may not be high

1.2 application of logistic regression

Logistic regression models are widely used in various fields, including machine learning, most medical fields and social sciences. For example, trauma and injury severity score (TRISS) developed by Boyd et al. Was widely used to predict mortality in injured patients. Logistic regression was used to analyze and predict the risk of specific diseases (such as diabetes, coronary heart disease) based on observed patient characteristics (age, gender, body mass index, blood test results, etc.). Logistic regression model is also used to predict the possibility of system or product failure in a given process. It is also used in marketing applications, such as predicting customers' tendency to buy products or stop ordering. In economics, it can be used to predict the possibility of a person choosing to enter the labor market, while business applications can be used to predict the possibility of homeowners defaulting on their mortgages. Conditional random fields are extensions of logistic regression to sequential data for natural language processing.

Logistic regression model is also the basic component of many classification algorithms, such as credit card transaction anti fraud and CTR (click through rate) estimation based on GBDT algorithm + LR logistic regression in classification tasks. Its advantage is that the output value naturally falls between 0 and 1 and has probability significance. The model is clear and has the corresponding theoretical basis of probability. The fitted parameters represent the influence of each feature on the results. It is also a good tool for understanding data. But at the same time, because it is essentially a linear classifier, it can not deal with more complex data. Many times, we will use logistic regression model to do some baseline (basic level) of task attempt.

After talking about the concept and application of logistic regression, we should have expected it, so let's start now!!!

2.Demo practice

Step 1: library function import

##  Basic function library
import numpy as np 

## Import drawing library
import matplotlib.pyplot as plt
import seaborn as sns

## Import logistic regression model function
from sklearn.linear_model import LogisticRegression

Step 2: model training

##Demo demonstrates LogisticRegression classification

## Construct dataset
x_fearures = np.array([[-1, -2], [-2, -1], [-3, -2], [1, 3], [2, 1], [3, 2]])
y_label = np.array([0, 0, 0, 1, 1, 1])

## Call logistic regression model
lr_clf = LogisticRegression()

## The constructed data set was fitted with logistic regression model
lr_clf = lr_clf.fit(x_fearures, y_label) #The fitting equation is y=w0+w1*x1+w2*x2

Step3: view model parameters

## View the w of its corresponding model
print('the weight of Logistic Regression:',lr_clf.coef_)

## View w0 of its corresponding model
print('the intercept(w0) of Logistic Regression:',lr_clf.intercept_)

Step4: data and model visualization

## Visually constructed data sample points
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')
plt.show()

# Visual decision boundary
plt.figure()
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')

nx, ny = 200, 100
x_min, x_max = plt.xlim()
y_min, y_max = plt.ylim()
x_grid, y_grid = np.meshgrid(np.linspace(x_min, x_max, nx),np.linspace(y_min, y_max, ny))

z_proba = lr_clf.predict_proba(np.c_[x_grid.ravel(), y_grid.ravel()])
z_proba = z_proba[:, 1].reshape(x_grid.shape)
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')

plt.show()

### Visual prediction of new samples

plt.figure()
## new point 1
x_fearures_new1 = np.array([[0, -1]])
plt.scatter(x_fearures_new1[:,0],x_fearures_new1[:,1], s=50, cmap='viridis')
plt.annotate(s='New point 1',xy=(0,-1),xytext=(-2,0),color='blue',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

## new point 2
x_fearures_new2 = np.array([[1, 2]])
plt.scatter(x_fearures_new2[:,0],x_fearures_new2[:,1], s=50, cmap='viridis')
plt.annotate(s='New point 2',xy=(1,2),xytext=(-1.5,2.5),color='red',arrowprops=dict(arrowstyle='-|>',connectionstyle='arc3',color='red'))

## training sample 
plt.scatter(x_fearures[:,0],x_fearures[:,1], c=y_label, s=50, cmap='viridis')
plt.title('Dataset')

# Visual decision boundary
plt.contour(x_grid, y_grid, z_proba, [0.5], linewidths=2., colors='blue')

plt.show()

Step5: model prediction

## In the training set and test set, the trained model is used to predict
y_label_new1_predict = lr_clf.predict(x_fearures_new1)
y_label_new2_predict = lr_clf.predict(x_fearures_new2)

print('The New point 1 predict class:\n',y_label_new1_predict)
print('The New point 2 predict class:\n',y_label_new2_predict)

## Since the logistic regression model is a probability prediction model (p = p(y=1|x,\theta) introduced earlier), we can use predict_ Probabilities predicted by proba function
y_label_new1_predict_proba = lr_clf.predict_proba(x_fearures_new1)
y_label_new2_predict_proba = lr_clf.predict_proba(x_fearures_new2)

print('The New point 1 predict Probability of each class:\n',y_label_new1_predict_proba)
print('The New point 2 predict Probability of each class:\n',y_label_new2_predict_proba)

It can be found that the trained regression model will X_new1 is predicted as category 0 (lower left side of discrimination plane), X_new2 is predicted as category 1 (upper right side of the discrimination plane). The discriminant surface with a probability of 0.5 of the trained logistic regression model is the blue line in the figure above.

3. Practice of logistic regression classification based on iris data set

This time, we choose iris data to train the method. The data set contains 5 variables, including 4 characteristic variables and 1 target classification variable. There are 150 samples, and the target variable is the category of flowers, which all belong to three subgenera of iris, namely iris setosa, iris versicolor and iris Virginia. The four features of the three Iris species are calyx length (cm), calyx width (cm), petal length (cm) and petal width (cm). These morphological features have been used to identify species in the past.

Step 1: library function import

##  Basic function library
import numpy as np 
import pandas as pd

## Drawing function library
import matplotlib.pyplot as plt
import seaborn as sns

Step 2: data reading / loading

## We load iris data from sklearn as data and convert it into DataFrame format using Pandas
from sklearn.datasets import load_iris
data = load_iris() #Get data characteristics
iris_target = data.target #Get the label corresponding to the data
iris_features = pd.DataFrame(data=data.data, columns=data.feature_names) #Convert to DataFrame format using Pandas

Step 3: simple view of data information

## Use. info() to view the overall information of the data
iris_features.info()

## For simple data viewing, we can use. head() header and. tail() tail
iris_features.head()

## The corresponding category labels are, where 0, 1 and 2 represent the categories of 'setosa', 'versicolor' and 'virginica' respectively.
iris_target

## Use value_ The counts function looks at the number of each category
pd.Series(iris_target).value_counts()

## Make some statistical description for the characteristics
iris_features.describe()

From the statistical description, we can see the variation range of different numerical characteristics.

Step4: visual description

## Merge label and feature information
iris_all = iris_features.copy() ##Make shallow copies to prevent modifications to the original data
iris_all['target'] = iris_target
## Scatter visualization of feature and label combination
sns.pairplot(data=iris_all,diag_kind='hist', hue= 'target')
plt.show()

It can be found from the above figure that in 2D, different feature combinations have scattered distribution for different types of flowers and approximate discrimination ability.

for col in iris_features.columns:
    sns.boxplot(x='target', y=col, saturation=0.5,palette='pastel', data=iris_all)
    plt.title(col)
    plt.show()

Using the box graph, we can also get the distribution differences of different categories in different features.

# Select the first three features to draw a three-dimensional scatter diagram
from mpl_toolkits.mplot3d import Axes3D

fig = plt.figure(figsize=(10,8))
ax = fig.add_subplot(111, projection='3d')

iris_all_class0 = iris_all[iris_all['target']==0].values
iris_all_class1 = iris_all[iris_all['target']==1].values
iris_all_class2 = iris_all[iris_all['target']==2].values
# 'setosa'(0), 'versicolor'(1), 'virginica'(2)
ax.scatter(iris_all_class0[:,0], iris_all_class0[:,1], iris_all_class0[:,2],label='setosa')
ax.scatter(iris_all_class1[:,0], iris_all_class1[:,1], iris_all_class1[:,2],label='versicolor')
ax.scatter(iris_all_class2[:,0], iris_all_class2[:,1], iris_all_class2[:,2],label='virginica')
plt.legend()

plt.show()

Step5: use logistic regression model for training and prediction on binary classification

## In order to correctly evaluate the model performance, the data is divided into training set and test set, the model is trained on the training set, and the model performance is verified on the test set.
from sklearn.model_selection import train_test_split

## Select samples with categories 0 and 1 (excluding samples with category 2)
iris_features_part = iris_features.iloc[:100]
iris_target_part = iris_target[:100]

## The test set size is 20%, 80% / 20% points
x_train, x_test, y_train, y_test = train_test_split(iris_features_part, iris_target_part, test_size = 0.2, random_state = 2020)
## Import logistic regression model from sklearn
from sklearn.linear_model import LogisticRegression
## Define logistic regression model 
clf = LogisticRegression(random_state=0, solver='lbfgs')
# Training logistic regression model on training set
clf.fit(x_train, y_train)
## View its corresponding w
print('the weight of Logistic Regression:',clf.coef_)

## View its corresponding w0
print('the intercept(w0) of Logistic Regression:',clf.intercept_)

## The distribution on the training set and test set is predicted by the trained model
train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)

from sklearn import metrics

## The model effect is evaluated by accuracy [the proportion of the number of correctly predicted samples to the total number of predicted samples]
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## View the confusion matrix (statistical matrix of various situations of predicted value and real value)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# Visualization of results using thermal maps
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

We can find that its accuracy is 1, which means that all samples are predicted correctly.

Step6: use logistic regression model to train and predict on three classifications (multiple classifications)

## The test set size is 20%, 80% / 20% points
x_train, x_test, y_train, y_test = train_test_split(iris_features, iris_target, test_size = 0.2, random_state = 2020)

Other codes are the same as those of category II

Through the results, we can find that the prediction accuracy of the three classification results has decreased, and its accuracy in the test set is:
86.67
%
, this is due to the characteristics of 'versicolor' (1) and 'virginica' (2). We can also find that the boundary of the characteristics is fuzzy (the boundary categories are mixed, and there is no obvious distinction between the boundaries). There are some errors in the prediction of these two types.

2, Classification prediction based on XGBoost

1.XGBoost and Application

Introduction to XGBoost

XGBoost is an extensible machine learning system developed in 2016 under the leadership of Chen Tianqi of the University of Washington. Strictly speaking, XGBoost is not a model, but a software package for users to easily solve classification, regression or sorting problems. It implements the gradient lifting tree (GBDT) model internally, and optimizes many algorithms in the model. It not only obtains high precision, but also maintains a very fast speed. For a period of time, it has become a weapon of mass destruction in the field of data mining and machine learning at home and abroad.

More importantly, XGBoost has made in-depth consideration in system optimization and machine learning principles. It is no exaggeration to say that the scalability, portability and accuracy provided by XGBoost promote the upper limit of machine learning computing restrictions. The system runs ten times faster on a single machine than the popular solutions at that time, and even can process billions of data in distributed systems.

Main advantages of XGBoost:

**Easy to use** Compared with other machine learning libraries, users can easily use XGBoost and get quite good results.

**Efficient and scalable** When dealing with large-scale data sets, it has fast speed, good effect and low requirements for hardware resources such as memory.

**Strong robustness** Compared with the deep learning model, it can achieve close effect without fine tuning.

XGBoost implements the lifting tree model internally, which can automatically handle missing values.

Main disadvantages of XGBoost:

Compared with the deep learning model, it can not model the spatio-temporal position, and can not capture high-dimensional data such as image, voice, text and so on.

When we have a large amount of training data and can find an appropriate deep learning model, the accuracy of deep learning can be far ahead of XGBoost.

Application of 1.2XGboost

XGBoost is widely used in the field of machine learning and data mining. According to statistics, among the 29 award-winning schemes on Kaggle platform in 2015, 17 teams used XGBoost; In the 2015 KDD cup, the top ten teams used XGBoost, and the integration of other models can not compare with the improvement brought by adjusting the parameters of XGBoost. These real examples show that XGBoost can achieve very good results on various problems.

At the same time, XGBoost has also been successfully applied to various problems in industry and academia. For example, store sales forecast, high energy physics event classification, web text classification; User behavior prediction, motion detection, advertising click through rate prediction, malware classification, disaster risk prediction, online course drop out rate prediction. Although the domain is related to data analysis and Feature Engineering in these solutions

2. Practice of XGBoost classification based on weather data set

Dataset: Weather data set

Step 1: function library import

##  Basic function library
import numpy as np 
import pandas as pd

## Drawing function library
import matplotlib.pyplot as plt
import seaborn as sns

This time, we choose the weather data set to try to train the method. Now there are some daily rainfall data provided by the weather station. We need to predict the probability of rain tomorrow according to the historical rainfall data. The format of the test set data test.csv involved in the sample is exactly the same as that of train.csv, but its RainTomorrow is not given. It is a predictive variable.

The characteristics of the data are described as follows:

Step 2: data reading / loading

## We use the read provided by Pandas_ CSV function reads and converts to DataFrame format

data = pd.read_csv('train.csv')

Step 3: simple view of data information

## Use. info() to view the overall information of the data
data.info()

## For simple data viewing, we can use. head() header and. tail() tail
data.head()

Because there are too many variables, only some variables are shown here

Here we find that NaN exists in the data set. Generally, we believe that NaN represents a missing value in the data set, which may be an error during data collection or processing. Here, we use - 1 to fill in the missing value. There are other missing value processing methods such as "median filling and average filling". If you are interested, please check another blog: [data analysis series] Python data preprocessing summary , the basic operations of data preprocessing are explained in detail here

data = data.fillna(-1)
data.tail()

## Use value_ The counts function views the number of training set labels
pd.Series(data['RainTomorrow']).value_counts()

No 82786
,Yes 23858
,Name: RainTomorrow, dtype: int64

We find that the number of negative samples in the data set is much larger than the number of positive samples. This common problem is called "data imbalance" problem, which needs some special treatment in some cases. The methods to solve the data imbalance include data transformation or data interpolation and so on.

## Make some statistical description for the characteristics
data.describe()

Step4: visual description

For convenience, we first record digital features and non digital features:

numerical_features = [x for x in data.columns if data[x].dtype == np.float]
numerical_features = [x for x in data.columns if data[x].dtype == np.float]

## Scatter visualization based on the combination of three features and labels
sns.pairplot(data=data[['Rainfall',
'Evaporation',
'Sunshine'] + ['RainTomorrow']], diag_kind='hist', hue= 'RainTomorrow')
plt.show()

It can be found from the above figure that in 2D, different feature combinations have the scattered distribution of rain and no rain the next day, as well as the approximate discrimination ability. The combination of Sunshine and other features has more distinguishing ability

for col in data[numerical_features].columns:
    if col != 'RainTomorrow':
        sns.boxplot(x='RainTomorrow', y=col, saturation=0.5, palette='pastel', data=data)
        plt.title(col)
        plt.show()

Using the box graph, we can also get the distribution differences of different categories in different features. We can find that sunshine, humidity3pm, cloud9am and cloud3pm have strong discrimination ability

tlog = {}
for i in category_features:
    tlog[i] = data[data['RainTomorrow'] == 'Yes'][i].value_counts()
flog = {}
for i in category_features:
    flog[i] = data[data['RainTomorrow'] == 'No'][i].value_counts()

plt.figure(figsize=(10,10))
plt.subplot(1,2,1)
plt.title('RainTomorrow')
sns.barplot(x = pd.DataFrame(tlog['Location']).sort_index()['Location'], y = pd.DataFrame(tlog['Location']).sort_index().index, color = "red")
plt.subplot(1,2,2)
plt.title('Not RainTomorrow')
sns.barplot(x = pd.DataFrame(flog['Location']).sort_index()['Location'], y = pd.DataFrame(flog['Location']).sort_index().index, color = "blue")
plt.show()

It can be seen from the above figure that rainfall varies greatly in different regions, and some places are obviously easier to rainfall

plt.figure(figsize=(10,2))
plt.subplot(1,2,1)
plt.title('RainTomorrow')
sns.barplot(x = pd.DataFrame(tlog['RainToday'][:2]).sort_index()['RainToday'], y = pd.DataFrame(tlog['RainToday'][:2]).sort_index().index, color = "red")
plt.subplot(1,2,2)
plt.title('Not RainTomorrow')
sns.barplot(x = pd.DataFrame(flog['RainToday'][:2]).sort_index()['RainToday'], y = pd.DataFrame(flog['RainToday'][:2]).sort_index().index, color = "blue")
plt.show()

In the above figure, we can find that it rains today and not necessarily tomorrow, but it doesn't rain today and it doesn't rain the next day.

Step5: Code discrete variables

Since XGBoost cannot handle string data, we need some methods to convert string data into data. The simplest method is to encode all features of the same category into the same value, such as female = 0, male = 1, dog = 2, so the last encoded feature value is an integer between [0, number of features - 1]. In addition, there are methods such as unique heat coding, summation coding, leaving one method coding and so on, which can get better results.

## All features of the same category are encoded into the same value
def get_mapfunction(x):
    mapp = dict(zip(x.unique().tolist(),
         range(len(x.unique().tolist()))))
    def mapfunction(y):
        if y in mapp:
            return mapp[y]
        else:
            return -1
    return mapfunction
for i in category_features:
    data[i] = data[i].apply(get_mapfunction(data[i]))

## The encoded string feature becomes a number

data['Location'].unique()

Step6: training and prediction with XGBoost

## In order to correctly evaluate the model performance, the data is divided into training set and test set, the model is trained on the training set, and the model performance is verified on the test set.
from sklearn.model_selection import train_test_split

## Select samples with categories 0 and 1 (excluding samples with category 2)
data_target_part = data['RainTomorrow']
data_features_part = data[[x for x in data.columns if x != 'RainTomorrow']]

## The test set size is 20%, 80% / 20% points
x_train, x_test, y_train, y_test = train_test_split(data_features_part, data_target_part, test_size = 0.2, random_state = 2020)

## Import XGBoost model
from xgboost.sklearn import XGBClassifier
## Define XGBoost model 
clf = XGBClassifier()
# Training XGBoost model on training set
clf.fit(x_train, y_train)

## The distribution on the training set and test set is predicted by the trained model
train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)
from sklearn import metrics

## The model effect is evaluated by accuracy [the proportion of the number of correctly predicted samples to the total number of predicted samples]
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## View the confusion matrix (statistical matrix of various situations of predicted value and real value)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# Visualization of results using thermal maps
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

We can find that a total of 15759 + 2306 samples are predicted correctly and 2470 + 794 samples are predicted incorrectly.

Step7: feature selection using XGBoost

sns.barplot(y=data_features_part.columns, x=clf.feature_importances_)

From the picture, we can find that the humidity at 3 p.m. and whether it rains today are the most important factors to determine whether it rains the next day

In addition to the first time, we can also use the following important attributes in XGBoost to evaluate the importance of features.

weight: it is evaluated by the number of times the feature is used

gain: evaluation Gini index when using features for division

cover: it is divided by the average value of the second derivative of an index covering the sample (the specific principle is unclear and needs to be explored).

total_gain: total Gini index

total_cover: total coverage

from sklearn.metrics import accuracy_score
from xgboost import plot_importance

def estimate(model,data):

    #sns.barplot(data.columns,model.feature_importances_)
    ax1=plot_importance(model,importance_type="gain")
    ax1.set_title('gain')
    ax2=plot_importance(model, importance_type="weight")
    ax2.set_title('weight')
    ax3 = plot_importance(model, importance_type="cover")
    ax3.set_title('cover')
    plt.show()
def classes(data,label,test):
    model=XGBClassifier()
    model.fit(data,label)
    ans=model.predict(test)
    estimate(model, data)
    return ans
 
ans=classes(x_train,y_train,x_test)
pre=accuracy_score(y_test, ans)
print('acc=',accuracy_score(y_test,ans))

These diagrams can also help us better understand other important features.

Step8: get better results by adjusting parameters

XGBoost includes but is not limited to the following parameters that have a great impact on the model:

learning_rate: sometimes called eta. The system default value is 0.3. The step size of each iteration is very important. It is too large, the operation accuracy is not high, it is too small, and the operation speed is slow.

subsample: 1 by default. This parameter controls the proportion of random sampling for each tree. If the value of this parameter is reduced, the algorithm will be more conservative and avoid over fitting. The value range is zero to one.

colsample_bytree: the system default value is 1. We usually set it to about 0.8. It is used to control the proportion of the number of columns sampled randomly per tree (each column is a feature).

max_depth: the default value is 6. We usually use a number between 3 and 10. This value is the maximum depth of the tree. This value is used to control
Fitted. max_ The greater the depth, the more specific the model learning.
The methods of adjusting model parameters include greedy algorithm, grid parameter adjustment, Bayesian parameter adjustment and so on. Here we use grid parameter adjustment. Its basic idea is exhaustive search: in all candidate parameter selection, try every possibility through cyclic traversal, and the best parameter is the final result

## Import grid parameter adjustment function from sklearn Library
from sklearn.model_selection import GridSearchCV

## Define parameter value range
learning_rate = [0.1, 0.3, 0.6]
subsample = [0.8, 0.9]
colsample_bytree = [0.6, 0.8]
max_depth = [3,5,8]

parameters = { 'learning_rate': learning_rate,
              'subsample': subsample,
              'colsample_bytree':colsample_bytree,
              'max_depth': max_depth}
model = XGBClassifier(n_estimators = 50)

## Perform grid search
clf = GridSearchCV(model, parameters, cv=3, scoring='accuracy',verbose=1,n_jobs=-1)
clf = clf.fit(x_train, y_train)

## The best model parameters are used to predict the distribution on the training set and test set

## Define XGBoost model with parameters 
clf = XGBClassifier(colsample_bytree = 0.6, learning_rate = 0.3, max_depth= 8, subsample = 0.9)
# Training XGBoost model on training set
clf.fit(x_train, y_train)

train_predict = clf.predict(x_train)
test_predict = clf.predict(x_test)

## The model effect is evaluated by accuracy [the proportion of the number of correctly predicted samples to the total number of predicted samples]
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_train,train_predict))
print('The accuracy of the Logistic Regression is:',metrics.accuracy_score(y_test,test_predict))

## View the confusion matrix (statistical matrix of various situations of predicted value and real value)
confusion_matrix_result = metrics.confusion_matrix(test_predict,y_test)
print('The confusion matrix result:\n',confusion_matrix_result)

# Visualization of results using thermal maps
plt.figure(figsize=(8, 6))
sns.heatmap(confusion_matrix_result, annot=True, cmap='Blues')
plt.xlabel('Predicted labels')
plt.ylabel('True labels')
plt.show()

Originally there were 2470 + 790 errors, but now there are 2112 + 939 errors, which has significantly improved the accuracy.

s

Keywords: Big Data Machine Learning logistic regressive

Added by linkin on Wed, 08 Dec 2021 12:08:24 +0200

Programming VIP

[data analysis and mining] binary classification / multi classification prediction practice based on LightGBM,XGBoost and logistic regression (with data sets and codes)

1, Classification prediction based on logistic regression

1 Introduction and application of logistic regression

1.1 introduction to logistic regression

1.2 application of logistic regression

2.Demo practice

Step 1: library function import

Step 2: model training

Step3: view model parameters

Step4: data and model visualization

Step5: model prediction

3. Practice of logistic regression classification based on iris data set

Step 1: library function import

Step 2: data reading / loading

Step 3: simple view of data information

Step4: visual description

Step5: use logistic regression model for training and prediction on binary classification

Step6: use logistic regression model to train and predict on three classifications (multiple classifications)

2, Classification prediction based on XGBoost

1.XGBoost and Application

Introduction to XGBoost

Application of 1.2XGboost

2. Practice of XGBoost classification based on weather data set

Step 1: function library import

Step 2: data reading / loading

Step 3: simple view of data information

Step4: visual description

Step5: Code discrete variables

Step6: training and prediction with XGBoost

Step7: feature selection using XGBoost

Step8: get better results by adjusting parameters

s

Popular Keywords