Tianchi Competition - Prediction of Industrial Steam Volume

Starting from 0, data analysis and machine learning in Xueda University are simple and simple to write down the contest experience. This paper uses a variety of machine learning regression algorithms, and also uses deep learning pytorch to build a neural network for regression calculation.

1. Background introduction

This is an industrial steam volume contest, linked: AI Training Camp Computer Vision-Ali Yuntianchi

Contest Background

The basic principle of fossil fuel power generation is that fuel heats up water to produce steam while burning, steam pressure drives the steam turbine to rotate, and then the steam turbine drives the generator to rotate to generate electricity. In this series of energy conversion, the core affecting the power generation efficiency is the boiler's combustion efficiency, that is, the fuel burns the heated water to produce high temperature and high pressure steam.Includes the boiler's adjustable parameters, such as combustion feed, primary and secondary air, induced air, return air, water supply; and the boiler's operation, such as the boiler bed temperature, bed pressure, furnace temperature, pressure, superheater temperature and so on.

Description of the title

The data collected by the desensitization boiler sensor (acquisition frequency is at the level of minutes) can be used to predict the amount of steam generated according to the boiler's operation.

2. Data Exploration

Download the training and test sets locally, open jupyter notebook, and import the packages.

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from scipy import stats

import warnings
warnings.filterwarnings("ignore")
%matplotlib inline

Start importing data.

train_data_file = "data/zhengqi_train.txt"
test_data_file = "data/zhengqi_test.txt"

train_data = pd.read_csv(train_data_file, sep="\t", encoding="utf-8")
test_data = pd.read_csv(test_data_file, sep="\t", encoding="utf-8")

#View the first few rows of data
train_data.head()

The data for train_data is as follows:

View the basic information of the training set as follows:

train_data.info()

You can see that:

1. A total of 2888 data, numbered 0-2887
2. A total of 39 columns, excluding the result column, with 38 eigenvectors
3. All columns have no missing values and are of numeric type

 

Call the describe() method again to view the information of the training set:

Data visualization

In order to visualize the relationship between data more intuitively, data visualization is required.

First, plot box charts of each feature:

column_list = train_data.columns.tolist()[:39]                                #List Header List
fig = plt.figure(figsize=(100, 100), dpi=75)                                  #Specify the width and height of the drawing object
for i in range(38):
    plt.subplot(7, 6, i+1)                                                    #7 Rows 8 Columns Subgraph
    sns.boxplot(data=train_data[column_list[i]], orient="v", width=0.5)
    plt.ylabel(column_list[i], fontsize=36)
plt.show()

The results are as follows:

 

Draw histograms and Q-Q charts of each feature. Q-Q charts are used to describe whether data conforms to a normal distribution. If points fall on a straight line, the data conforms to a standard normal distribution.

#Draw histograms and Q-Q graphs of all variables
train_cols = 6
train_rows = len(train_data.columns)
plt.figure(figsize=(4*train_cols, 4*train_rows))

i = 0
for col in train_data.columns:
    i += 1
    ax = plt.subplot(train_rows, train_cols, i)
    sns.distplot(train_data[col], fit=stats.norm)
    
    i += 1
    ax = plt.subplot(train_rows, train_cols, i)
    res = stats.probplot(train_data[col], plot=plt)
    
plt.tight_layout()                                                 #tight_layout automatically adjusts the subgraph parameters to fill the entire image area
plt.show()

The effect is as follows:

Due to space limitation, only part of the picture is shown here.

The KDE diagram of each feature is drawn to show whether the distribution of training set and test set data is consistent. If the distribution of training set data is not consistent with that of test set data, it indicates that there are some problems with this feature and it needs to be excluded when doing model training.

train_cols = 6
train_rows = len(test_data.columns)
plt.figure(figsize=(4*train_cols, 4*train_rows))

for i in range(train_rows):
    ax = plt.subplot(train_rows, train_cols, i+1)
    ax = sns.kdeplot(train_data[column_list[i]], color="Red", shade=True)
    ax = sns.kdeplot(test_data[column_list[i]], color="Blue", shade=True)
    ax.set_xlabel(column_list[i])
    ax.set_ylabel("Frequency")
    ax = ax.legend(["train", "test"])                                             #Legend
plt.show()

The effect is as follows:

Due to space limitation, only part of the graph is shown here. From the graph, we can see that the distribution of the feature variables v5, v9, v11, v17, v22, v28 in the training set and the test set are inconsistent, such features need to be deleted in the training model.

Next, we calculate the correlation coefficients for each feature.

data_train1 = train_data.drop(["V5", "V9", "V11", "V17", "V22", "V28"], axis=1)
train_corr = data_train1.corr()
train_corr

The effect is as follows:

It looks like it's hard to see something. What do you do? Data visualization is certainly used to make icons:

ax = plt.subplots(figsize=(20, 16))                                 #Canvas Size
ax = sns.heatmap(train_corr, vmax=0.8, square=True, annot=True)     #Draw thermogram

The effect is as follows:

How about this? Plus, does the graph feel much better, but it still has a lot of data, so it's impossible to visually see the information you want. Don't worry, we can find 10 most relevant characteristic variables to the target variable to further reduce the amount of data.

k = 10                                                                               #Find k the k most relevant characteristic variables to the target variable
cols = train_corr.nlargest(k, "target")["target"].index                              #index of k most relevant columns

#cm = np.corrcoef(train_data[cols].values.T)
hm = plt.subplots(figsize=(10, 10))                                                  #Canvas Size
hm = sns.heatmap(train_data[cols].corr(), vmax=0.8, square=True, annot=True)        #Draw thermogram
plt.show()

The effect is as follows:

Or we can also find features with correlation coefficients greater than 0.5:

rate = 0.5
top_corr_features = train_corr.index[abs(train_corr["target"]) > rate]
plt.figure(figsize=(10, 10))
hm = sns.heatmap(train_data[top_corr_features].corr(), vmax=0.8, square=True, annot=True)        #Draw thermogram

The effect is as follows:

3. Modeling

Write how to use deep learning pytorch to build a neural network for regression prediction.

First, import the package:

import pandas as pd
import torch
from torch import nn
from sklearn import preprocessing
import torch.optim as optim
import random
from sklearn.decomposition import PCA

Read data:

#Read data
#Description: No missing values for data
train_data_file = "data/zhengqi_train.txt"
test_data_file = "data/zhengqi_test.txt"
#read
train_data = pd.read_csv(train_data_file, sep="\t", encoding="utf-8")
test_data = pd.read_csv(test_data_file, sep="\t", encoding="utf-8")

After importing the data, we need to normalize the data, so it is more convenient to combine the training set and the test set first. After normalization, we can split the data.

#Merge training and test sets
all_features = pd.concat((train_data.iloc[:,0:-1], test_data.iloc[:,0:]))
#data normalization
features_columns = all_features.dtypes[all_features.dtypes != "object"].index                   #column
#Normalization function
min_max_scaler = preprocessing.MinMaxScaler()
min_max_scaler = min_max_scaler.fit(all_features[features_columns])
#Start processing
all_features_scaler = min_max_scaler.transform(all_features[features_columns])
all_features_scaler = pd.DataFrame(all_features_scaler)
all_features_scaler.columns = features_columns
#Split data
n_train = train_data.shape[0]
train_features = torch.tensor(all_features[:n_train].values, dtype=torch.float32)
test_features = torch.tensor(all_features[n_train:].values, dtype=torch.float32)
train_labels = torch.tensor(train_data.target.values.reshape(-1, 1), dtype=torch.float32)

Simply do a feature project, mainly using principal component analysis:

#PCA processing for feature dimension reduction
pca = PCA(n_components=16)
train_features_16 = pca.fit_transform(train_features)
test_features_16 = pca.transform(test_features)
train_features_16 = torch.tensor(train_features_16, dtype=torch.float32)
test_features_16 = torch.tensor(test_features_16, dtype=torch.float32)

Then slice the dataset so that we can evaluate the model.

#Fractional dataset
from sklearn.model_selection import train_test_split        #Fractional data

#Divide dataset, 80% training set, 20% test set
train_data, test_data, train_target, test_target = train_test_split(train_features_16, train_labels, test_size=0.2, random_state=0)

Now it is time to define our neural network:

#Define Neural Network
class MyNet(nn.Module):
    def __init__(self, in_features, out_features=1):
        super().__init__()
        self.fc = nn.Sequential(
            nn.Linear(in_features, 12),
            nn.GELU(),
            nn.Linear(12, 10),
            nn.GELU(),
            nn.Linear(10, out_features),
        )

    def forward(self, x):
        return self.fc(x)

With a handwritten neural network, simply use the three linear regression layers, because it is a regression problem, so the output of the neural network is only 1, where the activation function is gelu, Gelu effect is better than relu.

When running a neural network, GPU runs much faster than CPU if conditions permit.

device = torch.device("cuda")
in_features = train_features_16.shape[1]
#print(in_features)
net = MyNet(in_features).to(device)
loss = nn.MSELoss()
optimizer = optim.Adam(net.parameters(), lr=0.001)

train_net = True
if train_net:
    for epoch in range(230):
        #net.train()
        for X, y in data_iter(train_data, train_target, 210):
            X = X.to(device)
            y = y.to(device)
            X_out = net(X)
            l = loss(X_out, y)

            optimizer.zero_grad()
            l.backward()
            optimizer.step()

        print("epoch: ",epoch, "train_loss: ",l.item())

    #torch.save(net.state_dict(), 'model/model.pth')

Then try the network with a sliced set of validations:

#Validation Set Loss
test_data = test_data.to(device)
test_target = test_target.to(device)
predict = net(test_data)
l = loss(predict, test_target)
print("the loss: ", l)

Next, just hand in the results:

Ouch, I'm still a chicken and I'm crying.

 

Keywords: Python Machine Learning neural networks Pytorch Deep Learning

Added by estero2002 on Sun, 26 Sep 2021 20:08:07 +0300