# Regression filling

First import the required package

```import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import random
import missingno as mno

import warnings
warnings.filterwarnings('ignore')
```

Then import the data

```data=np.loadtxt('data\Magic.txt')
tmp_columns=list('abcdefghij')
tmp_columns.append('class')
magic=pd.DataFrame(data=data,columns=tmp_columns)
```

10 pieces of data were randomly selected for observation

```magic.sample(10)
``` View the missing values of the dataset

```magic.isnull().sum()
```

We found no missing values We draw the heat map between features and observe the correlation between features ```%matplotlib inline
'''''
It can be seen that a-b,a-c,b-c,d-e,j-a,j-b,j-c
'''
complete_features=magic.loc[:,magic.columns.difference(['class'])]
# Draw thermal diagram
plt.figure(figsize=(10,10))
sns.heatmap(complete_features.corr(),annot=True)
```

Then, according to the correlation between features, we select a, b and c as the columns with missing values

We randomly extract 10% of the data from a, b and c

```prob_missing = 0.1
col_incomplete=['a','b','c']
ind_incomplete=[magic.columns.get_loc(i) for i in col_incomplete]
df_incomplete = magic.copy()
ix = [(row, col) for row in range(magic.shape) for col in ind_incomplete]
for row, col in random.sample(ix, int(round(prob_missing * len(ix)))):
df_incomplete.iat[row, col] = np.nan
# Original feature column
df_complete=magic[col_incomplete]
df_incomplete_copy=df_incomplete.copy()

df_incomplete.isna().sum()
mno.matrix(df_incomplete, figsize = (20, 6))
```

The processed data table is visualized as shown in the following figure ## random imputation

Next, we will perform regression filling on columns a, B and C. I intend to use knn regression model. The features of the training set are all features except those to be predicted. However, since more than one row contains vacancy values, we can't predict directly, so we randomly fill columns a, B and C with non empty values in columns a, B and C as the new three feature a_tmp, b_tmp, c_tmp, when we predict a, we can put b_tmp, c_tmp participates in training together as a feature

```missing_columns=col_incomplete

def random_imputation(df,feature):
num_missing=df[feature].isnull().sum()
observed_values=df.loc[df[feature].notnull(),feature]
df.loc[df[feature].isnull(),feature+'_imp']=np.random.choice(
observed_values,num_missing,replace=True
)
return df

for feature in missing_columns:
df_incomplete[feature+'_imp']=df_incomplete[feature]
df_incomplete=random_imputation(df_incomplete,feature)

mno.matrix(df_incomplete,figsize=[20,6])
```

The data table filled with new features is shown in the following figure ## deterministic regression imputation

Then, we use knn (n_neighbor = 3) model to predict each missing feature and fill in the missing value

```from sklearn.neighbors import KNeighborsRegressor

deter_data=pd.DataFrame(columns=['Det'+name for name in missing_columns])
for feature in missing_columns:
deter_data['Det'+feature]=df_incomplete[feature+'_imp']
para=list(set(df_incomplete.columns)-set(missing_columns)-{feature+'_imp'})
# create model to fit
model=KNeighborsRegressor()
model.fit(X=df_incomplete[para],y=df_incomplete[feature+'_imp'])
deter_data.loc[df_incomplete[feature].isnull(), 'Det'+feature]=model.predict(
df_incomplete[para]
)[df_incomplete[feature].isnull()]

mno.matrix(deter_data,figsize=[20,5])
```

The filled data table is shown in the following figure. It can be found that the data set does not contain vacancy values Next, we look at the distribution histogram and box graph of the original data and the filled data

```sns.set()
fig,axes=plt.subplots(nrows=3,ncols=2)
fig.set_size_inches(8,8)

for index, variable in enumerate(['a','b','c']):
sns.distplot(df_incomplete[variable].dropna(),kde=False,ax=axes[index, 0],color='blue')
sns.distplot(deter_data['Det'+variable],kde=False,ax=axes[index,0],color='red')
sns.boxplot(data=pd.concat([df_incomplete[variable], deter_data['Det'+variable]],axis=1),ax=axes[index,1])
plt.tight_layout()
``` We can find that the feature distribution histogram of the original complete data is higher and narrower than that of the filled feature histogram. In other words, the standard deviation of the feature distribution of the original complete data is smaller than that of the filled feature histogram

The reason for this phenomenon is that we use the regression method to fill in the missing values, which actually fluctuate up and down along the hyperplane of the regression model, which contains some noise

We can also see from the box diagram that the IQ range of the filled data is wider than that of the original data

## stochastic regression imputation

Therefore, in order to solve this problem, we will add some interference terms to the regression filled data, which obey the normal distribution

```random_data=pd.DataFrame(columns=['Ran'+name for name in missing_columns])
for feature in missing_columns:
random_data['Ran'+feature]=df_incomplete[feature+'_imp']
para=list(set(df_incomplete.columns)-set(missing_columns)-{feature+'_imp'})
# create model to fit
model=KNeighborsRegressor()
model.fit(X=df_incomplete[para],y=df_incomplete[feature+'_imp'])
#---
predict=model.predict(df_incomplete[para])
std_error=(predict[df_incomplete[feature].notnull()]
-df_incomplete.loc[df_incomplete[feature].notnull(), feature+'_imp']).std()
random_predict=np.random.normal(size=df_incomplete[feature].shape,
loc=predict,scale=std_error
)
#---
random_data.loc[(df_incomplete[feature].isnull())&(random_predict>0),
'Ran'+feature]=random_predict[(df_incomplete[feature].isnull())
&(random_predict > 0)]
```

Then let's visualize it

```sns.set()
fig,axes=plt.subplots(nrows=3,ncols=2)
fig.set_size_inches(8,8)

for index, variable in enumerate(['a','b','c']):
sns.distplot(df_incomplete[variable].dropna(),kde=False,ax=axes[index, 0],color='blue')
sns.distplot(random_data['Ran'+variable],kde=False,ax=axes[index,0],color='red')
axes[index, 0].set(xlabel=variable+'/'+variable+'_imp')
sns.boxplot(data=pd.concat([df_incomplete[variable], random_data['Ran'+variable]],axis=1),ax=axes[index,1])

plt.tight_layout()
``` We can find that the feature distribution after filling is better than the original distribution, and retains the shape of the original distribution

After filling, the data no longer contains null values ```df_incomplete[missing_columns]=random_data
df_incomplete.drop(columns=['a_imp','b_imp','c_imp'],axis=1,inplace=True)
df_incomplete.isnull().sum()
```

We calculate the mean square error of knn filling

```knn_mse=((df_complete.values-random_data.values)**2).sum()
knn_mse
The result is knn_mse= 32.29607335035754
```

# Cluster filling

Then, we use the K-means method to fill in the missing values
We first construct a dataset without missing columns (a, b, c) for clustering

```df_incomplete=df_incomplete_copy.copy()
df_cluster=df_incomplete[df_incomplete.columns.difference(['a','b','c','class'])]
df_cluster
```

The data sheet is shown in the figure below In order to visualize the clustering of data, I wrote a pca dimension reduction visualization function

```from sklearn.decomposition import PCA
from mpl_toolkits.mplot3d import Axes3D

def plot_pca(num,data,label):
pca=PCA(n_components=num)
X_pca=pca.fit_transform(data)
print(pca.components_)
# Split data
X_failure=np.array([x for i,x in enumerate(X_pca) if label[i]==1.0])
X_healthy=np.array([x for i,x in enumerate(X_pca) if label[i]==2.0])

if num==3:
fig = plt.figure(figsize=[10,15])
ax = Axes3D(fig)
#ax.legend(loc='best')
ax.set_zlabel('Z', fontdict={'size': 15, 'color': 'red'})
ax.set_ylabel('Y', fontdict={'size': 15, 'color': 'red'})
ax.set_xlabel('X', fontdict={'size': 15, 'color': 'red'})
ax.scatter(X_failure[:,0], X_failure[:,1], X_failure[:,2])
ax.scatter(X_healthy[:,0], X_healthy[:,1], X_healthy[:,2])
ax.view_init(elev=50,azim=10)
elif num==2:
plt.figure(figsize=[10,10])
plt.scatter(X_failure[:,0],X_failure[:,1])
plt.scatter(X_healthy[:,0],X_healthy[:,1])
else:
print('i do not want to work.....')
```

Then, in order to find the appropriate number of clusters, we use the elbow method to calculate the optimal number of clusters

```from sklearn.cluster import KMeans
%matplotlib inline
SSE = []  # Store the sum of squares of errors for each result
for k in range(1,9):
estimator=KMeans(n_clusters=k, random_state=9)
estimator.fit(df_cluster)
SSE.append(estimator.inertia_) # estimator.inertia_ Get the sum of clustering criteria
plt.xlabel('k')
plt.ylabel('SSE')
plt.plot(range(1,9),SSE,'o-')
plt.show()

``` We find that the curvature is the largest when k=2, so the number of clusters we choose to cluster is 2

Next, let's visualize the clustering effect

```%matplotlib inline
# Pre clustering
kmeans = KMeans(n_clusters=2, random_state=9)
idxs = kmeans.fit_predict(df_cluster)
# Dimensionality reduction
pca=PCA(n_components=3)
pca.fit(df_cluster)
X_pca=pca.transform(df_cluster)

subX = []
#Traverse the cluster and load the sample points (pixels) into the subX by color
for id in range(len(np.unique(idxs))):
subX.append(np.array([X_pca[i] for i in range(X_pca.shape) if idxs[i] == id]))

fig = plt.figure(figsize=[8,8])
ax = Axes3D(fig)
#ax.legend(loc='best')
ax.set_zlabel('Z', fontdict={'size': 15, 'color': 'red'})
ax.set_ylabel('Y', fontdict={'size': 15, 'color': 'red'})
ax.set_xlabel('X', fontdict={'size': 15, 'color': 'red'})
# ax.view_init(elev=50,azim=10)

for x in range(len(subX)):
newX = subX[x]
# Scatter plot
ax.scatter(newX[:,0], newX[:,1], newX[:,2])

``` We can find that the distribution of data is not completely cluster distribution

```# Calculate the mean of two categories (mean a, mean B, mean C)
df_list=[]
df_data=df_incomplete[df_incomplete.columns.difference(['class'])].values
for id in range(len(np.unique(idxs))):
tmp_cluster_df=pd.DataFrame([df_data[i] for i in range(df_data.shape) if idxs[i]==id]).iloc[:,:3]
df_list.append(tmp_cluster_df.mean().values)
# Cluster data block
cluster_data=pd.DataFrame(columns=['Clu_'+name for name in missing_columns])
# Fill data
for feature in missing_columns:
cluster_data['Clu_'+feature]=df_incomplete[feature]
cluster_data['cluster']=idxs
cluster_data
```

The following figure shows our data set Fill dataset

```for i,feature in enumerate(missing_columns):
cluster_data.loc[(cluster_data['Clu_'+feature].isnull())&
(cluster_data['cluster']==0),'Clu_'+feature]=df_list[i]
for i,feature in enumerate(missing_columns):
cluster_data.loc[(cluster_data['Clu_'+feature].isnull())&
(cluster_data['cluster']==1),'Clu_'+feature]=df_list[i]
cluster_data.drop(['cluster'],axis=1,inplace=True)
cluster_data.isnull().sum()
```

Calculate the mean square error of cluster filling

```cluster_mse=((df_complete.values-cluster_data.values)**2).sum()
cluster_mse
kmeans_mse= 74.81802220522698
```

Visualize the feature distribution after filling

```sns.set()
fig,axes=plt.subplots(nrows=3,ncols=2)
fig.set_size_inches(8,8)

for index, variable in enumerate(missing_columns):
sns.distplot(df_incomplete[variable].dropna(),kde=False,ax=axes[index, 0],color='blue')
sns.distplot(cluster_data['Clu_'+variable],kde=False,ax=axes[index,0],color='red')
axes[index, 0].set(xlabel=variable)
sns.boxplot(data=pd.concat([df_incomplete[variable],cluster_data['Clu_'+variable]],axis=1),ax=axes[index,1])

plt.tight_layout()
``` We can find that the filled feature distribution can not fit the original feature distribution well, and the performance of clustering method is not good

It may also be because the number of clusters is too small

# Autoencoder filling

I'm lazy. I'll fill the pit later

# conclusion

• In this experiment, the clustering method (K-means) does not fill in the missing values well. I guess it is because the data is not well clustered and distributed. The mean square error between the final filled data set and the original data set is 74.81802220522698
• In the regression filling missing values, we can find that the feature distribution histogram of the original complete data is higher and narrower than that of the filled feature histogram. In other words, the standard deviation of the feature distribution of the original complete data is smaller than that of the filled feature histogram.
• The reason for this phenomenon is that we use the regression method to fill in the missing values, which actually fluctuate up and down along the hyperplane of the regression model, which contains some noise
• Therefore, at last, a certain correction term is added to each term filled by regression. These correction terms obey Gaussian distribution. After doing so, the filling effect is obviously better and the mse is relatively small

Keywords: Python

Added by argyrism on Tue, 09 Nov 2021 07:17:26 +0200