Data default value interpolation method (regression filling, clustering filling, etc.)

Regression filling

First import the required package

import pandas as pd  
import numpy as np  
import matplotlib.pyplot as plt  
import seaborn as sns  
import random  
import missingno as mno  
  
import warnings  
warnings.filterwarnings('ignore')  

Then import the data

data=np.loadtxt('data\Magic.txt')  
tmp_columns=list('abcdefghij')  
tmp_columns.append('class')  
magic=pd.DataFrame(data=data,columns=tmp_columns)  

10 pieces of data were randomly selected for observation

magic.sample(10)  


View the missing values of the dataset

magic.isnull().sum()  

We found no missing values

We draw the heat map between features and observe the correlation between features

%matplotlib inline  
''''' 
It can be seen that a-b,a-c,b-c,d-e,j-a,j-b,j-c 
'''  
complete_features=magic.loc[:,magic.columns.difference(['class'])]  
# Draw thermal diagram
plt.figure(figsize=(10,10))  
sns.heatmap(complete_features.corr(),annot=True)  

Then, according to the correlation between features, we select a, b and c as the columns with missing values

We randomly extract 10% of the data from a, b and c

prob_missing = 0.1  
col_incomplete=['a','b','c']  
ind_incomplete=[magic.columns.get_loc(i) for i in col_incomplete]  
df_incomplete = magic.copy()  
ix = [(row, col) for row in range(magic.shape[0]) for col in ind_incomplete]  
for row, col in random.sample(ix, int(round(prob_missing * len(ix)))):  
    df_incomplete.iat[row, col] = np.nan  
# Original feature column  
df_complete=magic[col_incomplete]  
df_incomplete_copy=df_incomplete.copy()  
  
df_incomplete.isna().sum()  
mno.matrix(df_incomplete, figsize = (20, 6))  

The processed data table is visualized as shown in the following figure

random imputation

Next, we will perform regression filling on columns a, B and C. I intend to use knn regression model. The features of the training set are all features except those to be predicted. However, since more than one row contains vacancy values, we can't predict directly, so we randomly fill columns a, B and C with non empty values in columns a, B and C as the new three feature a_tmp, b_tmp, c_tmp, when we predict a, we can put b_tmp, c_tmp participates in training together as a feature

missing_columns=col_incomplete  
  
def random_imputation(df,feature):  
    num_missing=df[feature].isnull().sum()  
    observed_values=df.loc[df[feature].notnull(),feature]  
    df.loc[df[feature].isnull(),feature+'_imp']=np.random.choice(  
        observed_values,num_missing,replace=True  
    )  
    return df  
  
for feature in missing_columns:  
    df_incomplete[feature+'_imp']=df_incomplete[feature]  
    df_incomplete=random_imputation(df_incomplete,feature)  
  
mno.matrix(df_incomplete,figsize=[20,6])  

The data table filled with new features is shown in the following figure

deterministic regression imputation

Then, we use knn (n_neighbor = 3) model to predict each missing feature and fill in the missing value

from sklearn.neighbors import KNeighborsRegressor  
  
deter_data=pd.DataFrame(columns=['Det'+name for name in missing_columns])  
for feature in missing_columns:  
    deter_data['Det'+feature]=df_incomplete[feature+'_imp']  
    para=list(set(df_incomplete.columns)-set(missing_columns)-{feature+'_imp'})  
    # create model to fit  
    model=KNeighborsRegressor()  
    model.fit(X=df_incomplete[para],y=df_incomplete[feature+'_imp'])  
    deter_data.loc[df_incomplete[feature].isnull(), 'Det'+feature]=model.predict(  
        df_incomplete[para]  
    )[df_incomplete[feature].isnull()]  
  
mno.matrix(deter_data,figsize=[20,5])  

The filled data table is shown in the following figure. It can be found that the data set does not contain vacancy values

Next, we look at the distribution histogram and box graph of the original data and the filled data

sns.set()  
fig,axes=plt.subplots(nrows=3,ncols=2)  
fig.set_size_inches(8,8)  
  
for index, variable in enumerate(['a','b','c']):  
    sns.distplot(df_incomplete[variable].dropna(),kde=False,ax=axes[index, 0],color='blue')  
    sns.distplot(deter_data['Det'+variable],kde=False,ax=axes[index,0],color='red')  
    sns.boxplot(data=pd.concat([df_incomplete[variable], deter_data['Det'+variable]],axis=1),ax=axes[index,1])  
    plt.tight_layout()  

We can find that the feature distribution histogram of the original complete data is higher and narrower than that of the filled feature histogram. In other words, the standard deviation of the feature distribution of the original complete data is smaller than that of the filled feature histogram

The reason for this phenomenon is that we use the regression method to fill in the missing values, which actually fluctuate up and down along the hyperplane of the regression model, which contains some noise

We can also see from the box diagram that the IQ range of the filled data is wider than that of the original data

stochastic regression imputation

Therefore, in order to solve this problem, we will add some interference terms to the regression filled data, which obey the normal distribution

random_data=pd.DataFrame(columns=['Ran'+name for name in missing_columns])  
for feature in missing_columns:  
    random_data['Ran'+feature]=df_incomplete[feature+'_imp']  
    para=list(set(df_incomplete.columns)-set(missing_columns)-{feature+'_imp'})  
    # create model to fit  
    model=KNeighborsRegressor()  
    model.fit(X=df_incomplete[para],y=df_incomplete[feature+'_imp'])  
    #---  
    predict=model.predict(df_incomplete[para])  
    std_error=(predict[df_incomplete[feature].notnull()]  
        -df_incomplete.loc[df_incomplete[feature].notnull(), feature+'_imp']).std()  
    random_predict=np.random.normal(size=df_incomplete[feature].shape[0],  
        loc=predict,scale=std_error  
    )  
    #---  
    random_data.loc[(df_incomplete[feature].isnull())&(random_predict>0),  
        'Ran'+feature]=random_predict[(df_incomplete[feature].isnull())  
        &(random_predict > 0)]  

Then let's visualize it

sns.set()  
fig,axes=plt.subplots(nrows=3,ncols=2)  
fig.set_size_inches(8,8)  
  
for index, variable in enumerate(['a','b','c']):  
    sns.distplot(df_incomplete[variable].dropna(),kde=False,ax=axes[index, 0],color='blue')  
    sns.distplot(random_data['Ran'+variable],kde=False,ax=axes[index,0],color='red')  
    axes[index, 0].set(xlabel=variable+'/'+variable+'_imp')  
    sns.boxplot(data=pd.concat([df_incomplete[variable], random_data['Ran'+variable]],axis=1),ax=axes[index,1])  
  
    plt.tight_layout()  

We can find that the feature distribution after filling is better than the original distribution, and retains the shape of the original distribution

After filling, the data no longer contains null values

df_incomplete[missing_columns]=random_data  
df_incomplete.drop(columns=['a_imp','b_imp','c_imp'],axis=1,inplace=True)  
df_incomplete.isnull().sum()  

We calculate the mean square error of knn filling

knn_mse=((df_complete.values-random_data.values)**2).sum()  
knn_mse  
The result is knn_mse= 32.29607335035754

Cluster filling

Then, we use the K-means method to fill in the missing values
We first construct a dataset without missing columns (a, b, c) for clustering

df_incomplete=df_incomplete_copy.copy()  
df_cluster=df_incomplete[df_incomplete.columns.difference(['a','b','c','class'])]  
df_cluster  

The data sheet is shown in the figure below

In order to visualize the clustering of data, I wrote a pca dimension reduction visualization function

from sklearn.decomposition import PCA  
from mpl_toolkits.mplot3d import Axes3D   
  
def plot_pca(num,data,label):  
    pca=PCA(n_components=num)  
    X_pca=pca.fit_transform(data)  
    print(pca.components_)  
    # Split data  
    X_failure=np.array([x for i,x in enumerate(X_pca) if label[i]==1.0])  
    X_healthy=np.array([x for i,x in enumerate(X_pca) if label[i]==2.0])  
      
    if num==3:  
        fig = plt.figure(figsize=[10,15])  
        ax = Axes3D(fig)     
        #ax.legend(loc='best')  
        ax.set_zlabel('Z', fontdict={'size': 15, 'color': 'red'})  
        ax.set_ylabel('Y', fontdict={'size': 15, 'color': 'red'})  
        ax.set_xlabel('X', fontdict={'size': 15, 'color': 'red'})  
        ax.scatter(X_failure[:,0], X_failure[:,1], X_failure[:,2])  
        ax.scatter(X_healthy[:,0], X_healthy[:,1], X_healthy[:,2])  
        # Adjust viewing angle  
        ax.view_init(elev=50,azim=10)  
    elif num==2:  
        plt.figure(figsize=[10,10])  
        plt.scatter(X_failure[:,0],X_failure[:,1])  
        plt.scatter(X_healthy[:,0],X_healthy[:,1])  
    else:  
        print('i do not want to work.....')  

Then, in order to find the appropriate number of clusters, we use the elbow method to calculate the optimal number of clusters

from sklearn.cluster import KMeans  
%matplotlib inline  
SSE = []  # Store the sum of squares of errors for each result  
for k in range(1,9):  
    estimator=KMeans(n_clusters=k, random_state=9)  
    estimator.fit(df_cluster)  
    SSE.append(estimator.inertia_) # estimator.inertia_ Get the sum of clustering criteria  
plt.xlabel('k')  
plt.ylabel('SSE')  
plt.plot(range(1,9),SSE,'o-')  
plt.show()  

We find that the curvature is the largest when k=2, so the number of clusters we choose to cluster is 2

Next, let's visualize the clustering effect

%matplotlib inline  
# Pre clustering  
kmeans = KMeans(n_clusters=2, random_state=9)  
idxs = kmeans.fit_predict(df_cluster)  
# Dimensionality reduction  
pca=PCA(n_components=3)  
pca.fit(df_cluster)  
X_pca=pca.transform(df_cluster)  
  
subX = []  
#Traverse the cluster and load the sample points (pixels) into the subX by color  
for id in range(len(np.unique(idxs))):  
    subX.append(np.array([X_pca[i] for i in range(X_pca.shape[0]) if idxs[i] == id]))  
      
fig = plt.figure(figsize=[8,8])  
ax = Axes3D(fig)      
#ax.legend(loc='best')  
ax.set_zlabel('Z', fontdict={'size': 15, 'color': 'red'})  
ax.set_ylabel('Y', fontdict={'size': 15, 'color': 'red'})  
ax.set_xlabel('X', fontdict={'size': 15, 'color': 'red'})  
# ax.view_init(elev=50,azim=10)  
      
for x in range(len(subX)):  
    newX = subX[x]  
    # Scatter plot  
    ax.scatter(newX[:,0], newX[:,1], newX[:,2])  

We can find that the distribution of data is not completely cluster distribution

# Calculate the mean of two categories (mean a, mean B, mean C)   
df_list=[]  
df_data=df_incomplete[df_incomplete.columns.difference(['class'])].values  
for id in range(len(np.unique(idxs))):  
    tmp_cluster_df=pd.DataFrame([df_data[i] for i in range(df_data.shape[0]) if idxs[i]==id]).iloc[:,:3]  
    df_list.append(tmp_cluster_df.mean().values)  
# Cluster data block  
cluster_data=pd.DataFrame(columns=['Clu_'+name for name in missing_columns])  
# Fill data  
for feature in missing_columns:  
    cluster_data['Clu_'+feature]=df_incomplete[feature]  
cluster_data['cluster']=idxs  
cluster_data  

The following figure shows our data set

Fill dataset

for i,feature in enumerate(missing_columns):  
    cluster_data.loc[(cluster_data['Clu_'+feature].isnull())&  
        (cluster_data['cluster']==0),'Clu_'+feature]=df_list[0][i]  
for i,feature in enumerate(missing_columns):  
    cluster_data.loc[(cluster_data['Clu_'+feature].isnull())&  
        (cluster_data['cluster']==1),'Clu_'+feature]=df_list[1][i]  
cluster_data.drop(['cluster'],axis=1,inplace=True)  
cluster_data.isnull().sum()  

Calculate the mean square error of cluster filling

cluster_mse=((df_complete.values-cluster_data.values)**2).sum()  
cluster_mse  
kmeans_mse= 74.81802220522698

Visualize the feature distribution after filling

sns.set()  
fig,axes=plt.subplots(nrows=3,ncols=2)  
fig.set_size_inches(8,8)  
  
for index, variable in enumerate(missing_columns):  
    sns.distplot(df_incomplete[variable].dropna(),kde=False,ax=axes[index, 0],color='blue')  
    sns.distplot(cluster_data['Clu_'+variable],kde=False,ax=axes[index,0],color='red')  
    axes[index, 0].set(xlabel=variable)  
    sns.boxplot(data=pd.concat([df_incomplete[variable],cluster_data['Clu_'+variable]],axis=1),ax=axes[index,1])  
  
    plt.tight_layout()  

We can find that the filled feature distribution can not fit the original feature distribution well, and the performance of clustering method is not good

It may also be because the number of clusters is too small

Autoencoder filling

I'm lazy. I'll fill the pit later

conclusion

  • In this experiment, the clustering method (K-means) does not fill in the missing values well. I guess it is because the data is not well clustered and distributed. The mean square error between the final filled data set and the original data set is 74.81802220522698
  • In the regression filling missing values, we can find that the feature distribution histogram of the original complete data is higher and narrower than that of the filled feature histogram. In other words, the standard deviation of the feature distribution of the original complete data is smaller than that of the filled feature histogram.
  • The reason for this phenomenon is that we use the regression method to fill in the missing values, which actually fluctuate up and down along the hyperplane of the regression model, which contains some noise
  • Therefore, at last, a certain correction term is added to each term filled by regression. These correction terms obey Gaussian distribution. After doing so, the filling effect is obviously better and the mse is relatively small

Keywords: Python

Added by argyrism on Tue, 09 Nov 2021 07:17:26 +0200