How to improve the accuracy of regression model

In this article, we will see how to deal with the regression problem and how to improve the accuracy of machine learning model by using the concepts of feature transformation, feature engineering, clustering, enhancement algorithm and so on.

Data science is an iterative process. Only through repeated experiments can we get the most suitable model / solution for our needs.

Let's focus on each of the above stages through an example. I have a health insurance dataset (CSV file), which contains customer information such as insurance premium, age, gender, BMI, etc. We must predict insurance costs based on these parameters in the data set. This is a regression problem because our target variable - cost / insurance cost - is numeric.

Let's start with loading datasets and exploring attributes (EDA - exploratory data analysis)

#Load csv into a dataframe
df=pd.read_csv('insurance_data.csv')
df.head(3)#Get the number of rows and columns
print(f'Dataset size: {df.shape}')
(1338,7)

The dataset has 1338 records and 6 characteristics. Smokers, gender and region are categorical variables, while age, BMI and children are numerical variables.

Zero / missing value processing

Let's check the proportion of missing values in the dataset:

df.isnull().sum().sort_values(ascending=False)/df.shape[0]

Age and BMI have some zero values - though few. We will process these missing data and then start data analysis. Sklearn's SimpleImputer allows you to replace missing values based on the average / median / most frequent values in their respective columns. In this case, I use the median to fill in the null value.

#Instantiate SimpleImputer 
si=SimpleImputer(missing_values = np.nan, strategy="median")
si.fit(df[['age', 'bmi']])
  
#Filling missing data with median
df[['age', 'bmi']] = si.transform(df[['age', 'bmi']])

Data visualization

Now our data is clean. We will analyze the data through visualization and map. A simple seafood pairing can give us a lot of enlightenment!

sns.pairplot(data=df, diag_kind='kde')

What do we see?

Fees and children are distorted.

Age is positively correlated with fees.

BMI follows normal distribution! 😎

Seaborn's box chart and count chart can be used to reveal the impact of classification variables on charges.

According to the observation of the above figure:

The number of men and women is almost equal, and the average median charge for men and women is the same, but the charge range for men is higher.

Smokers' insurance costs are relatively high.

People with 2-3 children charge the highest

Customers in these four regions are almost evenly distributed, and their fees are almost the same.

The percentage of female smokers is lower than that of male smokers.

Therefore, we can conclude that "smokers" have a considerable impact on insurance costs, while gender has the least impact.

Let's create a heat map to understand the correlation between characteristics (age, BMI and children).

sns.heatmap(df[['age', 'bmi', 'children', 'charges']].corr(), cmap='Blues', annot=True)
plt.show()

We see an average correlation between age and body mass index and fees.

Now, we will introduce the steps of model preparation and model development one by one.

Feature coding

In this step, we convert the categorical variables - smokers, gender and Region - to digital format (0, 1, 2, 3, etc.), because most algorithms cannot handle non digital data. This process is called coding. There are many ways to do this:

LabelEncoding - represents the classification value as a number (for example, regions with Italian, Indian, US, UK equivalents can be represented as 1, 2, 3, 4)

OrdinalEncoding -- used to represent the ranking based classification data value as a number. (for example, 1,2,3 indicates high, medium and low)

Unique hot coding - represents category data as binary values - only 0 and 1. I prefer to use the unique value in many tags instead of the unique value in the classification. Here, I use a pandas encoding function (get_dummies) on the Region and divide it into four columns - location_NE,location_SE,location_NW and location_SW. Tag coding can also be used in this column, but the single hot coding gives me better results.

#One hot encoding
region=pd.get_dummies(df.region, prefix='location')
df = pd.concat([df,region],axis=1)
df.drop(columns='region', inplace=True)df.sex.replace(to_replace=['male','female'],value=[1,0], inplace=True)
df.smoker.replace(to_replace=['yes', 'no'], value=[1,0], inplace=True)

Feature selection and scaling

Next, we will choose the function that can most affect "charging". I chose all functions except "gender" because it has little impact on the charge (concluded from the chart above). These characteristics will constitute variable X and cost will constitute variable y. If there are many features, I suggest you use scikit learn's SelectKBest for feature selection to reach the top-level features.

#Feature Selection 
y=df.charges.values
X=df[['age', 'bmi', 'smoker', 'children', 'location_northeast', 'location_northwest', 'location_southeast', 'location_southwest']]#Split data into test and train
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=42)

Once we choose our characteristics, we need to "standardize" numbers - age, BMI, children. The standardization process converts the data into smaller values in the range of 0 to 1, so that all values are on the same scale, rather than one overriding the other. I use the StandardScaler here.

#Scaling numeric features using sklearn StandardScalar
numeric=['age', 'bmi', 'children']
sc=StandardScalar()
X_train[numeric]=sc.fit_transform(X_train[numeric])
X_test[numeric]=sc.transform(X_test[numeric])

Now we are ready to create our first basic model 😀. We will try linear regression and decision tree to predict insurance costs

from sklearn.linear_model import LinearRegression
from sklearn.tree import DecisionTreeRegressor
from sklearn.metrics import r2_score, mean_absolute_error, mean_squared_error

#Create a LinearRegression object
lr= LinearRegression()
#Fit X and y 
lr.fit(X_train, y_train)
ypred = lr.predict(X_test)
#Metrics to evaluate your model 
r2_score(y_test, ypred), mean_absolute_error(y_test, ypred), np.sqrt(mean_squared_error(y_test, ypred))


dt = DecisionTreeRegressor()
dt.fit(X_train, y_train)
yhat = dt.predict(X_test)
r2_score(y_test, yhat), mean_absolute_error(y_test, yhat), np.sqrt(mean_squared_error(y_test, yhat))

Mean absolute error (MAE) and root mean square error (RMSE) are the indexes used to evaluate the regression model. You can read more here. Our baseline model gives a score of more than 76%. Between the two methods, decision - trees gives a better Mae of 2780.

Let's see how to make our model better.

Characteristic Engineering

We can improve the model score by manipulating some features in the data set. After several experiments, I found that the following items can improve the accuracy:

Use KMeans to group similar customers into clusters.

In the area column, the northeast and northwest regions are divided into "North" regions, and the southeast and southwest regions are divided into "south" regions.

Convert 'children' to a name called 'more'_ than_ one_ The classification feature of 'child'. If the number of children is > 1, the feature is' Yes'

from sklearn.cluster import KMeans
features=['age', 'bmi', 'smoker', 'children', 'location_northeast', 'location_northwest', 'location_southeast', 'location_southwest']
kmeans = KMeans(n_clusters=2)
kmeans.fit(df[features])

df['cust_type'] = kmeans.predict(df[features])

df['location_north']=df.apply(lambda x: get_north(x['location_northeast'], x['location_northwest']), axis=1

df['location_south']=df.apply(lambda x: get_south(x['location_southwest'], x['location_southeast']), axis=1)

df['more_than_1_child']=df.children.apply(lambda x:1 if x>1 else 0)

Feature transformation

From our EDA, we know that the distribution of "cost" (Y) is highly skewed, so we will apply scikit learn's target transformation - QuantileTransformer to standardize this behavior.

X=df[['age', 'bmi', 'smoker', 'more_than_1_child', 'cust_type', 'location_north', 'location_south']]

#Split test and train data
X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.2, random_state=42)

model = DecisionTreeRegressor()
regr_trans = TransformedTargetRegressor(regressor=model, transformer=QuantileTransformer(output_distribution='normal'))
regr_trans.fit(X_train, y_train)
yhat = regr_trans.predict(X_test)
round(r2_score(y_test, yhat), 3), round(mean_absolute_error(y_test, yhat), 2), round(np.sqrt(mean_squared_error(y_test, yhat)),2)

>>0.843, 2189.28, 4931.96

Wow, amazing 84%, MAE has dropped to 2189!

Using integrated and enhanced algorithms

Now we will use the integration of these functions based on random forest, gradient enhancement, LightGBM, and XGBoost. If you are a beginner, you don't realize the methods of boosting and bagging.

X=df[['age', 'bmi', 'smoker', 'more_than_1_child', 'cust_type', 'location_north', 'location_south']]

model = RandomForestRegressor()
#transforming target variable through quantile transformer
ttr = TransformedTargetRegressor(regressor=model, transformer=QuantileTransformer(output_distribution='normal'))
ttr.fit(X_train, y_train)
yhat = ttr.predict(X_test)

r2_score(y_test, yhat), mean_absolute_error(y_test, yhat), np.sqrt(mean_squared_error(y_test, yhat))

>>0.8802, 2078, 4312

yes! Our random forest model performs well - MAE of 2078 👍. Now, we will try some enhancement algorithms, such as gradient enhancement, LightGBM, and XGBoost.

from sklearn.ensemble import GradientBoostingRegressor
import lightgbm
import xgboost

#generic function to fit model and return metrics for every algorithm
def boost_models(x):
    #transforming target variable through quantile transformer
    regr_trans = TransformedTargetRegressor(regressor=x, transformer=QuantileTransformer(output_distribution='normal'))
    regr_trans.fit(X_train, y_train)
    yhat = regr_trans.predict(X_test)
    algoname= x.__class__.__name__
    return algoname, round(r2_score(y_test, yhat),3), round(mean_absolute_error(y_test, yhat),2), round(np.sqrt(mean_squared_error(y_test, yhat)),2)

algo=[GradientBoostingRegressor(), lgbm.LGBMRegressor(), xg.XGBRFRegressor()]
score=[]
for a in algo:
    score.append(boost_models(a))

 #Collate all scores in a table
 pd.DataFrame(score, columns=['Model', 'Score', 'MAE', 'RMSE'])

Hyperparameter tuning

Let's adjust some algorithm parameters, such as tree depth, estimated value, learning rate, etc., and check the accuracy of the model. Manually trying different combinations of parameter values is time-consuming. Scikit learn's GridSearchCV automatically performs this process and calculates the optimized values of these parameters. I have applied GridSearch to the above three algorithms. Here is an example of XGBoost:

from sklearn.model_selection import GridSearchCV

param_grid = {'n_estimators': [100, 80, 60, 55, 51, 45],  
              'max_depth': [7, 8],
              'reg_lambda' :[0.26, 0.25, 0.2]
             }
                
grid = GridSearchCV(xg.XGBRFRegressor(), param_grid, refit = True, verbose = 3, n_jobs=-1) #
regr_trans = TransformedTargetRegressor(regressor=grid, transformer=QuantileTransformer(output_distribution='normal'))

# fitting the model for grid search 
grid_result=regr_trans.fit(X_train, y_train)
best_params=grid_result.regressor_.best_params_
print(best_params)

#using best params to create and fit model
best_model = xg.XGBRFRegressor(max_depth=best_params["max_depth"], n_estimators=best_params["n_estimators"], reg_lambda=best_params["reg_lambda"])
regr_trans = TransformedTargetRegressor(regressor=best_model, transformer=QuantileTransformer(output_distribution='normal'))
regr_trans.fit(X_train, y_train)
yhat = regr_trans.predict(X_test)

#evaluate metrics
r2_score(y_test, yhat), mean_absolute_error(y_test, yhat), np.sqrt(mean_squared_error(y_test, yhat))

Once we get the optimal values of the parameters, we will use these values to run all three models again.

This looks much better! We have been able to improve our accuracy - XGBoost gives a score of 88.6%, with relatively few errors

The distribution and residual diagram confirm that there is a good overlap between the predicted cost and the actual cost. However, some predicted values are far beyond the x-axis, which makes our root mean square error higher. We can reduce this by increasing data points (i.e. collecting more data).

Now we are ready to deploy this model to the production environment and test it on unknown data.

In short, the key points to improve the accuracy of my model

Create simple new features
Convert target variable
Clustering common data points
Using enhanced algorithms
Hyperparameter tuning

You can find my notebook here. Not all methods apply to your model. Choose the best solution for you:)

https://github.com/sacharya225/data-expts/tree/master/Health%20Insurance%20Cost%20Prediction

Author: Shwetha Acharya

Deep hub translation group

Keywords: Machine Learning Data Analysis Data Mining

Added by SmoshySmosh on Mon, 31 Jan 2022 16:52:40 +0200

Programming VIP