[second hand car valuation in the second round of Mathorcup cup big data challenge] idea and Python implementation

subject

Question 1: Based on question 2 of the preliminary round, if you need to accurately predict the transaction cycle of vehicles, what method will you adopt for modeling? Please use Annex 4 "store transaction training data" to build the transaction cycle prediction model, predict Annex 5 "store transaction verification data", and save the prediction results in Annex 6 "store transaction model results" file. Be careful not to modify the format. Annex 5 "store transaction verification data" only includes the first 1 to 4 fields of Annex 4 "store transaction training data". All carid and other relevant information in Annex 5 are included in Annex 2 "valuation verification data".

Question 2: in the process of selling vehicles in the store, in addition to accurately predicting the future transaction cycle of vehicles in the store, it is also necessary to effectively manage the inventory (assuming that the site and staff of the store remain unchanged during the evaluation cycle), so as to maximize the sales profit of the store when the cost (vehicle capital occupation cost and parking space occupation cost) is minimized. The price of vehicles is a very important factor affecting the transaction of vehicles. When doing inventory management, stores need to price or adjust the sales price of vehicles according to the situation of vehicles in stock and newly received vehicles. On the one hand, hot-selling vehicles can be traded at a more appropriate price to preserve the profits of stores, and at the same time, slow-moving vehicles should be reduced and promoted to avoid greater losses. Based on this, Assuming that you are the store manager of the store, what you can decide is when to adjust the price of a certain vehicle and how much to adjust, so as to ensure the achievement of the business goal of the store (maximizing the gross profit of the store under the condition of minimizing the cost). Here, the labor cost and other costs of employees are not considered. Please abstract the mathematical model description of the problem, build the store operation model, and give the solution ideas and algorithm steps of the model. Here, it is assumed that the operation goal is evaluated once a month.

According to the answers to questions 1 and 2, improve the preliminary thesis and clarify your ideas, models, methods and results.

1 ideas

1.1 first question

It's a regression problem

Annex 4 is used as the training set and Annex 5 is used as the test set. The LGB regression model is used for regression prediction, and the predicted value is rounded up. It involves the calculation of transaction cycle, which needs attention,... Please download the complete idea

For the feature construction of regression model, there are other feature construction methods besides the feature intersection of baseline I provide below. as follows

reference resources: Method of feature construction

(1) Basic transformation of single variable: X, x ^ 2, sqrt x, log x, scaling

(2) If the distribution of variables is long tailed, use box Cox conversion (log conversion is fast but not necessarily a good choice)

(3) You can also check Residuals or log odd (for linear models) to analyze whether it is strongly nonlinear.

(4) For data with large cardinality and classification variables, it is very useful to create a feature representing the occurrence frequency of each category. Of course, these categories can also be expressed as a ratio or percentage of the total.

(5) For each possible value of the variable, estimate the average of the target variable, and use the result as the feature of creation.

(6) Create a feature with a target variable ratio.

(7) Select the two most important variables, calculate their second-order interaction with each other and with other variables, and put them into the model, and compare the resulting model results with the results of the original linear model.

(8) If you want a smoother solution, you can apply the radial basis function kernel. This is equivalent to applying a smooth transition.

(9) If you think you need Covariates, you can apply polynomial kernels or explicitly add their Covariates.

(10) High cardinality feature: in the preprocessing stage, it is converted into numerical variables through out of fold average.

. . . .

1.2 second question

Title Requirements: the mathematical model of whether to reduce the price, the range of price reduction and the time of price reduction

Simple thinking can also be a regression problem. If it is complex, it is a planning problem. Because there is no data for this problem, if we want to do it by planning, it is pure theoretical mathematical modeling. I give the following ideas and implementation.
. . . Please download the complete idea
Question 2: complete idea and Python implementation code download

2 implementation

2.1 TXT to CSV

import scipy.stats as st
import pandas as pd 
import seaborn as sns
from pylab import mpl 
import numpy as np
import matplotlib.pyplot as plt
from tqdm import tqdm
tqdm.pandas()
import warnings

warnings.filterwarnings('ignore')
# plt.rcParams['font.sans-serif'] = ['STSong']
# mpl.rcParams['font.sans-serif'] = ['STSong'] # Specifies the default font 
mpl.rcParams['axes.unicode_minus'] = False

import csv
import os
import pickle


filepath = "./data/Annex 5: store transaction verification data.txt"
fh = open(r'./data/{}.csv'.format("file5"), "w+", newline='')
writer = csv.writer(fh)
writer.writerow(["carid", "pushDate", "pushPrice", "updatePriceTimeJson"])
with open(filepath, 'r', encoding="utf-8") as f:
    # try:
    res = []
    for line in f.readlines():
        d = [x for x in line.strip().split('\t')]
        res.append(d)
    writer.writerows(res)
f.close()
fh.close()

2.2 data preprocessing

df4 = pd.read_csv('./data/file4.csv')
df5 = pd.read_csv('./data/file5.csv')

(1) Calculate transaction cycle

# Remove unsold samples
df_trans = df4[df4.withdrawDate.notna()]

. . . . Please download the complete code https://mianbaoduo.com/o/bread/YpiXlpZx

train_cols = ['pushDate','pushPrice','transcycle']
df_train = df_trans[train_cols]
test_cols = ['pushDate','pushPrice']
df_test = df5[test_cols]
df_train

import scipy.stats as st
import seaborn as sns 
import matplotlib.pyplot as plt 

plt.figure(figsize=(14, 5))
plt.subplot(122)
plt.title('Normal distribution fitting-Processed', fontsize=20)
sns.distplot(np.log1p(df_train['pushPrice']), kde=False, fit=st.norm)
plt.xlabel('Shelf price', fontsize=20)
plt.subplot(121)
plt.title('Normal distribution fitting-Untreated', fontsize=20)
sns.distplot(df_train['pushPrice'], kde=False, fit=st.norm)
plt.xlabel('Shelf price', fontsize=20)
plt.savefig('img/Fitting of normal distribution of on shelf price.png',dpi=300)

(2) Extracting temporal features

# # Processing time (extraction date)
df_train['pushDate'] = pd.to_datetime(df_train['pushDate'])
df_test['pushDate'] = pd.to_datetime(df_test['pushDate'])
df_train['pushDate_year'] = df_train['pushDate'].dt.year
df_train['pushDate_month'] = df_train['pushDate'].dt.month
df_train['pushDate_day'] = df_train['pushDate'].dt.day

df_test['pushDate_year'] = df_test['pushDate'].dt.year
df_test['pushDate_month'] = df_test['pushDate'].dt.month
df_test['pushDate_day'] = df_test['pushDate'].dt.day

del df_train['pushDate']
del df_test['pushDate']

(3) Conversion of data distribution

df_train['pushPrice'] = np.log1p(df_train['pushPrice'])
df_test['pushPrice'] = np.log1p(df_test['pushPrice'])
df_train.columns

Index(['pushPrice', 'update_price', 'barging_times', 'barging_price', 'transcycle', 'pushDate_year', 'pushDate_month', 'pushDate_day'], dtype='object')

(4) Feature crossover

#Define cross feature statistics
def cross_cat_num(df, num_col, cat_col):
    for f1 in tqdm(cat_col):
        g = df.groupby(f1, as_index=False)
        for f2 in tqdm(num_col):
            feat = g[f2].agg({
                '{}_{}_max'.format(f1, f2): 'max', '{}_{}_min'.format(f1, f2): 'min',
                '{}_{}_median'.format(f1, f2): 'median',
                '{}_{}_sum'.format(f1, f2): 'sum',
                '{}_{}_mad'.format(f1, f2): 'mad',
            })
            df = df.merge(feat, on=f1, how='left')
    return(df)
### Cross with numerical features and category features
cross_num = ['pushPrice']

cross_cat = ['pushDate_year', 'pushDate_month','pushDate_day']
data_train = cross_cat_num(df_train, cross_num, cross_cat)  # First order crossover
data_test = cross_cat_num(df_test, cross_num, cross_cat)  # First order crossover
data_train.shape

(8000, 20)

2.3 model training

(1) Model training

from sklearn import metrics
from sklearn.model_selection import KFold
import lightgbm as lgb
import pandas as pd
from sklearn.model_selection import KFold

import numpy as np
from sklearn.preprocessing import StandardScaler

from sklearn.preprocessing import StandardScaler
train = data_train
test = data_test
train_y = train['transcycle']
del train['transcycle']
scaler = StandardScaler()
train_x = scaler.fit_transform(train)
test_x = scaler.fit_transform(test)

from sklearn import metrics

params = {
    'boosting_type': 'gbdt',
    'objective': 'regression_l1',
    'metric': 'mae',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.9,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': -1,
}

def MAE_metric(y_true, y_pred):
    return metrics.mean_absolute_error(y_true, y_pred)

folds = 5
kfold = KFold(n_splits=folds, shuffle=True, random_state=5421)
preds_lgb = np.zeros(len(test_x))
for fold, (trn_idx, val_idx) in enumerate(kfold.split(train_x, train_y)):
    import lightgbm as lgb
    print('-------fold {}-------'.format(fold))
    x_tra, y_trn, x_val, y_val = train_x[trn_idx], train_y.iloc[trn_idx], train_x[val_idx], train_y.iloc[val_idx]

    train_set = lgb.Dataset(x_tra, y_trn)
    val_set = lgb.Dataset(x_val, y_val)
    # lgb
    lgbmodel =. . . . Please download the complete code
    
    val_pred_xgb = lgbmodel.predict(
        x_val, predict_disable_shape_check=True)
    preds_lgb += lgbmodel.predict(test_x,
                                    predict_disable_shape_check=True) / folds
    val_mae = MAE_metric(y_val, val_pred_xgb)
    print('lgb val_mae {}'.format(val_mae))

-------fold 0------- lgb val_mae 15.108706443185115 -

------fold 1------- lgb val_mae 15.755760771009792

-------fold 2------- lgb val_mae 16.597388380375197

-------fold 3------- lgb val_mae 15.483798531878621

-------fold 4------- lgb val_mae 15.278992579304203

(2) Store the prediction results as TXT

import math
file5 = pd.read_csv('./data/file5.csv')
submit_file  = pd.DataFrame(columns=['id'])
submit_file['id'] = file5['carid']
# Round up
submit_file['transcycle'] = [math.ceil(i) for i in list(preds_lgb)]
submit_file['transcycle'].astype(int)

i = 0
with open('./submit/Annex 6: results of store transaction model.txt','a+', encoding='utf-8') as f:
    for line in submit_file.values:
        if i==0:
            i += 1
            continue
        else:
            i += 1
            f.write((str(line[0])+'\t'+str(line[1])+'\n'))

See the top link for complete ideas and code download

Keywords: Mathematical Modeling

Added by crishna369 on Mon, 07 Mar 2022 21:02:34 +0200

Programming VIP