background
Advertising fraud is one of the important challenges that digital marketing needs to face. Click fraud will waste advertisers a lot of money and mislead click data. The competition provided about 500000 hits. Special attention: we simulated the data, hid the meaning of some features, and desensitized them.
Please predict whether the user's click behavior is normal or cheating. Click fraud prediction is applicable to all kinds of information flow advertising, banner advertising and Baidu online alliance platform to help businesses identify click fraud and lock accurate and real users.
- Competition address: https://aistudio.baidu.com/aistudio/competition/detail/52/0/introduction
- Competition dataset: https://download.csdn.net/download/turkeym4/72338032#
Data and tasks
The competition provides 500000 training data and 150000 test data. The goal is to predict whether the data has anti fraud behavior.
field | type | explain |
---|---|---|
sid | string | Sample id / request session sid |
package | string | Media information, package name (encrypted) |
version | string | Media information, app version |
android_id | string | Media information, external advertising space ID (encrypted) |
media_id | string | Media information, external media ID (encrypted) |
apptype | int | Media information, app category |
timestamp | bigint | Request arrival service time, in ms |
location | int | User geolocation code (accurate to city) |
fea_hash | int | User characteristic code (specific physical meaning is omitted) |
fea1_hash | int | User characteristic code (specific physical meaning is omitted) |
cus_type | int | User characteristic code (specific physical meaning is omitted) |
ntt | int | Network type: 0-unknown, 1-wired, 2-WIFI, 3-cellular, 4-2G, 5-3G, 6 – 4G |
carrier | string | The operator used by the device is 0 unknown, 46000 mobile, 46001 Unicom and 46003 Telecom |
os | string | Operating system, android by default |
osv | string | Operating system version |
lan | string | The language of the device is Chinese by default |
dev_height | int | Equipment high |
dev_width | int | Equipment width |
dev_ppi | int | Screen resolution |
label | int | Whether there is anti fraud |
From the data label, we can know that the proposition is a binary classification task. It can be solved using machine learning algorithm or MLP.
Problem solving ideas
The solution can be divided into two parts:
- Binary prediction using machine learning algorithm: LGB/XGB/CatBoost
- Binary prediction using deep learning algorithm: MLP / wide & deep / deepfm
The general modeling scheme will be listed below. See the source code for details: gitee warehouse
machine learning
Machine learning is nothing more than Feature Engineering + ancestral parameters. Usually, in order to quickly release the first version of Baseline, we often start with LGB(lightgbm). The biggest feature of this algorithm is to ensure the accuracy and fast at the same time.
Feature processing
Null value processing
After investigation, it is found that null values appear on lan and osv.
# The string type needs to be converted to a numeric value (labelencoder) object_cols = train.select_dtypes(include='object').columns # Number of missing values temp = train.isnull().sum() # Fields with missing values: lan, osv temp[temp>0] # Get analysis fields features = train.columns.tolist() features.remove('label') print(features)
Continuous value and classification value
Then the continuous value and classification value are analyzed. Finally, it is found that osv needs to be transformed and FEA needs to be modified_ Hash and fea1_hash preliminary character length processing
for feature in features: print(feature, train[feature].nunique())
osv processing method
# Handling osv def trans_osv(osv): global result osv = str(osv).replace(' ','').replace('.','').replace('Android_','').replace('Ten core 20 G_HD','').replace('Android','').replace('W','') if osv == 'nan' or osv == 'GIONEE_YNGA': result = 810 elif osv.count('-') >0: result = int(osv.split('-')[0]) elif osv == 'f073b_changxiang_v01_b1b8_20180915': result = 810 elif osv == '%E6%B1%9F%E7%81%B5OS+50': result = 500 else: result = int(osv) if result < 10: result = result * 100 elif result < 100: result = result * 10 return int(result)
Finally, the transformation between test and training set
# Feature screening features = train[col] # Structural fea_hash_len characteristics features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x))) features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x))) # Thinking: why will it be a big, long fea_hash to 0? # If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x)) features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x)) features['osv'] = features['osv'].apply(trans_osv) test_features = test[col] # Structural fea_hash_len characteristics test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x))) test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x))) # Thinking: why will it be a big, long fea_hash to 0? # If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x)) test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x)) test_features['osv'] = test_features['osv'].apply(trans_osv)
modeling
lgb with default parameters is used for modeling, and the final score is 88.094
#train['os'].value_counts() # Training with LGBM import lightgbm as lgb model = lgb.LGBMClassifier() # model training model.fit(features.drop(['timestamp', 'version'], axis=1), train['label']) result = model.predict(test_features.drop(['timestamp', 'version'], axis=1)) #features['version'].value_counts() res = pd.DataFrame(test['sid']) res['label'] = result res.to_csv('./baseline.csv', index=False) res
Optimization direction
The following is a list of schemes that have been made. See the model results at the end of the paper for the specific version comparison. See the source code for details: gitee warehouse
- Add conversion usage for version
- Add timestamp for detailed use, and add year, month, day, hour, minute, weekend and diff features
- Add the difference between osv and version
- Add lan's permission to use
- Add screen ratio, screen area, pixel ratio
- Use ancestral lgb, ancestral xgb and other user-defined parameter models
- 5-fold cross training was performed on the model
- Multi model 5-fold cross training fusion
Deep learning
This in-depth learning method focuses on Baidu Flying oar Completed as a basic framework
Feature processing
For the data processing module, it is roughly similar to machine learning. However, due to the use of deep learning, it is necessary to normalize the data after processing.
import pandas as pd import warnings warnings.filterwarnings('ignore') # Data loading train = pd.read_csv('train.csv') test = pd.read_csv('test.csv') test = test.iloc[:, 1:] train = train.iloc[:, 1:] train # ##### Object type: lan, os, osv, version, fea_hash # ##### Fields with missing values: lan, osv # In[2]: # ['os', 'osv', 'lan', 'sid'] features = train.columns.tolist() features.remove('label') print(features) # In[3]: for feature in features: print(feature, train[feature].nunique()) # In[4]: # Data cleaning of osv def osv_trans(x): x = str(x).replace('Android_', '').replace('Android ', '').replace('W', '') if str(x).find('.') > 0: temp_index1 = x.find('.') if x.find(' ') > 0: temp_index2 = x.find(' ') else: temp_index2 = len(x) if x.find('-') > 0: temp_index2 = x.find('-') result = x[0:temp_index1] + '.' + x[temp_index1 + 1:temp_index2].replace('.', '') try: return float(result) except: print(x + '#########') return 0 try: return float(x) except: print(x + '#########') return 0 # train['osv'] => LabelEncoder ? # Use mode to fill in missing values train['osv'].fillna('8.1.0', inplace=True) # Data cleaning train['osv'] = train['osv'].apply(osv_trans) # Use mode to fill in missing values test['osv'].fillna('8.1.0', inplace=True) # Data cleaning test['osv'] = test['osv'].apply(osv_trans) # In[5]: # train['os'].value_counts() train['lan'].value_counts() # lan_map = {'zh-CN': 1, } train['lan'].value_counts().index lan_map = {'zh-CN': 1, 'zh_CN': 2, 'Zh-CN': 3, 'zh-cn': 4, 'zh_CN_#Hans': 5, 'zh': 6, 'ZH': 7, 'cn': 8, 'CN': 9, 'zh-HK': 10, 'tw': 11, 'TW': 12, 'zh-TW': 13, 'zh-MO': 14, 'en': 15, 'en-GB': 16, 'en-US': 17, 'ko': 18, 'ja': 19, 'it': 20, 'mi': 21} train['lan'] = train['lan'].map(lan_map) test['lan'] = test['lan'].map(lan_map) test['lan'].value_counts() # In[6]: # Set to 22 for missing LANs train['lan'].fillna(22, inplace=True) test['lan'].fillna(22, inplace=True) # In[7]: remove_list = ['os', 'sid'] col = features for i in remove_list: col.remove(i) col # In[8]: # train['timestamp'].value_counts() # train['timestamp'] = pd.to_datetime(train['timestamp']) # train['timestamp'] from datetime import datetime # lambda is a sentence function, an anonymous function train['timestamp'] = train['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000)) # 1559892728241.7212 # 1559871800477.1477 # 1625493942.538375 # import time # time.time() test['timestamp'] = test['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000)) test['timestamp'] # In[9]: def version_trans(x): if x == 'V3': return 3 if x == 'v1': return 1 if x == 'P_Final_6': return 6 if x == 'V6': return 6 if x == 'GA3': return 3 if x == 'GA2': return 2 if x == 'V2': return 2 if x == '50': return 5 return int(x) train['version'] = train['version'].apply(version_trans) test['version'] = test['version'].apply(version_trans) train['version'] = train['version'].astype('int') test['version'] = test['version'].astype('int') # In[10]: # Feature screening features = train[col] # Structural fea_hash_len characteristics features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x))) features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x))) # Thinking: why will it be a big, long fea_hash to 0? # If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x)) features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x)) features test_features = test[col] # Structural fea_hash_len characteristics test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x))) test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x))) # Thinking: why will it be a big, long fea_hash to 0? # If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x)) test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x)) test_features # Multi-scale time extraction of timestamp from training set # Create timestamp index temp = pd.DatetimeIndex(features['timestamp']) features['year'] = temp.year features['month'] = temp.month features['day'] = temp.day features['week_day'] = temp.weekday # What day is today? features['hour'] = temp.hour features['minute'] = temp.minute # Find the diff of time start_time = features['timestamp'].min() features['time_diff'] = features['timestamp'] - start_time features['time_diff'] = features['time_diff'].dt.days + features['time_diff'].dt.seconds / 3600 / 24 features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']] # Create timestamp index temp = pd.DatetimeIndex(test_features['timestamp']) test_features['year'] = temp.year test_features['month'] = temp.month test_features['day'] = temp.day test_features['week_day'] = temp.weekday # What day is today? test_features['hour'] = temp.hour test_features['minute'] = temp.minute # Find the diff of time # start_time = features['timestamp'].min() test_features['time_diff'] = test_features['timestamp'] - start_time test_features['time_diff'] = test_features['time_diff'].dt.days + test_features['time_diff'].dt.seconds / 3600 / 24 # test_features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']] test_features['time_diff'] # In[12]: # test['version'].value_counts() # features['version'].value_counts() features['dev_height'].value_counts() features['dev_width'].value_counts() # Structural area characteristics features['dev_area'] = features['dev_height'] * features['dev_width'] test_features['dev_area'] = test_features['dev_height'] * test_features['dev_width'] # In[13]: """ Thinking: Is it available dev_ppi and dev_area New structural features features['dev_ppi'].value_counts() features['dev_area'].astype('float') / features['dev_ppi'].astype('float') """ # features['ntt'].value_counts() features['carrier'].value_counts() features['package'].value_counts() # Version - difference between OSV app version and operating system version features['osv'].value_counts() features['version_osv'] = features['osv'] - features['version'] test_features['version_osv'] = test_features['osv'] - test_features['version'] # In[14]: features = features.drop(['timestamp'], axis=1) test_features = test_features.drop(['timestamp'], axis=1) # In[16]: # feature normalization from sklearn.preprocessing import StandardScaler scaler = StandardScaler() features1 = scaler.fit_transform(features) test_features1 = scaler.transform(test_features)
Generate Dataset and Dataloader
import paddle from paddle import nn from paddle.io import Dataset, DataLoader import numpy as np paddle.device.set_device('gpu:0') # Custom dataset class MineDataset(Dataset): def __init__(self, X, y): super(MineDataset, self).__init__() self.num_samples = len(X) self.X = X self.y = y def __getitem__(self, idx): return self.X.iloc[idx].values.astype('float32'), np.array(self.y.iloc[idx]).astype('int64') def __len__(self): return self.num_samples from sklearn.model_selection import train_test_split train_x, val_x, train_y, val_y = train_test_split(features1, train['label'], test_size=0.2, random_state=42) train_x = pd.DataFrame(train_x, columns=features.columns) val_x = pd.DataFrame(val_x, columns=features.columns) train_y = pd.DataFrame(train_y, columns=['label']) val_y = pd.DataFrame(val_y, columns=['label']) train_dataloader = DataLoader(MineDataset(train_x, train_y), batch_size=1024, shuffle=True, drop_last=True, num_workers=2) val_dataloader = DataLoader(MineDataset(val_x, val_y), batch_size=1024, shuffle=True, drop_last=True, num_workers=2) test_dataloader = DataLoader(MineDataset(test_features1, pd.Series([0 for i in range(len(test_features1))])), batch_size=1024, shuffle=True, drop_last=True, num_workers=2)
Network construction
The first version of the network uses only a simple full connection layer network. The network structure is a tower stone structure from 250 to 2, and each linear layer passes through relu and dropout layers.
class ClassifyModel(nn.Layer): def __init__(self, features_len): super(ClassifyModel, self).__init__() self.fc1 = nn.layer.Linear(in_features=features_len, out_features=250) self.ac1 = nn.layer.ReLU() self.drop1 = nn.layer.Dropout(p=0.02) self.fc2 = nn.layer.Linear(in_features=250, out_features=100) self.ac2 = nn.layer.ReLU() self.drop2 = nn.layer.Dropout(p=0.02) self.fc3 = nn.layer.Linear(in_features=100, out_features=50) self.ac3 = nn.layer.ReLU() self.drop3 = nn.layer.Dropout(p=0.02) self.fc4 = nn.layer.Linear(in_features=50, out_features=25) self.ac4 = nn.layer.ReLU() self.drop4 = nn.layer.Dropout(p=0.02) self.fc5 = nn.layer.Linear(in_features=25, out_features=2) self.out = nn.layer.Sigmoid() def forward(self, input): x = self.fc1(input) x = self.ac1(x) x = self.drop1(x) x = self.fc2(x) x = self.ac2(x) x = self.drop2(x) x = self.fc3(x) x = self.ac3(x) x = self.drop3(x) x = self.fc4(x) x = self.ac4(x) x = self.drop4(x) x = self.fc5(x) output = self.out(x) return output
Network training
# Initialization model model = ClassifyModel(int(len(features.columns))) # Training mode model.train() # Define optimizer opt = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters()) loss_fn = nn.CrossEntropyLoss() EPOCHS = 10 # Sets the number of outer cycles for epoch in range(EPOCHS): for iter_id, mini_batch in enumerate(train_dataloader): x_train = mini_batch[0] y_train = mini_batch[1] # Forward propagation y_pred = model(x_train) # Calculate loss loss = nn.functional.loss.cross_entropy(y_pred, y_train) # Print loss avg_loss = paddle.mean(loss) if iter_id % 20 == 0: acc = paddle.metric.accuracy(y_pred, y_train) print("epoch: {}, iter: {}, loss is: {}, acc is: {}".format(epoch, iter_id, avg_loss.numpy(), acc.numpy())) # Back propagation avg_loss.backward() # Minimize loss and update parameters opt.step() # Clear gradient opt.clear_grad()
Optimization ideas
Similarly, due to space reasons, the following two schemes can refer to the source code: gitee warehouse
- Embedded wide & deep
- Use FM Based DeepFM
Model score results of each version
classification | Model | details | fraction |
---|---|---|---|
ML | ML first version | 1. Preliminary modeling 2. Features not involved in modeling ['os',' version ',' lan ',' sid '] 3. Default parameter LGB | 88.094 |
ML Second Edition | 1. Based on the first version 2. Introduce version and use timestamp for simple conversion 3. Test the default parameters LGB and XGB | 88.2133 | |
ML Third Edition | 1. Based on the second version 2. Introducing lan 3. Make a difference between osv and version 4. lgb ancestral parameters | 88.9487 | |
ML Fourth Edition | 1. Based on the third version 2. 50% off lgb 3. 50% off xgb 4. Integration | 89.0293 89.0253 89.054 | |
ML Fifth Edition | 1. Based on the third version 2. Add pixel ratio, pixel size and pixel resolution ratio 3. 50% off lgb 4. 50% off xgb 5. Integration | 89.1873 89.108 89.1713 | |
Paddle | Paddle first version | 1. Feature Engineering Based on the third version of ML 2. Simply build a network based on the pad | Results not uploaded |
Paddy version 2 | 1. Based on the first version 2. Add embedded dictionary creation (in embedded analysis. ipynb) 3. Hybrid basic model based on embedding | 88.71 | |
Paddy version 3 | 1. Based on the second version 2. Add DeepFM partial models and merge them | 87.816 | |
TensorFlow | TF First Edition | 1. Feature Engineering Based on the third version of ML 2. Simply build a network based on TensorFlow | Results not uploaded |
FM | FM First Edition | 1. The first simple modeling based on FM model | 57.2147 |
Final ranking score