[Baidu AI Studio] MarTech Challenge click anti fraud forecast

background

Advertising fraud is one of the important challenges that digital marketing needs to face. Click fraud will waste advertisers a lot of money and mislead click data. The competition provided about 500000 hits. Special attention: we simulated the data, hid the meaning of some features, and desensitized them.

Please predict whether the user's click behavior is normal or cheating. Click fraud prediction is applicable to all kinds of information flow advertising, banner advertising and Baidu online alliance platform to help businesses identify click fraud and lock accurate and real users.

  • Competition address: https://aistudio.baidu.com/aistudio/competition/detail/52/0/introduction
  • Competition dataset: https://download.csdn.net/download/turkeym4/72338032#

Data and tasks

The competition provides 500000 training data and 150000 test data. The goal is to predict whether the data has anti fraud behavior.

fieldtypeexplain
sidstringSample id / request session sid
packagestringMedia information, package name (encrypted)
versionstringMedia information, app version
android_idstringMedia information, external advertising space ID (encrypted)
media_idstringMedia information, external media ID (encrypted)
apptypeintMedia information, app category
timestampbigintRequest arrival service time, in ms
locationintUser geolocation code (accurate to city)
fea_hashintUser characteristic code (specific physical meaning is omitted)
fea1_hashintUser characteristic code (specific physical meaning is omitted)
cus_typeintUser characteristic code (specific physical meaning is omitted)
nttintNetwork type: 0-unknown, 1-wired, 2-WIFI, 3-cellular, 4-2G, 5-3G, 6 – 4G
carrierstringThe operator used by the device is 0 unknown, 46000 mobile, 46001 Unicom and 46003 Telecom
osstringOperating system, android by default
osvstringOperating system version
lanstringThe language of the device is Chinese by default
dev_heightintEquipment high
dev_widthintEquipment width
dev_ppiintScreen resolution
labelintWhether there is anti fraud

From the data label, we can know that the proposition is a binary classification task. It can be solved using machine learning algorithm or MLP.

Problem solving ideas

The solution can be divided into two parts:

  • Binary prediction using machine learning algorithm: LGB/XGB/CatBoost
  • Binary prediction using deep learning algorithm: MLP / wide & deep / deepfm

The general modeling scheme will be listed below. See the source code for details: gitee warehouse

machine learning

Machine learning is nothing more than Feature Engineering + ancestral parameters. Usually, in order to quickly release the first version of Baseline, we often start with LGB(lightgbm). The biggest feature of this algorithm is to ensure the accuracy and fast at the same time.

Feature processing

Null value processing
After investigation, it is found that null values appear on lan and osv.

# The string type needs to be converted to a numeric value (labelencoder)
object_cols = train.select_dtypes(include='object').columns

# Number of missing values
temp = train.isnull().sum()
# Fields with missing values: lan, osv
temp[temp>0]
# Get analysis fields
features = train.columns.tolist()
features.remove('label')
print(features)

Continuous value and classification value
Then the continuous value and classification value are analyzed. Finally, it is found that osv needs to be transformed and FEA needs to be modified_ Hash and fea1_hash preliminary character length processing

for feature in features:
    print(feature, train[feature].nunique())

osv processing method

# Handling osv
def trans_osv(osv):
    global result
    osv = str(osv).replace(' ','').replace('.','').replace('Android_','').replace('Ten core 20 G_HD','').replace('Android','').replace('W','')
    if osv == 'nan' or osv == 'GIONEE_YNGA':
        result = 810
    elif osv.count('-') >0:
        result = int(osv.split('-')[0])
    elif osv == 'f073b_changxiang_v01_b1b8_20180915':
        result = 810
    elif osv == '%E6%B1%9F%E7%81%B5OS+50':
        result = 500
    else:
        result = int(osv)
        
    if result < 10:
        result = result * 100
    elif  result < 100:
        result = result * 10
        
    return int(result)

Finally, the transformation between test and training set

# Feature screening
features = train[col]
# Structural fea_hash_len characteristics
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking: why will it be a big, long fea_hash to 0?
# If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['osv'] = features['osv'].apply(trans_osv)


test_features = test[col]
# Structural fea_hash_len characteristics
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking: why will it be a big, long fea_hash to 0?
# If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['osv'] = test_features['osv'].apply(trans_osv)

modeling

lgb with default parameters is used for modeling, and the final score is 88.094

#train['os'].value_counts()
# Training with LGBM
import lightgbm as lgb
model = lgb.LGBMClassifier()
# model training
model.fit(features.drop(['timestamp', 'version'], axis=1), train['label'])
result = model.predict(test_features.drop(['timestamp', 'version'], axis=1))
#features['version'].value_counts()
res = pd.DataFrame(test['sid'])
res['label'] = result
res.to_csv('./baseline.csv', index=False)
res

Optimization direction

The following is a list of schemes that have been made. See the model results at the end of the paper for the specific version comparison. See the source code for details: gitee warehouse

  1. Add conversion usage for version
  2. Add timestamp for detailed use, and add year, month, day, hour, minute, weekend and diff features
  3. Add the difference between osv and version
  4. Add lan's permission to use
  5. Add screen ratio, screen area, pixel ratio
  6. Use ancestral lgb, ancestral xgb and other user-defined parameter models
  7. 5-fold cross training was performed on the model
  8. Multi model 5-fold cross training fusion

Deep learning

This in-depth learning method focuses on Baidu Flying oar Completed as a basic framework

Feature processing

For the data processing module, it is roughly similar to machine learning. However, due to the use of deep learning, it is necessary to normalize the data after processing.

import pandas as pd
import warnings

warnings.filterwarnings('ignore')

# Data loading
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
test = test.iloc[:, 1:]
train = train.iloc[:, 1:]
train

# ##### Object type: lan, os, osv, version, fea_hash
# ##### Fields with missing values: lan, osv

# In[2]:


# ['os', 'osv', 'lan', 'sid']
features = train.columns.tolist()
features.remove('label')
print(features)

# In[3]:


for feature in features:
    print(feature, train[feature].nunique())


# In[4]:


# Data cleaning of osv
def osv_trans(x):
    x = str(x).replace('Android_', '').replace('Android ', '').replace('W', '')
    if str(x).find('.') > 0:
        temp_index1 = x.find('.')
        if x.find(' ') > 0:
            temp_index2 = x.find(' ')
        else:
            temp_index2 = len(x)

        if x.find('-') > 0:
            temp_index2 = x.find('-')

        result = x[0:temp_index1] + '.' + x[temp_index1 + 1:temp_index2].replace('.', '')
        try:
            return float(result)
        except:
            print(x + '#########')
            return 0
    try:
        return float(x)
    except:
        print(x + '#########')
        return 0


# train['osv'] => LabelEncoder ?
# Use mode to fill in missing values
train['osv'].fillna('8.1.0', inplace=True)
# Data cleaning
train['osv'] = train['osv'].apply(osv_trans)

# Use mode to fill in missing values
test['osv'].fillna('8.1.0', inplace=True)
# Data cleaning
test['osv'] = test['osv'].apply(osv_trans)

# In[5]:


# train['os'].value_counts()
train['lan'].value_counts()
# lan_map = {'zh-CN': 1, }
train['lan'].value_counts().index
lan_map = {'zh-CN': 1, 'zh_CN': 2, 'Zh-CN': 3, 'zh-cn': 4, 'zh_CN_#Hans': 5, 'zh': 6, 'ZH': 7, 'cn': 8, 'CN': 9,
           'zh-HK': 10, 'tw': 11, 'TW': 12, 'zh-TW': 13, 'zh-MO': 14, 'en': 15, 'en-GB': 16, 'en-US': 17, 'ko': 18,
           'ja': 19, 'it': 20, 'mi': 21}
train['lan'] = train['lan'].map(lan_map)
test['lan'] = test['lan'].map(lan_map)
test['lan'].value_counts()

# In[6]:


# Set to 22 for missing LANs
train['lan'].fillna(22, inplace=True)
test['lan'].fillna(22, inplace=True)

# In[7]:


remove_list = ['os', 'sid']
col = features
for i in remove_list:
    col.remove(i)
col

# In[8]:


# train['timestamp'].value_counts()
# train['timestamp'] = pd.to_datetime(train['timestamp'])
# train['timestamp']
from datetime import datetime

# lambda is a sentence function, an anonymous function
train['timestamp'] = train['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
# 1559892728241.7212
# 1559871800477.1477
# 1625493942.538375
# import time
# time.time()
test['timestamp'] = test['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
test['timestamp']


# In[9]:


def version_trans(x):
    if x == 'V3':
        return 3
    if x == 'v1':
        return 1
    if x == 'P_Final_6':
        return 6
    if x == 'V6':
        return 6
    if x == 'GA3':
        return 3
    if x == 'GA2':
        return 2
    if x == 'V2':
        return 2
    if x == '50':
        return 5
    return int(x)


train['version'] = train['version'].apply(version_trans)
test['version'] = test['version'].apply(version_trans)
train['version'] = train['version'].astype('int')
test['version'] = test['version'].astype('int')

# In[10]:


# Feature screening
features = train[col]
# Structural fea_hash_len characteristics
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking: why will it be a big, long fea_hash to 0?
# If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features

test_features = test[col]
# Structural fea_hash_len characteristics
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking: why will it be a big, long fea_hash to 0?
# If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features



# Multi-scale time extraction of timestamp from training set
# Create timestamp index
temp = pd.DatetimeIndex(features['timestamp'])
features['year'] = temp.year
features['month'] = temp.month
features['day'] = temp.day
features['week_day'] = temp.weekday  # What day is today?
features['hour'] = temp.hour
features['minute'] = temp.minute

# Find the diff of time
start_time = features['timestamp'].min()
features['time_diff'] = features['timestamp'] - start_time
features['time_diff'] = features['time_diff'].dt.days + features['time_diff'].dt.seconds / 3600 / 24
features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']]

# Create timestamp index
temp = pd.DatetimeIndex(test_features['timestamp'])
test_features['year'] = temp.year
test_features['month'] = temp.month
test_features['day'] = temp.day
test_features['week_day'] = temp.weekday  # What day is today?
test_features['hour'] = temp.hour
test_features['minute'] = temp.minute

# Find the diff of time
# start_time = features['timestamp'].min()
test_features['time_diff'] = test_features['timestamp'] - start_time
test_features['time_diff'] = test_features['time_diff'].dt.days + test_features['time_diff'].dt.seconds / 3600 / 24
# test_features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']]
test_features['time_diff']

# In[12]:


# test['version'].value_counts()
# features['version'].value_counts()
features['dev_height'].value_counts()
features['dev_width'].value_counts()
# Structural area characteristics
features['dev_area'] = features['dev_height'] * features['dev_width']
test_features['dev_area'] = test_features['dev_height'] * test_features['dev_width']

# In[13]:


"""
Thinking: Is it available dev_ppi and dev_area New structural features
features['dev_ppi'].value_counts()
features['dev_area'].astype('float') / features['dev_ppi'].astype('float')
"""
# features['ntt'].value_counts()
features['carrier'].value_counts()
features['package'].value_counts()
# Version - difference between OSV app version and operating system version
features['osv'].value_counts()
features['version_osv'] = features['osv'] - features['version']
test_features['version_osv'] = test_features['osv'] - test_features['version']

# In[14]:


features = features.drop(['timestamp'], axis=1)
test_features = test_features.drop(['timestamp'], axis=1)

# In[16]:


# feature normalization 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features1 = scaler.fit_transform(features)
test_features1 = scaler.transform(test_features)

Generate Dataset and Dataloader

import paddle
from paddle import nn
from paddle.io import Dataset, DataLoader
import numpy as np
paddle.device.set_device('gpu:0')

# Custom dataset
class MineDataset(Dataset):
    def __init__(self, X, y):
        super(MineDataset, self).__init__()
        self.num_samples = len(X)
        self.X = X
        self.y = y

    def __getitem__(self, idx):
        return self.X.iloc[idx].values.astype('float32'), np.array(self.y.iloc[idx]).astype('int64')

    def __len__(self):
        return self.num_samples

from sklearn.model_selection import train_test_split


train_x, val_x, train_y, val_y = train_test_split(features1, train['label'], test_size=0.2, random_state=42)

train_x = pd.DataFrame(train_x, columns=features.columns)
val_x = pd.DataFrame(val_x, columns=features.columns)
train_y = pd.DataFrame(train_y, columns=['label'])
val_y = pd.DataFrame(val_y, columns=['label'])


train_dataloader = DataLoader(MineDataset(train_x, train_y),
                            batch_size=1024,
                            shuffle=True,
                            drop_last=True,
                            num_workers=2)

val_dataloader = DataLoader(MineDataset(val_x, val_y),
                            batch_size=1024,
                            shuffle=True,
                            drop_last=True,
                            num_workers=2)

test_dataloader = DataLoader(MineDataset(test_features1, pd.Series([0 for i in range(len(test_features1))])),
                            batch_size=1024,
                            shuffle=True,
                            drop_last=True,
                            num_workers=2)

Network construction

The first version of the network uses only a simple full connection layer network. The network structure is a tower stone structure from 250 to 2, and each linear layer passes through relu and dropout layers.

class ClassifyModel(nn.Layer):

    def __init__(self, features_len):
        super(ClassifyModel, self).__init__()

        self.fc1 = nn.layer.Linear(in_features=features_len, out_features=250)
        self.ac1 = nn.layer.ReLU()
        self.drop1 = nn.layer.Dropout(p=0.02)

        self.fc2 = nn.layer.Linear(in_features=250, out_features=100)
        self.ac2 = nn.layer.ReLU()
        self.drop2 = nn.layer.Dropout(p=0.02)

        self.fc3 = nn.layer.Linear(in_features=100, out_features=50)
        self.ac3 = nn.layer.ReLU()
        self.drop3 = nn.layer.Dropout(p=0.02)

        self.fc4 = nn.layer.Linear(in_features=50, out_features=25)
        self.ac4 = nn.layer.ReLU()
        self.drop4 = nn.layer.Dropout(p=0.02)

        self.fc5 = nn.layer.Linear(in_features=25, out_features=2)
        self.out = nn.layer.Sigmoid()

    def forward(self, input):
        x = self.fc1(input)
        x = self.ac1(x)
        x = self.drop1(x)

        x = self.fc2(x)
        x = self.ac2(x)
        x = self.drop2(x)

        x = self.fc3(x)
        x = self.ac3(x)
        x = self.drop3(x)

        x = self.fc4(x)
        x = self.ac4(x)
        x = self.drop4(x)

        x = self.fc5(x)
        output = self.out(x)
        return output

Network training

# Initialization model
model = ClassifyModel(int(len(features.columns)))
# Training mode
model.train()
# Define optimizer
opt = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters())
loss_fn = nn.CrossEntropyLoss()

EPOCHS = 10   # Sets the number of outer cycles
for epoch in range(EPOCHS):
    for iter_id, mini_batch in enumerate(train_dataloader):
        x_train = mini_batch[0]
        y_train = mini_batch[1]
        # Forward propagation
        y_pred = model(x_train)
        # Calculate loss
        loss = nn.functional.loss.cross_entropy(y_pred, y_train)
        # Print loss
        avg_loss = paddle.mean(loss)
        if iter_id % 20 == 0:
            acc = paddle.metric.accuracy(y_pred, y_train)
            print("epoch: {}, iter: {}, loss is: {}, acc is: {}".format(epoch, iter_id, avg_loss.numpy(), acc.numpy()))

        # Back propagation
        avg_loss.backward()
        # Minimize loss and update parameters
        opt.step()
        # Clear gradient
        opt.clear_grad()

Optimization ideas

Similarly, due to space reasons, the following two schemes can refer to the source code: gitee warehouse

  1. Embedded wide & deep
  2. Use FM Based DeepFM

Model score results of each version

classificationModeldetailsfraction
MLML first version1. Preliminary modeling
2. Features not involved in modeling ['os',' version ',' lan ',' sid ']
3. Default parameter LGB
88.094
ML Second Edition1. Based on the first version
2. Introduce version and use timestamp for simple conversion
3. Test the default parameters LGB and XGB
88.2133
ML Third Edition1. Based on the second version
2. Introducing lan
3. Make a difference between osv and version
4. lgb ancestral parameters
88.9487
ML Fourth Edition1. Based on the third version
2. 50% off lgb
3. 50% off xgb
4. Integration
89.0293
89.0253
89.054
ML Fifth Edition1. Based on the third version
2. Add pixel ratio, pixel size and pixel resolution ratio
3. 50% off lgb
4. 50% off xgb
5. Integration
89.1873
89.108
89.1713
PaddlePaddle first version1. Feature Engineering Based on the third version of ML
2. Simply build a network based on the pad
Results not uploaded
Paddy version 21. Based on the first version
2. Add embedded dictionary creation (in embedded analysis. ipynb)
3. Hybrid basic model based on embedding
88.71
Paddy version 31. Based on the second version
2. Add DeepFM partial models and merge them
87.816
TensorFlowTF First Edition1. Feature Engineering Based on the third version of ML
2. Simply build a network based on TensorFlow
Results not uploaded
FMFM First Edition1. The first simple modeling based on FM model57.2147

Final ranking score

Keywords: Python Machine Learning AI neural networks Deep Learning

Added by shellyrobson on Mon, 03 Jan 2022 11:40:44 +0200