[Baidu AI Studio] MarTech Challenge click anti fraud forecast

background

Advertising fraud is one of the important challenges that digital marketing needs to face. Click fraud will waste advertisers a lot of money and mislead click data. The competition provided about 500000 hits. Special attention: we simulated the data, hid the meaning of some features, and desensitized them.

Please predict whether the user's click behavior is normal or cheating. Click fraud prediction is applicable to all kinds of information flow advertising, banner advertising and Baidu online alliance platform to help businesses identify click fraud and lock accurate and real users.

Competition address: https://aistudio.baidu.com/aistudio/competition/detail/52/0/introduction
Competition dataset: https://download.csdn.net/download/turkeym4/72338032#

Data and tasks

The competition provides 500000 training data and 150000 test data. The goal is to predict whether the data has anti fraud behavior.

field	type	explain
sid	string	Sample id / request session sid
package	string	Media information, package name (encrypted)
version	string	Media information, app version
android_id	string	Media information, external advertising space ID (encrypted)
media_id	string	Media information, external media ID (encrypted)
apptype	int	Media information, app category
timestamp	bigint	Request arrival service time, in ms
location	int	User geolocation code (accurate to city)
fea_hash	int	User characteristic code (specific physical meaning is omitted)
fea1_hash	int	User characteristic code (specific physical meaning is omitted)
cus_type	int	User characteristic code (specific physical meaning is omitted)
ntt	int	Network type: 0-unknown, 1-wired, 2-WIFI, 3-cellular, 4-2G, 5-3G, 6 – 4G
carrier	string	The operator used by the device is 0 unknown, 46000 mobile, 46001 Unicom and 46003 Telecom
os	string	Operating system, android by default
osv	string	Operating system version
lan	string	The language of the device is Chinese by default
dev_height	int	Equipment high
dev_width	int	Equipment width
dev_ppi	int	Screen resolution
label	int	Whether there is anti fraud

From the data label, we can know that the proposition is a binary classification task. It can be solved using machine learning algorithm or MLP.

Problem solving ideas

The solution can be divided into two parts:

Binary prediction using machine learning algorithm: LGB/XGB/CatBoost
Binary prediction using deep learning algorithm: MLP / wide & deep / deepfm

The general modeling scheme will be listed below. See the source code for details: gitee warehouse

machine learning

Machine learning is nothing more than Feature Engineering + ancestral parameters. Usually, in order to quickly release the first version of Baseline, we often start with LGB(lightgbm). The biggest feature of this algorithm is to ensure the accuracy and fast at the same time.

Feature processing

Null value processing
After investigation, it is found that null values appear on lan and osv.

# The string type needs to be converted to a numeric value (labelencoder)
object_cols = train.select_dtypes(include='object').columns

# Number of missing values
temp = train.isnull().sum()
# Fields with missing values: lan, osv
temp[temp>0]
# Get analysis fields
features = train.columns.tolist()
features.remove('label')
print(features)

Continuous value and classification value
Then the continuous value and classification value are analyzed. Finally, it is found that osv needs to be transformed and FEA needs to be modified_ Hash and fea1_hash preliminary character length processing

for feature in features:
    print(feature, train[feature].nunique())

osv processing method

# Handling osv
def trans_osv(osv):
    global result
    osv = str(osv).replace(' ','').replace('.','').replace('Android_','').replace('Ten core 20 G_HD','').replace('Android','').replace('W','')
    if osv == 'nan' or osv == 'GIONEE_YNGA':
        result = 810
    elif osv.count('-') >0:
        result = int(osv.split('-')[0])
    elif osv == 'f073b_changxiang_v01_b1b8_20180915':
        result = 810
    elif osv == '%E6%B1%9F%E7%81%B5OS+50':
        result = 500
    else:
        result = int(osv)
        
    if result < 10:
        result = result * 100
    elif  result < 100:
        result = result * 10
        
    return int(result)

Finally, the transformation between test and training set

# Feature screening
features = train[col]
# Structural fea_hash_len characteristics
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking: why will it be a big, long fea_hash to 0?
# If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
features['osv'] = features['osv'].apply(trans_osv)


test_features = test[col]
# Structural fea_hash_len characteristics
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking: why will it be a big, long fea_hash to 0?
# If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x))>16 else int(x))
test_features['osv'] = test_features['osv'].apply(trans_osv)

modeling

lgb with default parameters is used for modeling, and the final score is 88.094

#train['os'].value_counts()
# Training with LGBM
import lightgbm as lgb
model = lgb.LGBMClassifier()
# model training
model.fit(features.drop(['timestamp', 'version'], axis=1), train['label'])
result = model.predict(test_features.drop(['timestamp', 'version'], axis=1))
#features['version'].value_counts()
res = pd.DataFrame(test['sid'])
res['label'] = result
res.to_csv('./baseline.csv', index=False)
res

Optimization direction

The following is a list of schemes that have been made. See the model results at the end of the paper for the specific version comparison. See the source code for details: gitee warehouse

Add conversion usage for version
Add timestamp for detailed use, and add year, month, day, hour, minute, weekend and diff features
Add the difference between osv and version
Add lan's permission to use
Add screen ratio, screen area, pixel ratio
Use ancestral lgb, ancestral xgb and other user-defined parameter models
5-fold cross training was performed on the model
Multi model 5-fold cross training fusion

Deep learning

This in-depth learning method focuses on Baidu Flying oar Completed as a basic framework

Feature processing

For the data processing module, it is roughly similar to machine learning. However, due to the use of deep learning, it is necessary to normalize the data after processing.

import pandas as pd
import warnings

warnings.filterwarnings('ignore')

# Data loading
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
test = test.iloc[:, 1:]
train = train.iloc[:, 1:]
train

# ##### Object type: lan, os, osv, version, fea_hash
# ##### Fields with missing values: lan, osv

# In[2]:


# ['os', 'osv', 'lan', 'sid']
features = train.columns.tolist()
features.remove('label')
print(features)

# In[3]:


for feature in features:
    print(feature, train[feature].nunique())


# In[4]:


# Data cleaning of osv
def osv_trans(x):
    x = str(x).replace('Android_', '').replace('Android ', '').replace('W', '')
    if str(x).find('.') > 0:
        temp_index1 = x.find('.')
        if x.find(' ') > 0:
            temp_index2 = x.find(' ')
        else:
            temp_index2 = len(x)

        if x.find('-') > 0:
            temp_index2 = x.find('-')

        result = x[0:temp_index1] + '.' + x[temp_index1 + 1:temp_index2].replace('.', '')
        try:
            return float(result)
        except:
            print(x + '#########')
            return 0
    try:
        return float(x)
    except:
        print(x + '#########')
        return 0


# train['osv'] => LabelEncoder ?
# Use mode to fill in missing values
train['osv'].fillna('8.1.0', inplace=True)
# Data cleaning
train['osv'] = train['osv'].apply(osv_trans)

# Use mode to fill in missing values
test['osv'].fillna('8.1.0', inplace=True)
# Data cleaning
test['osv'] = test['osv'].apply(osv_trans)

# In[5]:


# train['os'].value_counts()
train['lan'].value_counts()
# lan_map = {'zh-CN': 1, }
train['lan'].value_counts().index
lan_map = {'zh-CN': 1, 'zh_CN': 2, 'Zh-CN': 3, 'zh-cn': 4, 'zh_CN_#Hans': 5, 'zh': 6, 'ZH': 7, 'cn': 8, 'CN': 9,
           'zh-HK': 10, 'tw': 11, 'TW': 12, 'zh-TW': 13, 'zh-MO': 14, 'en': 15, 'en-GB': 16, 'en-US': 17, 'ko': 18,
           'ja': 19, 'it': 20, 'mi': 21}
train['lan'] = train['lan'].map(lan_map)
test['lan'] = test['lan'].map(lan_map)
test['lan'].value_counts()

# In[6]:


# Set to 22 for missing LANs
train['lan'].fillna(22, inplace=True)
test['lan'].fillna(22, inplace=True)

# In[7]:


remove_list = ['os', 'sid']
col = features
for i in remove_list:
    col.remove(i)
col

# In[8]:


# train['timestamp'].value_counts()
# train['timestamp'] = pd.to_datetime(train['timestamp'])
# train['timestamp']
from datetime import datetime

# lambda is a sentence function, an anonymous function
train['timestamp'] = train['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
# 1559892728241.7212
# 1559871800477.1477
# 1625493942.538375
# import time
# time.time()
test['timestamp'] = test['timestamp'].apply(lambda x: datetime.fromtimestamp(x / 1000))
test['timestamp']


# In[9]:


def version_trans(x):
    if x == 'V3':
        return 3
    if x == 'v1':
        return 1
    if x == 'P_Final_6':
        return 6
    if x == 'V6':
        return 6
    if x == 'GA3':
        return 3
    if x == 'GA2':
        return 2
    if x == 'V2':
        return 2
    if x == '50':
        return 5
    return int(x)


train['version'] = train['version'].apply(version_trans)
test['version'] = test['version'].apply(version_trans)
train['version'] = train['version'].astype('int')
test['version'] = test['version'].astype('int')

# In[10]:


# Feature screening
features = train[col]
# Structural fea_hash_len characteristics
features['fea_hash_len'] = features['fea_hash'].map(lambda x: len(str(x)))
features['fea1_hash_len'] = features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking: why will it be a big, long fea_hash to 0?
# If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own
features['fea_hash'] = features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features['fea1_hash'] = features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
features

test_features = test[col]
# Structural fea_hash_len characteristics
test_features['fea_hash_len'] = test_features['fea_hash'].map(lambda x: len(str(x)))
test_features['fea1_hash_len'] = test_features['fea1_hash'].map(lambda x: len(str(x)))
# Thinking: why will it be a big, long fea_hash to 0?
# If FEA_ The hash is very long, and it is all returned to 0, otherwise it is its own
test_features['fea_hash'] = test_features['fea_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features['fea1_hash'] = test_features['fea1_hash'].map(lambda x: 0 if len(str(x)) > 16 else int(x))
test_features



# Multi-scale time extraction of timestamp from training set
# Create timestamp index
temp = pd.DatetimeIndex(features['timestamp'])
features['year'] = temp.year
features['month'] = temp.month
features['day'] = temp.day
features['week_day'] = temp.weekday  # What day is today?
features['hour'] = temp.hour
features['minute'] = temp.minute

# Find the diff of time
start_time = features['timestamp'].min()
features['time_diff'] = features['timestamp'] - start_time
features['time_diff'] = features['time_diff'].dt.days + features['time_diff'].dt.seconds / 3600 / 24
features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']]

# Create timestamp index
temp = pd.DatetimeIndex(test_features['timestamp'])
test_features['year'] = temp.year
test_features['month'] = temp.month
test_features['day'] = temp.day
test_features['week_day'] = temp.weekday  # What day is today?
test_features['hour'] = temp.hour
test_features['minute'] = temp.minute

# Find the diff of time
# start_time = features['timestamp'].min()
test_features['time_diff'] = test_features['timestamp'] - start_time
test_features['time_diff'] = test_features['time_diff'].dt.days + test_features['time_diff'].dt.seconds / 3600 / 24
# test_features[['timestamp', 'year', 'month', 'day', 'week_day', 'hour', 'minute', 'time_diff']]
test_features['time_diff']

# In[12]:


# test['version'].value_counts()
# features['version'].value_counts()
features['dev_height'].value_counts()
features['dev_width'].value_counts()
# Structural area characteristics
features['dev_area'] = features['dev_height'] * features['dev_width']
test_features['dev_area'] = test_features['dev_height'] * test_features['dev_width']

# In[13]:


"""
Thinking: Is it available dev_ppi and dev_area New structural features
features['dev_ppi'].value_counts()
features['dev_area'].astype('float') / features['dev_ppi'].astype('float')
"""
# features['ntt'].value_counts()
features['carrier'].value_counts()
features['package'].value_counts()
# Version - difference between OSV app version and operating system version
features['osv'].value_counts()
features['version_osv'] = features['osv'] - features['version']
test_features['version_osv'] = test_features['osv'] - test_features['version']

# In[14]:


features = features.drop(['timestamp'], axis=1)
test_features = test_features.drop(['timestamp'], axis=1)

# In[16]:


# feature normalization 
from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
features1 = scaler.fit_transform(features)
test_features1 = scaler.transform(test_features)

Generate Dataset and Dataloader

import paddle
from paddle import nn
from paddle.io import Dataset, DataLoader
import numpy as np
paddle.device.set_device('gpu:0')

# Custom dataset
class MineDataset(Dataset):
    def __init__(self, X, y):
        super(MineDataset, self).__init__()
        self.num_samples = len(X)
        self.X = X
        self.y = y

    def __getitem__(self, idx):
        return self.X.iloc[idx].values.astype('float32'), np.array(self.y.iloc[idx]).astype('int64')

    def __len__(self):
        return self.num_samples

from sklearn.model_selection import train_test_split


train_x, val_x, train_y, val_y = train_test_split(features1, train['label'], test_size=0.2, random_state=42)

train_x = pd.DataFrame(train_x, columns=features.columns)
val_x = pd.DataFrame(val_x, columns=features.columns)
train_y = pd.DataFrame(train_y, columns=['label'])
val_y = pd.DataFrame(val_y, columns=['label'])


train_dataloader = DataLoader(MineDataset(train_x, train_y),
                            batch_size=1024,
                            shuffle=True,
                            drop_last=True,
                            num_workers=2)

val_dataloader = DataLoader(MineDataset(val_x, val_y),
                            batch_size=1024,
                            shuffle=True,
                            drop_last=True,
                            num_workers=2)

test_dataloader = DataLoader(MineDataset(test_features1, pd.Series([0 for i in range(len(test_features1))])),
                            batch_size=1024,
                            shuffle=True,
                            drop_last=True,
                            num_workers=2)

Network construction

The first version of the network uses only a simple full connection layer network. The network structure is a tower stone structure from 250 to 2, and each linear layer passes through relu and dropout layers.

class ClassifyModel(nn.Layer):

    def __init__(self, features_len):
        super(ClassifyModel, self).__init__()

        self.fc1 = nn.layer.Linear(in_features=features_len, out_features=250)
        self.ac1 = nn.layer.ReLU()
        self.drop1 = nn.layer.Dropout(p=0.02)

        self.fc2 = nn.layer.Linear(in_features=250, out_features=100)
        self.ac2 = nn.layer.ReLU()
        self.drop2 = nn.layer.Dropout(p=0.02)

        self.fc3 = nn.layer.Linear(in_features=100, out_features=50)
        self.ac3 = nn.layer.ReLU()
        self.drop3 = nn.layer.Dropout(p=0.02)

        self.fc4 = nn.layer.Linear(in_features=50, out_features=25)
        self.ac4 = nn.layer.ReLU()
        self.drop4 = nn.layer.Dropout(p=0.02)

        self.fc5 = nn.layer.Linear(in_features=25, out_features=2)
        self.out = nn.layer.Sigmoid()

    def forward(self, input):
        x = self.fc1(input)
        x = self.ac1(x)
        x = self.drop1(x)

        x = self.fc2(x)
        x = self.ac2(x)
        x = self.drop2(x)

        x = self.fc3(x)
        x = self.ac3(x)
        x = self.drop3(x)

        x = self.fc4(x)
        x = self.ac4(x)
        x = self.drop4(x)

        x = self.fc5(x)
        output = self.out(x)
        return output

Network training

# Initialization model
model = ClassifyModel(int(len(features.columns)))
# Training mode
model.train()
# Define optimizer
opt = paddle.optimizer.AdamW(learning_rate=0.001, parameters=model.parameters())
loss_fn = nn.CrossEntropyLoss()

EPOCHS = 10   # Sets the number of outer cycles
for epoch in range(EPOCHS):
    for iter_id, mini_batch in enumerate(train_dataloader):
        x_train = mini_batch[0]
        y_train = mini_batch[1]
        # Forward propagation
        y_pred = model(x_train)
        # Calculate loss
        loss = nn.functional.loss.cross_entropy(y_pred, y_train)
        # Print loss
        avg_loss = paddle.mean(loss)
        if iter_id % 20 == 0:
            acc = paddle.metric.accuracy(y_pred, y_train)
            print("epoch: {}, iter: {}, loss is: {}, acc is: {}".format(epoch, iter_id, avg_loss.numpy(), acc.numpy()))

        # Back propagation
        avg_loss.backward()
        # Minimize loss and update parameters
        opt.step()
        # Clear gradient
        opt.clear_grad()

Optimization ideas

Similarly, due to space reasons, the following two schemes can refer to the source code: gitee warehouse

Embedded wide & deep
Use FM Based DeepFM

Model score results of each version

classification	Model	details	fraction
ML	ML first version	1. Preliminary modeling 2. Features not involved in modeling ['os',' version ',' lan ',' sid '] 3. Default parameter LGB	88.094
	ML Second Edition	1. Based on the first version 2. Introduce version and use timestamp for simple conversion 3. Test the default parameters LGB and XGB	88.2133
	ML Third Edition	1. Based on the second version 2. Introducing lan 3. Make a difference between osv and version 4. lgb ancestral parameters	88.9487
	ML Fourth Edition	1. Based on the third version 2. 50% off lgb 3. 50% off xgb 4. Integration	89.0293 89.0253 89.054
	ML Fifth Edition	1. Based on the third version 2. Add pixel ratio, pixel size and pixel resolution ratio 3. 50% off lgb 4. 50% off xgb 5. Integration	89.1873 89.108 89.1713
Paddle	Paddle first version	1. Feature Engineering Based on the third version of ML 2. Simply build a network based on the pad	Results not uploaded
	Paddy version 2	1. Based on the first version 2. Add embedded dictionary creation (in embedded analysis. ipynb) 3. Hybrid basic model based on embedding	88.71
	Paddy version 3	1. Based on the second version 2. Add DeepFM partial models and merge them	87.816
TensorFlow	TF First Edition	1. Feature Engineering Based on the third version of ML 2. Simply build a network based on TensorFlow	Results not uploaded
FM	FM First Edition	1. The first simple modeling based on FM model	57.2147

Final ranking score

Keywords: Python Machine Learning AI neural networks Deep Learning

Added by shellyrobson on Mon, 03 Jan 2022 11:40:44 +0200

Programming VIP

[Baidu AI Studio] MarTech Challenge click anti fraud forecast

background

Data and tasks

Problem solving ideas

machine learning

Feature processing

modeling

Optimization direction

Deep learning

Feature processing

Generate Dataset and Dataloader

Network construction

Network training

Optimization ideas

Model score results of each version

Popular Keywords