[actual combat] heartbeat anomaly detection -- training CNN one-dimensional convolution with Keras, K-fold crossover

This time, it's an AI workshop Match
The goal is to distinguish whether the ECG data is normal or not, which can be divided into two categories: normal = 0, abnormal = 1.
After downloading the data set and opening ptbdb train.csv, 7000 rows of data will be found. The first 187 columns of each row are ECG data, and the last one is label.

This task is a two classification task, and because the data is time series data and has context relation, it is considered to use one-dimensional convolution to realize it, or to use lstm to realize it, but it is not well known about lstm.

It is mainly divided into the following modules:
Data preprocessing
model building
model training
Model test

Data preprocessing

Data preprocessing includes 1. Transforming label into thermal vector coding 2. Data reading

1. Transfer label to thermal vector coding

Since each row of data corresponds to a label, and the label value is 0 or 1, and the network we construct later is output by two, we change the label to oneHot form.

# Turn the label into oneHot
def convert2oneHot(index,Lens):
    hot = np.zeros((Lens,))
    hot[int(index)] = 1
    return(hot)
2. Data reading

The data reading part includes the reading of test data and training data, using yield method to generate data.

# Generate data
def train_gen(df,batch_size = 20,train=True):

    img_list = np.array(df)
    if train:
        steps = math.ceil(img_list.shape[0] / batch_size)    # Determine how many batch es there are in each round
    else:
        steps = math.ceil(img_list.shape[0] / batch_size)    # Determine how many batch es there are in each round
    while True:
        for i in range(steps):

            batch_list = img_list[i * batch_size : i * batch_size + batch_size]
#             print('batch_list shape is {}'.format(batch_list.shape))
            np.random.shuffle(batch_list)
            batch_x = np.array([file for file in batch_list[:,:-1]])
            batch_y = np.array([convert2oneHot(label,2) for label in batch_list[:,-1]])

            yield batch_x, batch_y

# Generate test data
def test_gen(df,batch_size = 20):
    img_list = np.array(df)
    steps = math.ceil(len(img_list) / batch_size)    # Determine how many batch es there are in each round
    while True:
        for i in range(steps):
            batch_list = img_list[i * batch_size : i * batch_size + batch_size]
            batch_x = np.array([file for file in batch_list[:,:]])
            print(batch_x.shape)
            yield batch_x

model building

Model building is based on the keras framework, and the Sequential method of the keras framework is very convenient to build the Sequential model, so it is used here.
One sample is 187 length time series data.
The following is a random model constructed with one-dimensional convolution, which needs to be adjusted to better fit the data.

TIME_PERIODS = 187
def build_model(input_shape=(TIME_PERIODS,),num_classes=2):
    model = Sequential()
    model.add(Reshape((TIME_PERIODS, 1), input_shape=input_shape))
    model.add(Conv1D(16, 8,strides=2, activation='relu',input_shape=(TIME_PERIODS,1)))

    model.add(Conv1D(16, 8,strides=2, activation='relu',padding="same"))
#     model.add(MaxPooling1D(2))
    model.add(Conv1D(64, 4,strides=2, activation='relu',padding="same"))
    model.add(Conv1D(64, 4,strides=2, activation='relu',padding="same"))
#     model.add(MaxPooling1D(2))
    model.add(Conv1D(256, 4,strides=2, activation='relu',padding="same"))
    model.add(Conv1D(256, 4,strides=2, activation='relu',padding="same"))
#     model.add(MaxPooling1D(2))
    model.add(Conv1D(512, 2,strides=1, activation='relu',padding="same"))
    model.add(Conv1D(512, 2,strides=1, activation='relu',padding="same"))
#     model.add(MaxPooling1D(2))

    model.add(GlobalAveragePooling1D())
    model.add(Dropout(0.3))
    model.add(Dense(num_classes, activation='softmax'))
    return(model)

model training

Once the data is ready and the model is built, you can train the model. Here we use KFold method of sklearn for model cross training.
This training is divided into 10 fold, that is, 90% of the data of each model training is used for training, and 10% of the data is used for verification.

data_fath = ['../heartbeat/ptbdb_train.csv', '../heartbeat/ptbdb_test.csv']

train_data = pd.read_csv(data_fath[0], header = None)
train_data_df = pd.DataFrame(train_data)

batch_size = 20
skf = KFold(n_splits=10, random_state=233, shuffle=True)
for flod_idx, (train_idx, val_idx) in enumerate(skf.split(train_data_df, train_data_df)):
    train_data= train_data_df.iloc[train_idx]
    val_data = train_data_df.iloc[val_idx]
    len_train = train_data.shape[0]
    len_val = val_data.shape[0]

    train_iterr = train_gen(train_data, batch_size, True)
    val_iterr = train_gen(val_data, batch_size, False)

    ckpt = keras.callbacks.ModelCheckpoint(
       # Model save name, there will be 10 model files after training.
        filepath='best_model_{}.h5'.format(flod_idx),
        monitor='val_loss', save_best_only=True, verbose=1)

    model = build_model()
    # Using the Adam optimizer, you can change it here to another optimizer
    opt = Adam(0.0002)
    model.compile(loss='categorical_crossentropy',
                  optimizer=opt, metrics=['accuracy'])
    print(model.summary())

    model.fit_generator(
        generator=train_iterr,
        steps_per_epoch=len_train // batch_size,
        epochs=100,
        initial_epoch=0,
        validation_data=val_iterr,
        nb_val_samples=len_val // batch_size,
        callbacks=[ckpt],
    )

The training process is as follows:

Epoch 00001: val_loss improved from inf to 0.36047, saving model to best_model_0.h5
Epoch 2/100
315/315 [==============================] - 3s 11ms/step - loss: 0.4000 - accuracy: 0.8365 - val_loss: 0.3006 - val_accuracy: 0.8700

Epoch 00002: val_loss improved from 0.36047 to 0.30064, saving model to best_model_0.h5
Epoch 3/100
315/315 [==============================] - 3s 11ms/step - loss: 0.3306 - accuracy: 0.8684 - val_loss: 0.2190 - val_accuracy: 0.8771

Epoch 00003: val_loss improved from 0.30064 to 0.21901, saving model to best_model_0.h5
Epoch 4/100
315/315 [==============================] - 3s 9ms/step - loss: 0.2674 - accuracy: 0.8948 - val_loss: 0.1419 - val_accuracy: 0.9014

Epoch 00004: val_loss improved from 0.21901 to 0.14192, saving model to best_model_0.h5
Epoch 5/100
315/315 [==============================] - 4s 13ms/step - loss: 0.2161 - accuracy: 0.9181 - val_loss: 0.1205 - val_accuracy: 0.9271

Epoch 00005: val_loss improved from 0.14192 to 0.12052, saving model to best_model_0.h5
Epoch 6/100
315/315 [==============================] - 4s 11ms/step - loss: 0.1775 - accuracy: 0.9346 - val_loss: 0.1750 - val_accuracy: 0.9257


Model test

After the training, 10 models are obtained, tested and fused, and the results are output to the csv file.

batch_size = 20
result =np.zeros(shape=(1000,))
data_fath = ['../heartbeat/ptbdb_train.csv', '../heartbeat/ptbdb_test.csv']
test_data = pd.read_csv(data_fath[1], header = None)
test_data_df = pd.DataFrame(test_data)
# Test Data Generator 
test_iter = test_gen(test_data_df,batch_size = 20)

for i in range(10):
    h5 = './best_model_{}.h5'.format(i)
    model = load_model(h5)
    pres =model.predict_generator(generator=test_iter,steps=math.ceil(1000/batch_size),verbose=1)
    print('pres.shape is {}'.format(pres.shape))
    ohpres = np.argmax(pres,axis=1)
    print('ohpres.shape is {}'.format(ohpres.shape))
    print(type(ohpres))
    result +=  ohpres
    print('result shape is {}'.format(result))

result = [1.0 if result[i]>5 else 0.0 for i in range(len(result))]
df = pd.DataFrame()
df["id"] = np.arange(0,len(ohpres))
df["label"] = result
df.to_csv("submmit.csv",header=None, index=None)

Game Over!

Published 29 original articles, won praise 12, and received 10 thousand visits+
Private letter follow

Keywords: network

Added by Verrou on Sat, 18 Jan 2020 08:15:33 +0200