This time, it's an AI workshop Match,
The goal is to distinguish whether the ECG data is normal or not, which can be divided into two categories: normal = 0, abnormal = 1.
After downloading the data set and opening ptbdb train.csv, 7000 rows of data will be found. The first 187 columns of each row are ECG data, and the last one is label.
This task is a two classification task, and because the data is time series data and has context relation, it is considered to use one-dimensional convolution to realize it, or to use lstm to realize it, but it is not well known about lstm.
It is mainly divided into the following modules:
Data preprocessing
model building
model training
Model test
Data preprocessing
Data preprocessing includes 1. Transforming label into thermal vector coding 2. Data reading
1. Transfer label to thermal vector coding
Since each row of data corresponds to a label, and the label value is 0 or 1, and the network we construct later is output by two, we change the label to oneHot form.
# Turn the label into oneHot def convert2oneHot(index,Lens): hot = np.zeros((Lens,)) hot[int(index)] = 1 return(hot)
2. Data reading
The data reading part includes the reading of test data and training data, using yield method to generate data.
# Generate data def train_gen(df,batch_size = 20,train=True): img_list = np.array(df) if train: steps = math.ceil(img_list.shape[0] / batch_size) # Determine how many batch es there are in each round else: steps = math.ceil(img_list.shape[0] / batch_size) # Determine how many batch es there are in each round while True: for i in range(steps): batch_list = img_list[i * batch_size : i * batch_size + batch_size] # print('batch_list shape is {}'.format(batch_list.shape)) np.random.shuffle(batch_list) batch_x = np.array([file for file in batch_list[:,:-1]]) batch_y = np.array([convert2oneHot(label,2) for label in batch_list[:,-1]]) yield batch_x, batch_y # Generate test data def test_gen(df,batch_size = 20): img_list = np.array(df) steps = math.ceil(len(img_list) / batch_size) # Determine how many batch es there are in each round while True: for i in range(steps): batch_list = img_list[i * batch_size : i * batch_size + batch_size] batch_x = np.array([file for file in batch_list[:,:]]) print(batch_x.shape) yield batch_x
model building
Model building is based on the keras framework, and the Sequential method of the keras framework is very convenient to build the Sequential model, so it is used here.
One sample is 187 length time series data.
The following is a random model constructed with one-dimensional convolution, which needs to be adjusted to better fit the data.
TIME_PERIODS = 187 def build_model(input_shape=(TIME_PERIODS,),num_classes=2): model = Sequential() model.add(Reshape((TIME_PERIODS, 1), input_shape=input_shape)) model.add(Conv1D(16, 8,strides=2, activation='relu',input_shape=(TIME_PERIODS,1))) model.add(Conv1D(16, 8,strides=2, activation='relu',padding="same")) # model.add(MaxPooling1D(2)) model.add(Conv1D(64, 4,strides=2, activation='relu',padding="same")) model.add(Conv1D(64, 4,strides=2, activation='relu',padding="same")) # model.add(MaxPooling1D(2)) model.add(Conv1D(256, 4,strides=2, activation='relu',padding="same")) model.add(Conv1D(256, 4,strides=2, activation='relu',padding="same")) # model.add(MaxPooling1D(2)) model.add(Conv1D(512, 2,strides=1, activation='relu',padding="same")) model.add(Conv1D(512, 2,strides=1, activation='relu',padding="same")) # model.add(MaxPooling1D(2)) model.add(GlobalAveragePooling1D()) model.add(Dropout(0.3)) model.add(Dense(num_classes, activation='softmax')) return(model)
model training
Once the data is ready and the model is built, you can train the model. Here we use KFold method of sklearn for model cross training.
This training is divided into 10 fold, that is, 90% of the data of each model training is used for training, and 10% of the data is used for verification.
data_fath = ['../heartbeat/ptbdb_train.csv', '../heartbeat/ptbdb_test.csv'] train_data = pd.read_csv(data_fath[0], header = None) train_data_df = pd.DataFrame(train_data) batch_size = 20 skf = KFold(n_splits=10, random_state=233, shuffle=True) for flod_idx, (train_idx, val_idx) in enumerate(skf.split(train_data_df, train_data_df)): train_data= train_data_df.iloc[train_idx] val_data = train_data_df.iloc[val_idx] len_train = train_data.shape[0] len_val = val_data.shape[0] train_iterr = train_gen(train_data, batch_size, True) val_iterr = train_gen(val_data, batch_size, False) ckpt = keras.callbacks.ModelCheckpoint( # Model save name, there will be 10 model files after training. filepath='best_model_{}.h5'.format(flod_idx), monitor='val_loss', save_best_only=True, verbose=1) model = build_model() # Using the Adam optimizer, you can change it here to another optimizer opt = Adam(0.0002) model.compile(loss='categorical_crossentropy', optimizer=opt, metrics=['accuracy']) print(model.summary()) model.fit_generator( generator=train_iterr, steps_per_epoch=len_train // batch_size, epochs=100, initial_epoch=0, validation_data=val_iterr, nb_val_samples=len_val // batch_size, callbacks=[ckpt], )
The training process is as follows:
Epoch 00001: val_loss improved from inf to 0.36047, saving model to best_model_0.h5 Epoch 2/100 315/315 [==============================] - 3s 11ms/step - loss: 0.4000 - accuracy: 0.8365 - val_loss: 0.3006 - val_accuracy: 0.8700 Epoch 00002: val_loss improved from 0.36047 to 0.30064, saving model to best_model_0.h5 Epoch 3/100 315/315 [==============================] - 3s 11ms/step - loss: 0.3306 - accuracy: 0.8684 - val_loss: 0.2190 - val_accuracy: 0.8771 Epoch 00003: val_loss improved from 0.30064 to 0.21901, saving model to best_model_0.h5 Epoch 4/100 315/315 [==============================] - 3s 9ms/step - loss: 0.2674 - accuracy: 0.8948 - val_loss: 0.1419 - val_accuracy: 0.9014 Epoch 00004: val_loss improved from 0.21901 to 0.14192, saving model to best_model_0.h5 Epoch 5/100 315/315 [==============================] - 4s 13ms/step - loss: 0.2161 - accuracy: 0.9181 - val_loss: 0.1205 - val_accuracy: 0.9271 Epoch 00005: val_loss improved from 0.14192 to 0.12052, saving model to best_model_0.h5 Epoch 6/100 315/315 [==============================] - 4s 11ms/step - loss: 0.1775 - accuracy: 0.9346 - val_loss: 0.1750 - val_accuracy: 0.9257
Model test
After the training, 10 models are obtained, tested and fused, and the results are output to the csv file.
batch_size = 20 result =np.zeros(shape=(1000,)) data_fath = ['../heartbeat/ptbdb_train.csv', '../heartbeat/ptbdb_test.csv'] test_data = pd.read_csv(data_fath[1], header = None) test_data_df = pd.DataFrame(test_data) # Test Data Generator test_iter = test_gen(test_data_df,batch_size = 20) for i in range(10): h5 = './best_model_{}.h5'.format(i) model = load_model(h5) pres =model.predict_generator(generator=test_iter,steps=math.ceil(1000/batch_size),verbose=1) print('pres.shape is {}'.format(pres.shape)) ohpres = np.argmax(pres,axis=1) print('ohpres.shape is {}'.format(ohpres.shape)) print(type(ohpres)) result += ohpres print('result shape is {}'.format(result)) result = [1.0 if result[i]>5 else 0.0 for i in range(len(result))] df = pd.DataFrame() df["id"] = np.arange(0,len(ohpres)) df["label"] = result df.to_csv("submmit.csv",header=None, index=None)
Game Over!