Image feature vector extraction and application of transfer learning
This article will discuss the concept of transfer learning in computer vision, an ability to use a pre trained model to learn from data sets other than its previously trained data sets.
For example:
There are two different data sets A and B, and our task is to identify different types of images in A and B (classification task)
Conventional practice: train model X on dataset A and model Y on dataset B
The method of transfer learning: train model x on dataset A, transform the trained model x, and then use X to train on dataset B.
Applicable:
Deep neural network X has been trained on large data sets, such as ImageNet. These trained models perform well in transfer learning. Reusing their convolution kernels is more meaningful than training new convolution kernels again.
Classification:
Generally speaking, there are two types of transfer learning applied to deep learning computer vision:
- Model X is used as the feature extractor, and then the extracted features are used as the input of other machine learning algorithms.
- Remove the FC (full connection layer) of model X, replace it with a new FC layer, and then fine tune its weight.
This article will focus on the first type.
Feature extraction using trained CNN
So far, we have regarded convolutional neural network as an end-to-end classifier:
- Input images into the network
- Propagate the image forward through the entire network
- Obtain the classification probability from the end of the network
However, no one stipulates that we must let the image pass through the whole network. We can choose to stop at any layer, such as Activation or Pool layer. At this time, we take the value from the network and use it as the feature vector.
If we extract the corresponding feature vectors from the images in the whole image dataset through the above operations, and then use these extracted feature vectors to train the existing machine learning models (such as linear SVM, logistic regression classifier and random forest).
Note that in the whole process, our convolutional neural network can not complete the classification operation. We just use it as a feature extractor, and the downstream machine learning classifier is responsible for learning latent patterns from the features extracted by convolutional neural network.
Know HDF5
HDF5 is a binary data format created by HDF5 group. It is used to store huge data sets on the hard disk, and it is convenient to access and operate the data in the data set.
The data in HDF5 is stored hierarchically, which is very similar to the way the file system stores data.
-
Group: data is first defined in a group. A group is like a container. It can hold data sets and other groups.
-
Dataset: once the group is defined, the dataset can be created in the group. The dataset can be regarded as multidimensional data of the same data type.
HDF5 is written in C, but with h5py module, we can use python language to manipulate the underlying C API.
The amazing thing about HDF5 is that it interacts with data very easily. We can store a large amount of data in HDF5 dataset and manipulate it in a way similar to manipulating Numpy arrays.
When using HDF5 through h5py, you can treat your data as a huge NumPy array. This array is too large to load into memory, but we can still operate on it through HDF5.
The best point is that the format of HDF5 is standardized, which means that the data set stored in HDF5 can be read by other developers in different languages, such as C, MATLAB and JAVA.
Write data to HDF5
If a worker wants to do well, he must sharpen his tools first.
Before we start our formal work, we need to write a small tool to read and write HDF5 files.
Directory structure:
----pyimgsearch | |----__init__.py | |----callbacks | |----inputoutput | | |----__init__.py | | |----hdf5datasetwriter.py | |----nn | |----preprocessing | |----utils
import h5py import os class HDF5DatasetWriter: def __init__(self, dims, outputPath, dataKey="images", bufSize=1000): if os.path.exists(outputPath): raise ValueError("The supplied 'outputPath' already exists and " "cannot be overwritten. Manually delete the file before continuing", outputPath) self.db = h5py.File(outputPath, "w") self.data = self.db.create_dataset(dataKey, dims, dtype="float") self.labels = self.db.create_dataset("labels", (dims[0],), dtype="int") self.bufsize = bufSize self.buffer = {"data" : [], "labels" : []} self.idx = 0 def add(self, rows, labels): self.buffer["data"].extend(rows) self.buffer["labels"].extend(labels) if len(self.buffer["data"]) >= self.bufsize: self.flush() def flush(self): i = self.idx + len(self.buffer["data"]) self.data[self.idx:i] = self.buffer["data"] self.labels[self.idx:i] = self.buffer["labels"] self.idx = i self.buffer = {"data": [], "labels": []} def storeClassLabels(self, classLabels): dt = h5py.special_dtype(vlen=str) labelSet = self.db.create_dataset("label_names", (len(classLabels),), dtype=dt) labelSet[:] = classLabels def close(self): if len(self.buffer["data"]) > 0: self.flush() self.db.close()
In the above program, we operate the data set in HDF5 file through several functions.
Its functions are:
flush: write the data in the cache to the file, and then empty the cache
add: write data and corresponding tags to the cache. If the data size in the cache exceeds the size of the cache, call the flush method
storeClassLabels: write the name of each category to the file in the format of string
Close: close the file stream. If there is still data in the cache at this time, call the flush method to write the file first.
feature extraction
Create a python file named extract_feature.py and write the following code:
from tensorflow.keras.applications import VGG16 from tensorflow.keras.applications import imagenet_utils from tensorflow.keras.preprocessing.image import img_to_array from tensorflow.keras.preprocessing.image import load_img from sklearn.preprocessing import LabelEncoder from inOutput.hdf5datasetwriter import HDF5DatasetWriter from imutils import paths import numpy as np import progressbar import random import os dataset = "/Users/lingg/Desktop/dataset/Flower17-master/dataset/train" output = "/Users/lingg/PycharmProjects/DLstudy/feature/flower-17/hdf5/feature.hdf5" batchsize = 32 bufferSize = 1000 bs = batchsize print("[INFO] loading images...") imagePaths = list(paths.list_images(dataset)) random.shuffle(imagePaths) labels = [p.split(os.path.sep)[-2] for p in imagePaths] le = LabelEncoder() labels = le.fit_transform(labels) print("[INFO] loading network...") model = VGG16(weights="imagenet", include_top=False) dataset = HDF5DatasetWriter((len(imagePaths), 512 * 7 * 7), output, dataKey="features", bufSize=bufferSize) dataset.storeClassLabels(le.classes_) widgets = ["Extracting Features: ", progressbar.Percentage(), " ", progressbar.Bar(), " ", progressbar.ETA()] pbar = progressbar.ProgressBar(maxval=len(imagePaths), widgets=widgets).start() for i in np.arange(0, len(imagePaths), bs): batchPaths = imagePaths[i:i + bs] batchLabels = labels[i:i + bs] batchImages = [] for (j, imagePath) in enumerate(batchPaths): image = load_img(imagePath, target_size=(224, 224)) image = img_to_array(image) image = np.expand_dims(image, axis=0) image = imagenet_utils.preprocess_input(image) batchImages.append(image) batchImages = np.vstack(batchImages) features = model.predict(batchImages, batch_size=bs) features = features.reshape((features.shape[0], 512 * 7 * 7)) dataset.add(features, batchLabels) pbar.update(i) dataset.close() pbar.finish()
among
The variable dataset is the directory where the dataset is located. We use the flower-17 dataset.
The variable output is the target path of the extracted feature storage
Note that our file feature.hdf5 is generated automatically by the program and does not need to be created manually, but its directory needs to be created in advance. For example, / Users/lingg/PycharmProjects/DLstudy/feature/flower-17/hdf5 / in this article is to be submitted for creation, otherwise it will indicate that the target path does not exist.
In addition, due to the long waiting time, we added a control progressbar for program interaction with the outside world to show the progress of current feature extraction to the outside world. You can install it through the following command:
pip progressbar
Of course, you can use it if you don't want to. It's not necessary.
After executing the program, we can see the feature.hdf5 file in the corresponding directory.
What is saved in this file is the features extracted from the flower-17 dataset, which we will use to train the classifier in the following steps.
The extracted features are used to train the classifier
As we all know, it is not ideal to use a simple linear classifier to train the image directly. What if we train on the features extracted above?
Let's find out.
Create a file, name it: train_model.py, and write the following code:
from sklearn.linear_model import LogisticRegression from sklearn.model_selection import GridSearchCV from sklearn.metrics import classification_report import pickle import h5py import numpy as np import sys db = "/Users/lingg/PycharmProjects/DLstudy/feature/flower-17/hdf5/feature.hdf5" model_path = "/Users/liushanlin/PycharmProjects/DLstudy/model/animals.cpickle" jobs = -1 db = h5py.File(db, "r") i = int(db["labels"].shape[0] * 0.75) print("[INFO] tuning hyperparameters...") params = {"C": [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]} model = GridSearchCV(LogisticRegression(), params, cv=3, n_jobs=-1) trainX = db["features"][:i] trainY = db["labels"][:i] testX = db["features"][i:] testY = db["labels"][i:] targets = db["label_names"][:] for i in np.arange(0, len(targets)): targets[i] = str(targets[i], encoding='utf-8') print(targets) model.fit(trainX, trainY) print("[INFO] best hypermeters:{}".format(model.best_params_)) print("[INFO] evaluating...") preds = model.predict(testX) print(classification_report(testY, preds, target_names=targets)) print("[INFO] saving model...") f = open(model_path, "wb") f.write(pickle.dumps(model.best_estimator_)) f.close() db.close()
Among them,
The variable db stores the feature file extracted in the previous step
The variable model_path is the path we want to serialize and store our trained linear classifier.
We classify features through logistic regression in the code and use GridSearchCV (used to compare and verify the best parameters). A total of six parameters are compared: {C ": [0.1, 1.0, 10.0, 100.0, 1000.0, 10000.0]}. It will help us find the one with the best effect. For specific use methods, please refer to the official documents.
Possible problems with using GridSearchCV in this Code: Grid search error: (GridSearchCV): 'ascii' codec can't encode characters in position 18-20
Operation results:
[INFO] tuning hyperparameters... ['bluebell' 'buttercup' 'colts_foot' 'cowslip' 'crocus' 'daffodil' 'daisy' 'dandelion' 'fritillary' 'iris' 'lily_valley' 'pansy' 'snowdrop' 'sunflower' 'tigerlily' 'tulip' 'windflower'] [INFO] best hypermeters:{'C': 1.0} [INFO] evaluating... precision recall f1-score support bluebell 0.96 0.96 0.96 26 buttercup 0.93 1.00 0.97 14 colts_foot 1.00 0.94 0.97 17 cowslip 0.74 0.88 0.80 16 crocus 0.74 0.93 0.82 15 daffodil 0.88 0.94 0.91 16 daisy 0.94 0.89 0.92 19 dandelion 0.94 0.88 0.91 17 fritillary 0.94 0.89 0.91 18 iris 1.00 0.89 0.94 19 lily_valley 0.83 0.94 0.88 16 pansy 1.00 0.88 0.93 16 snowdrop 0.62 0.81 0.70 16 sunflower 1.00 1.00 1.00 16 tigerlily 1.00 1.00 1.00 22 tulip 1.00 0.58 0.73 19 windflower 0.94 0.94 0.94 16 accuracy 0.90 298 macro avg 0.91 0.90 0.90 298 weighted avg 0.92 0.90 0.90 298 [INFO] saving model... Process finished with exit code 0
It can be seen that grid search helps us find the best parameter: c=1000.0.
And more surprisingly, the simple logistic regression also achieved very high accuracy, thanks to the features extracted by VGG.