My blog https://blog.justlovesmile.top/
Target detection is an important research direction in computer vision tasks. It is used to detect specific kinds of visual target instances in digital images. As one of the fundamental problems of computer vision, target detection is the basis and premise of many other computer vision tasks, such as image description generation, instance segmentation and target tracking. When solving such problems, we often need to use our own scripts or annotation tools to generate data sets, and the data set formats are often diverse. Therefore, for target detection tasks, in order to be more compatible with training, most target detection model frameworks support several common data set annotation formats by default, including COCO, Pascal, VOC, YOLO, wait. This article mainly introduces the above data set formats and the Python script I wrote (generally need to be changed according to the actual situation).
1. COCO
1.1 COCO dataset format
COCO (Common Objects in COtext) dataset is a large-scale dataset suitable for target detection, image segmentation and Image Captioning tasks. Its annotation format is one of the most commonly used formats. At present, the COCO2017 dataset is widely used. Its official website is COCO - Common Objects in Context (cocodataset.org).
The COCO dataset mainly contains images (jpg or png, etc.) and annotation files (json). The format of the dataset is as follows (/ represents folder):
-coco/ |-train2017/ |-1.jpg |-2.jpg |-val2017/ |-3.jpg |-4.jpg |-test2017/ |-5.jpg |-6.jpg |-annotations/ |-instances_train2017.json |-instances_val2017.json |-*.json
train2017 and val2017 folders store the images of training set and verification set, while test2017 folder stores the information of test set, which can be only images or labels, which are generally used separately.
The files in the annotations folder are annotation files. If you have xml files, you usually need to convert them to json format. The format is as follows (for more details, please refer to Official website):
{ "info": info, "images": [image], //list "annotations": [annotation], //list "categories": [category], //list "licenses": [license], //list }
info is the information of the whole data set, including year, version, description, etc. if it is only to complete the training task, it is not very important, as shown below:
//It's not that important for training info{ "year": int, "version": str, "description": str, "contributor": str, "url": str, "date_created": datetime, }
The image is the basic information of the image, including serial number, width, height, file name, etc. the serial number (id) needs to correspond to the serial number of the picture marked in the following annotations, as shown below:
image{ "id": int, //necessary "width": int, //necessary "height": int, //necessary "file_name": str, //necessary "license": int, "flickr_url": str, "coco_url": str, "date_captured": datetime, }
Annotation is the most important annotation information, including serial number, image serial number, category serial number and so on, as shown below:
annotation{ "id": int, //Label id "image_id": int, //Image id "category_id": int, //Category id "segmentation": RLE or [polygon], //Image segmentation annotation "area": float, //Area "bbox": [x,y,width,height], //Coordinates of the upper left corner of the target box and width and height "iscrowd": 0 or 1, //Is it dense }
Category represents category information, including parent category, category serial number and category name, as shown below:
category{ "id": int, //Category serial number "name": str, //Category name "supercategory": str, //Parent category }
The license represents the agreement license information of the dataset, including serial number, agreement name and link information, as shown below:
//Not important for training license{ "id": int, "name": str, "url": str, }
Next, let's look at a simple example:
{ "info": {slightly}, "images": [{"id": 1, "file_name": "1.jpg", "height": 334, "width": 500}, {"id": 2, "file_name": "2.jpg", "height": 445, "width": 556}], "annotations": [{"id": 1, "area": 40448, "iscrowd": 0, "image_id": 1, "bbox": [246, 61, 128, 316], "category_id": 3, "segmentation": []}, {"id": 2, "area": 40448, "iscrowd": 0, "image_id": 1, "bbox": [246, 61, 128, 316], "category_id": 2, "segmentation": []}, {"id": 3, "area": 40448, "iscrowd": 0, "image_id": 2, "bbox": [246, 61, 128, 316], "category_id": 1, "segmentation": []}], "categories": [{"supercategory": "none", "id": 1, "name": "liner"},{"supercategory": "none", "id": 2, "name": "containership"},{"supercategory": "none", "id": 3, "name": "bulkcarrier"}], "licenses": [{slightly}] }
1.2 COCO conversion script
The Python conversion script is as follows. The image and xml annotation files need to be prepared:
# -*- coding: utf-8 -*- # @Author : justlovesmile # @Date : 2021/9/8 15:36 import os, random, json import shutil as sh from tqdm.auto import tqdm import xml.etree.ElementTree as xmlET def mkdir(path): if not os.path.exists(path): os.makedirs(path) return True else: print(f"The path ({path}) already exists.") return False def readxml(file): tree = xmlET.parse(file) #Picture size field size = tree.find('size') width = int(size.find('width').text) height = int(size.find('height').text) #Target field objs = tree.findall('object') bndbox = [] for obj in objs: label = obj.find("name").text bnd = obj.find("bndbox") xmin = int(bnd.find("xmin").text) ymin = int(bnd.find("ymin").text) xmax = int(bnd.find("xmax").text) ymax = int(bnd.find("ymax").text) bbox = [xmin, ymin, xmax, ymax, label] bndbox.append(bbox) return [[width, height], bndbox] def tococo(xml_root, image_root, output_root,classes={},errorId=[],train_percent=0.9): # assert assert train_percent<=1 and len(classes)>0 # define the root path train_root = os.path.join(output_root, "train2017") val_root = os.path.join(output_root, "val2017") ann_root = os.path.join(output_root, "annotations") # initialize train and val dict train_content = { "images": [], # {"file_name": "09780.jpg", "height": 334, "width": 500, "id": 9780} "annotations": [],# {"area": 40448, "iscrowd": 0, "image_id": 1, "bbox": [246, 61, 128, 316], "category_id": 5, "id": 1, "segmentation": []} "categories": [] # {"supercategory": "none", "id": 1, "name": "liner"} } val_content = { "images": [], # {"file_name": "09780.jpg", "height": 334, "width": 500, "id": 9780} "annotations": [],# {"area": 40448, "iscrowd": 0, "image_id": 1, "bbox": [246, 61, 128, 316], "category_id": 5, "id": 1, "segmentation": []} "categories": [] # {"supercategory": "none", "id": 1, "name": "liner"} } train_json = 'instances_train2017.json' val_json = 'instances_val2017.json' # divide the trainset and valset images = os.listdir(image_root) total_num = len(images) train_percent = train_percent train_num = int(total_num * train_percent) train_file = sorted(random.sample(images, train_num)) if mkdir(output_root): if mkdir(train_root) and mkdir(val_root) and mkdir(ann_root): idx1, idx2, dx1, dx2 = 0, 0, 0, 0 for file in tqdm(images): name=os.path.splitext(os.path.basename(file))[0] if name not in errorId: res = readxml(os.path.join(xml_root, name + '.xml')) if file in train_file: idx1 += 1 sh.copy(os.path.join(image_root, file), train_root) train_content['images'].append( {"file_name": file, "width": res[0][0], "height": res[0][1], "id": idx1}) for b in res[1]: dx1 += 1 x = b[0] y = b[1] w = b[2] - b[0] h = b[3] - b[1] train_content['annotations'].append( {"area": w * h, "iscrowd": 0, "image_id": idx1, "bbox": [x, y, w, h], "category_id": classes[b[4]], "id": dx1, "segmentation": []}) else: idx2 += 1 sh.copy(os.path.join(image_root, file), val_root) val_content['images'].append( {"file_name": file, "width": res[0][0], "height": res[0][1], "id": idx2}) for b in res[1]: dx2 += 1 x = b[0] y = b[1] w = b[2] - b[0] h = b[3] - b[1] val_content['annotations'].append( {"area": w * h, "iscrowd": 0, "image_id": idx2, "bbox": [x, y, w, h], "category_id": classes[b[4]], "id": dx2, "segmentation": []}) for i, j in classes.items(): train_content['categories'].append({"supercategory": "none", "id": j, "name": i}) val_content['categories'].append({"supercategory": "none", "id": j, "name": i}) with open(os.path.join(ann_root, train_json), 'w') as f: json.dump(train_content, f) with open(os.path.join(ann_root, val_json), 'w') as f: json.dump(val_content, f) print("Number of Train Images:", len(os.listdir(train_root))) print("Number of Val Images:", len(os.listdir(val_root))) def test(): box_root = "E:/MyProject/Dataset/hwtest/annotations" #xml folder image_root = "E:/MyProject/Dataset/hwtest/images" #image folder output_root = "E:/MyProject/Dataset/coco" #Output folder classes = {"liner": 0,"bulk carrier": 1,"warship": 2,"sailboat": 3,"canoe": 4,"container ship": 5,"fishing boat": 6} #Category dictionary errorId = [] #Dirty data id train_percent = 0.9 #Proportion of training set and verification set tococo(box_root, image_root, output_root,classes=classes,errorId=errorId,train_percent=train_percent) if __name__ == "__main__": test()
2. VOC
2.1 VOC dataset format
The VOC (Visual Object Classes) dataset comes from PASCAL VOC challenge. Its main tasks include Object Classification, Object Detection, Object Segmentation, Human Layout and Action Classification. Its official website is The PASCAL Visual Object Classes Homepage (ox.ac.uk) The main data sets are voc207 and VOC2012.
VOC dataset mainly includes images (jpg or png, etc.) and annotation files (xml). Its dataset format is as follows (/ represents folder):
-VOC/ |-JPEGImages/ |-1.jpg |-2.jpg |-Annotations/ |-1.xml |-2.xml |-ImageSets/ |-Layout/ |-*.txt |-Main/ |-train.txt |-val.txt |-trainval.txt |-test.txt |-Segmentation/ |-*.txt |-Action/ |-*.txt |-SegmentationClass/ |-SegmentationObject/
The most common and necessary folders for target detection tasks include JPEGImages, Annotations, ImageSets/Main.
Images are stored in JPEGImages, while xml annotation files are stored in Annotations. The contents of the files are as follows:
<annotation> <folder>VOC</folder> # Image folder <filename>000032.jpg</filename> # Image file name <source> # Image source <database>The VOC Database</database> <annotation>PASCAL VOC</annotation> <image>flickr</image> </source> <size> # Image size information <width>500</width> # Image width <height>281</height> # Image height <depth>3</depth> # Number of image channels </size> <segmented>0</segmented> # Whether the image is used for segmentation, 0 means not applicable, which doesn't matter for target detection <object> # Information about a target object <name>aeroplane</name> # Target's class alias <pose>Frontal</pose> # Shooting angle, if none, is generally Unspecified <truncated>0</truncated> # Whether it is truncated. 0 indicates that it is complete and not truncated <difficult>0</difficult> # Whether it is difficult to identify, 0 means it is not difficult to identify <bndbox> # Bounding box information <xmin>104</xmin> # Upper left corner x <ymin>78</ymin> # Upper left corner y <xmax>375</xmax> # Lower right corner x <ymax>183</ymax> # Lower right corner y </bndbox> </object> # The following is information about other targets, which is omitted here <object> other object Information, omitted here </object> </annotation>
2.2 VOC conversion script
The following script is only applicable when there are images and xml files. It needs to be written after coco is converted to voc format:
# -*- coding: utf-8 -*- # @Author : justlovesmile # @Date : 2021/9/8 21:01 import os,random from tqdm.auto import tqdm import shutil as sh def mkdir(path): if not os.path.exists(path): os.mkdir(path) return True else: print(f"The path ({path}) already exists.") return False def tovoc(xmlroot,imgroot,saveroot,errorId=[],classes={},tvp=1.0,trp=0.9): ''' Parameters: root: Data set storage root directory Function: Load data and save as VOC format Loaded format: VOC/ Annotations/ - **.xml JPEGImages/ - **.jpg ImageSets/ Main/ - train.txt - test.txt - val.txt - trainval.txt ''' # assert assert len(classes)>0 # init path VOC = saveroot ann_path = os.path.join(VOC, 'Annotations') img_path = os.path.join(VOC,'JPEGImages') set_path = os.path.join(VOC,'ImageSets') txt_path = os.path.join(set_path,'Main') # mkdirs if mkdir(VOC): if mkdir(ann_path) and mkdir(img_path) and mkdir(set_path): mkdir(txt_path) images = os.listdir(imgroot) list_index = range(len(images)) #test and trainval set trainval_percent = tvp train_percent = trp val_percent = 1 - train_percent if train_percent<1 else 0.1 total_num = len(images) trainval_num = int(total_num*trainval_percent) train_num = int(trainval_num*train_percent) val_num = int(trainval_num*val_percent) if train_percent<1 else 0 trainval = random.sample(list_index,trainval_num) train = random.sample(list_index,train_num) val = random.sample(list_index,val_num) for i in tqdm(list_index): imgfile = images[i] img_id = os.path.splitext(os.path.basename(imgfile))[0] xmlfile = img_id+".xml" sh.copy(os.path.join(imgroot,imgfile),os.path.join(img_path,imgfile)) sh.copy(os.path.join(xmlroot,xmlfile),os.path.join(ann_path,xmlfile)) if img_id not in errorId: if i in trainval: with open(os.path.join(txt_path,'trainval.txt'),'a') as f: f.write(img_id+'\n') if i in train: with open(os.path.join(txt_path,'train.txt'),'a') as f: f.write(img_id+'\n') else: with open(os.path.join(txt_path,'val.txt'),'a') as f: f.write(img_id+'\n') if train_percent==1 and i in val: with open(os.path.join(txt_path,'val.txt'),'a') as f: f.write(img_id+'\n') else: with open(os.path.join(txt_path,'test.txt'),'a') as f: f.write(img_id+'\n') # end print("Dataset to VOC format finished!") def test(): box_root = "E:/MyProject/Dataset/hwtest/annotations" image_root = "E:/MyProject/Dataset/hwtest/images" output_root = "E:/MyProject/Dataset/voc" classes = {"liner": 0,"bulk carrier": 1,"warship": 2,"sailboat": 3,"canoe": 4,"container ship": 5,"fishing boat": 6} errorId = [] train_percent = 0.9 tovoc(box_root,image_root,output_root,errorId,classes,trp=train_percent) if __name__ == "__main__": test()
3. YOLO
3.1 YOLO dataset format
The format of the YOLO dataset is mainly used to train the YOLO model. There are no fixed requirements for its file format, because the data can be loaded by modifying the configuration file of the model. The only thing to note is that the annotation format of the YOLO dataset is to normalize the position information of the target frame (normalization here refers to dividing the picture width and height), as shown below:
{Target category} {Normalized target center point x coordinate} {Normalized target center point y coordinate} {Normalized target frame width w} {Normalized target frame height h}
3.2 YOLO conversion script
The Python conversion script is as follows:
# -*- coding: utf-8 -*- # @Author : justlovesmile # @Date : 2021/9/8 20:28 import os import random from tqdm.auto import tqdm import shutil as sh try: import xml.etree.cElementTree as et except ImportError: import xml.etree.ElementTree as et def mkdir(path): if not os.path.exists(path): os.makedirs(path) return True else: print(f"The path ({path}) already exists.") return False def xml2yolo(xmlpath,savepath,classes={}): namemap = classes #try: # with open('classes_yolo.json','r') as f: # namemap=json.load(f) #except: # pass rt = et.parse(xmlpath).getroot() w = int(rt.find("size").find("width").text) h = int(rt.find("size").find("height").text) with open(savepath, "w") as f: for obj in rt.findall("object"): name = obj.find("name").text xmin = int(obj.find("bndbox").find("xmin").text) ymin = int(obj.find("bndbox").find("ymin").text) xmax = int(obj.find("bndbox").find("xmax").text) ymax = int(obj.find("bndbox").find("ymax").text) f.write( f"{namemap[name]} {(xmin+xmax)/w/2.} {(ymin+ymax)/h/2.} {(xmax-xmin)/w} {(ymax-ymin)/h}" + "\n" ) def trainval(xmlroot,imgroot,saveroot,errorId=[],classes={},tvp=1.0,trp=0.9): # assert assert tvp<=1.0 and trp <=1.0 and len(classes)>0 # create dirs imglabel = ['images','labels'] trainvaltest = ['train','val','test'] mkdir(saveroot) for r in imglabel: mkdir(os.path.join(saveroot,r)) for s in trainvaltest: mkdir(os.path.join(saveroot,r,s)) #train / val trainval_percent = tvp train_percent = trp val_percent = 1 - train_percent if train_percent<1.0 else 0.15 total_img = os.listdir(imgroot) num = len(total_img) list_index = range(num) tv = int(num * trainval_percent) tr = int(tv * train_percent) va = int(tv * val_percent) trainval = random.sample(list_index, tv) # trainset and valset train = random.sample(trainval, tr) # trainset val = random.sample(trainval, va) #valset, use it only when train_percent = 1 print(f"trainval_percent:{trainval_percent},train_percent:{train_percent},val_percent:{val_percent}") for i in tqdm(list_index): name = total_img[i] op = os.path.join(imgroot,name) file_id = os.path.splitext(os.path.basename(name))[0] if file_id not in errorId: xmlp = os.path.join(xmlroot,file_id+'.xml') if i in trainval: # trainset and valset if i in train: sp = os.path.join(saveroot,"images","train",name) xml2yolo(xmlp,os.path.join(saveroot,"labels","train",file_id+'.txt'),classes) sh.copy(op,sp) else: sp = os.path.join(saveroot,"images","val",name) xml2yolo(xmlp,os.path.join(saveroot,"labels","val",file_id+'.txt'),classes) sh.copy(op,sp) if (train_percent==1.0 and i in val): sp = os.path.join(saveroot,"images","val",name) xml2yolo(xmlp,os.path.join(saveroot,"labels","val",file_id+'.txt'),classes) sh.copy(op,sp) else: # testset sp = os.path.join(saveroot,"images","test",name) xml2yolo(xmlp,os.path.join(saveroot,"labels","test",file_id+'.txt'),classes) sh.copy(op,sp) def maketxt(dir,saveroot,filename): savetxt = os.path.join(saveroot,filename) with open(savetxt,'w') as f: for i in tqdm(os.listdir(dir)): f.write(os.path.join(dir,i)+'\n') def toyolo(xmlroot,imgroot,saveroot,errorId=[],classes={},tvp=1,train_percent=0.9): # toyolo main function trainval(xmlroot,imgroot,saveroot,errorId,classes,tvp,train_percent) maketxt(os.path.join(saveroot,"images","train"),saveroot,"train.txt") maketxt(os.path.join(saveroot,"images","val"),saveroot,"val.txt") maketxt(os.path.join(saveroot,"images","test"),saveroot,"test.txt") print("Dataset to yolo format success.") def test(): box_root = "E:/MyProject/Dataset/hwtest/annotations" image_root = "E:/MyProject/Dataset/hwtest/images" output_root = "E:/MyProject/Dataset/yolo" classes = {"liner": 0,"bulk carrier": 1,"warship": 2,"sailboat": 3,"canoe": 4,"container ship": 5,"fishing boat": 6} errorId = [] train_percent = 0.9 toyolo(box_root,image_root,output_root,errorId,classes,train_percent=train_percent) if __name__ == "__main__": test()
Following this script, the following will be generated in the output folder:
-yolo/ |-images/ |-train/ |-1.jpg |-2.jpg |-test/ |-3.jpg |-4.jpg |-val/ |-5.jpg |-6.jpg |-labels/ |-train/ |-1.txt |-2.txt |-test/ |-3.txt |-4.txt |-val/ |-5.txt |-6.txt |-train.txt |-test.txt |-val.txt