https://opencv.org/how-to-speed-up-deep-learning-inference-using-openvino-toolkit-2/
Real-time Facial Surface Geometry from Monocular Video on Mobile GPUs
https://arxiv.org/pdf/1907.06724.pdf
https://github.com/thepowerfuldeez/facemesh.pytorch
Introduction to face cloud extraction (facesh)
In 2019, there was a paper on real-time 3D point cloud extraction at the mobile terminal, which was used by many mobile terminal AR applications as the underlying algorithm to realize face detection and 3D point cloud generation. The title of this paper is "real time facial surface geometry from monocular video on mobile GPUs". The implementation address of github with Python version is as follows:
https://github.com/thepowerfuldeez/facemesh.pytorch
The pre training model file pth has been provided, which can be opened and viewed with Netron. Its input and output display screenshots are as follows:
The final output point cloud data is 468 3D coordinates, face point cloud coordinates, and the ROI area of the input face, with a size of 192x192. The model can be converted to ONNX format model by using the script supported by pytoch. The converted script and code are as follows:
from facemesh import FaceMesh import torch net = FaceMesh() net.load_weights("facemesh.pth") torch.onnx.export(net, torch.randn(1, 3, 192, 192, device='cpu'), "facemesh.onnx", input_names=("image", ), output_names=("preds", "confs"), opset_version=9 )
In this way, we get the ONNX version of the model file.
OpenVINO deployment and reasoning facemesh
OpenVINO2020. After version x, it supports direct reading of ONNX format model files to realize model loading and reasoning call. Here, take openvino2021 Take version 2 as an example. Our basic idea is to first realize face detection through openvino's own face detection model, then intercept the face ROI area, and then send it to the facemesh model to extract 468 points from the face 3D surface point cloud. For face detection, we have selected openvino's own face detection-0202 model file, which is based on MobileNet SSDv version, and the input format is as follows:
NCHW = 1x3x384x384
The output format is:
1x1xNx7
The order of channels is BGR
It can be seen from figure-2 that the input format of the face 3D point cloud extraction model facemesh is 1x3x192x192, and there are two output layers, preds and conf, where preds is the point cloud data, and conf represents the confidence. 1404 of preds represents the three-dimensional coordinates of 468 points, with a total of 468x3=1404 Detailed description and steps of the steps and running results of the code demonstration part
Loading model and obtaining input and output information:
# Load face detection model net = ie.read_network(model=model_xml, weights=model_bin) input_blob = next(iter(net.input_info)) out_blob = next(iter(net.outputs)) # Input format of face detection n, c, h, w = net.input_info[input_blob].input_data.shape print(n, c, h, w) exec_net = ie.load_network(network=net, device_name="CPU") # Load face 3D point cloud prediction model face_mesh_onnx = "facemesh.onnx" mesh_face_net = ie.read_network(model=face_mesh_onnx) # Input format em_input_blob = next(iter(mesh_face_net.input_info)) en, ec, eh, ew = mesh_face_net.input_info[em_input_blob].input_data.shape print(en, ec, eh, ew) em_exec_net = ie.load_network(network=mesh_face_net, device_name="CPU")
Face detection and obtaining face ROI, and then extracting face 3D point cloud data
# Set the input image and face detection model for reasoning and prediction image = cv.resize(frame, (w, h)) image = image.transpose(2, 0, 1) inf_start = time.time() res = exec_net.infer(inputs={input_blob: [image]}) ih, iw, ic = frame.shape res = res[out_blob] # Analyze face detection and obtain ROI for obj in res[0][0]: if obj[2] > 0.75: xmin = int(obj[3] * iw) ymin = int(obj[4] * ih) xmax = int(obj[5] * iw) ymax = int(obj[6] * ih) if xmin < 0: xmin = 0 if ymin < 0: ymin = 0 if xmax >= iw: xmax = iw - 1 if ymax >= ih: ymax = ih - 1 # Intercept face ROI and extract 3D surface point cloud data roi = frame[ymin:ymax, xmin:xmax, :] roi_img = cv.resize(roi, (ew, eh)) roi_img = np.float32(roi_img) / 127.5 roi_img = roi_img.transpose(2, 0, 1) em_res = em_exec_net.infer(inputs={em_input_blob: [roi_img]}) # Convert to 468 3D point cloud data and display it prob_mesh = em_res["preds"] prob_mesh= np.reshape(prob_mesh, (-1, 3)) cv.rectangle(frame, (xmin, ymin), (xmax, ymax), (0, 255, 255), 2, 8) sx, sy= ew / roi.shape[1], eh / roi.shape[0] for i in range(prob_mesh.shape[0]): x, y = int(prob_mesh[i, 0] / sx), int(prob_mesh[i, 1] / sy) cv.circle(frame, (xmin + x, ymin + y), 1, (0, 0, 255), 1) # Calculate frame rate and display point cloud results inf_end = time.time() - inf_start cv.putText(frame, "infer time(ms): %.3f, FPS: %.2f" % (inf_end * 1000, 1 / inf_end), (10, 50), cv.FONT_HERSHEY_SIMPLEX, 1.0, (255, 0, 255), 2, 8) cv.imshow("Face Detection + 3D mesh", frame)
The operation results are as follows: