Robust Video Matting in PyTorch, TensorFlow, TensorFlow.js, ONNX, CoreML!
Powerful video matting function in PyTorch, TensorFlow and TensorFlow. js，ONNX，CoreML!
Stable video matting (RVM)
Thesis Robust High-Resolution Video Matting with Temporal Guidance The official GitHub library. RVM is designed for stable character video matting. Unlike the existing neural network, which processes each frame as a separate picture, RVM uses cyclic neural network to have time memory when processing video streams. RVM can do real-time HD matting on any video. • 4K 76FPS and HD 104FPS on Nvidia GTX 1080Ti. This research project comes from Byte runout.
All materials in the video are available for download and can be used to test the model: Google Drive
- Webpage : view the camera matting effect in the browser to display the cyclic memory value inside the model.
- Colab : convert your video with our model.
It is recommended to use the model of MobileNetV3 in general. The ResNet50 model is much larger and the effect is slightly improved. Our model supports many frameworks. For details, please read Infer document.
|Official PyTorch model weights. file|
|TorchHub||No manual download is required.||Make it easier to use this model in your PyTorch project. file|
|If you need to infer at the mobile end, you can consider deriving int8 quantitative model by yourself. file|
|Tested on the CPU and CUDA backend of ONNX Runtime. The provided model uses opset 12. file，export|
|TensorFlow 2 SavedModel format. file|
|TensorFlow.js||rvm_mobilenetv3_tfjs_int8.zip||Run the model on the web page. Exhibition，Model code|
|CoreML can only export fixed resolutions, and other resolutions can be exported by themselves. Supports iOS 13 +. s ， represents the down sampling ratio. file，export|
- 1 install Python Library:
pip install -r requirements_inference.txt
- 2. Loading model:
import torch from model import MattingNetwork model = MattingNetwork('mobilenetv3').eval().cuda() # Or "resnet50" model.load_state_dict(torch.load('rvm_mobilenetv3.pth'))
- 3 if you only need to do video matting, we provide a simple API:
from inference import convert_video convert_video( model, # Model, which can be loaded into any device (cpu or cuda) input_source='input.mp4', # Video files, or picture sequence folders output_type='video', # Optional "video" or "png_sequence" output_composition='com.mp4', # If the video is exported, provide the file path. If the PNG sequence is exported, provide the folder path output_alpha="pha.mp4", # [optional] output transparency prediction output_foreground="fgr.mp4", # [optional] output foreground forecast output_video_mbps=4, # If the video is exported, the video bit rate is provided downsample_ratio=None, # The down sampling ratio can be adjusted according to the specific video, or None can select automatic seq_chunk=12, # Set up multi frame parallel computing )
- 4 or write your own inference logic:
from torch.utils.data import DataLoader from torchvision.transforms import ToTensor from inference_utils import VideoReader, VideoWriter reader = VideoReader('input.mp4', transform=ToTensor()) writer = VideoWriter('output.mp4', frame_rate=30) bgr = torch.tensor([.47, 1, .6]).view(3, 1, 1).cuda() # Green background rec = [None] * 4 # Initial cyclic States downsample_ratio = 0.25 # Down sampling ratio, adjusted according to video with torch.no_grad(): for src in DataLoader(reader): # Input tensor, RGB channel, range 0 ~ 1 fgr, pha, *rec = model(src.cuda(), *rec, downsample_ratio) # Give the memory of the previous frame to the next frame com = fgr * pha + bgr * (1 - pha) # Composes the foreground to a green background writer.write(com) # Output frame
- 5 models and API s can also be quickly loaded through TorchHub.
# Loading model model = torch.hub.load("PeterL1n/RobustVideoMatting", "mobilenetv3") # Or "resnet50" # Conversion API convert_video = torch.hub.load("PeterL1n/RobustVideoMatting", "converter")
Infer document There is a pair of downsamples in the_ Explanation of ratio parameter, API use, and high-level use.
Training and evaluation
Please refer to Training documents (English).
Speed information_ speed_ test. Py) measurement for reference.
|GPU||dType||HD (1920x1080)||4K (3840x2160)|
|RTX 3090||FP16||172 FPS||154 FPS|
|RTX 2060 Super||FP16||134 FPS||108 FPS|
|GTX 1080 Ti||FP32||104 FPS||74 FPS|
- Note 1: HD downsample is used_ Ratio = 0.25, 4K downsample_ratio=0.125. All tests used batch size 1 and frame chunk 1.
- Note 2: GPU before Turing architecture does not support FP16 reasoning, so GTX 1080 Ti uses FP32.
- Note 3: we only measure tensor throughput. The provided video conversion script is much slower because it does not use hardware video encoding / decoding and does not complete tensor transmission on parallel threads. If you are interested in implementing hardware video encoding / decoding in Python, please refer to PyNvCodec.