Use python to convert long video and long voice into text tutorial, which is very easy to use

Originally, I wanted to make a voice to text, but I found that many blogs didn't say what I wanted. The reasons are as follows: 1. Because the voice supported by Baidu interface is pcm format, and my own is mp3 format file.

2. Baidu only supports the file format within 60s, and what it wants is a long file, and hopes to cut it into 60s to meet the requirements of Baidu speech recognition.

3. The file can be generated srt file format, which is exactly what you want. So it took several days to find a perfect blog and share it with you. The blog address is as follows:

 Using Baidu speech recognition interface to convert speech into text tutorial - various streams - blog Garden

This blog is aimed at converting mp3 files into The text of the srt file is shown to you first

If you want to convert through video files, you can convert the video files into mp3 files. The address can see my last blog Convert video files into voice files using Python_ A blog of learning slag who doesn't want to learn - CSDN blog

For Baidu's APPID and KEY_ID, etc. you can refer to the console on the address page below for registration

After registration, you can click the voice technology on the following page. Then create the task and fill in the relevant information. The second chapter will get the final page. Then you can use the code for speech recognition,

Note: this Baidu interface is not free. It needs money in the later stage..

The main codes are as follows:

import base64
import json
import os
import time
import shutil
import requests

class BaiduVoiceToTxt():
    # Initialization function
    def __init__(self):
        # Define the location of the pcm file to be cut. Speech VAD demo fixed, no choice
        self.pcm_path = ".\\speech-vad-demo\\pcm\\16k_1.pcm"
        # Define the directory to which the pcm file is output after being cut. Speech VAD demo fixed, no choice
        self.output_pcm_path = ".\\speech-vad-demo\\output_pcm\\"

    # Baidu AI interface only accepts pcm format, so it needs to convert the format
    # This function is used to convert the mp3 file to be recognized into PCM format and output it as \speech-vad-demo\pcmk_1.pcm
    def change_file_format(self,filepath):
        file_name = filepath
        # If \speech-vad-demo\pcm\16k_ 1. If the PCM file already exists, delete it first
        if os.path.isfile(f"{self.pcm_path}"):
        # Call the system command to convert the file into PCM format and output it as \speech-vad-demo\pcmk_1.pcm
        change_file_format_command = f".\\ffmpeg\\bin\\ffmpeg.exe -y  -i {file_name}  -acodec pcm_s16le -f s16le -ac 1 -ar 16000 {self.pcm_path}"

    # Baidu AI interface can only accept audio and video for 60 seconds at most, so it needs to be cut
    # This function is used to convert \speech-vad-demo\pcmk_1.pcm cutting
    def devide_video(self):
        # If cutting the output directory \speech-vad-demo\output_pcm \ already exists. It is likely that there are files in it. Empty it first
        # The files in the empty directory are deleted first and then created
        if os.path.isdir(f"{self.output_pcm_path}"):
        # vad-demo.exe uses relative paths \pcm and \output_pcm, so first switch the current working directory to \Speech VAD demo Exe cannot find the file
        # Direct execution \vad-demo.exe, which defaults to \pcm\16k_ 1. The PCM file is cut and output to \output_pcm directory
        devide_video_command = ".\\vad-demo.exe"
        # Switch back to working directory

    # This function is used to convert \speech-vad-demo\output_ The time of the file name of the file under PCM \ is formatted as 0:00:00000
    def format_time(self, msecs):
        # Milliseconds per hour
        hour_msecs = 60 * 60 * 1000
        # Milliseconds per minute
        minute_msecs = 60 * 1000
        # Milliseconds per second
        second_msecs = 1000
        # The time of the file name is milliseconds and needs to be converted to seconds first+ 500 is for rounding and / / is for division
        # msecs = (msecs + 500) // 1000
        # hour
        hour = msecs // hour_msecs
        if hour < 10:
            hour = f"0{hour}"
        # Milliseconds left after deducting hours
        hour_left_msecs = msecs % hour_msecs
        # minute
        minute = hour_left_msecs // minute_msecs
        # If it is less than 10 minutes, fill 0 in front of it to form a two digit format
        if minute < 10:
            minute = f"0{minute}"
        # Milliseconds left after minutes
        minute_left_msecs = hour_left_msecs % minute_msecs
        # second
        second = minute_left_msecs // second_msecs
        # If the number of seconds is less than 10 seconds, fill 0 in front of it to make up for the two digit format
        if second < 10:
            second = f"0{second}"
        # Milliseconds left after seconds
        second_left_msecs = minute_left_msecs % second_msecs
        # If it is less than 10 milliseconds or 100 milliseconds, fill 0 in front of it to make up for the three digit format
        if second_left_msecs < 10:
            second_left_msecs = f"00{second_left_msecs}"
        elif second_left_msecs < 100:
            second_left_msecs = f"0{second_left_msecs}"
        # Formatted as 00:00:00000 and returned
        time_format = f"{hour}:{minute}:{second},{second_left_msecs}"
        return time_format

    # This function is used to request access to the ai interface_ token
    def get_access_token(self):
        # This variable is assigned the value of its own API Key
        client_id = 'f3wT23Otc8jXlDZ4HGtS4jfT'
        # This variable is assigned the value of its own Secret Key
        client_secret = 'YPPjW3E0VGPUOfZwhjNGVn7LTu3hwssj'
        auth_url = '' + client_id + '&client_secret=' + client_secret

        response_at = requests.get(auth_url)
        # Read the response result in json format
        json_result = json.loads(response_at.text)
        # Get access_token
        access_token = json_result['access_token']
        return access_token

    # This function is used to convert \speech-vad-demo\output_ A single file under PCM \ is converted from voice to file
    def transfer_voice_to_srt(self,access_token,filepath):
        # Baidu speech recognition interface
        url_voice_ident = ""
        # Interface specification, post data in json format
        headers = {
            'Content-Type': 'application/json'
        # Open the pcm file and read the contents of the file
        pcm_obj = open(filepath,'rb')
        pcm_content_base64 = base64.b64encode(
        # Get pcm file size
        pcm_content_len = os.path.getsize(filepath)

        # For the interface specification, see the official document for the body function. It is worth noting the writing method of cuid and speech parameters
        post_data = {
            "format": "pcm",
            "rate": 16000,
            "dev_pid": 1737,
            "channel": 1,
            "token": access_token,
            "cuid": "1111111111",
            "len": pcm_content_len,
            "speech": pcm_content_base64.decode(),
        proxies = {
        # Call the interface for voice text conversion
        response =, headers=headers, data=json.dumps(post_data))
        # response =,headers=headers,data=json.dumps(post_data),proxies=proxies)
        return response.text

if __name__ == "__main__":
    # instantiation 
    baidu_voice_to_srt_obj = BaiduVoiceToTxt()
    # The folder where the audio and video files to be converted are stored
    video_dir = ".\\video\\"
    all_video_file =[]
    all_file = os.listdir(video_dir)
    subtitle_format = "{\\fscx75\\fscy75}"
    # Only accept mp3 format file. Because other formats have not studied how to convert to pcm, it meets the interface requirements
    for filename in all_file:
        if ".mp3" in filename:
    i = 0
    video_file_num = len(all_video_file)
    print(f"Current common{video_file_num}Audio files need to be converted and will be processed soon. Please wait...")
    # The for loop of this layer is processed one mp3 file by one
    for video_file_name in all_video_file:
        i += 1
        print(f"Current conversion{video_file_name}({i}/{video_file_num})")
        # Output the content translated into audio and video to the same directory with the same name txt file
        video_file_srt_path = f".\\video\\{video_file_name[:-4]}.srt"
        # Open as overlay txt file
        video_file_srt_obj = open(video_file_srt_path,'w+')

        filepath = os.path.join(video_dir, video_file_name)
        # Call change_file_format Convert mp3 to pcm format
        # Cut the converted pcm file into multiple pcm files less than 60 seconds
        # Get token
        access_token = baidu_voice_to_srt_obj.get_access_token()
        # obtain. \speech-vad-demo\output_ List of files in PCM \ directory
        file_dir = baidu_voice_to_srt_obj.output_pcm_path
        all_pcm_file = os.listdir(file_dir)
        j = 0
        pcm_file_num = len(all_pcm_file)
        print(f"Currently transferred files{video_file_name}({i}/{video_file_num})Be cut into{pcm_file_num}Block, the audio text conversion will be carried out block by block. Please wait a moment...")
        # This layer is for \speech-vad-demo\output_ All files in PCM \ directory are converted one by one
        for filename in all_pcm_file:
            j += 1
            filepath = os.path.join(file_dir, filename)
            if (os.path.isfile(filepath)):
                # Gets the time on the file name
                time_str = filename[10:-6]
                time_str_dict = time_str.split("-")
                time_start_str = baidu_voice_to_srt_obj.format_time(int(time_str_dict[0]))
                time_end_str = baidu_voice_to_srt_obj.format_time(int(time_str_dict[1]))
                print(f"Current conversion{video_file_name}({i}/{video_file_num})-{time_start_str}-{time_end_str}({j}/{pcm_file_num})")
                response_text = baidu_voice_to_srt_obj.transfer_voice_to_srt(access_token, filepath)
                # Read the returned result in json
                json_result = json.loads(response_text)
                # Write the audio text conversion result into the srt file
                video_file_srt_obj.writelines(f"{time_start_str} --> {time_end_str}\r\n")
                if json_result['err_no'] == 0:
                    print(f"{time_start_str}-{time_end_str}({j}/{pcm_file_num})Conversion succeeded:{json_result['result'][0]}")
                elif json_result['err_no'] == 3301:
                    print(f"{time_start_str}-{time_end_str}({j}/{pcm_file_num})The audio quality is too poor to be recognized")
                    video_file_srt_obj.writelines(f"{subtitle_format}The audio quality is too poor to be recognized\r\n")
                    print(f"{time_start_str}-{time_end_str}({j}/{pcm_file_num})Other errors were encountered during the conversion")
                    video_file_srt_obj.writelines(f"{subtitle_format}Other errors were encountered during the conversion\r\n")

Finally, I hope you can make progress together and succeed in your studies!

Keywords: AI

Added by adsegzy on Wed, 16 Feb 2022 10:50:21 +0200