[Day4] speech recognition (audio to text)

Three solutions for speech recognition:
Original purpose: it was originally required to recognize voice from video, and then extract the text content. As a result, I read a lot of projects, Chinese notes, but I recognized English. I feel that the threshold of teaching is a little low. I can understand other people's open source code, which is still a distance from my own development. Later, I explored a lot, such as subtitle generation. Instead of generating subtitles at the bottom of the video, they are added to the txt text paragraph by paragraph. I also think this is the best way, and a time stamp can be given to each word and sentence. Later, I realized that it was really difficult to complete a day, so I found the most common and understandable solutions and related codes from the Internet: from video to audio, and then from audio to text.

There are three schemes: ① speech_recognition plus r.recognize_sphinx(audio,language = "zh CN") ②it is really difficult for Baidu API (similar to iFLYTEK) to collect wool from Baidu ③ it is realized by using the solution of TIMIT project

Scheme I:

#!/usr/bin/env python
# -*- coding: utf-8 -*-
# @Time    : 2022.1.5

# @File    : duizhaozu1.py

import os
import speech_recognition as sr


def file_to_wav(file_path, wav_path, sampling_rate):
    if os.path.exists(wav_path):  # If the file exists
        # To delete a file, you can use the following two methods.
        os.remove(wav_path)
        # Terminal command
    command = "D:/download/ffmpeg-master-latest-win64-lgpl/bin/ffmpeg.exe -i {} -ac 1 -ar {} {}".format(file_path, sampling_rate, wav_path)
    os.system(command)


if __name__ == '__main__':

    file_path = r'C:\Users\PineappleMan\Desktop\ok\DFS.mp4'
    wav_path = r'C:\Users\PineappleMan\Desktop\ok\DFS.wav'
    sampling_rate = 16000
    file_to_wav(file_path, wav_path, sampling_rate)
    r=sr.Recognizer()
    with sr.AudioFile(wav_path) as source:
        audio =r.record(source)
    print("Text content:",r.recognize_sphinx(audio,language="zh-CN"))

Among them, the file is video. If it is directly audio (wav file), the format conversion code will be deleted directly. The code is used for reference https://download.csdn.net/download/weixin_38693753/13709062

pip install -i https://pypi.tuna.tsinghua.edu.cn/simple speech_recognition

An error was encountered while running, that is, pocketsphinx is missing: https://download.csdn.net/download/yuxuwen1234/12195200 In fact, it's OK to download a whl. My python is 3.7, so I downloaded the one in the link and it will run well soon. There are many solutions on the Internet, because the reported error is the problem of swig, but "a headache cures the head and a foot cures the foot", so I should download whl wisely to solve this problem.

Scheme II:

import base64
import json
import os
import time
import uuid

import requests
import urllib.response
from inc import db_config
from inc import rtysdb


class BaiduRest:
    def __init__(self, cu_id, api_key, api_secert):
        self.token_url = "https://openapi.baidu.com/oauth/2.0/token?grant_type=client_credentials&client_id=%s&client_secret=%s"
        self.getvoice_url = "http://tsn.baidu.com/text2audio?tex=%s&lan=zh&cuid=%s&ctp=1&tok=%s"
        self.upvoice_url = 'http://vop.baidu.com/server_api'
        self.cu_id = cu_id
        self.get_token(api_key, api_secert)
        return

    def get_token(self, api_key, api_secert):
        token_url = self.token_url % (api_key, api_secert)
        r_str = urllib.response.urlopen(token_url).read()
        token_data = json.loads(r_str)
        self.token_str = token_data['access_token']
        return True


# speech synthesis
# def text2audio(self, text, filename):
#     get_url = self.getvoice_url % (urllib.response.quote(text), self.cu_id, self.token_str)
#
#
# voice_data = urllib.response.urlopen(get_url).read()
# voice_fp = open(filename, 'wb+')
# voice_fp.write(voice_data)
# voice_fp.close()
# return True


##speech recognition
def audio2text(self, filename):
    data = {}
    data['format'] = 'wav'
    data['rate'] = 8000
    data['channel'] = 1
    data['cuid'] = self.cu_id
    data['token'] = self.token_str
    wav_fp = open(filename, 'rb')
    voice_data = wav_fp.read()
    data['len'] = len(voice_data)
    # data['speech'] = base64.b64encode(voice_data).decode('utf-8')
    data['speech'] = base64.b64encode(voice_data).replace('\n', '')
    # post_data = json.dumps(data)
    result = requests.post(self.upvoice_url, json=data, headers={'Content-Type': 'application/json'})
    data_result = result.json()
    if (data_result['err_msg'] == 'success.'):
        return data_result['result'][0]
    else:
        return False


def test_voice(voice_file):
    api_key = "vossGHIgEETS6IMRxBDeahv8"
    api_secert = "3c1fe6a6312f41fa21fa2c394dad5510"
    bdr = BaiduRest("0-57-7B-9F-1F-A1", api_key, api_secert)
    # generate
    # start = time.time()
    # bdr.text2audio("hello", "out.wav")
    # using = time.time() - start
    # print using
    # distinguish
    # start = time.time()
    result = bdr.audio2text(voice_file)
    # result = bdr.audio2text("weather.pcm")
    # using = time.time() - start
    return result


def get_master_audio(check_status='cut_status'):
    if check_status == 'cut_status':
        sql = "SELECT id,url, time_long,sharps FROM ocenter_recognition WHERE status=0"
    elif check_status == 'finished_status':
        sql = "SELECT id,url, time_long,sharps FROM ocenter_recognition WHERE finished_status=0"
    else:
        return False
    data = rtysdb.select_data(sql, 'more')
    if data:
        return data
    else:
        return False


def go_recognize(master_id):
    section_path = "C:/Users/PineappleMan/Desktop/ok/audio1.wav"
    sql = "SELECT id,rid,url,status FROM ocenter_section WHERE rid=%d AND status=0 order by id asc limit 10" % (
        master_id)
    # print sql
    record = rtysdb.select_data(sql, 'more')
    # print record
    if not record:
        return False
    for rec in record:
        # print section_path+'/'+rec[1]
        voice_file = section_path + '/' + rec[2]
        if not os.patcvoice_file:
            continue
        result = test_voice(voice_file)
        print(result)
        exit(0)
        if result:
            # rtysdb.update_by_pk('ocenter_section',rec[0],{'content':result,'status':1})
            sql = "update ocenter_section set content='%s', status='%d' where id=%d" % (result, 1, rec[0])  # print sql
            rtysdb.do_exec_sql(sql)
            parent_content = rtysdb.select_data("SELECT id,content FROM ocenter_recognition WHERE id=%d" % (rec[1]))
            # print parent_content
            if parent_content:
                new_content = parent_content[1] + result
                update_content_sql = "update ocenter_recognition set content='%s' where id=%d" % (new_content, rec[1])
                rtysdb.do_exec_sql(update_content_sql)
            else:
                rtysdb.do_exec_sql("update ocenter_section set status='%d' where id=%d" % (result, 1, rec[0]))
            time.sleep(5)
        else:
            rtysdb.do_exec_sql("UPDATE ocenter_recognition SET finished_status=1 WHERE id=%d" % (master_id))


# Convert audio files that Baidu speech cannot recognize
def ffmpeg_convert():
    section_path = "C:/Users/PineappleMan/Desktop/ok/audio1.wav"
    # print section_path
    used_audio = get_master_audio('cut_status')
    # print used_audio
    if used_audio:
        for audio in used_audio:
            audio_path = section_path + '/' + audio[1]
            new_audio = uuid.uuid1()
            command_line = "ffmpeg -i " + audio_path + " -ar 8000 -ac 1 -f wav " + section_path + "/Uploads/Convert/convert_" + str(
                new_audio) + ".wav";
            # print command_line
            os.popen(command_line)
        if os.path.exists(section_path + "/Uploads/Convert/convert_" + str(new_audio) + ".wav"):
            convert_name = "Uploads/Convert/convert_" + str(new_audio) + ".wav"
            ffmpeg_cut(convert_name, audio[3], audio[0])
            sql = "UPDATE ocenter_recognition SET status=1,convert_name='%s' where id=%d" % (convert_name, audio[0])
            rtysdb.do_exec_sql(sql)

    # Cut the large audio file into pieces


def ffmpeg_cut(convert_name, sharps, master_id):
    section_path = "C:/Users/PineappleMan/Desktop/ok/audio1.wav"
    if sharps > 0:
        for i in range(0, sharps):
            timeArray = time.localtime(i * 30)
            h = time.strftime("%H", timeArray)
            h = int(h) - 8
            h = "0" + str(h)
            ms = time.strftime("%M:%S", timeArray)
            start_time = h + ':' + str(ms)
            cut_name = section_path + '/' + convert_name
            db_store_name = "Uploads/Section/" + str(uuid.uuid1()) + '-' + str(i + 1) + ".wav"
            section_name = section_path + "/" + db_store_name
            command_line = "ffmpeg.exe -i " + cut_name + " -vn -acodec copy -ss " + start_time + " -t 00:00:30 " + section_name
            # print command_line
            os.popen(command_line)
            data = {}
            data['rid'] = master_id
            data['url'] = db_store_name
            data['create_time'] = int(time.time())
            data['status'] = 0
            rtysdb.insert_one('ocenter_section', data)

if __name__ == "__main__":
    ffmpeg_convert()
    audio = get_master_audio('finished_status')
    if audio:
        for ad in audio:
            go_recognize(ad[0])

The project refers to https://download.csdn.net/download/weixin_38531210/12867107 , but many changes have been made! Whether on the official website or these projects, the packages called are relatively old, and even Python 3 has been replaced by other names. Therefore, the workload of changes is quite large.

Scheme III:

TIMIT is a classic English speech recognition. It is not difficult to find the relevant code. Not much here. In two days, I will annotate each line of code and send it out.

summary

In fact, I realize one function a day, especially for people like me who have taken the postgraduate entrance examination for a long time and forgot a lot of things. However, the realization of this purposeful function is still very exercise. Keep up the good work.

Keywords: AI

Added by AcidCool19 on Thu, 06 Jan 2022 03:36:23 +0200