Three solutions for speech recognition:
Original purpose: it was originally required to recognize voice from video, and then extract the text content. As a result, I read a lot of projects, Chinese notes, but I recognized English. I feel that the threshold of teaching is a little low. I can understand other people's open source code, which is still a distance from my own development. Later, I explored a lot, such as subtitle generation. Instead of generating subtitles at the bottom of the video, they are added to the txt text paragraph by paragraph. I also think this is the best way, and a time stamp can be given to each word and sentence. Later, I realized that it was really difficult to complete a day, so I found the most common and understandable solutions and related codes from the Internet: from video to audio, and then from audio to text.
There are three schemes: ① speech_recognition plus r.recognize_sphinx(audio,language = "zh CN") ②it is really difficult for Baidu API (similar to iFLYTEK) to collect wool from Baidu ③ it is realized by using the solution of TIMIT project
Scheme I:
#!/usr/bin/env python # -*- coding: utf-8 -*- # @Time : 2022.1.5 # @File : duizhaozu1.py import os import speech_recognition as sr def file_to_wav(file_path, wav_path, sampling_rate): if os.path.exists(wav_path): # If the file exists # To delete a file, you can use the following two methods. os.remove(wav_path) # Terminal command command = "D:/download/ffmpeg-master-latest-win64-lgpl/bin/ffmpeg.exe -i {} -ac 1 -ar {} {}".format(file_path, sampling_rate, wav_path) os.system(command) if __name__ == '__main__': file_path = r'C:\Users\PineappleMan\Desktop\ok\DFS.mp4' wav_path = r'C:\Users\PineappleMan\Desktop\ok\DFS.wav' sampling_rate = 16000 file_to_wav(file_path, wav_path, sampling_rate) r=sr.Recognizer() with sr.AudioFile(wav_path) as source: audio =r.record(source) print("Text content:",r.recognize_sphinx(audio,language="zh-CN"))
Among them, the file is video. If it is directly audio (wav file), the format conversion code will be deleted directly. The code is used for reference https://download.csdn.net/download/weixin_38693753/13709062
pip install -i https://pypi.tuna.tsinghua.edu.cn/simple speech_recognition
An error was encountered while running, that is, pocketsphinx is missing: https://download.csdn.net/download/yuxuwen1234/12195200 In fact, it's OK to download a whl. My python is 3.7, so I downloaded the one in the link and it will run well soon. There are many solutions on the Internet, because the reported error is the problem of swig, but "a headache cures the head and a foot cures the foot", so I should download whl wisely to solve this problem.
Scheme II:
import base64 import json import os import time import uuid import requests import urllib.response from inc import db_config from inc import rtysdb class BaiduRest: def __init__(self, cu_id, api_key, api_secert): self.token_url = "https://openapi.baidu.com/oauth/2.0/token?grant_type=client_credentials&client_id=%s&client_secret=%s" self.getvoice_url = "http://tsn.baidu.com/text2audio?tex=%s&lan=zh&cuid=%s&ctp=1&tok=%s" self.upvoice_url = 'http://vop.baidu.com/server_api' self.cu_id = cu_id self.get_token(api_key, api_secert) return def get_token(self, api_key, api_secert): token_url = self.token_url % (api_key, api_secert) r_str = urllib.response.urlopen(token_url).read() token_data = json.loads(r_str) self.token_str = token_data['access_token'] return True # speech synthesis # def text2audio(self, text, filename): # get_url = self.getvoice_url % (urllib.response.quote(text), self.cu_id, self.token_str) # # # voice_data = urllib.response.urlopen(get_url).read() # voice_fp = open(filename, 'wb+') # voice_fp.write(voice_data) # voice_fp.close() # return True ##speech recognition def audio2text(self, filename): data = {} data['format'] = 'wav' data['rate'] = 8000 data['channel'] = 1 data['cuid'] = self.cu_id data['token'] = self.token_str wav_fp = open(filename, 'rb') voice_data = wav_fp.read() data['len'] = len(voice_data) # data['speech'] = base64.b64encode(voice_data).decode('utf-8') data['speech'] = base64.b64encode(voice_data).replace('\n', '') # post_data = json.dumps(data) result = requests.post(self.upvoice_url, json=data, headers={'Content-Type': 'application/json'}) data_result = result.json() if (data_result['err_msg'] == 'success.'): return data_result['result'][0] else: return False def test_voice(voice_file): api_key = "vossGHIgEETS6IMRxBDeahv8" api_secert = "3c1fe6a6312f41fa21fa2c394dad5510" bdr = BaiduRest("0-57-7B-9F-1F-A1", api_key, api_secert) # generate # start = time.time() # bdr.text2audio("hello", "out.wav") # using = time.time() - start # print using # distinguish # start = time.time() result = bdr.audio2text(voice_file) # result = bdr.audio2text("weather.pcm") # using = time.time() - start return result def get_master_audio(check_status='cut_status'): if check_status == 'cut_status': sql = "SELECT id,url, time_long,sharps FROM ocenter_recognition WHERE status=0" elif check_status == 'finished_status': sql = "SELECT id,url, time_long,sharps FROM ocenter_recognition WHERE finished_status=0" else: return False data = rtysdb.select_data(sql, 'more') if data: return data else: return False def go_recognize(master_id): section_path = "C:/Users/PineappleMan/Desktop/ok/audio1.wav" sql = "SELECT id,rid,url,status FROM ocenter_section WHERE rid=%d AND status=0 order by id asc limit 10" % ( master_id) # print sql record = rtysdb.select_data(sql, 'more') # print record if not record: return False for rec in record: # print section_path+'/'+rec[1] voice_file = section_path + '/' + rec[2] if not os.patcvoice_file: continue result = test_voice(voice_file) print(result) exit(0) if result: # rtysdb.update_by_pk('ocenter_section',rec[0],{'content':result,'status':1}) sql = "update ocenter_section set content='%s', status='%d' where id=%d" % (result, 1, rec[0]) # print sql rtysdb.do_exec_sql(sql) parent_content = rtysdb.select_data("SELECT id,content FROM ocenter_recognition WHERE id=%d" % (rec[1])) # print parent_content if parent_content: new_content = parent_content[1] + result update_content_sql = "update ocenter_recognition set content='%s' where id=%d" % (new_content, rec[1]) rtysdb.do_exec_sql(update_content_sql) else: rtysdb.do_exec_sql("update ocenter_section set status='%d' where id=%d" % (result, 1, rec[0])) time.sleep(5) else: rtysdb.do_exec_sql("UPDATE ocenter_recognition SET finished_status=1 WHERE id=%d" % (master_id)) # Convert audio files that Baidu speech cannot recognize def ffmpeg_convert(): section_path = "C:/Users/PineappleMan/Desktop/ok/audio1.wav" # print section_path used_audio = get_master_audio('cut_status') # print used_audio if used_audio: for audio in used_audio: audio_path = section_path + '/' + audio[1] new_audio = uuid.uuid1() command_line = "ffmpeg -i " + audio_path + " -ar 8000 -ac 1 -f wav " + section_path + "/Uploads/Convert/convert_" + str( new_audio) + ".wav"; # print command_line os.popen(command_line) if os.path.exists(section_path + "/Uploads/Convert/convert_" + str(new_audio) + ".wav"): convert_name = "Uploads/Convert/convert_" + str(new_audio) + ".wav" ffmpeg_cut(convert_name, audio[3], audio[0]) sql = "UPDATE ocenter_recognition SET status=1,convert_name='%s' where id=%d" % (convert_name, audio[0]) rtysdb.do_exec_sql(sql) # Cut the large audio file into pieces def ffmpeg_cut(convert_name, sharps, master_id): section_path = "C:/Users/PineappleMan/Desktop/ok/audio1.wav" if sharps > 0: for i in range(0, sharps): timeArray = time.localtime(i * 30) h = time.strftime("%H", timeArray) h = int(h) - 8 h = "0" + str(h) ms = time.strftime("%M:%S", timeArray) start_time = h + ':' + str(ms) cut_name = section_path + '/' + convert_name db_store_name = "Uploads/Section/" + str(uuid.uuid1()) + '-' + str(i + 1) + ".wav" section_name = section_path + "/" + db_store_name command_line = "ffmpeg.exe -i " + cut_name + " -vn -acodec copy -ss " + start_time + " -t 00:00:30 " + section_name # print command_line os.popen(command_line) data = {} data['rid'] = master_id data['url'] = db_store_name data['create_time'] = int(time.time()) data['status'] = 0 rtysdb.insert_one('ocenter_section', data) if __name__ == "__main__": ffmpeg_convert() audio = get_master_audio('finished_status') if audio: for ad in audio: go_recognize(ad[0])
The project refers to https://download.csdn.net/download/weixin_38531210/12867107 , but many changes have been made! Whether on the official website or these projects, the packages called are relatively old, and even Python 3 has been replaced by other names. Therefore, the workload of changes is quite large.
Scheme III:
TIMIT is a classic English speech recognition. It is not difficult to find the relevant code. Not much here. In two days, I will annotate each line of code and send it out.
summary
In fact, I realize one function a day, especially for people like me who have taken the postgraduate entrance examination for a long time and forgot a lot of things. However, the realization of this purposeful function is still very exercise. Keep up the good work.