Notes: audio format conversion ByPython

In the above, we have roughly understood the usage of pydub library. Today's goal is to write a crawler to crawl song information.

For web crawlers, there are corresponding packages in Python's standard library, which can be opened directly: https://docs.python.org/zh-cn/ Go to see the official Chinese documents of the corresponding version of python (this website is very useful and is recommended to be collected by small partners learning Python). Of course, the official documents are generally obscure. You can search some tutorials to eat the best.

Through learning about python, web crawlers can use the traditional urllib library or the more advanced Requests library. Urllib is selected for the time being. Of which urlib The request module is used to open the url. The usage is as follows:

urllib.request.urlopen(url, data=None, [timeout, ]*, cafile=None, capath=None, cadefault=False, context=None)

It seems very complicated, but other defaults can not be filled in. We just need to give the url parameter. Open Baidu Encyclopedia to search for fireworks and find that the url of the web page is as follows: https://baike.baidu.com/item/%E7%83%9F%E8%8A%B1%E6%98%93%E5%86%B7/211 , try changing the url, enter: https://baike.baidu.com/item/ Qilixiang, go to, and successfully enter the baidu entry interface of Qilixiang, but the url is automatically updated to: https://baike.baidu.com/item/ Qilixiang / 2181450 (can be used, nice).

Observing the web page, there is a new problem, that is, Qilixiang has polysemy, and the default is Jay Chou's album Qilixiang, not Jay Chou's song Qilixiang. Open the source code of fireworks easy to cool and Qilixiang search results respectively, and observe:

<li class="item">▪<span class="selected">Jay Chou sings songs</span></li>

<li class="item">▪<span class="selected">2004 Jay Chou's music album</span></li>

It can be found that their line of code is different. In addition, near the latter line of code, there are the following codes:

<li class="item">▪<span class="selected">2004 Jay Chou's music album</span></li>
<li class="item">▪<a title="Xi Murong's poetry collection" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/2181435#Viewpagecontent '> Xi Murong's poetry collection</a></li>
<li class="item">▪<a title="2007 Thai TV series" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/2181466#Viewpagecontent '> 2007 Thai TV series</a></li>
<li class="item">▪<a title="Chen Shuhua sings songs" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/2172939#Viewpagecontent '> Chen Shuhua sings songs</a></li>
<li class="item">▪<a title="traditional Chinese medicine" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/4494994#Viewpagecontent '> traditional Chinese Medicine</a></li>
<li class="item">▪<a title="scenic spot" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/3518031#Viewpagecontent '> tourist attractions</a></li>
<li class="item">▪<a title="2005 Books published by the Central Compilation publishing house in" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/20490760#Viewpagecontent '> books published by the Central Compilation and Translation Press in 2005</a></li>
<li class="item">▪<a title="Novel Qi Li Xiang" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/3922533#Viewpagecontent '> novel Qi Li Xiang</a></li>
<li class="item">▪<a title="Thymus of Rutaceae" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/4499679#Viewpagecontent '> thyme of Rutaceae</a></li>
<li class="item">▪<a title="Jay Chou sang songs in Taiwan in 2004" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/12009481#Viewpagecontent '> Jay Chou sang songs in Taiwan in 2004</a></li>
<li class="item">▪<a title="Snacks in Taiwan, China" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/2181417#ViewPageContent'> Taiwan China snacks </a></li>
<li class="item">▪<a title="Xi Murong creates new poetry" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/22593324#Viewpagecontent '> Xi Murong creates new poems</a></li>
<li class="item">▪<a title="Dark night literature network novel" href='/item/%E4%B8%83%E9%87%8C%E9%A6%99/22781892#Viewpagecontent '> dark night literature network novel</a></li>
<a href="javascript:;" class="fold-on">Expand all<em class="cmn-icon cmn-icons cmn-icons_arrow-b"></em></a>
<a href="javascript:;" class="fold-off">Put away<em class="cmn-icon cmn-icons cmn-icons_arrow-t"></em></a>

It is found that there is an option of "Jay Chou singing songs in Taiwan in 2004". The same place with the former is that there is a common keyword "Jay Chou singing songs". Next, continue to find the information we need in this paragraph:

<meta name="description" content="<Fireworks are easy to cool "is a song composed by Fang Wenshan, composed by Huang YuXun and composed and sung by Jay Chou. It is included in Jay Chou's album cross era released on May 18, 2010. In 2011, the song won the "Golden Melody of the year" at the 2010 Beijing pop music ceremony.">

<meta name="description" content="<Qilixiang is a song sung by Jay Chou. It is composed by Fang Wenshan, composed by Jay Chou and arranged by Zhong Xingmin. It is included in Jay Chou's album of the same name "Qilixiang" released on August 3, 2004. In 2004, the song won Hong Kong TVB8 There are three awards for the best composition, producer and arranger of the top ten golden songs. In 2005, the song won many awards, such as the 27th top ten Chinese Golden Song Award, the excellent popular Chinese song award, and the best song of the year in the 11th global Chinese music list.">

So, is there any difference between composition, arrangement and composition? use Baidu Search:

1. Conceptual difference: composition generally refers to composing melody for lyrics; Arrangement generally refers to the accompaniment of songs; Composing music is to write down the existing music and write it into simplified music, staff, etc.
2. Order difference: first composition, then arrangement and composition.

Well, I've seen a lot. Here, the preparatory work is about the same.
This is a statistical table comparing song tag information with ffmpeg library on different platforms:

WindowsiTunes(Info tab)id3v2.3ffmpeg keyffmpeg example
TitleTitleTIT2title-metadata title = "vast sea and sky"
SubtitleDescription(Video tab)TIT3TIT3-metadata TIT3 = "beyond 20 th Anniversary Edition"
Ratingn/an/an/an/a
CommentsCommentsCOMMn/an/a
Contributing artistsArtistTPE1artist-metadata artist = "Huang Jiaju"
Album artistAlbum artistTPE2album_artist-metadata album_artist="Josh Groban"
AlbumAlbumTALBalbum-metadata album="Closer"
YearYearTYERdate-metadata date="2009"
#Track NumberTRCKtrack-metadata track = "3 / 12"
GenreGenreTCONgenre-metadata genre="Vocal"
Publishern/aTPUBpublisher-metadata publisher="Heaven Church"
Encoded byn/aTENCencoded_by-metadata encoded_by="Joshua"
Aythor URLn/aWOARn/an/a
Copyright (non editable)n/aTCOPcopyright-metadata copyright="℗ lqsoft"
Composersn/aTCOMcomposer-metadata composer="Joshua"
Conductorsn/aTPE3performer-metadata performer="Joshua"
Group descriptionGroupingTIT1TIT1-metadata TIT1="The Classics"
Moodn/an/an/an/a
Part of setDisc NumberTPOSdisc-metadata disc="1/2"
Initial keyn/aTKEYTKEY-metadata TKEY="G"
Beats-per-minuteBOMTBPMTBPM-metadata TBPM="120"
Part of a compilationPart of a compilationTCMPn/an/a
n/an/aTLANlanguage-metadata language="eng"
n/an/aTSSEencoder-metadata encoder="iTunes v10"
We will mainly use Title: title is the song name, artist: artist is the singer, album: album, date: release time, composer: composition. Then he found that he didn't leave a place for the poor Wenshan brothers to write words. Looking back, it seems that lyrics and music composition generally appear in the lyrics file, and the label of music file generally seems to have a song name and singer.

Because the store's resource file name has a numeric number:

01. Cowboys are busy wav
01. Said goodbye wav

So first write a script to rename it and export the song list:

import os
import re

pattern=[r"^[0-9]+\.",r"\.wav"]
dir='E:\\BaiduNetdiskDownload\\Jay Chou'
os.chdir(dir)
raw_dir_list=os.listdir(dir)
dir_list=list()

for file in raw_dir_list:
    tmp=re.sub(pattern[0],"",file)
    str=re.sub(pattern[1],"",tmp)
    dir_list.append(str)
    os.rename(file,tmp)

with open("song_list.txt","w") as p:
    for file in dir_list:
        p.write(file+"\n")

The effect of the list is as follows (the file name is suffixed with ". wav"):

Qilixiang
doomsday
Dongfeng break
Uncle Joker

Next comes the crawler script:

from urllib import request
from urllib import parse
import re
import os

def getlist(file):
    with open(file,"r") as p:
        list=p.read().split("\n")
    while '' in list:
        list.remove('')
    return list

def crawtext(url):
    res=request.urlopen(url)
    text=res.read().decode(encoding='utf-8', errors='strict')
    return text

def isurl(patternlist,text):
    if re.search(patternlist[0],text):
        a=re.search(patternlist[1],text)
        if a:
            flag=0
        else :
            flag=2
    else :
        flag=1
    return flag

def gettext(pattern,raw_text):
    a=re.search(pattern,raw_text)
    if a:
        text=raw_text[a.span()[0]:a.span()[1]]
    else :
        text=False
    return text

def geturl(pattern,patternlist,raw_text):
    a=re.search(pattern,raw_text)
    if a:
        text=raw_text[a.span()[0]:a.span()[1]]
        tmp=re.sub(patternlist[0],"",text)
        url=re.sub(patternlist[1],"",tmp)
    else :
        url=False
    return url
    

baseurl=r"https://baike.baidu.com/item/"
pattern1=['<li class="item">▪<span class="selected">','<li class="item">▪<span class="selected">.*Jay Chou.*song.*</span></li>']
pattern2='<meta name="description" content=".*">'
pattern3='<li class="item">▪<a title=".*Jay Chou.*song.*>'
pattern4=[".*href='/item/","'>.*"]
dir="E:\\BaiduNetdiskDownload\\Jay Chou"
os.chdir(dir)

song_list=getlist("song_list.txt")
text_list=list()
for file in song_list:
    name=re.sub(".wav","",file)
    url=baseurl+parse.quote(name)
    text=crawtext(url)
    flag=isurl(pattern1,text)
    if  flag==0:
        text_list.append(gettext(pattern2,text))
    elif flag==1:
        text=gettext(pattern2,text)
        if text:
            text_list.append(text)
        else:
            text_list.append(name+" error 1 ")
    else :
        key=geturl(pattern3,pattern4,text)
        if key:
            url=baseurl+key
            text=crawtext(url)
            text_list.append(gettext(pattern2,text))
        else :
            text_list.append(name+" error 2 ")

with open("text.txt","w") as p:
    for str in text_list:
        p.write(str+"\n")

There are still some problems, such as three "error: 2":

Chrysanthemum terrace error 2
Agreed happiness error 2
Track error 2

Open the browser to search and find that Jay Chou's song is called "say good happiness", not "say good happiness", but for "chrysanthemum platform" and "track":

Jay Chou sings the ending song of the film "all over the city with golden armour"
Jay Chou sings the theme song of the film "looking for Jay Chou"

Speechless, there is no keyword "song" in the subtitle. In addition, there are several data errors because the entry does not jump automatically and the singer is not Jay Chou (dedication is a song written by Jay Chou to Chen Xiaochun).
It seems that the script can be optimized. It's troublesome. Just a few anyway. Add them manually and modify the wrong song name. The original data is downloaded successfully, and the effect is as follows:

< meta name = "description" content = "" Qilixiang "is a song sung by Jay Chou. It is composed by Fang Wenshan, composed by Jay Chou and arranged by Zhong Xingmin. It is included in Jay Chou's album of the same name" Qilixiang "released on August 3, 2004 Yes. In 2004, the song won three awards for the best composition, producer and arrangement of the top ten Golden Songs of TVB8 in Hong Kong. In 2005, the song won many awards, such as the 27th top ten Chinese Golden Song Award, the excellent popular Chinese song award, and the best song of the year in the 11th global Chinese music list. ">

Next, clean up the data:

import os
import re

def getlist(file):
    with open(file,"r") as p:
        list=p.read().split("\n")
    while '' in list:
        list.remove('')
    return list

class SONG:
    title=""
    artist=""
    album=""
    date=""
    composer=""
    def __init__(self,title) :
        self.title=title

def cuthead(pattern,text):
    a=re.search(pattern,text)
    if a:
        tmp=text[a.span()[1]:-1]+text[-1]
        str=cuthead(pattern,tmp)
    else :
        str=text
    return str

def search1(pattern,text):
    a=re.search(pattern[0]+".*?"+pattern[1],text)
    if a:
        tmp1=text[a.span()[0]:a.span()[1]]
        tmp2=re.sub(pattern[1],"",tmp1)
        str=cuthead(pattern[0],tmp2)
    else:
        str=False
    return str

def search2(pattern,text):
    a=re.search(pattern,text)
    if a:
        str=text[a.span()[0]:a.span()[1]]
    else :
        str=False
    return str

def search3(pattern,text):
    pass


dir="E:\\BaiduNetdiskDownload\\Jay Chou"
os.chdir(dir)
pattern1=["<",">"]
pattern2=["yes","Singing"]
pattern3=["song,.",",Included"]
pattern4=["Included.*?[0-9]+year[0-9]+month[0-9]+day","[0-9]+year[0-9]+month[0-9]+day"]
pattern5=["Album<",">"]


textlist=getlist("text.txt")
li=[]
for  text in textlist:
    title=search1(pattern1,text)
    song=SONG(title)
    song.artist=search1(pattern2,text)
    song.album=search1(pattern5,text)
    song.date=search2(pattern4[1],str(search2(pattern4[0],text)))
    song.composer=search1(pattern3,text)
    li.append(song)

with open("list.txt","w") as p:
    for song in li:
        p.write(str(song.title)+"\t")
        p.write(str(song.artist)+"\t")
        p.write(str(song.album)+"\t")
        p.write(str(song.date)+"\t")
        p.write(str(song.composer)+"\n")

If the data is not standardized, clean up two lines of tears. After the program is run, it is still manually checked and modified several non-standard data. The cleaning effect is as follows:

Qilixiang Jay Chou Qilixiang wrote lyrics by Fang Wenshan on August 3, 2004. Jay Chou composed music and Zhong Xingmin arranged music
The end of the world Jay van Persie plus December 28, 2001 Jay Chou wrote lyrics and music
On July 31, 2003, Jay Chou composed music, Fang Wenshan wrote lyrics and Lin Michael arranged music

Next, the last step is format conversion and label addition:

import os
import pydub

def getlist(file):
    with open(file,"r") as p:
        list=p.read().split("\n")
    while '' in list:
        list.remove('')
    return list

class SONG:
    title=""
    artist=""
    album=""
    date=""
    composer=""
    def __init__(self,title) :
        self.title=title

dir="E:\\BaiduNetdiskDownload\\Jay Chou"
os.chdir(dir)
os.mkdir("test")
lines=getlist("list.txt")
list=[]
for line in lines:
    tmp=line.split("\t")
    song=pydub.AudioSegment.from_wav(tmp[0]+".wav")
    dic={"title":tmp[0],"artist":tmp[1],"album":tmp[2],"date":tmp[3],"composer":tmp[4]}
    song.export("test\\"+tmp[0]+".flac",format="flac",tags=dic)
    song.export()

Throughout this article, I found that format conversion is the simplest.

Keywords: Python crawler

Added by pcw on Sun, 06 Mar 2022 07:55:52 +0200