What if the python crawler encounters dynamic encryption? Crawl the content of a review website

The text and pictures of this article come from the Internet, only for learning and communication, and do not have any commercial purpose. If you have any questions, please contact us in time for handling.

The following article is from early rising Python by Liu Zaoqi

Python crawler, data analysis, website development and other case tutorial videos can be viewed online for free

https://space.bilibili.com/523606542

The font mapping between the downloaded font and the encrypted font was given by the merchant a few days ago.

However, there is a problem that the font files of different pages are loaded dynamically. In other words, the mapping relationship you establish on this page cannot be used for another page.

Then there's no solution? In fact, it's not difficult, or the other party still gives a very clear thinking direction, because although the font of each page is dynamically loaded, this dynamic is only for the change of the code after font analysis, and the internal order of the font does not change, that is, as shown in the figure below

In every two pages, only the font code has changed, but the font position order has not changed, so we only need to extract the CSS style in the page before parsing the data of each page, and then locate the font file storage link from the CSS content, and then request the font file corresponding to this page and parse and construct the matching dictionary, The following steps are the same as the previous article.

At the beginning, our goal is to crawl all the business information of the designated food in a city, such as locating Guangzhou and searching Shaxian snacks, and then crawl all the search pages.

The first step is to construct all the URLs. Because the URL of each page has a certain law, this step is very simple. Extract all the pages from the first page and add them to the URL according to the law_ List, and the data is not encrypted

So this part of the code can be written like this

def get_url(url):          headers = {     "Host": "www.dianping.com",     "Referer":f"{url}",     "User-Agent":ua.random,     "Sec-Fetch-Dest": "document",     "Sec-Fetch-Mode": "navigate",     "Sec-Fetch-Site": "none",     "Sec-Fetch-User": "?1",     "Upgrade-Insecure-Requests": "1"     }          r = requests.get(url = url,headers = headers,proxies = get_ip())     soup = BeautifulSoup(r.text)     page_num = int(soup.find_all('a',class_ = 'PageLink')[-1].text)     url_list = [url + f"/p{i+1}" for i in range(page_num)]          return url_list

This part of the code is not difficult to understand. Construct the request - parse the page - extract the number of pages - simulate the URL, where get_ip() must return an ip that can be used. Whether you use a free or paid proxy, it will not be explained in detail here.

After completing the URL, we come to the most critical step, write a function, pass in a page, and return to the text matching Dictionary of the page. The first step is to take down the font, and the following four lines of code can be done

css_url = "http://" + re.search(r's3plus.meituan.net/(.*?)/svgtextcss/(.*?).css', page.text).group(0) # get CSS file css_value = requests.get(css_url).text addr_font = "http:" + re.search(r'address(.*?).woff', css_value).group(0).split(',')[-1][5:] price_font = "http:" + re.search(r'shopNum(.*?).woff', css_value).group(0).split(',')[-1][5:]

Let's take a brief look at this code. After we pass in the page obtained after a request

The first line of code uses a regular expression to extract the css link where the font is located

The second line of code uses requests to request css content

The last two lines of code use regular to extract the URL of the woff font file

If the page you send in is normal, now we have the URL of the font in the address and average price field. Below, you can use requests to download and save the two font files locally. The code is as follows

x = requests.get(addr_font).content
with open('addr.woff','wb+') as f:     f.write(x) x = requests.get(price_font).content with open('price.woff','wb+') as f:     f.write(x)

Now there are two font files in the working directory, and then you can operate according to the font encryption cracking method introduced in the previous article. Therefore, the complete code of this part is as follows:

def get_font(page):     '''     Page after receiving the request     Return to this page url typeface woff Two dictionary files corresponding to the file     '''python     css_url = "http://" + re.search(r's3plus.meituan.net/(.*?)/svgtextcss/(.*?).css', page.text).group(0) # get CSS file # css_value = requests.get(css_url).text     addr_ font = "http:" + re.search(r'address(.*?).woff', css_value).group(0).split(',')[-1][5:]     price_ font = "http:" + re.search(r'shopNum(.*?).woff', css_value). group(0). Split ('', '[- 1] [5:] # download the font and save it to local # x = requests get(addr_font). content     with open('addr.woff','wb+') as f:         f.write(x)     x = requests.get(price_font).content # with # open('price.woff','wb + ') as # F: f.write (x) # parse font # font_addr = TTFont('addr.woff')     font1 = font_addr.getGlyphOrder()[2:]      font1 = [font1[i][-4:] for i in range(len(font1))]          font_price = TTFont('price.woff')     font2 = font_price.getGlyphOrder()[2:]      font2 = [font2[i][-4:] for i in range(len(font2))]          font3 =  ['1', '2', '3', '4', '5', '6', '7', '8', '9', '0', 'store', 'medium', 'beautiful', 'home', 'Museum', 'small', 'car', 'large', 'city', 'public', 'wine', 'line', 'country', 'product', 'hair', 'electricity', 'gold', 'heart', 'industry', 'business',' company ',' super ',' health ',' decoration ',' garden ',' field ',' food ',' have ',' new ',' limit ',' Day ',' surface ',' work '] ',' service ',' sea ',' China ',' water ',' house ',' decoration ',' city ',' music ',' steam ',' fragrance ',' department ',' profit ',' son ',' old ',' art ',' flower ',' Specialty ',' East ',' meat ',' vegetable ',' learning ',' Blessing ',' rice ',' people ',' hundred ',' meal ',' tea ',' service ',' communication ',' taste ',' place ',' Mountain ',' district ',' gate ',' medicine ',' Silver ',' agriculture ',' Dragon ',' stop ',' Shang ', 'an', 'Guang', 'Xin', 'Yi', 'Rong', 'Dong', 'Nan', 'Ju', 'Yuan', 'Xing', 'fresh', 'Ji', 'Shi', 'machine', 'roast', 'Wen', 'Kang', 'Xin', 'fruit', 'Yang', 'Li', 'pot', 'treasure', 'Da', 'land', 'son', 'clothes',' special ',' property ',' West ',' batch ',' Fang ',' state ',' cattle ',' Jia ',' Hua ',' five ',' rice ',' repair ',' love ', 'North', 'raising', 'selling', 'building', 'material', 'three', 'meeting', 'chicken', 'room', 'Red', 'station', 'virtue', 'King', 'light', 'name', 'beauty', 'oil', 'courtyard', 'Hall', 'burning', 'River', 'society', 'combination', 'star', 'goods',' type ',' village ',' self ',' branch ',' fast ',' defecate ',' Day ',' people ',' camp ',' and ',' living ',' children ',' Ming ',' utensils', 'smoke', 'Yu', 'bin', 'Jing', 'Wu', 'Jing', 'Ju', 'Zhuang', 'Shi', 'Shun', 'Lin', 'er', 'County', 'hand', 'Hall', 'pin', 'use', 'good', 'guest', 'fire', 'elegance', 'Sheng', 'body', 'brigade', 'Zhi', 'shoes',' spicy ',' make ',' powder ',' bag ',' building ',' school ',' fish ',' flat ',' color ',' upper ',' bar ',' protect ',' forever ',' 10000 ',' things',  'teach',

Added by [uk]stuff on Tue, 08 Feb 2022 08:07:02 +0200

Programming VIP

What if the python crawler encounters dynamic encryption? Crawl the content of a review website

Popular Keywords