About the font decryption of the monthly ticket on the starting point Chinese website (with a free page turning to obtain the monthly ticket)

I haven't written code for a long time recently. I suddenly want to be hot, so I fell in love with the starting point Chinese website (●) ˇ ∀ ˇ ●)
No more nonsense, give the code
Let's first analyze the website of starting point Chinese website
https://www.qidian.com/rank/yuepiao/year2022-month01/
Normal operation after we enter the website, press f12 and click network, as shown in the figure below

We need to find the content we want to crawl. Let's crawl the title and the number of monthly tickets today

**Find the website indicated by the arrow, click in to check its preview, and find that there is no data we are looking for. Let's see whether it is in the Response. Search the Stargate with CTRL+f and find it in this
**


In this way, we get the title. The code of the title is as follows

import random
import requests
from lxml import etree
 # Determine the website of the monthly ticket ranking list of the starting point Chinese website
  url = 'https://www.qidian.com/rank/yuepiao/year2022-month01/'
 # Request header
 headers = {
       'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36',
        'referer': 'https://www.qidian.com/rank/',
        'cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C19%22%2C%22l1%22%3A4%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22%22%2C%22l1%22%3A4%7D; _yep_uuid=fd95b6b7-090e-c6e5-cb8c-b8387e5b29ab; _ga=GA1.1.376581816.1643601078; newstatisticUUID=1643601078_1599172947; _csrfToken=m8mDkhtjc381bOHrIGiYTkE1g3bUzgPZjExmmO9l; _ga_FZMMH98S83=GS1.1.1643601077.1.1.1643601098.0; _ga_PFYW0QLV3P=GS1.1.1643601077.1.1.1643601098.0'
    }
# Response data
response = requests.get(url, headers=headers)
response_text = response.text
html_data = etree.HTML(response_text)
#Get the title through xpath
title_list = html_data.xpath('//h2/a/text()')
print(title_list)

**Run the code and you can see that the names of the novels on the first page have come out (in the form of list)
**

Of course, we also need to get monthly tickets for these novels

It can be seen that the number of monthly tickets is not directly displayed. Let's get the ones that are not displayed first

# Get the number of monthly tickets using regular
re_data = re.findall('</style><span class=".*?">(.*?)</span>', response_text) 
print(re_data)

The effect is as follows

It can be seen that this is different from the display on the web page. What is this? So we can guess that the number of votes this month should be encrypted. In order to verify this idea, we found a src on the font

And this src is still dynamic (mentality collapse). Every time I enter this page, I will randomly generate the following font in the network for comparison


The code for obtaining the dynamic font url is as follows

 # Get dynamic url using regular
font_url = re.findall(r"format\('eot'\); src: url\('(.*?)'\) format\('woff'\)", response_text)[0]
print(font_url)

Then the idea behind is clear. Just decrypt the encrypted data in the source code directly with the obtained font package

# Send a request to download the font encryption file
font_response = requests.get(font_url, headers=headers)
with open('jiemi.woff','wb')as f:
      f.write(font_response.content)
#Parsing font decryption file
#Create TTFont object
font_obj = TTFont('jiemi.woff')
#Convert to xml plaintext format
font_obj.saveXML('jiemi.xml')

# Get mapping table
cmap_dict = font_obj.getBestCmap()
print("Font encryption mapping table", cmap_dict)
# Remove the special symbols after encryption &#  &#100196 =>100196  re_data
for i in enumerate(re_data):
	new_font_list = re.findall(r'\d+', i[1])
    re_data[i[0]] = new_font_list
print("Remove special symbols", re_data)
# Change the English number of the relationship mapping table into Arabic number {100196: '3'}
dict_e_a = {
        "one": '1', "two": '2', "three": '3', "four": '4', "five": "5", "six": '6', "seven": "7", "eight": '8', "nine": '9',
        "zero": '0'
    }

# Traversal relation mapping table
for i in cmap_dict:
# Ergodic dict_e_a
     for j in dict_e_a:
     # dict_ The value of is equal to dict_ e_ Key of a
     	if cmap_dict[i] == j:
        	cmap_dict[i] = dict_e_a[j]

print("Relationship mapping table after replacing with numbers", cmap_dict)

# 10. Change the ciphertext to plaintext 100196 = "3" by matching the response, removing the value of special symbols and changing it to Arabic numerals
for i in re_data:  # Remove the value of response and remove the special symbol [[], [], []]
	print(i)  #  ['100388', '100389', '100388', '100385', '100385']
    for j in enumerate(i):  # 100388
            # print(j)
    for k in cmap_dict:  # Relationship mapping table after changing to Arabic numerals
                # print(k)
    if j[1] == str(k):
    	print(j[0])
        i[j[0]] = cmap_dict[k]
print("Number of monthly tickets after parsing", re_data)

# Splice a single plaintext into a complete number of monthly tickets
list_ = []
for i in re_data:
	j = ''
	for k in i:
		j += k
    list_.append(j)
print("Final monthly ticket plaintext data list", list_)

# 11. Combine the book name and dictionary name to form a dictionary {Book Name: "number of monthly tickets"}
rank_dict = {}
for i in range(len(title_list)):
	rank_dict[title_list[i]] = list_[i]

This is not enough. I have made more than one page. It is not very difficult to turn the page. That is, it is not easy to decrypt. Observe the differences in the URLs of page 1, page 2 and page 3
first page: https://www.qidian.com/rank/yuepiao/year2022-month01/
Page 2: https://www.qidian.com/rank/yuepiao/year2022-month01-page2/
Page 3: https://www.qidian.com/rank/yuepiao/year2022-month01-page3/
The complete page turning code is as follows

import random
import requests
import time
from lxml import etree
from fontTools.ttLib import TTFont
import re

pages = int(input('Please enter the number of pages to query'))
for page in range(pages):
    if page == 0:
        # Determine the website of the monthly ticket ranking list of the starting point Chinese website
        url = 'https://www.qidian.com/rank/yuepiao/year2022-month01/'
    else:
        pages_i=1
        url = f'https://www.qidian.com/rank/yuepiao/year2022-month01-page{pages_i+page}/'
    # Request header
    headers = {
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/86.0.4240.193 Safari/537.36',
        'referer': 'https://www.qidian.com/rank/',
        'cookie': 'e1=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22qd_C19%22%2C%22l1%22%3A4%7D; e2=%7B%22pid%22%3A%22qd_P_rank_01%22%2C%22eid%22%3A%22%22%2C%22l1%22%3A4%7D; _yep_uuid=fd95b6b7-090e-c6e5-cb8c-b8387e5b29ab; _ga=GA1.1.376581816.1643601078; newstatisticUUID=1643601078_1599172947; _csrfToken=m8mDkhtjc381bOHrIGiYTkE1g3bUzgPZjExmmO9l; _ga_FZMMH98S83=GS1.1.1643601077.1.1.1643601098.0; _ga_PFYW0QLV3P=GS1.1.1643601077.1.1.1643601098.0'
    }
    # Response data
    response = requests.get(url, headers=headers)
    response_text = response.text
    html_data = etree.HTML(response_text)
    # Get the title through xpath
    title_list = html_data.xpath('//h2/a/text()')
    print(title_list)
    # Get the number of monthly tickets using regular
    re_data = re.findall('</style><span class=".*?">(.*?)</span>', response_text)
    print(re_data)
    # Get dynamic url using regular
    font_url = re.findall(r"format\('eot'\); src: url\('(.*?)'\) format\('woff'\)", response_text)[0]
    # Send a request to download the font encryption file
    font_response = requests.get(font_url, headers=headers)
    with open('jiemi.woff','wb')as f:
        f.write(font_response.content)
    #Parsing font decryption file
    #Create TTFont object
    font_obj = TTFont('jiemi.woff')
    #Convert to xml plaintext format
    font_obj.saveXML('jiemi.xml')
    # Get mapping table
    cmap_dict = font_obj.getBestCmap()
    print("Font encryption mapping table", cmap_dict)
    # Remove the special symbols after encryption &#  &#100196 =>100196  re_data
    for i in enumerate(re_data):
        # print(i)
        new_font_list = re.findall(r'\d+', i[1])
        re_data[i[0]] = new_font_list
    print("Remove special symbols", re_data)
    # Change the English number of the relationship mapping table into Arabic number {100196: '3'}
    dict_e_a = {
        "one": '1', "two": '2', "three": '3', "four": '4', "five": "5", "six": '6', "seven": "7", "eight": '8', "nine": '9',
        "zero": '0'
    }
    # Traversal relation mapping table
    for i in cmap_dict:
        # Ergodic dict_e_a
        for j in dict_e_a:
            # dict_ The value of is equal to dict_ e_ Key of a
            if cmap_dict[i] == j:
                cmap_dict[i] = dict_e_a[j]
    print("Relationship mapping table after replacing with numbers", cmap_dict)
    # 10. Change the ciphertext to plaintext 100196 = "3" by matching the response, removing the value of special symbols and changing it to Arabic numerals
    for i in re_data:  # Remove the value of response and remove the special symbol [[], [], []]
        print(i)  #  ['100388', '100389', '100388', '100385', '100385']
        for j in enumerate(i):  # 100388
            # print(j)
            for k in cmap_dict:  # Relationship mapping table after changing to Arabic numerals
                # print(k)
                if j[1] == str(k):
                    print(j[0])
                    i[j[0]] = cmap_dict[k]
    print("Number of monthly tickets after parsing", re_data)
    # Splice a single plaintext into a complete number of monthly tickets
    list_ = []
    for i in re_data:
        j = ''
        for k in i:
            j += k
        list_.append(j)
    print("Final monthly ticket plaintext data list", list_)
    # Combine the book name with the dictionary name to form a dictionary {Book Name: "number of monthly tickets"}
    rank_dict = {}
    for i in range(len(title_list)):
        rank_dict[title_list[i]] = list_[i]
    print(f"The first{page+1}Final result", rank_dict)
    print('-'*50)
    #Prevent reverse climbing and then sleep for 1 to 2 seconds
    time.sleep(random.randint(1,2))

The effect is as follows:


Those who like this article can focus on me, and more good articles will be published later (● '◡' ●)

Keywords: crawler

Added by d3chapma on Tue, 01 Feb 2022 05:07:33 +0200