The cookie verification of Python crawler theory. I don't know how the routine came without looking back on history!

At the beginning of the invention of cookie, in order to help the server synchronize the user information on the web page and save the user operation, so as to reduce the pressure on the server.

Before there was no cookie, people still stayed in the same situation as TV, which could only broadcast on demand on Web pages, and websites could not tell who was communicating.

Aside: the first generation password is a universal key

With a cookie, you interact with that web page, and then you have a website account.

A cookie created by the website you are browsing is called a first-party cookie.

This thing is very important. If you don't believe in evil and prohibit this first-party cookie,

Well, congratulations on returning to the radio age.

The Python requests library opens cookie s by default.

– check cookie s

import requests
from requests.cookies import RequestsCookieJar

headers = {
    'Host': 'accounts.douban.com',
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.90 Safari/537.36',
    'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3',
    'Accept-Language': 'zh-CN,zh;q=0.9,en;q=0.8',
    'Accept-Encoding': 'gzip, deflate, br',
    'Connection': 'keep-alive'
           }
request_url = "https://accounts.douban.com/passport/login"
res = requests.get(request_url, headers=headers)

status_code = res.status_code
res_header = res.headers
res_cookies = res.cookies
cookie1111 = res.cookies.get_dict()                             # Formatted dictionary output
cookie2222 = requests.utils.dict_from_cookiejar(res_cookies)    # Formatted dictionary output
for cookie in res_cookies:
    print(cookie.name+"\t"+cookie.value)

print("Response status code:", status_code)
print("Response request header:", res_header)
print("response cookies: ", res_cookies)
print("format cookie1111 :", cookie1111)
print("format cookie2222 :", cookie2222)

- bring your own cookie here. That's clear!

Next, we introduce the concept of third-party cookie s,

Use an empty browser to see the effect.

Clear the browser cookie record, or simulate.

Enter a website CSDN net,

Then click the small lock on the left side of the web address bar to see this information.

Accessing CSDN Net

There are also 40 other cookies. These domain names outside the web address you visit are called third-party cookies.

How did these third-party cookie s come from? What role do they play.

You enter CSDN COM, this website visited Baidu COM server.

Let's press F12 to enter the developer mode of the browser and observe the network structure.

Take a closer look at the loading of this website. We can find baidu.com in its directory Com source, he used Baidu COM, written into their own website code.

– in the process of visiting this website, we also use the service provided by Baidu. What is this service?

#I have to mention another function of cookies## In addition to binding the identity of web pages and users, you can also record the browsing history of web pages### This gives = = advertising providers = = the opportunity to use different code modules and embed them into different websites, so as to implement product recommendation### Third party cookies record your preferences silently. When you enter other websites, you can make personalized advertising recommendations by reading the previously recorded information.

Is there no advertising when third-party cookie s are disabled?

This is also the most common situation encountered by reptiles.

Manually simulate that if the third-party cookie is disabled, you will find that the number of verification code entries begins to become frequent.

In view of this, the reptile produced another tool, selenium.

Write at the end

1. Understanding history will help us better position the problem.

2. Many bloggers only tell you that you need to bring cookies when you climb the second time, and don't mention third-party cookies at all

import requests

cookies="Copied from the Internet cookie value"

cookies_dict={}

for i in cookies.split("; "): 

   cookies_dict[i.splict('=')[0]] = i.splict('=')[1]

html=requests.get(url='',cookies=cookies_dict}

3. More advanced tools, learning and using.

Keywords: Python crawler

Added by areric on Tue, 14 Dec 2021 22:43:00 +0200