Python crawlers - requests and selenium camouflage headers and agents to deal with anti crawling mechanism

  • catalogue
    • 1. requests headers
      Send request
    • 2. selenium simulation uses browser camouflage
      headers
    • 3. requests use ip
      Proxy send request
    • 4. selenium webdriver uses proxy
      ip

In the process of writing crawlers, some websites will set up anti crawling mechanism to refuse to respond to non browser access; Or frequent crawling in a short time will trigger the anti crawling mechanism of the website, resulting in the ip being blocked and unable to crawl the web page. This requires modifying the headers of the request in the crawler to disguise browser access, or using a proxy to initiate the request. So as to bypass the anti crawling mechanism of the website and obtain the correct page.

This article uses Python 3 6. Common request library requests and automatic test library selenium Use the browser.
For the use of these two libraries, please refer to the official documents or another blog: Method for python crawler to obtain html content of web page and download attachments.

1. Requests masquerading as headers to send requests

In the request header of the request sent by requests, the user agent will identify the request sent by python program, as shown below:

import requests

url = 'https://httpbin.org/headers'

response = requests.get(url)

if response.status_code == 200:
    print(response.text)

Return result:

{
  "headers": {
    "Accept": "*/*", 
    "Accept-Encoding": "gzip, deflate", 
    "Host": "httpbin.org", 
    "User-Agent": "python-requests/2.20.1"
  }
}

Note: https://httpbin.org It is an open source website for testing web page requests. For example, the / headers link above will return the request header of the sent request. Please refer to its official website for details.

User Agent: User Agent (English: User Agent) refers to an identifier provided by the software agent representing user behavior. It is used to identify the browser type and version, operating system and version, browser kernel, and other information. See Wikipedia entry for details: User agent

For websites with anti crawling, their headers will be recognized and the correct page will be rejected. At this time, you need to disguise the sent request as the browser's headers.

Open with browser https://httpbin.org/headers Website, you will see the following page, which is the headers of the browser.

[the external chain image transfer fails. The source station may have an anti-theft chain mechanism. It is recommended to save the image and upload it directly (img-71i5qeea-164508769471)( https://blog.csdn.net/XnCSD/article/resources/anti_crawl_strategy1.jpg )]

Copy the above request header and pass it to requests The get () function disguises the request as a browser.

import requests

url = 'https://httpbin.org/headers'

myheaders = {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8",
    "Accept-Encoding": "br, gzip, deflate",
    "Accept-Language": "zh-cn",
    "Host": "httpbin.org",
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15"
  }

response = requests.get(url, headers=myheaders)

if response.status_code == 200:
    print(response.text)


The returned result becomes:

{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8", 
    "Accept-Encoding": "br, gzip, deflate", 
    "Accept-Language": "zh-cn", 
    "Host": "httpbin.org", 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15"
  }
}

When applying crawlers, you can randomly replace other user agents to avoid triggering anti crawling.

2. selenium simulates using a browser to disguise headers

Selenium, an automated testing tool, can simulate using a browser to visit a website. This article uses selenium version 3.14.0, which supports Chrome and Firefox browsers. To use the browser, you need to download the corresponding version of driver.

Driver download address:

use webdriver Access the headers with their own browser.

from selenium import webdriver

url = 'https://httpbin.org/headers'

driver_path = '/path/to/chromedriver'
browser = webdriver.Chrome(executable_path=driver_path)
browser.get(url)

print(browser.page_source)
browser.close()

Print out the returned web page code:

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "zh-CN,zh;q=0.9", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36"
  }
}
</pre></body></html>

The browser driver can also disguise as a user agent. Just pass in the settings when creating the webdriver browser object:

from selenium import webdriver

url = 'https://httpbin.org/headers'

user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15"

driver_path = '/path/to/chromedriver'

opt = webdriver.ChromeOptions()
opt.add_argument('--user-agent=%s' % user_agent)

browser = webdriver.Chrome(executable_path=driver_path, options=opt)
browser.get(url)

print(browser.page_source)
browser.close()

At this point, the returned user agent becomes the incoming setting.

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "headers": {
    "Accept": "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8", 
    "Accept-Encoding": "gzip, deflate, br", 
    "Accept-Language": "zh-CN,zh;q=0.9", 
    "Host": "httpbin.org", 
    "Upgrade-Insecure-Requests": "1", 
    "User-Agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15"
  }
}
</pre></body></html>

The settings of Firefox browser driver are different:

from selenium import webdriver

url = 'https://httpbin.org/headers'

user_agent = "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15"

driver_path = '/path/to/geckodriver'

profile = webdriver.FirefoxProfile()
profile.set_preference("general.useragent.override", user_agent)

browser = webdriver.Firefox(executable_path=driver_path, firefox_profile=profile)
browser.get(url)

print(browser.page_source)
browser.close()

3. Requests sends requests using an ip proxy

When an ip accesses too frequently in a short time, the anti crawling mechanism of the website will be triggered, which will require the input of verification code or even block the ip access. At this time, you need to use proxy ip to initiate a request to obtain the correct web page.

visit https://httpbin.org/ip You can check your ip address.

import requests

url = 'https://httpbin.org/ip'
response = requests.get(url)
print(response.text)

Return the ip address of the local network:

{
  "origin": "202.121.167.254, 202.121.167.254"
}

Using proxy ip is similar to camouflage headers in 1. For example, now we get a proxy ip: 58.58.213.55:8888. Use the following:

import requests

proxy = {
    'http': 'http://58.58.213.55:8888',
    'https': 'http://58.58.213.55:8888'
}
response = requests.get('https://httpbin.org/ip', proxies=proxy)
print(response.text)

The proxy ip is returned

{
  "origin": "58.58.213.55, 58.58.213.55"
}

4. selenium webdriver uses proxy ip

chrome driver uses proxy ip in a similar way to camouflage user agent:

from selenium import webdriver

url = 'https://httpbin.org/ip'
proxy = '58.58.213.55:8888'
driver_path = '/path/to/chromedriver'
opt = webdriver.ChromeOptions()
opt.add_argument('--proxy-server=' + proxy)

browser = webdriver.Chrome(executable_path=driver_path, options=opt)
browser.get(url)
print(browser.page_source)
browser.close()

Print results:

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body><pre style="word-wrap: break-word; white-space: pre-wrap;">{
  "origin": "58.58.213.55, 58.58.213.55"
}
</pre></body></html>

The settings of firefox driver are slightly different, and the browser extension close needs to be downloaded_ proxy_ Authentication cancels proxy user authentication. It can be downloaded through Google. This article uses firefox 65.0.1 (64 bit) version. The available extension files are: close_proxy_authentication-1.1-sm+tb+fx.xpi.

The code is as follows:

from selenium import webdriver

url = 'https://httpbin.org/ip'
proxy_ip = '58.58.213.55'
proxy_port = 8888
xpi = '/path/to/close_proxy_authentication-1.1-sm+tb+fx.xpi'
driver_path = '/path/to/geckodriver'

profile = webdriver.FirefoxProfile()
profile.set_preference('network.proxy.type', 1)
profile.set_preference('network.proxy.http', proxy_ip)  
profile.set_preference('network.proxy.http_port', proxy_port)
profile.set_preference('network.proxy.ssl', proxy_ip)  
profile.set_preference('network.proxy.ssl_port', proxy_port)
profile.set_preference('network.proxy.no_proxies_on', 'localhost, 127.0.0.1')
profile.add_extension(xpi)  

browser = webdriver.Firefox(executable_path=driver_path, firefox_profile=profile)
browser.get(url)
print(browser.page_source)
browser.close()

The proxy ip is also printed.

<html platform="mac" class="theme-light" dir="ltr"><head><meta http-equiv="Content-Security-Policy" content="default-src 'none' ; script-src resource:; "><link rel="stylesheet" type="text/css" href="resource://devtools-client-jsonview/css/main.css"></head><body><div id="content"><div id="json">{
  "origin": "58.58.213.55, 58.58.213.55"
}
</div></div><script src="resource://devtools-client-jsonview/lib/require.js" data-main="resource://devtools-client-jsonview/viewer-config.js"></script></body></html>

Turn: https://blog.csdn.net/XnCSD/article/details/88615791

Keywords: Python Selenium crawler

Added by hnxuying on Tue, 22 Feb 2022 07:58:13 +0200