Each time the proxy IP is replaced, it is still backcrawled

Customer problems

Product type used:

Tunnel agent dynamic version (replace IP every time).

Problem Description:

The target website has an anti crawling mechanism, and the interval between two searches shall not be less than 25 seconds. After using our tunnel agent, it will still be detected and can only be accessed once in 25 seconds. It is suspected that our tunnel agent did not replace the IP.

Technical support process

0x01 troubleshooting product problems
According to the process, we first use the user's tunnel agent for access test to see if the IP is changed each time.

Add the Internet IP of our test machine under the user's tunnel agent order, and use curl command to access CIP under Linux/macOS CC for testing.

curl cip.cc -x tps1xx.kdlapi.com:15818

As you can see, in the actual test, we used the tunnel agent to access CIP three times CC is normal, and the IP is replaced every time, which eliminates the problem of the tunnel agent's own products. After expressing to users that the problem is not a tunnel agent product, we continue to ask the bottom to help customers find the real problem.

0x02 analysis website
Analyze websites visited by users:

https://www.kquanben.com/modules/article/search.php

Discovery is an ordinary novel search web page. You search the keyword "overbearing president" twice in a row and find that you have prompted an error. Sure enough, the interval between two searches shall not be less than 25s.

Open the Chrome browser console to view the sent request. It can be found that during the search, the browser sends a POST request with the following parameters.

#header
{
 'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/avif,',
 'image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9,',
 'Accept-Language': 'zh-CN,zh;q=0.9,',
 'Cache-Control': 'no-cache,',
 'Connection': 'keep,',
 'Host': 'www.kquanben.com,',
 'Pragma': 'no-cache,',
 'Upgrade-Insecure-Requests': '1,',
 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) ',
 'Chrome/86.0.4240.111 Safari/537.36,',
 'sec-ch-ua': ''Google Chrome;v=89, Chromium;v=89, ;Not A Brand;v=99',',
 'sec-ch-ua-mobile': '?0,',
 'Sec-Fetch-Dest': 'document,',
 'Sec-Fetch-Mode': 'navigate,',
 'Sec-Fetch-Site': 'none,',
 'Sec-Fetch-User': '?1,'
 ...
}

The response Header is as follows

{
 'content-type': 'text/html; charset=utf-8',
 'date': 'Wed, 14 Apr 2021 02': '53': '09 GMT',
 'location': 'result/?searchid=31435',
 'server': 'nginx',
 'set-cookie': 'alllc111lastsearchtime=1618368789; expires=Thu, 15-Apr-2021 02': '53': '09 GMT; Max-Age=86400; path=/; secure',
 'strict-transport-security': 'max-age=31536000'
 ...
}

The set cookie field attracted our attention.
There is a number in alllc111lastsearchtime=1618368789 that is very similar to the timestamp. We infer that when the browser searches for novels for the first time, the server will return the set cookie field to record the user's last search time. When the user searches for the second time, the request will carry this cookie field. If the server judges that the time interval between two searches is less than 25s, the request will be rejected directly. In order to verify, we generate the timestamp in real time for the request. Sure enough, there is no search limit, and each request can be responded normally. Therefore, the main anti crawler measure to obtain the website is the cookie timestamp limit, which is not related to the client IP. In other words, the main reason is not the tunnel agent. It can be judged that it is the user's code problem.

0x03 check user code
We asked the user for the code, and only the logic related to the key sending request is retained for ease of presentation.

# coding:utf-8
# Look at the whole book


import requests


def test(info):
    book_name = info[0]
    book_auth = info[1]
    data = {
 "searchkey": book_name.strip(),
 "searchtype": "articlename",
 }
    s = requests.session() # session used
    url = 'https://www.kquanben.com/modules/article/search.php'
    headers = {
 # ellipsis
 }
    tunnel = "tps1xx.kdlapi.com:15818"
    username = "txxxxxxxxxx"
    password = "password"
    proxies = {
 "http": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel},
 "https": "http://%(user)s:%(pwd)s@%(proxy)s/" % {"user": username, "pwd": password, "proxy": tunnel}
 }
    response = s.post(url, headers=headers, data=data, timeout=12.1)
 # Parse response

The user uses the session in the Requests[1] framework to send Requests. Viewing the documents shows that the Requests framework has the function of persistent cookie sessions. Viewing the session source code shows that the process of sending Requests using session will be prepared_ Request to merge the session's cookies.

class Session(SessionRedirectMixin):
 """A Requests session.


 Provides cookie persistence, connection-pooling, and configuration.
 
 ......
 """
 def prepare_request(self, request):
 """Constructs a :class:`PreparedRequest <PreparedRequest>` for
        transmission and returns it. The :class:`PreparedRequest` has settings
        merged from the :class:`Request <Request>` instance and those of the
 :class:`Session`.


 :param request: :class:`Request` instance to prepare with this
            session's settings.
 :rtype: requests.PreparedRequest
 """
        cookies = request.cookies or {}


 # Bootstrap CookieJar.
 if not isinstance(cookies, cookielib.CookieJar):
            cookies = cookiejar_from_dict(cookies)


 # Merge with session cookies
        merged_cookies = merge_cookies(
            merge_cookies(RequestsCookieJar(), self.cookies), cookies)
 # ......

For more information about cookies, see requests cookies [2].

Therefore, according to the analysis of the user's website in the previous step, the user used session to send the request. In the first user request, the target website returned the set cookie field. So the Requests framework automatically saves the cookie for the next request. Therefore, during the second access, the request will be stamped with the time stamp of the last access. The interval between the two is less than 25s, so the request is rejected. The solution is also very simple. Instead of using session to send Requests, you can directly use Requests Post (), some codes are as follows:

import requests
import time
url = 'https://www.kquanben.com/modules/article/search.php'
headers = {
 'cookie': 'alllc111lastsearchtime=%s' % int(time.time())
 # ellipsis
}
data = {
 # ellipsis
}
response = s.post(url, headers=headers, data=data, timeout=12.1)

At the suggestion of the agent engineer, the user quickly modified the code and successfully used the agent to visit the website.

conclusion

• for requests Session should be used with caution. It will store cookie s. Improper use will be recognized by the target website.
• not all website anti climbing measures restrict access to IP, which requires specific analysis.

                           ****Pay attention to [fast acting customer service] for more technical articles****

Added by billman on Sat, 22 Jan 2022 12:13:45 +0200

Programming VIP

Each time the proxy IP is replaced, it is still backcrawled