I. Preface
There is a project that needs to crawl the Securities Association, and the other party has ip shielding. So I need to realize the ip automatic switch in the scratch to complete the crawling task.
Before that, I have used the third-party library, namely, scrapy proxys plus the proxy api interface of sesame ip. Maybe the previous code has not been adjusted well, resulting in the failure to succeed. (there will be a chance to test later).
2, Abu cloud example
Abu cloud officially gave python and scrape's Sample code
Python 3 example
from urllib import request # Target page to visit targetUrl = "http://test.abuyun.com/proxy.php" # proxy server proxyHost = "http-dyn.abuyun.com" proxyPort = "9020" # Proxy tunnel validation information proxyUser = "H01234567890123D" proxyPass = "0123456789012345" proxyMeta = "http://%(user)s:%(pass)s@%(host)s:%(port)s" % { "host" : proxyHost, "port" : proxyPort, "user" : proxyUser, "pass" : proxyPass, } proxy_handler = request.ProxyHandler({ "http" : proxyMeta, "https" : proxyMeta, }) #auth = request.HTTPBasicAuthHandler() #opener = request.build_opener(proxy_handler, auth, request.HTTPHandler) opener = request.build_opener(proxy_handler) request.install_opener(opener) resp = request.urlopen(targetUrl).read() print (resp)
The above is the native writing method, and the following is the middleware writing method of scrapy
scrapy Middleware
import base64 # proxy server proxyServer = "http://http-dyn.abuyun.com:9020" # Proxy tunnel validation information proxyUser = "H01234567890123D" proxyPass = "0123456789012345" # for Python2 proxyAuth = "Basic " + base64.b64encode(proxyUser + ":" + proxyPass) # for Python3 #proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8") class ProxyMiddleware(object): def process_request(self, request, spider): request.meta["proxy"] = proxyServer request.headers["Proxy-Authorization"] = proxyAuth
Here you can write it in Middleware in the project of sketch.
3, Formal integration
Add a new class in middlewars.py of the project:
import base64 """ Abu cloud ip Agent configuration, including account password """ proxyServer = "http://http-dyn.abuyun.com:9020" proxyUser = "HWFHQ5YP14Lxxx" proxyPass = "CB8D0AD56EAxxx" # for Python3 proxyAuth = "Basic " + base64.urlsafe_b64encode(bytes((proxyUser + ":" + proxyPass), "ascii")).decode("utf8") class ABProxyMiddleware(object): """ Abu cloud ip Agent configuration """ def process_request(self, request, spider): request.meta["proxy"] = proxyServer request.headers["Proxy-Authorization"] = proxyAuth
Then open the middleware in settings.py:
DOWNLOADER_MIDDLEWARES = { #'Securities.middlewares.SecuritiesDownloaderMiddleware': None, 'Securities.middlewares.ABProxyMiddleware': 1, }
4, Precautions
By default, Abu cloud dynamic ip requests five times in one second (you can add money and buy many times). So, when it defaults to 5 times, I need to speed limit the crawler, or in settings.py, add the following code in the blank:
""" Enable speed limit setting """ AUTOTHROTTLE_ENABLED = True AUTOTHROTTLE_START_DELAY = 0.2 # Initial download delay DOWNLOAD_DELAY = 0.2 # Time between requests
Of course, if you pay more than one time, you don't need to think about the speed limit.
You can complete the integration of Abu cloud dynamic proxy ip in the summary, and climb it to the top of your heart!