Hello, everyone, I'm a programmer sophomore!
If there is a file with 100000 URLs, you need to send an http request for each url and print the status code of the request result. How to write code to complete these tasks as soon as possible?
Python has many methods for concurrent programming. The standard libraries of multithreading, threading, concurrency, collaborative asyncio, and of course, the asynchronous libraries of grequests, each of which can meet the above requirements. The following are implemented in code one by one. The code in this article can be run directly for your future concurrent programming as a reference:
Queue + multithreading
Define a queue with a size of 400, and then start 200 threads. Each thread continuously obtains the url from the queue and accesses it.
The main thread reads the url in the file, puts it into the queue, and then waits for all the elements in the queue to be received and processed. The code is as follows:
from threading import Thread import sys from queue import Queue import requests concurrent = 200 def doWork(): while True: url = q.get() status, url = getStatus(url) doSomethingWithResult(status, url) q.task_done() def getStatus(ourl): try: res = requests.get(ourl) return res.status_code, ourl except: return "error", ourl def doSomethingWithResult(status, url): print(status, url) q = Queue(concurrent * 2) for i in range(concurrent): t = Thread(target=doWork) t.daemon = True t.start() try: for url in open("urllist.txt"): q.put(url.strip()) q.join() except KeyboardInterrupt: sys.exit(1)
The operation results are as follows:
Have you got any new skills?
Thread pool
If you use thread pool, it is recommended to use more advanced concurrent Futures Library:
import concurrent.futures import requests out = [] CONNECTIONS = 100 TIMEOUT = 5 urls = [] with open("urllist.txt") as reader: for url in reader: urls.append(url.strip()) def load_url(url, timeout): ans = requests.get(url, timeout=timeout) return ans.status_code with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor: future_to_url = (executor.submit(load_url, url, TIMEOUT) for url in urls) for future in concurrent.futures.as_completed(future_to_url): try: data = future.result() except Exception as exc: data = str(type(exc)) finally: out.append(data) print(data)
Collaborative process + aiohttp
Concurrency is also a very common tool for concurrency,
import asyncio from aiohttp import ClientSession, ClientConnectorError async def fetch_html(url: str, session: ClientSession, **kwargs) -> tuple: try: resp = await session.request(method="GET", url=url, **kwargs) except ClientConnectorError: return (url, 404) return (url, resp.status) async def make_requests(urls: set, **kwargs) -> None: async with ClientSession() as session: tasks = [] for url in urls: tasks.append( fetch_html(url=url, session=session, **kwargs) ) results = await asyncio.gather(*tasks) for result in results: print(f'{result[1]} - {str(result[0])}') if __name__ == "__main__": import sys assert sys.version_info >= (3, 7), "Script requires Python 3.7+." with open("urllist.txt") as infile: urls = set(map(str.strip, infile)) asyncio.run(make_requests(urls=urls))
grequests[1]
This is a third-party library. At present, there are 3.8K stars, namely Requests + Gevent[2], which makes asynchronous http requests easier. Gevent's essence is still a collaborative process.
Before use:
pip install grequests
It's quite simple to use:
import grequests urls = [] with open("urllist.txt") as reader: for url in reader: urls.append(url.strip()) rs = (grequests.get(u) for u in urls) for result in grequests.map(rs): print(result.status_code, result.url)
Note grequests Map (RS) is executed concurrently. The operation results are as follows:
You can also add exception handling:
>>> def exception_handler(request, exception): ... print("Request failed") >>> reqs = [ ... grequests.get('http://httpbin.org/delay/1', timeout=0.001), ... grequests.get('http://fakedomain/'), ... grequests.get('http://httpbin.org/status/500')] >>> grequests.map(reqs, exception_handler=exception_handler) Request failed Request failed [None, None, <Response [500]>]
Last words
Today, I share several implementation methods of concurrent http requests. Some people say that asynchronous (cooperative process) performance is better than multithreading. In fact, we should look at it by scenario. No method is applicable to all scenarios. The author has done an experiment, which is also requesting URLs. When the number of concurrent requests exceeds 500, the cooperative process slows down obviously. Therefore, we can't say which is better than which. We need to divide the situation.