Do you know how to quickly send 100000 http requests in Python?

Hello, everyone, I'm a programmer sophomore!

If there is a file with 100000 URLs, you need to send an http request for each url and print the status code of the request result. How to write code to complete these tasks as soon as possible?

Python has many methods for concurrent programming. The standard libraries of multithreading, threading, concurrency, collaborative asyncio, and of course, the asynchronous libraries of grequests, each of which can meet the above requirements. The following are implemented in code one by one. The code in this article can be run directly for your future concurrent programming as a reference:

Queue + multithreading

Define a queue with a size of 400, and then start 200 threads. Each thread continuously obtains the url from the queue and accesses it.

The main thread reads the url in the file, puts it into the queue, and then waits for all the elements in the queue to be received and processed. The code is as follows:

from threading import Thread

import sys

from queue import Queue

import requests

concurrent = 200

def doWork():

while True:

url = q.get()

status, url = getStatus(url)

doSomethingWithResult(status, url)

q.task_done()

def getStatus(ourl):

try:

res = requests.get(ourl)

return res.status_code, ourl

except:

return "error", ourl

def doSomethingWithResult(status, url):

print(status, url)

q = Queue(concurrent * 2)

for i in range(concurrent):

t = Thread(target=doWork)

t.daemon = True

t.start()

try:

for url in open("urllist.txt"):

q.put(url.strip())

q.join()

except KeyboardInterrupt:

sys.exit(1)

The operation results are as follows:

Have you got any new skills?

Thread pool

If you use thread pool, it is recommended to use more advanced concurrent Futures Library:

import concurrent.futures

import requests

out = []

CONNECTIONS = 100

TIMEOUT = 5

urls = []

with open("urllist.txt") as reader:

for url in reader:

urls.append(url.strip())

def load_url(url, timeout):

ans = requests.get(url, timeout=timeout)

return ans.status_code

with concurrent.futures.ThreadPoolExecutor(max_workers=CONNECTIONS) as executor:

future_to_url = (executor.submit(load_url, url, TIMEOUT) for url in urls)

for future in concurrent.futures.as_completed(future_to_url):

try:

data = future.result()

except Exception as exc:

data = str(type(exc))

finally:

out.append(data)

print(data)

Collaborative process + aiohttp

Concurrency is also a very common tool for concurrency,

import asyncio

from aiohttp import ClientSession, ClientConnectorError

async def fetch_html(url: str, session: ClientSession, **kwargs) -> tuple:

try:

resp = await session.request(method="GET", url=url, **kwargs)

except ClientConnectorError:

return (url, 404)

return (url, resp.status)

async def make_requests(urls: set, **kwargs) -> None:

async with ClientSession() as session:

tasks = []

for url in urls:

tasks.append(

fetch_html(url=url, session=session, **kwargs)

)

results = await asyncio.gather(*tasks)

for result in results:

print(f'{result[1]} - {str(result[0])}')

if __name__ == "__main__":

import sys

assert sys.version_info >= (3, 7), "Script requires Python 3.7+."

with open("urllist.txt") as infile:

urls = set(map(str.strip, infile))

asyncio.run(make_requests(urls=urls))

grequests[1]

This is a third-party library. At present, there are 3.8K stars, namely Requests + Gevent[2], which makes asynchronous http requests easier. Gevent's essence is still a collaborative process.

Before use:

pip install grequests

It's quite simple to use:

import grequests

urls = []

with open("urllist.txt") as reader:

for url in reader:

urls.append(url.strip())

rs = (grequests.get(u) for u in urls)

for result in grequests.map(rs):

print(result.status_code, result.url)

Note grequests Map (RS) is executed concurrently. The operation results are as follows:

You can also add exception handling:

>>> def exception_handler(request, exception):

...    print("Request failed")

>>> reqs = [

...    grequests.get('http://httpbin.org/delay/1', timeout=0.001),

...    grequests.get('http://fakedomain/'),

...    grequests.get('http://httpbin.org/status/500')]

>>> grequests.map(reqs, exception_handler=exception_handler)

Request failed

Request failed

[None, None, <Response [500]>]

Last words

Today, I share several implementation methods of concurrent http requests. Some people say that asynchronous (cooperative process) performance is better than multithreading. In fact, we should look at it by scenario. No method is applicable to all scenarios. The author has done an experiment, which is also requesting URLs. When the number of concurrent requests exceeds 500, the cooperative process slows down obviously. Therefore, we can't say which is better than which. We need to divide the situation.

Keywords: Python

Added by digitalecartoons on Tue, 21 Dec 2021 16:31:01 +0200