Let's take a look at the live barrage of Betta:
data:image/s3,"s3://crabby-images/7c608/7c608fa5220250f3bd4150f50bab40eebd14a147" alt=""
You can see that the lower right corner is constantly changing.
Polling and WebSocket:
In the Web domain, polling and WebSocket are two means to realize data 'real-time' update.
Polling means that the client accesses the server interface at a certain time interval (such as 1 second), so as to achieve the "real-time" effect. Although it seems that the data is updated in real time, in fact, it has a certain time interval and is not really updated in real time. Polling usually adopts pull mode, and the client actively pulls data from the server.
WebSocket adopts the push mode. The server actively pushes the data to the client. This method is a real real-time update.
WebSocket:
WebSocket is a protocol for full duplex communication over a single TCP connection.
It makes the data exchange between the client and the server easier, and allows the server to actively push data to the client.
In the WebSocket API, the browser and server only need to complete a handshake, and they can directly create a persistent connection and conduct two-way data transmission.
WebSocket benefits:
Less control overhead: only one handshake is needed to carry the request header information, and then only data can be transmitted. Compared with HTTP, each request carries the request header, WebSocket saves resources.
Stronger real-time performance: the server can actively push messages, which makes the delay negligible. Compared with the time interval of HTTP polling, WebSocket can transmit multiple times in the same time.
Binary support: WebSocket supports binary frames, which means that transmission is more economical.
Case study:
Let's start with the official website of Wright coin http://www.laiteb.com/ Real time data as an example
The handshake of WebSocket only occurs once, so if you need to observe the network request through the browser developer tool, you need to open the browser developer tool, locate the NewWork tab, and enter or refresh the current page to observe the handshake request and data transmission of WebSocket. Take Chrome browser as an example:
data:image/s3,"s3://crabby-images/72c29/72c299f20a92a26aff32444850d7d8b5b6880aaa" alt=""
The filtering function is provided in the developer tool, where the WS option represents that only the network requests connected to the WebSocket are displayed.
At this time, you can see that there is a record named realTime in the request record list. After clicking it with the left mouse button, the developer tool will be divided into left and right columns, and the details of this request record will be listed on the right:
Unlike HTTP requests, the WebSocket connection address starts with ws or wss. The status code of successful connection is not 200, but 101.
data:image/s3,"s3://crabby-images/b0c89/b0c899342fb52ecf2d3decb824b5009b3cb1008f" alt=""
The Headers tab records the Request and Response information, while the Frames tab records the data transmitted by both parties, which is also the data content we need to crawl:
data:image/s3,"s3://crabby-images/a5b40/a5b402c3bb508903709ccfe0493ddbb99796626a" alt=""
In the Frames diagram, the data with the green arrow upward is the data sent by the client to the server, and the data with the orange arrow downward is the data pushed by the server to the client.
As can be seen from the data sequence, the client sends first: {"action":"ping"} Then the server will push the information (push all the time): {"action":"subscribe","group":"QuoteBin5m:14","success":true,"request":{"action":"subscribe","args":["QuoteBin5m:14"]}}
Therefore, the whole process from initiating handshake to obtaining data is as follows:
data:image/s3,"s3://crabby-images/9ae43/9ae43198375e1a11c63c3c1601e3fb246d583dc3" alt=""
Use aiowebsocket library to crawl network data:
There are many ways to connect WebSockets in Python library, but the easy-to-use and stable ones are WebSocket client (non asynchronous), WebSockets (asynchronous) and aiowebsocket (asynchronous).
You can choose one of the three according to the project requirements. Here is the asynchronous WebSocket connection client aiowebsocket.
AIO WebSocket is an asynchronous WebSocket client that follows the WebSocket specification. It is lighter and faster than other libraries.
Here is the code: (I don't know why the code is in this format. It is provided in the official document =.)
import asyncio import logging from datetime import datetime from aiowebsocket.converses import AioWebSocket async def startup(uri): async with AioWebSocket(uri) as aws: converse = aws.manipulator # The client sends a message to the server await converse.send('{"action":"subscribe","args":["QuoteBin5m:14"]}') while True: mes = await converse.receive() print('{time}-Client receive: {rec}' .format(time=datetime.now().strftime('%Y-%m-%d %H:%M:%S'), rec=mes)) if __name__ == '__main__': remote = 'wss://api.bbxapp.vip/v1/ifcontract/realTime' try: asyncio.get_event_loop().run_until_complete(startup(remote)) except KeyboardInterrupt as exc: logging.info('Quit.')
After running: (you can see that the data has been coming continuously)
data:image/s3,"s3://crabby-images/6e701/6e70174cbf16cc5b7d30c8b0ec6e6d871eb57da5" alt=""
We can take another look at this website: (Jinshi data center) https://datacenter.jin10.com/price
data:image/s3,"s3://crabby-images/bb729/bb7298dab51d325dcbdbc70ef164b1a3275f6283" alt=""
The Request Url in his headers starts with wss: / / The right side is frantically refreshing data, and the protocol used is websocket
data:image/s3,"s3://crabby-images/349f3/349f35271639069661686374ae500eb988ca90b4" alt=""
There are many articles on how to connect websocket under python. I don't need to elaborate here. Usually, when we get such an interface, we will instinctively try direct connection. After further attempts, the api of the port will change according to the real request, and the cookie and key of further requests will change. It seems that direct connection is not feasible, There's no way. We can only take the road of rendering. selenium can, but we need to try a new route and method. Go directly to chrome headless
Headless Chrome refers to running Google browser in headless mode (running in program mode without interface). Since it came out, The author of phantomjs announced that he would not maintain...
Use docker directly to install chrome headless
docker run -d -p 9222:9222 --cap-add=SYS_ADMIN justinribeiro/chrome-headless
In this way, we have enabled a chrome header less service. How to use it? Let's use websocket to interact with chrome header less. Let's go to the code directly
import json import time import requests import websocket request_id = 0 target_url = 'https://datacenter.jin10.com/price' def get_websocket_connection(): r = requests.get('http://10.10.2.42:9222/json ') # this is the address of the machine that started chrome headless if r.status_code != 200: raise ValueError("can not get the api ,please check if docker is ready") conn_api = r.json()[0].get('webSocketDebuggerUrl') return websocket.create_connection(conn_api) def run_command(conn, method, **kwargs): global request_id request_id += 1 command = {'method': method, 'id': request_id, 'params': kwargs} conn.send(json.dumps(command)) #while True: msg = json.loads(conn.recv()) if msg.get('id') == request_id: return msg def get_element(): conn = get_websocket_connection() msg = run_command(conn, 'Page.navigate', url=target_url) time.sleep(5) js = "var p = document.querySelector('.jin-pricewall_list-item_b').innerText ; p ;" for _ in range(20): time.sleep(1) msg = run_command(conn, 'Runtime.evaluate', expression=js) print(msg.get('result')['result']['value']) if __name__ == '__main__': get_element()
data:image/s3,"s3://crabby-images/e8f0b/e8f0b0a90a61047130c4b762ce3ab9c212a7246b" alt=""