Twenty lines of Python code crawl microblog high-quality beauty video (Welfare)

Brothers, if you don't send high-quality reptiles, you won't see it. Alas~

So you are all such people!!!

You've been paying attention to me for so long. Of course, you should have some serious teaching videos. After all, it's not serious. Everyone doesn't watch it. I'll climb a little sister for you today.

The environment used is Python 3 6 and pycharm, you need to install a browser driver (Google or Firefox, just similar to the browser version). If you don't have a small partner, you can send a private letter to my "little sister" to get it, including the complete code and complete tutorial.

1, Reptile principle

  • Function: batch access to Internet data (text, picture, audio, video)
  • Essence: repeated requests and responses
  • Most of the request methods are (get/post)
methoddescribe
GETRequest the page and return to the inner cylinder of the page
HEADIt is similar to a GET request, except that there is no specific content in the returned response to GET the header
POSTMost of them are used to submit forms or upload files, and the data is contained in the request body
PUTThe data transferred from the client to the server replaces the content in the specified document
DELETERequests the server to delete the specified page
CONNECTUse the server as a springboard and let the server access other web pages instead of the client
OPTIONSAllows clients to view the performance of the server
TRACEEcho the request received by the server, which is mainly used for testing or diagnosis
  • Request Headers (in the form of key value pairs) tell the server some configuration information requested, let the server judge these configuration information, and analyze the request header to explain the additional information to be used by the server. The more important information includes cookies, referers, user agents, etc.
    • Accept: request header field, which is used to specify what types of information the client can accept.
    • Accept - Language: Specifies the language type acceptable to the client.
    • Accept encoding: Specifies the content encoding acceptable to the client.
    • Host: used to specify the host IP and port number of the requested support. Its content is the location of the original server or gateway of the requested URL. From http1 From version 1, the request must contain this content.
    • Cookie: it is also commonly used in the plural. Cookie is the data stored locally by the website to identify users for session tracking. Its main function is to maintain the current access session. For example, after we enter the user name and password to successfully log in to a website, the server will save the login status information with the session. Later, we will find that it is the login status every time we refresh or request other pages of the site, which is the credit of Cookies. There is information in Cookies that identifies the session of our corresponding server. Every time the browser requests the page of the site, it will add Cookies in the request header and return them to the server. The server identifies ourselves through Cookies and finds out that the current state is the login state, so the return result is the web page content that can be seen only after logging in.
    • Referer: this content is used to identify the page from which the request is sent. The server can get this information and handle it accordingly, such as source statistics, anti-theft chain processing, etc.
    • User agent: referred to as UA for short. It is a special string header that enables the server to identify the operating system and version, browser and version used by the customer. When you add this information to a crawler, you can disguise it as a browser: if you don't, it may be easily recognized as a crawler.
    • Content type: also known as Internet Media Type or MIME type. In the HTTP protocol message header, it is used to represent the media type information in the specific request. For example, text/html represents HTML format, image/gif represents GIF picture, and application/json represents JSON type. For more correspondence, you can view this cross reference table: http://tool.oschina.net/commons
  • What is included in the Response?
    • The response, returned by the server to the client, can be divided into three parts: Response Status Code, Response Headers and Response Body
    • 1. Response status code the response status code indicates the response status of the server. For example, 200 represents the normal response of the server, 404 represents that the page is not found, and 500 represents an internal error in the server. In the crawler, we can judge the response status of the server according to the status code. If the status code is 200, it proves that the data is returned successfully, and then further processing is carried out. Otherwise, it is ignored directly.
    • 2. Response header
    • 3. The most important part of the responder is the content of the responder. The body data of the response is in the response body. For example, when requesting a web page, its response body is the HTML code of the web page; When requesting a picture, its response body is the binary data of the picture. After the crawler requests the web page, the content to be parsed is the response body. Click Preview in the browser developer tool to see the source code of the web page, that is, the content of the response body, which is the target of parsing. When doing a crawler, we mainly get the source code and JSON data of the web page through the response body, and then extract the corresponding content from it. Use the http request library to send a request to the server, and then get the response. Take down the content in the response body, and then parse it to get our data.

2, Case realization

  1. Target URL found
  2. Send network request
  3. get data
  4. Filter data
  5. Save data
import requests
import pprint

# Unified replacement
# 1. Select the content to replace
# 2. Press and hold Ctrl+R. note: turn on the asterisk * / turn on Regex in version 2021
# 3. Enter (. *?) in the first box: (.*)
# 4. Enter '$1': '$2' in the second box,
# 5. Click REPLACE ALL
headers = {
'cookie': '',
'referer': 'https://weibo.com/tv/channel/4379160563414111/editor',
'user-agent': '',
}
data = {
'data': '{"Component_Channel_Editor":{"cid":"4379160563414111","count":9}}'
}
url = 'https://www.weibo.com/tv/api/component?page=/tv/channel/4379160563414111/editor'
json_data = requests.post(url=url, headers=headers, data=data).json()
ccs_list = json_data['data']['Component_Channel_Editor']['list']
next_cursor = json_data['data']['Component_Channel_Editor']['next_cursor']
for ccs in ccs_list:
oid = ccs['oid']
title = ccs['title']
data_1 = {
'data': '{"Component_Play_Playinfo":{"oid":"' + oid + '"}}'
}
# 1. Find the target website
url_1 = 'https://weibo.com/tv/api/component?page=/tv/show/' + oid
# 2. Send network request
# 2.1 construction request header
# 2.2 construction request parameters
# 3. Obtain data
json_data_2 = requests.post(url=url_1, headers=headers, data=data_1).json()
# 4. Filter data
dict_urls = json_data_2['data']['Component_Play_Playinfo']['urls']
video_url = "https:" + dict_urls[list(dict_urls.keys())[0]]
print(title + "\t" + video_url)
# 5. Save data
video_data = requests.get(video_url).content
with open(f'video\{title}.mp4', mode='wb') as f:
f.write(video_data)
print(title, "Crawl successful................")

Brothers, send you a girlfriend, Chong duck~

Keywords: Python crawler Python crawler Data Mining

Added by php-phan on Sun, 26 Dec 2021 09:48:09 +0200