mm131 reptile explanation

http://www.mm131.net

The source address is at the bottom

brief introduction

Programmer's daily eye care series, human structure series, technology control reptile series.

The function is very simple. You can capture the xx sense (beauty pictures) of mm131 website. In theory, you can capture them all.

The idea is very simple. According to the click idea, imitate people's browsing mode, and finally save it to disk.

environment

  • python 3.6
  • requests 2.20.0
  • lxml 4.2.5

usage method

  1. First, you need to have python 3. Please install by yourself. This article does not cover the installation steps.
  2. Secondly, download the file to the local directory. The directory structure is as follows
mm131
└─lib
│   └─ parser.py
├─ main.py
├─ README.md
└─ requirements.txt
  1. Then you need to install the dependencies in the above environment.
root@localhost:~$ pip install -r requirements.txt
  1. Execute in mm131 directory
root@localhost:~/mm131$ python main.py
  1. Later, you will see the output of the capture process and the captured picture in the mm131 folder.

Script ideas

Anti brushing strategy of website

Direct access http://www.mm131.net/xinggan/ It will be forced to jump to the index interface, and the content will be set to an empty string.

Through F12 of Google browser, we found the anti brush strategy of the website, which is built in a place called uaabc JS script file

var mUA = navigator.userAgent.toLowerCase() + ',' + document.referrer.toLowerCase();
var _sa = ['baiduspider', 'baidu.com', 'sogou', 'sogou.com', '360spider', 'so.com', 'bingbot', 'bing.com', 'sm.cn', 'mm131.net'];
var _out = false;
for (var i = 0; i < 10; i++) {
    if (mUA.indexOf(_sa[i]) > 1) {
        _out = true;
        break
    }
}
if (!_out) {
    var _d = new Date();
    var _h = _d.getHours();
    if (_h >= 0 && _h <= 23) {
        var mPlace = escape(lo).toLowerCase();
        if (mPlace == '%u5317%u4eac%u5e02' || mPlace == '%u56db%u5ddd%u7701' || mPlace == '%u6e56%u5317%u7701') {
            top.location.href = 'https://m.mm131.net/index/';
        }
    }
};

The general meaning of the script is that there must be one of ['Baidu pider', 'baidu.com', 'sogou', 'sogou.com', '360spider', 'so.com', 'bingbot', 'bing.com', 'sm.cn', 'mm131.net') in the userAgent or referrer,
Otherwise, you will judge whether your current IP is [Beijing, Hubei, Sichuan]. If so, you will be forced to jump back to index.

Then, we only need to ensure that the refer ence contains mm131 Net for details parser.py Set in_ Header () method.

 def set_header(self, referer):
        headers = {
            'Pragma': 'no-cache',
            'Accept-Encoding': 'gzip, deflate',
            'Accept-Language': 'zh-CN,zh;q=0.9,ja;q=0.8,en;q=0.7',
            'Cache-Control': 'no-cache',
            'Connection': 'keep-alive',
            'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/64.0.3282.186 Safari/537.36',
            'Accept': 'image/webp,image/apng,image/*,*/*;q=0.8',
            'Referer': '{}'.format(self.__set_referer(referer)),
        }
        return headers

Analysis interface

Get the topic ID and title in the home page

visit http://www.mm131.net/xinggan/ After that, the ID and title of the latest 20 topics will be obtained, which will be used as the next step to obtain the topic picture.
Locate and parse the html through etree of lxml, and finally get the id and title. Details can be found at parser.py Medium_ ids_ The titles () method.

Get all picture links for a topic

Because all picture links are very regular, it is easy to splice all picture links of a topic. Details can be found at parser.py Medium_ ids_ The titles () method.

storage

In the previous step, all picture links have been obtained. With links, the next step is storage. You can use serial download or parallel download. Just adjust it according to your needs.

  • Download all the theme picture s of a page serially. get_page
  • Download all theme picture s of a page in parallel. get_page_parallel

Search module

You can directly search the picture of the desired role for download. However, due to the internal error of the official website server, it needs to be solved.

abort me

In view of the untimely response, students who need data sets should pay attention to the official account: Xiao Zhang is not enough to sleep.
Reply: mm131 crawler code
You can get the source code download link, as well as other surprises

Keywords: Python crawler

Added by chanfuterboy on Fri, 24 Dec 2021 13:59:29 +0200