Programmer's happiness - use python to crawl the pictures of beautiful women on the other side of the Internet

When I first entered the world of mortals, I didn't know the suffering of the world,
When I look back, I am already a bitter man.
One third of the wine in this glass,
This wine is sad.
Shut it down, shut it down,
It's hard every night.
May love and hatred in this world melt into wine,
Turn this dust into wine
A drink to relieve your worries!

This is a sentence from the poetry book on the pillow. When I read it roughly, there was a breath of "net suppressing cloud", which scared me to open my collection and observe it. Sure enough, it was pleasant to see beautiful women

what! You have no collection! I'll teach you how to have your own collection today.
Today, let's climb the other shore website. This website is relatively simple and suitable for novices to practice (pay attention not to be too hard, this will still be sealed)
Let's talk about my development environment: Python 3.8
Computer system: Windows10
Development tool: pycharm
Packages to use: requests, os, re
website: https://pic.netbian.com/

Open the website and check it first. It is found that the pictures are stored in the img tag in the li tag, but the thumbnails are stored here. If you want a high-definition image, you need to crawl the link in the a tag. After entering, you can crawl the high-definition image

Get the data of this page first

import os
import re
import requests


def get_page(url, headers):
    response = requests.get(url=url, headers=headers)
    response.encoding = 'gbk'	# Note here that the code of this website is gbk. If you set it
    if response.status_code == 200:
        return response.text

Note here that the code of this website is gbk. If you set it

def main():
    url = 'https://pic.netbian.com/'
    headers = {
        'User-Agent': 'Fiddler/5.0.20204.45441 (.NET 4.8; WinNT 10.0.19042.0; zh-CN; 8xAMD64; Auto Update; Full Instance; Extensions: APITesting, AutoSaveExt, EventLog, FiddlerOrchestraAddon, HostsFile, RulesTab2, SAZClipboardFactory, SimpleFilter, Timeline)'
    }
    html = get_page(url,headers)
    print(html)


if __name__ == '__main__':
    main()

Remember to set up UA camouflage every time you write about reptiles

Open a web page F12 to check, click Network, click any one in the Name, slide down on the right, find the user agent, copy it, and change the format to dictionary type. For details, refer to the code above and the picture below

Run to get the source code of the web page

Because I want to get all the links in the a tag, I use re here to match the links with regular expressions

def analysis(html):
    pattern = re.compile('<li><a href="(.*?)" title=')
    links = re.findall(pattern, html)
    print(links)

We can see that we have got all the links, but there are two uninvited visitors who can be divided by index. Another problem is that these links are incomplete. We have to splice them ourselves

You can see the missing content, as shown in the figure above. We can add it in front

You can see that these links have been spliced into complete links and can be accessed. Create an empty list and put the links in. The first part has been written. Let's write a cycle to access these links and get picture information from them

Parse the web page with the previously written get_page() is OK

The parsed page source code is then regularly matched to get the picture link and name to prepare for future storage

These links are still incomplete and need to be spliced. Before splicing, use '. join() to convert them to string type, because re.findall() returns a list

def analysis2(html):
    name = re.compile('<h1>(.*?)</h1>')
    new_name = re.findall(name, html)
    pattern = re.compile('<a href="" id="img"><img src="(.*?)" data')
    img = re.findall(pattern, html)
    img = ''.join(img)
    new_name = ''.join(new_name)
    new_ink = 'https://pic.netbian.com' + img
    return new_name, new_ink

Here, return returns a tuple and uses the index directly to get the desired content

For persistent storage, you need to create a folder first, otherwise it will be directly stored in your current folder

I created a new happy folder

if not os.path.exists('./happy'):
    os.mkdir('./happy')
    for i in inks:
        html = get_page(i, headers)
        ink = analysis2(html)
        ee = requests.get(url=ink[1], headers=headers).content
        new_img_name = 'happy/' + ink[0] + '.jpg'
        with open(new_img_name, 'wb') as f:
            f.write(ee)
        print(new_img_name, 'Download succeeded')

Store with with and you're done


This is not very happy!!

These codes are relatively simple without comments (it's not that I don't want to write them)

All codes are as follows:

import os
import re
import requests

if not os.path.exists('./happy'):
    os.mkdir('./happy')


def get_page(url, headers):
    response = requests.get(url=url, headers=headers)
    response.encoding = 'gbk'
    if response.status_code == 200:
        return response.text


def analysis(html):
    list_links = []
    pattern = re.compile('<li><a href="(.*?)" title=')
    links = re.findall(pattern, html)
    for i in links[2:]:
        new_link = 'https://pic.netbian.com' + i
        list_links.append(new_link)
    return list_links


def analysis2(html):
    name = re.compile('<h1>(.*?)</h1>')
    new_name = re.findall(name, html)
    pattern = re.compile('<a href="" id="img"><img src="(.*?)" data')
    img = re.findall(pattern, html)
    img = ''.join(img)
    new_name = ''.join(new_name)
    new_ink = 'https://pic.netbian.com' + img
    return new_name, new_ink


def main():
    url = 'https://pic.netbian.com/'
    headers = {
        'User-Agent': 'Fiddler/5.0.20204.45441 (.NET 4.8; WinNT 10.0.19042.0; zh-CN; 8xAMD64; Auto Update; Full Instance; Extensions: APITesting, AutoSaveExt, EventLog, FiddlerOrchestraAddon, HostsFile, RulesTab2, SAZClipboardFactory, SimpleFilter, Timeline)'
    }
    html = get_page(url, headers)
    inks = analysis(html)
    for i in inks:
        html = get_page(i, headers)
        ink = analysis2(html)
        ee = requests.get(url=ink[1], headers=headers).content
        new_img_name = 'happy/' + ink[0] + '.jpg'
        with open(new_img_name, 'wb') as f:
            f.write(ee)
        print(new_img_name, 'Download succeeded')


if __name__ == '__main__':
    main()

Just 57 lines of code to achieve happiness, is the happiness of men so simple!!

Keywords: Python Pycharm crawler

Added by Matt Phelps on Tue, 05 Oct 2021 22:54:04 +0300