Getting Started with Python Crawlers [7]: Picture Crawl Two of the Hummingbird Web

Picture of Hummingbird Net--Introduction

Play fresh today, using a new library aiohttp to speed up our crawls.

Installation module routine

pip install aiohttp

After running, wait, install, want to further study, then the official documents are necessary: https://aiohttp.readthedocs.io/en/stable/

Now you can start writing code.

The page we want to crawl is selected this time

http://bbs.fengniao.com/forum/forum_101_1_lastpost.html

Open the page and we can easily get the page number

It has been a long time since I could see the page number so easily.

Try to access this page with aiohttp, the introduction of modules, nothing special, just import
If we need to write crawls using Asyncio + Aiohttp asynchronous IO, then note that you need to precede the asynchronous method with async

Next, try to get the web source code of the address above.

The code first declares a function of fetch_img_url with a parameter that can also be written directly to death.

Instead of prompting, search for relevant information yourself ('<<)

aiohttp.ClientSession() as session: Create a session object and use it to open a Web page.Session can do many things, such as post, get, put, etc.

await response.text() in code waiting for web data to return

asyncio.get_event_loop creates a thread, and the run_until_complete method is responsible for scheduling tasks in tasks.Tasks can be a separate function or a list.

import aiohttp  
import asyncio 

async def fetch_img_url(num):
    url = f'http://bbs.fengniao.com/forum/forum_101_{num}_lastpost.html'#String Stitching
    # Or write directly as url ='http://bbs.fengniao.com/forum/forum_101_1_lastpost.html'
    print(url)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6726.400 QQBrowser/10.2.2265.400',
    }

    async with aiohttp.ClientSession() as session:
        # Get Round-robin Map Address
        async with session.get(url,headers=headers) as response:
            try:
                html = await response.text()   # Get Web Source
                print(html)

            except Exception as e:
                print("Basic Error")
                print(e)

# You can copy this part directly
loop = asyncio.get_event_loop()
tasks = asyncio.ensure_future(fetch_img_url(1))
results = loop.run_until_complete(tasks)
Python Resource Sharing qun 784758214 ,Installation packages are included. PDF,Learn video, here is Python The place where learners gather, zero base, advanced, all welcome

The last part of the code above can also be written as

loop = asyncio.get_event_loop()
tasks =  [fetch_img_url(1)]
results = loop.run_until_complete(asyncio.wait(tasks))

Okay, if you've got the source code for your work, you're not close to your ultimate goal and you've lost it.
Modify the code to get 10 pages in bulk.
You just need to modify tasks, run it here and see the following results

tasks =  [fetch_img_url(num) for num in range(1, 10)]

The following series of actions are very similar to the previous blog, looking for patterns.
Open a page at random

http://bbs.fengniao.com/forum/forum_101_4_lastpost.html

Click on a picture to go to the inside page, click on a picture on the inside page to go to a round-robin page

Click again to enter the picture playback page

Finally, we found all the picture links in the source code on the picture playback page, so the question arises, how can we change from the first link above to the link of the broadcast map???
The following source code is viewed at http://bbs.fengniao.com/forum/pic/slide_101_10408464_893854.html by right-clicking.

Continue to analyze ~~~(= = o)

http://bbs.fengniao.com/forum/forum_101_4_lastpost.html
//Change to the link below?
http://bbs.fengniao.com/forum/pic/slide_101_10408464_89383854.html

Continuing with the first link, we use the F12 Developer Tool to grab a picture and see it.

The location of the picture's winning * box, we found the number we want, so good, we just need to match them through regular expressions.
The code is in the following ### location, it should be noted that I used the original regular matching, in the process of writing the regular expression, I found that one step did not exactly match, can only be divided into two steps, you can see the specific details o (o)

  1. Find all pictures < div class="picList">
  2. Get the two parts of the number we want
async def fetch_img_url(num):
    url = f'http://bbs.fengniao.com/forum/forum_101_{num}_lastpost.html'
    print(url)
    headers = {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36 Core/1.63.6726.400 QQBrowser/10.2.2265.400',
    }

    async with aiohttp.ClientSession() as session:
        # Get Round-robin Map Address
        async with session.get(url,headers=headers) as response:
            try:
                ###############################################
                url_format = "http://bbs.fengniao.com/forum/pic/slide_101_{0}_{1}.html"
                html = await response.text()   # Get Web Source
                pattern = re.compile('<div class="picList">([\s\S.]*?)</div>')
                first_match = pattern.findall(html)
                href_pattern = re.compile('href="/forum/(\d+?)_p(\d+?)\.html')
                urls = [url_format.format(href_pattern.search(url).group(1), href_pattern.search(url).group(2)) for url in first_match]
                ##############################################

            except Exception as e:
                print("Basic Error")
                print(e)
Python Resource Sharing qun 784758214 ,Installation packages are included. PDF,Learn video, here is Python The place where learners gather, zero base, advanced, all welcome

The code is complete, we've got the URL we want, so let's go ahead and read the URL's internal information and match the link to the picture we want

async def fetch_img_url(num):
    # Copy the code above
    async with aiohttp.ClientSession() as session:
        # Get Round-robin Map Address
        async with session.get(url,headers=headers) as response:
            try:
                #Copy the code above
                ################################################################
                for img_slider in urls:
                    try:
                        async with session.get(img_slider, headers=headers) as slider:
                            slider_html = await slider.text()   # Get Web Source
                            try:
                                pic_list_pattern = re.compile('var picList = \[(.*)?\];')
                                pic_list = "[{}]".format(pic_list_pattern.search(slider_html).group(1))
                                pic_json = json.loads(pic_list)  # The list of pictures is available
                                print(pic_json)
                            except Exception as e:
                                print("Code Debugging Errors")
                                print(pic_list)
                                print("*"*100)
                                print(e)

                    except Exception as e:
                        print("Get Picture List Error")
                        print(img_slider)
                        print(e)
                        continue
                ################################################################

                print("{}Operation completed".format(url))
            except Exception as e:
                print("Basic Error")
                print(e)

The final JSON of the picture has been taken, the last step, download the picture, when ~~~, a rapid operation, the picture will be taken down

async def fetch_img_url(num):
    # Code to find above
    async with aiohttp.ClientSession() as session:
        # Get Round-robin Map Address
        async with session.get(url,headers=headers) as response:
            try:
                # Code to find above
                for img_slider in urls:
                    try:
                        async with session.get(img_slider, headers=headers) as slider:
                            # Code to find above
                            ##########################################################
                            for img in pic_json:
                                try:
                                    img = img["downloadPic"]
                                    async with session.get(img, headers=headers) as img_res:
                                        imgcode = await img_res.read()  # Picture reading
                                        with open("images/{}".format(img.split('/')[-1]), 'wb') as f:
                                            f.write(imgcode)
                                            f.close()
                                except Exception as e:
                                    print("Picture download error")
                                    print(e)
                                    continue
                            ###############################################################

                    except Exception as e:
                        print("Get Picture List Error")
                        print(img_slider)
                        print(e)
                        continue
                print("{}Operation completed".format(url))
            except Exception as e:
                print("Basic Error")
                print(e)
Python Resource Sharing qun 784758214 ,Installation packages are included. PDF,Learn video, here is Python The place where learners gather, zero base, advanced, all welcome

Pictures will be generated quickly in the images folder that you wrote in advance

tasks can start a maximum of 1024 protocols, but it is recommended that you start 100 protocols to OK. Too many concurrent operations will overwhelm your home server.

Once you've done that, add some details, such as saving to a specified folder, and OK is done.

Keywords: Python Session Windows JSON

Added by sqishy on Wed, 24 Jul 2019 21:14:35 +0300