Close to the autumn move, don't you see enough? Crawler to help you get the well-known website XX net multi class Classics (Java, C + +, Pyhton everything)!!!

preface

Blogger I have been wandering in Niuke's facial classics area all year round and summarized the high-frequency examination questions of big factories such as byte, Alibaba, Baidu, Tencent, meituan, etc. but today, I teach you how to crawl facial classics. If you can help your little partners, please give me a lot of support three times a day. Thank you for your disrespect!!!

This time, taking Java facial Sutra as an example, the little partner of the society can climb any facial Sutra of Niuke according to the law


teaching

Enter the Java area and open the console refresh request

It can be found that the response content obtained by sending the URL in the browser is not face-to-face, so where does the face-to-face data come from??? Don't worry, so many requests, let's keep looking!

Sliding down, you can see the request with json. Experience tells me that this is the request

Copy the URL, we go to the browser to request the URL, and we can find that we have obtained the face-to-face data

However, the surface is in JSON format, which can be copied to the online JSON parsing tool for viewing, as shown below

You can see that all posts, namely face-to-face information, are saved under discussPosts under data

However, unlike what I have seen before, this json string does not directly save the URL of the post details page, but we can provide access path discovery rules

You can see that the access path has a 675866, which corresponds to the postId in the json string, and the following parameters can be omitted


antic

It must be that a single page can't satisfy all of you. So if you crawl multiple pages, don't worry. I'll summarize the rules for you. I also hope you can click three times in a row!!!

The same routine, as shown in the figure below, is the face warp JSON string of the C + + area. I don't need to teach you more

Complete code

If you are a novice, you can see my crawler teaching index. After practicing, you can master it well!!!

Open source crawler instance tutorial directory index

import requests
import json
import os
from bs4 import BeautifulSoup

headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
}


path = "./Facial meridian"


symbol_list = ["\\", "/", "<", ":", "*", "?", "<", ">", "|","\"","&amp;"]

#Replace the title characters, because if there are these characters, they cannot be saved to txt
def replaceTitle(title):
    for i in symbol_list:
        if title.find(str(i)) != -1:
            title = title.replace(str(i)," ")

    return title

#Get details page URL
def getDetail_Url(tmp_path,json_str):
    discussPosts = json_str['data']['discussPosts']
    for discussPost in discussPosts:
        title = discussPost['postTitle']
        detail_url = "https://www.nowcoder.com/discuss/" + str(discussPost['postId'])
        detail_response = requests.get(url=detail_url,headers=headers)
        getDetailData(tmp_path,title,detail_response)


#Request details page for data
def getDetailData(tmp_path,title,detail_response):
    detailData = BeautifulSoup(detail_response.text,"html.parser")
    div_data = detailData.find(class_="post-topic-main").find_all("div")
    title = replaceTitle(title)
    print("Saving" + tmp_path + ":" + title)
    with open(tmp_path + "/" + title + ".txt",'w',encoding='utf-8') as f:
        for i in div_data:
            try:
                #If you encounter a tag with class clearfix, you can jump out of the loop, because the required data has been obtained, and the data under the tag is unnecessary
                if i['class'].count('clearfix') != 0:
                    break
            except:
                pass
            tmp_text = i.text.strip('\n').strip()
            f.write(tmp_text + "\n")
    f.close()



# start
if __name__ == '__main__':
    url = "https://www.nowcoder.com/discuss/experience/json?token=&tagId=639&companyId=0&phaseId=0&order=3&query=&page=1"
    for i in range(1,3):
        url = url + str(i)
        if not os.path.exists(path):
            os.mkdir(path)
        tmp_path = path + "/The first" + str(i) + "Page questions"
        if not os.path.exists(tmp_path):
            os.mkdir(tmp_path)
        response = requests.get(url=url,headers=headers)
        json_str = json.loads(response.text)
        getDetail_Url(tmp_path,json_str)

Result display


CSDN exclusive benefits come!!!


Recently, CSDN has an exclusive activity, that is, the following full stack knowledge map of Python. The route planning is very detailed. The size is 870mm x 560mm. Small partners can learn systematically according to the above process. Don't look for a book to learn. They should learn systematically and regularly. Its foundation is the most solid in our business, "Weak foundation, earth shaking and mountains shaking" is particularly obvious.

Finally, if you are interested, you can buy it as appropriate to pave the way for the future!!!


last

I am Skin shrimp , a shrimp lover who loves to share knowledge, we will constantly update blog posts that are beneficial to you in the future. We look forward to your attention!!!

It's not easy to create. If this blog is helpful to you, I hope you can click three times!, Thanks for your support. I'll see you next time~~~

Sharing outline

Interview question column of large factory


Directory index of Java learning route from getting started to entering the grave


Open source crawler instance tutorial directory index

For more wonderful content sharing, please click Hello World (●'◡'●)

Keywords: Python crawler

Added by mpar612 on Sun, 23 Jan 2022 12:08:14 +0200