Use Python to crawl 1000 posts of CET bar, they are all talking about these originally!

Write before

Postgraduate entrance examination means that if you want to know more about Postgraduate entrance examination, you can either go to your elder sister or go online to search for it. Posting is a good place.With powerful tools, valuable information can be quickly obtained from the mixed information of fishes and dragons on the web.Although there are many tutorials and examples of crawling Baidu Tieba on the Internet, the rules of Tieba update quickly, the purpose is different, and the content crawled is different, so there is this tool.

objective

Crawl 1000 posts Determine whether it's advertising or spam Analyze language emotions Generate word clouds

1. Analysis

1.1 First check the rules of the sticker bar, as expected, there are 50 posts per page

1.2 Post content, also regular, is inside this label

1.3 Use Baidu AI's content auditing to judge content, and Baidu AI's emotional analysis to save time
1.4 Word cloud can be generated using jieba participle and wordcloud first, but later discovered that there are ready-made tools on the web

2. Crawling process

2.1 The first small problem to solve is to let it calculate by itself that each page is 50 posts. I enter 1000 posts and it should crawl those pages. Just use this mathematical calculation
2.2 Crawl process code, call content review and emotional analysis after crawling, then write to file

def gettbtz(tbname,tznum):  ####Get all posts based on the number of Posts and bars given (an integer multiple of 50)
    n = -50
    tznum = int(tznum)  ###z Here is to show 50 posts per page according to the rules of Post Bar
    emotions = 0
    while (tznum > n):
        n = n + 50
        print("Before crawling" + str(n) + "Posts")
        url = "http://tieba.baidu.com/f?kw=" + tbname + "&ie=utf-8&pn=" + str(n)
        soup = BeautifulSoup(requests.get(url).text,'lxml')  ###Crawl Action
        a = soup.find_all('div',class_='threadlist_abs threadlist_abs_onlyline')
        for a in a:  ###Next, it determines whether the text is compliant, then the Affective Positive Tendency Index, and then writes the file
            if BDAITEXT(a.text) == "Compliance":
                print("Crawl to Compliance Post,Writing file:" + a.text)
                with open("resaults.txt","a+",encoding='utf-8') as f:
                    f.write(str(a.text))  ###Writing here intentionally converts the data type to avoid subsequent text encoding errors
                try:
                    emotions = emotions + BDAIemotion(a.text)
                    print("Current cumulative emotional index:" + str(emotions))
                except:
                    print("Emotional analysis error, skip")

            else:
                print("Post not conforming, skip")
        time.sleep(10)  ###Gentlemen's Agreement, suspended for 10 seconds
        f.close()

3. Call of Baidu Artificial Intelligence API

3.1 Baidu AK acquisition is to first register a developer account with Baidu AI development platform, then create an application, obtain the application id and key, and then get the key for such a call

# AK acquired by client_id for the official website and SK acquired by client_secret for the official website
host = 'https://Aip.baidubce.com/oauth/2.0/token?Grant_type=client_credentials&client_id=[application ID]&client_secret=[SK]
 response = requests.get(host)
 if response:
 print(response.json())

3.2 Content Auditing API Calls

def BDAITEXT(text):  ####Baidu AI text review, return to compliance or non-compliance
    content = {"text": text}
    r = requests.post(BDAItexturl,content).text
    if r:
        rback = json.loads(r)
        return rback["conclusion"]

3.3 Emotional Analysis API Calls

def BDAIemotion(text):  ####Baidu AI Emotional Analysis, Return a Numeric Value
    content = {"text": text}
    content = json.dumps(content)
    r = requests.post(BDAIemotionurl,content).text
    if r:
        rback = json.loads(r)
        return rback['items'][0]['positive_prob']

4. Word Cloud Generation

There are many online tools that import large chunks of text, then filter, split words as needed, and then set colors and styles to generate word clouds.

5. Information Analysis

Looking at Ciyun, the results are self-evident. Preparing early, rich experience, professional courses, mathematics, politics, College choices...
From an emotional point of view, most of the emotional indexes are positive, which indicates that a positive attitude is still needed to deal with postgraduate entrance examination.

Run Screenshot

To be improved

1. It should be multi-threaded, too slow
2. Crawled posts, not comments
3. There are many mistakes in emotional analysis

"Gossip and gossip are left to the people of cities and towns, you just care about the elegance and calmness of the distant place"

Keywords: Python JSON encoding IE

Added by sapoxgn on Tue, 14 Jan 2020 18:43:20 +0200