Write before
Postgraduate entrance examination means that if you want to know more about Postgraduate entrance examination, you can either go to your elder sister or go online to search for it. Posting is a good place.With powerful tools, valuable information can be quickly obtained from the mixed information of fishes and dragons on the web.Although there are many tutorials and examples of crawling Baidu Tieba on the Internet, the rules of Tieba update quickly, the purpose is different, and the content crawled is different, so there is this tool.
objective
Crawl 1000 posts Determine whether it's advertising or spam Analyze language emotions Generate word clouds
1. Analysis
1.1 First check the rules of the sticker bar, as expected, there are 50 posts per page
1.2 Post content, also regular, is inside this label
1.3 Use Baidu AI's content auditing to judge content, and Baidu AI's emotional analysis to save time
1.4 Word cloud can be generated using jieba participle and wordcloud first, but later discovered that there are ready-made tools on the web
2. Crawling process
2.1 The first small problem to solve is to let it calculate by itself that each page is 50 posts. I enter 1000 posts and it should crawl those pages. Just use this mathematical calculation
2.2 Crawl process code, call content review and emotional analysis after crawling, then write to file
def gettbtz(tbname,tznum): ####Get all posts based on the number of Posts and bars given (an integer multiple of 50) n = -50 tznum = int(tznum) ###z Here is to show 50 posts per page according to the rules of Post Bar emotions = 0 while (tznum > n): n = n + 50 print("Before crawling" + str(n) + "Posts") url = "http://tieba.baidu.com/f?kw=" + tbname + "&ie=utf-8&pn=" + str(n) soup = BeautifulSoup(requests.get(url).text,'lxml') ###Crawl Action a = soup.find_all('div',class_='threadlist_abs threadlist_abs_onlyline') for a in a: ###Next, it determines whether the text is compliant, then the Affective Positive Tendency Index, and then writes the file if BDAITEXT(a.text) == "Compliance": print("Crawl to Compliance Post,Writing file:" + a.text) with open("resaults.txt","a+",encoding='utf-8') as f: f.write(str(a.text)) ###Writing here intentionally converts the data type to avoid subsequent text encoding errors try: emotions = emotions + BDAIemotion(a.text) print("Current cumulative emotional index:" + str(emotions)) except: print("Emotional analysis error, skip") else: print("Post not conforming, skip") time.sleep(10) ###Gentlemen's Agreement, suspended for 10 seconds f.close()
3. Call of Baidu Artificial Intelligence API
3.1 Baidu AK acquisition is to first register a developer account with Baidu AI development platform, then create an application, obtain the application id and key, and then get the key for such a call
# AK acquired by client_id for the official website and SK acquired by client_secret for the official website host = 'https://Aip.baidubce.com/oauth/2.0/token?Grant_type=client_credentials&client_id=[application ID]&client_secret=[SK] response = requests.get(host) if response: print(response.json())
3.2 Content Auditing API Calls
def BDAITEXT(text): ####Baidu AI text review, return to compliance or non-compliance content = {"text": text} r = requests.post(BDAItexturl,content).text if r: rback = json.loads(r) return rback["conclusion"]
3.3 Emotional Analysis API Calls
def BDAIemotion(text): ####Baidu AI Emotional Analysis, Return a Numeric Value content = {"text": text} content = json.dumps(content) r = requests.post(BDAIemotionurl,content).text if r: rback = json.loads(r) return rback['items'][0]['positive_prob']
4. Word Cloud Generation
There are many online tools that import large chunks of text, then filter, split words as needed, and then set colors and styles to generate word clouds.
5. Information Analysis
Looking at Ciyun, the results are self-evident. Preparing early, rich experience, professional courses, mathematics, politics, College choices...
From an emotional point of view, most of the emotional indexes are positive, which indicates that a positive attitude is still needed to deal with postgraduate entrance examination.
Run Screenshot
To be improved
1. It should be multi-threaded, too slow
2. Crawled posts, not comments
3. There are many mistakes in emotional analysis