python crawler gets microblog Wu moufan microblog hot review

In July 2021, the biggest melon should be Wu moufan.

The pop melon in the entertainment industry is not new, but Wu moufan's melon is especially big!

The thing is, a freshman girl named "Du mouzhu" broke the news on her microblog that she suffered cold violence during her love with Wu moufan

He also said that Wu moufan had the behavior of "selecting concubines" and "luring" underage girls

Subsequently, a number of girls who claimed to have had a relationship with Wu moufan have posted chat records to support Wu moufan's behavior.

Is that really the case? Let's see what 1000000 + netizens say?

Target determination

Our goal is the following netizen comments on Wu moufan's microblog

See what they say?

requirement analysis

The data we want to obtain, if any

User id, author name, author motto, posting time and posting content.

First, we open the browser developer mode in F12:

Find our target url:

https://m.weibo.cn/comments/hotflow?id=4660583661568436&mid=4660583661568436&max_id_type=0

There are also anti creep parameter headers

We opened the link with a browser and found that this is a data set in standard json format,

All the data we need is in this json data

So the first step is to get the data set in json format.

Send request

The goal is clear. Next, add the code:

    url = 'https://m.weibo.cn/comments/hotflow?id=4660583661568436&mid=4660583661568436&max_id_type=0'
    print('current url Yes:', url)

    headers = {
        'cookie': 'SUB=_2A25NyTOqDeRhGeVG7lAZ9S_PwjiIHXVvMl3irDV6PUJbktB-LVDmkW1NT7e8qozwK1pqWVKX_PsKk5dhdCyPXwW1; SUBP=0033WrSXqPxfM725Ws9jqgMF55529P9D9WFGibRIp_iSfMUfmcr5kb295NHD95Q01h-E1h-pe0.XWs4DqcjLi--fi-2Xi-2Ni--fi-z7iKysi--Ri-8si-zXi--fi-88i-zce7tt; _T_WM=98961943286; MLOGIN=1; WEIBOCN_FROM=1110006030; XSRF-TOKEN=70a1e0; M_WEIBOCN_PARAMS=oid%3D4648381753067388%26luicode%3D20000061%26lfid%3D4648381753067388%26uicode%3D20000061%26fid%3D4648381753067388',
        'referer': 'https://m.weibo.cn/detail/4660583661568436',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4573.0 Safari/537.36'
    }

    resp = requests.get(url, headers=headers).json()
    wb_info = resp['data']['data']
    print(wb_info)

Parse page

The previous step has successfully simulated the browser to obtain the data.

The next step is how to extract our target data

    for item in wb_info:
        user_id = item.get('user')['id']  #User id

        author = item['user']['screen_name']  #Author name

        auth_sign = item['user']['description']  #Author's motto

        time = str(item['created_at']).split(' ')[1:4]
        rls_time = '-'.join(time)  #Posting time

        text = ''.join(re.findall('[\u4e00-\u9fa5]', item['text']))  #Post content
        
        print(user_id, author, auth_sign, rls_time, text)
        
        '''
        6414472816 Three seasons Wang Jing Brush your microblog three nights and two ends Jul-19-08:06:52 rolling
        6556281551 Jin Taiyan disciplined herself Don't talk to me if you don't take a big reward and belittle it every day Jul-19-08:09:16 The copywriter didn't think about it all night, was he surprised
        3320924712 Lionright The people I like are stars and light Jul-19-08:07:06 Vomited
        1966108801 Porridge nineteen Melody is homesick Jul-19-08:09:13 Hurry, hurry, he's hurry
        6300343274 Little mirror Life is one after another  The pursuit of good things Jul-19-08:08:53 Get out of China
        5776571462 Peeled potato stew Pepsi Cola Everything Fanta Mood Sprite Seven up a week Jul-19-08:07:21 Do you believe it? I don't believe it. I'm too happy
        5041911786 Liekkas_w I will always remember the summer wind Jul-19-08:06:50 Go away, you
        2316414853 Are you a fool Justice will prevail Jul-19-08:17:02 All the good-looking ones are in his bed, and all the ugly ones are in control
        5737714811 I like Crispy Shrimp cakes D 🌫 Jul-19-08:11:56 We believe in your tears
        5366720146 The right person at first sight  Jul-19-08:13:52 Even if the criticism is serious, you praise those who believe you and make them vomit in the front row
        5599841983 Did you breathe today National second class football player Jul-19-08:18:08 Get out of China
        3284861767 Watch the Milky Way catch stars I don't sing loud love songs Jul-19-08:07:53 We are
        6446391321 What can you do to me-Kris- My attitude towards all of you depends on your attitude towards Wu Yifan. Jul-19-08:06:56 We've been thinking
        5225883987 Nothing kym What are you looking at? I'm just Susie. Jul-19-08:06:55 trust you
        '''

Data obtained successfully!

It's all right. Let's continue to analyze and turn the page. Start with the url of each page.

https://m.weibo.cn/comments/hotflow?id=4660583661568436&mid=4660583661568436&max_id_type=0
https://m.weibo.cn/comments/hotflow?id=4660583661568436&mid=4660583661568436&max_id=27509545759071812&max_id_type=0
https://m.weibo.cn/comments/hotflow?id=4660583661568436&mid=4660583661568436&max_id=11396065415043588&max_id_type=0
https://m.weibo.cn/comments/hotflow?id=4660583661568436&mid=4660583661568436&max_id=5160872412930597&max_id_type=0

I believe you can see it at a glance. From the second page, there is an additional max_ Parameter for ID.

And this max_id changes randomly with the number of pages.

Now the question becomes how to get max_id

Get the max on the second page through the link on the first page_ id,

Then get to the third page through the link on the second page_ id

And so on, get all the data

Then use openpyxl to save the content to an Excel file, as shown in the following figure.

    ws = op.Workbook()
    wb = ws.create_sheet(index=0)

    wb.cell(row=1, column=1, value='user id')
    wb.cell(row=1, column=2, value='Author name')
    wb.cell(row=1, column=3, value='Author's motto')
    wb.cell(row=1, column=4, value='Posting time')
    wb.cell(row=1, column=5, value='Post content')

    count = 2
    wb.cell(row=count, column=1, value=user_id)
    wb.cell(row=count, column=2, value=author)
    wb.cell(row=count, column=3, value=auth_sign)
    wb.cell(row=count, column=4, value=rls_time)
    wb.cell(row=count, column=5, value=text)
    
    ws.save('666.xlsx')

Get 50 pages of data first and practice

Some data obtained are as follows:

Visual display

  rcv_data = pd.read_excel('./666.xlsx')
  exist_col = rcv_data.dropna()  #Delete empty lines
  c_title = exist_col['Post content'].tolist()
  #Cloud picture of film viewing comments
  wordlist = jieba.cut(''.join(c_title))
  result = ' '.join(wordlist)
  pic = 'img1.jpg'
  gen_stylecloud(text=result,
                  icon_name='fab fa-apple',
                  font_path='msyh.ttc',
                  background_color='white',
                  output_name=pic,
                  custom_stopwords=['you', 'I', 'of', 'Yes', 'stay', 'bar', 'believe', 'yes', 'also', 'all', 'no', 'Do you', 'Just', 'this', 'still', 'say', 'One', 'always', 'We']
                   )
  print('Drawing succeeded!')

Keywords: Python crawler

Added by |Adam| on Sat, 15 Jan 2022 11:58:09 +0200