Encyclopedia character crawler - attribute analysis

Article catalog

Entry analysis

Under the character attribute in the encyclopedia, there are people tags of various industries, and their corresponding description information is also different, so it is necessary to extract common fields before parsing.

First, remove the "network celebrities, actors, E-sports characters, film and television behind the scenes characters, music characters, star combinations, virtual characters and sports characters".

Extract feature words according to the remaining person Tags:

Political figures: biography, appointment and removal, events, main contributions, honors and evaluation of figures Corporate characters: character experience, personal life, main contributions, honors and character evaluation Historical figures: characters' life, personal works, main achievements, anecdotes and allusions, historical records, artistic images, relatives and members, and character evaluation Cultural tasks: character experience, personal life, personal works, main contributions, award records, character evaluation Scientific characters: character experience, personal life, research direction, main achievements, honors, social posts, character influence and character evaluation Educational characters: character experience, research direction, main achievements, award records, social service and character evaluation Medical personage: personage's experience, research direction, translation of works, scientific research achievements, awards, academic posts, areas of expertise and visiting time Other characters: character experience, main contributions, honors, character evaluation Other characters - writers: character experience, personal life, published works, published books, character evaluation, honors Other figures - scientific research: introduction to figures, projects undertaken, academic achievements, awards and honors, views of figures, published works

After observation and analysis, the following information is finally retained. Therefore, the contents to be analyzed in combination with the basic character information column include:

  • Chinese name, foreign name and alias
  • Nationality, nationality and native place
  • Date of birth, date of death
  • Graduation school, occupation and main achievements
  • Gender, position and degree
  • Character experience, personal life, research direction, achievements, awards, honors, posts, influence and evaluation

Page parsing

Due to the uneven page data, how to analyze intelligently is the key to encyclopedia data collection. I built a generic field extractor.

import requests,re
from lxml import etree

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36'
}

# Filter [n], [n-n] Tags
def filter_label(para):
    for label_num in re.findall('\[\d+\]|\[\d+-\d+\]', para):
        para = para.replace(label_num, '')
    return para

def get_item(url):
    doc = requests.get(url,headers=headers).text
    e = etree.HTML(doc)
    item = {}
    item['baike_url'] = url
    # Cover picture
    item['baike_pic'] = ''.join(e.xpath('//div[@class="summary-pic"]/a/img/@src'))
    #print("cover picture:", item['pic '])

    # Personal profile
    item['baike_desc'] = ''
    for desc in e.xpath('//div[@class="lemma-summary"]/div[@class="para"]'):
        para = ''.join(desc.xpath('.//text()')).replace('\n','')
        para = filter_label(para)
        item['baike_desc']+=para+'\n'
    #print("profile:", item['baike_desc '])

    # essential information
    item['baike_basicInfo'] = {}
    dt = e.xpath('//div[@class="basic-info J-basic-info cmn-clearfix"]/dl/dt')
    dd = e.xpath('//div[@class="basic-info J-basic-info cmn-clearfix"]/dl/dd')
    for t,d in zip(dt,dd):
        key =  ''.join(''.join(t.xpath('./text()')).split())
        value =  ''.join(''.join(d.xpath('.//text()')).split())
        item['baike_basicInfo'][key] = value
    #print("basic information:", item['baike_basicInfo '])

    # Universal Extractor
    def parse_label(label):
        re_rule = f'{label}" class="lemma-anchor " >\n(.*?)'
        experiences = re.findall(re_rule, doc, re.S)
        labels = []
        if experiences:
            exper = ''.join(experiences)
            ex = etree.HTML(exper)
            labels = filter_label(''.join(ex.xpath('//div[@class="para"]//text()')))
        return labels

    # experience
    exper = parse_label('experience')
    if exper:
        exper = exper.split(';')

    # direction
    field = parse_label('direction')
    if not field:
        field = parse_label('field')

    # part-time job
    social = parse_label('part-time job')
    if social:
        social = social.split(';')

    # honor
    awards = parse_label('honor')
    if not awards:
        awards = parse_label('Award')

    item['baike_experience'] = exper
    item['baike_awards'] = awards
    item['baike_social'] = social
    item['baike_field'] = field
    item['baike_results'] = parse_label('achievements')
    item['baike_life'] = parse_label('life')
    item['baike_affect'] = parse_label('influence')
    item['baike_eval'] = parse_label('evaluate')
    return item

Run test

If the characters you collect have other obvious fields, such as topics, popular science, works, etc.

You can use parse_labe adds fields for intelligent parsing.

print(get_item('https://baike.baidu.com/item/%E7%8E%8B%E5%85%83%E5%8D%93'))

Run test:

Added by tpc on Tue, 04 Jan 2022 10:13:51 +0200