Article catalog
Entry analysis
Under the character attribute in the encyclopedia, there are people tags of various industries, and their corresponding description information is also different, so it is necessary to extract common fields before parsing.

First, remove the "network celebrities, actors, E-sports characters, film and television behind the scenes characters, music characters, star combinations, virtual characters and sports characters".
Extract feature words according to the remaining person Tags:
Political figures: biography, appointment and removal, events, main contributions, honors and evaluation of figures Corporate characters: character experience, personal life, main contributions, honors and character evaluation Historical figures: characters' life, personal works, main achievements, anecdotes and allusions, historical records, artistic images, relatives and members, and character evaluation Cultural tasks: character experience, personal life, personal works, main contributions, award records, character evaluation Scientific characters: character experience, personal life, research direction, main achievements, honors, social posts, character influence and character evaluation Educational characters: character experience, research direction, main achievements, award records, social service and character evaluation Medical personage: personage's experience, research direction, translation of works, scientific research achievements, awards, academic posts, areas of expertise and visiting time Other characters: character experience, main contributions, honors, character evaluation Other characters - writers: character experience, personal life, published works, published books, character evaluation, honors Other figures - scientific research: introduction to figures, projects undertaken, academic achievements, awards and honors, views of figures, published works
After observation and analysis, the following information is finally retained. Therefore, the contents to be analyzed in combination with the basic character information column include:
- Chinese name, foreign name and alias
- Nationality, nationality and native place
- Date of birth, date of death
- Graduation school, occupation and main achievements
- Gender, position and degree
- Character experience, personal life, research direction, achievements, awards, honors, posts, influence and evaluation
Page parsing
Due to the uneven page data, how to analyze intelligently is the key to encyclopedia data collection. I built a generic field extractor.
import requests,re from lxml import etree headers = { 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/93.0.4577.82 Safari/537.36' } # Filter [n], [n-n] Tags def filter_label(para): for label_num in re.findall('\[\d+\]|\[\d+-\d+\]', para): para = para.replace(label_num, '') return para def get_item(url): doc = requests.get(url,headers=headers).text e = etree.HTML(doc) item = {} item['baike_url'] = url # Cover picture item['baike_pic'] = ''.join(e.xpath('//div[@class="summary-pic"]/a/img/@src')) #print("cover picture:", item['pic ']) # Personal profile item['baike_desc'] = '' for desc in e.xpath('//div[@class="lemma-summary"]/div[@class="para"]'): para = ''.join(desc.xpath('.//text()')).replace('\n','') para = filter_label(para) item['baike_desc']+=para+'\n' #print("profile:", item['baike_desc ']) # essential information item['baike_basicInfo'] = {} dt = e.xpath('//div[@class="basic-info J-basic-info cmn-clearfix"]/dl/dt') dd = e.xpath('//div[@class="basic-info J-basic-info cmn-clearfix"]/dl/dd') for t,d in zip(dt,dd): key = ''.join(''.join(t.xpath('./text()')).split()) value = ''.join(''.join(d.xpath('.//text()')).split()) item['baike_basicInfo'][key] = value #print("basic information:", item['baike_basicInfo ']) # Universal Extractor def parse_label(label): re_rule = f'{label}" class="lemma-anchor " >\n(.*?)' experiences = re.findall(re_rule, doc, re.S) labels = [] if experiences: exper = ''.join(experiences) ex = etree.HTML(exper) labels = filter_label(''.join(ex.xpath('//div[@class="para"]//text()'))) return labels # experience exper = parse_label('experience') if exper: exper = exper.split(';') # direction field = parse_label('direction') if not field: field = parse_label('field') # part-time job social = parse_label('part-time job') if social: social = social.split(';') # honor awards = parse_label('honor') if not awards: awards = parse_label('Award') item['baike_experience'] = exper item['baike_awards'] = awards item['baike_social'] = social item['baike_field'] = field item['baike_results'] = parse_label('achievements') item['baike_life'] = parse_label('life') item['baike_affect'] = parse_label('influence') item['baike_eval'] = parse_label('evaluate') return item
Run test
If the characters you collect have other obvious fields, such as topics, popular science, works, etc.
You can use parse_labe adds fields for intelligent parsing.

print(get_item('https://baike.baidu.com/item/%E7%8E%8B%E5%85%83%E5%8D%93'))
Run test:
