Use Python to sort out the relationships in the Dream of Red Mansions

Today, we use Python to sort out the relationships in the Dream of Red Mansions.
Don't ask me why the Dream of Red Mansions is, not the Three Kingdoms of the Marsh or the Westward Journey, because I have also identified that the Red Mansion is the undisputed peak of Chinese classical fiction, and do not accept refutation! The Dream of Red Mansions is also one of the few novels that I have read repeatedly, and I have the deepest feelings for it.
Okay, it's not sour. Let's get dry.

Data preparation

A TXT Document of Dream of Red Mansions
List of Twelve Chai + Jia Baoyu Characters in Jinling
The list of characters is as follows:

Baoyu nr
 Daiyu nr
 Baochai nr
 Xiangyun nr
 Sister Feng nr
 Li Wannr
 Yuanchun nr
 Yingchun nr
 Tanchun nr
 Xichunnr
 Miaoyu nr
 Qiaojie nr
 Qin's nr

This list is also used for word segmentation. The following nr is the meaning of a person's name.

data processing

Read the data and load the dictionary

    with open("The Dream of Red Mansion.txt", encoding='gb18030') as f:
        honglou = f.readlines()
    jieba.load_userdict("renwu_forcut")
    renwu_data = pd.read_csv("renwu_forcut", header=-1)
    mylist = [k[0].split(" ")[0] for k in renwu_data.values.tolist()]

In this way, we read the Dream of Red Mansions into the variable honglou, and also load our custom dictionary into the jieba library through load_userdict.

Word Segmentation and Extraction of Text

tmpNames = []
    names = {}
    relationships = {}
    for h in honglou:
        h.replace("Jia Fei", "New Year")
        h.replace("Li Gongcai", "Li Wan")
        poss = pseg.cut(h)
        tmpNames.append([])
        for w in poss:
            if w.flag != 'nr' or len(w.word) != 2 or w.word not in mylist:
                continue
            tmpNames[-1].append(w.word)
            if names.get(w.word) is None:
                names[w.word] = 0
            relationships[w.word] = {}
            names[w.word] += 1

First of all, because the article "Jia Fei", "Yuan Chun", "Li Gong Tai", "Li Wan" mix seriously, so here do the replacement treatment directly.
Then use the pseg tool provided by the jieba library to do word segmentation, which returns the part of speech of each participle.
Then make a judgment that only those words that meet the requirements and are included in the list of dictionaries we provide will be retained.
Every time a person appears, an additional one will be added to facilitate the determination of the size of the person node when drawing the diagram behind.
For names that exist in our custom dictionary, save tmpNames in a temporary variable.

Handling Personal Relationships

    for name in tmpNames:
        for name1 in name:
            for name2 in name:
                if name1 == name2:
                    continue
                if relationships[name1].get(name2) is None:
                    relationships[name1][name2] = 1
                else:
                    relationships[name1][name2] += 1

For those who appear in the same paragraph, we think that they are closely related. Every time they appear at the same time, the relationship increases by 1.

Save to file

    with open("relationship.csv", "w", encoding='utf-8') as f:
        f.write("Source,Target,Weight\n")
        for name, edges in relationships.items():
            for v, w in edges.items():
                f.write(name + "," + v + "," + str(w) + "\n")

    with open("NameNode.csv", "w", encoding='utf-8') as f:
        f.write("ID,Label,Weight\n")
        for name, times in names.items():
            f.write(name + "," + name + "," + str(times) + "\n")

Document 1: Character Relations Table, which contains the first person to appear, the next person to appear and the number of times to appear together
Document 2: Character Ratio Table, including the overall number of occurrences of the figure, the more the number of occurrences, the greater the proportion of that figure.

Making Relational Charts

Drawing with pyecharts

def deal_graph():
    relationship_data = pd.read_csv('relationship.csv')
    namenode_data = pd.read_csv('NameNode.csv')
    relationship_data_list = relationship_data.values.tolist()
    namenode_data_list = namenode_data.values.tolist()

    nodes = []
    for node in namenode_data_list:
        if node[0] == "Baoyu":
            node[2] = node[2]/3
        nodes.append({"name": node[0], "symbolSize": node[2]/30})
    links = []
    for link in relationship_data_list:
        links.append({"source": link[0], "target": link[1], "value": link[2]})

    g = (
        Graph()
        .add("", nodes, links, repulsion=8000)
        .set_global_opts(title_opts=opts.TitleOpts(title="Character Relations in the Red Chamber"))
    )
    return g

First, read the two files as lists
For "Baoyu", because of its large proportion, if zooming uniformly, it will cause other characters'node s to be too small and the display is not beautiful, so here we do a zooming first.

The final diagram

All code has been uploaded to Github:
https://github.com/zhouwei713/data_analysis/tree/master/honglou
Finally, I have prepared a more comprehensive dictionary of people in the Red Chamber, which can be found in the code warehouse - "renwu_total". Interested partners can also try to create a full-character relationship map.

Keywords: encoding github Python

Added by Assorro on Tue, 13 Aug 2019 13:22:18 +0300

Programming VIP