Question 3 of "Teddy Cup" C of data mining in 2018

1, Problem background

Topic: analysis of tourists' destination impression

Improving the reputation of scenic spots, hotels and other tourism destinations is a work that local cultural and tourism authorities and tourism related enterprises attach great importance to and pay attention to. It involves important matters such as how to stabilize the source of tourists, obtain competitive advantages, attract tourists to visit and consume. Tourist satisfaction is closely related to the reputation of the destination. The higher the tourist satisfaction, the greater the reputation of the destination. Therefore, mastering the influencing factors of destination tourist satisfaction, effectively improving tourist satisfaction and finally improving the reputation of the destination can not only ensure the stability of tourist sources, but also play a long-term and positive role in the scientific supervision of tourism enterprises, the optimal allocation of resources and the sustainable development of the market.

2, Problem solving

The third question: the effectiveness analysis of online review text

Title Requirements: for various reasons, online comments often have irrelevant content, simple copy and modification and no effective content, which not only hinders tourists from obtaining valuable information from online comments, but also brings challenges to the operation of various network platforms. Please establish a reasonable model from the perspective of text analysis to analyze the effectiveness of online comments on scenic spots and hotels in Annex 1.

Problem solving idea: it initially feels like data cleaning, but here seems to be a model for filtering and deleting spam comments. For example, when we visit Taobao, Taobao community will automatically help me block out some useless comments and give effective comments to consumers.

It mainly involves: text de duplication, which can be calculated based on the similarity between texts, including editing distance de duplication, simhash algorithm de duplication, etc., but some similar expressions will also be removed, which may be deleted by mistake. It is recommended to use the comparison deletion method. Here, a similarity comparison function is defined to delete useless comments and do a simple de duplication.

Start writing code and import the Library:

import jieba.analyse
import numpy as np  # numpy data processing library
import pandas as pd


jingqu = pd.read_excel(r'./data/Annex 1/Scenic spot comments.xlsx')  # Read excel file
jiudian = pd.read_excel(r'./data/Annex 1/Hotel Reviews.xlsx')

jingqu = jingqu.set_index('Name of scenic spot')
# Do a reprocessing first
print('Before weight removal:', len(jingqu['Comment content']))
contents = jingqu.drop_duplicates('Comment content')  # Keep the duplicate lines that appear for the first time and delete the subsequent duplicate lines
print('After weight removal:', len(contents['Comment content']))

# Get keywords first
jingqu_IDs = jingqu['Name of scenic spot']
# print(jingqu_IDs)
jingqu_contents = jingqu['Comment content']


# Define similarity calculation function
def compute_sim(word1, word2):
    # The intersection function returns a collection containing similarities between two or more collections
    jiaoji = set(word1).intersection(set(word2))
    # Union returns the union of two sets, and duplicate elements appear only once
    bingji = set(word1).union(set(word2))
    return len(jiaoji) / len(bingji)


a = []  # List used to store keywords
for jingqu_ID, jingqu_content in zip(jingqu_IDs, jingqu_contents):
    # print(jingqu_ID, jingqu_content)
    # jieba.analyse.extract_tags is to extract subject words and return a list
    # The first parameter is the text of the keyword to be extracted, and the second parameter topK returns the number of keywords, sorted from high to low importance
    # The third parameter withWeight returns the weight of each keyword at the same time. If set to true, it returns a binary. The first element is a keyword and the second element is a weight
    # By string The join method splices the words in the keyword list. There is a space between the words
    keyword = jieba.analyse.extract_tags(jingqu_content, 20)  # You get a list of keywords
    a.append(keyword)

j = 0
matrix_list = []
try:
    for i in range(len(a)):
        # for j in range(len(a)):
        sim_va = compute_sim(a[i], a[j])
        print(sim_va)
        matrix_list.append(sim_va)
        if j < len(a):
            j += 1
except ZeroDivisionError as e:
    print("ZeroDivisionError: division by zero")
finally:
    print(matrix_list)

values = np.mat(matrix_list)
print(np.mean(values, axis=1))

# arr = np.array(matrix_list)
# arr = arr.reshape((50, 50))  # Reshape array
# arr_mean = np.array(np.mean(arr, axis=1))  # Find the mean
# weizhi = np.where(arr_mean <= 0.021)  # Set the threshold for deletion
# for i in range(len(weizhi[0][:])):
#     print('item% d data should be masked or deleted '% weizhi[0][i])

Code parsing:

A similarity calculation function is defined

The intersection() function returns a collection containing similarities between two or more collections
union() returns the union of two sets, and duplicate elements appear only once
def compute_sim(word1, word2):

    jiaoji = set(word1).intersection(set(word2))

    bingji = set(word1).union(set(word2))

    return len(jiaoji) / len(bingji)

a = [] # the list used to store keywords

for jingqu_ID, jingqu_content in zip(jingqu_IDs, jingqu_contents):

    # print(jingqu_ID, jingqu_content)

    # jieba.analyse.extract_tags is to extract subject words and return a list

# the first parameter is the text of the keyword to be extracted, and the second parameter topK is the number of returned keywords, sorted from high to low importance

# the third parameter withWeight returns the weight of each keyword at the same time. If it is set to true, it returns a binary. The first element is a keyword and the second element is a weight

# through a string The join method splices the words in the keyword list. There is a space between the words

    keyword = jieba. analyse. extract_ Tags (jingqu# content, 20) # get the keyword list

    a.append(keyword)

Because the division by zero exception may occur when running the similarity comparison function, a try exception statement that captures the ZeroDivisionError exception is added.

j = 0

matrix_list = []

try:

    for i in range(len(a)):

        # for j in range(len(a)):

        sim_va = compute_sim(a[i], a[j])

        print(sim_va)

        matrix_list.append(sim_va)

        if j < len(a):

            j += 1

except ZeroDivisionError as e:

    print("ZeroDivisionError: division by zero")

finally:

    print(matrix_list)

values = np.mat(matrix_list)

print(np.mean(values, axis=1))

arr = np.array(matrix_list)

arr = arr.reshape((50, 50)) # reshape the array

arr_mean = np.array(np.mean(arr, axis=1)) # find the mean

weizhi = np. Where (arr_mean < = 0.021) # set the deletion threshold

for i in range(len(weizhi[0][:])):

print('item% d data should be masked or deleted '% weizhi[0][i])

Keywords: Python Data Mining

Added by davieboy on Thu, 30 Dec 2021 17:30:29 +0200