Simple data cleaning and preprocessing (source code attached)

The next article will share the content of the toB end product manager. The content of data analysis will be sent on the weekend. As a person in a product post, it's a little outrageous to always send data analysis.

This chapter mainly talks about the data cleaning steps. The process is relatively simple. Interpolation method is not used. You can make do with it

1. Data preliminary missing value processing

a. The retrieval of the first row data source is shown in the figure

b. Count the missing values of columns and rows, as shown in the figure

c. Check and delete the row of the missing values in the three columns of house name, address and house type area, because the missing values in these three columns cannot be filled, and because the surrounding average price and average price have the same meaning, insert the surrounding average price column into the price column, and then delete the row of the missing value in the price column.

c=df.isnull().sum(axis=1) # view missing values

print(c)

d=df.isnull().sum(axis=0)

print(d)

df=df.dropna(subset = ['house name', 'address',' house type area ']) # delete the missing value row of this column

df=df.fillna(axis=1,method='backfill ') # inserts the surrounding average price into the price

df=df.dropna(subset = ['price']) # deletes the missing value of the price

2. Text de duplication

Retrieve the duplicate name of the first column for de duplication

data = df.drop_duplicates(subset = ['house name'], keep='first', inplace=False)

3. Delete the last column

Leave the necessary data. Now the data is not missing or duplicated, as shown in the figure

4. Data classification

Because of the different units of measurement of the house price of the residents, the average price and average price of the surrounding areas are classified into one category, the lowest into one category and the total price into one category. The classification analysis is shown in the figure below

5. Upper and lower limits of data transformation and extraction area

a. First delete the unimportant strings in the data and clear the data redundancy, such as the "average price around", "average price" and UOM of house price. Here, it is necessary to replace "10000" with "0000" as shown in the figure for future analysis. After clearing, the content shown in the figure is obtained and the final cleaned file is obtained. Because the area of the house is a range, it is necessary to separate the upper and lower limits of the area.

b. Write the house area corresponding to the house price into the document

for a in reader:

a=a ['unit area']

b = re.split(': |-|㎡|m²',a)

print(b[1:3], file=convert)

c. Extract the address keyword of the house type and write it into the document:

for c in reader1:

c = c ['address']

print(c[1:3],file=out3)
Finally, the data are shown in the figure to form the average price of fen csv, fen total price csv and fen are the lowest csv three files.

6. Further processing of missing values

Further delete the line where the missing value is located, and finally get the file as shown in the figure

0-1 standardization of data

Select 0-1 standardized processing for the data. The data to be processed include: "house price: yuan / m ²”, "'minimum area: M ²”, "'maximum area: M ²”， The three types of data of the three files shall be standardized, and the core code is as follows:

DF1 ['house price 0-1'] = ((df1 ['house price: yuan / m) ²']) - Df1['house price: yuan / m ²']. Min () / (df1['house price: yuan / m ²']. max() - df1['house price: yuan / m ²']. min())

df1 ['minimum area 0-1'] = ((df1 ['minimum area: m) ²']) - df1 ['minimum area: M ²']. min())/

(df1['minimum area: m) ²']. max() - df1['minimum area: M ²']. min())

DF1 ['maximum area 0-1'] = ((df1 ['maximum area: m) ²']) - DF1 ['maximum area: M ²']. Min () / (df1['maximum area: M ²']. max() - df1['maximum area: M ²']. min())

Source code on the old rule:

#clean
import numpy as np
import pandas as pd
import csv
file=open("C:\\Users\\admin\Desktop\\Tianjin housing data.csv",encoding="utf-8-sig")
out = open("C:\\Users\\admin\Desktop\\Post processing document of Tianjin housing data.csv",'w',encoding='utf-8-sig',newline="")
out1 = open("C:\\Users\\admin\Desktop\\Tianjin housing data processing file 1.csv",'w',encoding='utf-8-sig',newline="")
out2 = open("C:\\Users\\admin\Desktop\\Tianjin housing data processing document 2.csv",'w',encoding='utf-8-sig',newline="")
out3 = open("C:\\Users\\admin\Desktop\\Tianjin housing data processing document 3.csv",'w',encoding='utf-8-sig',newline="")

out4 =open("C:\\Users\\admin\Desktop\\Post processing data\\fen average price.csv",'w',encoding='utf-8-sig',newline="")
out5 =open("C:\\Users\\admin\Desktop\\Post processing data\\fen minimum.csv",'w',encoding='utf-8-sig',newline="")
out6 =open("C:\\Users\\admin\Desktop\\Post processing data\\fen Total price.csv",'w',encoding='utf-8-sig',newline="")

df=pd.read_csv(file)
b=df.info()
a=df.describe()#View detailed description
print(b)
print(a)
c=df.isnull().sum(axis=1)#View missing values
print(c)
d=df.isnull().sum(axis=0)
print(d)
df=df.dropna(subset=['House name','address','House type area'])#Delete the missing value row for this column
df=df.fillna(axis=1,method='backfill')#Insert the surrounding average price into the price
df=df.dropna(subset=['Price'])#Delete missing value for price
df=df.drop_duplicates(subset=['House name'], keep='first', inplace=False)#duplicate removal

df=df.drop(["Peripheral average price"],axis=1)

df1=df[df["Price"].str.contains('average price')]
df1.to_csv(out1)#Output to csv

df2=df[df["Price"].str.contains('minimum')]
df2.to_csv(out2)#Output to csv

df3=df[df["Price"].str.contains('Total price')]
df3.to_csv(out3)#Output to csv

df.to_csv(out)#Output to csv

#Standardization
import numpy as np
import csv
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings("ignore")
out1 =open("C:\\Users\\admin\Desktop\\Post processing data\\final\\Average price data.csv",'r',encoding='utf-8-sig',newline="")
out2 =open("C:\\Users\\admin\Desktop\\Post processing data\\final\\Low price data.csv",'r',encoding='utf-8-sig',newline="")
out3 =open("C:\\Users\\admin\Desktop\\Post processing data\\final\\Total price data.csv",'r',encoding='utf-8-sig',newline="")
#Average price 0-1
df1=pd.read_csv(out1,engine='python')
df1['House price 0-1'] = ((df1['House price: Yuan/m²']) -df1['House price: Yuan/m²'].min())/\
              (df1['House price: Yuan/m²'].max() - df1['House price: Yuan/m²'].min())
df1['Minimum area 0-1'] = ((df1['Minimum area: m²']) -df1['Minimum area: m²'].min())/\
                (df1['Minimum area: m²'].max() - df1['Minimum area: m²'].min())
df1['Maximum area 0-1'] = ((df1['Maximum area: m²']) -df1['Maximum area: m²'].min())/\
                (df1['Maximum area: m²'].max() - df1['Maximum area: m²'].min())
df1.to_csv(r'C:\\Users\\admin\Desktop\\Post processing data\\final\\Average price data 0-1.csv',mode='w',index =False,encoding='utf-8-sig')
#Low price 0-1
df2=pd.read_csv(out2,engine='python')
df2['House price 0-1'] = ((df2['House price: Yuan/m²']) -df2['House price: Yuan/m²'].min())/\
              (df2['House price: Yuan/m²'].max() - df2['House price: Yuan/m²'].min())
df2['Minimum area 0-1'] = ((df2['Minimum area: m²']) -df2['Minimum area: m²'].min())/\
                (df2['Minimum area: m²'].max() - df2['Minimum area: m²'].min())
df2['Maximum area 0-1'] = ((df2['Maximum area: m²']) -df2['Maximum area: m²'].min())/\
                (df2['Maximum area: m²'].max() - df2['Maximum area: m²'].min())
df2.to_csv(r'C:\\Users\\admin\Desktop\\Post processing data\\final\\Low price data 0-1.csv',mode='w',index =False,encoding='utf-8-sig')
#Total price 0-1
df3=pd.read_csv(out3,engine='python')
df3['House price 0-1'] = ((df3['House price: 10000 yuan/set']) -df3['House price: 10000 yuan/set'].min())/\
              (df3['House price: 10000 yuan/set'].max() - df3['House price: 10000 yuan/set'].min())
df3['Minimum area 0-1'] = ((df3['Minimum area: m²']) -df3['Minimum area: m²'].min())/\
                (df3['Minimum area: m²'].max() - df3['Minimum area: m²'].min())
df3['Maximum area 0-1'] = ((df3['Maximum area: m²']) -df3['Maximum area: m²'].min())/\
                (df3['Maximum area: m²'].max() - df3['Maximum area: m²'].min())
df3.to_csv(r'C:\\Users\\admin\Desktop\\Post processing data\\final\\Total price data 0-1.csv',mode='w',index =False,encoding='utf-8-sig')


#Visual legend
#average price

plt.figure(1)
plt.rcParams['font.sans-serif'] = 'Microsoft YaHei'
plt.rcParams['axes.unicode_minus'] = False
plt.figure(num='Title', facecolor='lightgray', figsize=(6, 3))
# Before average price processing
plt.subplot(3, 2, 1)
plt.title('average price-housing price', fontsize=10)
plt.bar(df1['address'], df1['House price: Yuan/m²'], 0.5, color='dodgerblue', label='housing price')
plt.legend()
# Average price 01 after standardization
plt.subplot(3, 2, 2)
plt.title('0-1 Standardization', fontsize=10)
plt.bar(df1['address'], df1['House price 0-1'], 0.2, color='orangered', label='House price 0-1')
plt.legend()
plt.tight_layout()
# Before low price treatment
plt.subplot(3, 2, 3)
plt.title('average price-housing price', fontsize=10)
plt.bar(df2['address'], df2['House price: Yuan/m²'], 0.5, color='dodgerblue', label='housing price')
plt.legend()
# After standardization
plt.subplot(3, 2, 4)
plt.title('0-1 Standardization', fontsize=10)
plt.bar(df2['address'], df2['House price 0-1'], 0.2, color='orangered', label='House price 0-1')
plt.legend()
plt.tight_layout()
# Before total price processing
plt.subplot(3, 2, 5)
plt.title('House price: 10000 yuan/set', fontsize=10)
plt.bar(df3['address'], df3['House price: 10000 yuan/set'], 0.5, color='dodgerblue', label='housing price')
plt.legend()
# Total price 01 after standardization
plt.subplot(3, 2, 6)
plt.title('0-1 Standardization', fontsize=10)
plt.bar(df3['address'], df3['House price: 10000 yuan/set'], 0.2, color='orangered', label='House price 0-1')
plt.legend()
plt.tight_layout()
plt.show()

Keywords: Data Analysis

Added by Jeff4507 on Tue, 28 Dec 2021 10:43:40 +0200

Programming VIP