Hands on data analysis -- data reconstruction

1. Data consolidation

1.1 import basic library

import numpy as np
import pandas as pd

1.2 loading data

#Load all the data in the data folder and observe the relationship between them compared with the previous original data
text_left_up = pd.read_csv("train-left-up.csv")
text_left_down = pd.read_csv("train-left-down.csv")
text_right_up = pd.read_csv("train-right-up.csv")
text_right_down = pd.read_csv("train-right-down.csv")
text = pd.read_csv('train.csv')

#The four data are the overall data, which are divided into four blocks according to the upper left, lower left, upper right and lower right.

1.3 merging data

# Method 1: use concat Method: add data train-left-up.csv and train-right-up.csv Merge horizontally into one table, and # Save this table as result_up use concat Method: add train-left-down and train-right-down Horizontal merge into one # Table and save the table as result_down.  Then put the result above_ Up and result_down is merged vertically into result.
list_up = [text_left_up,text_right_up]
result_up = pd.concat(list_up,axis=1)

result_down = pd.concat(list_down,axis=1)

result = pd.concat([result_up,result_down])

#Method 2: use DataFrame's own methods, join method and append method
resul_up = text_left_up.join(text_right_up)

result_down = text_left_down.join(text_right_down)

result = result_up.append(result_down)

#Method 3: use the merge method of panels and the append method of DataFrame
result_up = pd.merge(text_left_up,text_right_up,left_index=True,right_index=True)

result_down = pd.merge(text_left_down,text_right_down,left_index=True,right_index=True)

result = resul_up.append(result_down)
[thinking] 1. Compare the different and same methods of merge, join and concat.
2. Think about why the append method of DataFrame is required in task 4 and task 5,
3. How to use merge or join to complete task 4 and task 5?

merge function
leftSpliced left DataFrame object
rightSpliced right DataFrame object

The name of the column or index level to join.

Must be found in the left and right DataFrame objects. If not passed and left_index and right_ If the index is False, the intersection of columns in the DataFrame will be inferred as a join key.

left_onThe column (index level) in the DataFrame on the left is used as the key. (column name, index level name, array with length equal to the length of DataFrame.)
left_indexleft_index=True, the row label is used as its connection key. For a DataFrame with MultiIndex (hierarchy), the number of levels must match the number of connection keys in the DataFrame on the right.

(left, right, outer, inner). Default inner.

  inner is the intersection and outer is the union.


: sort the resulting dataframes by the join key in dictionary order.

The default is True, and setting it to False will significantly improve performance in many cases

copyAlways copy data from the passed DataFrame object (default = True)
join function


Rules used to specify data retention after table consolidation (the same as merge)
onThe name of the column or index level to join.


If you encounter a column with the same name but different values in the process of merging with tables, and you want to keep them during merging, you can use suffixes to add suffixes to the duplicate column names of each table
Combat function


series, dataframe or panel sequence lsit  
axisAxes that need to merge links. 0 is lie, 1 is hang, and the default axis=0
joinThe connection mode is inner or outer
mergeHorizontal splicing, database style link merging, you can specify the key of the link
joinTransverse splicing
conbatBoth horizontal and vertical

2. Because append is used, longitudinal splicing can be performed (additional lines can be added).

3. Only merge and join can't work, because both are spliced horizontally.

2. Look at the data from another angle

2.1 changing data to Series type data

#What does this stack function do?
#Change the data from table structure to curly bracket structure, that is, its row index into column index


3. Data aggregation and operation

3.1 group by () usage

#Group and aggregate according to the contents of one or more columns of the DataFrame itself
survived_sex = text['Survived'].groupby(text['Sex']).sum()

df  = text['Fare'].groupby(text['Sex'])

survived_pclass = text['Survived'].groupby(text['Pclass'])

three point two   Usage of agg() function.

text.groupby('Sex').agg({'Fare': 'mean', 'Pclass': 'count'}).rename(columns=                                                                
                         {'Fare': 'mean_fare', 'Pclass': 'count_pclass'})

Keywords: Python Data Analysis Data Mining

Added by Begbie on Fri, 19 Nov 2021 13:29:49 +0200