1. Data consolidation
1.1 import basic library
import numpy as np import pandas as pd
1.2 loading data
#Load all the data in the data folder and observe the relationship between them compared with the previous original data text_left_up = pd.read_csv("train-left-up.csv") text_left_down = pd.read_csv("train-left-down.csv") text_right_up = pd.read_csv("train-right-up.csv") text_right_down = pd.read_csv("train-right-down.csv") text = pd.read_csv('train.csv') #The four data are the overall data, which are divided into four blocks according to the upper left, lower left, upper right and lower right.
1.3 merging data
# Method 1: use concat Method: add data train-left-up.csv and train-right-up.csv Merge horizontally into one table, and # Save this table as result_up use concat Method: add train-left-down and train-right-down Horizontal merge into one # Table and save the table as result_down. Then put the result above_ Up and result_down is merged vertically into result. list_up = [text_left_up,text_right_up] result_up = pd.concat(list_up,axis=1) list_down=[text_left_down,text_right_down] result_down = pd.concat(list_down,axis=1) result = pd.concat([result_up,result_down]) #Method 2: use DataFrame's own methods, join method and append method resul_up = text_left_up.join(text_right_up) result_down = text_left_down.join(text_right_down) result = result_up.append(result_down) #Method 3: use the merge method of panels and the append method of DataFrame result_up = pd.merge(text_left_up,text_right_up,left_index=True,right_index=True) result_down = pd.merge(text_left_down,text_right_down,left_index=True,right_index=True) result = resul_up.append(result_down)
[thinking] 1. Compare the different and same methods of merge, join and concat. 2. Think about why the append method of DataFrame is required in task 4 and task 5, 3. How to use merge or join to complete task 4 and task 5?
parameter | explain |
left | Spliced left DataFrame object |
right | Spliced right DataFrame object |
on | The name of the column or index level to join. Must be found in the left and right DataFrame objects. If not passed and left_index and right_ If the index is False, the intersection of columns in the DataFrame will be inferred as a join key. |
left_on | The column (index level) in the DataFrame on the left is used as the key. (column name, index level name, array with length equal to the length of DataFrame.) |
right_on | ditto |
left_index | left_index=True, the row label is used as its connection key. For a DataFrame with MultiIndex (hierarchy), the number of levels must match the number of connection keys in the DataFrame on the right. |
right_index | ditto |
how | (left, right, outer, inner). Default inner. inner is the intersection and outer is the union. |
sort | : sort the resulting dataframes by the join key in dictionary order. The default is True, and setting it to False will significantly improve performance in many cases |
copy | Always copy data from the passed DataFrame object (default = True) |
parameter | explain |
how | Rules used to specify data retention after table consolidation (the same as merge) |
on | The name of the column or index level to join. |
suffix | If you encounter a column with the same name but different values in the process of merging with tables, and you want to keep them during merging, you can use suffixes to add suffixes to the duplicate column names of each table |
objs | series, dataframe or panel sequence lsit |
axis | Axes that need to merge links. 0 is lie, 1 is hang, and the default axis=0 |
join | The connection mode is inner or outer |
merge | Horizontal splicing, database style link merging, you can specify the key of the link |
join | Transverse splicing |
conbat | Both horizontal and vertical |
2. Because append is used, longitudinal splicing can be performed (additional lines can be added).
3. Only merge and join can't work, because both are spliced horizontally.
2. Look at the data from another angle
2.1 changing data to Series type data
#What does this stack function do? #Change the data from table structure to curly bracket structure, that is, its row index into column index unit_result=text.stack().head(20)
3. Data aggregation and operation
3.1 group by () usage
#Group and aggregate according to the contents of one or more columns of the DataFrame itself survived_sex = text['Survived'].groupby(text['Sex']).sum() df = text['Fare'].groupby(text['Sex']) survived_pclass = text['Survived'].groupby(text['Pclass'])
three point two Usage of agg() function.
text.groupby('Sex').agg({'Fare': 'mean', 'Pclass': 'count'}).rename(columns= {'Fare': 'mean_fare', 'Pclass': 'count_pclass'})