2.4 data consolidation
2.4. 1 load all the data in the data folder and observe the relationship between them compared with the previous original data
text_left_up = pd.read_csv("data/train-left-up.csv") text_left_down = pd.read_csv("data/train-left-down.csv") text_right_up = pd.read_csv("data/train-right-up.csv") text_right_down = pd.read_csv("data/train-right-down.csv")
text_left_up.head() text_left_down.head() text_right_down.head() text_right_up.head()
2.4. 2: Use the concat method: train left up the data CSV and train right up Merge CSV horizontally into a table and save the table as result_up
list_up = [text_left_up,text_right_up] result_up = pd.concat(list_up,axis=1) result_up.head()
2.4. 3. Use the concat method: merge the train left down and train right down horizontally into one table, and save the table as result_down. Then put the result above_ Up and result_down is merged vertically into result.
list_down=[text_left_down,text_right_down] result_down = pd.concat(list_down,axis=1) result = pd.concat([result_up,result_down]) result.head()
2.4. 4. Use DataFrame's own method, join method and append: complete 2.4 2 and 2.4 3 tasks
resul_up = text_left_up.join(text_right_up) result_down = text_left_down.join(text_right_down) result = result_up.append(result_down) result.head()
2.4. 5. Use the merge method of panels and the append method of DataFrame: complete 2.4 2 and 2.4 3 tasks
result_up = pd.merge(text_left_up,text_right_up,left_index=True,right_index=True) result_down = pd.merge(text_left_down,text_right_down,left_index=True,right_index=True) result = resul_up.append(result_down) result.head()
2.4. 6. Save the completed data as result csv
result.to_csv('result.csv')
2.5 looking at data from another angle
2.5. 1: Turn our data into Series type data
This stack function is used to stack several input arrays in different ways and return one array after stacking.
# Load the complete data text = pd.read_csv('result.csv') text.head() # The code is written here unit_result=text.stack().head(20) unit_result.head()
#Save the code as unit_result,csv unit_result.to_csv('unit_result.csv')
test = pd.read_csv('unit_result.csv')
test.head()
2.6 data application
2.6. 1) understand GroupBy mechanism
Groupby is a powerful data aggregation processing mechanism provided by Pandas, which can perspective a large number of multi-dimensional data. At the same time, groupby also provides a powerful apply function, It makes it possible to apply complex functions in multidimensional data to obtain complex results (this is also the most advantage of Pandas over Excel PivotTable when the amount of data is not so large in actual business analysis) From the abstract "Tao" level, without involving specific code, we understand groupby mainly through the three links of "split application aggregation".
2.6. 2: Calculate the average ticket price for men and women on the Titanic
df = text['Fare'].groupby(text['Sex']) means = df.mean() means
2.6. 3: Count the survival of men and women on the Titanic
survived_sex = text['Survived'].groupby(text['Sex']).sum() survived_sex.head()
2.6. 4: Calculate the number of survivors at different levels of the cabin
survived_pclass = text['Survived'].groupby(text['Pclass']) survived_pclass.sum()
2.6. 5: Count the average cost of tickets of different ages in different levels of tickets
text.groupby(['Pclass','Age'])['Fare'].mean().head()
2.6. 6: Merge the data of task 2 and task 3 and save them to sex_fare_survived.csv
result = pd.merge(means,survived_sex,on='Sex') result result.to_csv('sex_fare_survived.csv')
2.6. 7: Get the total number of survivors at different ages, then find out the age group with the largest number of survivors, and finally calculate the survival rate with the highest number of survivors (number of survivors / total number)
#Number of survivors of different ages survived_age = text['Survived'].groupby(text['Age']).sum() survived_age.head() #Find the age range of the maximum value survived_age[survived_age.values==survived_age.max()] _sum = text['Survived'].sum() print(_sum) #First calculate the total number of people _sum = text['Survived'].sum() print("sum of person:"+str(_sum)) precetn =survived_age.max()/_sum print("Maximum survival:"+str(precetn))