Hands on data analysis: 2 (medium) data reconstruction

2.4 data consolidation

2.4. 1 load all the data in the data folder and observe the relationship between them compared with the previous original data

text_left_up = pd.read_csv("data/train-left-up.csv")
text_left_down = pd.read_csv("data/train-left-down.csv")
text_right_up = pd.read_csv("data/train-right-up.csv")
text_right_down = pd.read_csv("data/train-right-down.csv")
text_left_up.head()
text_left_down.head()
text_right_down.head()
text_right_up.head()

 2.4. 2: Use the concat method: train left up the data CSV and train right up Merge CSV horizontally into a table and save the table as result_up

list_up = [text_left_up,text_right_up]
result_up = pd.concat(list_up,axis=1)
result_up.head()

 2.4. 3. Use the concat method: merge the train left down and train right down horizontally into one table, and save the table as result_down. Then put the result above_ Up and result_down is merged vertically into result.

list_down=[text_left_down,text_right_down]
result_down = pd.concat(list_down,axis=1)
result = pd.concat([result_up,result_down])
result.head()

2.4. 4. Use DataFrame's own method, join method and append: complete 2.4 2 and 2.4 3 tasks

resul_up = text_left_up.join(text_right_up)
result_down = text_left_down.join(text_right_down)
result = result_up.append(result_down)
result.head()

 2.4. 5. Use the merge method of panels and the append method of DataFrame: complete 2.4 2 and 2.4 3 tasks

result_up = pd.merge(text_left_up,text_right_up,left_index=True,right_index=True)
result_down = pd.merge(text_left_down,text_right_down,left_index=True,right_index=True)
result = resul_up.append(result_down)
result.head()

2.4. 6. Save the completed data as result csv

result.to_csv('result.csv')

 

2.5 looking at data from another angle

2.5. 1: Turn our data into Series type data

This stack function is used to stack several input arrays in different ways and return one array after stacking.

# Load the complete data
text = pd.read_csv('result.csv')
text.head()
# The code is written here
unit_result=text.stack().head(20)
unit_result.head()
#Save the code as unit_result,csv
unit_result.to_csv('unit_result.csv')
test = pd.read_csv('unit_result.csv')
test.head()

2.6 data application

2.6. 1) understand GroupBy mechanism

Groupby is a powerful data aggregation processing mechanism provided by Pandas, which can perspective a large number of multi-dimensional data. At the same time, groupby also provides a powerful apply function, It makes it possible to apply complex functions in multidimensional data to obtain complex results (this is also the most advantage of Pandas over Excel PivotTable when the amount of data is not so large in actual business analysis) From the abstract "Tao" level, without involving specific code, we understand groupby mainly through the three links of "split application aggregation".

2.6. 2: Calculate the average ticket price for men and women on the Titanic

df  = text['Fare'].groupby(text['Sex'])
means = df.mean()
means

2.6. 3: Count the survival of men and women on the Titanic

survived_sex = text['Survived'].groupby(text['Sex']).sum()
survived_sex.head()

2.6. 4: Calculate the number of survivors at different levels of the cabin

survived_pclass = text['Survived'].groupby(text['Pclass'])
survived_pclass.sum()

2.6. 5: Count the average cost of tickets of different ages in different levels of tickets

text.groupby(['Pclass','Age'])['Fare'].mean().head()

 2.6. 6: Merge the data of task 2 and task 3 and save them to sex_fare_survived.csv

result = pd.merge(means,survived_sex,on='Sex')
result


result.to_csv('sex_fare_survived.csv')

 2.6. 7: Get the total number of survivors at different ages, then find out the age group with the largest number of survivors, and finally calculate the survival rate with the highest number of survivors (number of survivors / total number)

#Number of survivors of different ages
survived_age = text['Survived'].groupby(text['Age']).sum()
survived_age.head()


#Find the age range of the maximum value
survived_age[survived_age.values==survived_age.max()]

_sum = text['Survived'].sum()
print(_sum)


#First calculate the total number of people
_sum = text['Survived'].sum()

print("sum of person:"+str(_sum))

precetn =survived_age.max()/_sum

print("Maximum survival:"+str(precetn))

 

Keywords: Machine Learning Data Analysis Data Mining

Added by shams on Fri, 17 Dec 2021 21:44:39 +0200