Datawhale clocked in for the first time
Recently, I participated in a "hands-on learning data analysis" learning clock out activity. I only have a little understanding of Python. I wanted to chew "data analysis using Python" countless times, but it didn't work out in the end. The good news of this activity is: 1 There is a relatively perfect learning path (this is better. After all, when you first learn, someone told you what is more important, which can relatively reduce the cost of learning); 2. There will be special personnel to answer doubts; 3. There will be partners to cheer each other up, so take advantage of this learning activity to learn Python data analysis and rush!!!
Dataset Download: Titanic - machinery learns from disaster.
Chapter I
Part I load / adjust / view / save data
1.1.1 loading data
- df.read_csv( )
- df.read_table( )
These two methods open The difference between csv files is PD read_ Table needs to be separated by. csv in "sep =" parameter "
data2=pd.read_csv("Chapter 1 / 1. Data loading and preliminary observation / train. CSV") data3 = PD read_ Table ("Chapter 1 / 1. Data loading and preliminary observation / train.csv",sep = ",")
The above is read by relative path method
There will be a relative path and an absolute path. There are two methods to view and modify the relative path respectively
import os os.getcwd():View current path os.chdir(path):Modify current path
1.1.2 adjustment data
- df.rename modify column name
Specific use:
Change the header to Chinese and the index to passenger ID [For some English materials, we can get familiar with our data more intuitively through translation] PassengerId => passenger ID Survived => Survive Pclass => Passenger class(1/2/3 Class space) Name => Passenger name Sex => Gender Age => Age SibSp => male cousins/Number of sisters Parch => Number of parents and children Ticket => Ticket information Fare => Ticket Price Cabin => passenger cabin Embarked => Boarding port
data1.rename(columns={"PassengerId":"passenger ID", "Survived":"Survive", "Pclass":"Passenger class(1/2/3 Class space)", "Name":"Passenger name", "Sex":"Gender", "Age":"Age", "SibSp":"male cousins/Number of sisters", "Parch":"Number of parents and children", "Ticket":"Ticket information", "Fare":"Ticket Price", "Cabin":"passenger cabin", "Embarked":"Boarding port"},inplace=True)
Delete a column in dataframe
- del df[column_name]
- df.drop(column_name,inplace=True , axis=1)
drop() deletes rows and columns, but when inplace = false, it can be used as a hidden function
1.1.3 viewing data
see
About viewing the overall situation of DataFrame data
- df.describe(): view the summary statistics combination of numeric types of DataFrame data
- df.types: view the data type of each column of DataFrame data
- df.info(): get a brief summary of DataFrame data
- df.head(n): view the data of the first n rows of DataFrame data
- df.tail(n): view the data of N rows after the DataFrame data
- df.isnull(): judge whether the value of DataFrame data is empty. If it is empty, it will return True, and the rest will return False
About viewing DataFrame data columns
- df.columns: view each column item of DataFrame data
- df.column_name / df[column_name]: view all items in a column of DataFrame data
- Serites.values_counts/ df[column_index].values_counts: view the count of data under a column in the DataFrame data
sort
- sort:
df.sort_values(): sort values
df.sort_index(): sort indexes
example:
df.sort_index(axis=1)#Sort column index in ascending order df.sort_index(axis=1,ascending=False)#Sort column index in descending order df.sort_values(by=["a","c"],ascending=False)Let any two columns of data be sorted in descending order at the same time
- rank:
df.rank(): only the ranking is displayed without sorting the values
screen
- df [condition]
example:
data1[data1["Age"]<10] data1[(10<data1["Age"]) & (data1["Age"]<50)]
According to the formation of a certain screening condition, new data needs to be added copy()
midage=data1[(10<data1["Age"]) & (data1["Age"]<50)].copy()
Because the middle has been filtered, the index will be missing and need to be reset
midage.reset_index(inplace=True,drop=True)
read
- df.loc: read according to the index location name
- df.iloc: read according to the index position value
Specific examples:
loc #One row and multiple columns midage.loc[101,["Pclass","Sex"]] #Multi row and multi column midage.loc[[100,105,108],["Pclass","Name","Sex"]] iloc #Multi row and multi column midage.iloc[[101,106,109],2:5]
1.1.3 saving data
- df.to_csv(path)
epilogue
The sentences used in the first three classes have been learned, but the specific application of sentences to the data set seems to be a little lacking, which also needs to be paid attention to in daily learning.
(incomplete, updated from time to time)