Datawhale clocked in for the first time

Datawhale clocked in for the first time

Recently, I participated in a "hands-on learning data analysis" learning clock out activity. I only have a little understanding of Python. I wanted to chew "data analysis using Python" countless times, but it didn't work out in the end. The good news of this activity is: 1 There is a relatively perfect learning path (this is better. After all, when you first learn, someone told you what is more important, which can relatively reduce the cost of learning); 2. There will be special personnel to answer doubts; 3. There will be partners to cheer each other up, so take advantage of this learning activity to learn Python data analysis and rush!!!

Dataset Download: Titanic - machinery learns from disaster.

Chapter I

Part I load / adjust / view / save data

1.1.1 loading data
  • df.read_csv( )
  • df.read_table( )

These two methods open The difference between csv files is PD read_ Table needs to be separated by. csv in "sep =" parameter "

data2=pd.read_csv("Chapter 1 / 1. Data loading and preliminary observation / train. CSV") data3 = PD read_ Table ("Chapter 1 / 1. Data loading and preliminary observation / train.csv",sep = ",")

The above is read by relative path method

There will be a relative path and an absolute path. There are two methods to view and modify the relative path respectively

import os
os.getcwd():View current path
os.chdir(path):Modify current path
1.1.2 adjustment data
  • df.rename modify column name

Specific use:

Change the header to Chinese and the index to passenger ID [For some English materials, we can get familiar with our data more intuitively through translation]
PassengerId => passenger ID
Survived => Survive
Pclass => Passenger class(1/2/3 Class space)
Name => Passenger name
Sex => Gender
Age => Age
SibSp => male cousins/Number of sisters
Parch => Number of parents and children
Ticket => Ticket information
Fare => Ticket Price
Cabin => passenger cabin
Embarked => Boarding port
data1.rename(columns={"PassengerId":"passenger ID",
             "Survived":"Survive",
             "Pclass":"Passenger class(1/2/3 Class space)",
             "Name":"Passenger name",
             "Sex":"Gender",
             "Age":"Age",
             "SibSp":"male cousins/Number of sisters",
             "Parch":"Number of parents and children",
             "Ticket":"Ticket information",
             "Fare":"Ticket Price",
             "Cabin":"passenger cabin",
             "Embarked":"Boarding port"},inplace=True)

Delete a column in dataframe

  • del df[column_name]
  • df.drop(column_name,inplace=True , axis=1)

drop() deletes rows and columns, but when inplace = false, it can be used as a hidden function

1.1.3 viewing data
see

About viewing the overall situation of DataFrame data

  • df.describe(): view the summary statistics combination of numeric types of DataFrame data
  • df.types: view the data type of each column of DataFrame data
  • df.info(): get a brief summary of DataFrame data
  • df.head(n): view the data of the first n rows of DataFrame data
  • df.tail(n): view the data of N rows after the DataFrame data
  • df.isnull(): judge whether the value of DataFrame data is empty. If it is empty, it will return True, and the rest will return False

About viewing DataFrame data columns

  • df.columns: view each column item of DataFrame data
  • df.column_name / df[column_name]: view all items in a column of DataFrame data
  • Serites.values_counts/ df[column_index].values_counts: view the count of data under a column in the DataFrame data
sort
  • sort:
    df.sort_values(): sort values
    df.sort_index(): sort indexes

example:

df.sort_index(axis=1)#Sort column index in ascending order
df.sort_index(axis=1,ascending=False)#Sort column index in descending order
df.sort_values(by=["a","c"],ascending=False)Let any two columns of data be sorted in descending order at the same time

  • rank:
    df.rank(): only the ranking is displayed without sorting the values
screen
  • df [condition]

example:

data1[data1["Age"]<10]
data1[(10<data1["Age"]) & (data1["Age"]<50)]

According to the formation of a certain screening condition, new data needs to be added copy()

midage=data1[(10<data1["Age"]) & (data1["Age"]<50)].copy()

Because the middle has been filtered, the index will be missing and need to be reset

midage.reset_index(inplace=True,drop=True)
read
  • df.loc: read according to the index location name
  • df.iloc: read according to the index position value

Specific examples:

loc
 #One row and multiple columns
midage.loc[101,["Pclass","Sex"]]
#Multi row and multi column
midage.loc[[100,105,108],["Pclass","Name","Sex"]]

iloc
#Multi row and multi column
midage.iloc[[101,106,109],2:5]
1.1.3 saving data
  • df.to_csv(path)
epilogue

The sentences used in the first three classes have been learned, but the specific application of sentences to the data set seems to be a little lacking, which also needs to be paid attention to in daily learning.

(incomplete, updated from time to time)

Keywords: Python Data Analysis pandas

Added by heavenly on Sat, 29 Jan 2022 17:12:44 +0200