python pandas learning notes

Pandas is suitable for data analysis. Its function is similar to excel, but the former is easy to reproduce. To use pandas, first import the library.

import pandas

data structure

sequence

Series is a one-dimensional array that can contain any data type. Each array has an index label.

establish

The Series() function can create a sequence object, and its parameters can be list ,Dictionaries or numpy array.

s = pd.Series(np.random.randn(3))
# 0   -0.365480
# 1   -0.745080
# 2   -0.107698
# dtype: float64

By default, the first element is marked as 0 and the second element is marked as 1. You can pass in the index parameter to specify the index name, and its length must be the same as the data length.

pd.Series(['Alice', 42], index=['name', 'age'])
# name    Alice
# age        42
# dtype: object

The index of the sequence can be the same.

When the parameter is a dictionary, the key of the dictionary is the index.

d = {'a': 1, 'b': 2}
pd.Series(d)
# a    1
# b    2
# dtype: int64

When the dictionary is used as a parameter, the sequence is sorted according to the dictionary by default. If the index parameter is passed in, the sequence is sorted according to the order specified in the index.

pd.Series(d, index=['b', 'a', 'c'])
# b    2.0
# a    1.0
# c    NaN
# dtype: float64

Scalars can also be used as parameters. When a scalar is entered, it is copied to the same number of values as the index length.

pd.Series(2, index=['a', 'b', 'c'])
# a    2
# b    2
# c    2
# dtype: int64

You can also specify name when creating a Series, otherwise it will be named automatically.

pd.Series(np.random.randn(3), name='old')
# 0   -0.038295
# 1    0.551865
# 2    1.103370
# Name: old, dtype: float64

Properties and methods

The index attribute can obtain the key index of a sequence, and the functions of the keys() method are the same.

s.index   # Index(['name', 'age'], dtype='object')
s.keys()

The calculation method of Series is similar to that of ndarray. It can perform operations such as operation and slicing. You can view the details numpy use . If you need to get the value of Series, use the array attribute so that its index can be ignored. to_ The numpy () method can convert the sequence into a numpy array, and the functions of the values attribute are the same.

s.array
# [-0.36547972911183113, -0.7450803876117753, -0.10769806920146939]
# Length: 3, dtype: float64
s.to_numpy()   # array([-0.36547973, -0.74508039, -0.10769807])
s.values

mean(), min(), max() and std() methods can get the average, minimum, maximum and standard deviation of the sequence respectively.

age = pd.Series([16, 23, 25, 33])
age.mean()  # 24.25
age.min()   # 16
age.max()   # 33
age.std()   # 6.994

When the data contains missing values, the calculated results will also be missing values. Set skip_ If the Na parameter is True, the missing value can be ignored to complete the calculation.

Unlike numpy, the series operation is to align index labels, as described below.

Series can also be used like a Dictionaries , get the value through the index key.

s = pd.Series(np.random.randn(3), index=['a', 'b', 'c'])
s['a']

The name attribute can get the sequence name, and the rename() method can rename it.

s.name  # 'old'
s.rename('new')
# 0    1.599583
# 1   -0.281893
# 2    1.441235
# Name: new, dtype: float64

Data frame

Data frame is a two-dimensional label data structure, which can be simply understood as a table or a dictionary of sequence objects. It can be manipulated by row index and column name.

establish

The DataFrame() method can create a data frame, which can accept one-dimensional numpy array, dictionary, list, dictionary or sequence as parameters, or two-dimensional numpy array. Through dictionary creation, you can enter the index parameter to specify the row index.

pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [22, 44]}, index=['007', '009'])
#       Name  Age
# 007  Alice   22
# 009  Bob   44

Created from a dictionary list.

pd.DataFrame([{'Name': 'Alice', 'Age': 24}, {'Name': 'Bob', 'Age': 28}])
#              one       two     three
# first  -1.334441 -0.222222 -0.993980
# second  1.015804 -0.273330 -0.625443

By creating a sequence dictionary, you can also specify the column name by entering the columns parameter

pd.DataFrame({'Name': pd.Series(['Alice', 'Bob'], index=['007', '009']) , 'Age': pd.Series([24, 29], index=['007', '009'])}, index=['007', '009'], columns=["Name", "Age"])
#       Name  Age
# 007  Alice   24
# 009    Bob   29

Created by numpy array.

pd.DataFrame(np.random.randn(2, 3), index=['first', 'second'], columns=['one', 'two', 'three'])
#              one       two     three
# first  -1.334441 -0.222222 -0.993980
# second  1.015804 -0.273330 -0.625443

Properties and methods

Index and columns view the row index and column name of the data frame respectively.

df.index    # Index(['007', '009'], dtype='object')
df.columns  # Index(['Name', 'Age'], dtype='object')

The shape function can view the size of the data.

df.shape

You can view the column names of the data and get a list.

dtypes view the data type of each column of data.

df.dtypes

The T attribute can be transposed.

df.T

The info() method can view the information in more detail.

df.info()

The describe() function can quickly get the statistical value.

stu = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [22, 44]}, index=['007', '009'])
stu.descirbe()
#              Age
# count   2.000000
# mean   33.000000
# std    15.556349
# min    22.000000
# 25%    27.500000
# 50%    33.000000
# 75%    38.500000
# max    44.000000

The copy() method can copy a data frame.

df2 = df.copy()

Read write file

csv file

read_csv can read csv files. The first parameter indicates the location of the file you want to read; The second parameter sep represents the separator, which is comma by default,; Set keep_ default_ If Na is False, the missing value can be removed, and the missing value will be replaced by an empty string. na_ The values parameter can set missing values.

df = pd.read_csv('data/gap.tsv', sep='\t')

to_ The csv () method is saved as a csv file. The first parameter represents the location and file name of the file to be saved, and the second parameter 'sep' represents the separator

df.to_csv('data/gap2.tsv', sep='\t')

Excel file

If you want to save the sequence file as an Excel file, first use to_ The frame () method converts the sequence into a data frame.

df = ser.to_frame()

to_ The Excel () method writes the data frame into the Excel file, and only needs to provide the location and file name of the saved file. By setting the index parameter to False, you can not save the row index name, otherwise the row index name will be saved.

df.to_excel('new.xlxs')

Data selection

Extract column

To get a column of information, you need to enter the column name you want to get in square brackets []. Direct use You can also view a column of data.

name_df = df['name']
df.name

If you want to get multiple columns, you need to pass in a list containing multiple column names.

df[['name', 'age']]

Add column

The method of adding columns is the same as that of taking columns. You only need to specify the column name and value.

stu['birthday'] = ['19980110', '19981001']

The insert() method can insert the value into the specified position, such as inserting the year into column 1.

df.insert(1, 'Year', [1996, 2000])

The assign() method can derive a new column from an existing column. Method can also be used as its parameter.

df.assign(new_col=df['Age']+1)
df.assign(new_col=lambda x: x['Age'] + 1)

Delete column

The pop() method can delete a column.

df.pop('Age')

Extract row

The head() method and the tail() method view the first 5 rows and the last 5 rows of data respectively. Input parameters can specify the first few rows of output.

df.head()  # First 5 lines
df.head(1) # First line
df.tail()  # Last 5 lines
df.tail(1) # Last line

loc input row index can get the required row.

df.loc['007']    # Get row with row index '007'

iloc functions are similar. Enter an integer to indicate the line.

df.iloc[0]    # Get line 1

Similarly, if you need to get multiple rows, you need to pass in a list. It can also be realized by slicing.

df.loc[[0, 2, 3]]
df.iloc[[0, 2, 3]]
df[5: 10]

Add row

To add a sequence to a dataset as a row of data, first convert the sequence into a data frame.

new_row_df = pd.DataFrame([['n1', 'n2', 'n3', 'n4']], columns=['A', 'B', 'C', 'D'])

Operate on both rows and columns

Similarly, loc is used. The square brackets are separated by commas. The left represents the row index and the right represents the column name.

df.loc[4, 'name']

If you need to extract multiple rows and columns, you need to pass in the list.

df.loc[[2, 4], ['name', 'age']]

If you use iloc, you need to enter row and column numbers instead of row indexes and column names.

df.iloc[4, 0]
df.iloc[[1, 2], [0, 1]]

bool value index

You can also use the bool value as an index for filtering.

stu[[True, False]]
#       Name  Age
# 007  Alice   22

Data splicing

Suppose there are two data frames df1 and df2, and their column names are the same. The concat() function can splice the two data frames by row.

row_concat = pd.concat([df1, df2])

You can also use the append() method to add data.

df1.append(df2)

You can also use a dictionary. As long as the primary key is consistent with the column name of the data frame, it can be spliced normally.

data_dict = {'A': 'n1', 'B': 'n2', 'C': 'n3', 'D': 'n4'}
df1.append(data_dict, ignore_index=True)

Processing missing values

There may be a missing value NP in the data Nan, it's not equal to anything. isnull() function can judge whether it is a missing value. If it is a missing value, it returns True. The notnull() function can check whether the value is empty. If it is a missing value, it returns False.

from numpy import NaN
pd.isnull(NaN)   # True
pd.notnull(NaN)  # False

Statistical missing value

The count() method can calculate the number of non missing values per column.

df.count()   # Number of non missing values per column

How many missing values can be calculated in combination with the numpy library.

import numpy as np
np.count_nonzero(df['Age'].isnull())   # How many missing values are in the Age column
np.count_nonzero(df.isnull())          # How many missing values are there in the entire dataset

Use the values of the sequence_ The count () method calculates the number of each non duplicate data in the column. It will skip the missing value by default. If it is in drop_ Pass in False in the Na parameter to get the number of missing values.

df.Age.values_count(drop_na=False).head()

Fill in missing values

The fillna() method can fill in the missing values, enter the parameters we provide, and replace all the missing values.

df.fillna(0)

Set the method parameters to fill and bfill to fill the previous value (fill with the previous value; if the beginning is a missing value, the filling cannot be completed) and the post value respectively. The interpolate() method implements interpolation filling.

df.fillna(method='ffill')
df.interpolate()

drop_ The Na () method can delete rows with missing values.

df.drop_na()

Data alignment and calculation

The calculation between dataframes is aligned through the indexes and labels of rows and columns.

df1 = pd.DataFrame(np.random.randn(3, 2), index=['a', 'b', 'c'])
df2 = pd.DataFrame(np.random.randn(4, 3), index=['a', 'b', 'c', 'd'])
df1 + df2
#           0         1   2
# a -0.920320  0.947203 NaN
# b  1.089856 -1.327052 NaN
# c  2.604738 -0.038784 NaN
# d       NaN       NaN NaN

The calculation between DataFrame and Series is realized through broadcast mechanism, and so is the calculation with scalar.

df1 - df1.iloc[0]   # Subtract the first line from each line
#           0         1
# a  0.000000  0.000000
# b  1.431525  0.879529
# c  0.207648 -0.617918

df1 + 100   # Add 100 to each element
#             0           1
# a   99.545553   99.951717
# b  100.977078  100.831246
# c   99.753201   99.333800

Exhibition

When there are too many data rows and columns, the middle rows and columns will be represented by ellipsis, and the display mode can be changed by setting. to_ The string () method returns a string representation of the table. But it doesn't fit the width of the console.

df.to_string()

Set display max_ Colwidth can set the display width of each column.

pd.set_option("display.max_colwidth", 100)

Keywords: Python Data Analysis Data Mining

Added by Simsonite on Sun, 19 Dec 2021 23:39:19 +0200