Pandas is suitable for data analysis. Its function is similar to excel, but the former is easy to reproduce. To use pandas, first import the library.
import pandas
data structure
sequence
Series is a one-dimensional array that can contain any data type. Each array has an index label.
establish
The Series() function can create a sequence object, and its parameters can be list ,Dictionaries or numpy array.
s = pd.Series(np.random.randn(3)) # 0 -0.365480 # 1 -0.745080 # 2 -0.107698 # dtype: float64
By default, the first element is marked as 0 and the second element is marked as 1. You can pass in the index parameter to specify the index name, and its length must be the same as the data length.
pd.Series(['Alice', 42], index=['name', 'age']) # name Alice # age 42 # dtype: object
The index of the sequence can be the same.
When the parameter is a dictionary, the key of the dictionary is the index.
d = {'a': 1, 'b': 2} pd.Series(d) # a 1 # b 2 # dtype: int64
When the dictionary is used as a parameter, the sequence is sorted according to the dictionary by default. If the index parameter is passed in, the sequence is sorted according to the order specified in the index.
pd.Series(d, index=['b', 'a', 'c']) # b 2.0 # a 1.0 # c NaN # dtype: float64
Scalars can also be used as parameters. When a scalar is entered, it is copied to the same number of values as the index length.
pd.Series(2, index=['a', 'b', 'c']) # a 2 # b 2 # c 2 # dtype: int64
You can also specify name when creating a Series, otherwise it will be named automatically.
pd.Series(np.random.randn(3), name='old') # 0 -0.038295 # 1 0.551865 # 2 1.103370 # Name: old, dtype: float64
Properties and methods
The index attribute can obtain the key index of a sequence, and the functions of the keys() method are the same.
s.index # Index(['name', 'age'], dtype='object') s.keys()
The calculation method of Series is similar to that of ndarray. It can perform operations such as operation and slicing. You can view the details numpy use . If you need to get the value of Series, use the array attribute so that its index can be ignored. to_ The numpy () method can convert the sequence into a numpy array, and the functions of the values attribute are the same.
s.array # [-0.36547972911183113, -0.7450803876117753, -0.10769806920146939] # Length: 3, dtype: float64 s.to_numpy() # array([-0.36547973, -0.74508039, -0.10769807]) s.values
mean(), min(), max() and std() methods can get the average, minimum, maximum and standard deviation of the sequence respectively.
age = pd.Series([16, 23, 25, 33]) age.mean() # 24.25 age.min() # 16 age.max() # 33 age.std() # 6.994
When the data contains missing values, the calculated results will also be missing values. Set skip_ If the Na parameter is True, the missing value can be ignored to complete the calculation.
Unlike numpy, the series operation is to align index labels, as described below.
Series can also be used like a Dictionaries , get the value through the index key.
s = pd.Series(np.random.randn(3), index=['a', 'b', 'c']) s['a']
The name attribute can get the sequence name, and the rename() method can rename it.
s.name # 'old' s.rename('new') # 0 1.599583 # 1 -0.281893 # 2 1.441235 # Name: new, dtype: float64
Data frame
Data frame is a two-dimensional label data structure, which can be simply understood as a table or a dictionary of sequence objects. It can be manipulated by row index and column name.
establish
The DataFrame() method can create a data frame, which can accept one-dimensional numpy array, dictionary, list, dictionary or sequence as parameters, or two-dimensional numpy array. Through dictionary creation, you can enter the index parameter to specify the row index.
pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [22, 44]}, index=['007', '009']) # Name Age # 007 Alice 22 # 009 Bob 44
Created from a dictionary list.
pd.DataFrame([{'Name': 'Alice', 'Age': 24}, {'Name': 'Bob', 'Age': 28}]) # one two three # first -1.334441 -0.222222 -0.993980 # second 1.015804 -0.273330 -0.625443
By creating a sequence dictionary, you can also specify the column name by entering the columns parameter
pd.DataFrame({'Name': pd.Series(['Alice', 'Bob'], index=['007', '009']) , 'Age': pd.Series([24, 29], index=['007', '009'])}, index=['007', '009'], columns=["Name", "Age"]) # Name Age # 007 Alice 24 # 009 Bob 29
Created by numpy array.
pd.DataFrame(np.random.randn(2, 3), index=['first', 'second'], columns=['one', 'two', 'three']) # one two three # first -1.334441 -0.222222 -0.993980 # second 1.015804 -0.273330 -0.625443
Properties and methods
Index and columns view the row index and column name of the data frame respectively.
df.index # Index(['007', '009'], dtype='object') df.columns # Index(['Name', 'Age'], dtype='object')
The shape function can view the size of the data.
df.shape
You can view the column names of the data and get a list.
dtypes view the data type of each column of data.
df.dtypes
The T attribute can be transposed.
df.T
The info() method can view the information in more detail.
df.info()
The describe() function can quickly get the statistical value.
stu = pd.DataFrame({'Name': ['Alice', 'Bob'], 'Age': [22, 44]}, index=['007', '009']) stu.descirbe() # Age # count 2.000000 # mean 33.000000 # std 15.556349 # min 22.000000 # 25% 27.500000 # 50% 33.000000 # 75% 38.500000 # max 44.000000
The copy() method can copy a data frame.
df2 = df.copy()
Read write file
csv file
read_csv can read csv files. The first parameter indicates the location of the file you want to read; The second parameter sep represents the separator, which is comma by default,; Set keep_ default_ If Na is False, the missing value can be removed, and the missing value will be replaced by an empty string. na_ The values parameter can set missing values.
df = pd.read_csv('data/gap.tsv', sep='\t')
to_ The csv () method is saved as a csv file. The first parameter represents the location and file name of the file to be saved, and the second parameter 'sep' represents the separator
df.to_csv('data/gap2.tsv', sep='\t')
Excel file
If you want to save the sequence file as an Excel file, first use to_ The frame () method converts the sequence into a data frame.
df = ser.to_frame()
to_ The Excel () method writes the data frame into the Excel file, and only needs to provide the location and file name of the saved file. By setting the index parameter to False, you can not save the row index name, otherwise the row index name will be saved.
df.to_excel('new.xlxs')
Data selection
Extract column
To get a column of information, you need to enter the column name you want to get in square brackets []. Direct use You can also view a column of data.
name_df = df['name'] df.name
If you want to get multiple columns, you need to pass in a list containing multiple column names.
df[['name', 'age']]
Add column
The method of adding columns is the same as that of taking columns. You only need to specify the column name and value.
stu['birthday'] = ['19980110', '19981001']
The insert() method can insert the value into the specified position, such as inserting the year into column 1.
df.insert(1, 'Year', [1996, 2000])
The assign() method can derive a new column from an existing column. Method can also be used as its parameter.
df.assign(new_col=df['Age']+1) df.assign(new_col=lambda x: x['Age'] + 1)
Delete column
The pop() method can delete a column.
df.pop('Age')
Extract row
The head() method and the tail() method view the first 5 rows and the last 5 rows of data respectively. Input parameters can specify the first few rows of output.
df.head() # First 5 lines df.head(1) # First line df.tail() # Last 5 lines df.tail(1) # Last line
loc input row index can get the required row.
df.loc['007'] # Get row with row index '007'
iloc functions are similar. Enter an integer to indicate the line.
df.iloc[0] # Get line 1
Similarly, if you need to get multiple rows, you need to pass in a list. It can also be realized by slicing.
df.loc[[0, 2, 3]] df.iloc[[0, 2, 3]] df[5: 10]
Add row
To add a sequence to a dataset as a row of data, first convert the sequence into a data frame.
new_row_df = pd.DataFrame([['n1', 'n2', 'n3', 'n4']], columns=['A', 'B', 'C', 'D'])
Operate on both rows and columns
Similarly, loc is used. The square brackets are separated by commas. The left represents the row index and the right represents the column name.
df.loc[4, 'name']
If you need to extract multiple rows and columns, you need to pass in the list.
df.loc[[2, 4], ['name', 'age']]
If you use iloc, you need to enter row and column numbers instead of row indexes and column names.
df.iloc[4, 0] df.iloc[[1, 2], [0, 1]]
bool value index
You can also use the bool value as an index for filtering.
stu[[True, False]] # Name Age # 007 Alice 22
Data splicing
Suppose there are two data frames df1 and df2, and their column names are the same. The concat() function can splice the two data frames by row.
row_concat = pd.concat([df1, df2])
You can also use the append() method to add data.
df1.append(df2)
You can also use a dictionary. As long as the primary key is consistent with the column name of the data frame, it can be spliced normally.
data_dict = {'A': 'n1', 'B': 'n2', 'C': 'n3', 'D': 'n4'} df1.append(data_dict, ignore_index=True)
Processing missing values
There may be a missing value NP in the data Nan, it's not equal to anything. isnull() function can judge whether it is a missing value. If it is a missing value, it returns True. The notnull() function can check whether the value is empty. If it is a missing value, it returns False.
from numpy import NaN pd.isnull(NaN) # True pd.notnull(NaN) # False
Statistical missing value
The count() method can calculate the number of non missing values per column.
df.count() # Number of non missing values per column
How many missing values can be calculated in combination with the numpy library.
import numpy as np np.count_nonzero(df['Age'].isnull()) # How many missing values are in the Age column np.count_nonzero(df.isnull()) # How many missing values are there in the entire dataset
Use the values of the sequence_ The count () method calculates the number of each non duplicate data in the column. It will skip the missing value by default. If it is in drop_ Pass in False in the Na parameter to get the number of missing values.
df.Age.values_count(drop_na=False).head()
Fill in missing values
The fillna() method can fill in the missing values, enter the parameters we provide, and replace all the missing values.
df.fillna(0)
Set the method parameters to fill and bfill to fill the previous value (fill with the previous value; if the beginning is a missing value, the filling cannot be completed) and the post value respectively. The interpolate() method implements interpolation filling.
df.fillna(method='ffill') df.interpolate()
drop_ The Na () method can delete rows with missing values.
df.drop_na()
Data alignment and calculation
The calculation between dataframes is aligned through the indexes and labels of rows and columns.
df1 = pd.DataFrame(np.random.randn(3, 2), index=['a', 'b', 'c']) df2 = pd.DataFrame(np.random.randn(4, 3), index=['a', 'b', 'c', 'd']) df1 + df2 # 0 1 2 # a -0.920320 0.947203 NaN # b 1.089856 -1.327052 NaN # c 2.604738 -0.038784 NaN # d NaN NaN NaN
The calculation between DataFrame and Series is realized through broadcast mechanism, and so is the calculation with scalar.
df1 - df1.iloc[0] # Subtract the first line from each line # 0 1 # a 0.000000 0.000000 # b 1.431525 0.879529 # c 0.207648 -0.617918 df1 + 100 # Add 100 to each element # 0 1 # a 99.545553 99.951717 # b 100.977078 100.831246 # c 99.753201 99.333800
Exhibition
When there are too many data rows and columns, the middle rows and columns will be represented by ellipsis, and the display mode can be changed by setting. to_ The string () method returns a string representation of the table. But it doesn't fit the width of the console.
df.to_string()
Set display max_ Colwidth can set the display width of each column.
pd.set_option("display.max_colwidth", 100)