❤️ 20000 words, summarizing 50 pandas high-frequency operations [illustrated and worthy of collection] ❤️

Point, knock on the blackboard

  • First of all, this paper follows the traditional teaching, point to point! Only some functions or processing methods that are frequently used by individuals are introduced.
  • The examples in this article are only used for demonstration. Generally, the examples do not modify the original data. If the code will modify the original data, it will be marked (modify on the original data). You must pay attention to whether you have modified the original data when using it yourself. Once an error is reported, first check whether your code has changed the original data.
# Original data not modified
df.drop('name', axis = 1)  
# Modify original data
df.drop('name', axis = 1, inplace=True)  
# Modify original data
df = df.drop('name', axis = 1)  
  • pandas is powerful because it has a variety of data processing functions. Each function is combined with each other, flexible and changeable, and interacts with numpy, matplotlib, sklearn, > pyspark, sklearn and many other scientific computing libraries. It is the only way to really master the actual combat.
  • It's not easy to be original, and the code word is also very tired. If you think the article is good, 💗 You must remember three times 💗 ~ Thank you in advance.

DataFrame creation

Create an empty DataFrame

Create an empty with three lines of empty data.

df = pd.DataFrame(columns=['name', 'age'], index=[0, 1, 2])

General DataFrame creation method

Only three common methods are introduced here: array creation, dictionary creation and external file creation. When reading common files, you can directly specify the file path. For xlsx files, there may be multiple sheets, so you need to specify sheets_ name .

# Array creation
df = pd.DataFrame(data=[['Monkey D Luffy', 18],
                       ['Sauron', 19],
                       ['Nami', 18]],
                  columns=['name', 'age'])
# Create from dictionary
df = pd.DataFrame({'name': ['Monkey D Luffy','Sauron', 'Nami'],
					 'age':[18, 19, 18]})
# Create through external files, csv, xlsx, json, etc
df = pd.read_csv('XXX.csv')


DataFrame storage

Common storage methods (csv, json, excel, pickle)

Generally, you do not need to save the index when saving, because the index will be automatically generated when reading.

df.to_csv('test.csv', index=False)  # Ignore index
df.to_excel('test.xlsx', index=False)  # Ignore index
df.to_json('test.json')  # Save as json
df.to_pickle('test.pkl')  # Save in binary format

DataFrame view data information

Display summary information

Before we use the DataFrame, we will view the data information. Personal preference info shows the row and column information of the dataset and the number of non null values in each column.

df.info()

Display descriptive statistics

It can intuitively view the basic statistical information of the numerical column.

df.describe()

Show front / back n lines

5 rows are displayed by default. You can specify the number of rows to display.

df.head(n)  # You can specify an integer to output the first n lines
df.tail(n)  # You can specify an integer and output the following n lines

Display index and column information

Displays basic information about indexes and columns.

df.columns  # Column information
df.index  # Index information

Displays the data type for each column

Displays the name of the column and the corresponding data type.

df.dtypes

Displays the amount of memory used

Displays the amount of memory occupied by the column, in bytes.

df.memory_usage()


Locate a row of data

Important: both LOC and iloc use the method of specifying rows first and then columns, and the rows are separated from the list expression. For example: df.loc [:,:] get the data of all rows and all columns.

Locate using loc()

For example, to locate the data of [Solon], there are the following

df.loc[1, :]  # LOC [index, columns] row index, column name, return Series object
df.loc[df['age'] > 18]  # Return DataFrame object
# Or DF [DF ['age '] > 18]
# df.loc[df['name '] = =' Sauron ']

Positioning using iloc

Use iloc to get the data of the second row (the index starts from 0) and all columns.

df.iloc[1, :]  # iloc[index1, index2] row index, column index


Add a row of data

Locate and add using loc

Use loc to locate the row with index = 3, and then assign a value (modify the original data)

df.loc[len(df)] = ['Joba', 3]

Add using append

append when adding data, you need to specify the column name and column value. If a column is not specified, NaN will be filled in by default.

df.append({'name': 'Yamaji', 'age': 19}, ignore_index=True)


Delete data

Delete columns based on column names

Use drop to delete a column, specify the axis to be deleted, and the name / index of the corresponding column / row.

df.drop('name', axis = 1)  # Delete single column
df.drop(['name', 'age'], axis = 1)  # Delete multiple columns

Delete rows based on index

The method of deleting columns is similar to that above, but the index is specified here.

df.drop(0, axis=0)  # Delete single line
df.drop([0, 1], axis=0)  # Delete multiple rows

Locate and delete data using loc

First use loc to locate the data of a condition, then obtain the index index, and then use drop to delete it.

df.drop(df.loc[df['name'] == 'Nami'].index, axis=0)  # Delete the anchored row

Delete columns using del

del is used to modify the original data, which should be paid attention to.

del df['age']

Delete rows and columns at the same time

drop can also specify rows and columns to delete at the same time. Here, delete the first and second rows and delete the age column.

df.drop(columns=['age'], index=[0, 1])

Delete duplicate values

  • When a subset is specified, the duplicate is removed according to the specified column as a reference. That is, if the a values of two rows are the same, the row that appears the second time will be deleted and only the first row will be retained
  • If you do not specify a subset, all columns will be used as references for de duplication. Only two rows of data are identical will be de duplicated.
df.drop_duplicates(subset=['a'], keep='first')
df.drop_duplicates(keep='first')


Screening duplicate values

Sample data

df = pd.DataFrame({'name':['Python',
                        'Python',
                        'Java',
                        'Java',
                        'C'],
                   'count': [2, 2, 6, 8, 10]})

Determine whether a column has duplicate values

Use values_counts() counts the occurrence times of each value in the column. The results are arranged in descending order by default. You only need to judge whether the occurrence times of the values in the first row is 1 to judge whether there are duplicate values.

df['a'].value_counts()

Using drop_duplicates() deletes duplicate values and only keeps the first value. It determines whether the processed value is equal to the original df. If False, it indicates that there are duplicate values.

df.equals(df.drop_duplicates(subset=['a'], keep='first'))

False

Judge whether there are duplicate rows in the DataFrame

Drop is also used_ Duplicates() deletes duplicate values and only keeps the values that appear for the first time. At this time, the subset parameter is not used to set the columns. The default is all columns. Judge whether the processed values are equal to the original df. If False, it indicates that there are duplicate values.

df.equals(df.drop_duplicates(keep='first'))

False

Count the number of duplicate lines

Note that the statistics here refer to all columns. Only two rows are identical will be judged as duplicate rows, so the statistical result is 1.

len(df) - len(df.drop_duplicates(keep="first"))

1

Show duplicate data rows

Delete the duplicate rows first, keep only the first row, get a unique dataset of rows, and then use drop_duplicates() deletes all duplicate data in df, and does not retain the duplicate value for the first time. Merge the above two result sets and use drop_duplicates() de duplicates the newly generated data set to obtain the data of duplicate rows.

df.drop_duplicates(keep="first")\
  .append(df.drop_duplicates(keep=False))\
  .drop_duplicates(keep=False)

Missing value processing

Find missing values

The missing value is True and the non missing value is False.

df.isnull()

Find non missing values

The non missing value is True and the missing value is False.

df.notnull()

Show rows with missing values

. isnull: find the missing value, mainly to mark the position of the missing value as True.
. T: transpose rows and columns to prepare for the next step any.
. any: returns True if one True is satisfied in a sequence.

df[df.isnull().T.any()]

Delete missing values

There are a lot of parameters to pay attention to here, which will be emphasized here.

  • axis: 0 rows, 1 columns
  • how:
    • any: if NaN exists, delete the row or column.
    • All: if all values are NaN, delete the row or column.
  • thresh: Specifies the number of NaN. It is deleted only when the number of NaN reaches.
  • Subset: the data range to be considered. For example, when deleting missing rows, use subset to specify the referenced columns. All columns are selected by default.
  • inplace: whether to modify the original data.
# Delete a row if it has missing values
df.dropna(axis=0, how='any')  
# Delete a column if it has missing values
df.dropna(axis=1, how='any')  

Fill in missing values

Numeric or string padding

Directly specify the number or string to populate.

df.fillna(0)

Fill with values before / after missing values

  • It is filled with the previous value of the missing value (the value above the column). If the missing value is in the first row, it is not filled
  • It is filled with the last value of the missing value (the value below the column). If the missing value is in the last row, it is not filled
df.fillna(method='pad')
df.fillna(method='bfill')

Fill with the mean / median of the column in which the missing value is located

It can be populated with the statistics of this column. Such as filling with mean, median, max, min, sum, etc.

df.fillna(df.mean())


Column operation

Modify column name

df.columns directly specifies a new column name to replace all column names. (modify the original data)
rename() needs to specify the original name and new name to replace.

df.columns = ['new_name', 'new_age']
df.rename(columns=({'name':'new_name','age':'new_age'}))

Modify column type

Use astype to modify the column type.

df['age'].astype(str)

Split columns to get multiple columns

Split can only split character string columns.

df[['name1', 'name2']] = df['name'].str.split(',', expand=True)

Merge multiple columns into a new column

Similarly, only columns of string type can be merged.

df['all'] = df['name'] + '/' + df['age'].astype(str)

Partition numeric columns

For the value column, it may be necessary to change these values into label values according to the specified range, such as passing and failing indicators of the product, excellent performance, etc. To use, you need to specify the value column, the critical value of each label, the opening and closing of the critical value (in the example: the default is left=True, specify right=False, that is, close left and open right), and finally specify the name of the label.

df['evaluate'] = pd.cut(df['achievement'], [0, 60, 80, np.inf],
			 right=False, labels=['fail,', 'good', 'excellent'])


sort

index order

Sort row index in descending order

df.sort_index(axis=0, ascending=False)

Sort column index in descending order

df.sort_index(axis=1, ascending=False)

Reset index

Reorder the index so that the original index is not retained.

df.reset_index(drop=True)

Value sorting

First sort by name in descending order, and then sort the scores under the same name in descending order.

df.sort_values(by=['name', 'achievement'], axis=0, ascending=False)

Create ranking column

rank is used for ranking. The value meaning of the main parameter method is as follows:

methodmeaning
averageBy default, in the group with the same ranking, the average ranking (average) is assigned to each value, and there is a jump between the rankings
minUsing the smallest ranking in the group, there is a jump between the rankings
maxUsing the largest ranking in the group, there is a jump between the rankings
firstThe values are ranked according to the order in which they appear in the original data, and there is a jump between the rankings
denseThe ranking of the same group is the same, and there is no jump between the rankings

Now rank each row of data according to the score column and create a new ranking column. Several ranking methods are given below.

df['ranking'] = df['achievement'].rank(method='average', ascending=False)


grouping

Grouping statistics of rows

Now calculate the scores in groups, and calculate the sum, mean and maximum respectively.

df.groupby(['name']).sum()
df.groupby(['name']).mean()
df.groupby(['name']).max()


Note: the index is the name at this time. If you want to reset the index, you can use the following method.

df.groupby(['name']).sum().reset_index()

Use different statistical functions for different columns

agg() specifies that the function is used on a sequence of numbers and then returns a scalar value.
apply() is the process of splitting the data > > > then applying > > > and finally summarizing (only a single function can be applied). Returns multidimensional data.

df.groupby(['name']).agg({'achievement':['sum','mean','max']})
df.groupby(['name']).apply(max)


DataFrame merge

The merge functions in pandas are mainly: merge(), concat(), append(), which are generally used to connect two or more dataframes. Concat (), append () is used to connect DataFrame objects vertically and merge () is used to connect DataFrame objects horizontally.

Comparison of the three:

concat()

  • Connecting multiple dataframes
  • Set a specific key

append()

  • Connecting multiple dataframes

merge()

  • Specify columns to connect to the DataFrame

merge()

If on is specified, the column must appear in the two dataframes at the same time. The default value is the intersection of the two DataFrame columns. In this example, even if on is not specified, the actual default value will be merged according to the name column.
how parameter details:

  • inner: take the intersection according to the column specified by on.
  • outer: takes the union set according to the column specified by on.
  • Left: merge according to the column specified by on and left join.
  • Right: merge according to the column specified by on and right join.
pd.merge(df1, df2, on='name', how = "inner")
pd.merge(df1, df2, on='name', how = "outer")
pd.merge(df1, df2, on='name', how = "left")
pd.merge(df1, df2, on='name', how = "right")

concat()

concat() can merge multiple dataframes. You can choose vertical merge or horizontal merge according to the actual situation. See the following example for details.

# Vertical merging and intersection of multiple dataframes
pd.concat([df1, df2], ignore_index=True, join='inner',axis=0)
# Vertical merge and union of multiple dataframes
pd.concat([df1, df2], ignore_index=True, join='outer',axis=0)
# Horizontal merging and intersection of multiple dataframes
pd.concat([df1, df2], ignore_index=False, join='inner',axis=1)
# Horizontal merge and union of multiple dataframes
pd.concat([df1, df2], ignore_index=False, join='outer',axis=1)


In addition, you can also specify a key to add the name of the original data at the index position.

pd.concat([df1, df2], ignore_index=False, join='outer', keys=['df1', 'df2'])

append()

append() is often used for vertical merging, or merging multiple dataframes.

df1.append(df2, ignore_index=True)
df1.append([df1, df2], ignore_index=True)


DataFrame time processing

Sample data

Convert character string into time series

Sometimes the time read from csv or xlsx files is of the type of string (Object). In this case, it needs to be converted into the type of datetime to facilitate subsequent time processing.

pd.to_datetime(df['datetime'])

Index time column

For most time series data, we can use this column as an index to maximize the use of time. Here drop=False, choose not to delete the datetime column.

df.set_index('datetime', drop=False)

Get the data of January through the index, and the first five lines are displayed here.

df.loc['2021-1'].head()

Obtain the data from January to march through the index.

df.loc['2021-1':'2021-3'].info()

Get each attribute of time

Here are the attributes that may be used in general requirements, as well as examples of various methods.

Common propertiesdescribe
dateget date
timeGet time
yearGet year
monthGet month
dayGet days
hourGet hours
minuteGet minutes
secondGet seconds
dayofyearWhat day of the year is the data in
weekofyearWhat week of the year is the data in (the new version uses isocalendar().week)
weekdayThe day of the week on which the data is in (the number Monday is 0)
day_name()What day of the week is the data in
quarterWhat quarter of the year is the data in
is_leap_yearIs it a leap year

Here, select the date in line 100 as an example, and the results of each attribute are displayed in the form of comments.

df['datetime'].dt.date[100]
# datetime.date(2021, 4, 11)
df['datetime'].dt.time[100]
# datetime.time(11, 50, 58, 995000)
df['datetime'].dt.year[100]
# 2021
df['datetime'].dt.month[100]
# 4
df['datetime'].dt.day[100]
# 11
df['datetime'].dt.hour[100]
# 11
df['datetime'].dt.minute[100]
# 50
df['datetime'].dt.second[100]
# 58
df['datetime'].dt.dayofyear[100]
# 101
df['datetime'].dt.isocalendar().week[100]
# 14
df['datetime'].dt.weekday[100]
# 6
df['datetime'].dt.day_name()[100]
# 'Sunday'
df['datetime'].dt.quarter[100]
# 2
df['datetime'].dt.is_leap_year[100]
# False

resample()

Resampling is divided into down sampling and up sampling.

Downsampling means that the time frequency of sampling is lower than that of the original time series. At the same time, it is an aggregation operation. Looking at the example, let's get the average value of the count column for each quarter. Q stands for quarter, which means sampling by quarter.

df.resample('Q',on='datetime')["count"].mean()


Note: the maximum output time at this time is 06-30, not 05-31 in the actual data. However, it does not affect the calculation.

Upsampling is opposite to downsampling. It means that the time frequency of sampling is higher than that of the original time series, which is equivalent to obtaining time data in finer latitudes. However, this often leads to a large number of null values in the data, which are not used in practice. I won't explain it here.


DataFrame traversal mode

iterrows() - traverse rows (index, column value sequence)

Traverse by row to obtain the row index and column value sequence. The speed is slow. Look at the example directly.

for index, row in df.iterrows():
    print(index)
    print(row)
    print(row['name'])
    print(row['age'])
    break # Used in the demonstration, only one line is displayed

iteritems() - traversal (column name, value)

Traverse by column to obtain the column name and the value of the column.

for column, value in df.iteritems():
    print(column)
    print(value)
    break # Used in the demonstration, only one line is displayed

itertuples() - traverse rows (index, column value A, column value B...)

Traverse by row to obtain the index and column value of the row. The difference from iterrows is that the index and column value are included together, and info[0] is used to obtain the index

for info in df.itertuples():
    print(info[0])
    print(info[1])
    print(info[2])
    break # Used in the demonstration, only one line is displayed


This is all the content of this article, if it feels good. ❤ Remember to support the third company!!! ❤


In the future, we will continue to share all kinds of dry goods. If you are interested, you can pay attention and don't get lost ~.

People like me who love fans will of course give 100 million little benefits to my fans. Pay attention to the small cards below and get the blogger's various article source code and data analysis dry goods!

👇 👇 👇

Keywords: Python Data Analysis sklearn pandas

Added by Daukan on Tue, 21 Sep 2021 14:21:22 +0300