[Pandas learning notes 02] - advanced usage of data processing

Author: Huan Hao

Source: Hang Seng LIGHT cloud community

Pandas is a Python software library, which provides a large number of functions and methods that enable us to process data quickly and easily. This paper will mainly introduce the practical data processing operation of pandas.

Series of articles:

[Pandas learning notes 01] powerful tool set for analyzing structured data

[Pandas learning notes 02] - practical operation of data processing

summary

Pandas is a library based on NumPy. In terms of data processing, it can be understood as an enhanced version of NumPy. At the same time, pandas is also an open source project. It is used for data mining and data analysis, and also provides data cleaning function.

This paper mainly introduces the high-order usage of Pandas in data processing, including data merging, grouping and splitting. If you have learned the SQL syntax of the database, this article will understand very quickly.

Data merging

Data preparation

First, define a DataFrame dataset:

import pandas as pd

df_a = pd.DataFrame(columns=['name', 'rank'], data=[['C', 1], ['java', 2], ['python', 3], ['golang', 4]])
df_b = pd.DataFrame(columns=['name', 'year'], data=[['java', 2020], ['python', 2021], ['golang', 2022]])

The DataFrame dataset can be merged through the merge() method, through internal connection, external connection, left connection, right connection, etc., as shown in the following example:

The merge method takes the intersection of inner connections by default. Specify the connection type by how and the connection field by on

# Connect by specifying the name in columns
df_tmp = pd.merge(df_a, df_b, on='name', how='outer')
print(df_tmp)

# ========Print========
     name  rank  year
0    java     2  2020
1  python     3  2021
2  golang     4  2022
# Left join by specifying name in columns
df_tmp = pd.merge(df_a, df_b, on='name', how='left')
print(df_tmp)

# ========Print========
     name  rank    year
0       C     1     NaN
1    java     2  2020.0
2  python     3  2021.0
3  golang     4  2022.0
# Connect right by specifying name in columns
df_tmp = pd.merge(df_a, df_b, on='name', how='right')
print(df_tmp)

# ========Print========
     name  rank  year
0    java     2  2020
1  python     3  2021
2  golang     4  2022
# If merging two dataframes does not contain public columns, you can directly specify the matching fields
df_c = pd.DataFrame(columns=['name1', 'year'], data=[['java', 2020], ['python1', 2021], ['golang1', 2022]])
df_tmp = pd.merge(df_a, df_c, left_on='name', right_on='name1')
print(df_tmp)

# ========Print========
   name  rank name1  year
0  java     2  java  2020

Data grouping

Data preparation

First, define a DataFrame dataset:

import pandas as pd

df_a = pd.DataFrame(columns=['name', 'nums'], data=[['python', 1], ['java', 2], ['python', 3], ['java', 4]])

The DataFrame dataset can be grouped through the group() method. After grouping, it can also be summed and averaged, as shown in the following example:

# Gets the number of each data in the grouped dataset
df_tmp = df_a.groupby('name').size()
print(df_tmp)

# ========Print========
name
java      2
python    2
dtype: int64
# Sum the grouped data sets according to the nums field
df_tmp = df_a.groupby('name')['nums'].sum()
print(df_tmp)

# ========Print========
name
java      6
python    4
Name: nums, dtype: int64
# Gets the size of the grouped dataset
df_tmp = df_a.groupby('name').size()
print(df_tmp)

# ========Print========
name
java      3
python    2
Name: nums, dtype: int64

Data splitting

Data preparation

First, define a DataFrame dataset:

import pandas as pd

df_a = pd.DataFrame(columns=['name', 'rank'], data=[['C_no1', 1], ['java_no2', 2], ['python_no3', 3], ['golang', 4]])

The split() method can be used to split a column of data in the DataFrame dataset, as shown in the following example:

# Data splitting: split the data of a column in columns by matching a symbol. expand: if True, you can directly convert the sorted results into DataFrame
df_tmp = df_a['name'].str.split('_', 1, expand=True)
print(df_tmp)

# ========Print========
        0     1
0       C   no1
1    java   no2
2  python   no3
3  golang  None
# Data splitting: merge the split data with the original data again
df_tmp = pd.merge(df_a, df_a['name'].str.split('_', 1, expand=True), how='left', left_index=True, right_index=True)
print(df_tmp)

# ========Print========
         name  rank       0     1
0       C_no1     1       C   no1
1    java_no2     2    java   no2
2  python_no3     3  python   no3
3      golang     4  golang  None

Data visualization

In the process of using Pandas to process data, in order to more intuitively show the linear relationship of data, we can introduce matplotlib library to turn our data into relevant graphics

# The plot() method generates the corresponding linear graph
df_a = pd.DataFrame(columns=['name', 'rank'], data=[['C_no1', 1], ['java_no2', 2], ['python_no3', 3], ['golang', 4]])
df_a.plot()

summary

This paper mainly introduces the high-level operation of Pandas tool set. The operation principle is similar to SQL in the database, which can help us solve the daily data analysis and processing.

Keywords: Python Data Analysis data visualization pandas

Added by piznac on Wed, 01 Dec 2021 07:50:31 +0200