Simply summarize the essence, pandas must know and know

Hello, I'm jiejie. Today we introduce some very basic methods and functions in the pandas library. I hope you will gain something after reading it!

Prepare the required dataset

We are going to generate some random numbers as the data set to be used later

index = pd.date_range("one/one/two 000", periods=8)

series = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

df = pd.DataFrame(np.random.randn(8, three), index=index, columns=["A", "B", "C"])
Head and tail

The head() and tail() methods are used to view the first and last rows in the dataset. The default is to view 5 rows. Of course, readers can set the number of rows themselves

series2 = pd.Series(np.random.randn(100))
series2.head()

output

0    0.733801
1   -0.740149
2   -0.031863
3    2.515542
4    0.615291
dtype: float64

Similarly

series2.tail()

output

95   -0.526625
96   -0.234975
97    0.744299
98    0.434843
99   -0.609003
dtype: float64

Statistical analysis of data

In pandas, use the describe() method to make a general statistical analysis of the data in the table, such as

series2.describe()

output

count    100.000000
mean       0.040813
std        1.003012
min       -2.385316
25%       -0.627874
50%       -0.029732
75%        0.733579
max        2.515542
dtype: float64

Of course, we can also set the quantile of the output

series2.describe(percentiles=[0.05, 0.25, 0.75, 0.95])

output

count    100.000000
mean       0.040813
std        1.003012
min       -2.385316
5%        -1.568183
25%       -0.627874
50%       -0.029732
75%        0.733579
95%        1.560211
max        2.515542
dtype: float64

For discrete data, the result given by the describe() method will be much simpler

s = pd.Series(["a", "a", "b", "b", "a", "a", "d", "c", "d", "a"])
s.describe()

output

count     10
unique     4
top        a
freq       5
dtype: object

If the table contains both discrete data and continuous data, by default, describe() will perform statistical analysis on continuous data

df2 = pd.DataFrame({"a": ["Yes", "Yes", "No", "No"], "b": np.random.randn(4)})
df2.describe()

output

              b
count  4.000000
mean   0.336053
std    1.398306
min   -1.229344
25%   -0.643614
50%    0.461329
75%    1.440995
max    1.650898

Of course, we can also specify it to force statistical analysis of discrete data or continuous data

df2.describe(include=["object"])

output

          a
count     4
unique    2
top     Yes
freq      2

Similarly, we can also specify continuous data for statistical analysis

df2.describe(include=["number"])

output

              b
count  4.000000
mean  -0.593695
std    0.686618
min   -1.538640
25%   -0.818440
50%   -0.459147
75%   -0.234401
max    0.082155

If we all have to do statistical analysis, we can do so

df2.describe(include="all")

output

          a         b
count     4  4.000000
unique    2       NaN
top     Yes       NaN
freq      2       NaN
mean    NaN  0.292523
std     NaN  1.523908
min     NaN -1.906221
25%     NaN -0.113774
50%     NaN  0.789560
75%     NaN  1.195858
max     NaN  1.497193

Position of maximum / minimum value

The idxmin() and idxmax() methods are used to find the position of the maximum / minimum value in the table, and return the index of the value

s1 = pd.Series(np.random.randn(5))
s1

output

s1.idxmin(), s1.idxmax()

output

(0, 3)

If it is used on DataFrame, it is as follows

df1 = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"])
df1.idxmin(axis=0)

output

A    4
B    2
C    1
dtype: int64

Similarly, we change the axis parameter to 1

df1.idxmin(axis=1)

output

0    C
1    C
2    C
3    B
4    A
dtype: object

value_counts() method

Value in pandas_ The counts () method is mainly used to count and sort the data table. It is used to view the number of different data values in the specified column in the table and calculate the number of different values in the column. Let's take a simple example first

df = pd.DataFrame({'city': ['Beijing', 'Guangzhou', 'Shanghai', 'Shanghai', 'Hangzhou', 'Chengdu', 'Hong Kong', 'Nanjing', 'Beijing', 'Beijing'],
                   'income': [10000, 10000, 5500, 5500, 4000, 50000, 8000, 5000, 5200, 5600],
                   'Age': [50, 43, 34, 40, 25, 25, 45, 32, 25, 25]})
df["city"].value_counts()

output

Beijing    3
 Shanghai    2
 Guangzhou    1
 Hangzhou    1
 Chengdu    1
 Hong Kong    1
 Nanjing    1
Name: city, dtype: int64

It can be seen that there are three times in Beijing and two times in Shanghai, and the default arrangement is in descending order. Let's take a look at arranging the income column in ascending order

df["income"].value_counts(ascending=True)

output

4000     1
50000    1
8000     1
5000     1
5200     1
5600     1
10000    2
5500     2
Name: income, dtype: int64

At the same time, the parameter normalize=True can also be used to calculate the count proportion of different values

df['Age'].value_counts(ascending=True,normalize=True)

output

50    0.1
43    0.1
34    0.1
40    0.1
45    0.1
32    0.1
25    0.4
Name: Age, dtype: float64

If you think this article is of some use to you, remember not to forget three companies. Your affirmation will be the strongest driving force for me to continue to output more high-quality articles!

Keywords: Python

Added by nicx on Wed, 29 Dec 2021 19:23:20 +0200