# Simply summarize the essence, pandas must know and know

Hello, I'm jiejie. Today we introduce some very basic methods and functions in the pandas library. I hope you will gain something after reading it!

Prepare the required dataset

We are going to generate some random numbers as the data set to be used later

```index = pd.date_range("one/one/two 000", periods=8)

series = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"])

df = pd.DataFrame(np.random.randn(8, three), index=index, columns=["A", "B", "C"])```

The head() and tail() methods are used to view the first and last rows in the dataset. The default is to view 5 rows. Of course, readers can set the number of rows themselves

```series2 = pd.Series(np.random.randn(100))

output

```0    0.733801
1   -0.740149
2   -0.031863
3    2.515542
4    0.615291
dtype: float64```

Similarly

`series2.tail()`

output

```95   -0.526625
96   -0.234975
97    0.744299
98    0.434843
99   -0.609003
dtype: float64```

# Statistical analysis of data

In pandas, use the describe() method to make a general statistical analysis of the data in the table, such as

`series2.describe()`

output

```count    100.000000
mean       0.040813
std        1.003012
min       -2.385316
25%       -0.627874
50%       -0.029732
75%        0.733579
max        2.515542
dtype: float64```

Of course, we can also set the quantile of the output

`series2.describe(percentiles=[0.05, 0.25, 0.75, 0.95])`

output

```count    100.000000
mean       0.040813
std        1.003012
min       -2.385316
5%        -1.568183
25%       -0.627874
50%       -0.029732
75%        0.733579
95%        1.560211
max        2.515542
dtype: float64```

For discrete data, the result given by the describe() method will be much simpler

```s = pd.Series(["a", "a", "b", "b", "a", "a", "d", "c", "d", "a"])
s.describe()```

output

```count     10
unique     4
top        a
freq       5
dtype: object```

If the table contains both discrete data and continuous data, by default, describe() will perform statistical analysis on continuous data

```df2 = pd.DataFrame({"a": ["Yes", "Yes", "No", "No"], "b": np.random.randn(4)})
df2.describe()```

output

```              b
count  4.000000
mean   0.336053
std    1.398306
min   -1.229344
25%   -0.643614
50%    0.461329
75%    1.440995
max    1.650898```

Of course, we can also specify it to force statistical analysis of discrete data or continuous data

`df2.describe(include=["object"])`

output

```          a
count     4
unique    2
top     Yes
freq      2```

Similarly, we can also specify continuous data for statistical analysis

`df2.describe(include=["number"])`

output

```              b
count  4.000000
mean  -0.593695
std    0.686618
min   -1.538640
25%   -0.818440
50%   -0.459147
75%   -0.234401
max    0.082155```

If we all have to do statistical analysis, we can do so

`df2.describe(include="all")`

output

```          a         b
count     4  4.000000
unique    2       NaN
top     Yes       NaN
freq      2       NaN
mean    NaN  0.292523
std     NaN  1.523908
min     NaN -1.906221
25%     NaN -0.113774
50%     NaN  0.789560
75%     NaN  1.195858
max     NaN  1.497193```

# Position of maximum / minimum value

The idxmin() and idxmax() methods are used to find the position of the maximum / minimum value in the table, and return the index of the value

```s1 = pd.Series(np.random.randn(5))
s1```

output

`s1.idxmin(), s1.idxmax()`

output

`(0, 3)`

If it is used on DataFrame, it is as follows

```df1 = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"])
df1.idxmin(axis=0)```

output

```A    4
B    2
C    1
dtype: int64```

Similarly, we change the axis parameter to 1

`df1.idxmin(axis=1)`

output

```0    C
1    C
2    C
3    B
4    A
dtype: object```

# value_counts() method

Value in pandas_ The counts () method is mainly used to count and sort the data table. It is used to view the number of different data values in the specified column in the table and calculate the number of different values in the column. Let's take a simple example first

```df = pd.DataFrame({'city': ['Beijing', 'Guangzhou', 'Shanghai', 'Shanghai', 'Hangzhou', 'Chengdu', 'Hong Kong', 'Nanjing', 'Beijing', 'Beijing'],
'income': [10000, 10000, 5500, 5500, 4000, 50000, 8000, 5000, 5200, 5600],
'Age': [50, 43, 34, 40, 25, 25, 45, 32, 25, 25]})
df["city"].value_counts()```

output

```Beijing    3
Shanghai    2
Guangzhou    1
Hangzhou    1
Chengdu    1
Hong Kong    1
Nanjing    1
Name: city, dtype: int64```

It can be seen that there are three times in Beijing and two times in Shanghai, and the default arrangement is in descending order. Let's take a look at arranging the income column in ascending order

`df["income"].value_counts(ascending=True)`

output

```4000     1
50000    1
8000     1
5000     1
5200     1
5600     1
10000    2
5500     2
Name: income, dtype: int64```

At the same time, the parameter normalize=True can also be used to calculate the count proportion of different values

`df['Age'].value_counts(ascending=True,normalize=True)`

output

```50    0.1
43    0.1
34    0.1
40    0.1
45    0.1
32    0.1
25    0.4
Name: Age, dtype: float64```

If you think this article is of some use to you, remember not to forget three companies. Your affirmation will be the strongest driving force for me to continue to output more high-quality articles!

Keywords: Python

Added by nicx on Wed, 29 Dec 2021 19:23:20 +0200