Hello, I'm jiejie. Today we introduce some very basic methods and functions in the pandas library. I hope you will gain something after reading it!
We are going to generate some random numbers as the data set to be used later
index = pd.date_range("one/one/two 000", periods=8) series = pd.Series(np.random.randn(5), index=["a", "b", "c", "d", "e"]) df = pd.DataFrame(np.random.randn(8, three), index=index, columns=["A", "B", "C"])
The head() and tail() methods are used to view the first and last rows in the dataset. The default is to view 5 rows. Of course, readers can set the number of rows themselves
series2 = pd.Series(np.random.randn(100)) series2.head()
output
0 0.733801 1 -0.740149 2 -0.031863 3 2.515542 4 0.615291 dtype: float64
Similarly
series2.tail()
output
95 -0.526625 96 -0.234975 97 0.744299 98 0.434843 99 -0.609003 dtype: float64
Statistical analysis of data
In pandas, use the describe() method to make a general statistical analysis of the data in the table, such as
series2.describe()
output
count 100.000000 mean 0.040813 std 1.003012 min -2.385316 25% -0.627874 50% -0.029732 75% 0.733579 max 2.515542 dtype: float64
Of course, we can also set the quantile of the output
series2.describe(percentiles=[0.05, 0.25, 0.75, 0.95])
output
count 100.000000 mean 0.040813 std 1.003012 min -2.385316 5% -1.568183 25% -0.627874 50% -0.029732 75% 0.733579 95% 1.560211 max 2.515542 dtype: float64
For discrete data, the result given by the describe() method will be much simpler
s = pd.Series(["a", "a", "b", "b", "a", "a", "d", "c", "d", "a"]) s.describe()
output
count 10 unique 4 top a freq 5 dtype: object
If the table contains both discrete data and continuous data, by default, describe() will perform statistical analysis on continuous data
df2 = pd.DataFrame({"a": ["Yes", "Yes", "No", "No"], "b": np.random.randn(4)}) df2.describe()
output
b count 4.000000 mean 0.336053 std 1.398306 min -1.229344 25% -0.643614 50% 0.461329 75% 1.440995 max 1.650898
Of course, we can also specify it to force statistical analysis of discrete data or continuous data
df2.describe(include=["object"])
output
a count 4 unique 2 top Yes freq 2
Similarly, we can also specify continuous data for statistical analysis
df2.describe(include=["number"])
output
b count 4.000000 mean -0.593695 std 0.686618 min -1.538640 25% -0.818440 50% -0.459147 75% -0.234401 max 0.082155
If we all have to do statistical analysis, we can do so
df2.describe(include="all")
output
a b count 4 4.000000 unique 2 NaN top Yes NaN freq 2 NaN mean NaN 0.292523 std NaN 1.523908 min NaN -1.906221 25% NaN -0.113774 50% NaN 0.789560 75% NaN 1.195858 max NaN 1.497193
Position of maximum / minimum value
The idxmin() and idxmax() methods are used to find the position of the maximum / minimum value in the table, and return the index of the value
s1 = pd.Series(np.random.randn(5)) s1
output
s1.idxmin(), s1.idxmax()
output
(0, 3)
If it is used on DataFrame, it is as follows
df1 = pd.DataFrame(np.random.randn(5, 3), columns=["A", "B", "C"]) df1.idxmin(axis=0)
output
A 4 B 2 C 1 dtype: int64
Similarly, we change the axis parameter to 1
df1.idxmin(axis=1)
output
0 C 1 C 2 C 3 B 4 A dtype: object
value_counts() method
Value in pandas_ The counts () method is mainly used to count and sort the data table. It is used to view the number of different data values in the specified column in the table and calculate the number of different values in the column. Let's take a simple example first
df = pd.DataFrame({'city': ['Beijing', 'Guangzhou', 'Shanghai', 'Shanghai', 'Hangzhou', 'Chengdu', 'Hong Kong', 'Nanjing', 'Beijing', 'Beijing'], 'income': [10000, 10000, 5500, 5500, 4000, 50000, 8000, 5000, 5200, 5600], 'Age': [50, 43, 34, 40, 25, 25, 45, 32, 25, 25]}) df["city"].value_counts()
output
Beijing 3 Shanghai 2 Guangzhou 1 Hangzhou 1 Chengdu 1 Hong Kong 1 Nanjing 1 Name: city, dtype: int64
It can be seen that there are three times in Beijing and two times in Shanghai, and the default arrangement is in descending order. Let's take a look at arranging the income column in ascending order
df["income"].value_counts(ascending=True)
output
4000 1 50000 1 8000 1 5000 1 5200 1 5600 1 10000 2 5500 2 Name: income, dtype: int64
At the same time, the parameter normalize=True can also be used to calculate the count proportion of different values
df['Age'].value_counts(ascending=True,normalize=True)
output
50 0.1 43 0.1 34 0.1 40 0.1 45 0.1 32 0.1 25 0.4 Name: Age, dtype: float64
If you think this article is of some use to you, remember not to forget three companies. Your affirmation will be the strongest driving force for me to continue to output more high-quality articles!