pandas introduction notes

 

Introduction to DataFrame of Pandas

  • Pandas is an open source Python library for data analysis, which can realize data loading, cleaning, conversion, statistical processing, visualization and other functions

  • DataFrame and Series are the two most basic data structures of Pandas

  • DataFrame is used to process structured data (SQL data table, Excel table)

  • Series is used to process single column data. DataFrame can also be regarded as a dictionary or collection composed of series objects

conda activate base  # Optional, switch virtual environment
# pip install pandas  # pandas needs to be installed additionally in the self created virtual environment
cd D:\ # Switch disk location, optional
cd pandas
jupyter notebook  # start-up

import pandas as pd  # Import pandas package
df = pd.read_csv('data/movie.csv') # Load movie CSV file
df.head() # Display the first 5 data

df = pd.read_csv('data/gapminder.tsv',sep='\t') 
# Parameter 1 file path to load; Parameter 2 sep incoming separator, default is', '\ t' tab
print(df) 
type(df)  # The built-in function type is used to view the returned data type

# Note: the attribute does not take (), and the method takes ()
df.shape  # [attribute] get the number of rows and columns
df.columns  # [attribute] get column name
df.dtypes  # [attribute] get data type
df.info()  # [method] obtain data type
country_df = df['country']  # Load a column of data by df ['column name']
country_df.head()  # Get the first 5 rows of data
subset = df[['country','continent','year']]  # Load multi column data by column name
# df [['column name 1', 'column name 2',...]], Here are two layers [], which can be understood as df [list of column names]
print(subset.tail(n))  # If no value is entered, five lines of data will be printed by default
print(df.head(n))  #  Load some data by line. If no value is filled, the first 5 lines of data will be printed by default,
print(df.loc[0])  # df.loc[n] the index of the incoming row to obtain one or more rows of data

number_of_rows = df.shape[0]  # Get the total number of rows through shape
last_row_index = number_of_rows - 1  # Total rows - 1 get the index of the last row
print(df.loc[last_row_index])  # Get the last line of data and print it
print(df.tail(n=1))  # The tail method outputs one line by default and passes in n=1 to control that only one line is displayed


# `df.loc ` and ` DF Tail ` the types of the last row of data obtained by the two methods are different
subset_loc = df.loc[0] 
subset_head = df.head(n=1)
print(type(subset_loc))
print(type(subset_head)) 

print(df.iloc[0])  # Use 'iloc' to get the first row of data and print it
print(df.iloc[99])  # Use 'iloc' to get the data in the 100th row and print it
print(df.iloc[-1])  # Get the last row of data, where - 1 is the subscript

print(df.iloc[4:7])  # Return multiple rows of data in the specified range. Pay attention to close left and open right
print(df.iloc[:3])  # Return to the first three rows of data, and pay attention to the left closing and right opening
print(df.iloc[-3:])  # Return the last 3 lines of data, and pay attention to the left closing and right opening

# Gets data from a specified range based on rows or columns
# The loc and iloc attributes can be used to obtain both row data and column data
# df.iloc [[row index subscript], [column index subscript]]
print(df.iloc[[0],[0]])  # Returns the data of the first row and the first column
print(type(df.iloc[[0],[0]]))  
print(df.iloc[42,0])  # Return the data in row 42 and column 1
print(type(df.iloc[42,0]))

# df.loc [[row index value], [column index value]]
print(df.loc[[0],['country']])  # Returns the data of the first row and the first column
print(type(df.loc[[0],['country']]))
print('='*10)
print(df.loc[42,'country'])
print(type(df.loc[42,'country']))

# Use loc to get one or more columns in the data
# `df.loc [:, [column name]] `, 2:4 indicates row slice, colon indicates all rows,
subset = df.loc[2:4, ['year','pop']]  # 2: 4 indicates row slice,
subset = df.loc[0:2,['year','pop']]
subset = df.loc[:,['year','pop']]  # A colon indicates all lines,
print(subset.head())

#  Use iloc to get one or more columns in the data
# df.iloc [:, [column No. 1, column No. 2, column No. 3...]] Column ordinals are separated by commas
subset = df.iloc[:,[2,4,-1]]
print(subset.head())

# df.iloc [:, [column index slice]] colon separated slices and steps
subset = df.iloc[:,3:6]  # Using slicing syntax to get several columns of data in iloc
print(subset.head())

subset = df.iloc[:,0:6:2]  # 0:6:2 indicates that the subscript ranges from 0 to 5 columns, and 2 is the step size
print(subset.head())

# Get through the columns field and return an array of numpy type [array array]
print(df.columns.values)
print(list(df))  # List by list

#df.columns returns index,
# Convert to list type through tolist() or list(df.columns)
print(df.columns.tolist())

# Row No. column No. fetching data
print(df.iloc[[0,99,999],[0,3,5]])
# Format: print(df.iloc [[Line No. 1, line No. 2...], [column number subscript 1, column number subscript 2]]) 
# Obtain the data of columns 1, 4 and 6 of rows 1, 100 and 1000 (obtained with subscript)

# Row number and column name fetch data
# [benefits: 1. Increase the readability of the code; 2. Avoid taking out wrong data due to the change of column order]
# In practice, it is recommended to pass in the actual column name,
print(df.loc[[0,99,999],['country','lifeExp','gdpPercap']])

# Fetch data from row slice column name or column number
# You can use slices to get data in the row portion of the loc and iloc attributes
print(df.loc[2:6, ['country','lifeExp','gdpPercap']])
 # Value based on row name and column name

print(df.iloc[2:6, [0,3,5]]) # Take value according to line number and column number

# Note the difference between the two returned results
print(df.loc[2:6, ['country','lifeExp','gdpPercap']])
# Value based on row name and column name
print(df.iloc[2:6, [0,3,5]]) # Take value according to line number and column number
# Analysis of difference causes: the principle of value slicing is to close left and close right, and the principle of subscript slicing is to close left and open right
#loc[2:6,...] loc is fetched by value, so the values of line numbers from 2 to 6 are fetched
#iloc[2:6,...] iloc fetches data by subscript, so all lines with subscripts 2 to 5 are fetched

  • Comparison between Pandas and Python common data types:

Pandas type

Python type

explain

Object

string

String type

int64

int

plastic

float64

float

float

datetime64

datetime

Date time type, which needs to be loaded in python

Pandas statistical calculation

Group aggregate calculation:

  • Group the data first
  • The data of each group shall be statistically calculated. For example, find avg, count, etc
  • Then combine the results of each group of calculations
  • You can use the groupby method of DataFrame to complete the grouping / aggregation calculation
print(df.head(10))
# print(df.groupby('grouped column name ') ['lifeExp'] Aggregate function ()
print(df.groupby('year')['lifeExp'].mean())
# Calculate the average life expectancy for each year (calculate the average value for the specified columns after grouping the specified columns)

# Via ` DF Groupby ('year ') ` create a grouping object first,
# If you print the DataFrame of this group, a memory address will be returned
grouped_year_df = df.groupby('year')
print(type(grouped_year_df))
print(grouped_year_df)

# From the DataFrameGroupBy data grouping object after grouping,
# Pass in the column name to get the data we are interested in and perform further calculation
# To calculate the average life expectancy of each year, we need to use the column lifeExp
# We can use the method described in the previous section to obtain one of the data after grouping
grouped_year_df_lifeExp = grouped_year_df['lifeExp']
print(type(grouped_year_df_lifeExp)) 
print(grouped_year_df_lifeExp)

# Finally, calculate the average value of the grouped data
mean_lifeExp_by_year = grouped_year_df_lifeExp.mean()
# Finally, calculate the average value of the grouped data
print(mean_lifeExp_by_year)

# After specifying the grouping of multiple columns, the average value is calculated for the specified multiple columns
# The above example only groups and averages a column of lifeExp,
# If you want to group multiple column values, the aggregation code is similar
print(df.groupby(['year', 'continent'])[['lifeExp','gdpPercap']].mean())

The above code groups the data by year and continent, and calculates the corresponding average life expectancy lifeExp and average GDP for each set of data

In the output results, year container and lifeExp gdpPercap are not in the same row. The two row indexes of year container have a hierarchical structure. The usage of this composite index will be described in detail in the following chapters

If you want to remove the hierarchy of year container, you can use reset_index method (reset row index)

multi_group_var = df.groupby(['year', 'continent'])[['lifeExp','gdpPercap']].mean() 
flat = multi_group_var.reset_index()
print(flat.head(15))

Packet frequency calculation

In data analysis, a common task is to calculate the frequency

  • The unique value count of the panda series can be calculated using the nunique method

How many countries and regions appear in each continent column in the data?

df.groupby('continent')['country'].nunique()

5 basic drawing

  • Visualization is very important in every step of data analysis. When understanding or cleaning the data, visualization helps to identify the trend in the data. For example, we calculated the average life of each year:
global_yearly_life_expectancy=df.groupby('year')['lifeExp'].mean()
print(global_yearly_life_expectancy)

  • You can draw a picture through the plot function, and draw a conclusion more intuitively through the picture
global_yearly_life_expectancy.plot()

Summary

This course introduces how to use the DataFrame of Pandas to load data and how to group and aggregate data

pd.read_csv # Load CSV file
pd.loc      # Get some data from DataFrame and pass in the index name
pd.iloc     # Get some data from the DataFrame and pass in the index sequence number
pd.groupby  # grouping

Keywords: Python Data Analysis pandas

Added by tarado on Fri, 17 Dec 2021 07:32:58 +0200