Learning notes on python modeling (pandas library functions)

pandas Library (reference from https://www.jianshu.com/p/6acafa350208 

https://blog.csdn.net/zengxiantao1994/article/details/75200110 (there are also many learning videos on the beep station)

pandas provides high-performance and easy-to-use data types and analysis tools, which can be used to process csv, tsv, txt, mysql (relational database table), excel, html and other files

First, introduce

import pandas as pd

1. Read data (reference from https://www.bilibili.com/video/BV1UJ411A7Fs?p=2)

# Read the CSV file named ratings
fpath = ".../ratings.csv"      # Write the path
ratings = pd.read_csv(fpath)   # Read data
ratings.head()                 # View the first few rows of data: index, column name and data body from left to right
ratings.shape                  # Output rows and columns
ratings.columns                # View a list of column names
ratings.index                  # View index columns
ratings.dtypes                 # View the data type of each column

# Read the txt file (split by - instead of comma), and the file is pvuv
fpath = ".../access_pvuv.txt"
pvuv = pd.read_csv(
    sep = "\t"                    # Specify address separator \ t
    header = None                 # Unmarked line
    names = ['pdate','pv','uv']   # Set column naming

# Read excel file
fpath =
pvuv = pd.read(fpath)

# Read MySQL
import pymysql
comn = pymysql.commect(
    host = '',
    user = 'root',
    password = '12345678',
    database = 'test',
    charset = 'utf8'
mysql_page = pd.read_sql("select*from crazyant_pvuv",con = conn)   # The first is an sql statement, and the second is to establish a connection

2. Data structure

There are two data types, Series and DataFrame

Series one-dimensional data, one row or one column

(1) The data list can generate simple series (with or without labels)

(2) Dictionary creation Series

(3) Label index can be queried

DataFrame is a two-dimensional array

(1) Create a dataframe from multiple dictionary sequences

(2) Query series from dataframe

If only one row and one column are queried, PD. Is returned Series

If multiple rows and columns are queried, PD DataFrame

3. Data query

df.loc method

(1) Use the value of a single label

(2) Batch query using value list

(3) Range query using numerical range

(4)!!! Query with conditional expression!!

Returns a Boolean value

(5) Call function query

4. New data column

(1) Direct replication method

The temperature is listed as C, which is required to be removed. Replace c with empty.

df.loc[:,"wencha"] = df["bWendu"] - df["yWendu"]     # Add a new temperature difference column, which is the highest temperature minus the lowest temperature

(2)df.apply method

Pass a function to return series

(3)df.assign method

Without modifying df itself, a new object is returned

(4) Assign values respectively according to conditional selection

5. Data statistics function

(1) Summary statistics (value type)

df. Discriminate() calculates the minimum, maximum and average values

Average: df["column name"]. mean()

Maximum: df["column name"]. max()

Minimum value: df["column name"]. min()

(2) Unique de duplication and number of columns by value (not used for numerical columns, but for enumeration and classification columns)

De duplication: df["column name"]. unique

Count by value: df ["column name"]. value_counts()

Count by value is the statistics of how many times each name appears, which will be output from large to small.

(3) Correlation coefficient and covariance


Covariance matrix: DF cov()

Correlation coefficient matrix: DF corr()

View the correlation coefficient of two columns separately: df ["column name"]. corr(df ["column name"])

6. Pandas's handling of missing values of functions( https://www.bilibili.com/video/BV1UJ411A7Fs?p=7)

isnull and notnull: detect whether it is empty

dropna: delete missing values:

  • axis: delete row or column, {0 or 'index', 1 or 'columns'}, default 0
  • how: if it is equal to any, any value will be deleted if it is empty. If it is equal to all, all values will be deleted if they are empty
  • inplace: if True, modify the current df; otherwise, return a new df

fillna: fill in null values

  • Value: the value used for filling. It can be a single value or a dictionary (key is the column name and value is the value)
  • method: equal to fill, fill forword fill with the previous non empty value; Equal to bfill, backword fill is filled with the last non empty value
  • axis: Fill by row or column, {0 or 'index', 1 or 'columns'}
  • inplace: if True, modify the current df; otherwise, return a new df

7. SettingWithCopyWarning alarm resolution


8. Data sorting


import pandas as pd

# Series sort
# Ascending: the default is True ascending sorting and False descending sorting. inplace: modify original Series

#Sorting of DataFrame
# by: string or list, single column sorting or multi column sorting; ascending: Boolean or list; inplace: modify the original DataFrame

# Multi column sorting
pd.DataFrame.sort_values(by=["Column name"."Column name"])

9. String processing

First get the str attribute of Series and use various string processing functions.

Application field: 1 Use the Series bool classes such as startwith and contain of str to perform conditional queries.

                   2. Chain operation that requires multiple str processing.

                   3. Processing using regular expressions.

10. Understanding of axis parameters( https://www.jianshu.com/p/4f18e8327872)

0/index refers to a row, and aggregation refers to cross rows. 1/columns refers to a column, and aggregation refers to cross columns

import numpy as np
df = pd.DataFrame(
    columns = ['A','B','C','D']
df.drop("A", axis=1)       # Delete column
df.drop(1, axis=0)         # Delete row
df.mean(axis=1)            # If an axis is specified, it will move. It is similar to being traversed by for. Other axes mean unchanged. 1 refers to the column, which is the average number of output columns.
df.sum(axis=0)             # Sum each column of data
df.sum(axis=1)             # Sum each row of data

Note here that after deleting, summing and averaging operations, you need to assign the operation result to the data of another dataframe.

import pandas as pd
import numpy as np
df = pd.DataFrame(
    columns = ['A','B','C','D']

The result is:

11. Purpose of Pandas index

Purpose: more convenient data query; Performance improvement; Automatic data alignment; More powerful data structure support

(1) First five lines of query


(2) improve query performance:

If the index is unique, Pandas will use the hash table to optimize, and the time complexity is O(1)

If it is not unique, it will be queried by dichotomy, and the performance is O(logN)

If it is completely random, the whole table must be scanned for each query, and the query performance is O(N)

(3) Automatic alignment, automatic alignment when the list is added or subtracted

# index auto align data
s1 = pd.Series([1,2,3],index=list("abc"))
s2 = pd.Series([2,3,4],index=list("bcd"))

(4) data structure support

Categorialindex, an Index based on classified data, improves performance.

MultiIndex, a multi-dimensional index, is used for groupby, mostly the results after aggregation, etc.

DatetimeIndex, time type index, powerful date and time method support.

12. Merge syntax (merge)

Quantity alignment of merge

One to one, the associated keywords are unique

One to many, the associated keywords are unique on the left and not unique on the right

Many to many, the associated keywords are not unique, and the number of results is M*N

The difference between the four join s

13. Pandas realizes data merging concat

Application scenario: batch merge Excel with the same format, add rows to DataFrame, and add columns to DataFrame.

That is, merge multiple Pandas objects (DataFrame\Series) into one along an axis (axis=0/1) using a merge method (inner/outer).

# objs refers to the objects to be merged. When axis is 0, it is merged by row, and when axis is 1, it is merged by column. join is alignment, ignore_index refers to whether the original index is ignored.

# append only merges by row, not by column
# other refers to a single dataframe, series,dict or list.

Use join=inner to filter out mismatched columns; Using axis=1 is equivalent to adding a new column

The list parameter of concat is very flexible. It can be series or dataframe.

Reference from https://gitee.com/antpython/ant-learn-pandas/blob/master/14.%20Pandas%E5%AE%9E%E7%8E%B0%E6%95%B0%E6%8D%AE%E7%9A%84%E5%90%88%E5%B9%B6concat.ipynb

Keywords: Python Mathematical Modeling

Added by bache on Sun, 19 Dec 2021 21:27:25 +0200