How does pandas improve computing efficiency

preface

Pandas is designed to process the vectorization operation of the whole row or column at one time. Looping through each cell, row or column is not its design purpose. Therefore, when using pandas, you should consider highly parallelizable matrix operations.

1, Avoid using for loops

Try to use column number or row number for matrix retrieval and avoid using for loop.

1.1 using the for loop

import os
import pandas as pd
import datetime

path = r'E:\Scientific research documents\shiyan\LZQ\LZQ_all_sampledata.csv'
def read_csv(target_csv):
    target = pd.read_csv(path,header=None,sep=',')
    return target

start_time = datetime.datetime.now()
a = read_csv(path)
for i in range(10000):
    b = a.iloc[i]
end_time = datetime.datetime.now()

print(end_time-start_time)

Time: 0:00:02.455211

1.2 using line number to search

path = r'E:\Scientific research documents\shiyan\LZQ\LZQ_all_sampledata.csv'

def read_csv(target_csv):
    target = pd.read_csv(path,header=None,sep=',')
    return target

start_time = datetime.datetime.now()

a = read_csv(path)

b = a.iloc[10000]

end_time = datetime.datetime.now()

print(end_time-start_time)

Time: 0:00:00.464756

2, Improve efficiency with for loop

2.0 how to improve efficiency if the for loop must be used

The simplest but most valuable acceleration we can do is to use Pandas's built-in iterrows() function.

When writing the for loop in the previous section, we used the range() function. However, when we loop through a wide range of values in Python, the generator tends to be much faster.
Pandas The iterrows() function internally implements a generator function that will generate a row of dataframes in each iteration. More precisely iterrows() generates (index, Series) pairs (tuples) for each row in the dataframe. This is actually the same as using enumerate() in raw Python, but it runs much faster!

Generators
The generator function allows you to declare a function that behaves like an iterator, that is, it can be used in a for loop. This greatly simplifies the code and saves more memory than a simple for loop.

When you want to deal with a huge list, such as 1 billion floating-point numbers, the problem arises. Using the for loop, a large number of memory huge lists are created in memory. Not everyone has unlimited RAM to store such things!

When the generator creates elements, it stores them in memory only when needed. One at a time. This means that if you have to create 1 billion floating-point numbers, you can only store them in memory at one time. The range() function in Python uses a generator to build lists.

That is, if you want to iterate over the list many times and it's small enough to fit in memory, it's better to use the for loop and the range function. This is because each time the list value is accessed, the generator and range regenerate them, while range is a static list and integers already exist in memory for quick access.

2.1 using range

import os
import pandas as pd
import datetime

path = r'E:\Scientific research documents\shiyan\LZQ\LZQ_all_sampledata.csv'

def read_csv(target_csv):
    target = pd.read_csv(path,header=None,sep=',')
    return target

start_time = datetime.datetime.now()

a = read_csv(path)

for data_row in range(a.shape[0]):
    b = a.iloc[data_row]

end_time = datetime.datetime.now()

print(end_time-start_time)

Time: 0:00:07.642816

2.2 use iterrows() instead of range

import os
import pandas as pd
import datetime
path = r'E:\Scientific research documents\shiyan\LZQ\LZQ_all_sampledata.csv'

def read_csv(target_csv):
    target = pd.read_csv(path,header=None,sep=',')
    return target

start_time = datetime.datetime.now()

a = read_csv(path)

for index,data_row in a.iterrows():
    b = data_row

end_time = datetime.datetime.now()

print(end_time-start_time)

Time: 0:00:03.513161

3, Use apply

The iterrows() function has greatly improved the speed, but we are far from finished. Always remember that when using libraries designed for vector operations, there may be a way to do the task most efficiently without a for loop at all.

The Pandas function that provides us with this function is apply() function. The apply () function takes another function as input and applies it along the axis of the DataFrame (rows, columns, etc.). In the case of transfer functions, lambda usually makes it easy to package everything together.
Reference link

Added by miancu on Tue, 08 Feb 2022 02:48:49 +0200