[actual measurement] speed comparison of string search in Python and C + +

Full format link: https://blog.imakiseki.cf/2022/03/07/techdev/python-cpp-string-find-perf-test/

background

Recently, in preparation for an algorithm competition, the language mistakenly chose python, so we had no choice but to start language migration for common scenes. The scene of string search appears from time to time in the algorithm competition. This paper compares the speed of this scene in Python and C + +, a common language in the competition, and provides relevant parameters and running results for others' reference.

parameter

Hardware and operating system

                   -`                    root@<hostname>
                  .o+`                   ------------
                 `ooo/                   OS: Arch Linux ARM aarch64
                `+oooo:                  Host: Raspberry Pi 4 Model B
               `+oooooo:                 Kernel: 5.16.12-1-aarch64-ARCH
               -+oooooo+:                Uptime: 3 hours, 32 mins
             `/:-:++oooo+:               Packages: 378 (pacman)
            `/++++/+++++++:              Shell: zsh 5.8.1
           `/++++++++++++++:             Terminal: /dev/pts/0
          `/+++ooooooooooooo/`           CPU: (4) @ 1.500GHz
         ./ooosssso++osssssso+`          Memory: 102MiB / 7797MiB
        .oossssso-````/ossssss+`
       -osssssso.      :ssssssso.
      :osssssss/        osssso+++.
     /ossssssss/        +ssssooo/-
   `/ossssso+/:-        -:/+osssso+-
  `+sso+:-`                 `.-/+oso:
 `++:.                           `-/+/
 .`                                 `/

Compilation environment and interpretation environment

Python
- Interpreter: Python 3.10.2 (main, Jan 23 2022, 21:20:14) [GCC 10.2.0] on linux
- Interactive environment: IPython 8.0.1
C++
- Compiler: g++ (GCC) 11.2.0
- Compile command: G + + test cpp -Wall -O2 -g -std=c++11 -o test

scene

Two scenarios are set in this measurement: the character distribution of the source string in scenario 1 is generated by the pseudo-random number generator, which represents the average situation of string search; The source string of Scene 2 can be continuously divided into 20000 character segments with a length of 50, of which the 15001st is the pattern string, such as "ab... b" (1 "a", 49 "b"), and the rest of the character segments are such as "ab... c" (1 "a", 48 "b", 1 "c").

project	Scenario 1: Average	Scenario 2: worse case
character set	Lowercase letters	abc
Character distribution	random.choice	Strong regularity
Source string length	1,000,000	1,000,000
Mode string length	1,000	50
Mode string occurrence position	250,000,500,000,750,000	750,000
Number of occurrences of mode string	1	1

test method

In this measurement, Python language uses the built-in type str find() member function, C + + language uses string class respectively find() member function, strstr standard library function and KMP algorithm implemented by users.

Test object	Core code
Python	src.find(pat)
C++ - test.cpp	src.find(pat)
C++ - test_strstr.cpp	strstr(src, pat)
C++ - test_kmp.cpp	KMP(src, pat)

source code

Generate source string and mode string

import random

# Scenario 1:
# Source string
s = "".join(chr(random.choice(range(ord("a"), ord("z") + 1))) for _ in range(1000000))
# List of pattern strings. Each of the three elements corresponds to a pattern string
p = [s[250000:251000], s[500000:501000], s[750000:751000]]

# Scenario 2:
# Pattern string
p = 'a' + 'b' * 49
# Other character segments
_s = "a" + "b" * 48 + "c"
# Source string
s = _s * 15000 + p + _s * 4999

# Stored in a file for easy access by C + + programs
with open('source.in', 'w') as f:
    f.write(s)
with open('pattern.in', 'w') as f:
    f.write(p[0])

Test code

Python

In []: %timeit s.find(p[0])

C++ - test.cpp

#include <chrono>
#include <iostream>
#include <cstring>
#include <fstream>
#define LOOP_COUNT (1000)
using namespace std;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;

double test(string s, string p, size_t* pos_ptr) {
    auto t1 = high_resolution_clock::now();
    *pos_ptr = s.find(p);
    auto t2 = high_resolution_clock::now();
    duration<double, milli> ms_double = t2 - t1;
    return ms_double.count();
}

int main() {
    string s, p;
    size_t pos;
    ifstream srcfile("source.in");
    ifstream patfile("pattern.in");
    srcfile >> s;
    patfile >> p;

    double tot_time = 0;
    for (int i = 0; i < LOOP_COUNT; ++i) {
        tot_time += test(s, p, &pos);
    }

    cout << "Loop count:            " << LOOP_COUNT << endl;
    cout << "Source string length:  " << s.length() << endl;
    cout << "Pattern string length: " << p.length() << endl;
    cout << "Search result:         " << pos << endl;
    cout << "Time:                  " << tot_time / LOOP_COUNT << " ms" << endl;

    return 0;
}

C++ - test_strstr.cpp

#include <chrono>
#include <iostream>
#include <cstring>
#include <fstream>
#define LOOP_COUNT (1000)
using namespace std;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
char s[1000005], p[1005], *pos=NULL;

double test(char* s, char* p, char** pos_ptr) {
    auto t1 = high_resolution_clock::now();
    *pos_ptr = strstr(s, p);
    auto t2 = high_resolution_clock::now();
    duration<double, milli> ms_double = t2 - t1;
    return ms_double.count();
}

int main() {
    ifstream srcfile("source.in");
    ifstream patfile("pattern.in");
    srcfile >> s;
    patfile >> p;

    double tot_time = 0;
    for (int i = 0; i < LOOP_COUNT; ++i) {
        tot_time += test(s, p, &pos);
    }

    cout << "Loop count:            " << LOOP_COUNT << endl;
    cout << "Source string length:  " << strlen(s) << endl;
    cout << "Pattern string length: " << strlen(p) << endl;
    cout << "Search result:         " << pos - s << endl;
    cout << "Time:                  " << tot_time / LOOP_COUNT << " ms" << endl;

    return 0;
}

C++ - test_kmp.cpp

#include <chrono>
#include <iostream>
#include <cstring>
#include <fstream>
#include <cstdlib>
#define LOOP_COUNT (1000)
using namespace std;
using std::chrono::high_resolution_clock;
using std::chrono::duration_cast;
using std::chrono::duration;
using std::chrono::milliseconds;
int dp[1005];

int KMP(string s, string p) {
    int m = s.length(), n = p.length();
    if (n == 0) return 0;
    if (m < n) return -1;
    memset(dp, 0, sizeof(int) * (n+1));
    for (int i = 1; i < n; ++i) {
        int j = dp[i+1];
        while (j > 0 && p[j] != p[i]) j = dp[j];
        if (j > 0 || p[j] == p[i]) dp[i+1] = j + 1;
    }
    for (int i = 0, j = 0; i < m; ++i)
        if (s[i] == p[j]) { if (++j == n) return i - j + 1; }
        else if (j > 0) {
            j = dp[j];
            --i;
        }
    return -1;
}

double test(string s, string p, int* pos_ptr) {
    auto t1 = high_resolution_clock::now();
    *pos_ptr = KMP(s, p);
    auto t2 = high_resolution_clock::now();
    duration<double, milli> ms_double = t2 - t1;
    return ms_double.count();
}

int main() {
    string s, p;
    int pos;
    ifstream srcfile("source.in");
    ifstream patfile("pattern.in");
    srcfile >> s;
    patfile >> p;

    double tot_time = 0;
    for (int i = 0; i < LOOP_COUNT; ++i) {
        tot_time += test(s, p, &pos);
    }

    cout << "Loop count:            " << LOOP_COUNT << endl;
    cout << "Source string length:  " << s.length() << endl;
    cout << "Pattern string length: " << p.length() << endl;
    cout << "Search result:         " << pos << endl;
    cout << "Time:                  " << tot_time / LOOP_COUNT << " ms" << endl;

    return 0;
}

result

The difference between the average execution time of the command% timeython and the average execution time of the command% timeython. C + + code takes the average time after running each mode string for 1000 times.

Unless otherwise specified, the following times are in microseconds and reserved to the whole digit.

scene	Mode string occurrence position	Python	C++ - test.cpp	C++ - test_strstr.cpp	C++ - test_kmp.cpp
Scenario 1	250,000	105	523	155	2564
Scenario 1	500,000	183	1053	274	3711
Scenario 1	750,000	291	1589	447	4900
Scenario 2	750,000	2630*	618	353	3565

*The original output is "2.63 ms". The average value of IPython's% timeit output retains 3 significant digits. Since this time has exceeded 1 millisecond, the microsecond bit is discarded. The unit here is still microseconds, and the value is recorded as "2630".

limitations

The hardware of the equipment used in this measurement is inferior to the standard configuration machine in the algorithm competition, and the "absolute value" in the measured results has low reference.

summary

According to the results in the above table, under the given environment and relevant parameters, the running time of Python in scenario 1 is about one fifth of that of string::find in C + +, which is close to std:strstr; In scenario 2, the running time of Python increases significantly, but the running time of the first two test methods of C + + is close to or even shorter than that before. In the four tests, the running time of KMP algorithm implemented by C + + users is longer than that of Python under the same conditions.

The fast lookup (. find()) and count (. count()) algorithms for the built-in type str in Python are based on Boyer Moore algorithm and Horspool algorithm The latter is the simplification of the former, while the former and Knuth Morris Pratt algorithm of

For information about the fact that string::find in C + + runs longer than STD:: STR, see Bug 66414 - string::find ten times slower than strstr.

It is noteworthy that the running time of KMP algorithm implemented in C + + is much longer than that in C + + standard library and even Python. This is also similar to the situation that "the running efficiency of self-designed assembly code is lower than that of compiler". A problem of Stack Overflow strstr faster than algorithms? There's someone down there answer As follows:

Why do you think strstr should be slower than all the others? Do you know what algorithm strstr uses? I think it's quite likely that strstr uses a fine-tuned, processor-specific, assembly-coded algorithm of the KMP type or better. In which case you don't stand a chance of out-performing it in C for such small benchmarks.

KMP algorithm is not the fastest of all linear complexity algorithms. In different environments (software and hardware, test data, etc.), KMP, its variants and even other linear complexity algorithms cannot be judged. The compiler takes into account many possible factors in the design, so that there can be relatively better strategies to get results in different environments as far as possible. Therefore, under the condition of ensuring the correct results, it is better to directly use the functions provided in the standard library rather than write them according to the algorithm principle.

At the same time, the actual measurement also confirms the statement that Python is not suitable to achieve high results in the algorithm competition from the perspective of running time.

reference resources

Keywords: Python C++

Added by MBDesktop on Mon, 07 Mar 2022 11:44:22 +0200

Programming VIP