After going deep into the source code of Python interpreter, I finally understood the principle of string persistence

In order to perform well and achieve excellent performance, each programming language needs a lot of compiler level and interpreter level optimization.

Since string is an indispensable part of any programming language, if you have the ability to operate string quickly, you can quickly improve the overall performance.

In this article, we will delve into the internal implementation of Python and learn how Python uses a technology called String Interning to achieve high performance of the interpreter. The purpose of this article is not only to introduce the internal knowledge of python, but also to enable readers to easily browse the source code of Python; Therefore, there will be a lot of code snippets from CPython in this article.

The outline of the full text is as follows:

1. What is "string resident"?

String persistence is an optimization method of compiler / interpreter. It saves space and time of string processing tasks by caching general strings.

Instead of creating a new copy of the string every time, this optimization method keeps only one copy of the string for each appropriate immutable value and references it with a pointer.

The unique copy of each string is called its intern, hence the name String Interning.

Python cat note: String Interning is generally translated as "string resident" or "string retention". In some languages, it may be used to the concept of StringPool (string constant pool), which is actually a different expression of the same mechanism. Intern as a noun means "intern, intern", which can be understood as "resident, resident value".

The method to find the string intern may or may not be exposed as an exposed interface. Modern programming languages, such as Java, Python, PHP, Ruby, Julia, etc., support string persistence to enable their compilers and interpreters to achieve high performance.

2. Why do you want to resident strings?

String persistence improves the speed of string comparison. If there is no dwell, when we want to compare whether the two strings are equal, its time complexity will rise to O(n), that is, we need to check each character in the two strings to determine whether they are equal.

However, if the string is fixed, because the same string will use the same object reference, it is enough to judge whether the two strings are equal by just checking whether the pointers are the same. There is no need to check each character one by one. Since this is a very common operation, it is typically implemented as a pointer equality check, using only a machine instruction with no memory reference at all.

String persistence reduces memory usage. Python avoids flooding the memory with redundant string objects and optimizes the memory footprint by sharing and reusing defined objects through meta design patterns.

3. Python string resident

Like most other modern programming languages, python uses string persistence to improve performance. In Python, we can use the is operator to check whether two objects refer to the same memory object.

If the is operator is True, the two objects will reference the same string, otherwise, the is operator will reference the same string.

>>> 'python' is 'python'
True

We can use this particular operator to determine which strings are resident. In CPython, string persistence is realized through the following functions, which are declared in Unicode object H, defined in Unicode object C.

PyAPI_FUNC(void) PyUnicode_InternInPlace(PyObject **);

To check whether a string is resident, CPython implements a string called pyunicode_ CHECK_ The macros of interconnected are also defined in Unicode object H medium.

This macro indicates that Python maintains a member variable named interned in the pyasiioobject structure, and its value indicates whether the corresponding string is resident.

#define PyUnicode_CHECK_INTERNED(op) \
    (((PyASCIIObject *)(op))->state.interned)

4. Principle of string persistence

In CPython, references to strings are stored, accessed, and managed by a Python dictionary called interconnected. The dictionary is initialized with delay the first time a string is called and holds references to all the resident string objects.

4.1 how does a string reside?

The core function responsible for resident strings is PyUnicode_InternInPlace, which is defined in Unicode object In C, when called, it will create a dictionary interned to accommodate all resident strings, and then register the object in the parameter so that its key and value use the same object reference.

The following function snippet shows how Python implements string persistence.

void
PyUnicode_InternInPlace(PyObject **p)
{
    PyObject *s = *p;

    .........

    // Lazily build the dictionary to hold interned Strings
    if (interned == NULL) {
        interned = PyDict_New();
        if (interned == NULL) {
            PyErr_Clear();
            return;
        }
    }

    PyObject *t;

    // Make an entry to the interned dictionary for the
    // given object
    t = PyDict_SetDefault(interned, s, s);

    .........
    
    // The two references in interned dict (key and value) are
    // not counted by refcnt.
    // unicode_dealloc() and _PyUnicode_ClearInterned() take
    // care of this.
    Py_SET_REFCNT(s, Py_REFCNT(s) - 2);

    // Set the state of the string to be INTERNED
    _PyUnicode_STATE(s).interned = SSTATE_INTERNED_MORTAL;
}

4.2 how to clean up resident strings?

The cleanup function traverses all strings from the interconnected dictionary, adjusts the reference count of these objects, and marks them as NOT_INTERNED to be garbage collected. Once all strings are marked not_ Intersect, the interned dictionary will be cleared and deleted.

This cleanup function is**_ PyUnicode_ Clearinterconnected * *, in Unicode object C.

void
_PyUnicode_ClearInterned(PyThreadState *tstate)
{
    .........

    // Get all the keys to the interned dictionary
    PyObject *keys = PyDict_Keys(interned);

    .........

    // Interned Unicode strings are not forcibly deallocated;
    // rather, we give them their stolen references back
    // and then clear and DECREF the interned dict.

    for (Py_ssize_t i = 0; i < n; i++) {
        PyObject *s = PyList_GET_ITEM(keys, i);

        .........

        switch (PyUnicode_CHECK_INTERNED(s)) {
        case SSTATE_INTERNED_IMMORTAL:
            Py_SET_REFCNT(s, Py_REFCNT(s) + 1);
            break;
        case SSTATE_INTERNED_MORTAL:
            // Restore the two references (key and value) ignored
            // by PyUnicode_InternInPlace().
            Py_SET_REFCNT(s, Py_REFCNT(s) + 2);
            break;
        case SSTATE_NOT_INTERNED:
            /* fall through */
        default:
            Py_UNREACHABLE();
        }

        // marking the string to be NOT_INTERNED
        _PyUnicode_STATE(s).interned = SSTATE_NOT_INTERNED;
    }

    // decreasing the reference to the initialized and
    // access keys object.
    Py_DECREF(keys);

    // clearing the dictionary
    PyDict_Clear(interned);

    // clearing the object interned
    Py_CLEAR(interned);
}

5. Implementation of string resident

Now that we understand the internal principle of string persistence and cleaning, we can find all strings that will be resident in Python.

To do this, all we have to do is look for pyunicode in CPython source code_ Call the interninplace function and view the code near it. Here are some interesting discoveries about string persistence in Python.

5.1 variable, constant and function names

CPython performs string dwell on constants (such as function name, variable name, string literal, etc.).

The following code is from codeobject c. It indicates that when a new PyCode object is created, the interpreter will reside on all compile time constants, names, and literals.

PyCodeObject *
PyCode_NewWithPosOnlyArgs(int argcount, int posonlyargcount, int kwonlyargcount,
                          int nlocals, int stacksize, int flags,
                          PyObject *code, PyObject *consts, PyObject *names,
                          PyObject *varnames, PyObject *freevars, PyObject *cellvars,
                          PyObject *filename, PyObject *name, int firstlineno,
                          PyObject *linetable)
{

    ........

    if (intern_strings(names) < 0) {
        return NULL;
    }

    if (intern_strings(varnames) < 0) {
        return NULL;
    }

    if (intern_strings(freevars) < 0) {
        return NULL;
    }

    if (intern_strings(cellvars) < 0) {
        return NULL;
    }

    if (intern_string_constants(consts, NULL) < 0) {
        return NULL;
    }

    ........

}

5.2 dictionary keys

CPython also hosts the string keys of any dictionary object.

When an element is inserted into the dictionary, the interpreter will string the key of the element. The following code is from dictobject c. Shows the actual behavior.

Interesting place: in pyunicode_ There is a comment where the interninplace function is called. It asks, do we really need to reside all the keys in all dictionaries?

int
PyDict_SetItemString(PyObject *v, const char *key, PyObject *item)
{
    PyObject *kv;
    int err;
    kv = PyUnicode_FromString(key);
    if (kv == NULL)
        return -1;

    // Invoking String Interning on the key
    PyUnicode_InternInPlace(&kv); /* XXX Should we really? */

    err = PyDict_SetItem(v, kv, item);
    Py_DECREF(kv);
    return err;
}

5.3 properties of any object

The properties of an object in Python can be set explicitly through the setattr function, implicitly as part of a class member, or predefined in its data type.

CPython will host all these attribute names for quick lookup. The following is the function pyobject_ The code fragment of setattr, which is defined in the file object C is responsible for setting new properties for Python objects.

int
PyObject_SetAttr(PyObject *v, PyObject *name, PyObject *value)
{

    ........

    PyUnicode_InternInPlace(&name);

    ........
}

5.4 explicitly resident

Python also supports explicit string persistence through the intern function in the sys module.

When this function is called with any string object, the string object will be resident. Here is sysmodule C file, which shows the code snippet in sys_ intern_ String resident procedure in impl function.

static PyObject *
sys_intern_impl(PyObject *module, PyObject *s)
{

    ........

    if (PyUnicode_CheckExact(s)) {
        Py_INCREF(s);
        PyUnicode_InternInPlace(&s);
        return s;
    }

    ........
}

6. Other findings of string persistence

Only compile time strings will be resident. Strings specified during interpretation or compilation will be resident, while dynamically created strings will not.

Python cat note: this rule is worth thinking about. I've stepped on it... There are two knowledge points that I believe 99% of people don't know: the join() method of string creates string dynamically, so the created string will not be resident; The constant folding mechanism also occurs at compile time, so it is sometimes easy to confuse it with string persistence. It is recommended to read the magical use of join () method and the weakness of Intern mechanism

Strings containing ASCII characters and underscores are resident. During compilation, CPython ensures that only regular expressions [a-zA-Z0-9_] match when string literals are resident* Because they are very close to Python identifiers.

About Python technology reserve

It's good to learn Python well, whether in employment or sideline, but to learn python, you still need to have a learning plan. Finally, let's share a full set of Python learning materials to help those who want to learn Python!

1, Python learning routes in all directions

All directions of Python is to sort out the commonly used technical points of Python and form a summary of knowledge points in various fields. Its purpose is that you can find corresponding learning resources according to the above knowledge points to ensure that you learn more comprehensively.

2, Learning software

If a worker wants to do well, he must sharpen his tools first. The commonly used development software for learning Python is here, which saves you a lot of time.

3, Getting started video

When we watch videos, we can't just move our eyes and brain without hands. The more scientific learning method is to use them after understanding. At this time, the hand training project is very suitable.

4, Actual combat cases

Optical theory is useless. We should learn to knock together and practice, so as to apply what we have learned to practice. At this time, we can make some practical cases to learn.

5, Interview materials

We must learn Python in order to find a high paying job. The following interview questions are the latest interview materials from front-line Internet manufacturers such as Alibaba, Tencent and byte, and Alibaba boss has given authoritative answers. After brushing this set of interview materials, I believe everyone can find a satisfactory job.


This complete set of Python learning materials has been uploaded to CSDN. Friends can scan the official authentication QR code of CSDN below on wechat and get it for free [guaranteed to be 100% free]

Keywords: Python Programmer crawler

Added by HK2ALL on Wed, 02 Feb 2022 11:26:59 +0200