Python source code analysis - String objects in Python

1. Preface

We have [integer object in Python] The fixed length objects are explained in detail in the chapter. Next, we will introduce variable length objects, and string type is a typical representative of such objects.

A concept must be introduced here:

There are two types of variable length objects in Python:

  • Variable length variable object - for example, List, which can add and delete elements after creation
  • Variable length and immutable objects - for example, String, Tuple. After creation, adding, deleting and other operations are no longer supported

2. Preliminary understanding of pystringobject

PyStringObject is the implementation of string object. First of all, it is a variable length object, which only means that when creating a string object, the length is not fixed. But once created, the length is fixed and cannot be changed.

For example:

test_str = "Hello World"
test_url = "https://www.xtuz.net"

Obviously, the length of test STR is not the same as that of test URL. This is because PyStringObject does not limit the length when the string object is created. After the creation, the string object maintained inside the modified object will not be changed.

We can also prove it from the source code:

typedef struct {
    PyObject_VAR_HEAD
    long ob_shash;
    int ob_sstate;
    char ob_sval[1];

    /* Invariants:
     *     ob_sval contains space for 'ob_size+1' elements.
     *     ob_sval[ob_size] == 0.
     *     ob_shash is the hash of the string or -1 if not computed yet.
     *     ob_sstate != 0 iff the string object is in stringobject.c's
     *       'interned' dictionary; in this case the two references
     *       from 'interned' to this object are *not counted* in ob_refcnt.
     */
} PyStringObject;

Ob? Size in pyobject? Var? Head (see Python source code analysis - object exploration )Record the memory size of the variable length object. ov_sval points to a section of memory as a character pointer, which is the actual string. The OB size of the scale test STR is 11.

OBU hash is the hash value of the object, which is very useful in dict type and exists as the key value.

Ob_sstateindicates whether the object has been processed by the inter mechanism. In short, it means that a string object with the same value will only be saved one copy and put in a string storage pool, which is shared. Of course, it cannot be changed, which also determines that the string must be an immutable object.

3. PyStringObject creation

From the code point of view, there are many ways to create a PyStringObject:

PyAPI_FUNC(PyObject *) PyString_FromStringAndSize(const char *, Py_ssize_t);
PyAPI_FUNC(PyObject *) PyString_FromString(const char *);
PyAPI_FUNC(PyObject *) PyString_FromFormatV(const char*, va_list)
                                Py_GCC_ATTRIBUTE((format(printf, 1, 0)));
PyAPI_FUNC(PyObject *) PyString_FromFormat(const char*, ...)
                                Py_GCC_ATTRIBUTE((format(printf, 1, 2)));

Among them, the most commonly used is pystring [fromstring (const char *);

The code implementation is as follows:

PyObject *
PyString_FromString(const char *str)
{
    register size_t size;
    register PyStringObject *op;

    assert(str != NULL);
    size = strlen(str);
    if (size > PY_SSIZE_T_MAX - PyStringObject_SIZE) {
        PyErr_SetString(PyExc_OverflowError,
            "string is too long for a Python string");
        return NULL;
    }
    if (size == 0 && (op = nullstring) != NULL) {
#ifdef COUNT_ALLOCS
        null_strings++;
#endif
        Py_INCREF(op);
        return (PyObject *)op;
    }
    if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL) {
#ifdef COUNT_ALLOCS
        one_strings++;
#endif
        Py_INCREF(op);
        return (PyObject *)op;
    }

    /* Inline PyObject_NewVar */
    op = (PyStringObject *)PyObject_MALLOC(PyStringObject_SIZE + size);
    if (op == NULL)
        return PyErr_NoMemory();
    (void)PyObject_INIT_VAR(op, &PyString_Type, size);
    op->ob_shash = -1;
    op->ob_sstate = SSTATE_NOT_INTERNED;
    Py_MEMCPY(op->ob_sval, str, size+1);
    /* share short strings */
    if (size == 0) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        nullstring = op;
        Py_INCREF(op);
    } else if (size == 1) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        characters[*str & UCHAR_MAX] = op;
        Py_INCREF(op);
    }
    return (PyObject *) op;
}

In short, there are mainly three logics:

  1. If the string is too long, null pointer will be returned
  2. Determine whether it is an empty string. If it is an empty string, it will be referenced
  3. Allocate memory and copy strings to op - > ob ﹣ sval

After creation, the memory layout is as shown above

4. Character buffer pool

We have [integer object in Python] Python's optimization of small integers is described in, and the string's inter mechanism is similar to this, in fact, it will create an object pool for characters of length 1.

    if (size == 1 && (op = characters[*str & UCHAR_MAX]) != NULL) {
#ifdef COUNT_ALLOCS
        one_strings++;
#endif
        Py_INCREF(op);
        return (PyObject *)op;
    }


/* share short strings */
    if (size == 0) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        nullstring = op;
        Py_INCREF(op);
    } else if (size == 1) {
        PyObject *t = (PyObject *)op;
        PyString_InternInPlace(&t);
        op = (PyStringObject *)t;
        characters[*str & UCHAR_MAX] = op;
        Py_INCREF(op);
    }

Whenever a string with a length of 1 is created, it will be saved in characters. When a character with a length of 1 is created later, if it is detected that it is already in characters, it will directly return the buffered object without malloc, which is the function of the buffer pool.

5. Inter mechanism of string object

The implementation principle of string in CPython uses a technology called Intern (string resident) to improve string efficiency.

Let's look at a piece of code:

a='www.xtuz.net'
b='www.xtuz.net'
print(id(a), id(b))
print(a is b)

You can see the following output

(4420449312, 4420449312)
True

Although the values of a and b are the same, they are really two different string objects. Assuming that there are a large number of strings with the same values in the program, the system has to allocate memory space for each string repeatedly. Obviously, it is a waste of unnecessary resources for the system. To solve this problem, Python introduces the inter mechanism.

Intern is a built-in function in Python. Its function is to process strings through the intern mechanism and return string objects after processing. We found that all strings with the same value are returned the same string object after being processed by the inter mechanism. This method can undoubtedly save more memory space when processing big data. The system does not need to repeatedly allocate memory for the same string. It can share an object for strings with the same value.

Intern implementation The mechanism is very simple, that is, by maintaining a string storage pool, which is a dictionary structure. If a string already exists in the pool, it will not create a new string, and directly return the string object created before. If it has not been added to the pool before, it will first construct a string object and add the object to the pool Go to Zizhong for the convenience of next acquisition.

6. More

From Mr. rabbit's website: https://www.xtuz.net/detail-139.html

View the original > > > Python source code analysis - String objects in Python

If you are interested in Python language, you can pay attention to me or pay attention to my WeChat official account: xtuz666

Keywords: Python Big Data

Added by baccarak on Sat, 28 Mar 2020 13:01:40 +0200