GoLanguage Core 36 Speech (GoLanguage Fifteen Practices and Applications) --Learning Notes

37 | strings package and string operations

Go Language not only has a type rune that represents Unicode characters independently, but also a for statement that splits Unicode characters into string values.

In addition, Unicode packages and their subpackages in standard libraries provide many functions and data types to help us parse Unicode characters in a variety of content.

These program entities are useful, straightforward, and effectively hide some of the complex details of the Unicode encoding specification. I'm not going to talk about them here.

Today we will mainly talk about strings code packages in the standard libraries. This code package also uses many program entities from Unicode packages and unicode/utf8 packages.

  • For example, a WriteRune method of type strings.Builder.
  • Another example is the ReadRune method of strings.Reader type, and so on.

The next question is for the strings.Builder type. Our question today is: What are the advantages of strings.Builder type values over string values?

The typical answer here is this.

Values of type strings.Builder (hereinafter referred to as Builder values) have the following three advantages:

  • Existing content cannot be changed, but more content can be stitched together.
  • The number of memory allocations and content copies is reduced.
  • Content can be reset and values can be reused.

Problem resolution

Let's start with the string type. As we all know, values of the string type are immutable in the Go language. If we want to get a different string, we can only clip, stitch, and so on from the original string to generate a new string.

  • Tile expressions can be used for clipping operations;
  • Stitching can be done with operator+

At the bottom, the contents of a string value are stored in a contiguous block of memory space. The number of bytes held in this memory is also recorded and used to indicate the length of the string value.

You can think of the contents of this memory as an array of bytes, and the string value contains a pointer value to the head of the byte array. In this way, applying a slice expression to a string value is equivalent to slicing an array of bytes at its bottom.

In addition, when we do string stitching, the Go language copies all the stitched strings into a new and large enough contiguous memory space one by one, and returns the string value holding the corresponding pointer value as the result.

Obviously, when there are too many string splicing operations in the program, there will be a lot of pressure on memory allocation.

Note that although a string value holds a pointer value internally, its type is still a value type. However, because string values are immutable, pointer values also contribute to memory space savings.

More specifically, a string value shares the same byte array at the bottom with all its copies. Since the byte array here will never be changed, it is absolutely safe to do so.

Compared with string value, the advantage of Builder value is mainly reflected in string splicing.

The Builder value has a container for hosting content (hereinafter referred to as a content container). It is a slice of byte element type (hereinafter referred to as a byte slice).

Since the underlying array of such byte slices is an array of bytes, we can say that it stores content in the same way as a string value.

In fact, they all hold the pointer value to the underlying byte array through a field of type unsafe.Pointer.

Because of this internal construction, the Builder value also has the precondition to efficiently use memory. Although any element value it contains can be modified for the byte slice itself, the Builder value does not allow it; its contents can only be stitched or completely reset.

This means that what already exists in the Builder value is immutable. Therefore, we can use the methods provided by the Builder value to stitch up more content without worrying about their impact on existing content.

The method mentioned here refers to a series of pointer methods that the Builder value has, including Write, WriteByte, WriteRune, and WriteString. We can collectively refer to them as stitching methods.

By invoking the above method, we can stitch the new content to the end (that is, to the right) of the existing content. In this case, the Builder value automatically expands its own content container if necessary. The Auto-Expansion strategy here is consistent with the tiling expansion strategy.

In other words, we don't necessarily cause expansion when splicing content to Builder values. As long as the content container has enough capacity, expansion will not occur and memory allocation for it will not occur. At the same time, as long as there is no expansion, the existing content in the Builder value will no longer be copied.

In addition to the automatic expansion of the Builder value, we have the option of manually expanding, which can be done by calling the Grow method of the Builder value. The Grow method, also known as the expansion method, accepts an int-type parameter n, which represents the number of bytes that will be expanded.

If necessary, the Grow method adds n bytes to the capacity of the content container in the value to which it belongs. More specifically, it generates a byte slice as a new content container, which is twice as large as the original container plus n. It then copies all the bytes from the original container to the new container.

var builder1 strings.Builder
// Omit some code.
fmt.Println("Grow the builder ...")
builder1.Grow(10)
fmt.Printf("The length of contents in the builder is %d.\n", builder1.Len())

Of course, the Grow method may also do nothing. The precondition for this is that the unused capacity in the current content container is sufficient, that is, the unused capacity is greater than or equal to n. The preconditions here are similar to those in the Auto-Expansion strategy mentioned earlier.

fmt.Println("Reset the builder ...")
builder1.Reset()
fmt.Printf("The third output(%d):\n%q\n", builder1.Len(), builder1.String())

Finally, the Builder value can be reused. By calling its Reset method, we can return the Builder value to zero as if it had never been used before.

Once reused, the original content container in the Builder value is discarded directly. After that, it and all its contents will be marked and recycled by the Go Language garbage collector.

package main

import (
	"fmt"
	"strings"
)

func main() {
	// Example 1.
	var builder1 strings.Builder
	builder1.WriteString("A Builder is used to efficiently build a string using Write methods.")
	fmt.Printf("The first output(%d):\n%q\n", builder1.Len(), builder1.String())
	fmt.Println()
	builder1.WriteByte(' ')
	builder1.WriteString("It minimizes memory copying. The zero value is ready to use.")
	builder1.Write([]byte{'\n', '\n'})
	builder1.WriteString("Do not copy a non-zero Builder.")
	fmt.Printf("The second output(%d):\n\"%s\"\n", builder1.Len(), builder1.String())
	fmt.Println()

	// Example 2.
	fmt.Println("Grow the builder ...")
	builder1.Grow(10)
	fmt.Printf("The length of contents in the builder is %d.\n", builder1.Len())
	fmt.Println()

	// Example 3.
	fmt.Println("Reset the builder ...")
	builder1.Reset()
	fmt.Printf("The third output(%d):\n%q\n", builder1.Len(), builder1.String())
}

Knowledge Expansion

Question 1: Are strings.Builder types constrained in use?

The answer is: There are constraints, summarized as follows:

  • It can no longer be copied after it has been really used;
  • Since its content is not entirely immutable, self-resolution of operational conflicts and concurrent security issues is required.

All we need to do is call the splicing or expansion method of the Builder value, which means we're actually using it. Clearly, these methods change the state of the content container in the value to which they belong.

Once they are called, we can no longer copy their values in any way. Otherwise, whenever the above method is called on any replica, a panic will be raised.

This panic tells us that this is not legal because the Builder value here is a copy, not the original value. By the way, the means of replication here include, but are not limited to, passing values between functions, passing values through channels, assigning values to variables, and so on.

var builder1 strings.Builder
builder1.Grow(1)
builder3 := builder1
//builder3.Grow(1)//This causes panic.
_ = builder3

Although this constraint is very strict, it can be beneficial if we think about it carefully.

It is precisely because the used Builder values can no longer be copied that there is no guarantee that content containers (that is, the byte slice) in multiple Builder values will share a single underlying byte array. This also avoids the potential conflict of multiple homologous Builder values when splicing content.

However, although the used Builder value can no longer be copied, its pointer value can. At any time, we can copy such a pointer value in any way. Note that such a pointer value will all point to the same Builder value.

f2 := func(bp *strings.Builder) {
 (*bp).Grow(1) // panic is not raised here, but it is not concurrently safe.
 builder4 := *bp
 //builder4.Grow(1)//This causes panic.
 _ = builder4
}
f2(&builder1)

That's why the problem arises here: if the Builder value is operated on by multiple parties at the same time, the content is likely to be confusing. This is what we call operational conflict and concurrency security issues.

Builder values themselves cannot solve these problems. Therefore, when we share the Builder value by passing its pointer value, we must ensure that all parties use it correctly, orderly and concurrently safe. The most thorough solution is to never share the Builder value and its pointer value.

We can declare a Builder value for use everywhere, or we can declare a Builder value first, and then pass its copy everywhere before we actually use it. In addition, we can use pass-through first, as long as we call its Reset method before passing it.

builder1.Reset()
builder5 := builder1
builder5.Grow(1) // panic will not be raised here.

In summary, constraints on copying Builder values are meaningful and necessary. Although we can still share Builder values in some ways, it's best not to put yourself in danger and "do your own thing" is the best solution. However, for Builder values that are in a zero state, there is no problem with replication.

package main

import (
	"strings"
)

func main() {
	// Example 1.
	var builder1 strings.Builder
	builder1.Grow(1)

	f1 := func(b strings.Builder) {
		//b.Grow(1)//This causes panic.
	}
	f1(builder1)

	ch1 := make(chan strings.Builder, 1)
	ch1 <- builder1
	builder2 := <-ch1
	//builder2.Grow(1)//This causes panic.
	_ = builder2

	builder3 := builder1
	//builder3.Grow(1)//This causes panic.
	_ = builder3

	// Example 2.
	f2 := func(bp *strings.Builder) {
		(*bp).Grow(1) // panic is not raised here, but it is not concurrently safe.
		builder4 := *bp
		//builder4.Grow(1)//This causes panic.
		_ = builder4
	}
	f2(&builder1)

	builder1.Reset()
	builder5 := builder1
	builder5.Grow(1) // panic will not be raised here.
}

Question 2: Why does a value of type strings.Reader read strings efficiently?

Contrary to the strings.Builder type, the strings.Reader type exists to read strings efficiently. The latter's efficiency is mainly reflected in its string reading mechanism, which encapsulates many best practices for reading content on string values.

A value of type strings.Reader (hereinafter referred to as the Reader value) makes it easy to read the contents of a string. During reading, the Reader value saves the count of bytes read (hereinafter referred to as the read count).

The read count also represents the starting index position for the next read. Reader values rely on such a count, along with slice expressions for string values, for fast reading.

In addition, this read count is an important basis for reading fallback and location settings. Although it is the internal structure of the Reader value, we can still calculate it through the Len method and Size method of the value. The code is as follows:

var reader1 strings.Reader
// Omit some code.
readingIndex := reader1.Size() - int64(reader1.Len()) // The calculated read count.

Most of the methods owned by Reader values for reading update the read count in time. For example, the ReadByte method will add 1 to the value of this count after successful reading.

For example, after a successful read, the ReadRune method increments the number of bytes occupied by the read characters as a count.

However, the ReadAt method is an exception. It will neither read based on the read count nor update it after reading. Because of this, this method is free to read anything in the Reader value to which it belongs.

In addition, the Seek method for a Reader value updates the read count for that value. In fact, the main purpose of this Seek method is to set the starting index position for the next read.

In addition, if we pass the value of the constant io.SeekCurrent as the second parameter value to the method, it will also calculate a new count based on the current read count and the value of the first parameter offset.

Since the Seek method returns a new count value, we can easily verify this. Like the following:

offset2 := int64(17)
expectedIndex := reader1.Size() - int64(reader1.Len()) + offset2
fmt.Printf("Seek with offset %d and whence %d ...\n", offset2, io.SeekCurrent)
readingIndex, _ := reader1.Seek(offset2, io.SeekCurrent)
fmt.Printf("The reading index in reader: %d (returned by Seek)\n", readingIndex)
fmt.Printf("The reading index in reader: %d (computed by me)\n", expectedIndex)

In summary, the key to an efficient read of a Reader value is its internal read count. The value of the count represents the starting index position for the next read. It can be easily calculated. The Seek method of a Reader value directly sets the read count value in that value.

package main

import (
	"fmt"
	"io"
	"strings"
)

func main() {
	// Example 1.
	reader1 := strings.NewReader(
		"NewReader returns a new Reader reading from s. " +
			"It is similar to bytes.NewBufferString but more efficient and read-only.")
	fmt.Printf("The size of reader: %d\n", reader1.Size())
	fmt.Printf("The reading index in reader: %d\n",
		reader1.Size()-int64(reader1.Len()))

	buf1 := make([]byte, 47)
	n, _ := reader1.Read(buf1)
	fmt.Printf("%d bytes were read. (call Read)\n", n)
	fmt.Printf("The reading index in reader: %d\n",
		reader1.Size()-int64(reader1.Len()))
	fmt.Println()

	// Example 2.
	buf2 := make([]byte, 21)
	offset1 := int64(64)
	n, _ = reader1.ReadAt(buf2, offset1)
	fmt.Printf("%d bytes were read. (call ReadAt, offset: %d)\n", n, offset1)
	fmt.Printf("The reading index in reader: %d\n",
		reader1.Size()-int64(reader1.Len()))
	fmt.Println()

	// Example 3.
	offset2 := int64(17)
	expectedIndex := reader1.Size() - int64(reader1.Len()) + offset2
	fmt.Printf("Seek with offset %d and whence %d ...\n", offset2, io.SeekCurrent)
	readingIndex, _ := reader1.Seek(offset2, io.SeekCurrent)
	fmt.Printf("The reading index in reader: %d (returned by Seek)\n", readingIndex)
	fmt.Printf("The reading index in reader: %d (computed by me)\n", expectedIndex)

	n, _ = reader1.Read(buf2)
	fmt.Printf("%d bytes were read. (call Read)\n", n)
	fmt.Printf("The reading index in reader: %d\n",
		reader1.Size()-int64(reader1.Len()))
}

summary

Today, we focus on two important types of strings in the code package: Builder and Reader. The former is used to build strings, while the latter is used to read strings.

Compared with string values, the advantages of Builder values are mainly reflected in string splicing. It can stitch more content while keeping the existing content unchanged, and minimize the memory allocation and the number of content copies during the stitching process.

However, such values are also constrained in their use. It can no longer be copied after it is actually used, or panic will be triggered. Although this constraint is strict, it can also bring benefits. It can effectively avoid some operation conflicts. Although we can bypass this constraint by some means, such as passing its pointer value, it does more harm than good. The best solution is to declare, use separately, and not interfere with each other.

Reader values make it easy to read from a string. Its efficiency is mainly reflected in its string reading mechanism. During reading, the Reader value holds the count of bytes read, also known as the read count.

This count represents the starting index location for the next read and is the key to efficient reading. We can use the Len and Size methods of such values to calculate the values of the read counts. With it, we have more flexibility in string reading.

I've only covered these two data types in this article, but that doesn't mean there are only two useful program entities in the strings package. In fact, the strings package also provides a large number of functions. For example:

`Count`,`IndexRune`,`Map`,`Replace`,`SplitN`,`Trim`,Wait.

They are both very easy to use and efficient. You can look at their source code, and maybe you'll get some insight.

Think Questions

Today's question is: What interfaces do strings.Builder and strings.Reader implement? Is there any benefit in doing so?

Note Source

https://github.com/MingsonZheng/go-core-demo

This work uses Knowledge Sharing Attribution-Non-Commercial Use-Sharing 4.0 International License Agreement in the Same Way License.

You are welcome to reprint, use and republish the article, but you must keep the article's signature Zheng Ziming (including links: http://www.cnblogs.com/MingsonZheng/ ) must not be used for commercial purposes. Works modified in this article must be published under the same license.

Added by mcmuney on Wed, 01 Dec 2021 06:38:25 +0200