i/o timeout. I hope you don't step into the pit of this net/http package

The article is continuously updated. You can search "growth of golang Xiaobai" on wechat for the first time, and reply to [tutorial] to get a free video tutorial of golang. This article has been included in GitHub github.com/xiaobaiTech/golangFamil... , there are complete test sites and growth routes for large factory interviews. Welcome Star.


problem

Let's look at a piece of daily code.

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "io/ioutil"
    "net"
    "net/http"
    "time"
)

var tr *http.Transport

func init() {
    tr = &http.Transport{
        MaxIdleConns: 100,
        Dial: func(netw, addr string) (net.Conn, error) {
            conn, err := net.DialTimeout(netw, addr, time.Second*2) //Set connection timeout
            if err != nil {
                return nil, err
            }
            err = conn.SetDeadline(time.Now().Add(time.Second * 3)) //Set sending and receiving data timeout
            if err != nil {
                return nil, err
            }
            return conn, nil
        },
    }
}

func main() {
    for {
        _, err := Get("http://www.baidu.com/")
        if err != nil {
            fmt.Println(err)
            break
        }
    }
}


func Get(url string) ([]byte, error) {
    m := make(map[string]interface{})
    data, err := json.Marshal(m)
    if err != nil {
        return nil, err
    }
    body := bytes.NewReader(data)
    req, _ := http.NewRequest("Get", url, body)
    req.Header.Add("content-type", "application/json")

    client := &http.Client{
        Transport: tr,
    }
    res, err := client.Do(req)
    if res != nil {
        defer res.Body.Close()
    }
    if err != nil {
        return nil, err
    }
    resBody, err := ioutil.ReadAll(res.Body)
    if err != nil {
        return nil, err
    }
    return resBody, nil
}

The simple thing to do is to request circularly http://www.baidu.com/ Then wait for a response.

Looks like there's nothing wrong.

When the code runs, it can send and receive messages normally.

However, if this code runs for a period of time, an i/o timeout error will appear.


This is actually a problem that has been checked recently. It is found that this pit may be easier to step on. I simplified the code here.

The phenomenon in actual production is that when the golang service initiates an HTTP call, although http If transport sets 3s timeout, i/o timeout error will occur occasionally.

However, when viewing the downstream services, it is found that the downstream services have actually returned in 100ms.


Check

It's strange that the client timeout is set to 100ms, and the server timeout is set to 100ms?


There are two possibilities here.

  • Because the log printed by the server is actually only the log printed by the application layer of the server. However, after the client application layer sends data, it also passes through the transport layer, network layer, data link layer and physical layer of the client, and then passes through the physical layer, data link layer, network layer and transport layer of the server to the application layer of the server. It takes 100ms at the application layer of the server, and then return to the original way. The remaining 3s-100ms may be consumed in all layers of the whole process. For example, when the network is bad, TCP at the transport layer tries hard to lose packets and retransmit.

  • There is no problem with the network. The whole sending and receiving process of the client-server link takes about 100ms. The client processing logic problem caused a timeout.


Generally, the problem will not be the problem of the underlying network in most cases. It's right to boldly doubt that it's your own problem. If you don't give up, grab a bag and have a look.

After analysis, shake hands three times from the beginning (where the red box is drawn).

In the end, there is an i/o timeout error (where the blue box is drawn).

The time column is from 7 to 10, with an interval of 3s. And look at the blue box in the lower right corner. It is a Reset connection from port 51169 to port 80.

Port 80 is the port of the server. In other words, the client 3s times out and actively disconnects the link.

But take a closer look at the first line. From the third handshake to the end when the client times out and actively disconnects, there are actually many HTTP requests.

Go back to the way the code sets the timeout.

    tr = &http.Transport{
        MaxIdleConns: 100,
        Dial: func(netw, addr string) (net.Conn, error) {
            conn, err := net.DialTimeout(netw, addr, time.Second*2) //Set connection timeout
            if err != nil {
                return nil, err
            }
            err = conn.SetDeadline(time.Now().Add(time.Second * 3)) //Set sending and receiving data timeout
            if err != nil {
                return nil, err
            }
            return conn, nil
        },
    }

In other words, the 3s timeout here is actually calculated after the connection is established, rather than the timeout calculated from a single call.

Look, the note says

SetDeadline sets the read and write deadlines associated with the connection.



Timeout reason

As we all know, HTTP is an application layer protocol and TCP is used in the transport layer.

Before 1.0, HTTP protocol used short connection by default, and TCP connection will be established every time a request is initiated. Send and receive data. Then disconnect.

TCP connections are handshaked three times at a time. Wave four times each time you disconnect.

In fact, it is not necessary to establish a new connection every time. Just keep the established connection open and reuse the data every time.

As a result, the HTTP protocol has used long connections by default since 1.1. See the previous for specific information This article.

Then the golang standard library is also compatible with this implementation.

Establish a TCP long connection for each domain name by establishing a connection pool, such as http://baidu.com and http://golang.com Just two different domain names.

First visit http://baidu.com A connection will be established when the domain name is used up, and then it will be put into the free connection pool for access next time http://baidu.com The connection will be retrieved from the connection pool for reuse.


A digression: this also explains the previous This article In the last question, why should we emphasize the same domain name: a domain name will establish a connection, and a connection corresponds to a read goroutine and a write goroutine. It is precisely because it is the same domain name that three goroutines will be disclosed in the end. If different domain names are used, 1+2*N processes will be disclosed, and N is the number of domain names.


Assuming that the first request takes 100ms, each request is completed http://baidu.com After that, they are put into the connection pool. Next time, they continue to be reused for 29 times, taking 2900ms.

At the 30th request, the connection has taken 3000ms from the start of establishment to the return of the service, just reaching the set 3s timeout threshold, then the client will report timeout i/o timeout.

Although it only took 100ms for the server at this time, it took a long time to add up the previous 29 times.

That is, just through HTTP Transport has err = conn.setdeadline (time. Now()) Add (time. Second * 3)), and you use a long connection. Even if the server processes faster and the timeout set by the client is longer, your program will report a timeout error at one moment.

Correct posture

The original expectation was to set a timeout for each call, not for the entire connection.

In addition, the reason for the above problem is that the timeout is set for long connections, and long connections will be reused.

Based on these two points, change the code.

package main

import (
    "bytes"
    "encoding/json"
    "fmt"
    "io/ioutil"
    "net/http"
    "time"
)

var tr *http.Transport

func init() {
    tr = &http.Transport{
        MaxIdleConns: 100,
        // The following code was killed
        //Dial: func(netw, addr string) (net.Conn, error) {
        //    conn, err := net.DialTimeout(netw, addr, time.Second*2) / / set the connection timeout
        //    if err != nil {
        //        return nil, err
        //    }
        //    err = conn.SetDeadline(time.Now().Add(time.Second * 3)) / / set the timeout for sending and receiving data
        //    if err != nil {
        //        return nil, err
        //    }
        //    return conn, nil
        //},
    }
}


func Get(url string) ([]byte, error) {
    m := make(map[string]interface{})
    data, err := json.Marshal(m)
    if err != nil {
        return nil, err
    }
    body := bytes.NewReader(data)
    req, _ := http.NewRequest("Get", url, body)
    req.Header.Add("content-type", "application/json")

    client := &http.Client{
        Transport: tr,
        Timeout: 3*time.Second,  // The timeout added here is the timeout of each call
    }
    res, err := client.Do(req) 
    if res != nil {
        defer res.Body.Close()
    }
    if err != nil {
        return nil, err
    }
    resBody, err := ioutil.ReadAll(res.Body)
    if err != nil {
        return nil, err
    }
    return resBody, nil
}

func main() {
    for {
        _, err := Get("http://www.baidu.com/")
        if err != nil {
            fmt.Println(err)
            break
        }
    }
}

Look at the notes and you will find that there are two points to change

  • http. Some timeout settings in transport for connection establishment have been killed.

  • When an HTTP request is initiated, the scenario http Client, add the timeout setting at this time. The timeout here can be understood as the timeout of a single request. You can also see the notes

    Timeout specifies a time limit for requests made by this Client.

Here, the code will be changed and the problems in actual production will be solved.

In the example code, if you take it to run, you will actually make the following mistakes

Get http://www.baidu.com/: EOF

This is because it's called too hard, http://www.baidu.com The active disconnection over there can be understood as a current limiting measure to protect the server. After all, everyone does it like this, and the server will explode...

The solution is very simple. Just add a sleep interval in the middle of each HTTP call.


Here, in fact, the problem has been solved. Next, we will analyze the causes of the problem at the source level. You don't have to look down at the bottom of the article and do the things you're not interested in. (crazy hint, please, this is really important to me!)

Source code analysis

The version of go.7 is used.

Start with a network request.

res, err := client.Do(req)
func (c *Client) Do(req *Request) (*Response, error) {
    return c.do(req)
}

func (c *Client) do(req *Request) {
    // ...
    if resp, didTimeout, err = c.send(req, deadline); err != nil {
    // ...
  }
    // ...  
}  
func send(ireq *Request, rt RoundTripper, deadline time.Time) {
    // ...    
    resp, err = rt.RoundTrip(req)
     // ...  
} 

// Enter the RoundTrip logic from here
/src/net/http/roundtrip.go: 16
func (t *Transport) RoundTrip(req *Request) (*Response, error) {
    return t.roundTrip(req)
}

func (t *Transport) roundTrip(req *Request) (*Response, error) {
    // Try to get an idle connection to initiate an http connection
  pconn, err := t.getConn(treq, cm)
  // ...
}

// Focus on this function. The return is a long connection
func (t *Transport) getConn(treq *transportRequest, cm connectMethod) (*persistConn, error) {
  // Omit a lot of logic and focus on the following two points
    // Return when there is an idle connection
    pc := <-t.getIdleConnCh(cm)

  // No connection created
  pc, err := t.dialConn(ctx, cm)

}

Many of the above codes are actually just to show how this part of the code is tracked down, so that everyone can follow it when looking at the source code.

Finally, there is a getConn method in the above code. When initiating a network request, a network connection will be taken first, and the connection has two sources.

  • If there is an idle connection, take the idle connection

    /src/net/http/tansport.go:810
    func (t *Transport) getIdleConnCh(cm connectMethod) chan *persistConn {
       // Return the chan of idle connection
       ch, ok := t.idleConnCh[key]
         // ...
       return ch
    }
  • If there is no free connection, create a long connection.

/src/net/http/tansport.go:1357
func (t *Transport) dialConn() {
  //...
  conn, err := t.dial(ctx, "tcp", cm.addr())
  // ...
  go pconn.readLoop()
  go pconn.writeLoop()
  // ...
}

When an http request is initiated for the first time, there must be no idle connection at this time, and a new connection will be established. A read goroutine and a write goroutine will be created at the same time.

Note t.dial(ctx, "tcp", cm.addr()) in the above code. If HTTP is set as at the beginning of the article Transport

Dial: func(netw, addr string) (net.Conn, error) {
   conn, err := net.DialTimeout(netw, addr, time.Second*2) //Set connection timeout
   if err != nil {
      return nil, err
   }
   err = conn.SetDeadline(time.Now().Add(time.Second * 3)) //Set sending and receiving data timeout
   if err != nil {
      return nil, err
   }
   return conn, nil
},

Then this will be executed in the following dial

func (t *Transport) dial(ctx context.Context, network, addr string) (net.Conn, error) {
   // ...
  c, err := t.Dial(network, addr)
  // ...
}

The settings called here are timed out and will be executed until

/src/net/net.go
func (c *conn) SetDeadline(t time.Time) error {
    //...
    c.fd.SetDeadline(t)
    //...
}

//...

func setDeadlineImpl(fd *FD, t time.Time, mode int) error {
    // ...
    runtime_pollSetDeadline(fd.pd.runtimeCtx, d, mode)
    return nil
}


//go:linkname poll_runtime_pollSetDeadline internal/poll.runtime_pollSetDeadline
func poll_runtime_pollSetDeadline(pd *pollDesc, d int64, mode int) {
    // ...
  // Set a timer event
  rtf = netpollDeadline
    // And register the event in the timer
  modtimer(&pd.rt, pd.rd, 0, rtf, pd, pd.rseq)
}  

The above source code, in short, is that when the request is called for the first time, a connection will be established, and a timer event will be registered at this time. Assuming that the time is set to 3s, this event will occur after 3s, and then the logic of registering the event will be executed. The registration event is netpollDeadline. Note the netpollDeadline, which will be mentioned later.

The timeout event is set and occurs after 3s. Send and receive data normally in the next period. Business as usual.

Until 3s later, when you read goroutine, you will wait for the network data to return.

/src/net/http/tansport.go:1642
func (pc *persistConn) readLoop() {
    //...
    for alive {
        _, err := pc.br.Peek(1)  // Blocking reading the data returned by the server
    //...
}

Then it's always with the code.

src/bufio/bufio.go: 129
func (b *Reader) Peek(n int) ([]byte, error) {
   // ...
   b.fill() 
   // ...   
}

func (b *Reader) fill() {
    // ...
    n, err := b.rd.Read(b.buf[b.w:])
    // ...
}

/src/net/http/transport.go: 1517
func (pc *persistConn) Read(p []byte) (n int, err error) {
    // ...
    n, err = pc.conn.Read(p)
    // ...
}

// /src/net/net.go: 173
func (c *conn) Read(b []byte) (int, error) {
    // ...
    n, err := c.fd.Read(b)
    // ...
}

func (fd *netFD) Read(p []byte) (n int, err error) {
    n, err = fd.pfd.Read(p)
    // ...
}

/src/internal/poll/fd_unix.go: 
func (fd *FD) Read(p []byte) (int, error) {
    //...
  if err = fd.pd.waitRead(fd.isFile); err == nil {
    continue
  }
    // ...
}

func (pd *pollDesc) waitRead(isFile bool) error {
    return pd.wait('r', isFile)
}

func (pd *pollDesc) wait(mode int, isFile bool) error {
    // ...
  res := runtime_pollWait(pd.runtimeCtx, mode)
    return convertErr(res, isFile)
}

Until it reaches the runtime_pollWait, which can be simply regarded as waiting for the return of data from the server.

//go:linkname poll_runtime_pollWait internal/poll.runtime_pollWait
func poll_runtime_pollWait(pd *pollDesc, mode int) int {

    // 1. If the network returns data normally, it will jump out
  for !netpollblock(pd, int32(mode), false) {
    // 2. Jump out if there is an error
        err = netpollcheckerr(pd, int32(mode))
        if err != 0 {
            return err
        }
    }
    return 0
}

When the whole link follows, it will always wait for data. There are only two waiting results

  • There is data that can be read
  • Error reporting

There are two kinds of error reporting

  • Connection closed
  • overtime
func netpollcheckerr(pd *pollDesc, mode int32) int {
    if pd.closing {
        return 1 // errClosing
    }
    if (mode == 'r' && pd.rd < 0) || (mode == 'w' && pd.wd < 0) {
        return 2 // errTimeout
    }
    return 0
}

The timeout mentioned here refers to the number 2 returned here, which will be converted into ErrTimeout through the following function, and ErrTimeout Error() is actually i/o timeout.

func convertErr(res int, isFile bool) error {
    switch res {
    case 0:
        return nil
    case 1:
        return errClosing(isFile)
    case 2:
        return ErrTimeout // ErrTimeout.Error() is "i/o timeout"
    }
    println("unreachable: ", res)
    panic("unreachable")
}

So here comes the question.... How does the timeout error returned above, that is, how does the condition meet when returning 2?

    if (mode == 'r' && pd.rd < 0) || (mode == 'w' && pd.wd < 0) {
        return 2 // errTimeout
    }

Remember netpoll deadline just mentioned?

This contains the logic executed when the timer 3s reaches the point.

func timerproc(tb *timersBucket) {
    // The timer has reached the set time point. Register the function before triggering
    f(arg, seq) // netpollDeadline was previously registered
}

func netpollDeadline(arg interface{}, seq uintptr) {
    netpolldeadlineimpl(arg.(*pollDesc), seq, true, true)
}

/src/runtime/netpoll.go: 428
func netpolldeadlineimpl(pd *pollDesc, seq uintptr, read, write bool) {
    //...
    if read {
        pd.rd = -1
        rg = netpollunblock(pd, 'r', false)
    }
    //...
}

PD. Is set here Rd = - 1 refers to poller descriptor Read deadline means the read timeout time of the file descriptor of the network poller. We know that everything is a file in linux. The file here actually refers to the socket used in this network communication.

At this time, go back and see that the timeout condition is if (mode = = 'R' & & PD. Rd < 0).

So far. We received an error message of io timeout in our code.

summary

  • Don't use HTTP Timeout is set in transport, which is the timeout of connection, not the timeout of request. Otherwise, an inexplicable io timeout error may occur.

  • The timeout of the request is set in the create client.

If the article is helpful to you, look at the bottom right corner of the article and do something positive energy (click twice) to support it. (crazy hint, please, this is really important to me!)

I'm Xiaobai. I'll see you next time.

Article recommendation:

Stop talking, let's choke in the ocean of knowledge

Attention to the official account: golang

Keywords: Go network

Added by renzaho on Tue, 08 Feb 2022 03:35:45 +0200