net/http: scale rstAvoidanceDelay to reduce test flakiness

As far as I can tell, some flakiness is unavoidable in tests
that race a large client request write against a server's response
when the server doesn't read the full request.
It does not appear to be possible to simultaneously ensure that
well-behaved clients see EOF instead of ECONNRESET and also prevent
misbehaving clients from consuming arbitrary server resources.
(See RFC 7230 §6.6 for more detail.)

Since there doesn't appear to be a way to cleanly eliminate
this source of flakiness, we can instead work around it:
we can allow the test to adjust the hard-coded delay if it
sees a plausibly-related failure, so that the test can retry
with a longer delay.

As a nice side benefit, this also allows the tests to run more quickly
in the typical case: since the test will retry in case of spurious
failures, we can start with an aggressively short delay, and only back
off to a longer one if it is really needed on the specific machine
running the test.

Fixes #57084.
Fixes #51104.
For #58398.

Change-Id: Ia4050679f0777e5eeba7670307a77d93cfce856f
Cq-Include-Trybots: luci.golang.try:gotip-linux-amd64-longtest-race,gotip-linux-amd64-race,gotip-windows-amd64-race
Reviewed-on: https://go-review.googlesource.com/c/go/+/527196
LUCI-TryBot-Result: Go LUCI <golang-scoped@luci-project-accounts.iam.gserviceaccount.com>
Reviewed-by: Damien Neil <dneil@google.com>
Auto-Submit: Bryan Mills <bcmills@google.com>
This commit is contained in:
Bryan C. Mills 2023-09-11 16:17:03 -04:00 committed by Gopher Robot
parent 545e4f38e0
commit 6760f20ef5
4 changed files with 419 additions and 289 deletions

View file

@ -1750,8 +1750,12 @@ func (c *conn) close() {
// and processes its final data before they process the subsequent RST
// from closing a connection with known unread data.
// This RST seems to occur mostly on BSD systems. (And Windows?)
// This timeout is somewhat arbitrary (~latency around the planet).
const rstAvoidanceDelay = 500 * time.Millisecond
// This timeout is somewhat arbitrary (~latency around the planet),
// and may be modified by tests.
//
// TODO(bcmills): This should arguably be a server configuration parameter,
// not a hard-coded value.
var rstAvoidanceDelay = 500 * time.Millisecond
type closeWriter interface {
CloseWrite() error
@ -1770,6 +1774,27 @@ func (c *conn) closeWriteAndWait() {
if tcp, ok := c.rwc.(closeWriter); ok {
tcp.CloseWrite()
}
// When we return from closeWriteAndWait, the caller will fully close the
// connection. If client is still writing to the connection, this will cause
// the write to fail with ECONNRESET or similar. Unfortunately, many TCP
// implementations will also drop unread packets from the client's read buffer
// when a write fails, causing our final response to be truncated away too.
//
// As a result, https://www.rfc-editor.org/rfc/rfc7230#section-6.6 recommends
// that “[t]he server … continues to read from the connection until it
// receives a corresponding close by the client, or until the server is
// reasonably certain that its own TCP stack has received the client's
// acknowledgement of the packet(s) containing the server's last response.”
//
// Unfortunately, we have no straightforward way to be “reasonably certain”
// that we have received the client's ACK, and at any rate we don't want to
// allow a misbehaving client to soak up server connections indefinitely by
// withholding an ACK, nor do we want to go through the complexity or overhead
// of using low-level APIs to figure out when a TCP round-trip has completed.
//
// Instead, we declare that we are “reasonably certain” that we received the
// ACK if maxRSTAvoidanceDelay has elapsed.
time.Sleep(rstAvoidanceDelay)
}