-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add an idle timeout for the server #4760
Conversation
A note for reviewers: the main test of tidy functions is in the expiration manager, but the way it's fixed up shows that the CAS stuff is working appropriately (and the way it failed earlier made it clear that both goroutines were exiting immediately). Given that they all now share the same control structure it should be applicable across the changed functions. Another item: 10 minutes could be too long; should we make it 5? What operations take even that long with tidy out of the picture? |
My main comment is whether the |
Maybe, the problem is picking a number that isn't arbitrary. At what point would you want to cut it off? If Vault is functioning normally, why kill tidy after 30 minutes, leaving things in a not-fully-cleaned-up state? |
Yes, if tidy durations have a very large range then context.WithTimeout will not help. |
Logging "Tidy operation {x} completed successfully" would be good if we're not doing that already somewhere (I didn't notice it in the diff but maybe it's higher up). Prior to this commit, the API call returning was effectively that message. If there are ever questions about the operation it would be good to grep the logs for started/completed pairs. |
command/server.go
Outdated
@@ -935,7 +935,8 @@ CLUSTER_SYNTHESIS_COMPLETE: | |||
} | |||
|
|||
server := &http.Server{ | |||
Handler: handler, | |||
Handler: handler, | |||
IdleTimeout: 10 * time.Minute, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried to check that this is having the desired effect but haven't been successful. Using a build with a very low value (5s) and throwing a long sleep into some operation, the connection wasn't closed. I was even able to nc 127.0.0.1 8200
and the connection would stay open indefinitely.
I wrote a little Go server to experiment and found that ReadTimeout will close my nc
test at the right time, but just IdleTimeout won't. I'm not sure what IdleTimeout is really doing. But replacing IdleTimeout with ReadTimeout in Vault didn't close my connection either.
Because tidy operations can be long-running, this also changes all tidy operations to behave the same operationally (kick off the process, get a warning back, log errors to server log) and makes them all run in a goroutine. This could mean a sort of hard stop if Vault gets sealed because the function won't have the read lock. This should generally be okay (running tidy again should pick back up where it left off), but future work could use cleanup funcs to trigger the functions to stop.
…r server, plus add readheader/read timeout to api server
vault/request_forwarding.go
Outdated
// that don't successfully auth to be kicked out quickly. | ||
// Cluster connections should be reliable so being marginally | ||
// aggressive here is fine. | ||
err = tlsConn.SetDeadline(time.Now().Add(10 * time.Second)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Replication also uses this connection and that potentially could be going over a less reliable network. Maybe it should be bumped a little bit?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a good point. 30 seconds? If it disconnects it will reconnect again...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Works for me!
…nster because that passes more often than less monster timeouts, but really it's just flaky for reasons that likely have to do with logical.Framework testing stuff because there's no evidence of any problem other than it just not having run in time.
@@ -285,7 +285,7 @@ func prepareDynamoDBTestContainer(t *testing.T) (cleanup func(), retAddress stri | |||
t.Fatalf("Failed to connect to docker: %s", err) | |||
} | |||
|
|||
resource, err := pool.Run("deangiberson/aws-dynamodb-local", "latest", []string{}) | |||
resource, err := pool.Run("cnadiminti/dynamodb-local", "latest", []string{}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a reason for this switch? If it is significant, can we have a comment as to why this is better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This test was failing only on Jeff's laptop... no issues for other devs or even Travis. The error didn't make a ton of sense either, since basically reading back from the container what was just written didn't work. Unclear what the true root cause was, but we assumed docker-related. Our previous container hasn't been updated by the author in 15 months, and this new one is current and well-used. When we swapped it in all platforms started passing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's merged in from a branch Jim put up, but basically: for some reason on my machine some dynamodb tests were failing in ways that others could not reproduce, and while my container was supposedly up to date, the other container in general has not been kept up to date, whereas the new one has.
err = tlsConn.SetDeadline(time.Time{}) | ||
if err != nil { | ||
if c.logger.IsDebug() { | ||
c.logger.Debug("error setting deadline for cluster connection", "error", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we want to distinguish this error from the error above?
Because tidy operations can be long-running, this also changes all tidy
operations to behave the same operationally (kick off the process, get a
warning back, log errors to server log) and makes them all run in a
goroutine.
This could mean a sort of hard stop if Vault gets sealed because the
function won't have the read lock. This should generally be okay
(running tidy again should pick back up where it left off), but future
work could use cleanup funcs to trigger the functions to stop.