-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Vault Crashes OOM #4803
Comments
It sounds like you have a huge number of leases, so when Vault is loading them when a node becomes active, you OOM. Probably you need to increase memory to get past the immediate issue, and then figure out why you have so many leases. (Running trace mode on logs will have the lease loading print out a total and running load count.) |
@jefferai - Thanks for your response, the total number of leases we have is around 100 - since I created the ticket we've upgraded to 0.10.2 and changed the log level to TRACE, since the upgrade the only logs we see is the standard rollback messages. |
You could try upgrading to 0.10.3, although I doubt there's anything relevant (but it's a good idea anyways). Depending on what the issue is, based on your log, 0.10.2 may in fact help. We've fixed issues in 0.10.1 and 0.10.2 related to revocation logic. Can you get total counts of items in Consul under sys/expire/id and sys/token/id? |
Thanks for you suggestions, Infact I was just reviewing the changes in 0.10.3 |
Please let us know if the problem persists under 0.10.2/0.10.3. |
Can you get another crash log from 0.10.3? Since the logic has changed significantly it would be good to know the current stack trace. Also, is there a reason you are running with disable_cache set to true? Does this issue go away if you remove that config option? |
It's set to |
Changed it to "false" after your last comment, then restarted all the nodes, the graph in my previous comment is the leader with"disable_cache": "false" |
Vault 0.10.2 crash log: |
Could you also post the full vault server logs with any sensitive values redacted? It seems the crash is happening while trying to revoke a lease. It would be helpful to see any lease revocation errors from the logs. |
@briankassouf - I've been through logs on all nodes, here is what I found: |
@ars05 Did anything change with your underlying datastore? Vault being unable to decrypt the data can mean the data changed out from under it. Perhaps from a bad consul restore. That being said it shouldn't cause a OOM. The full logs would be helpful to determine what's being loaded into memory and if you got into some kind of loop attempting to revoke this lease. |
@briankassouf - We did consul snapshot of an old cluster then we restored it on the new cluster. Please find the full log here. vault.log |
@ars05 I don't see anything suspicious in the log file, what calls are clients making during the time the memory is growing? |
Also were you ever able to try the 0.10.3 docker container to see if it was fixed in that version? |
@briankassouf - here is crash log and memory usage graph from 0.10.3 - the log level is on trace. |
@ars05 Are you still experiencing this OOM and could describe the access patterns and what auth/secret mounts you are using? |
@briankassouf - this issue still persists. When a node becomes Active, the memory usage increases to the point which crashes with fatal error: runtime: out of memory and another node in the cluster becomes active and goes through he same cycle. In total we have 25 secret engines enabled. 5 AWS, 1 Consul, 5 secrets, 1 SSH, 13 PKI |
@ars05 I just pushed a branch off of 0.10.3 that enables some basic memory profiling: https://github.com/hashicorp/vault/tree/profile-0.10.3 . Are you able to build from source? This version will save memory stats files in |
@kalafut - I've built the binary from the profile-0.10.3 branch and deployed it, The configuration is exactly the same as the other nodes and here is the log at startup:
The node is unsealed but is not active (leader)yet, and I don't get any memory stat files, is this expected ? does the node needs to be active (leader) to provide memory stat files ? |
@ars05 The profiling will be on regardless of active status. The log message is at the DEBUG level, so you'd need to start with the vault binary with
But debug is not required for the profiling to actually occur. Did you check |
@kalafut - please find the pprof files attached. |
@ars05 Thanks, these were pretty useful. Based on them, I'd like to review the structure of some token data in Consul. Can you please export and provide the list of keys under vault/sys/token/parent, e.g.:
This list will only contain hashed UUID paths like |
@kalafut - Thanks, I sent you an email with the list of keys. |
We were able to create an abnormal key state (a parent/child cycle) that would result in ever growing memory use that is similar to the profiles you sent earlier. The keys you provided had this problem state as well. A PR (#5335) is up to handle such cycles correctly. |
@ars05 Excellent news! |
Describe the bug
When a Vault node in cluster of 6 with consul back-end becomes leader, the CPU Load and Memory increases to the point is crashes because of OOM.
To Reproduce
Steps to reproduce the behavior:
Vault node becomes leader and is crashing with:
fatal error: runtime: out of memory
Expected behavior
A clear and concise description of what you expected to happen.
Environment:
vault status
):Key Value
Seal Type shamir
Sealed false
Total Shares 5
Threshold 3
Version 0.10.1
Cluster Name vault-cluster-
Cluster ID
HA Enabled true
HA Cluster https://
HA Mode active
Docker version 18.03.1-ce, build 9ee9f40
4.14.48-coreos-r2
Vault server configuration file(s):
Additional context
Consul back-end : Consul 1.1.0
Crash log: vault_crash.log
Prometheus Monitoring Graph:
The text was updated successfully, but these errors were encountered: