Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Vault Crashes OOM #4803

Closed
rshsmi opened this issue Jun 20, 2018 · 29 comments
Closed

Vault Crashes OOM #4803

rshsmi opened this issue Jun 20, 2018 · 29 comments
Milestone

Comments

@rshsmi
Copy link

rshsmi commented Jun 20, 2018

Describe the bug
When a Vault node in cluster of 6 with consul back-end becomes leader, the CPU Load and Memory increases to the point is crashes because of OOM.

To Reproduce
Steps to reproduce the behavior:
Vault node becomes leader and is crashing with:
fatal error: runtime: out of memory

Expected behavior
A clear and concise description of what you expected to happen.

Environment:

  • Vault Server Version (retrieve with vault status):
    Key Value

Seal Type shamir
Sealed false
Total Shares 5
Threshold 3
Version 0.10.1
Cluster Name vault-cluster-
Cluster ID
HA Enabled true
HA Cluster https://
HA Mode active

  • Server Operating System/Architecture:

Docker version 18.03.1-ce, build 9ee9f40
4.14.48-coreos-r2

Vault server configuration file(s):

{"storage": {"consul": {"address": ":8500", "path": "vault", "token": "${consul_token}" }}, "listener": { "tcp": { "address": "0.0.0.0:443", "tls_cert_file": "/vault/certs/tls.crt", "tls_key_file": "/vault/certs/tls.key" }}, "disable_cache": "true", "disable_mlock": "true", "ui": "true"}

Additional context
Consul back-end : Consul 1.1.0
Crash log: vault_crash.log
Prometheus Monitoring Graph:
vault

@jefferai
Copy link
Member

It sounds like you have a huge number of leases, so when Vault is loading them when a node becomes active, you OOM. Probably you need to increase memory to get past the immediate issue, and then figure out why you have so many leases. (Running trace mode on logs will have the lease loading print out a total and running load count.)

@rshsmi
Copy link
Author

rshsmi commented Jun 20, 2018

@jefferai - Thanks for your response, the total number of leases we have is around 100 - since I created the ticket we've upgraded to 0.10.2 and changed the log level to TRACE, since the upgrade the only logs we see is the standard rollback messages.

@jefferai
Copy link
Member

You could try upgrading to 0.10.3, although I doubt there's anything relevant (but it's a good idea anyways).

Depending on what the issue is, based on your log, 0.10.2 may in fact help. We've fixed issues in 0.10.1 and 0.10.2 related to revocation logic.

Can you get total counts of items in Consul under sys/expire/id and sys/token/id?

@rshsmi
Copy link
Author

rshsmi commented Jun 20, 2018

Thanks for you suggestions, Infact I was just reviewing the changes in 0.10.3
121 under sys/expire/id
59 under sys/token/id

@jefferai
Copy link
Member

Please let us know if the problem persists under 0.10.2/0.10.3.

@rshsmi
Copy link
Author

rshsmi commented Jun 20, 2018

The problem persist on 0.10.2
vault-10-3

@jefferai
Copy link
Member

Can you get another crash log from 0.10.3? Since the logic has changed significantly it would be good to know the current stack trace.

Also, is there a reason you are running with disable_cache set to true? Does this issue go away if you remove that config option?

@rshsmi
Copy link
Author

rshsmi commented Jun 20, 2018

I'm waiting for 0.10.3 docker image to be available. disable_cache is set to false (default value)
[DEBUG] expiration: leases collected: num_existing=117
The memory issue is still persist.
disable_cache

@jefferai
Copy link
Member

{"storage": {"consul": {"address": ":8500", "path": "vault", "token": "${consul_token}" }}, "listener": { "tcp": { "address": "0.0.0.0:443", "tls_cert_file": "/vault/certs/tls.crt", "tls_key_file": "/vault/certs/tls.key" }}, "disable_cache": "true", "disable_mlock": "true", "ui": "true"}

It's set to true...

@rshsmi
Copy link
Author

rshsmi commented Jun 20, 2018

Changed it to "false" after your last comment, then restarted all the nodes, the graph in my previous comment is the leader with"disable_cache": "false"

@jefferai jefferai added this to the 0.10.4 milestone Jun 20, 2018
@rshsmi
Copy link
Author

rshsmi commented Jun 21, 2018

Vault 0.10.2 crash log:
vault-0-10-2.log.txt

@briankassouf
Copy link
Contributor

Could you also post the full vault server logs with any sensitive values redacted? It seems the crash is happening while trying to revoke a lease. It would be helpful to see any lease revocation errors from the logs.

@rshsmi
Copy link
Author

rshsmi commented Jun 22, 2018

@briankassouf - I've been through logs on all nodes, here is what I found:
[ERROR] secrets.system.system_62531041: error retrieving lease: lease_id=/auth/ldap/login/XXXXXX error="failed to read lease entry: decryption failed: cipher: message authentication failed"
if I try to revoke it using CLI:
vault lease revoke /auth/ldap/login/XXXXXX
it is being revoked:
[INFO ] expiration: revoked lease: lease_id=auth/ldap/login/

@briankassouf
Copy link
Contributor

@ars05 Did anything change with your underlying datastore? Vault being unable to decrypt the data can mean the data changed out from under it. Perhaps from a bad consul restore.

That being said it shouldn't cause a OOM. The full logs would be helpful to determine what's being loaded into memory and if you got into some kind of loop attempting to revoke this lease.

@rshsmi
Copy link
Author

rshsmi commented Jun 25, 2018

@briankassouf - We did consul snapshot of an old cluster then we restored it on the new cluster. Please find the full log here. vault.log

@briankassouf
Copy link
Contributor

@ars05 I don't see anything suspicious in the log file, what calls are clients making during the time the memory is growing?

@briankassouf
Copy link
Contributor

Also were you ever able to try the 0.10.3 docker container to see if it was fixed in that version?

@rshsmi
Copy link
Author

rshsmi commented Jul 2, 2018

@briankassouf - here is crash log and memory usage graph from 0.10.3 - the log level is on trace.
vault-0-10-3.log
vault-0-10-3

@briankassouf
Copy link
Contributor

@ars05 Are you still experiencing this OOM and could describe the access patterns and what auth/secret mounts you are using?

@briankassouf briankassouf modified the milestones: 0.10.4, 0.10.5 Jul 24, 2018
@rshsmi
Copy link
Author

rshsmi commented Jul 31, 2018

@briankassouf - this issue still persists. When a node becomes Active, the memory usage increases to the point which crashes with fatal error: runtime: out of memory and another node in the cluster becomes active and goes through he same cycle.

In total we have 25 secret engines enabled. 5 AWS, 1 Consul, 5 secrets, 1 SSH, 13 PKI
10-15 SRE Engineers access to Vault (mostly through UI) AD/LDAP integration is enabled.
In term of Authentication Methods, 3 Kubernetes clusters, ldap, approle and token (by default) is enabled/configured.

@kalafut
Copy link
Contributor

kalafut commented Jul 31, 2018

@ars05 I just pushed a branch off of 0.10.3 that enables some basic memory profiling: https://github.com/hashicorp/vault/tree/profile-0.10.3 . Are you able to build from source?

This version will save memory stats files in $TMPDIR/vaultprof/ every 5 minutes. If you can run with this version through the event and send us a zip of the vaultprof folder, we should be able to get a better idea of what is using so much memory.

@jefferai jefferai modified the milestones: 0.10.5 , 0.11 Aug 13, 2018
@chrishoffman chrishoffman added this to the 0.11 milestone Aug 13, 2018
@chrishoffman chrishoffman removed this from the 0.11 milestone Aug 25, 2018
@chrishoffman chrishoffman added this to the 0.11.1 milestone Aug 25, 2018
@jefferai jefferai modified the milestones: 0.11.1, 0.11.2 Sep 1, 2018
@rshsmi
Copy link
Author

rshsmi commented Sep 4, 2018

@kalafut - I've built the binary from the profile-0.10.3 branch and deployed it, The configuration is exactly the same as the other nodes and here is the log at startup:

==> Vault server configuration:

             Api Address: https://IP_ADDRESS:443
                     Cgo: disabled
         Cluster Address: https://IP_ADDRESS:444
              Listener 1: tcp (addr: "0.0.0.0:443", cluster address: "0.0.0.0:444", max_request_duration: "1m30s", max_request_size: "33554432", tls: "enabled")
               Log Level: info
                   Mlock: supported: true, enabled: false
                 Storage: consul (HA available)
                 Version: Vault v0.10.4
             Version Sha: 8e427f20e3174039d5a49d7e7ab2566ddacf2ce2+CHANGES

==> Vault server started! Log data will stream in below:

2018-09-04T11:07:40.120Z [WARN ] storage.consul: appending trailing forward slash to path
2018-09-04T11:08:23.834Z [INFO ] core: vault is unsealed
2018-09-04T11:08:23.834Z [INFO ] core: entering standby mode
2018-09-04T11:08:23.834Z [INFO ] core: entering standby mode

The node is unsealed but is not active (leader)yet, and I don't get any memory stat files, is this expected ? does the node needs to be active (leader) to provide memory stat files ?

@kalafut
Copy link
Contributor

kalafut commented Sep 4, 2018

@ars05 The profiling will be on regardless of active status. The log message is at the DEBUG level, so you'd need to start with the vault binary with -log-level=debug to see a message like this appear in the log stream:

2018-09-04T11:01:22.314-0700 [DEBUG] wrote memory profile: filename=/var/folders/yy/kf718y9x41g40ncf8s29mshh0000gn/T/vaultprof/20180904_180122.pprof

But debug is not required for the profiling to actually occur. Did you check $TMPDIR for a vaultprof folder, and some files therein?

@rshsmi
Copy link
Author

rshsmi commented Sep 7, 2018

@kalafut - please find the pprof files attached.
vaultprof.zip

@kalafut
Copy link
Contributor

kalafut commented Sep 12, 2018

@ars05 Thanks, these were pretty useful. Based on them, I'd like to review the structure of some token data in Consul. Can you please export and provide the list of keys under vault/sys/token/parent, e.g.:

curl http://<consul address>/v1/kv/vault/sys/token/parent\?keys > out.json

This list will only contain hashed UUID paths like "vault/sys/token/parent/10e62d82c8982d247b64c8b17c44c24e7b6d4d5e/6da14c90b428e2f7c8140c44c14d836d0e0af4e6". If you wish you can email the output to me directly: jkalafut@hashicorp.com

@rshsmi
Copy link
Author

rshsmi commented Sep 13, 2018

@kalafut - Thanks, I sent you an email with the list of keys.

@kalafut
Copy link
Contributor

kalafut commented Sep 14, 2018

We were able to create an abnormal key state (a parent/child cycle) that would result in ever growing memory use that is similar to the profiles you sent earlier. The keys you provided had this problem state as well.

A PR (#5335) is up to handle such cycles correctly.

@rshsmi
Copy link
Author

rshsmi commented Oct 11, 2018

This issue is resolved.
vault-0-11-3

@kalafut
Copy link
Contributor

kalafut commented Oct 11, 2018

@ars05 Excellent news!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants