Add metrics to identify audit logging failures #2863

edjackson-wf · 2017-06-14T18:15:57Z

It would be very helpful to be able to alert an operator when there are audit logging failures. Because this information isn't available from the /sys/health endpoint, we need some other means.

I would suggest the addition of some appropriate metrics in the telemetry, so alerting can be done from statsd/statsite.

I don't have strong feelings about exactly what the metrics should be. Being able to monitor 500 response codes would certainly help, or maybe the audit logging backends should provide more specific error metrics.

See also this conversation.

csawyerYumaed · 2017-06-19T13:09:50Z

I just had this issue in production this morning.

This morning we had a failure in our audit backend(the file was rotated, but the -HUP signal did not happen for some reason).

vault log:

2017/06/19 05:41:59.431861 [ERROR] audit: backend failed to log response: backend=file/ error=write vault_audit.log: bad file descriptor
2017/06/19 05:41:59.431874 [ERROR] core: failed to audit response: request_path=auth/token/renew-self error=no audit backend succeeded in logging the response2017/06/19 05:42:26.046577 [ERROR] audit: backend failed to log request: backend=file/ error=write vault_audit.log: bad file descriptor
2017/06/19 05:42:26.046605 [ERROR] core: failed to audit request: path=auth/token/renew-self error=1 error occurred:

no audit backend succeeded in logging the request

yet, the health checks were continuing to pass, despite error 500's on any read/write call to vault. This seems.. wrong somehow :)

Ideally, I'd love if vault on a bad FD would just try a re-open like if a HUP signal happened. Especially since audit is crucial to a valid operating vault system.

also, the /health check should probably at least WARN, if not outright FAIL, if it can't write the audit log for whatever reason.

Addresses a pain point from #2863 (comment)

Fixes #2863

jefferai · 2017-07-12T19:45:17Z

@edjackson-wf Any chance you can tell me if you think https://github.com/hashicorp/vault/pull/3001/files meets your needs? I figured that an incrementing counter is probably the right way to go.

edjackson-wf · 2017-07-12T20:04:34Z

@jefferai Yes, I think that would work for me.

I suppose it might be worth considering the case where multiple audit backends are enabled and one fails. Is it worth distinguishing between audit failures that cause requests to fail and those that don't? It's not my use case, but it seems plausible.

Addresses a pain point from #2863 (comment)

Fixes #2863

jefferai · 2017-07-14T15:05:12Z

@edjackson-wf We can add more specific metrics later if needed, but I'd argue that any time that counter is going up continually there's a bad situation just waiting to happen, regardless of which backend is currently experiencing the problem. At that point logs will tell the rest.

edjackson-wf · 2017-07-14T15:11:40Z

@jefferai Fair enough. Thanks a bunch for adding this.

jefferai · 2017-07-14T15:12:37Z

No problem!

jefferai added this to the 0.7.4 milestone Jun 15, 2017

jefferai added a commit that referenced this issue Jul 12, 2017

Opportunistically try re-opening file audit fd on error

b82790d

Addresses a pain point from #2863 (comment)

jefferai mentioned this issue Jul 12, 2017

Opportunistically try re-opening file audit fd on error #2999

Merged

jefferai added a commit that referenced this issue Jul 12, 2017

Add metrics counters for audit log failures

569a38c

Fixes #2863

jefferai mentioned this issue Jul 12, 2017

Add metrics counters for audit log failures #3001

Merged

jefferai added a commit that referenced this issue Jul 14, 2017

Opportunistically try re-opening file audit fd on error (#2999)

ba64932

Addresses a pain point from #2863 (comment)

jefferai closed this as completed in #3001 Jul 14, 2017

jefferai added a commit that referenced this issue Jul 14, 2017

Add metrics counters for audit log failures (#3001)

4e4c9aa

Fixes #2863

jefferai modified the milestones: 0.7.4, 0.8.0 Jul 24, 2017

robertdebock mentioned this issue Jun 9, 2022

Vault considers itself healthy when can't write to audit log #11949

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add metrics to identify audit logging failures #2863

Add metrics to identify audit logging failures #2863

edjackson-wf commented Jun 14, 2017 •

edited

Loading

csawyerYumaed commented Jun 19, 2017

jefferai commented Jul 12, 2017

edjackson-wf commented Jul 12, 2017

jefferai commented Jul 14, 2017

edjackson-wf commented Jul 14, 2017

jefferai commented Jul 14, 2017

Add metrics to identify audit logging failures #2863

Add metrics to identify audit logging failures #2863

Comments

edjackson-wf commented Jun 14, 2017 • edited Loading

csawyerYumaed commented Jun 19, 2017

jefferai commented Jul 12, 2017

edjackson-wf commented Jul 12, 2017

jefferai commented Jul 14, 2017

edjackson-wf commented Jul 14, 2017

jefferai commented Jul 14, 2017

edjackson-wf commented Jun 14, 2017 •

edited

Loading