Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sampling needs to be configurable for tracing, even if using custom sampler. Current code results in massive log spam #53012

Open
jbilliau-rcd opened this issue Sep 5, 2024 · 3 comments

Comments

@jbilliau-rcd
Copy link

jbilliau-rcd commented Sep 5, 2024

Describe the feature request

We use Dynatrace at our company and are currently testing using the new OTEL configuration it provides with Istio 1.22, since OpenTracing was removed in Envoy 1.30, which is what Istio 1.22 uses. We have it all working but are seeing this message nonstop in our logs, to the point it's very difficult to see actual traffic logs without some sort of filtering done:

2024-09-05T18:28:10.787073363Z 2024-09-05T18:28:10.786914Z error envoy tracing external/envoy/source/extensions/tracers/opentelemetry/http_trace_exporter.cc:86 OTLP HTTP exporter received a non-success status code: 503 while exporting the OTLP message thread=23

It looks like Istio hardcodes a sample rate of 100 when using a custom sampler -

var sampling float64
if useCustomSampler {
// If the TracingProvider has a custom sampler (OTel Sampler)
// the sampling percentage is set to 100% so all spans arrive at the sampler for its decision.
sampling = 100
} else if spec.RandomSamplingPercentage != nil {
sampling = *spec.RandomSamplingPercentage
} else {
// gracefully fallback to MeshConfig configuration. It will act as an implicit
// parent configuration during transition period.
sampling = proxyConfigSamplingValue(proxyCfg)
}
. Thus, we have deduced with Dynatrace support that it's just rejecting the majority of the traces due to sampling rate limits on its side.

The comment says "the sampling percentage is set to 100% so all spans arrive at the sampler for its decision" but why limit it in this way? I don't want to sample EVERY call, and send EVERY call to Dynatrace just for them to throw away 95% of them; isn't that incredibly wasteful? Not to mention, the very point of this issue which is my istio-proxy logs are filled nonstop with 503's.

I suggest making the sample rate configurable, just like it is with a non-custom sampler, so consumers can choose what works best for them.

Affected product area (please put an X in all that apply)

[ ] Ambient
[ ] Docs
[ ] Dual Stack
[ ] Installation
[ ] Networking
[ ] Performance and Scalability
[ X ] Extensions and Telemetry
[ ] Security
[ ] Test and Release
[ ] User Experience
[ ] Developer Infrastructure

Affected features (please put an X in all that apply)

[ ] Multi Cluster
[ ] Virtual Machine
[ ] Multi Control Plane

Additional context

@zirain
Copy link
Member

zirain commented Sep 6, 2024

@joaopgrassi

@howardjohn
Copy link
Member

Not an expert here but I think there is some confusion. First, the logs you are seeing (OTLP HTTP exporter received a non-success status code: 503 while exporting the OTLP message) have nothing to do with sampling. This means that we are failing to report to dynatrace. This is orthogonal from how much we sample -- it seems like your setup is completely broken and not able to export to dynatrace.

Note a single one of those calls could be (attempting to) exporting 1000s of spans at once.

Now on the 100% tracing - it does not mean 100% of traces will be sampled. It means you are using a custom sampler which is not (just) percentage based. if we were to set this to something else (say 1%), then we would prefilter 1% of spans before we even let the custom sampler. Dynatraces sampler (and others) expect to see all spans (locally!) so they can decide which ones to sample using more complex algorithms.

@jbilliau-rcd
Copy link
Author

jbilliau-rcd commented Sep 6, 2024

Interesting.....ok then, let me take this back to our own monitoring SME's and Dynatrace and see what they say. Interestingly enough, we see traces in Dynatrace, so at least SOME of getting over there. But then we also see these 503's, so it almost seems like it's working sporadically, which makes no sense to me. We are using the exact setup steps in their docs. ¯_(ツ)_/¯

Thanks @howardjohn , I hope to follow up shortly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants