Workflows with CosmosDB throwing 412 errors #8004

joshuadmatthews · 2024-08-16T15:20:17Z

What version of Dapr?

1.13.5

Expected Behavior

Workflows should function correctly

Actual Behavior

{"app_id":"my-service-workflow","instance":"my-service-workflow-c89bdf786-xtvjj","level":"warning","msg":"Workflow actor '56631279-491c-422f-8ac3-ef885ad5a448': execution failed with a recoverable error and will be retried later: 'failed to invoke activity actor '56631279-491c-422f-8ac3-ef885ad5a448::1::1' to execute 'GetSomeData': error from internal actor: error saving reminders partition and metadata: transaction failed due to operation 1 which failed with status code 412'","scope":"dapr.wfengine.backend.actors","time":"2024-08-15T20:54:58.986427073Z","type":"log","ver":"1.13.5"}

Steps to Reproduce the Problem

Not quite sure. This same workflow works on my DEV and QA environment, but not in my UAT environment which appears to be identical. Looking for help to determine what could cause this.

I am running in an Azure AKS cluster, Kubernetes 1.29.7, Dapr 1.13.5-msft.1 in HA mode. Cosmos collection has the correct partition key, and I see workflow records in the collection. I only have 1 instance of my workflow deployed.

joshuadmatthews · 2024-08-16T15:58:45Z

I have seen #7162, but this is all new greenfield stuff, I have never used those old versions of dapr.

yaron2 · 2024-08-16T16:38:37Z

This error means that two (or more) concurrent operations trying to mutate the same state are happening on your Cosmos collection and one got rejected due to record versioning. This is a retriable error that you can safely retry from your code. However, please note that we completely revamped the actor reminder system in Dapr 1.14 - it'd be great if you could upgrade to 1.14.1 and enable the Scheduler service with the following configuration:

apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: featureconfig
spec:
  features:
    - name: SchedulerReminders
      enabled: true

Then apply the configuration to your app with the following annotation: dapr.io/config: featureconfig.

You should not only see these errors resolved but also get improved performance. Notice that old reminder data will not be moved over.

joshuadmatthews · 2024-08-16T18:06:25Z

Yes I am anxiously awaiting 1.14 to hit the AKS Dapr extension, not quite available yet. Great job on the work there!

As for retrying this error, not really sure it will help. I'm not seeing intermittent failure; it's failing every time on every workflow. I too thought there must be concurrency somewhere, but I can't find it. It seems there is only 1 instance of this workflow running and nothing else is touching that state.

yaron2 · 2024-08-16T18:07:52Z

Yes I am anxiously awaiting 1.14 to hit the AKS Dapr extension, not quite available yet. Great job on the work there!

As for retrying this error, not really sure it will help. I'm not seeing intermittent failure; it's failing every time on every workflow. I too thought there must be concurrency somewhere, but I can't find it. It seems there is only 1 instance of this workflow running and nothing else is touching that state.

Can you reach out to me on Discord? handle is yaron2. We can debug there and take findings here.

joshuadmatthews · 2024-08-21T15:59:19Z

It seems this was due to Cosmos consistency level and multiple regions being enabled. Session consistency was the cause. We switched to Bounded Staleness and that resolved it, although there is a cost associated with that so it would be good if Session support could be added.

One question I have is, with the new reminder scheduling, if we delete and recreate our AKS cluster will reminder now be lost?

yaron2 · 2024-08-21T17:48:45Z

It seems this was due to Cosmos consistency level and multiple regions being enabled. Session consistency was the cause. We switched to Bounded Staleness and that resolved it, although there is a cost associated with that so it would be good if Session support could be added.

One question I have is, with the new reminder scheduling, if we delete and recreate our AKS cluster will reminder now be lost?

Yes, all existing reminder data will be lost.

dapr-bot · 2024-10-20T17:51:41Z

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

joshuadmatthews added the kind/bug Something isn't working label Aug 16, 2024

olitomlinson mentioned this issue Aug 18, 2024

[Tracking issue] Dapr Workflows Stable #8008

Open

34 tasks

dapr-bot added the stale Issues and PRs without response label Oct 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Workflows with CosmosDB throwing 412 errors #8004

Workflows with CosmosDB throwing 412 errors #8004

joshuadmatthews commented Aug 16, 2024 •

edited

Loading

joshuadmatthews commented Aug 16, 2024

yaron2 commented Aug 16, 2024

joshuadmatthews commented Aug 16, 2024

yaron2 commented Aug 16, 2024

joshuadmatthews commented Aug 21, 2024

yaron2 commented Aug 21, 2024

dapr-bot commented Oct 20, 2024

Workflows with CosmosDB throwing 412 errors #8004

Workflows with CosmosDB throwing 412 errors #8004

Comments

joshuadmatthews commented Aug 16, 2024 • edited Loading

What version of Dapr?

Expected Behavior

Actual Behavior

Steps to Reproduce the Problem

joshuadmatthews commented Aug 16, 2024

yaron2 commented Aug 16, 2024

joshuadmatthews commented Aug 16, 2024

yaron2 commented Aug 16, 2024

joshuadmatthews commented Aug 21, 2024

yaron2 commented Aug 21, 2024

dapr-bot commented Oct 20, 2024

joshuadmatthews commented Aug 16, 2024 •

edited

Loading