Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workflows with CosmosDB throwing 412 errors #8004

Open
joshuadmatthews opened this issue Aug 16, 2024 · 7 comments
Open

Workflows with CosmosDB throwing 412 errors #8004

joshuadmatthews opened this issue Aug 16, 2024 · 7 comments
Labels
kind/bug Something isn't working stale Issues and PRs without response

Comments

@joshuadmatthews
Copy link

joshuadmatthews commented Aug 16, 2024

What version of Dapr?

1.13.5

Expected Behavior

Workflows should function correctly

Actual Behavior

{"app_id":"my-service-workflow","instance":"my-service-workflow-c89bdf786-xtvjj","level":"warning","msg":"Workflow actor '56631279-491c-422f-8ac3-ef885ad5a448': execution failed with a recoverable error and will be retried later: 'failed to invoke activity actor '56631279-491c-422f-8ac3-ef885ad5a448::1::1' to execute 'GetSomeData': error from internal actor: error saving reminders partition and metadata: transaction failed due to operation 1 which failed with status code 412'","scope":"dapr.wfengine.backend.actors","time":"2024-08-15T20:54:58.986427073Z","type":"log","ver":"1.13.5"}

Steps to Reproduce the Problem

Not quite sure. This same workflow works on my DEV and QA environment, but not in my UAT environment which appears to be identical. Looking for help to determine what could cause this.

I am running in an Azure AKS cluster, Kubernetes 1.29.7, Dapr 1.13.5-msft.1 in HA mode. Cosmos collection has the correct partition key, and I see workflow records in the collection. I only have 1 instance of my workflow deployed.

@joshuadmatthews joshuadmatthews added the kind/bug Something isn't working label Aug 16, 2024
@joshuadmatthews
Copy link
Author

I have seen #7162, but this is all new greenfield stuff, I have never used those old versions of dapr.

@yaron2
Copy link
Member

yaron2 commented Aug 16, 2024

This error means that two (or more) concurrent operations trying to mutate the same state are happening on your Cosmos collection and one got rejected due to record versioning. This is a retriable error that you can safely retry from your code. However, please note that we completely revamped the actor reminder system in Dapr 1.14 - it'd be great if you could upgrade to 1.14.1 and enable the Scheduler service with the following configuration:

apiVersion: dapr.io/v1alpha1
kind: Configuration
metadata:
  name: featureconfig
spec:
  features:
    - name: SchedulerReminders
      enabled: true

Then apply the configuration to your app with the following annotation: dapr.io/config: featureconfig.

You should not only see these errors resolved but also get improved performance. Notice that old reminder data will not be moved over.

@joshuadmatthews
Copy link
Author

Yes I am anxiously awaiting 1.14 to hit the AKS Dapr extension, not quite available yet. Great job on the work there!

As for retrying this error, not really sure it will help. I'm not seeing intermittent failure; it's failing every time on every workflow. I too thought there must be concurrency somewhere, but I can't find it. It seems there is only 1 instance of this workflow running and nothing else is touching that state.

@yaron2
Copy link
Member

yaron2 commented Aug 16, 2024

Yes I am anxiously awaiting 1.14 to hit the AKS Dapr extension, not quite available yet. Great job on the work there!

As for retrying this error, not really sure it will help. I'm not seeing intermittent failure; it's failing every time on every workflow. I too thought there must be concurrency somewhere, but I can't find it. It seems there is only 1 instance of this workflow running and nothing else is touching that state.

Can you reach out to me on Discord? handle is yaron2. We can debug there and take findings here.

@joshuadmatthews
Copy link
Author

It seems this was due to Cosmos consistency level and multiple regions being enabled. Session consistency was the cause. We switched to Bounded Staleness and that resolved it, although there is a cost associated with that so it would be good if Session support could be added.

One question I have is, with the new reminder scheduling, if we delete and recreate our AKS cluster will reminder now be lost?

@yaron2
Copy link
Member

yaron2 commented Aug 21, 2024

It seems this was due to Cosmos consistency level and multiple regions being enabled. Session consistency was the cause. We switched to Bounded Staleness and that resolved it, although there is a cost associated with that so it would be good if Session support could be added.

One question I have is, with the new reminder scheduling, if we delete and recreate our AKS cluster will reminder now be lost?

Yes, all existing reminder data will be lost.

@dapr-bot
Copy link
Collaborator

This issue has been automatically marked as stale because it has not had activity in the last 60 days. It will be closed in the next 7 days unless it is tagged (pinned, good first issue, help wanted or triaged/resolved) or other activity occurs. Thank you for your contributions.

@dapr-bot dapr-bot added the stale Issues and PRs without response label Oct 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/bug Something isn't working stale Issues and PRs without response
Projects
None yet
Development

No branches or pull requests

3 participants