Unexpected cluster termination

Learn how to troubleshoot a Databricks cluster that stopped unexpectedly.

Written by Adam Pavlacka

Last published at: March 4th, 2022

Sometimes a cluster is terminated unexpectedly, not as a result of a manual termination or a configured automatic termination. A cluster can be terminated for many reasons. Some terminations are initiated by Databricks and others are initiated by the cloud provider. This article describes termination reasons and steps for remediation.

Databricks initiated request limit exceeded

To defend against API abuses, ensure quality of service, and prevent you from accidentally creating too many large clusters, Databricks throttles all cluster up-sizing requests, including cluster creation, starting, and resizing. The throttling uses the token bucket algorithm to limit the total number of nodes that anyone can launch over a defined interval across your Databricks deployment, while allowing burst requests of certain sizes. Requests coming from both the web UI and the APIs are subject to rate limiting. When cluster requests exceed rate limits, the limit-exceeding request fails with a REQUEST_LIMIT_EXCEEDED error.

Solution

If you hit the limit for your legitimate workflow, Databricks recommends that you do the following:

  • Retry your request a few minutes later.
  • Spread out your recurring workflow evenly in the planned time frame. For example, instead of scheduling all of your jobs to run at an hourly boundary, try distributing them at different intervals within the hour.
  • Consider using clusters with a larger node type and smaller number of nodes.
  • Use autoscaling clusters.

If these options don’t work for you, contact Databricks Support to request a limit increase for the core instance.

For other Databricks initiated termination reasons, see Termination Code.

Cloud provider initiated terminations

This article lists common cloud provider related termination reasons and remediation steps.

AWS

Provider limit

Databricks launches a cluster by requesting resources on behalf of your cloud account. Sometimes, these requests fail because they would exceed your cloud account’s resource limits. In AWS, common error codes include:

InstanceLimitExceeded

AWS limits the number of running instances for each node type. Possible solutions include:

  • Request a cluster with fewer nodes.
  • Request a cluster with a different node type.
  • Ask AWS support to increase instance limits.

Client.VolumeLimitExceeded

The cluster creation request exceeded the EBS volume limit. AWS has two types of volume limits: a limit on the total number of EBS volumes, and a limit on the total storage size of EBS volumes. Potential remediation steps:

  • Request a cluster with fewer nodes.
  • Check which of the two limits was exceeded. (AWS trusted advisor shows service limits for free). If the request exceeded the total number of EBS volumes, try reducing the requested number of volumes per node. If the request exceeded the total EBS storage size, try reducing the requested storage size and/or the number of EBS volumes.
  • Ask AWS support to increase EBS volume limits.

RequestLimitExceeded

AWS limits the rate of API requests made for an AWS account. Wait a while before retrying the request.

Provider shutdown

The Spark driver is a single point of failure because it holds all cluster state. If the instance hosting the driver node is shut down, Databricks terminates the cluster. In AWS, common error codes include:

Client.UserInitiatedShutdown

Instance was terminated by a direct request to AWS which did not originate from Databricks. Contact your AWS administrator for more details.

Server.InsufficientInstanceCapacity

AWS could not satisfy the instance request. Wait a while and retry the request. Contact AWS support if the problem persists.

Server.SpotInstanceTermination

Instance was terminated by AWS because the current spot price has exceeded the maximum bid made for this instance. Use an on-demand instance for the driver, choose a different availability zone, or specify a higher spot bid price.

For other shutdown-related error codes, refer to AWS docs.

Delete

Launch failure

AWS

In AWS, common error codes include:

UnauthorizedOperation

Databricks was not authorized to launch the requested instances. Possible reasons include:

  • Your AWS administrator invalidated the AWS access key or IAM role used to launch instances.
  • You are trying to launch a cluster using an IAM role that Databricks does not have permission to use. Contact the AWS administrator who set up the IAM role. For more information, see Secure Access to S3 Buckets Using IAM Roles.

Unsupported with message “EBS-optimized instances are not supported for your requested configuration”

The selected instance type is not available in the selected availability zone (AZ). It does not actually have anything to do with EBS-optimization being enabled. To remediate, you can choose a different instance type or AZ.

AuthFailure.ServiceLinkedRoleCreationNotPermitted

The provided credentials do not have permission to create the service-linked role for EC2 spot instances. The Databricks administrator needs to update the credentials used to launch instances in your account. Instructions and the updated policy can be found AWS Account.

See Error Codes for a complete list of AWS error codes.

Delete

Azure

This termination reason occurs when Azure Databricks fails to acquire virtual machines. The error code and message from the API are propagated to help you troubleshoot the issue.

OperationNotAllowed

You have reached a quota limit, usually number of cores, that your subscription can launch. Request a limit increase in Azure portal. See Azure subscription and service limits, quotas, and constraints.

PublicIPCountLimitReached

You have reached the limit of the public IPs that you can have running. Request a limit increase in Azure Portal.

SkuNotAvailable

The resource SKU you have selected (such as VM size) is not available for the location you have selected. To resolve, see Resolve errors for SKU not available.

ReadOnlyDisabledSubscription

Your subscription was disabled. Follow the steps in Why is my Azure subscription disabled and how do I reactivate it? to reactivate your subscription.

ResourceGroupBeingDeleted

Can occur if someone cancels your Azure Databricks workspace in the Azure portal and you try to create a cluster at the same time. The cluster fails because the resource group is being deleted.

SubscriptionRequestsThrottled

Your subscription is hitting the Azure Resource Manager request limit (see Throttling Resource Manager requests). Typical cause is that another system outside Azure Databricks) making a lot of API calls to Azure. Contact Azure support to identify this system and then reduce the number of API calls.

Delete

Communication lost

Databricks was able to launch the cluster, but lost the connection to the instance hosting the Spark driver.

AWS

Caused by an incorrect networking configuration (for example, changing security group settings for Databricks workers) or a transient AWS networking issue.

Delete

Azure

Caused by the driver virtual machine going down or a networking issue.

Delete