Schedule and orchestrate Workflows

Article
08/06/2024

Databricks Workflows provides a collection of tools that allow you to schedule and orchestrate data processing tasks on Azure Databricks. You use Databricks Workflows to configure Databricks Jobs.

This article introduces concepts related to managing production workloads using Databricks Jobs.

Note

Delta Live Tables provide a declarative syntax for creating data processing pipelines. See What is Delta Live Tables?.

What are Databricks jobs?

A Databricks job allows you to configure tasks to run in a specified compute environment on a specified schedule. Along with Delta Live Tables pipelines, jobs are the primary tool used on Azure Databricks to deploy data processing and ML logic into production.

Jobs can vary in complexity from a single task running a Databricks notebook to thousands of tasks running with conditional logic and dependencies.

How can I configure and run jobs?

You can create and run a job using the Jobs UI, the Databricks CLI, or by invoking the Jobs API. You can repair and re-run a failed or canceled job using the UI or API. You can monitor job run results using the UI, CLI, API, and notifications (for example, email, webhook destination, or Slack notifications).

To learn about using the Databricks CLI, see What is the Databricks CLI?. To learn about using the Jobs API, see the Jobs API.

What is the minimum configuration needed for a job?

All jobs on Azure Databricks require the following:

Source code that contains logic to be run.
A compute resource to run the logic. The compute resource can be serverless compute, classic jobs compute, or all-purpose compute. See Use Azure Databricks compute with your jobs.
A specified schedule for when the job should be run or a manual trigger.
A unique name.

Note

If you develop your code in Databricks notebooks, you can use the Schedule button to configure that notebook as a job. See Create and manage scheduled notebook jobs.

What is a task?

A task represents a unit of logic in a job. Tasks can range in complexity and include the following:

A notebook
A JAR
A SQL query
A DLT pipeline
Another job
Control flow tasks

You can control the execution order of tasks by specifying dependencies between them. You can configure tasks to run in sequence or parallel.

Jobs interact with state information and metadata of tasks, but task scope is isolated. You can use task values to share context between scheduled tasks. See Share information between tasks in an Azure Databricks job.

What control flow options are available for jobs?

When you configure jobs and tasks within jobs, you can customize settings that control how the entire job and individual tasks run.

Trigger types

You must specify a trigger type when you configure a job. You can choose from the following trigger types:

You can also choose to manually trigger your job, but this is mostly reserved for specific use cases such as:

You use an external orchestration tool for triggering jobs using REST API calls.
You have a job that runs rarely that requires a human-in-the-loop for validation or resolving data quality issues.
You are running a workload that only needs to be run once or a few times, such as a migration.

See Trigger jobs when new files arrive.

Retries

Retries specifies how many times a particular job or task should be re-run if the job fails with an error message. Errors are often transient and resolved through restart, and some features on Azure Databricks such as schema evolution with Structured Streaming assume that you run jobs with retries in order to reset the environment and allow a workflow to proceed.

An option for configuring retries appears in the UI for supported contexts. These include the following:

You can specify retries for an entire job, meaning the whole job restarts if any task fails.
You can specify retries for a task, in which case the task restarts up to the specified number of times if it encounters an error.

When running in continuous trigger mode, Databricks automatically retries with exponential backoff. See How are failures handled for continuous jobs?.

Run if conditional tasks

You can use the Run if task type to specify conditionals for later tasks based on the outcome of other tasks. You add tasks to your job and specify upstream-dependent tasks. Based on the status of those tasks, you can configure one or more downstream tasks to run. Jobs support the following dependencies:

All succeeded
At least one succeeded
None failed
All done
At least one failed
All failed

See Run tasks conditionally in an Azure Databricks job

If/else conditional tasks

You can use the If/else task type to specify conditionals based on some value. See Add branching logic to your job with the If/else condition task

Jobs support taskValues that you define within your logic and allow you to return the results of some computation or state from a task to the jobs environment. You can define If/else conditions against taskValues, job parameters, or dynamic values.

Azure Databricks supports the following operands for conditionals:

==
!=
>
>=
<
<=

Duration threshold

You can specify a duration threshold to either send a warning or stop a task or job if a specified duration is exceeded. Examples of when you might want to configure this setting include the following:

You have tasks that are prone to getting stuck in a hung state.
You need to warn an engineer if an SLA for a workflow is exceeded.
You want to fail a job configured with a large cluster to avoid unexpected costs.

Concurrency

Most jobs are configured with the default concurrency of 1 concurrent job. This means that if a previous job run has not completed by the time a new job should be triggered, the next job run is skipped.

There are some use cases for increased concurrency, but most workloads do not require altering this setting.

How can I monitor jobs?

You can receive notifications when a job or task starts, completes, or fails. You can send notifications to one or more email addresses or system destinations. See Add email and system notifications for job events.

System tables include a lakeflow schema where you can view records related to job activity in your account. See Jobs system table reference.

You can also join the jobs system tables with billing tables to monitor the cost of jobs across your account. See Monitor job costs with system tables.

Limitations

The following limitations exist:

A workspace is limited to 1000 concurrent task runs. A 429 Too Many Requests response is returned when you request a run that cannot start immediately.
The number of jobs a workspace can create in an hour is limited to 10000 (includes “runs submit”). This limit also affects jobs created by the REST API and notebook workflows.
A workspace can contain up to 12000 saved jobs.
A job can contain up to 100 tasks.

Can I manage workflows programmatically?

Databricks provides tools and APIs that allow you to schedule and orchestrate your workflows programmatically, including the following:

For more information about developer tools, see Developer tools and guidance.

Workflow orchestration with Apache AirFlow

You can use Apache Airflow to manage and schedule your data workflows. With Airflow, you define your workflow in a Python file, and Airflow manages scheduling and running the workflow. See Orchestrate Azure Databricks jobs with Apache Airflow.

Workflow orchestration with Azure Data Factory

Azure Data Factory (ADF) is a cloud data integration service that lets you compose data storage, movement, and processing services into automated data pipelines. You can use ADF to orchestrate an Azure Databricks job as part of an ADF pipeline.

To learn how to run a job using the ADF Web activity, including how to authenticate to Azure Databricks from ADF, see Leverage Azure Databricks jobs orchestration from Azure Data Factory.

ADF also provides built-in support to run Databricks notebooks, Python scripts, or code packaged in JARs in an ADF pipeline.

To learn how to run a Databricks notebook in an ADF pipeline, see Run a Databricks notebook with the Databricks notebook activity in Azure Data Factory, followed by Transform data by running a Databricks notebook.

To learn how to run a Python script in an ADF pipeline, see Transform data by running a Python activity in Azure Databricks.

To learn how to run code packaged in a JAR in an ADF pipeline, see Transform data by running a JAR activity in Azure Databricks.

Share via