When it comes to building a robust foundation for ETL (Extract, Transform, Load) pipelines, the trio of Azure Data Factory or Azure Synapse Analytics, Azure Batch, and Azure Storage is indispensable. These tools enable efficient data movement, transformation, and processing across diverse data sources, thereby helping us achieve our strategic goals.
This document provides a comprehensive guide on how to authenticate Azure Batch with SAMI and Azure Storage with Synapse SAMI. This enables user-driven connectivity to storage, facilitating data extraction. Furthermore, it allows the use of custom activities, such as High-Performance Computing (HPC), to process the extracted data.
The key enabler of these functionalities is the Synapse Pipeline. Serving as the primary orchestrator, the Synapse Pipeline is adept at integrating various Azure resources in a secure manner. Its capabilities can be extended to Azure Data Factory (ADF), providing a broader scope of data management and transformation.
Through this guide, you will gain insights into leveraging these powerful Azure services to optimize your data processing workflows.
During this procedure we will use different services, below you have more details about each of them.
Run an ADF / Synapse Pipeline that pulls a script located in a Storage Account and execute it into the Batch nodes using User Assigned Managed Identities (UAMI) for Authentication to Storage and System Assigned Managed Identity (SAMI) to authenticate with Batch.
During this procedure we will walk through step by step to complete the following actions:
1. In your Synapse Portal, go to Manage -> Credentials -> New and fill in the details and click Create.
2. In your Synapse Portal, go to Manage - Linked Services -> New -> Azure Blob Storage -> Continue and complete the form
a. Authentication Type: UAMI
b. Azure Subscription: Choose your one
c. Storage Account name: Choose your one where the script to be used is allocated
d. Credentials: choose the created into the Step #1
e. Click on Create
3. In Azure Portal go to your Batch Account -> Keys and Copy the Batch Account name & Account Endpoint to be used in next step, also copy the Pool Name to be used for this example.
4. In your Synapse Portal, go to Manage -> Linked Services -> New -> Azure Batch -> Continue and fill in the information
a. Authentication Method: SAMI (Copy the Managed Identity Name to be used later)
b. Account Name, Batch URL and Pool Name: Paste on here the values copied from Step#3
c. Storage linked service Name: Choose the one created from Step#2
5. Publish all your changes
6. In the Azure Portal, go to your Storage Account -> Access Control (IAM)
a. Click on Add Option and then on Add role assignment and search for "Storage Blob Data Contributor", then click on Next.
b. Choose Managed Identity and select your UAMI click on Select and then click Next, Next and Review + assign.
7. In the Azure Portal, go to your Batch Account -> Access Control (IAM)
a. Click on Add Option and then on Add role assignment
b. Click on "Privileged administrator roles" tab and then choose the Contributor role and click Next.
c. Choose Managed Identity and under Managed Identity lookup for "Synapse workspace" and then choose the SAMI same as it is added into the step 4a., then click on Select and Next, Next and Review and Assign.
If you need to create a new Batch Pool, you can follow the following procedure:
8. If you already have a Batch Pool created follow the next steps:
a. Into the Azure Portal go to your Batch Account -> Pools -> Choose your Pool -> Go to Identity
b. Click on Add then choose the necessary UAMI (on this example it was selected the one used by the Synapse Linked Services for Storage and another one used for other integrations) and click on Add.
Important: In case your Batch Pool use multiples UAMI's (as example to connect with Key Vault or other services), you have first to remove the existing one and then add all of them together.
c. Then, it is required to Scale in and Scale out the Pool to apply the changes.
9. In your Synapse Portal, go to Integrate -> Add New Resource -> Pipeline
10. Into the right panel Activities -> Batch Services -> Drag and drop the Custom activities
11. In the Azure Batch tab details for the Custom Activities, click on the Azure Batch linked service and click the one created in Step 4 and test the connection (if you receive a connection error, please go to the Troubleshooting scenario 1)
12. Then go to Settings tab and add your script. Ffor this example, we will use a Powershell script previously uploaded to a Storage Blob Container and send the output to txt file.
a. Command: your script details
b. Resource linked Service: The Storage Service Linked connection configured previously on Step#2
c. Browse Storage: lookup for the Container where your script was uploaded
d. Publish your Changes and perform a Debug
12. Check the Synapse Jobs Logs and outputs
a. Copy the Activity Run ID
b. Then, in the Azure Portal Go to your Storage Account -> Containers -> adfjobs -> select the folder with the activityID -> output.
c. On here you will find two files, "stderr.txt" and "stdout.txt" both of them contains information about the errors or the outputs of the commands executed during the task execution
13. Check the Batch Logs and outputs. To get the Batch logs you have different ways:
a. Over Nodes: In Azure Portal go to your Batch Account -> Pools -> Choose your Pool -> Nodes -> then into the Folders details go to the folder for this Synapse execution -> job-x -> lookup for the activityID
b. Over Jobs: In Azure Portal go to your Batch Account -> Jobs -> Select a pool with a name of adfv2-yourPoolName -> click on the Task with the ID same as it was the ActivityID of the Synapse Pipeline from step 12a.
During this walkthrough procedure we have learned and implemented about
If you have any questions or feedback, please leave a comment below!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.