We recently set up Azure Devops for deploying Data Factory, and I though I’d share some practical tips for how to do this. There are some great other resources out there for this, but I’ll try to put it all together for you here, and add a few bits I have found myself. This post will describe the first part which is setting up your environment. The second part is doing development and third part is to set up and run deployment.
To set this up I have created three resource groups; dev, test and shared. These all have a KeyVault and a Data Factory. Dev and test will have other services that could be storage accounts, databases, data lakes, Databricks or similar. One general rule of thumb; use the same designation for environments, like datahelge-dev-rg, datahelgedevdatafactory, datahelgedevkeyvault, don’t use datahelge-dev-rg, datahelgedevelopmendatafactroy, datahelgedev01keyvault. This applies for all environments. You will know why in part 3…
Shared Data Factory
The shared Data Factory is there for one use; self-hosted integration runtimes. This is the component you will use to connect to on-premise or other sources that have restrictions on access such as IP restriction or other firewall rules. Migrating a Self Hosted Integration Runtime is not supported, but you can share the same Integration Runtime across different Data Factories. You can find a description for how to do this in this article.
The shared keyVault is for storing secrets that are not environment specific. Examples of this can be credential to connect to source systems if you are connecting to the same source environments regardless of dev/test/production. The advantage of having these in a shared KeyVault is that you don’t have to manage these across different environments and deploy them. This KeyVault must have access policies for all your data factories.
This Data Factory must be set up with versioning with GitHub or Azure DevOps. Changes should be done with branching and pull requests, which we will return to later. When adding connections to your Data Factory you should keep the naming independent of environment meaning that you should not have a BIStorageAccountDev, but just use BIStorageAccount. This should point to the BIStorageAccount you have in dev for this Data Factory, but deployment will change the pointer to use the correct storage account for the target environment. It is easy to replace the connection string, but not to change the linked service name. In the same way any KeyVault secrets should follow the same pattern, so that you point to an environment specific KeyVault for environment specific secrets. In general I would advise to utilise Managed Service Identity when possible, for instance when connecting to Data Lakes or Azure SQL. If these are environment specific they should only allow access from the Data Factory in the correct environment.
This KeyVault stores environment specific secrets, such as BIStorageAccountKey. This KeyVault should therefore only have an access policy for the Dev Data Factory. Secrets should not have environment specific names.
This Data Factory does not need to have any versioning with GitHub or Azure DevOps. It is only changed by deployment from development.
This KeyVault should have the same secrets as for development, but pointing to test resources. So if a connection is migrated from development to test, and therefore is pointing to the test Key Vault it should find the required secrets with the same name as in development.
Difference between test and production?
There isn’t any regarding setup. The only difference would typically be that triggers are activated in production, but not in test.
With everything set up it’s time to start developing!