One of the principles of GDPR is that data is to be kept only as long as they are needed for the purpose they were gathered for. After that they should be deleted or anonymised. Truly anonymised data is hard, so it might be safest to define retention rules and then delete the data. Today I discovered a way to perform this automatically with Data Factory.
Data Factory recently released a delete activity (https://docs.microsoft.com/en-us/azure/data-factory/delete-activity) that can delete datasets in Azure blob storage, Data Lake (Gen1/Gen2), File system, FTP, SFTP or Amazon S3. My examples below use Data Lake Gen1, but this should work similarly for the other data stores.
Doing this requires you to have a few things:
- A folder dataset for the folder you wish to delete data from. This should be a dataset with one variable for folder name
- A file dataset for the file you wish to investigate. It will be as the one above, but have an additional variable for fileName
Next you can build your pipeline:
This pipeline is quite simple;
- The Get Elements in Folder uses the ADLS_xx_GDPR_Delete_Folder dataset with the field list “child items”
- Set Retention Date sets a variable to the retention threshold; @formatDateTime(addDays(utcnow(), int(pipeline().parameters.retentionDays)), ‘yyyyMMdd’)
- The for each loop goes through all child items; @activity(‘Get Elements in Folder’).output.childItems
The for each loop sets the file name (@item().name) and uses the ADLS_xx_GDPR_Delete_File dataset to check the file with that name in the folder to find the last modified date. Then the Expression for the if condition is to check if the last modified date is greater than the retention threshold. If true – Wait 1 second (do nothing), if false – use a delete activity to delete the file.
A little word of warning from the documentation; deleted files or folders cannot be restored, so be cautious. But this pipeline can help you to set up automatic data retention rules on folder levels by utilising triggers and parameters.