Automating data retention with Data Factory

One of the principles of GDPR is that data is to be kept only as long as they are needed for the purpose they were gathered for. After that they should be deleted or anonymised. Truly anonymised data is hard, so it might be safest to define retention rules and then delete the data. Today I discovered a way to perform this automatically with Data Factory.

Data Factory recently released a delete activity (https://docs.microsoft.com/en-us/azure/data-factory/delete-activity) that can delete datasets in Azure blob storage, Data Lake (Gen1/Gen2), File system, FTP, SFTP or Amazon S3. My examples below use Data Lake Gen1, but this should work similarly for the other data stores.

Doing this requires you to have a few things:

  • A folder dataset for the folder you wish to delete data from. This should be a dataset with one variable for folder name
Example dataset for folders; just add a Linked Service and a variable for the folderPath
  • A file dataset for the file you wish to investigate. It will be as the one above, but have an additional variable for fileName

Next you can build your pipeline:

The pipeline has two parameters; the folderPath and retention days. Meaning that retention days can be set on folder level

This pipeline is quite simple;

  • The Get Elements in Folder uses the ADLS_xx_GDPR_Delete_Folder dataset with the field list “child items”
  • Set Retention Date sets a variable to the retention threshold; @formatDateTime(addDays(utcnow(), int(pipeline().parameters.retentionDays)), ‘yyyyMMdd’)
  • The for each loop goes through all child items; @activity(‘Get Elements in Folder’).output.childItems
Activites in For Each

The for each loop sets the file name (@item().name) and uses the ADLS_xx_GDPR_Delete_File dataset to check the file with that name in the folder to find the last modified date. Then the Expression for the if condition is to check if the last modified date is greater than the retention threshold. If true – Wait 1 second (do nothing), if false – use a delete activity to delete the file.

A little word of warning from the documentation; deleted files or folders cannot be restored, so be cautious. But this pipeline can help you to set up automatic data retention rules on folder levels by utilising triggers and parameters.

One comment on “Automating data retention with Data Factory

  1. Howdy! This blog post could not be written much better!
    Going through this article reminds me of my previous roommate!

    He continually kept talking about this. I am going to forward this post
    to him. Fairly certain he will have a very good read.

    Many thanks for sharing!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.