Azure Data Factory has native support for copying data from REST APIs. There are some pagination rules supported (listed here), but what to do when the API you are using does not support this?
In my example I do not have any information in the header or body of my requests, but rather use the header to set pages. The call is; https://www.strava.com/api/v3/athlete/activities?page=&per_page=. The easiest way to solve this is with a dataset and pipeline variables.
@concat(‘athlete/activities?per_page=’,dataset().perPage,’&page=’, dataset().page)Relative URL for dataset
When setting this for the dataset, you can have a pipeline parameter or variable with the per page value and simply add this to the copy activity. To set the page number you need to do a bit more. Them simple way to do this would naturally be like this;
So my workaround was to create a new temporary variable that was set to the Activites_page value, and then set the Activites_page based on this. Then place this in an until loop.
But how to stop the until loop when there is no more data? Well, you know how much data there should be, so simply use this as the until expression:
@less(activity(‘Copy Activites JSON’).output.rowsCopied,int(variables(‘Activites_perPage’)))Keep copying until the number of rows in the last copy is less then the number of activites per page
5 comments on “Custom pagination when copying data from REST APIs”
Thanks for putting this info together. I am stuck in this situation as well. I read through the details and am trying to work through to set this up on my side, however, I am stuck in a few places. How do the Activities_page and Activities_pagetmp variables relate to the perPage and page variables set in the beginning? Also, getting an error on the Activities_perPage variable that it doesn’t belong to the current pipeline. Can you please help clarify and provide more details on how you have each of the variables setup and the pipeline itself?
There are three variables in this pipeline:
Activities_perPage – this tells how many activities should be returned per page. APIs typically have some limitation for how many rows they can return per page
Activities_page – this is the page number we are asking for
Activities_pagetmp – this is a temporary variable for setting activities_page. We cannot reference activities page when setting this variable, so here is what I do in an Until loop:
The default value for Activities_pagetmp is 0 (for the pipeline)
1. When reading data I set the Activities variable to Activites_pagetmp + 1 (expression; @string(add(int(variables(‘Activites_pagetmp’)),1)) )
2. Then I use the Activities variable in my call, as a dataset variable
3. Finally I set Activities_pagetmp equal to Activities. Then we can loop back to one.
The Until loop has this expression;
@less(activity(‘Copy Activites JSON’).output.rowsCopied,int(variables(‘Activites_perPage’)))
So this checks the number of rows copied in the last run. if this is less than Activities_perPage this means that we collected the last page.
I am trying to implement the same but it is not working for me. Can you share your pipeline settings?
Does the comment I gave to Chris help you?
Use UNTIL functionality it works