Custom pagination when copying data from REST APIs

Azure Data Factory has native support for copying data from REST APIs. There are some pagination rules supported (listed here), but what to do when the API you are using does not support this?

In my example I do not have any information in the header or body of my requests, but rather use the header to set pages. The call is; https://www.strava.com/api/v3/athlete/activities?page=&per_page=. The easiest way to solve this is with a dataset and pipeline variables.

@concat(‘athlete/activities?per_page=’,dataset().perPage,’&page=’, dataset().page)

Relative URL for dataset

When setting this for the dataset, you can have a pipeline parameter or variable with the per page value and simply add this to the copy activity. To set the page number you need to do a bit more. Them simple way to do this would naturally be like this;

But naturally, you can’t do that

So my workaround was to create a new temporary variable that was set to the Activites_page value, and then set the Activites_page based on this. Then place this in an until loop.

Start activites_pagetmp with 0, and it will be set to 1 for first run and keep iterating

But how to stop the until loop when there is no more data? Well, you know how much data there should be, so simply use this as the until expression:

@less(activity(‘Copy Activites JSON’).output.rowsCopied,int(variables(‘Activites_perPage’)))

Keep copying until the number of rows in the last copy is less then the number of activites per page

5 comments on “Custom pagination when copying data from REST APIs

  1. Hi Helge,

    Thanks for putting this info together. I am stuck in this situation as well. I read through the details and am trying to work through to set this up on my side, however, I am stuck in a few places. How do the Activities_page and Activities_pagetmp variables relate to the perPage and page variables set in the beginning? Also, getting an error on the Activities_perPage variable that it doesn’t belong to the current pipeline. Can you please help clarify and provide more details on how you have each of the variables setup and the pipeline itself?

    Thanks,

    Chris

    1. Hi

      There are three variables in this pipeline:
      Activities_perPage – this tells how many activities should be returned per page. APIs typically have some limitation for how many rows they can return per page
      Activities_page – this is the page number we are asking for
      Activities_pagetmp – this is a temporary variable for setting activities_page. We cannot reference activities page when setting this variable, so here is what I do in an Until loop:
      The default value for Activities_pagetmp is 0 (for the pipeline)

      1. When reading data I set the Activities variable to Activites_pagetmp + 1 (expression; @string(add(int(variables(‘Activites_pagetmp’)),1)) )
      2. Then I use the Activities variable in my call, as a dataset variable
      3. Finally I set Activities_pagetmp equal to Activities. Then we can loop back to one.

      The Until loop has this expression;
      @less(activity(‘Copy Activites JSON’).output.rowsCopied,int(variables(‘Activites_perPage’)))
      So this checks the number of rows copied in the last run. if this is less than Activities_perPage this means that we collected the last page.

  2. Hello Helge,

    Thank you very much, I was searching for a solution like this for a while, and I couldn´t do it without your help.

    One last question, after the calls i am getting one file for each call. am I doing something wrong? Or is this the result? In that case, what do you recommend to have all written in one file?

    Thank you!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.