REST source and Parquet sink? Be careful!

I’ve been struggling a bit with creating a new blog post lately. I have built upon the blog post where I set up a way to call the Strava API to collect some data with REST APIs. I wanted to process this with Databricks, but have met some problems. They are related to how Data Factory treats data in the copy activity.

How I do the copy activity is described in this post. I simply call the API multiple times and write into a dataset. The dataset use the page number, so I write a series of files. Since parquet is the most efficient way of handling data in Databricks I wanted to use this format, but that was when I met problems. My first problem when I just set this up with normal copy behaviour, to abort on first incompatible row;

{ “errorCode”: “2200”, “message”: “Failure happened on ‘Sink’ side. ErrorCode=UserErrorUnsupportedHierarchicalComplexValue,’Type=Microsoft.DataTransfer.Common.Shared.HybridDeliveryException,Message=The retrieved type of data JArray with value [59.93,10.72] is not supported yet, please either remove the targeted column or enable skip incompatible row to skip them.,Source=Microsoft.DataTransfer.Common,'”, “failureType”: “UserError”, “target”: “Copy Activites Parquet” }

So there is an array somewhere?

As I cannot remove the targeted column from the API, I enabled skip incompatible rows to see if the job ran. It succeeded, but in order to test it a bit I ran it with one copy activity for JSON (with abort on incompatible rows) and one with Parquet (with skip on incompatible rows).

Databricks have a nice feature so I can read in all the files I have with a simple command like this, to create two views, inputParquet and inputJSON

spark.conf.set(
“fs.azure.account.key.storagedatahelge.dfs.core.windows.net”,
dbutils.secrets.get(scope = “keyvault-datahelge”, key = “adlsAccessKey”) )


val path = “abfss://strava@storagedatahelge.dfs.core.windows.net/input/activites/Activites-” + dbutils.widgets.get(“date”) + “-*.parquet”
val inputDf = spark.read
.parquet(path)


inputDf.createOrReplaceTempView(“inputParquet”)

Using Key Vault for authentication, and then read all files for a given date

When I ran a count it turned out I had 2136 rows with JSON and 2041 rows with Parquet. So naturally I was interested to see what I could figure out about this rows, so I ran a comparison, and suddently I got this for one of the activities I had in JSON but not in Parquet:

The format of the data is included in Parquet files, and here it seems like the format is not correct across my different files! Looking into two files (file 22 as had the bug, and file 1) shows this:

Column is string in one file, and long in another

For Parquet files, this means that you loose data. Interestingly the same behaviour can be observed for JSON files, but it seems like that this is not a problem for Databricks and it is able to process the data.

This would only be guessing, but it seems like Data Factory does not consider structure when writing to files from REST APIs. The reason for treating this data as string could be that in some cases this column can have null values. For me this seems like bug, but either way I would think that it is a good principle to write REST data that you get in JSON into JSON format in your data lake.

One comment on “REST source and Parquet sink? Be careful!

  1. Hey! I have a JSON object that has an array within an array:

    {
    “styleclass”: {},
    “colorVariations”: [
    {
    “colorLabel”: “Red”,
    “colorID”: “1214”,
    “products”: [
    {
    “productID”: “123412”,
    “productSize”: “44”,
    “productTitle”: “Red Shoes”
    },
    {
    “productID”: “124312”,
    “productSize”: “46”,
    “productTitle”: “Red Shoes 46”
    }
    ]
    },
    {
    “colorLabel”: “Blue”,
    “colorID”: “123141”,
    “products”: [
    {
    “productID”: “09123”,
    “productSize”: “33”,
    “productTitle”: “Blue Shoes”
    },
    {
    “productID”: “1241231”,
    “productSize”: “34”,
    “productTitle”: “Blue Shoes 34”
    }
    ]
    }
    ]
    }

    In the copy activity, I’m not able to put the data to the lowest level of iteration. (collectionReference is marked at the product level as “colorvariations[0].products”. I tried to change it [*] which is giving the same error that you have listed (errorCode”: “2200))

    Any idea how I can go about it? thanks a lot!

Leave a Reply to Rameez K Cancel reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.