A scalable solution for Datawrapper vizualisations with Data Factory and Databricks

During the Covid pandemic we at NHO (The Confederation of Norwegian Enterprise) have performed surveys with our members, which are Norwegian enterprises. We have asked them how business is going, if they fear bankruptcy, how the income is, and how they look at the next months going forward. Media and others have wanted this data and we therefore created a website with key figures for Norwegian enterprises. Here we used Datawrapper as a visualization tool. Next, one of our county representatives wanted this, and more data, for his county, and we though all counties should know their key figures. But when you go from one national page to having eleven county pages, you need to create something scalable. This posts explains what we did.

Datawrapper is a modern service, so naturally they have an API that we could use. We found that using the Datawrapper developer tools to create the figures we wanted was a fast way to create a prototype, so that we actually knew what type of data we wanted and how we wanted to present this. After some work we ended up with a set of 28 figures, and when considering all 11 counties in Norway this gave us over 300 visualizations that we wanted to create and maintain.

We created a configuration file that had the definitions for our figures with metadata such as key (a unique name we give it), the Datawrapper visualization type, Datawrapper id, name, title, annotation and name of the source system for data. This configuration file is stored in our data lake and accessible both for Databricks and Data Factory. Databricks does updates to this during processing.

Then we created one, and just one, pipeline in Data Factory to do our magic. This pipeline has a parameter for the source system that has updated data, and is executed when new data is ready.

The pipeline works as follows:

  1. It starts with a Databricks notebook that prepares the data for visualizations. The notebook creates csv files, and by partitioning these files by county you can get one file for each county. Our prototype page had already defined the data structure for these csv-files, so the job in Databricks was to collect the data and format it properly
  2. Next we do a lookup in Data Factory to get the metadata for the figures we want, and do one of the following:
    1. If the figures is not created (we have no Datawrapper Id) we call the Datawrapper API to create the figure, and append the return value along with our key. When all missing figures are created we call another Databricks notebook that updates the metadata configuration and set the correct Datawrapper Id, and also creates configuration data (in json format) that we later call the API with to configure the visualizations.
    2. If the figures needs to be updated (for instance some visualizations have a intro text or annotation that explains when data was last updated) we run the same Databricks notebook to update the configuration.
  3. We also copy the data for the visualizations from our data lake to an Azure blob storage account. This is set up according to the Datawrapper requirements so that the visualizations can use our data directly.
  4. Finally all the figures are published and all 11 region sites are updated

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.