I presented a session on SQLSaturday Oslo last weekend and can share some of the key takeaways here. This is by no means a complete comparison of every feature of these techniques, but show some of the strengths and weaknesses and can be a start when considering what tech to use in your Azure Data Factory.
Databricks is based on Apache Spark and provides in memory compute with language support for Scala, R, Python and SQL. Data transformation/engineering can be done in notebooks with statements in different languages. That makes this a flexible technology to include advanced analytics and machine learning as part of the data transformation process. You are also able to run each step of the process in a notebook, so step by step debugging is easy. You will also be able to see this process during job execution, so it is easy to see if your job stops.
Databricks clusters can be configured in a variety of ways, both regarding the number and type of compute nodes. Managing to set the correct cluster is an art form, but you can get quite close as you can set up your cluster to automatically scale within your defined threshold given the workload. It can also be set to automatically terminate when it is inactive for a certain time. When used with ADF the cluster will start up when activities are started. parameters can be sent in and out from ADF. Azure Databricks is closely connected to other Azure services, both Active Directory, KeyVault and data storage options like blob, data lake storage and sql.
The biggest drawback of Databricks in my mind is that you have to write code. Most BI developers are used to more graphical ETL tools like SSIS, Informatica or similar, and it is a learning curve to rather write code. Many will say that poorly written code will be very hard to maintain, but I’ve seen plenty of examples where graphical ETL isn’t easy to follow either.
Mapping Data Flows
One of the many data flows from Microsoft these days providing, for the first time, data transformation capabilities within Data Factory. This is not a U-SQL script or Databricks notebook that is orchestrated from Data Factory, but a tool integrated. This means that you can reuse (many of) the datasets you have defined in Data Factory, while in Databricks you don’t.
Mapping Data Flows runs on top of Databricks, but the cluster is handled for you and you don’t have to write any of that Scala code yourself. Many of the transformations will be familiar if you have worked with SSIS or similar tools, so the learning curve is not as steep as with Databricks. Mapping Data Flow provides nice monitoring features in ADF, but so far only after the job is complete. Sitting and monitoring an activity that runs for 39 minutes with no feedback can challenge my nerves…
By getting a graphical interface with transformations also can have it’s drawbacks when it comes to flexibility. You cannot insert a custom script task with C# or run Python/R. There is also some limitations in file formats for the time being with no support for JSON, but this is coming. There are plenty of templates made available to solve common tasks and you can do things like row operations and automatic creation of target tables in your database (which the copy activity also can do now!)
Keep in mind that this is still in preview, so things can improve. I recommend to follow Mark Kromer for good examples and news on this tech. In my test mapping data flow were slower than databricks and SSIS, but that might change as they come along.
SSIS is the well known ETL tool from Microsoft, and probably needs no further introduction. You can deploy SSIS packages to ADF by setting up a SSIS-Integration Runtime, and this is were your problems start. The SSIS-IR have to run for you to deploy packages (and configure them, if you don’t want to do it manually). Also the IR must be started before you start a SSIS activity. It won’t start automatically and your jobs will fail! I wrote about how to this here. Just remember to stop the IR when you are done.
Another drawback is limited support for cloud data stores. For SQLSatOslo I used Azure Data Lake Storage Gen1 as only this is supported. Compared to other technologies (like databricks and mapping data flows) SSIS is not the first to receive new updates.
The big advantage is that there are lots of developers familiar with the technology. Or if you have plenty of SSIS packages rewriting all of them is maybe not tempting.. But I’m not sure that I would go for managing all of this in ADF. But the performance was quite good.
So, what is best? The answer is as always it depends. For me Databricks is the current front-runner, Mapping Data Flows are promising but not quite there yet and SSIS would be a last resort to run data transformation in the cloud…
Sorry for any typos in this post, and it might be some time before my next post. I broke my (left, luckily) arm in an attempt to play football last weekend…