Comparing Azure ETL tools: Azure data factory vs Azure Databricks

If you want to work with data integration on Azure cloud, your two obvious options are Azure data factory (ADF) or Azure Databricks (ADB). In this post, I will try to compare these two services, with the hope of helping those who try to decide which service to use.

What is Azure data factory?

Azure Data Factory is a cloud-based ETL service that enables you to create, schedule, and monitor data integration workflows. It offers a visual interface for creating ETL workflows and supports a wide range of data sources and destinations, including on-premises and cloud-based data stores.

What is Azure Databricks?

Azure Databricks is a powerful analytics service that provides a collaborative environment for building and managing data pipelines. It offers a powerful data processing engine based on Apache Spark, and supports a variety of data sources, including structured, semi-structured, and unstructured data.

Please note that ADB includes many more features than just ETL. It contains a full DWH environment and streaming solutions. In this post, however, I will just talk about the ETL part.

Connectors:

Connectors are used to connect the service to external data stores and use it as a source or destination.

ADF supports around 100 connectors, including all the main databases and file formats.

List of ADF connectors:

https://learn.microsoft.com/en-us/azure/data-factory/connector-overview

ADB supports fewer connectors out of the box, but since its code base, you can write your own code to connect to APIs or other external sources.

List of supported ADB external sources:

https://learn.microsoft.com/en-us/azure/databricks/external-data/

Support for on-premise data sources:

On-premise data source are sources of data that resides in a private network, where you would not want to open a connection from cloud services for security reasons.

ADF has a feature called Self Hosted integration runtime (SHIR), that you install on a server inside your private network. It does not require opening inbound rules in the firewall, only outbound rules, and is therefore more secure. SHIR interacts with ADF in the cloud and enables ADF to connect to an internal data source, and move data from it to the cloud and vice versa.

ADB does not have this capability, so to connect with ADB to an internal on-premise data source, you would have to open a connection through the firewall.

Code vs no-code

ADF is a no-code solution, and you can develop a complex data integration pipeline without learning and writing code.

ADB is code based and supports a few different languages: Python, SQL, Scala, and R.

Though harder to learn, code base solutions gives your much more flexibility in developing integration pipelines, and also enable you to perform other tasks, like writing a connection to an API.

Since you are writing your own code, you can create functions and reuse them, and so allow for more collaboration (each developer is working on a different peach of code) and save on developer efforts (reuse the same code in multiple places).

Documentation

When developing complex data integration, documenting what you do is very important. It enables other developers to understand your code and work with it and also helps you remember, when returning to your code after some time, what you did and why.

In ADF, you can give objects a description, but other than that, there is no way to add documentation.

In ADB, because you develop your code in notebooks, you can add to your notebooks markdown cells where you can add text, links, and even images.

See examples here:

https://grabngoinfo.com/databricks-notebook-markdown-cheat-sheet/

This is not a complete list of all the differences between ADF and ADB, but I think this is the main issues you should consider when deciding what tool to use for your next project.

The best option, in my opinion, is to use the best tool for each task. So, you may want to use both ADF and ADB in the same project, using each tool where it has an advantage.