Data flows vs. SQL, Part 1 (Performance)

I’ve been toying with Azure Data Factory lately and especially with Data flows. It’s a graphical, low-code tool made for doing data transformations in cloud environment in a serverless manner. I got an idea of doing a little comparison between it and good old SQL from both performance and billing point of a view. My idea was to create a denormalized, a.k.a. wide table, by joining multiple tables together using both Data flows and SQL. I used AdventureWorks2019 sample database as a data source and combined 7 tables of sales related data and end result was like this:

Database I provisioned was the slowest and cheapest Azure SQL Database I could find: Basic tier which costs about whopping 4€ per month.

Result table contained 121 317 rows and the stored procedure, which populated it, was this:

SET ANSI_NULLS ON
GO
SET QUOTED_IDENTIFIER ON
GO
CREATE PROCEDURE [dbo].[uspCurated] 
AS
   SET NOCOUNT ON;
   
TRUNCATE TABLE dbo.curated_sp;

INSERT INTO dbo.curated_sp 
SELECT h.SalesOrderNumber, h.OrderDate, h.DueDate, h.ShipDate  
      ,d.OrderQty, d.UnitPrice, d.UnitPriceDiscount, d.LineTotal
      ,p.ProductNumber, p.name as ProductName
      ,pg.Name as ProductSubcategoryName
      ,pcg.Name as ProductCategoryName
      ,pm.Name as ProductModelName
      ,c.AccountNumber
FROM [Sales].[SalesOrderHeader] h
  INNER JOIN [Sales].[SalesOrderDetail] d on h.SalesOrderID=d.SalesOrderID
  INNER JOIN [Production].[Product] p on d.ProductID=p.ProductID
  INNER JOIN [Production].[ProductSubcategory] pg on p.ProductSubcategoryID=pg.ProductSubcategoryID
  INNER JOIN Production.ProductCategory pcg on pg.ProductCategoryID=pcg.ProductCategoryID
  INNER JOIN [Production].[ProductModel] pm on p.ProductModelID=pm.ProductModelID
  INNER JOIN [Sales].[Customer] c on h.CustomerID=c.CustomerID;
GO

For Data flows I configured a runtime which type was General purpose and it had 8 cores. Pipeline I built is like the one below. Notice that not all sources are visible here but you get the idea:

I used SalesOrderHeader as a starting point into which I joined other tables using Join transformation. Before saving end result back into database, there’s also a Select operator for filtering out unwanted columns after which the result set is pushed into table.

Both constructs (Data flow and Stored procedure) were placed into their own pipelines which both were scheduled to be run once in an hour for 1 day. So in total both ran 24 times. Here’s the statistics of running times in seconds:

Run IdDuration (sec), Data flowDuration (sec), Stored procedure
155284
237384
352384
434086
537285
655286
755484
858285
940085
1055585
1137183
1237186
1334584
1455084
1537285
1655484
1734184
1834284
1937184
2037284
2155286
2235486
2337185
24371118
Average:43592

Stored procedure was both way more constant and a lot faster combined to Data flow. Let’s dive deeper into Data flow duration. What I found out was that warm up time of Spark cluster (which is technology behind Data flows) can be quite long. Here’s a sample of Run # 1 which took 9 minutes and 12 seconds to run in total. 5 minutes was spent just waiting for cluster to be ready:

Now that’s a long time! According to documentation, startup time varies between 4-5 minutes. In my test runs, which were done in Azure West Europe region, the average was approx. 3 minutes. Here’s the statistics but remember that your mileage may vary:

Run IdTotal duration (sec)Cluster startup time (sec)Actual runtime (total – startup)
1552298254
2373102271
3523282241
434096244
5372130242
6552290262
7554303251
8582326256
9400140260
10555284271
11371103268
12371126245
13345103242
14550289261
15372130242
16554291263
1734192249
18342103239
19371103268
20372109263
21552307245
22354103251
2337198273
2437197274
Average:435179256

Cluster startup times vary quite a lot from 1.5 minutes to 5 minutes. Actual runtime (the time from cluster being ready to end of execution) is much more constant. Using SQL we got all done in 1.5 minutes but just to get the Spark cluster up and running takes double!

Conclusions

If you already have your data in database then why not utilize its’ computing power to do data transformations? I mean it’s old technology and being old in this context is a good thing: Relational databases have been around more than 40 years which means those are well understood, well optimized and there’s a lot of scientific research behind those. Good old relational databases are still a valid solution for handling structured data.

Someone might say that test wasn’t fair and yes, I can relate to that. I mean data didn’t have to move out from SQL DB and then place back when using plain SQL. But then again should one move it to do some basic manipulations just that one could utilize some new and shiny low-code tool? I don’t think so: SQL is the lingua franca for data engineers and there’s a big pool of devs familiar with it.

Even starting the Spark cluster took more time than getting things done with SQL. And guess what’s fun about this? You also pay for the cluster warmup time! But more of this and the whole cost of these two methods on part 2. Stay tuned!

Leave a comment

Design a site like this with WordPress.com
Get started