Last week dbt Labs decided to change the pricing of their Cloud offering. I've already analysed this in week #22.50 of the Data News. In a nutshell, dbt Cloud pricing is per seat based, which means you pay for each dbt developer. Previously for a team it was $50/month/dev and they increase to $100/month/dev, a 100% increase with a team limit of 8 devs and only one project. To overpass this limit you'll need to take the Enterprise pricing which is opaque as all pricing of this kind.

But this article is not about the pricing which can be very subjective depending on the context—what is 1200$ for dev tooling when you pay them more than $150k per year, yes it's US-centric but relevant.

Let's go deeper than this to list what are today the options out there to schedule dbt in production. We will also cover what it means to manage dbt1. This article will be written like a guide that aim to be exhaustive by listing all the possible solutions but if you feel I missed something do not hesitate to ping me.

dbt, a small reminder

Everyone—incl. me—is speaking about dbt, but what the heck is dbt. In simple words dbt Core is a framework that helps you organise all your warehouse transformation. The framework usage grew a lot over the last years. It's important to say that a lot of the usages we have today have not been initially designed by Fishtown Analytics.

At first dbt transformations were only SQL queries, but in the recent version with supported warehouse it has been possible to add Python transformations. dbt responsibility is to transform the collection of queries into an usable DAG. The dependencies between the queries are humanly defined—which means prone to error—thanks to 2 handful function source and ref. These 2 functions are called macros because they use Jinja, a Python templating engine, in dbt macros transform Python+SQL code in SQL, we can say that we have templated queries.

Everything I just mention before we can consider it static. If we do a parallel with software development, this is your codebase. Python and SQL together in the dbt framework is your codebase. You can do development on your codebase. In order to go in production you'll have to manage and schedule dbt.

To manage dbt you will have to answer multiple questions, but mainly dbt management is how the data team develop on dbt, how the project is validated/deployed, how you get alerted when something goes wrong, how you monitor.

In addition the dbt management you will have to find the place where dbt will be scheduled. Where dbt will run. dbt scheduling is tricky but not really complicated. If you followed what we've just seen dbt is in a SQL queries orchestrator. dbt does not run the queries, all the queries are sent to the underlying warehouse, which means that theoretically dbt does not need a lot of computing power—CPU/RAM—because he is only sending SQL queries sequentially to your data warehouse which does the work.

Obviously every dbt project has been designed differently but if we simplify the workflow all dbt project will need at some point to run one or multiple dbt CLI commands.

In this guide we will first see how we can manage dbt, i.e. git structures, how to code, the CI/CD and the deployment then in the second part how we schedule dbt code,i.e. on which server and triggers.

💡
This is a big guide, do not hesitate to use the table of content to jump to the interesting parts.

How to manage dbt 🧑‍🔧

Data team workshop to setup dbt (credits)

One of the dbt founding principle is to bring software engineering practices to the data development work, especially into SQL development world. In order to follow-up on this we will try to treat the workflow like an engineering project, even if sometimes it could feel over-engineered.

You have to consider development and deployment when managing dbt project(s):

Git repositories considerations

This is the everlasting debate of every software engineering team, monorepo or multirepo? This is tightly linked to another dbt related question which is mono-project or multi-project. By default and by design dbt is meant to work with mono-project, but when you're starting to grow or when you want to have clear domain borders the single project can quickly reach limits.

As I said previously if you're just starting with dbt and you're a small team I still recommend you to go with one repo one project. Try first to organise correctly the models folder before trying to structure at a higher level.

By definition here a dbt project corresponds to the folder that has been generated by the command dbt init. While a repo is a folder that can be larger than this, that why a repo can contains multiple projects.

The first question you'll probably hit is: how do I put models in different schema/datasets? This is the first step of project organisation. The solution is to override the generate_schema_name macro.

Then if you want to go for multiple projects you'll maybe have to decide how you do the interface between multiple projects, within the dbt toolkit you have 2 solutions:

This is a way to define exposures for downstream Marketing usage

Once project/repo structure has been defined there are still open questions, here are a few:

Development Experience with dbt

First important thing to say. I'm a data engineer and I truly think that my main mission when in a data team is to empower others through data tools. In the dbt context it means you have to understand how your analytics team is working. I've also noticed over the years that analytics teams are often not able to identify that they are under-equipped or doing something that is not efficient. This is your role as data engineer to identify these issues. This is your role to provide a neat developer experience for every dbt users.

But who are your dbt users?

Now that we listed a few of the dbt users, let's focus on the development experience, especially for analytics team—analytics engineers and data analysts. This is super important to provide a smooth experience for these users because they will spend a lot of working hours in the models, the neater the workflow is, the happier people will be.

What are the levers you can act on to provide this great experience: