Skip to content

Data News — Week 22.38

Data News #22.38 — Hidden gems in dbt artifacts, understand the Snowflake query optimizer, Python untar vulnerability, fast news and ML Friday.

Christophe Blefari
Christophe Blefari
5 min read
🇫🇷 (credits)

Bonjour vous ! Like sometimes I'm late. Today, I write the first words of the newsletter at 5PM. Which is 8h later than usual. Pardon me. In term of content it has been a huge week for me, I've prepared a meetup presentation that I enjoyed giving this Wed. It feels good to present stuff in public.

So yeah, let's talk a bit of this presentation.

Find the hidden gem in dbt artifacts

On Wednesday I made a 30 minutes presentation looking for hidden gems in dbt artifacts. The talk was a bit experimental, the idea is to show that this is possible for everyone to add context to you data infrastructure by leveraging generated artifacts. It means you can use the 4 JSON files generated to create tooling around your dbt project.

Shoemakers children are the worst shod.

Why not using the data generated by dbt artifacts to create useful data models to self-improve our data platforms?

While leveraging the 4 JSON files (manifest, run_results, sources, catalog) we could:

  1. Sources monitoring like in dbt Cloud
  2. Extends your dbt docs HTML
  3. Send data in your BI tool. We already have Metabase or Preset integrations.
  4. Enforce and visualise your data governance policy. Refuse every merge request if a model owner is not defined for instance.
  5. dbt observability, monitoring and alerting, have fun with analytics on your analytics.
  6. Create a dbt model time travel viewer. Create an automated changelog process than display your data model evolutions.
  7. dbt-helper — Your SQL companion
  8. dbt-doctor — It’s time to detect issues. Idea: a CLI tool to detect any dbt FROM leftovers to fail in CI if yes.

I also shared that every data engineer should consider the artifacts like a way to understand their customers. If you manage to get the artifacts from every envs (local, ci, staging, prod) you have the data to understand how everyone is using the tool. Especially useful if you have junior analysts lost within the tool, it'll detect silent local issues.

Send artifacts from every env to understand how everyone uses dbt.

🔗 Here the slides of my presentation.

Closing on dbt

To finish this edito about dbt here 3 other articles I found interesting. While we live our best life by creating dbt projects the complexity of the projects will only rise in the future. By facilitating the way we creating data models we encourage the data model creation. So what does it means we you have more the 700 models written by more than 43 humans? Anna from dbt Labs wrote an introspection post about it.

Adrian also raised the complexity topic on Medium. He states that with the modern data stack and the all-SQL paradigm we wrote complex code that risks to be unmanageable.

Finally if you want to have a course on data modeling Miles from GitLab will run a CoRise on Data Modeling for the Modern Warehouse. Seems a good resources to get started at Kimball methodology.

Understanding the Snowflake query optimizer

❤️ If you had to read only one article this week it would be this one. I think Teej is doing an awesome job demystifying Snowflake internals. And he striked once again. It's time to understand how the Snowflake query optimizer works. Even if you don't use Snowflake I recommend this article to you.

The job of a query optimizer is to reduce the cost of queries without changing what they do. Optimizers cleverly manipulate the underlying data pipelines of a query to eliminate work, pare down expensive operations, and optimally re-arrange tasks.

In a nutshell the query optimizer tries to transform the badly written 500 lines query to optimized instructions for the database. In order to run the query the database will need to load data in memory and the query optimiser will try to find what is the minimal set of data the engine needs to scan in order to answer as fast as he can.

Once the database knows exactly what to read, the optimizer will rewrite the query in a more optimized syntax but logically identical. It will replace the views or functions with their underlying physical objects, unselect the useless columns (called column pruning) and push the predicates. The predicate pushdown is the step where the optimiser tries to move all the data filtering (WHEREs) as early as possible in the query.

Then it will do a join optimization. But for this I let you read it on Teej excellent post.

👉 Read the full article

Inside Snowflake partitions system (credits)

ML Friday 🤖

Fast News ⚡️


See you next week 👻

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.45

Data News #24.45 — dlt Paris meetup and Forward Data Conference approaching soon, SearchGPT, new Mistral API, dbt Coalesce and announcements and more.

Members Public

Data News — Week 24.40

Data News #24.40 — Back in Paris, Forward Data Conference program is out, OpenAI and Meta new stuff, DuckCon and a lot of things.