Skip to content

Data News — Week 25.10

Data News 25.10 — Super large edition, all new models releases, events, dbt Core vs. SQLMesh, benchmark your data team, and more.

Christophe Blefari
Christophe Blefari
9 min read
Joy (credits)

Hello here ☀️

I feel ashamed for not posting any Data News for the last 2 months, a lot was going on and I did not manage to find time every Friday to write news, I'm so sorry about it.

Hello to all the new subscribers who arrived since January, I want to warmly welcome you ❤️. This is your first Data News ever, enjoy the moment, read whatever you feel curious about, at your own rhythm.

At the moment, I don't want to promise something about being back to our regular weekly schedule, but I'm trying as hard as I can to organise my new routines/life as a content creator and company founder.

I've always worked on multiple projects at the same time, but since I've started nao things changed. There's a truth you only grasp when you've lived it: you are thinking about your company all the damn time.

Events 🪭

While being less present online, I've done a lot of things in real life in the last weeks and I'll continue to do in the weeks to come. I was at the DuckCon #6 in Amsterdam to talk about yato, the smallest DuckDB SQL orchestrator and Robin published 3 podcasts episodes—in French—that I hope you'll listen while running this weekend 🤭:

At the end of the month, on March 31st, I'll co-organise the AI Product Day in Paris, we are sold-out, but we still have slots for sponsors if you want to help us organise the event and get massive visibility to AI and product teams.

I'm going to Barcelona 🇪🇸 (from March 19 to 22)—I'd love to hangout with data people there. I'll give a talk to the French Tech Barcelona on March 20, you can register here. I might plan a day-trip to Madrid (?).

The biggest news as a data fan is also that I'll be at Data Council this year as your news reporter on duty 🤓. So if you plan to got, or if you're in San Francisco around April, let's have a coffee.

AI News 🤖

Running after all the models releases (credits)

The pace of change is nothing short of extraordinary. I haven't published in two months, and it feels like two years. Here's a recap.

  • Timeline of the major models releases. When I say major it's obviously subjective. It's mainly related to the noise they made online.
    • phi-4 — Microsoft continues to release their small open models. I never came across someone using it, however.
    • DeepSeek R1 — DeepSeek is a Chinese startup building foundational models, they released v3 previously, then R1 which is a reasoning model that made OpenAI and American AI companies panicking because they claimed major cost reduction in model training. More, DeepSeek code and models are open-source with MIT license. R1 is built on top of v3 using reinforcement learning combined with chain-of-thoughts (CoT) to "reason".

      HuggingFace created open-r1 a fully open (for what it means) version of r1, in Python where every step is detailed.

      There is also a good analyse of DeepSeek vs. the world.
    • Mistral Small 3 — A small model that can be use to do CoT, under Apache Licence.
    • Google Gemini Flash 2.0 — Multimodal reasoning.
    • Anthropic Claude 3.7 — Claude 3.5 has been by far the most used model for code generation for the last 6 months. 3.7 should be a uplift, and to be honest I feel it's not. They also release Claude Code, an agentic coding tool that you can use in command line to make changes, commit and fix issues. Simon, created the pelican bicycle test, which is fairly good to evaluate models.
    • Alibaba Qwen2.5 — Nothing much to say to be honest.
    • Open AI o3-mini — OpenAI fast reasoning model series.
    • Grok 3 — Musk fans says it's the best model.
    • OpenAI GPT 4.5 — One month later OpenAI released GPT 4.5, and I feel like a teleshopping presenter. Now we have a selector with 6 models in ChatGPT, I'm kinda lost to be honest. There is also the pelican test.
    • OpenAI deep research — OpenAI mode to replace McKinsey consultant or Phd people, because why not.
    • Mistral OCR — The promise is crazy, you can have a quick look at their examples, from a pdf or a photo it can extract information so you can use it. It's even "multimodal" because it keeps the figure in the output.
  • ChatGPT for MacOS can interact with your code (in the IDE) — in the demo it works with XCode or VS Code and directly changes the files on disk so they changed in your editor.
  • Claude 3.7 plays Pokemon on Twitch — finally something useful.
  • I'm not a Perplexity user but I see more and more people switching their Google search usage to Perplexity, which announce a new web browser called Comet and deep research. Deep research has been built to generate ideas, summaries or takeways.
  • Streaming AI agents: why Kafka and Flink are foundations — A small bond with the data engineering world.
  • Building effective AI agents — This is a great article from Anthropic if you wanna learn how to build AI agents. It explains well the flow between the user, the UI and the LLM.
  • 📺 Deep dive into LLMs — 1.5m views and have been recommended by a lot of people, should be good (did not see it). Goes from GPT-2 to DeepSeek R1 and give a mental model of what it is.
  • RAG stuff

dbt Core and SQLMesh, wat 🧭

dbt Core has become one of the most used tool across data teams all around the world. Because of its success, companies might feel the dbt fatigue which happens when your dbt project has been a success but widely spread within the company leading to A LOT of tables—we call it models, the dbt way.

When you have a lot of tables, dbt projects tend to become less manageable, the CLI becomes slow, the local development experience isn't great and more and more features are going into the Cloud version. SQLMesh has been created to fix dbt Core issues and to compete with dbt Cloud.

A few weeks ago, dbt Labs acquired SDF—which I had been watching closely for more than a year, see DN#24.07. SDF is a Rust binary that understands dbt projects and speeds up everything, making up to 100x gain in performance. Under the hood SDF, parses the SQL queries, gets syntax trees, compile and executes them to find issues even before they hit the data warehouse.

We will know very soon what this acquisition will bring to dbt and we all praise for the best improvements to be in the open-source codebase, (spoiler: not sure).

On the other side, SQLMesh answered with the acquisition of Quary, a Rust knowledgeable that made significant improvement to SQLGlot the underlying SQL parser of SQLMesh.

There is a fierce competition between the 2 companies and shots are fired. SQLMesh team is also organised GROUP BY their annual conference in a few days, any resemblance to another event is fortuitous. This week Tobiko also published a benchmark claiming that SQLMesh on top of Databricks delivers 9x cost saving.

Time will tell where this leads, but ultimately it will benefit data professionals as they strive to build the best SQL orchestrators. However, I believe there are still unresolved issues with the developer experience in the age of AI—challenges that I'm actively working to address with nao 🤭.

Will dbt stays on top? (credits)

Fast News ⚡️

🕵️ What if you could become a SQL detective: SQL Noir. It's a funny game to practice you SQL skills to solve mysteries.

Data Economy 💰

Become it's already too long, only headlines.


Sorry for the large edition, I also feel a bit rusty after 2 months not writing. See you soon folks ❤️.

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 25.02

Data News #25.02 — New conference AI Product Day, what are AI agents, does size matter, and awesome analytics engineering content.

Members Public

Data News — Small break until January

Data News #50 — Small break until January next year.