Data News — Week 25.02
Data News #25.02 — New conference AI Product Day, what are AI agents, does size matter, and awesome analytics engineering content.
Happy new year ✨. I wish you the best for 2025. There are multiple ways to start a new year, either with new projects, new ideas, new resolutions or by just keeping doing the same music. I hope you will enjoy 2025.
The Data News are here to stay, the format might vary during the year, but here we are for another year. Thank you so much for your support through the years.
Some personal news:
- I will be in Amsterdam for the DuckCon on Jan 31, I'll give a 5 minutes talk about yato, if you're also going or living there, reach out so we can chat!
- We announced the AI Product Day, a 1-day conference that will take place in Paris on March 31. It will be a day dedicated to product teams who want to fully exploit the potential of AI. We are looking for sponsors and the ticketing is open. I have a 15% discount code if you're interested BLEF_AIProductDay25.
- We published videos about the Forward Data Conference, you can watch Hannes, DuckDB co-creator, keynote about Changing Large Tables.
- Over the past four weeks, I took a break from blogging and LinkedIn to focus on building nao. It was refreshing to recharge and kick off the year, there’s nothing quite like diving back into the joy of hacking and creating.
Let's jump to the news, and have fun reading, it's a large wrap of everything that happened at the end of the year + how 2025 started.
AI News 🤖
The current economic uncertainties are affecting the tech and data worlds. Meanwhile, the AI landscape remains unpredictable. AI companies are aiming for the moon—AGI—promising it will arrive once OpenAI develops a system capable of generating at least $100 billion in profits.
- Why AI progress is increasingly invisible.
- It's happening, leaders wants AI agents to take the job of human employees. NVidia CEO said "IT department of every company is going to be the HR department of AI agents in the future" (cf. Keynote video) and Firecrawl, a tool for turning websites into LLMs, posted a $15K job for AI agents. Actually a modern Kaggle for Agentic AI, in the end it's a mechanism to lower human labor cost, because spoiler human will code to create these agents.
- Agents — Chip Huyen wrote a very large guide about AI agents, this is very detailed, it covers the necessary tooling, how planning works with agents and how you evaluate them. This is great quality material to be honest. There is also a Google introduction video about AI Agents.
- Don't do RAG, use CAG — A paper about another way to think about the information retrieval for AI knowledge tasks. The goal is to use a key-value (KV) cache that eliminates latencies issues traditional RAG might occur.
- Does size matter? — Paper written by Gael Varoquaux (sklearn), Meredith Whittaker (Signal) and Alexandra Sasha Luccioni (HuggingFace) about the negative impact of the bigger-is-better paradigm. It's easily readable (mildly large ~10 pages) and gives metrics about the performance plateau that we start to see at scale.
- A large international scientist collaboration released The Well: 2 massive datasets from physics simulation (15TB) to astronomical scientific data (100TB). They aim produce the same innovation as ImageNet produced for image recognition.
- Models news and tour
- DeepSeek-v3 — It entered the space with a bang. DeepSeek is a model trained by the Chinese company with the same name, they directly compete with OpenAI and all to build foundational models. They released v3 in open-source and it outperforms every other models.
- OpenAI o3 — OpenAI announced their advanced reasoning model called o3, that can achieve large task. o3 is kinda a waste of energy when you look at the numbers estimated carbon impacts (estimated via kWh), based out of François Chollet research on ARC-AGI benchmarks.
- smolagents — HuggingFace released a barebones library for agents. Agents write python code to call tools and orchestrate other agents.
- ❤️ Transformers laid out — The best article out there to understand Transformers (which are key to understand LLMs).
- Things we learned about LLMs in 2024 — I discovered Simon content during Christmas break and to be honest it's one of the best. He compiled a list of things we learned in 2024 about LLMs. This is a must-read.
Fast News ⚡️
- O'Reilly 2025 tech trend report — Every year O'Reilly releases a report based on their skills platforms search. From the traffic they get they draw market trends. A few things to notice:
- Interest in AI grew by 190%, Prompt Engineering by 456%.
- Python and Java still leads the programming language interest, but with a decrease in interest (-5% and -13%) while Rust gaining traction (+13%), not sure it's related, tho.
- Read the pdf version directly. Not really digest.
- The limits of data — The articles argues that while data's universality offers significant power, it often sacrifices contextual nuances, leading to oversimplified representations of complex human experiences. (Generated by AI as the article is too large to read atm for me).
- The hidden cost of over-abstraction in data teams — It's an interesting take that about the layer of abstractions we tend to create to build data platforms, whereas it's dbt Macros, CLI wrappers, etc. in the end it often adds more complexity than it removes.
- Designing a declarative data stack: from theory to practice — Related to the previous article Simon wrote a great article about the things to have in mind when we build a proprietary DSL for a declarative data stack. Meaning: a YAML configuration system for ingestion and transformations, and now, visualisation with BI-as-code.
- Lessons learned implementing Metrics Trees — An article in a form of an interview about someone who implemented Metrics Tree, mainly it's not about the visual representation but more about the process to translate business needs into an equation (I mainly see the tree as an equation).
- The evolution of OLAP — What is OLAP in the modern data stack? As Cube is preaching for their tooling, obviously the semantic layer is the OLAP layer.
- Hybrid Kimball & OBT data modeling approach — This is maybe the most common setup I've seen the last 3 years. A combination of Star schema with OBT in the marts for ease of consumption.
- Analytics engineering at Netflix — (and part 2). A internal survey of analytics engineering practices at Netflix. They developed a /data command internally that answer questions about everything and structured the analytics around a foundational data platform with company-wide analytics data layer that provides time series efficiency metrics across various business use cases.
- The future of data querying with Natural Language — What are all the architecture block needed to make natural language query working with data (esp. when you have a semantic layer).
- Hard data integration problems — As always Max describes the best way the reality. He listed 4 things that are the most difficult data integration tasks: from mutable data to IT migrations, everything adds complexity to ingestion systems.
- Materialization of data warehouse layers — What are the consideration for every materialisation you should pick in your data warehouse layer: view, tables, schema vs. databases, etc.
- The software development lifecycle within a modern data engineering framework — A great deep-dive about a data platform using dltHub, dbt and Dagster.
- The best code is the code you never wrote — Every line of code is a form of debt—a liability that must be maintained and understood. As we move toward a future dominated by AI-generated code, the balance will shift dramatically, making human-written code an increasingly scarce and valuable resource.
- Canva Snowflake journey:
- DuckDB vs. coreutils — A side by side comparison of DuckDB counting features with grep and wc.
- Amazon S3 Tables — Amazon released S3 Tables, an out-of-the-box support of Iceberg within S3 a few weeks ago. It came with a bang but with a few concerns.
- Database in 2024, a year in review — Mainly last year was about licensing issues, Databricks vs. Snowflake and DuckDB trying to decrown pandas as the default.
- How Uber is saving 140,000 hours/month using text-to-SQL — They developed within their tooling QueryGPT that helps Uber employees find the best stuff according to their needs. They have a table agent that list the best tables according to the user intent.
- A tool to make data stack diagram — Great tool to do stack diagram.
- Staff Engineer vs Engineering Manager — One of the best article about the topic, when it comes to expertise staff engineers are still rare. How should we structure the career ladder to make it work.
- Claude Sonnet 3.5 within Snowflake Cortex.
- Journey with Apache Flink & Flink CDC.
I still don’t think companies serve you ads based on spying through your microphone.
Economy 💰
- Boomi acquires Rivery in a $100m deal.
- Snowflake acquires Datavolo.
- Databricks raises $10B towards an IPO.
- Cursor raised $100m in Series B.
I missed you ❤️ — and see you next week.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.