person carrying backpack inside library — Lost between ideas (credits)

Hey, new Data News edition. I hope you will enjoy this week selection after skipping last week one. I was a bit overwhelmed with the amount of tasks I had on the desk—and I'm still. But here we are.

Before jumping to the news, I want to let you know that I have improved the Recommendations page and the weekly emails with the recommendation should arrive soon. The new page supports better the mobile and give you titles and overviews of links which are GPT4 generated.

I'll speak at the MDS Fest 2.0 next week on April 10. MDS Fest is a free virtual 5 days conference about Modern Data Stack topics, a lot of awesome speakers, there are a few talks I can't wait to watch. On my side I'll talk about Apache Superset and what you can do to build a complete application with it.

AI News 🤖

LlamaIndex slides, examples with Mistral AI — A few slides with a lot of example on how you can use LlamaIndex with Mistral AI models. I guess there is a video associated with the slides, but I don't have it. It shows a few RAGs, Agents and parsers on documents to retrieve the data you need.
DBRX, a new state-of-the-art open LLM — Databricks has to be an AI company (bragging vs. Snowflake). This week they released a new open model that performs great.
How we built Text-to-SQL at Pinterest — Pinterest open-sourced a tool called Querybook that they used to access Pinterest data every day. In order to boost usage they developed a text-to-SQL feature. This article greatly explained how they did it.

Fast News ⚡️

MAD 2024 landscape — The new edition of the Machine learning, AI and Data Landscape is out, with many logos and obviously changes since last year with the GenAI hype. I did not yet analysed the new map, but I'll try to do it soon.
Polars on GPU — Polars announced that they are collaborating with Rapids to bring GPU performance to reach another summit with Polars. Looks cool to be able to switch Polars engines like this. If you are using Polars reach out to me I'm curious to know how people are using it.
Git in Snowflake — Snowflake is getting more and more feature months after months. Becoming a fully complete suite of applications directly reachable from SQL in your warehouse. Reminds me of Oracle and I don't like this centralisation, but the future is always going the bundling way. Now you can read a git repository when creating procedure in the SQL DML.
DuckDB is the new jq — The author show how you can manipulate a json file with a DuckDB one liner. I really like this take, it gives a great perspective about DuckDB and how you can use it locally to do fast manipulation. But contrary to jq which has a non-trivial syntax, DuckDB is SQL.
Survey about query engines used by companies — Data Council happened recently. It's a US-based conference that I often really like because talks and ideas that are discussed there often shape what we do in the data industry, at least from what I see in the YouTube videos, I never went there. A talker did a survey during his keynote about the query engines used by the audience and Spark is still leading to BigQuery/Snowflake/Athena.
How we built a 19 PiB logging platform with ClickHouse — ClickHouse is a tech company and you can see it from the blogpost. They deeply explain in this article why they choose ClickHouse to monitor their ClickHouse Cloud offering saving money on their Datadog bill.
❤️ Building open data portals in 2024 — David open-source a end-to-end framework to build open data portals. This is awesome (example), you can easily ingest, transform and share data, looks like yato but with many more features puzzled together to create a local-first data platform.
Spotify, data platform explained — The beginning of a series explaining the Spotify data platform.
A path from data mess to data mesh — 5 key principles you should apply to avoid the data mess.
Semantic layers, a buyers guide — This is a exhaustive comparison between dbt Cloud metrics offering and Cube. In a nutshell, I'd say that both technologies are not yet mature, with a smaller advantage for Cube for being open.
The data analyst every CEO wants — I really like this blog from Benoit, he gives practical advices about what you have to focus on if you're working as an data analyst for C-level of your company.
YAML developers and the declarative data platforms — A good introduction to why declarative languages are perfect for creating data platforms. To be honest, I think this is a topic that can be used to compare good and great data engineers. Creating a declarative data platform is easy, but creating the right level of abstraction that describes reality without creating debt and over-engineered solutions is much harder.
PR review tool for dbt projects — A nice tool creating visual representations comparing 2 dbt artifacts that you can embed in a CI to validate changes before they get merged into production code.
When is the data model finished? — Spoiler: a data model is never finished. Actually you need to depict your company business and activities, as the time goes, activities grows and obviously you have to manage this asset in time.
A training set for bike sharing forecasting — Max has created a large dataset of bike sharing providers in ~50 cities around the world. If you want to play with DuckDB and visualisations this is a good start.
Calculating walking isochrones in Python — Cool way to produce Python viz.

See you next week ❤️

Dont forget to check out the new Recommendations pages (below an overview of mines).