Data News — Week 24.15
Data News #24.15 — MDSFest quick recap, LLM news, Airbnb Chronon, AST, Beam YAML, WAP and more.
I hope this Data News finds you well. In today's edition we have a large selection of links, I think you will enjoy it.
But first I want to welcome all the new members joining this week after my new episode on DataGen with Robin Conquet. This is an episode in French and we talked mainly about the eventual end of the modern data stack. Which I have already condensed in a post a few weeks ago (in English).
MDS Fest 🥳
As announced last week I've participated to the MDS Fest 2.0 this Thursday. I've shared my journey with Apache Superset and why I consider Superset the best open-source alternative when it comes to building BI applications.
Yes because you should stop building dashboard and build BI apps instead. It enters in the productisation of the data, but mainly I think you should consider your BI tools as a way for your users to interact with data and not only to monitor metrics. With customisation possible Superset is the best tool for it.
You can have a look at my slides or watch the replay on YouTube.
In the same conference a lot of other talks took place here a few selection you should check out:
- How to pivot your data team from a service team to a value-generator — Very often data teams struggle in delivering value or in finding what's their real identity. Taylor identified pattern and gives great advice to help you finding it.
- Data contracts: federated data governance — Another talk by Chad about the data contracts, always on point in describing the pains around the "data supply chain".
- Deliver reporting in pure SQL with dbt + Evidence — A great showcase of what you can build with Evidence (a BI as code solution).
- Build analytics at Hive.co — The journey Oleg and his team went through to implement a modern data stack. They used RFC to document where they were heading to.
PS: Apache Superset is going 4.0 this week with a lot of new features.
AI News 🤖
- Open-source LLMs for everyone — A great post from Siemens AI team about open LLMs initiatives that brings new usages in the dev workflow, wether it's about code completion and pull request / crash reporting summarisation, it looks neat.
- Building LLMs for code repair — Replit is a AI-driven workspace for developers , think of a supercharged IDE. They wrote a blog about what they developed to create a LLM driven fix suggestion for LSP (Language Server Protocol), which is a protocol between your IDE and a server that understand and analyse the code to find errors or highlight the code.
- LLM DataGen — A small demo of a LLM based on Gemma that generates JSONL based on a given name. This is not working super well and it would be better if we could specify the columns names and types for instance, but showcase another great usage example of generative algorithms.
- Meta, building an infrastructure for the future — It explains how Meta is partnering with GPUs vendors to design new chips and how it's incredibly hard to connect thousands of GPUs in a cluster where everything can fail at every moment.
- Can Gemini 1.5 actually read all the Harry Potter books at once? — A nice Graphviz chart spotted on Twitter of the whole Harry Potter relationships in a poster. Done by Gemini with the content of all the books. Obviously Gemini already knows a few of the Hogwarts lore by his training, but still this is impressive. Sadly we don't have the complete prompt / code.
- Speaking of prompts, PromptLayer organised a tournament and they blogged about their favourite prompts of the competition. Once again speaking to a LLM is like speaking to children, USE CAPITAL LETTERS TO CAPTURE THEIR ATTENTION.
- OpenAI open-sourced a light library to evaluate language models — you can use 7 different evals and check the results on OpenAI or Claude models.
- Mistral-8x22B is out — a new model that does something probably awesome.
Fast News ⚡️
- Last week I forgot to share the 2024 state of analytics engineering by dbt Labs. By overlooking at it it depicts well the actual trends I also see in my local market and we see that in 2024 more than 50% of the time spent by data practitioners is to maintain or organise data assets.
- Introducing Beam YAML — Apache Beam is a unified processing framework (understand unifying streaming and batch) that runs on many different engines. Today they introduce Beam YAML a way to write pipelines the declarative way. Reminds me Kestra so much.
- How I became an AST convert — I've been an AST convert for a long time, I'm so happy someone writes about this. AST stands for abstract syntax tree and can be an abstract representation of a language, and this is what SQLGlot does (hence SQLMesh) to SQL. Afzal writes in this blog what it means, especially in a diff context.
You can also listen Toby interview in Joe podcast about SQLMesh and SQL transformations. - Airbnb open-sources Chronon — A data platform for serving for AI/ML applications. In Chronon you defined sources and groupby—a collection of aggregations on keys—which represents in the end features and then the platform handle the downstream management.
- BigQuery releases data canvas — This is a large open canva (like count.co) in which you can write SQL queries, assisted by Gemini, and link them in a DAG fashion.
- Write-Audit-Publish pattern in modern data pipelines — A pattern worth to be known more because it can avoid your pipelines pushing wrong data into your users tools.
- Using Preset (Superset) to explore the dbt Cloud semantic layer — You can configure a sync between both clouds or use a CLI, then you will be able to explore metrics in your BI tool.
- Book review of DuckDB in Action.
- Efficient CSV parsing — A YouTube talk about the DuckDB cvs parser and what it means to parse unstructured files.
- To conclude this edition 2 project walkthrough:
✨ s/o to Hugo who runs a weekly data round-up and this week he published before me so you can also check his great links selection.
See you next week ❤️
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.