Data News — Week 23.24
Data News #23.24 — AI Act, testing in dbt, data journey manifesto, SO survey, CDC with Clickhouse and fundraising.
Hello, after the good weather comes the storm. I'm now under the Berlin rain with 20°. When I write in these conditions I feel like a tortured author writing a depressing novel while actually today I'll speak about the AI Act, Python, SQL and data platforms. Casual day at the office finally.
Some personal news, next Monday and Tuesday I'll be at Berlin Buzzwords, if you're ping me, it would be a pleasure to meet and hang together.
There are still seats for the June Airflow Paris Meetup (in French).
AI 🤖
- AI Act 🇪🇺 has been voted in the European parliament. Also called GDPR 2.0 the AI Act is meant to regulate the usage of AI in tomorrow's world. It has been widely criticised by lobbyists, companies or developers. I'm not informed enough so I'll wait before giving my opinion on it.
- I-JEPA: The first AI model based on Yann LeCun’s vision for more human-like AI — Meta is in a frenzy to release new models. Yann's vision goes toward AI systems learning and reasoning like animals and humans.
- Deep multi-task learning and real-time personalisation for closeup recommendations — Pinterest still doing deep learning.
- Last week I shared nice QR Code generated with ControlNet, this week someone released a model on HuggingFace to do it QR Code Conditioned ControlNet (not related to the Chinese original paper) and you can even use the generator web UI.
- DashQL – Complete analysis workflows with SQL — A crazy paper about a new language that mixes SQL with analyses and graphs. It looks sexy but my brain can't read 9 pages PDF without overheating.
- Model and Data Versioning: An Introduction to mlflow and DVC — If you want to understand model versioning this is for you.
Data and Analytics Engineering 🧑🔧
- Testing frameworks in dbt — Robbert developed a small framework to do tests in dbt. Mainly he unit tests macros (the logic) with his framework and test data with soda and dbt contracts.
- The data journey manifesto — DataKitchen wrote a manifesto to put principles on the data journey to avoid the mess in production. There are 11 principles and 11 new ideas to create an healthy platform. For instance you should not trust your data providers and what worked last week will not work today.
- Why data consumers do not trust your reporting — It is a good illustration of the data journey manifesto. Stakeholders often notice data issues before the data team does. This destroys any confidence they may have in the numbers. Data warehouses are mutable, this is one of the many root causes proposed by Lucas. The past often changes, whether because of code or data. This is metrics drift.
- Data Documentation 101: Why? How? For Whom? — Marie wrote best practices for establishing complete and reliable data documentation. The first advice is about the documentation readers: data team, business users or other stakeholders.
- Change Data Capture (CDC) with PostgreSQL and ClickHouse — This is a nice vendor post about CDC with Kafka as movement layer (using Debezium). The post explains well the architecture you need to make it work.
- A deep dive into graph analytics — Petrica tries and showcases Memgraph in a long-form post. I'm a fond of graph visualisations and analytics—as well as maps.
- Experimenting at Scale, the Spotify Home way — Simple principles to run a good old' experiment at Spotify scale.
- The ultimate SQL guide — After the last canva on data interviews, here's a canva to learn SQL. From databases introduction to SQL writing. It covers simple SELECT and advanced concepts. This is neat.
- The power of pre-commit and SQLFluff —SQL is a query programming language used to retrieve information from data storages, and like any other programming language, you need to enforce checks at all times. This is where you should use pre-commit and SQLFluff.
- Metis: building Airbnb’s next generation data management platform — The new manifesto for every data governance company /S.
PS: I just split the Fast News to have a smaller one. Fast News contains lighter news and broad articles.
Fast News ⚡️
- Stack Overflow developer survey 2023 — Every year SO sends a survey to developers and it gives a great overview of the technology usage across the space. This year ~90k people answered, they also integrate a small AI category to measure impact on dev work.
What we see related to data engineering is mainly: Python and SQL are still shining at the top of technology popularity—around 50% use them. Thanks to AI hype Python is the second most desired technology behind Javascript, which augurs well for the future. They also share salary figures and data engineer / science are well situated in the ecosystem, best-paid job in Germany after management position but less-paid in the US. - Generating income from open source — Vadim shares how he makes money from all the different open-source projects he has. He shares what works and what does not work. In the post he also shares the journey of Sidekiq founder who's making $10m ARR alone.
- You can put space in BigQuery column names — The editors of blef.fr (me) have no comment. In fact, yes, you are all crazy?
- Malloy's Near Term Roadmap — I've shared recently Malloy demo, which was awesome. The article shares the recent features and says also something I will never forget: "Malloy aims to be syntactically the same no matter what database contains the data".
- The Astro Cloud IDE — Astronomer released a bunch of Airflow operators to their Cloud IDE (which was released in Dec. but I missed it). I get the point why companies wants us to go in their Cloud IDE, but I hate this trend. Let me alone in my PyCharm.
- Cube announcements ; Data Graph and Orchestration API — This is 2 announcement from Cube. I really like following them because they are thoughts leader in the semantic layer space. Data graph create an entity diagram from the semantic definition with the API offers you an endpoint to launch pre-aggregations jobs from your scheduler.
Data Economy 🤖
- Graphext raises $4.6m in seed round (second to continue develop a data analysis platform build for exploration. The Spanish startup develop a tool where you quickly explore datasets and then build charts or AI models on top of it. Last year they build a graph with Data News links, we clearly see the different content categories I share.
- Telmai raises $5.5m seed round. A new data observability platform enters the space, it looks like they propose the same features as the competition: add your datasources, get automated alerts on data drifts.
- At the same time Masthead raises raises $1.3m also as a data observability platform, but done differently. Masthead does not run SQL on your data—which generate costs uplift—but reading logs and metadata to identify anomalies.
- Informatica acquires Privitar. This consolidation will bring new features to Informatica. As a reminder Informatica has been funded in 1993 and is one of the dinosaurs in the ETL space. Privitar will bring "data security" stuff.
See you next week ❤️
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.