Data News — Week 22.48
Data News #22.48 — Very fast news, Advent of Data debuts, Snowflake sends emails, deprecates dashboards.
Hey you, this is an unusual Saturday. I'm terribly late with this newsletter. This week I had a huge amount of work to deal with and we've launched the Advent of Data, your daily spark of data in December. Thanks to everyone who accepted to participate, we already published the 3 first articles and I can't wait to read everything else writers are working on.
In a nutshell the 3 first articles are:
- MLOps isn’t DevOps for ML — Abi strongly answer to thenewstack.io stating that machine learning field should find DevOps practitioners to fill the lack of people in ML operations.
- Using Airflow the wrong way — An experimental article I wrote where I explore Airflow as a framework rather than a all-in-one scheduler/orchestrator tool. What if we decide to schedule Airflow DAGs in Gitlab?
- Modern Data Modeling: Start with the End? — Brice wrote about dbt projects structure and foundations to have a working modelisation.
On a side note, we are 200 members away from the 2000 members and it'll be an awesome gift if I reach this number before next year. So if you like the newsletter maybe recommend it to your co-workers 🙃.
Fast News (very fast this time) ⚡
Because I want to deliver the news as soon as possible after my initial delay and a few IRL adventures—I'm currently stuck between Germany and France—this edition will only be a collection of bullet points with opinions.
- Joe Reis launched his Substack — Joe is the co-author of the great The Fundamentals of Data Engineering and his blog already have 2 articles I deeply recommend: No extra credit for complexity & Groundhog Days. The first article can be resumed as aim for simplicity for every system you build while the second one tries to answer how can data be believable and add value.
- Hey Snowflake, send me an email — Christmas is sometimes the time of the year where magic happens. And once again it happened. Felipe showcases how you can send an email from Snowflake. But, sadly, let's be honest, I hate the way it has to be done. Stored procedures, meh, feels like Oracle.
- How DoorDash secures data transfer — This is network stuff, but still interesting for a large part of data engineers. Doordash needed to send traffic from AWS to on-premise datacenters of their payment providers vendors.
- The thrill of deprecating dashboards — Last week while summarizing my data dream team presentation I've said that every data team should clean their BI tool every 6 months. This week Sarah shares a few tips on how to do it, from dumping BI data to the warehouse to stats report before cleaning.
- How data and finance teams can be friends — As weird as it can be data team is sometimes seen as an annoying stakeholder by business teams. Often because we are stuck between in search of engineering stability while keeping fast delivery pace to follow growth. This post tries to show how data team should work with finance team to avoid this spat.
- How to manage data teams, build a reliable platform & ensure data quality — 20 bullet points split in 4 categories. Anna shares a good checklist to build data team foundations.
- Introduction to Data Modeling — Data modeling is the trendy skill everyone wants to learn today, dbt and the Analytics Engineering trends put modeling back in the front seat. This article by the Preset team is a good introduction.
- A bot that watched 70,000 hours of Minecraft could unlock AI’s next big thing — Yes, generative models are fun, but I've you watched imitative or reinforcement learning? This is so crazy how these kind of models are impressive and I really like the fact that video games are the support of this research.
PS: same stuff is happening in Rocket League and this is also impressive.
Big tech watch
- Trino at Apple — Getting inspiration from big tech companies has always been a great way to discover patterns. This time Apple engineers shared at Trino—good old Presto for lost people—Summit how they use it.
- How LinkedIn cleans up unused metadata for its Kafka clusters — Kafka garbage collection, if you like to understand Kafka internals this post is for you.
- Enabling static analysis of SQL queries at Meta — reducing the feedback loop on your data models edition is probably the biggest challenge data teams are facing today. I'm not afraid to say it. This is not data contracts or Rust, I think that the most annoying thing for a data team is the time lost on data models development. This is why having a great static SQL analysis is a good starting point in reducing the amount of manual steps. Obviously Meta is Meta and they redeveloped everything from the ground up.
Thank you all and see you next week ❤️.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.