Data News — Week 22.33
Data News #22.33 — Omni new BI tool, data job market, the best data team, Snowflake stuff, data ingestion best practices, RL at Netflix and fast news.
Hello dear members. For people in holidays I hope you enjoy as much as you can. For the others I feel you and here the usual Data News to keep you up-to-date on this Friday afternoon.
Let's start a conversation this week. What is your biggest problem right now?
Mine is Superset virtual datasets, I'm working on a Superset project and the textarea to write SQL queries to build the data model is too small in a way that it is so unproductive. Except from this I'm surprised by how far Superset can deliver.
Data fundraising 💰
- Omni raised $17.5m Series A — and a seed funding in April. The field is starting to get stacked on the "next generation" of BI tools. Omni is trying to put a fresh look at it with founders coming from Looker (VPs Product) and Stitch/Talend (CTO). With such a line-up it looks promising. Omni wants to combine a neat BI platform with a auto-generated data models out of one-off queries. I feel Omni is gonna compete with Lightdash, the open-source BI built on dbt.
Data job market status
It's well know that data job market is highly stressed. All companies have engineering and analytics positions opened. Last week at the Data Analytics Careers Summit, Dustin revealed data about the data analytics market. This is super interesting. We can see that demand in SQL grew by 27pts since 2020, while Tableau, PowerBI and Excel are each mentioned in a third of job postings.
On the same topic someone analysed all the jobs posted in the dbt Slack community (~3k) and made some statistics about it. We can clearly see that analytics engineering has picked up while data engineering demand stayed the same.
The best data team
There are many ways to create great data teams. This is the kind of articles I'm really into. What if we could create a partnership between data and teams enabling data to be more than a support role, this is the Data Business Partnership. The post provides great guidelines to try this implementation.
We often say that this is a bad idea to look at data unicorns. But what if instead we were looking for data heroes. Mikkel wrote another great post about data teams. So yeah, how can find, activate and retain data heroes? You know, these people that add this little more in your team.
It's summer, so let's speak about snowflakes
What a boring category title but as I got 3 articles about Snowflake I wanted to group them here.
Firstly, you can try to do machine learning with Snowflake by using the recently released Snowpark. Then it's time to master the query profiler, thanks to Teej you'll able to get started at query graphs readings. And now that you had fun playing with ML and queries you should have a look at your Snowflake bills.
7 best practices for data ingestion
Saikat wrote a small wrap-up about basics best practices everyone should follow when writing data ingestion pipelines. Guess which one is my favourite.
On the same topic Matt wrote about data backfilling. To me, backfilling is one topic that really shows the difference between a data engineer and a great data engineer. Probably because backfilling requires experience and patience. It's easy to run a pipeline, but when your pipeline should recompute or reingest to data from the last 4 years, the stress it'll put on system will be heavy.
And also I have to disagree with Matt's post on how to handle backfilling. I've made the mistake in the past to create dedicated backfilling pipelines but I think this is a bad idea. If pipelines are idempotent and deterministic your don't need another branching, at least this is the Airflow way to do it.
Cool findings 🔎
- I discovered the StackExchange data explorer (via Vlad Mihalcea) — Yeah, I might be late to the party. This is a place where you can write SQL to query SE data and get insights.
- command-not-found.com, a website where you can search every OS package and the website gives you the installation command for each distribution.
- dbt-jsonschema — A way to validate your dbt YAML within VSCode.
ML Friday — Reinforcement Learning at Netflix
Reinforcement learning is something a bit mystic for me. Every year when I give classes I try to give examples and this is one from Netflix is quite interesting. They picked RL models to find optimal recommendations under constraints, our most limited resource: our time.
More ML: 4 essential steps for building a simulator.
Fast News ⚡️
- GitHub Copilot will (not) take your job — Ronny wrote a good overlook on Copilot capabilities and also spiced it up with what's questionable about Microsoft strategy.
- Grab explained how they use millions of orders — This is pretty classic stuff, a data pipeline reading Kafka with fault tolerance mechanisms. With an OLAP MySQL in the end.
- How Instagram suggests new content — Never find me on Instagram, except when I'm looking for a new tattoo, I replaced it with heavy scrolling LinkedIn which is not better tbh, but don't judge me. In this post Meta team detailed how they designed their ML system to recommend posts that feels like your Home Feed. They also mention their internal feature store powered by their IG query language.
- Reliability, Scalability & Maintainability in simple words — A great definition article by Thibault about these three magical words.
- 👍 How to structure a dbt project on GCP [part 1, part 2, part3] — This is an awesome technical deep-dive on how you can scale and structure a dbt project by using GCP toolkit. It covers everything: data landing, modeling, orchestration and deployment. This is the kind of feedback we were all waiting about dbt.
- You can make any piece of data look bad if you try — Fun post.
- Concepts and practices to ensure data quality + The best way to ensure that data works — Quality data quality stuff.
- Measuring system performance is hard, so, know your limits.
See you next week ❤️
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.