blef.fr

New week for the Data News, from the feedback I got I saw that you really like last week news. I would like to start this week edition with our beloved GAFAM who made the headlines.

If you like this kind of stories do not hesitate to Subscribe to get it by email each Friday. Forever free.

GAFAM stories

Earlier this week Facebook had a huge outage impacting all services because of some DNS configuration. Cloudflare explained what they've seen from their distributor point-of-view, I really like the charts.

Twitch — Amazon — had a massive leak this week 135GB of compressed archives has been distributed on the web. The leak started on 4chan — surprisingly — and was driven by revenue numbers made by all streamers. Archives contain a bit of data but mostly it's around 6k project repositories.

Google announced in France a partnership with Thales to create a "sovereign cloud" or as they say a "Trusted Cloud" — sorry if the translation is ridiculous it's the same in French. This follow Google partnership announcement in Germany last August with T-Systems and Microsoft venture with Orange and Capgemini. What does that even mean? Well, I'm not sure people already know.

That is probably some political announcement aiming to target digital transformation for big French companies and public sectors that are willing to dedicate — or waste — money in the next year to go to the Cloud of Confiance.

But is it really a trusted cloud? Do we need only to host data in our country to have a trusted cloud? Don't we also need to master — meaning develop — the software? Can we trust Thales and Orange to run cloud software? I already have my opinion on that matter.

Now, let's jump to traditional data news 🦘

Data fundraising 💰

A new week and a new company entering the data pipeline space. Orchest, a Dutch company, raised $3.5m in a Seed round to "build data pipelines, the easy way". From their website that means you can create data pipeline from Python or R notebooks and run jobs in defined environment.
To follow last week news, a new Israeli startup called NeuroBlade raised $83m in Series B to accelerate data analytics processing using processing-in-memory (PIM). Look like a trend is coming from there.
Last but not least, a new open-source data observability tool will go out of the stealth soon: Elementary — and yet another Israel based startup. They started with a data lineage tool parsing your Snowflake query logs to create the dependencies tree with context.

Me observing the observability space (credits)

Why Machine Learning Engineer are replacing Data Scientists

This article is busting myth about the ML engineers and how they are included in a data team and what are the skills required. Without any spoil that doesn't mean data scientists will disappear but more that it will transform their daily job. If I can simplify, machine learning engineers are here to bridge the gap between data engineers and data scientists.

Propagating metadata across our data architecture

I was afraid to see Apache Atlas disappear with all these tools about lineage and metadata, but here for you an superb article on how QuintoAndar used Atlas to connect all the metadata from their systems. Their platform is based on Kafka, Hive and Airflow but it can be replicated to every system. They also explain how they integrated this in the CI process.

Herding elephants: Lessons learned from sharding Postgres at Notion 🗒️

If you are a Notion user you could have feel earlier this year that the product was experiencing slowness. It seems they speedup thanks to some Postgres sharding. This post is a great feedback about the Postgres sharding journey they had. If you are asking yourself when, what and how to shard this is a must-read.

PS: jump a the conclusion if you want to fast takeaways.
PS2: it reminds me this Devoxx FR talk about Postgres at Leboncoin (vid not included :/).

The startup data journey

A lot of new startups are starting up — yep you read it 🙈 — every day. That means many new data stories will come in the next months. If you still struggle in the build of your data platform this post can help you: how to build a Data Lake in your early-stage startup.

On the other side, imagine that you have a legacy data platform that you want to migrate to something less-legacy, this post by Mark Grover (ex-Lyft, Amundsen) about 3 steps for a successful data migration will help you a lot.

I totally agree with the 3 steps and I highlight that tools often need to be specifically built when you do a migration.

Startup data journey still unclear (credits)

A/B tests follow-up

Netflix engineering team continues in the A/B test series, this time with the result interpretation. If you ask yourself how to manage false positive and false negative interpretation this article is for you.

In France we are just starting with The Trusted Cloud, but we have for years trusted ads companies: Criteo and Teads to name a few. This week Teads team describes their A/B test analysis framework. A 17 min read post that explains all the angles of their awesome A/B platform leveraging Spark and Athena.

Tired of Airflow?

On towards data science Madison Schott published: "Tired of Airflow? Try this.". Apart from the clickbait title and that she recommends Prefect in place of Airflow there are some complaints that should be heard by the Airflow community. I've been a huge fan of Airflow for years now, so I may be biased on that topic.

Is it just me or is Airflow ugly and just not that useful in telling your pipeline’s story?

When she's saying that Airflow is ugly, I agree but also Airflow is coming with a legacy — from Airbnb mostly — that is not written on the new fancy website or on Astronomer content. I taught Airflow for the last years and onboarding non-familiar people is harsh.

Last point about community and Astronomer support team "almost inexistent" should be watched closely. Just to add Airflow also has a community Slack with around 12k members.

To balance and conclude this Airflow category I propose you an article written by Ari from Airbyte¹ about pipelines with Airflow the good, the bad and the ugly. It features a Chord chart about Airflow transfers.

To contrast with the previous articles footage of people enjoying Airflow (credits)

Fast News

Feature Stores — What, Why, Where and How? ; in last week edition we saw that feature stores are emerging, this simple post is a reminder on what it really means
dot-env Python package — Super nice tutorial on how to handle sensitive data in your Python applications by Amhed Besbes
Python Lists Are Overrated — Other data structures worth looking
Streamlit tutorial — Streamlit is as they say "The fastest way to build and share data apps" ; this tutorial gives you an overview of the product
Following Cloudflare R2 announcement from last week. Last week on AWS shows how much money ($$$) you can save switching to R2.
Unit tests your BigQuery SQL — this is a concept idea to unit test SQL queries with templating. This is something I've seen at the Spark Summit Europe in 2015 but I don't find the video online.

¹ another tool competing with Airflow — I'm saying competing because Airbyte EL and connectors are replacing some Airflow use-cases. But maybe for the best?

² it's my first time adding footnotes, if you read this clap your hands