Data News — Week 11
Data News #11 — Scribble data raised and Sarus launched, GAFAM news, MDS ideas, datalake stories, Spotify no longer uses Luigi.
Hi, I'm Christophe and this is the 11th edition of the Data News, my weekly newsletter about data. In the news I share with you the articles I liked this week.
👉 Subscribe to the newsletter to get your weekly digest
Data fundraising 💰
- Scribble Data raised $2.2m in Seed round to provide a cloud-native feature store. They want to help data teams by offering end-to-end platform to build, understand and consume machine learning models.
- Sarus, a privacy-by-design data platform, launched this week HN. As they said: "Sarus is a privacy engineering software that lets data scientists work on data without the need to access it". They used under the hood differential privacy concepts and it could be a glimpse of the future.
News about the GAFAM
Facebook (Meta) has been hit by a $18.6m fine in Ireland after violations in 2018 of the GDPR laws because of 12 data leaks.
On the other side Microsoft is facing EU antitrust complaints lead by OVHCloud with 2 other companies. They claim that Microsoft favour their own cloud in term of pricing when it comes to deploy Office 365 to the cloud by making it more expensive on other's cloud.
Once again Mikkel continue to give us insights about salaries in tech companies around the world. This time he analysed FAANG salary data from levels.fyi and the results are quite interesting.
Spotify is moving away from Luigi
Every Python data engineer knows Luigi. Luigi is a Python module that helps you to build data pipelines. Luigi is easy to use and focus only on orchestration, in order to schedule your jobs you need something else. In 2019 the data orchestration team decided to move away from Luigi. They hand-picked Flyte, a graduate workflow automation platform from the Linux Foundation AI & Data.
This year the competition will be fierce between orchestration tools.
Modern Data Stack ideas
This week 2 companies shared their Modern Data Stack choices and implementations:
- ManyPets implemented their platform around BigQuery. In term of ingestion they are mixing Fivetran, Meltano — to get their telephony data — and Airflow. This experience shows once again that SaaS tools are mandatory in all data team but are once again limited every time for one reason and a combination of multiple tools will always to the trick.
- Whatnot on the other side used Snowflake as main warehouse. They also shared the decisions they took that made them faster. Especially they decided to use views instead of tables and to be CDC first without dimensional modeling in order to support frequent schema changes.
Ancient Data Stack reinvented
The warehouse-first philosophy becoming the default go to shouldn't make us forget our dear old friend the datalake. Mehdio decrypted the modern table formats — Hudi, Iceberg and Delta — and their role in the lake.
Delta team announced some days ago their Delta Sharing new release that works with Google Cloud storage with a combination of the Delta Sharing Server and their protocol you'll be able to read your lake data from compatible clients.
Following what Delta brought to the table LinkedIn wrote about Opal, their mutable data ingestion engine on top of their datalake. This is a great technical deep dive.
PS: I stole the Ancient data stack concept from this tweet.
Why Analytics Sucks
I was bothered that no one had thought to let me know that the launch had occurred [...] I’d never felt more viscerally that my role in the product-building process was that of data vending machine: request in, data out.
Robert from Hyperquery bring the storytelling behind choices they made to build their product. A Notion-like collaboration tool where data insights merge with documentation. Robert explains that the tools we use to do analytics today "inevitably shaped our processes". And the processes are broken.
Once again a competition will be fierce in the data last-mile exploration to create the new way we interact with data.
Saturday ML 🦾
As a good French born person when I see a project trying to read label on a wine bottle by computer vision I have to share it. So here a two parts article where Antonin describe how he did it and it also includes an app he deployed for you to try it.
Artefact team decided to share their experience with Facebook Prophet — a Python/R framework to forecast time series. It gives you an overlook of framework on all aspects, especially on feature engineering, modeling, interpretability and deployment.
Fast News ⚡️
- Registrations are open for the Airflow Summit in May 23-27
- Making friends with Machine Learning — This is a YouTube playlist — as of now 101 videos — ran by Google Chief Decision Scientist: Cassie Kozyrkov where she covers all the topics related to machine learning
- Understand process order in a SQL query and why it is important — small post but it covers surface notions about SQL queries everyone should at least know
- The rise of the Data Reliability Engineer — This title is a good mix between two founding concepts, and open us to a new path in our data journey.
- Understand Ibis and Apache Arrow — Two discovery articles to help you understand what is Apache Arrow and how it is positioned compared to Pandas for instance. And Ibis a yet another package stuff to query data with SQL, Marlene greatly details Ibis concepts.
💡 Wander Well.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.