Data News — Week 23.12
Data News #23.12 — Mage and Kestra takeaways, OpenAI plugins system and impact on job market, Reddit outage post mortem, etc.
Dear readers, I hope this new edition finds you well. It seems that you really liked the recent editions, which is perfect because it was fun to write. I feel that this week all the articles I found relevant for the newsletter are either AI related or technical. I really don't know how to deal with news overflow about the Gen AI landscape. Do you like all the GenAI hype? 👍 or 👎
Airflow alternatives meetup
This Tuesday also took place the first part of the Airflow alternatives Meetup with Mage and Kestra. It was an awesome online meetup. I really liked the presentations from Mage and Kestra and even if I was focus on hosting the event it was great to see 2 other visions about the future of orchestration. Which, to be honest, are not really far from Airflow.
Here are my takeaways about the event:
- Mage and Kestra have been both developed with Airflow flaws in mind, especially about deployment complexity, reusability and data sharing between tasks.
- The tagline "Modern replacement for Airflow" on Mage side makes sense. Out of the box Mage provide all-in-one web editor to write data pipelines with a great UX. In small browser text areas you will be able to write Python, SQL or R code and orchestrate theses transformations with drag-n-drop. I personally hate developing in the browser, but the promise looks good. But actually Mage and actual Airflow version are—almost—the same the only difference is the UX when developing pipelines.
- Tommy, Mage's CEO, said that for the moment they will focus on building the best open-source data pipelines tool. They got enough funding for the next 2 years.
- Facing the reality, even if worker management seems easier in Mage, the deployment is not yet ready to go. Either you go with Terraform script that will launch elastic containers either you go with helm, but requires Kubernetes.
- Now Kestra. One of last kid on the block. Ludovic, the CTO who presented Kestra at the event, said that he started the development while at a mission at Leroy Merlin where people were heavily unhappy about Airflow. Kestra is a YAML-based data pipeline tool mixed with string templating. The YAML approach allowed less-technical users to be able to write pipeline.
- Kestra vision is also very open, everything is accessible through APIs. Which leads to a variety of usage for a company. Under the hood Kestra is developed in Java which is totally different than other alternatives.
- Kestra in the future can easily looks like Mage, YAML being the mid-step before a "drag-n-drop" like UI.
It was so fun to organise this event and I'd love to do more live in the future with blef.fr. Still, in 2 weeks on April 4th the part 2 of the event with Prefect and Dagster will take place, I hope I'll see you there.
Gen AI 🤖
The newsletter is already to big for today so I'll try to keep it short especially on this Gen AI that is already spammed everywhere.
OpenAI is slowly starting to create a gigantic ecosystem and could become the next GAFA-like company. The non-profit research company manifesto is already far away. OpenAI released a study about Large Langage Models—LLMs—impact on the job market (sorry I wanted to read the pdf but my brain is already grilled) and announced ChatGPT plugins. In a nutshell OpenAI is has created a AI interface that everyone likes and will add on top of it a App Store experience with plugins. It reminds me of something.
Because OpenAI is not everything, some news of the alternative world. Mozilla announced Mozilla.ai a community-based open-source AI ecosystem, Stanford researches released Alpaca a model that behaves similarly than OpenAI text-davinci-003 but that costs a lot less ($600 to train it), there is also a list of open alternatives to ChatGPT.
I'd have love to speak about tools offering to translate human langage to SQL like sequel.ai or SQL translator but it would open the Pandora's box about self-service analytics and this is for next week.
Fast News ⚡️
- You broke Reddit: The Pi-day outage — a good retrospective post on Reddit outage for Pi-day—the 14th of March, 3/14 in US date format, unlucky we are all the rest of us we don't have a Pi-day. Jayme, a staff software engineer, shares that a Kubernetes version upgrade from 1.23 to 1.24 led to the outage. Actually Kubernetes introduces in 1.24 a terminology change from master to control-plane which was the trigger of the issue.
- Apache Arrow releases Arrow nanoarrow — Recently Arrow got a lot of light because of DuckDB or Pandas 2.0 and it's good. Arrow is a multi-langage interface to use in-memory data structures. Nanoarrow is a C library that acts as simplified interface for application developers in order to put Arrow everywhere.
- Avoiding data pipeline failures: the importance of backfilling capability — As I've already said in the past backfilling is one task that separates data engineers to great data engineers. Backfilling has to be thought at every step of a data pipeline design and development. This is a small LinkedIn article, but it's a good reminder.
- Datafold's data-diff now integrates dbt — You can now run data diff after a dbt run to compare your models to the production state and get a summarised view of the rows impacted.
- How LinkedIn reduced processing time with Apache Beam — Beam is a distributed processing framework that proposes a unified execution engine for batch and real-time. LinkedIn team decided to migrate to a lambda architecture and got 94% uplift in performance.
- How fast is DuckDB really? — Georges, Fivetran CEO, ran a performance test to have metrics on DuckDB performance. The article conclusion is that a Macbook M1 (and probably M2) can have better performance than a server. I can relate.
- If data lineage is the answer, what is the question? — A good list of use-cases where data lineage would be useful.
- Using CockroachDB to reduce feature store costs by 75% — More and more articles about costs optimisation, 2023 is the year were data engineers skills will be used to lower platform costs. And this is a good point.
- How shadow data teams are creating massive data debt.
- Distributed Machine Learning at Instacart.
Data Economy 💰
- Sifflet raises €12.8m Series A. Initially this is a data observability tool but it turned out they added features like lineage and cataloging. Often needed to better contextualise alerts but also to avoid tools multiplicity when working with big corporations.
- Hex raises $28m in a Venture Round[1]. Hex is a notebook-based analytics application. Cells are at the center of the analytics, they produce outputs than can be used later in other cells on in visualisation. The visualisation can be organised in a Notion-like document but with live data. I recently tried Hex, the UX is neat and I think the tool is worth it for production-ready explorations[2]. Here an example with the MAD data—I made no presentation effort.
- DragonflyDB raises $21m Series A. Dragonfly is a replacement for Redis claiming to outperform it in many way (throughput, snapshotting speed, scaling). I don't have a lot to say except the fact that we are going in a future with a lot of databases choices.
- A venture round is when the series has not been specified.
- I just coined the term, I mean, it's when you do a great exploration and you want to share a professional result to your stakeholders.
See you next week ❤️.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.