Preparing to leave Paris (credits)

Hello everyone. I hope this email finds you well. Back to the usual Data News. While the Airflow Summit is taking place I also plan to do a takeaway post after watching all the replays.

For the next two Fridays I'll be in holidays ☀️ but I'll try to plan 2 posts in advance. Have a good read.

Data fundraising 💰

After some slowness in the fundraising in the last months due the global economical situation this week is bringing some cash in the data space.

PS: the stuff around observability is weird, it feels like companies are raising on these keywords just for the money. It probably says something about current trends. Feel free to correct me if I'm wrong.

Manage your data resources

You know I'm fan of data organization articles. I really like to share these kind of articles because as of today this is super hard to create data teams that works. There are a lot of considerations and Emily says that you should not run your data team like a product team, you should run it like a company that needs to scale.

She recently changed position and put some thoughts around the fact that actually your data team could be run like a product team, but not only. And she details this "not only" part. Data job is different. In the post she asks really good question you should ask yourself to build the best data team.

In parallel Tristan from dbt Labs took Emily's post as a starting point and elaborated on what you can do to avoid velocity traps for your team. In summary for Tristan software engineering job is divided between releasing user features and everything else. Deciding how to allocate your team's time between both is probably the hardest part. When it comes to data it's a bit different but not that fat. This is the data hierarchy of needs.

To be honest I recommended you to read the two original posts I'm not sure my takeaways honour them 🙈.

As a side note here another aspect to reduce your team costs: stop useless workloads and services.

Teams (credits)

Lessons learned from running Apache Airflow at Scale

While waiting for other Airflow insights from the Summit here a first feedback. Shopify lists lessons learned running Airflow 2.2 with more than 10k DAGs. They noticed that the multi-tenancy is not perfect because, author privileges are too broad and this is hard to split DAG ownership. Below an extract of the conclusion.

To sum up [their] key takeaways:
• A combination of GCS and NFS allows for both performant and easy to use file management.
• Metadata retention policies can reduce degradation of Airflow performance.
• A centralized metadata repository can be used to track DAG origins and ownership.
• DAG Policies are great for enforcing standards and limitations on jobs.
• Standardized schedule generation can reduce or eliminate bursts in traffic.
• Airflow provides multiple mechanisms for managing resource contention.

How I have set up a cost-effective Modern Data Stack for a charity

In France we have a non-profit association called Data For Good, their work is awesome. People are working for the public good by solving data issues within their capabilities.

This time Marie detailed how she developed a Modern Data Stack for a Paris-based solidarity grocery. To me this is one of the best article regarding data platform development. She greatly explains every choice and propose a state-of-the-art cost effective data stack.

What is Apache Pinot?

In the Apache wine cellar I would like a Pinot.

If you want to understand what is Apache Pinot, this is a great whiteboard YouTube video explaining why Pinot was created and how you can use it today. In summary Pinot is a realtime distributed OLAP datastore to answer realtime data analytics needs. The video vulgarizes key concepts.

🍇(credits)

Some product news 📰

Opinion post: Can ML be absorbed by the DBMS?

George, the CEO of Fivetran, asks if machine learning could be a simple database task? Obviously this is already technically feasible, but will we see a major paradigm change like few years ago when we said "data langage is SQL"? The future is open.

Fast News ⚡️


1 Monte Carlo's VC, Redpoint, wrote a post congratulating Barr Moses (CEO) vision.