Data News — Week 8
Data News #8 — short data news because world bad news.
I should maybe rename the Data New to Saturday Data News (jk). Hey dear members this week edition gonna be quick because as Benn said there is nothing to add. All my thoughts are going to Ukrainian friends — I know a couple of you are members, so if I can do anything for you, do not hesitate to reach out.
Data fundraising 💰
- dbt Labs™ raised again this week. They got $222m in Series D. Snowflake and Databricks (the warehouse enemies) participated. As Tristan transparently said as always in a blog post they want to build the next layer of the modern data stack. This is the semantic layer. This framework gonna be the dbt Server™.
- To follow this semantic layer vision Hasura raised $100m in Series C. Hasura lets you add a GraphQL layer on top of your database. And they already work with BigQuery (and Postgres). This is also another way to see the story.
- Making real-time data platform a commodity with serverless product is also a new big trend. This week Decodable and Redpanda respectively raised $20m in Series A and $50m in Series B in order to simplify real-time data accessibility.
PS: I'm joking with dbt trademark because with the fundraising they changed seo meta and html titles, dbt™ - Transform data in your warehouse. Their trademark guidelines is funny, we can't say "We use dbt" we have to say “Our software utilizes dbt Core™ functionality.” Good luck with people saying DBT.
Terraform — Best practices and project setup
As a Data Engineer I have somewhere a folder with a Terraform boilerplate I used every time I want to create a new platform. Terraform has many hidden ways to do stuff and this post is a good starter to get best practices and setup.
Spinner: The mass migration to Pinterest’s new workflow platform
Pinterest team explained how they migrated more than 3000 workflows to Spinner, their 1.10-stable fork of Airflow . This post is a huge resource for everyone that want to understand how Airflow can be customized in any ways. They show the dedicated UI they developed to help the move.
Fast News ⚡️
- To conclude the unbundling of Airflow trend Ananth from dataengineeringweekly just wrote an bundling article about his thoughts
- How Leroy Merlin managed their cloud data pipelines with Kestra — Kestra is another open-source data orchestration platform
- Training pipeline orchestration with Kubeflow pipelines
- Running Spark Pipelines on EMR using spots instances
scrubadud
is a Python library that removes personally identifiable information from text. It uses spaCy NLP models
Love ❤️
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.