Skip to content

Data News — Week 24.34

Data News #24.34 — Forward Data Conference guest speakers, Data Engineering for AI/ML, AI news and a lot of great fast news.

Christophe Blefari
Christophe Blefari
6 min read
News again.. (credits)

It's been 3 weeks.

Summer continues and I hope this new edition finds you well, having had a great vacation and a nice break before getting back to business in September. Content and articles have been a little slow over the last few weeks and that's to be expected, but I feel it gonna get back to business as usual soon.

Some personal news, in September things will be changing professionally on my side, I'll be slowly leaving the freelancing world. More details soon in a 2-parts articles I'm writing about it. I can't wait to tell you more to be honest. Still the newsletter gonna be the same formula.

Events ✨

As you may know I'm co-organising the Forward Data Conference on November 25th in Paris. The Forward Data Conference will be a day to shape the future of the data community, where teams can come to learn and grow together. There are still tickets left—we sold around 60% of the tickets.

We have started to announce a few guest speakers for the conference that I can't wait listen on stage. At the moment we have announced:

  • Joe Reis, best-seller author, data engineer, he will speak about the new art of data modeling
  • Hannes Mühleisen, co-creator of DuckDB
  • Claire Lebarz, Chief Data and AI at Malt, who was working at Airbnb previously
  • Virginie Cornu, Co-founder / CTPO and previously VP data at Jellysmack

You can use BLEF_FWD24 promo-code to get 15% reduction on your ticket.


Transition to another event. Demetrios Brinkmann is organising on Sep 12—in 18 days—a free online conference called Data Engineering for AI/ML, the agenda is pretty packed and the lineup is full of awesome speakers (Joe and Hannes will be there ☺️). The idea of the conference is to go deeper in the AI/ML current state and how we do data engineering in 2024 that services ML and AI teams.

AI News 🤖

The 3 last OpenAI co-founders (credits: HBO)

Fast News ⚡️

  • Timeless skills for data engineers and analysts — Benjamin proposes 4 non-technical skills that are useful when doing the data work, the 2 first one are "thinking in systems" and "data intuition". With the current resurgence of data modeling I think it both are critical part of the modeling work and need to be looked at.
  • How top data teams are structured — Mikkel had a look at 40 data teams and analysed the way they are structured. As an output we get the roles ratio understanding better team composition.
  • Mainframes finds new life with AI — Obviously we will never get rid of mainframes, and IBM says clients wants to run AI on it. I think this is time to learn machine learning in COBOL.
  • Foursquare modern data platform — A classic modern data stack, but still interesting to see that Foursquare—like large companies—is going multi-technology by having multiple storage, processing engines and notebooks technologies.
  • Amazon migration from Spark to Ray — Exabyte-scale migration and in the end Amazon saves a lot of processing time when it comes to files compaction.
  • Design a text-to-SQL solution — An Indian food delivery company, called Swiggy, developed an internal text-to-SQL solution that user can interact with via Slack. The article describe all their thoughts and the challenges they faced while working on it.
  • In browser WASM Postgres — Recently a company ported Postgres to WASM and now you can run Postgres within your browser without any server, you can test it at postgres.new (only desktop for the moment). You can see more on pglite website.
  • Why we replaced Airflow in our experiment platform — Eppo reached a job concurrency limit with Airflow (wanted to run 50k concurrent experiments) and decided to switch to something else. After trying a few different things they decided to developed their own tech.
  • Airflow Summit in San Francisco — Airflow gonna turn 10 this year. There is Airflow Summit in San Francisco on 10-12 of September. This year will also mark the start of Airflow 3.0 aiming to be release in March next year. If we look at the confluence page Airflow 3.0 will feature: a new web UI, DAG versioning, remote execution (from cloud to on-prem), data assets (like Dagster, a renaming of Datasets) and more but that's a great milestone.
  • Airflow and dbt ; in Astronomer — Orchestrating dbt within an orchestrator is one of the most discussed topic among data teams using dbt. It's also an large adoption lever for companies like Dagster from the private discussion that I have (Dagster-dbt integration is top-notch). Astronomer has to go the same direction and through cosmos and a better integration now they propose it as well.
  • No, data engineers don't need dbt — Common sense but good reminders. If you do ETL instead of ELT and you don't have a warehouse, dbt might not be the best fit.
  • Sparkle write Spark pipelines in YAML at Uber — Uber doing uber things. Still declarative platforms in YAML are the way to go to standardise data work.
  • ETL development life-cycle with Dataflow — How Netflix is also doing YAML development for Dataflow job writing.
  • StarRocks usage at Pinterest for faster analytics.
  • Table format comparisons — Honest linear review of 4 formats (it includes the 3 main ones and Apache Paimon). This first part details how formats manage physical files, the second part is about append-only and incremental tables.
  • Polaris catalog is open-source — Snowflake released the expected Polaris catalog. On this matter, it has been reported that Databricks finally acquired Tabular for $2b while Snowflake tried to get them for $600m.
  • BigTable supports SQL (and this is something) and GCP can run managed Kafka.
  • The revolving doors of BI — I frequently observe companies switching BI tools every 2-3 years, hoping that the latest solution will resolve all their issues. While these migrations often provide short-term relief, they inevitably lead to another dead end. The initial success of the migration is typically due to the fact that, during the transition, companies address their technical debt and apply the lessons learned from previous mistakes. However, without addressing the underlying issues, the cycle is likely to repeat itself.
  • Metrics management at Duolingo — I should go back to learning German in order to improve Duolingo metrics.
  • How we test Airbyte and marketplace connectors — Exhaustive tests suite Airbyte put in place to check if connectors are behaving correctly.
  • Slack summary ELT pipeline — It showcases how you can create an ELT pipeline with dlt, Ibis and Hamilton—which is a Python library to create transformation DAGs.
  • snowflakecli — a DuckDB-powered command line interface for Snowflake security, governance, operations, and cost optimization.
  • A Postgres extension (pg_duckdb) that brings DuckDB as an analytical engine within Postgres has been backed by Microsoft and fostered a bit of discussions in the community.
  • Genealogy of databases — like a subway map that depicts all databases changes since Prehistory.
  • Martin Kleppmann is working on a version 2 of Designing Data-Intensive Applications.

Final stuff

Data News

Data Explorer

The hub to explore Data News links

Search and bookmark more than 2500 links

Explore

Christophe Blefari

Staff Data Engineer. I like 🚲, 🪴 and 🎮. I can do everything with data, just ask.

Comments


Related Posts

Members Public

Data News — Week 24.37

Data News #24.37 — OpenAI o1 new series, building low cost platform with Model dlt and dbt, Data teams survey, feature store, Ibis without pandas.

Members Public

Data News — Week 24.30

Data News #24.30 — TV shopping for foundational models (OpenAI, Mistral, Meta, Microsoft, HF), BigQuery newly released stuff, and more obviously.