Data News — Week 24.34
Data News #24.34 — Forward Data Conference guest speakers, Data Engineering for AI/ML, AI news and a lot of great fast news.
It's been 3 weeks.
Summer continues and I hope this new edition finds you well, having had a great vacation and a nice break before getting back to business in September. Content and articles have been a little slow over the last few weeks and that's to be expected, but I feel it gonna get back to business as usual soon.
Some personal news, in September things will be changing professionally on my side, I'll be slowly leaving the freelancing world. More details soon in a 2-parts articles I'm writing about it. I can't wait to tell you more to be honest. Still the newsletter gonna be the same formula.
Events ✨
As you may know I'm co-organising the Forward Data Conference on November 25th in Paris. The Forward Data Conference will be a day to shape the future of the data community, where teams can come to learn and grow together. There are still tickets left—we sold around 60% of the tickets.
We have started to announce a few guest speakers for the conference that I can't wait listen on stage. At the moment we have announced:
- Joe Reis, best-seller author, data engineer, he will speak about the new art of data modeling
- Hannes Mühleisen, co-creator of DuckDB
- Claire Lebarz, Chief Data and AI at Malt, who was working at Airbnb previously
- Virginie Cornu, Co-founder / CTPO and previously VP data at Jellysmack
You can use BLEF_FWD24 promo-code to get 15% reduction on your ticket.
Transition to another event. Demetrios Brinkmann is organising on Sep 12—in 18 days—a free online conference called Data Engineering for AI/ML, the agenda is pretty packed and the lineup is full of awesome speakers (Joe and Hannes will be there ☺️). The idea of the conference is to go deeper in the AI/ML current state and how we do data engineering in 2024 that services ML and AI teams.
AI News 🤖
- Drama at OpenAI people leaving (again) — Only 3 of the 11 original co-founders are still at OpenAI. In July reports were saying that OpenAI could be on the track to make a $5b loss.
- OpenAI structured outputs — You can now force OpenAI APIs to return you a specific enforced JSON schema when calling.
- New image generation capabilities — The German lab Black Forest created more realistic than ever images and with their model Flux.
- How Meta animates AI-generated images at scale.
- Meta segment anything model (SAM v2) is impressive in order to identify anything in images.
- ML in content moderation at Etsy.
- Semantics in the LinkedIn search engine.
- StackOverflow AI survey — a few insights quoted from the survey
- 76% of all respondents are using or are planning to use AI tools in their development process this year, an increase from last year (+70%)
- 81% agree increasing productivity is the biggest benefit that developers identify for AI tools.
- 70% of professional developers do not perceive AI as a threat to their job
- Watermarking Generative AI — I believe in watermarking and I root for users app in the future to identify watermarks and to inform users if its AI, not AI or unknown.
- The open-source AI definition (v0.0.9) — A try to put words in order to define, in open, what's AI: "an AI system is a machine-based system that, for explicit or implicit objectives, infers, from the input it receives, how to generate outputs such as predictions, content, recommendations, or decisions that can influence physical or virtual environments".
- What nobody tells you about RAG — Large deep-dive about RAGs.
Fast News ⚡️
- Timeless skills for data engineers and analysts — Benjamin proposes 4 non-technical skills that are useful when doing the data work, the 2 first one are "thinking in systems" and "data intuition". With the current resurgence of data modeling I think it both are critical part of the modeling work and need to be looked at.
- How top data teams are structured — Mikkel had a look at 40 data teams and analysed the way they are structured. As an output we get the roles ratio understanding better team composition.
- Mainframes finds new life with AI — Obviously we will never get rid of mainframes, and IBM says clients wants to run AI on it. I think this is time to learn machine learning in COBOL.
- Foursquare modern data platform — A classic modern data stack, but still interesting to see that Foursquare—like large companies—is going multi-technology by having multiple storage, processing engines and notebooks technologies.
- Amazon migration from Spark to Ray — Exabyte-scale migration and in the end Amazon saves a lot of processing time when it comes to files compaction.
- Design a text-to-SQL solution — An Indian food delivery company, called Swiggy, developed an internal text-to-SQL solution that user can interact with via Slack. The article describe all their thoughts and the challenges they faced while working on it.
- In browser WASM Postgres — Recently a company ported Postgres to WASM and now you can run Postgres within your browser without any server, you can test it at postgres.new (only desktop for the moment). You can see more on pglite website.
- Why we replaced Airflow in our experiment platform — Eppo reached a job concurrency limit with Airflow (wanted to run 50k concurrent experiments) and decided to switch to something else. After trying a few different things they decided to developed their own tech.
- Airflow Summit in San Francisco — Airflow gonna turn 10 this year. There is Airflow Summit in San Francisco on 10-12 of September. This year will also mark the start of Airflow 3.0 aiming to be release in March next year. If we look at the confluence page Airflow 3.0 will feature: a new web UI, DAG versioning, remote execution (from cloud to on-prem), data assets (like Dagster, a renaming of Datasets) and more but that's a great milestone.
- Airflow and dbt ; in Astronomer — Orchestrating dbt within an orchestrator is one of the most discussed topic among data teams using dbt. It's also an large adoption lever for companies like Dagster from the private discussion that I have (Dagster-dbt integration is top-notch). Astronomer has to go the same direction and through cosmos and a better integration now they propose it as well.
- No, data engineers don't need dbt — Common sense but good reminders. If you do ETL instead of ELT and you don't have a warehouse, dbt might not be the best fit.
- Sparkle write Spark pipelines in YAML at Uber — Uber doing uber things. Still declarative platforms in YAML are the way to go to standardise data work.
- ETL development life-cycle with Dataflow — How Netflix is also doing YAML development for Dataflow job writing.
- StarRocks usage at Pinterest for faster analytics.
- Table format comparisons — Honest linear review of 4 formats (it includes the 3 main ones and Apache Paimon). This first part details how formats manage physical files, the second part is about append-only and incremental tables.
- Polaris catalog is open-source — Snowflake released the expected Polaris catalog. On this matter, it has been reported that Databricks finally acquired Tabular for $2b while Snowflake tried to get them for $600m.
- BigTable supports SQL (and this is something) and GCP can run managed Kafka.
- The revolving doors of BI — I frequently observe companies switching BI tools every 2-3 years, hoping that the latest solution will resolve all their issues. While these migrations often provide short-term relief, they inevitably lead to another dead end. The initial success of the migration is typically due to the fact that, during the transition, companies address their technical debt and apply the lessons learned from previous mistakes. However, without addressing the underlying issues, the cycle is likely to repeat itself.
- Metrics management at Duolingo — I should go back to learning German in order to improve Duolingo metrics.
- How we test Airbyte and marketplace connectors — Exhaustive tests suite Airbyte put in place to check if connectors are behaving correctly.
- Slack summary ELT pipeline — It showcases how you can create an ELT pipeline with dlt, Ibis and Hamilton—which is a Python library to create transformation DAGs.
- snowflakecli — a DuckDB-powered command line interface for Snowflake security, governance, operations, and cost optimization.
- A Postgres extension (pg_duckdb) that brings DuckDB as an analytical engine within Postgres has been backed by Microsoft and fostered a bit of discussions in the community.
- Genealogy of databases — like a subway map that depicts all databases changes since Prehistory.
- Martin Kleppmann is working on a version 2 of Designing Data-Intensive Applications.
Final stuff
- dbt Generic Tests in Sessions Validation at Yelp.
- LETSQL inference for DataFrames — I don't understand the blog but looks cool.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.