Data News — Week 24.28
Data News #24.28 — Catching up the news, OpenAI, Claude, kyutai and all the engineering stuff from the last 3 weeks.
Dear members, it's been a few weeks since I did not catch you on a proper Data News with a collection of links. Here we are.
This week, I attended EuroPython in Prague. While I spent most of my time at the dltHub booth in the sponsors hall, I didn't attend many talks. However, I did give a few presentations on my SQL orchestration library, yato, which pairs well with dlt. A YouTube video might come out soon.
Additionally, I attended an interesting talk by a Data News reader about the rise of YAML engineers, Matthieu also wrote in the past an article about this. I'm so happy to have met a few of you there 😊.
This is a great transition as a reminder to say that I'm co-organising a 1-day conference on Nov 25th in Paris. The Forward Data Conference will be a day to shape the future of the data community, where teams can come to learn and grow together. The Call for Paper (CfP) closes in a few hours, on Sunday 23h59. So propose a talk. Submissions are welcome in French and English.
AI News 🤖
- OpenAI — Always the biggest news provider, announcements or dramas
- OpenAI was hacked, revealing internal secrets and raising national security concerns — The hacker reached OpenAI’s internal messaging systems early last year, stealing details of how OpenAI's technologies work from employees.
- Microsoft quits OpenAI board — Microsoft said that they are not needed anymore because the governance has improved and at the same might want to avoid antitrust issues raised by gouv around the world. Apple also will not join the board as it was also expected.
- How good is GPT at coding, really? — A research team evaluated the capabilities of GPT-3.5 in solving LeetCode problems. Although GPT-3.5 might be outdated, the findings are still somewhat relevant. The team discovered that GPT-3.5 performed significantly better on problems that existed before its training cut-off date. However, the model struggled with correcting its own mistakes.
- OpenAI’s acquisition of Rockset, what it means — I announced it a few weeks ago, OpenAI bought Rockset a real-time analytical vector database. Customers have 2 months left to migrate away from the database that will probably because the core of the OpenAI architecture.
- kyutai released Moshi — Moshi is a "voice-enabled AI". The team as kyutai developed the model with an audio interface-first with an audio language model, which make the conversation with the AI more real (demo at 5:00 min) as it can interrupt you or kinda "think" (meaning for predict the next audio segment) while it speaks. Moshi will be part of kyutai open-source released and is purely local.
- Claude 3.5 Sonnet — To end the tour of the last model, if you missed it Claude 3.5 went out in June and featured great performances when "reasoning", Claude is capable to split the screen and develop some kind of Codepen playground with a React app applying what you're asking. There is a demo where Sonnet transformed a research paper into a simulator app about the paper in one prompt.
- pandas-ai — Give a dataframe to pandas-ai and configure a model, then you'll be able to chat with your data to get answers or chat about your questions. Nothing new I'd say, the only difference is the fact that the API to use it is fairly simple.
- Meta’s approach to machine learning prediction robustness — Principles Meta applies to bring robustness to ML.
- Bringing AI-powered answers and summaries to file previews on the web — Dropbox has developed a feature that generates summaries for every file in your storage. This process involves converting files, regardless of their format, into text. The text is then transformed into embeddings, which make the content easily summarisable and queryable for Q&A purposes.
- Recommendation system using a vector database — How Malt (a freelancer platform) built a recommendation engine using current vector database technologies (with Qdrant).
- DevOps for data science — An open-source and free book covering what data scientists need to know about DevOps. It has been written by someone at Posit (the company behind RStudio). It covers general knowledge about infra + code snippets.
Fast News ⚡️
- End-to-end data lineage in AWS — AWS announced DataZone to bring lineage to your data assets, from the picture it can mixes datasets (?), Glue table and jobs while giving you a greed/red vision of what's up to date. They mention column lineage, but from the picture it looks like they track columns but not proper column-level lineage. UI is AWS-tier.
- A brief history of modern data stack — Ananth from data engineering weekly wrote a few weeks after the modern data stack debate his views on it (read my opinion on this), he considers that we are in the post-modern data stack era with a few a points that will (or are) be implemented everywhere. Especially the interoperability.
- Apache XTable — XTable is a new layer that provides a cross-table interoperability, you don't need to choose only one table format out of Hudi, Delta and Iceberg. It provides abstractions and tools for the translation of lakehouse table format metadata.
- Create dynamic Iceberg tables — Snowflake added support for dynamic tables in a Iceberg format. Dynamic tables are tables based on "real-time data" (or streams, or continuous pipelines). So it means now Snowflake can just be used as an engine writing continuous tables in a blob storage in a open format like Iceberg — future is coming.
- Creating a file format in Rust — Experiment about what you need to create a new file format. That's super interesting to understand what's under the hood of popular stuff we often use.
- Data pipelines and SCDs — Slowly changing dimensions are an important pattern to know when it comes to data engineering. Julien wrote a great article about it and explaining the 3 possible form and the snapshot way. His chart are great. You can also read Timo detailed post on Mixpanel blog and why SCD are the best thing for product analytics.
- How to build a Semantic Layer — An great small guide that gives you the things to consider when going the Semantic Layer road. Gwen gives a step-by-step method to migrate from marts to a dbt Semantic Layer.
- How dlt uses Apache Arrow — A great post explaining why the next generation of data tooling need to use Arrow and how it impact the performances. This article explains how dlt (extract and load) then leverages Arrow.
- nanoarrow, a way to technically understand Arrow — Slides about a re-implementation of the Arrow framework (it's highly technical to be honest without a video).
- DuckDB and dbt — How with DuckDB and dbt you can build the transformation layer of a BI application (e.g. a Pokemon dashboard).
- DuckDB extension mechanism — DuckDB wants to provide a repository for community extension. This way the community will be able to extend DuckDB easily and it will also reduce the minimal size of the DuckDB allowing use to have an even more portable database/engine.
- Don’t lead a data team before reading this — 5 important points you should consider when leading a data team. I really like the "The business doesn’t care about how you solve the problem" because it's a good reminder for my technical audience that your role as a data person is to empower others with data, so boring tech is often the best.
- Apache Kafka overview — If you're not familiar with Kafka this is a great overview.
- Spark-connect, what's this — Very detailed post about what is spark-connect and why it will change the way we do Spark. It highlights how it simplifies and enhances the development process, particularly through its compatibility with various languages and the potential it unlocks for creating a data quality process.
- How Discord uses Dagster — 2000 dbt tables, covered by over 12000 dbt tests. Discord uses the dbt <> Dagster integration to power they whole data assets management.
Stories
- How Canva collects 25 billion events per day — Protobuf + Amazon Kinesis.
- In-memory analytics for Kafka using DuckDB — The author develop kwack a small utility that allows you to run SQL queries on top of Kafka streams (in-memory).
- 40 AI prompts to boost your marketing team’s creativity — Atlassian collection of 40 prompts to do marketing stuff. I'm not sure I'm happy to see this on Atlassian blog.
- A Recap of the Data Engineering Open Forum at Netflix — Video about the Netflix data engineering open forum are out on YouTube and this post is a recap + takeaways.
- Senior engineer fatigue — When you gain in experience as your career progress, you will start to feel fatigue as an engineer. Senior fatigue is characterised not by a decline in productivity but by a deliberate deceleration. I find the first part about the paradox of slowing down to speed up so true, that I recommend you to read it warmly.
See you next week (probably) ❤️ — I'll take random breaks this summer in order to prepare for the changes coming in my professional and personal life in September.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.