Data News — Week 24.11
Data News #24.11 — OpenAI CTO, Musk vs. LeCun, Grok open-source?, French report about AI ambition, RAG is hype, and data engineering stuff.
I hope this e-mail finds you well, wherever you are. I'd like to thank you for the excellent comments you sent me last week after the publication of the first version of the Recommendations. This is just the beginning!
This week I've added a subscribe button in the Recommendations page in order for you to opt-in for the weekly recommendation email—every Tuesday. You can subscribe starting today on the page and you'll get emails as soon as I've developed the email sending—expected to be out at the end of the month.
Second point, I passed the 100 stars on Github for yato, which is a crazy amount! I'd like to do a bit of user research about yato, if you consider using it drop me a message please.
yato, is a small Python library that I've developed, yato stands for yet another transformation orchestrator. With yato you give a folder with SQL queries and it guesses the DAG and runs the queries in the right order.
AI News 🤖
- Mira Murati answers the Wall Street Journal about OpenAI Sora — OpenAI CTO has been asked a few questions about the underlying technology in Sora. She revealed a few insights. OpenAI consider for the moment Sora as a research output and might eventually be released later this year, it required "much much more" compute power than DALL-E to generate a video and they have a lot of interrogations regarding impact on elections or film industry. Saying mainly that "Sora is a tool to extend creativity".
Last point Mira has been mocked and criticised online because as a CTO she wasn't able to say on which public / licensed data Sora has been trained on. When she was asked if it was on YouTube videos, Facebook or Instagram she said "I'm actually not sure about that".
I personally really recommend this interview which covers a lot of interesting topics in 10 minutes. - Elon Musk said out loud that xAI will open-source Grok this week. It's Friday and it seems they are later than me when it comes to release stuff. Just-in-time for a reminder about the fact that open-source ≠ open-weights when it comes to AI licensing but differences in weights licensing are not as important as they seem.
- Databricks invests in Mistral AI — Mistral successfully positioned as the main OpenAI rival by being integrated in all major data platforms (Azure, Snowflake previously).
- A French commission released a 130 pages report untitled "Our AI: our ambition for France". You can download the French version and an English 16 pages summary. Report includes 25 recommendations given by French-speaking AI leaders (Yann LeCun, Arthur Mensch, etc.).
- Assisted AI wars are around the corner — I'm only following the French news, but the government is proudly doubling its budget for "AI defense". From what I know, AI is mainly used as an information companion to find signals in the huge amount of data we generate, creating more efficient agents.
This is related to Paris testing automated video surveillance during Olympics. The technology under this, is, Cityvision. - Yann LeCun clashed with Elon Musk on Twitter about AI future. Musk thinking that AI will be smarter than any single human next year, while LeCun said "No" taking as en example the false self-driving car promise. More, LeCun believes that human information compression capabilities are still so far ahead of AI that AGI is not even close.
- Cognition AI introduced Devin — Devin is the first AI software engineer, Devin can, unassisted, do software engineering tasks like fixing Github issues (13% of success, previously best was ~5%), apply to jobs on Upwork, train and fine-tune its own models. I'm speechless.
- Building Meta’s GenAI infrastructure — 2x 24k GPU clusters and it's growing. I like how Meta tries to do stuff out in the open (or at least with some kind of transparency) but the number of GPUs is just disconcerting.
- RAG is the new trend — RAG means retrieval-augmented generation, it has been coined in 2020 (see more) and let's you ground AI models with facts fetched from external sources.
- There is an exponential number of technologies in the RAG space, especially re vector databases that I don't even mention them but obviously post are all saying "ours is the best".
- Croissant: a metadata format for ML-ready datasets — In order to move forward, faster in AI and model building we need a interoperable and easy-to-use metadata format for ML datasets. This is Croissant. Starting today it will be supported by 3 majors platforms: Kaggle, HuggingFace and OpenML. Croissant is under mlcommons and you can have a look at the specification.
- The State of competitive machine learning — a study about ML competition platforms. Give a lot of insights on the market.
Fast News ⚡️
- Since the end of Feb. BigQuery supports DELETE to delete partitions in a SQL query.
- How I saved $70k a month in BigQuery — Junaid shared a few techniques he used to saved a bunch of dollars on the BigQuery bills, this is nothing new. this is more common sense but always works. In a nutshell it's: smarter schedules, tables optimisations, incremental, avoiding views and precomputing.
- Attributing Snowflake cost to whom it belongs — Fernando gives ideas about metadata management to attribute better Snowflake cost. Wether it's a dbt model, a Tableau dashboard or a Metabase question it has to be tracked to understand what drives your bills.
- Understand how BigQuery inserts, deletes and updates — Once again Vu took time to deep dive into BigQuery internal, this time to explain how data management is done.
- Pandera, a data validation library for dataframes, now supports Polars.
- PyData is coming to Paris in 2024 — The CFP is open and I submitted a talk there about yato.
- A comparison between Kestra and Airflow — Benoit (who works at Kestra) did a great comparison between the 2 tools, comparing the syntax to write DAGs and the performance in term of scheduling capacities—tasks per seconds. Obviously Benoit prefers Kestra, at the expense of writing YAML and running a Java application.
- New Apache Arrow engines — Arrow has become one of the most used library when it comes to built in-memory engines. Arrow doing a lot of the data operation heavy lifting.
- Apache Arrow DataFusion Comet — a native Spark SQL accelerator, the idea behind is to improve Spark performance behind replacing Spark executor by delegating it to Comet. On the matter there is also Apache Gluten which is a plugin aiming to double SparkSQL performance.
- Arroyo, a stream-processing platform, rebuilt their engine using DataFusion.
- Postgres creator launches DBOS, a transactional serverless computing platform — Mike sees DBOS like a cloud-native OS that runs on-top of the database in order to rethink application development and deployment.
- Unlocking Kafka's potential: tackling tail latency with eBPF.
Forward thinking
- Dataviz is hierarchical — Malloy, once again, provides an excellent article about a new way to see data visualisations. It's inspirational.
- Coding data pipelines is faster than renting connector catalogs — This is something I've always believed. The devil is in the details and when it comes to data pipelines there are a lot of details, which often refrain us to buy leading to build (or code). Matthaus gives the dlt vision about creating the foundation for developers to be able to create sources in a wink creating a large ecosystem of APIs datasets easily maintainable.
- Differential storage, a building block for a DuckDB-based data warehouse — It's MotherDuck vision, creating the next data warehouse on-top of DuckDB leveraging DuckDB morphing capacities between a single machine and a production ecosystem. In the article Joseph explains how MotherDuck extended DuckDB to add time travel, zero-copy snapshots opening the door for more collaboration and concurrency.
See you next week ❤️ — recommendations for this week have been computed, go check it out.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.