Data News — Week 24.07
Data News #24.07 — OpenAI Sora, Gemini, boximator, models competition is fierce, new Observable and BI as Code and more stuff.
Hey you, time for the Data News. Because I did not send the news last week you will get articles from the 2 last weeks. Last few days have been heavily packed with AI News as well.
Disclaimer, the 2 events below will be in French.
Before jumping to the news there are a few events I want to write about. Next Wednesday I will participate to a Data Night Talk a open discussion about AI & data engineering with other content creators. We will do it online / in-person. So tune-in. I working on a 10-minute light talk about data engines (🦆) and a funny game.
✨ Second, I'm partnering with La Conférence MLOps, a half-day conference on the challenges of industrializing AI. It will take place on March 7 in Paris. The list of speakers includes many important figures from the French data ecosystem, and I'm very excited about it. I may give a talk—but I'm not sure yet.
PS: This is not a paid partnership with nibble—the company behind the conference—but in fact they've been a client of mine for 2 years and have been a huge supporter of the newsletter since day one and they're good humans. I'm happy to partner with them.
AI News 🤖
The last few days have been stacked in term of AI News. Have fun getting through everything 😊, this is really cool how fast things are improving.
- OpenAI Sora, generate 1-minute long videos — OpenAI released yesterday a new generative model that is able to create 1-minute long videos from text prompt. At the moment this is not public and only in the hands of limit testers but the first look shows that OpenAI might be already ahead of the competition. You can see a few videos on their landing page as well as the current limitations.
- ByteDance boximator, create motion on images — Boximator is a friendly method to instruct generative algorithms with boxes. With the boxes you define a motion on an image and the models create a video out of it. ByteDance is the company behinds TikTok. At the end of the page you have a current comparison of generative videos and you can make yourself an opinion compared to Sora.
- Meta V-JEPA, fills the void in videos — V-JEPA is not a generative model. With this model you can "fill in the blanks of what’s missing—not in terms of the actual pixels, but rather as a more abstract description in this representation space".
- Sam Altman wants to raises $7 trillion — It was the WSJ news of last week about OpenAI CEO seeking for 7,000,000,000,000 of dollars. Obviously everyone tried to bet why he wants this money—which could came from UAE government—probably to enter the semi-conductor industry. I'm happy we correctly use money to save our planet /s.
- Canada partners with NVIDIA to bring more computing power.
- Groq, which speeds up LLMs inference — This week I discovered Groq a company that created LPU™ inference engine claiming to be the most efficient cloud-based provider (18x faster) in term of tokens/s.
- Google announced a "new" AI hub in Paris — Techcrunch is mocking US company about this announcement considered as a communication effort because the 300 members of the new AI hub are already working for Google and it's just a new office space.
- Still Google announced Gemini 1.5 — Gemini is the new name of Bard (as of last week) and blablabla Gemini is awesome blablabla. This is crazy how Google feels outdated when you look at smaller AI companies in term of hype or magic they are able to build. More details on Twitter.
- HuggingFace model usage on HuggingChat — HuggingChat is a Chat UI from HF that lets you play with whatever model works with it. It depicts how fluid is the model market. Mainly it shows that Mistral is currently replacing LLaMa. You can also see the models "market share" among the major APIs / Clouds vendors in a nice Sankey diagram.
- NVIDIA Chat with RTX — It's a Windows app (~37GB) that locally runs a GPT model to unlock chat with your files in a secure way out of the cloud. Happy gamers.
- French Finance ministry released a LLM to summarise legislative proposals — Called LLaMandement it's a fine-tuned LLaMa designed to produce neutral summary of law proposal to help the government with preliminary notes. All the data used for the fine-tuning is available on Gitlab and the FastChat command used as well. Here the English paper on arxiv.
- Why you need LLMOps — A great post that encapsulates all the words needed to understand what needs to be done when it comes to put LLM in production.
Fast News ⚡️
- ❤️ Observable 2.0 — Observable has a close place to my heart. Observable has been created by Mike Bostock the creator of D3js which is my Proust madeleine. Today they announced the 2.0 version which is mainly Charts as Code. It goes beyond the notebooks and become a static site generator for building fast, beautiful data apps, dashboards, and reports. I'm so excited to play with it.
- Introducing universal SQL — I have to talk about Evidence now, which is as well a static site generator for building data front-end. They introduce universal SQL as a way to connect to all kind of datasources, adding interaction in the frontend while staying fast. Mainly it means data is exported in Parquet and compute with DuckDB WASM in the browser.
- uv: Python packaging in Rust — I've been using poetry in the last 2 years and I'm quite satisfied about it. Seeing a new kid on the block is good because it renew the ideas because let's be honest we will never have a de facto packaging tool.
- sqruff: SQL linter written in Rust — This is the results of the Rust hype, people are now porting more and more tools in Rust for efficiency. And it's for the better.
- Introducing the column explorer in MotherDuck — A cool feature in MotherDuck (DuckDB in the cloud) to add sparklines and columns distributions when looking at a dataset.
- dbt Labs announced a few new things to their dbt Explorer (that is only available to dbt Cloud). In a nutshell they announced column-level lineage, recommendations and semantic layer exports. This is fair to say that lineage is powered by SQLGlot (and Toby is not happy about it).
- Introducing vector search in BigQuery — RAG is everywhere and BigQuery enters the game.
- Snowflake lowers the cost of task from x1.5 to x1.2.
Engineering ⚙️
- Eventify everything — This is an ode to event modeling and a different way to think data modeling. Timo showcases how you can eventify your data model to think differently your business activity.
- What we learned after running Airflow on Kubernetes for 2 years — Outstanding article with great insights about the journey of running Airflow in production. It breaks down how to handle dynamic DAG generation, multiple DAG repository, configuration fine tuning and observability.
- A dataframe is a bad abstraction — The article is too long for me to read it right now but the title is enough catchy for me to put it in the newsletter. If you read it I'm curious to know what you think about it.
- Back Market’s journey towards data self-service — How to be a data self-service company and what initiatives they tried toward this journey.
- How to leverage Metabase for efficient self-service analytics? — 3 companies joined to shared tips about Metabase governance and monitoring. This is goldmine if you're using Metabase and struggling understanding how users are using Metabase.
- Automating data classification for the 21st Century — How Semantic Data Fabric (SDF) is able to statically infer data types and lineage how of a SQL patrimony. In a sense SDF is a dbt alternative. I really like this article.
Food for thoughts 🍱
- What founders need to know to build a high-performing data team — Centralised, distributed or hybrid data team? This article discuss it.
- Let's elevate the role of analytics.
- Data ownership: A practical guide.
- Is the Modern Data Stack still a useful idea — I'll write more about this later I guess this is too big too write my views in a bullet point. The article is about Tristand Handy views about the future of the modern data stack.
See you next week ❤️.
PS: I'd love to get your feedback about the newsletter either if you recently joined or if you're here since the beginning. More I'd love, as well, to understand who you are and what eventually would make you financially support my content creation activity.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.