Data News — Week 24.30
Data News #24.30 — TV shopping for foundational models (OpenAI, Mistral, Meta, Microsoft, HF), BigQuery newly released stuff, and more obviously.
Dear members, it's Summer Data News, the only news you can consume by the pool, the beach or at the office—if you're not lucky. This week, I'm writing from the Baltics, nomading a bit in Eastern and Northern Europe.
I'm pleased to announce that we have successfully closed the CfP for Forward Data Conf, we received nearly 100 submissions and the program committee is currently reviewing all submissions. Many thanks to all people who trusted us and submitted a talk for the conference (especially the DN members!).
We also announced our first guest speaker, Joe Reis. Joe is a great speaker, he wrote Fundamentals of Data Engineering, which is one of the bibles in data engineering and I can't wait to hear him at Forward Data. He his currently writing his second book about data modeling.
Forward Data is a 1-day conference I will co-organise on November 25th, in Paris. It will be a day to shape the future of the data community, where teams can come to learn and grow together.
AI News 🤖
Some days, AI News is like a TV shopping show. Over the past two weeks, a few dozen models have been released, and I'd like to introduce them to you.
New models & stuff
- OpenAI — OpenAI is trying to continue leading the charge by releasing models like Apple products.
- GPT-4o mini: advancing cost-efficient intelligence — After GPT-4o which brought great performance and became the new flagship model, available in the free tier, OpenAI released a smaller version of it, the mini. According to the benchmark GPT-4o mini is close to GPT-4o performance but best in class among the small models. Even if OpenAI did not disclose how small it is, a few people are claiming it's a 8B.
- Fine-tune GPT-4o for free — Until September 23, 2024 GPT-4o mini is free to fine-tune. This means each organization will get 2M tokens per 24 hour period to train the model and any overage will be charged at $3.00/1M tokens. Worth trying [docs].
- SearchGPT, new OpenAI product —Yesterday, OpenAI unveiled their latest product, SearchGPT, a prototype AI search application. The system generates answers while providing reliable sources. This announcement coincides with Google Search's recent report of a 11% increase in revenue for the last quarter, reaching $64 billion. It shows that search did not disappeared with the advent of GPTs.
- Meta — This is crazy how Meta who suffered an unintentionally leak of LLaMA weights on torrents 1 year ago is now the company advocating for open models and leading this part of the ecosystem.
- LLaMA 3.1 is out — The model is out in 3 versions, the largest one with 405B and 2 smaller ones (70B and 8B). They even released a 92 -pages whitepaper explaining how they trained it, the expected performances and what you can do with it. How dear fried Mark even wrote an ode to the open source with some kind of manifesto. The announcement is like a summary if you want it short.
- Meta won’t release its multimodal Llama AI model in the EU — It would have been perfect if Meta was complying with the rules but in the end Meta is Meta. Doing lobbying and like a crying kid we punish for being bad they announce they will not release they super multimodal AI in Europe because regulatory environment is too "unpredictable". How to say that they use training data they should not have used.
Last point related to tech giants (Apple, Nvidia, Salesforce) stealing YouTube subtitles to train foundational models. - Meta Chameleon — Finally Chameleon is available on HuggingFace, it's Meta Mixed-Modal Early-Fusion Foundation, which actually means it can understand and generate text and images.
- Mistral — The US-French company is keeping the rhythm with other giants going back on open models.
- MathΣtral — A 7B model for math reasoning and scientific discovery under Apache license. I'll try it soon for something I'm cooking.
- Codestral Mamba — A 7B model for code generation under Apache license.
- Mistral Large 2 — Competing directly with the large LLaMA 3.1, Mistral Large 2 is 123B parameters and is at the moment the closest model to GPT-4o which still sets the benchmark.
- Microsoft Phi-3 models — Microsoft continues to try hard at the game with their Phi-3 models available in Azure. But who cares?
- SmolLM - blazingly fast and remarkably powerful — HuggingFace released new state-of-the-art small models (135M, 360M and 1.7B parameters) trained on a open corpus.
Articles
Because AI and GenAI is not only about models, a few great articles have been written as well.
- Twitter uses your data to train xAI Grok — Twitter recently added a opt-in to utilise your X posts as well as your user interactions, inputs and results with Grok for training and fine-tuning purposes. The opt-in is only available on desktop.
- AI Lab: The secrets to keeping machine learning engineers moving fast — About the AI Lab Meta put in place to keep ML engineers velocity giving them possibilities to A/B tests models and avoid regressions.
- Building Pinterest Canvas, a text-to-image foundation model — Great article about creating a image generation model for product backgrounds.
- Multimodal RAG pipeline — A notebook explaining how you can index and build a RAG on deck slides.
- Watermarking Generative AI: ensuring ownership and transparency — This is the future, being able to watermark all the generated content for ensure trust and transparency for end consumers.
- Germany (in German, sorry) and Switzerland both added into their law some kind of preference for open-source software (even required in Switzerland).
- Understand Vector database in Google Sheets — Playful. This is a Google Sheet template that you can copy that will explain you how a vector database works as well as the search.
- Infinite dataset hub — A Generative app that generates datasets for you out of a few words.
Fast News ⚡
Because the fast news are always the best.
- Why it feels impossible to get a data science job — Data science market became highly competitive in the last years, even more with everyone rushing to AI and jobs switching from being great at machine learning to being good at maintaining APIs orchestration. This articles tries to explain why and what to do.
- Snowflake brings seamless PostgreSQL and MySQL — It has been announce at the Snowflake summit, now you can directly ingest Postgres and MySQL from Snowflake UI, removing the need of any other tool for these sources. The way they did it requires you to run a Docker container, which is kinda meh.
- Query Snowflake Iceberg tables with DuckDB & Spark to save costs — That's what Iceberg tables on Snowflake unlocks. The capabilities to offload compute in DuckDB or Spark to save costs (or to move costs actually).
- BigQuery team is on fire and released a lot of cool new stuff
- Table explorer — an automated way to visually explore table data and create queries based on your selection of table fields.
- Continuous queries — A way to answer to Snowflake dynamic tables, continuous queries are SQL statements that run continuously. Google announce that CQ can be used for low-latency tasks.
- On the same topic as CQ, they released the changes function, which is a SQL function that returns all rows that have changed in a table for a given time range. I think it will unlock a lot of use-cases in BigQuery.
- How Mixpanel delivers funnels up to 7x faster than the data warehouse — Mixpanel team is proud to say that they have better performance than Snowflake.
- You can run Clickhouse functions in DuckDB with the chsql extension and there is a great post about how DuckDB manages memory.
- Querying 1TB on a laptop with Python dataframes — A benchmark on a 96 GB of memory laptop using DuckDB, DataFusion and Polars. Crazy what we can do nowadays.
- Data modeling techniques for the post-modern data stack — A great recap of all the techniques that exists out there for modeling (medallion and dimensional).
- Parquet File Format: everything you need to know — How parquet file are written.
- Encoding and Compression Guide for Parquet String Data Using RAPIDS — For Nvidia geeks.
- Hudi merge-on-read — Iceberg has been all over the place, but there is Hudi as well, and the merge-on-read is a great feature tbh.
- Maestro: Netflix’s workflow orchestrator — 5 years later Netflix finally open-source their orchestrator. Curious to see if it will pick up. It's written in Java and it does what other orchestrator are already doing.
- The Analytics Development Lifecycle — dbt Labs needs to reinvent itself, now that dbt is everywhere, they need to define the next vision as they have pure gold in their hands. Tristan, CEO of dbt Labs, is aiming for a new acronym, ADLC (Analytics Development Lifecycle), and provides a draft manifesto with user stories of what AE should be able to do tomorrow.
- Deliver your data as a product, but not as an application.
Data economy 💰
- Google in negotiations to acquire Wiz in $23 billion deal, actually no. Wiz is a cloud security firm.
See you next week ❤️
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.