Data News — Week 24.16
Data News #24.16 — Llama the Third, Mistral probable $5B valuation, structured Gen AI, principal engineers, big data scale to count billions and benchmarks.
Hey, new Friday, new Data News. This week, I feel like the selection is smaller than usual, so enjoy the links. I'm a bit late with the Recommendations emails, I'm sorry about that I got a few new leads as a freelancer I had to take in priority changing a bit my schedule. But don't worry it gonna be out soon.
AI News 🤖
When do models get the same hype as 2007 iPhone release? I did not get the memo.
- Meta releases Llama 3 — After last week Mistral new models, this week ends with the new Meta open-source models. Llama the Third is online. One thing to analyse is that the model created more hype than Meta AI the new ChatGPT run by Zuck company going all-in into his AI vision. It shows something changed, generative models reaching massive adoption, in my bubble at least, people care more about a new model than an assistant available in Meta ecosystem (Insta, WhatsApp, Facebook and more). Until we reach the model fatigue hype is real.
Personally, I can't comment on the performance of the models, it's like comparing the performance of two cars, as long as I can drive, it's fine. You can try Llama 3 on Modal or HuggingChat.
In order to go further you can read this excellent analysis on Twitter or the model card—they even give the teCO2 estimated in the training phase. In a nutshell it says:- Llama is available in 8B and 70B, 400B is coming once training will be completed—and approaching GPT-4 performances.
- Llama has a larger tokeniser and the context window grew to 8192 tokens as input.
- It was trained on a large dataset containing 15T tokens (compared to 2T for Llama 2).
- Mistral wants to raise again at $5B valuation.
- Microsoft VASA-1 — Microsoft published a paper about a model generating talking avatars from an image and an audio. This is quite impressive. They did not released the code so I tried the closest solution open-source called SadTalker and tried it on me. This is a bit creepy, but impressive for the low quality of my inputs.
- Structured generative AI — Oren explains how you can constraint generative algorithms to produce structured outputs (like JSON or SQL—seen as an AST). This is super interesting because it details important steps of the generative process.
- Evaluate anything you want with LLMs — I really like how LLMs can be used for tasks that are not the one we are first thinking of. This blog shows how you can use Gen AI to evaluate inputs like translations with added reasons.
- How we build Slack AI to be secure and private — How Slack uses VPC and Amazon SageMaker with your data secured and private.
- OpenAI batches — OpenAI opened a new API endpoint to batches requests.
Fast News ⚡️
- Principal Engineer — Although staffs and principals have been on the career ladder for a long time, there are very few articles on what it takes to become one of the greats. This article covers the whole ladder and the mix of skills needed to reach the top. It's a mix of hard, soft and business skills.
- Data pipeline, incremental vs. full load — A comprehension comparison between 2 mode of ingestion with a decision tree about which one to pick.
- ❤️ Scaling to count billions — An awesome retrospective of the Canva OLAP architecture to count marketplace usage, from MySQL to Snowflake + buckets. It greatly decompose every important part of an OLAP platform with the collection, the deduplication and the aggregation.
- Spark (and Theseus) on GPUs benchmark — A detailed benchmark by Voltron data about running Spark and Theseus (their GPU data processing engine) workloads on GPUs. This is crazy how Theseus outperform Spark. The conclusion look like a great summary for me:
- For less than 2TBs > use DuckDB, Polars, DataFusion or Arrow backed projects.
- Up to 30TBs > Cloud warehouse or Spark
- Over 30TBs > Go Theseus. [Theseus] "prefer to operate when queries exceed 100TBs". 😅
- Polars new benchmarks — Polars released new benchmarks about the TPC-H dataset. Polars and DuckDB are the cool kid and benchmarks show you should stop using pandas and switch to polars to get x10 performance gain.
- Hydra: the Postgres data warehouse — Postgres is one of the most used database, this week I discovered Hydra an open-source columnar port of Postgres aiming to create an open-source Snowflake. To watch.
- Neon GA — Neon, another Postgres fork, is generally available. Neon wants to provide a serverless autoscalable Postgres for devs.
- Clever Cloud offloading 20TB every month — Kestra showcases how one of their client is using the declarative orchestrator to offload TBs of data every month.
- KubeCon + CloudNativeCon Europe’24 notes — A few notes from the 2024 Kube big mass.
- Is SQLMesh the dbt Core 2.0? — A great blog to answer a great question. SQLMesh is bringing fresh ideas to the SQL transformation landscape. The post covers a lot of topics and explains the concept similarities between both tools.
- gwenwindflower/tbd — A code generator for dbt. Winnie developed a great tool to save time in documenting you dbt projects using Gen AI models.
- Snowflake text embeddings for retrieval.
- Distributed dashboarding with DuckDB WASM — Ramon put words into ideas I have in my mind for months: distributed dashboarding. I buy so much this concept, especially with DuckDB WASM and what it unlocks in term of autonomy or privacy for users.
- WAP with dbt, Icerberg or Nessie — Julien showcase how you can achieve WAP pattern with different technologies.
- CSV to DB — What if you could open a CSV, re-order or rename the columns directly in the browser? Without any backend call—with DuckDB obv—. s/o to Théodore, a Data News subscriber, who developed this. This is a great idea.
See you next week ❤️.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.