Data News — Week 24.12
Data News #24.12 — My Friday routine, the 01 interpreter, RAG, xAI Grok-1, Apple entering the course, run Spark in BigQuery, Williams F1 using Excel BigData (lol) and more.
It's Friday and it's Data News. I don't go into too much detail about the magic of Data News, but every Friday is the same. At first, I'm: oh shit, here we go again and 10 minutes later I'm lost in reading the content and picking too many articles to fit into a thousand word edition.
Usually all the process takes me a whole Friday. I organise myself as follows:
- During the whole week I scroll—too much—LinkedIn. Save posts without reading. Sometimes I also save stuff on Twitter by liking. The reason I do this is to avoid context switching—let's be honest, it works for the DN context, but does not work in general in my life.
- Exploration, Friday morning
- I read the last 7 days of 2 Twitter lists (MDS, Data voices) and I open interesting stuff in tabs.
- Then I use Feedly which is connected to ~500 websites, Reddit and Medium and opens interesting articles in tabs.
- Then I opens the elements saved from LinkedIn.
- Reading and writing, Friday afternoon
- I read the articles and remove what I find irrelevant (context, values, quality, etc.), I create a first connexion between all the links, trying to sort them to have a fluid path between articles ideas
- I usually go from ~50 links to 25 after the reading part.
- I write in one-go starting from top to bottom.
- Publication—Once the Data News is ready, I just click publish, I don't proofread much (sorry for the typos). I think I already spend so much time selecting + writing that I can't be stuck in revision mode for long.
- Post-publication—After the publication I do my homework of promoting my own work (mainly on LinkedIn), I run a few post-publication scripts for the Explorer / Recommendations. I also watch the click / opening stats and thats all. But I could do it better I think.
The process works well, but as you can see, because I use fresh news, it's just-in-time. Which puts pressure on my Fridays. I'd like to have a few articles in stocks to remove the pressure of having to write something on selected Fridays and be off.
❤️ I rarely say it, if Data News helps you save time you should consider taking a paid subscription (60€/year) to help me covers the blog fees and my writing Fridays.
Just before I jump to the news, I'll speak at the MDS Fest 2.0 on April 10. MDS Fest is a free virtual 5 days conference about Modern Data Stack topics, a lot of awesome speakers, there are a few talks I can't wait to watch. On my side I'll talk about Apache Superset and what you can do to build a complete application with it.
Ok. Now give me the news.
AI News 🤖
- 01 open interpreter — The 01 light is a small device operable with your voice that controls your home compute. I've been rarely amazed by the latest physical device AI startups produced, but this one is different. It's a small white sphere that understand what you says and then control your compute mouse to execute actions for you wether you're in front of your compute or elsewhere.
The initiative wants to be open(-source) and they provided the code on Github. Actually they trained a "computer LLM". And they are the reason this newsletter was late, because you can build yourself the physical device with Arduino stuff—list of materials—and I wanted to do it today but a piece was not available at the electronic shop 🥲.
Under the hood it uses a big prompt to instruct their LLM because in the end fuck you, show me the prompt. And actually it's always fun to read the prompt companies are using to do specific tasks. Sometimes it looks like you speak to a child. Capslocks and repetitions to make the algorithm understand. - Finally, xAI released Grok-1 in open — The weights are available in torrent / HF and everything is under Apache License. The repo has been released last Sunday after Musk publicly announced last week the released, I feel bad for the sweaty engineers who worked the whole week on it. I did not see a lot of feedback on it since.
- Apple tries to enter the LLM game (while facing fines) — Rumours says they will partner with Google to use Gemini to power iPhone AI features, at the same time they wrote a paper about MM1, a family of multimodal models up to 30B parameters.
- Microsoft hires DeepMind co-founder — Mustafa Suleyman will lead a new organisation called Microsoft AI. Following the announcement Copilot, Bing, Edge and the GenAI teams will all move to the new organisation. Satya Nadella is going all-in on AI. It's important to say that Mustafa is joining Microsoft from Inflection a LLM company in which Microsoft invested 1 year ago.
- OpenAI closing partnerships with major newspapers in Europe — After Axel Springer in Germany, they sign with Prisma Media which groups Le Monde (France), El País (Spain) and the Huffington Post (worldwide). All these partnership will help OpenAI train GPTs on media corpus to enhance the reliability of the answers in return for a significant source of additional revenue.
- Commun Corpus — A HuggingFace dataset collection including public domain texts, newspapers and books in a lot of languages. TB size.
- Designing RAGs — A super long and detailed article about RAG. It covers the 5 main components: indexing, storing, retrieval, synthesis and evaluation. Let's be honest it contains all you have to know about this new trend and what are the key considerations.
- Vector DB comparison — A table comparing all the different vector technologies on different axes like the search, models, APIs and technical details.
- Python codebase with best practices to support MLOps — This is a Github repository with a lot, I mean a lot, of tools and tips to create a production grade repository.
Fast News ⚡️
- Run Spark procedures in BigQuery — BigQuery released a way to write PySpark code in the web editor and to run / deploy it from there creating a new serverless way to create BigQuery assets. This is a nice way to mix SQL and Python code.
- pip install data-stack — This is a title I could have written myself. In this blog Julien covers the new Pythonic tooling and how far it can bring us in building lightweight programmatic data stacks. He also mentions my baby yato.
- Pretzel notebooks — A new open-source notebook / exploration tool based on-top of DuckDB WASM and PRQL, it allows you to chain operation like upload file, SQL, charting, filter, sort etc. You can explore the demo.
- On the same topic Hashquery launched — this is a Python framework to create semantic data models.
- Williams F1 used Excel to build their car — F1 parts (thousands) were managed in a spreadsheet. These Excels files were unmanageable and explained why Williams had delays in deliveries. That's not surprising because I think during a F1 season there aren't a lot of breaks which you can use to pay technical debt.
- dbt Core unit testing in v1.8 — dbt Core has implemented unit testing and it's coming soon. When unit testing a model you can give input rows and says what you expect as output rows in the YAML definition. dbt will run and validate the model for you. This is game changer.
- Awesome DuckDB snippets — A website that collects cool DuckDB snippets. The most popular is a 4 line bash command that you can add to your bashrc to convert a CSV to Parquet.
- The cost of data incidents — Mikkel is one of my favourite authors, he carefully picks all his titles that they resonate deeply in me. He proposes a formula to compute the costs of your data incidents, changing downtime numbers into $.
- 2023 in reading — This is a great side project idea. This is a visualisation of the hours spent by Erin reading books in 2023. Personally I just finished my first book of 2024 😅.
This newsletter edition is already too long and I have 10 other deep articles that I'll keep for next week ❤️.
Recommendations have been computed this Wed., go check what the algorithm prepared for you. The email notification feature is almost ready so opt-in on the reco page to get your recommended link by email once ready. I know the mobile version of the reco page is buggy, I'll work on it next week as well.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.