Data News — Week 24.45
Data News #24.45 — dlt Paris meetup and Forward Data Conference approaching soon, SearchGPT, new Mistral API, dbt Coalesce and announcements and more.
It's Data News time. Time really flies on my side, and apart from the bad news from across the Atlantic, all is well on my side. To be honest, I miss you folks. Writing here has been my little thing for the last 3 years and because I haven't been able to get back to my previous frequency since July, I feel empty every Friday.
I'm back in Paris and, wow, the way I live my life in Paris is so different from Berlin, Paris demands speed at every level. I've only been back 6 weeks and I feel like I haven't even left the last 2 years. I haven't yet settled back into my routines by jumping between all the hats I've decided to accumulate over the last few years: content, freelancing, founder and conference organiser.
I don't know when I'll be able to start writing here once a week again, but I'm doing my best to do it as soon as possible.
Enough.
On November 19th, I'm organising with dltHub folks the first dlt Paris community meetup, the event will take place at 42 and will start a 16h. It will feature:
- Navigating the complexities of enterprise ELT, towards data democracy and cost efficiency
- Me — Towards a simple future (dlt, DuckDB, yato and more)
- dltHub CTO, Marcin Rudolf — A teaser of the upcoming dltHub "Portable Data Lake"
- Lightning community talks (reach out if you want to present something)
It will be only a few days before the Forward Data Conference that we sold out a few weeks ago, the program is out—the official schedule is coming soon. We are very proud and honoured, along with the organising committee, that around 300 people took a paid ticket to the event. I'm sorry for those who weren't able to get a ticket, we've set up a waiting list and we're doing our best to find a way to push the walls.
I would also like to thank the sponsors who are accompanying us on this exciting adventure: Castordoc, Omni and Corail Analytics, Mirakl, SYNQ, nibble, Sparkline, Monte Carlo.
Whether at the dlt community meetup or at Forward, I look forward (no pun intended) to meeting you all.
AI News 🤖
I'm a bit offended, AI news is not what it used to be 🙃, we were used to more exciting news, competition and drama by the space.
- ChatGPT Search — OpenAI finally plugged ChatGPT to internet and live data. You can now switch on the web logo and ask for model to search on the web alongside his training knowledge. When you mix it to the new Canvas UI ChatGPT look more and more like Google Search results.
- HubSpot co-founder bought chat.com earlier this year and sold it to OpenAI for shares [via Oliver].
- Claude Computer Use — Like a soufflé, everyone was hyped when Claude released Computer Use, a chat interacting with an operating system, but a few days after it looks like almost everyone forgot it, like the 01 interpreter. If you wanna try Computer Use (at your own risk), there is a repo with a Docker image launching a VNC and a Streamlit app—it works fine.
- New Mistral APIs — a batch API, for batching calls rather than doing it synchronously lowering by 50% the costs and a moderation API which is a 0-categories classifier scoring text into intent.
- IBM released lightweight open-foundation models — 2 sets of "small" models: 2B / 8B dense models and mixture-of-experts 1B / 3B models. IBM has proudly shared the datasets they used to train their model.
- Skrub: Less data wrangling, more machine learning — skrub is a preprocessing / feature engineering library for tabular machine learning. The video emphasis something critical, even if we often talk about training impact—time, carbon footprint—we tend to forget that inference is also a critical part and more importantly because of the preprocessing. So your preprocessing matters.
- Standford, "Building Large Language Models (LLMs)" — 1h44 of a Stanford class about building LLMs. I did not watch it, but I bet you gonna learn at least a thing watching it.
- Generative AI’s Act o1 — Sequoia essay deep dives into the current state of LLM apps and infra which has been stabilised around Microsoft/OpenAI, AWS/Anthropic, Meta and Google/DeepMind. The text touches fast vs. slow reasoning, System 1 and System 2 thinking, while in the awaiting the all-mighty AGI to come plunging us in a new technology era.
- Apple proved LLMs do not reason (at least Mathematically) — Did we actually needed Apple for this? Because it's actually a design nor a bug or a feature. Apple researchers published a paper saying: "our work underscores significant limitations in the ability of LLMs to perform genuine mathematical reasoning". No way.
- The future is robot — In the recent weeks a lot of new robots were (re)-announced with eventually new features like Humanoid robots, Tesla Optimus, Boston dynamics. Happy to learn that we really need robots to fix the jobs market.
Fast News ⚡️
- Russia fines Google $20,000,000,000,000,000,000,000,000,000,000,000 — because a few Russian YouTube channels that have been blocked 2 years ago are still blocked.
- Why is Kafka always late? — A great large article about Kafka concepts about what Time is, and which configuration lever can be moved to fix eventual issues.
- The path to vendor-agnostic data platforms — The Iceberg trend and the competition between Snowflake and Databricks might be scary for the future, thinking about vendor-agnostic data platforms based on open-source technology is becoming a thing again. dlt + DuckDB is a combo that bridges a lot of gabs.
- Snowflake and Databricks finger pointing like kids — Recently both CEO played the card of "we are better than them and we are so different". Someone wrote a comparison side-by-side with numbers. Regarding performance if you're going the Iceberg way here a benchmark with Snowflake.
- Analytics-optimized concurrent transactions in DuckDB — Technical writeup about the concurrency concepts at stake within DuckDB and a good Hannes interview.
- DuckDB inside Postgres!!?? — Daniel Beach really showcased what's pg_duckdb and out-of-the-box performance. He noticed that pg_duckdb performances are slower than raw Postgres. Main reason is because the extension does not support indexing yet.
- Data warehousing is dead — Actually the article is saying the opposite, with all the fuzz around lakes, real big data warehouse with HUGE DATA are still out there. Funny article.
- Why Microsoft Excel won’t die — A small reminder for me (and my co-founder).
- Python becomes the most popular language on Github — Who would have guessed. According to Github Python overtook Javascript as #1 language. A lot of metrics are shared in the post like the number of 5B of contributions in 2024 in Github 🤯.
- Github announced Github Spark — Can we enable anyone to create or adapt software for themselves, using AI and a fully-managed runtime? With a natural language based editor, everyone can spark and application ✨.
- The road ahead: What’s coming in Airflow 3 and beyond? — Keynote from Airflow Summit about Airflow moving towards assets and more.
- Managed Polaris by Snowflake — The open source Iceberg catalog can be managed with Snowflake UI.
- Fivetran is now in Snowflake marketplace — Tomorrow, the warehouses will ingest themselves their data, Fivetran can be used from Snowflake marketplace.
- dbt runs in Fivetran becomes a paid feature — you start to pay when you run more than 5000 models a month. In my local community in France I don't know a lot of company orchestrating dbt this way.
- Ingestion performance comparison — dlt compared the performance of themselves against other tools. Tbh, I'm quite impressed by dlt perf.
- dbt Coalesce 2024 — Sorry, I wasn't able this year to pull a large recap of all the Coalesce talks, still in my todo tho. Gleb wrote about it on LinkedIn ; mainly the dbt Explorer + mesh becomes more complete, a new visual experience (like Alteryx or Knime) to write models and Iceberg. All of this illustrates dbt Labs strategy : targeting large companies (visual editor for non-tech people) and a move towards emancipation of the warehouse with Iceberg.
- dbt: incremental but incomplete — A critic of the new dbt microbatch incremental feature, it works with Postgres, BigQuery, Spark and Snowflake. If I understand it right microbatch are a way to batch your large incremental queries by a time dimension that you specify—see it as chunking.
- Loading data into Redshift with dbt — lmao it's been ages I did not heard about Redshift, so I wanted to do a shout out.
- dbt-column-lineage-extraction — A Python CLI tool from Canva team.
- BigQuery's AI-assisted data preparation — Do data preparation in the fresh new BigQuery editor with natural language description of the transformations.
- Enabling and disabling Studio in Looker — Looker Studio will become a feature that you activate / deactivate in your Looker setup.
- ❤️ In the era of GenBI — one of the best post of this week.
- Lightdash raises $11m — I really like Lightdash, it's a promising great mix between a BI tool, a semantic layer, as code.
Food for thoughts
- Docker for data products.
- Why Agile and Product Management fail with Data & AI?
- Who’s really responsible for team failures?
- Empowering Engineers with AI at Slack.
See you soon ❤️
PS: for our product research with nao we are looking for analytics engineers working on modeling daily. If you fall into this bucket, answer by saying "hi i'm an analytics engineer" and we will follow-up on this 🤗.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.