Data News — Week 24.09
Data News #24.09 — Mistral AI, Klarna AI customer support agent, extract and load still unsolved
Hello all, this is the Data News, this week edition might be smaller than usual in term of comments as I'm working on a Data News related project that takes me a bit of time, which will probably lead to a series of articles.
Before I forget I've appeared on The Joe Reis Show, we chatted with Joe about data engineering teaching, why it is hard and about generative AI that will change education for ever. This is a 1h podcast, I hope you will enjoy listening to it.
Final reminder, next week there is La Conférence MLOps which will take place in Paris on March 7th. If you want to register I sill have a 40% promocode: mlops-blef-40. I'll give a talk—in French—about how to put in production machine learning at a small scale. Topic which is related to the Data News project 😬.
AI News 🤖
- Mistral AI announcements
- Mistral Large, their new flagship model, which outperform other concurrent excepting GPT-4. At the same Microsoft closed a partnership with Mistral to make Large available to Azure, their first distribution partner. It has led to a lot of discussion in French politics about Mistral AI being American more than French. With the partnership Microsoft entered the Series A with a €15m addition joining a16z.
- They also released a smaller model called Mistral Small.
- Le Chat, the conversational interface to interact with Mistral models.
- Final comment, with these 2 announcement Mistral left the open side to go commercial / closed. It led to conversation where people felt betrayed by Mistral which built their differentiator—or should I say marketing—on-top of open-source/weight models. Mistral perdant.
- GitHub Copilot Enterprise is now generally available — This week I've started to use GitHub Copilot (not the Enterprise version). And let's be honest this is a productivity boost, especially when you want to write docstrings and comments. Still there is an annoying interaction in PyCharm where Copilot takes too much space. Copilot Enterprise mainly comes with 3 features: understand your whole org codebase, a chat to ask question about the codebase, summarise pull requests.
- Fast, efficient active speaker detection on videos — This is a great introduction to active speaker detection, which means you are able to detect in video speaker faces and if they are actually speaking or not.
- Klarna's AI customer support agent do the equivalent of 700 agents — Klarna developed an AI agent that interacts automatically with customer driving profit. It has to be put in context.
- Using DuckDB + Ibis for RAG — Handy code snippet to explain why DuckDB is a good solution bringing best of both world when it comes to RAG.
Extract and load, still unsolved 🤭
I've started writing data pipelines in 2014 and the movement from sources to destinations has always been one of the most discussed topic in my data engineering spaces. Personally I'm the kind of guy who likes to build it custom because I think an out-of-box solution does not exist. In the end you finish with a composable solution mixing up 2 or 3 technologies to extract and load you data in your central storage, ready for transformations.
In 2024 we are more than ever tools to move data from sources to destinations. But the field has taken a new direction.
Until now, solutions were mainly full platforms (often in the cloud) with the promise to do everything in search of rebundling the data platform (cf. The unbundling of Airflow). Recently, it has reached new heights: what if the extract and load is just a small library layer that integrates whatever you're doing—for people reading me carefully this is what I was calling for in using Airflow the wrong way, but the fun way.
Enters the new kids on the blocks:
- dlt — it stands for data load tool, it's a Python library installable with pip. It provides a framework to do the extract and load, you need to define sources and resources what are the specificities of the resources you want to load: primary keys, write disposition, incremental mode, etc. and the library does the heavy lifting accordingly.
- PyAirbyte — Airbyte announced their Python library in beta. Currently it support around 250 sources, which is a subset of all Airbyte sources (only the ones written in Python) and it seems it does not support connecting to classic databases. They call a destination a Cache, which is a terrible name. Even if the library is a great idea I feel this is a sad that the interoperability with Airbyte is not 100%.
Adrian from dlt wrote a small post about PyAirbyte. - CloudQuery — Written in Go, YAML driven configuration to move data.
- ingestr — ingestr is a CLI tool to copy data between any databases with a single command seamlessly. It's built on top of dlt.
- Slings — Sling is a CLI tool that extracts data from a source storage/database and loads it in a target storage/database. Written in Go.
- Let's not forget Meltano.
We see a pattern here, when we talk about extract and load there are 2 kinds of sources: databases and APIs, behind able to do both correctly is the key.
On the other side of the movement there is a new open-source reverse-ETL technology called Multiwoven/multiwoven. This is built in Ruby (haha). At the moment it can sync to Facebook, Salesforce and Slack.
Fast News ⚡️
- Your first 90 days as a head of data — Handbook and roadmap with pragmatic insights on what to do in your new journey as head of data.
- Career pathways of data engineers — IC, manager, being a data engineer or a data full stack. It covers great topics.
- Google Search filetype:pdf was not working for a moment — Internet panicked and believed Google was continuing to downfall. But actually it was a bug.
- BigQuery time series data — BigQuery will now support time series analyses.
- Snowflake access management — I've already share Teej work in the past, but now he launched a company to solve Snowflake access management, using code. I bet it gonna become the best solution for this issue out there.
- Rill dashboards for ClickHouse — Rill works now with ClickHouse.
- Fast Poetry and pre-commit with GitHub Actions — An efficient and useful Github Actions to cache poetry installs in the CI.
- BigQuery cost dashboard app — Hashboard develop a dashboard to help you follow your BigQuery costs, it's free. It uses Hashboard, which is a BI tool. Even if you don't use it, it gives good ideas about what to track.
- Introducing DoorDash’s in-house search engine — Custom searcher built on top of S3.
Tech stuff
- Adidas Lakehouse engine.
- Why Apache Spark RDD is immutable?
- High-order and partially applied functions in Python.
See you next week ❤️
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.