Data News — Week 23.02
Data News #23.02 — Switch from pandas to Polars, hiring processes, new age of machine learning, how query engines work and data economy.
Hey. I have busy weeks, I'm sorry Data News are coming on Saturday again. This is a bit hard to travel by train, work and write at the same time. Plus I'm a fast context switcher, so it piles up. Also a few of you have sent me messages recently and I've not yet answered, I see you and I did not forget you. Now that I'm back in Berlin it'll be easy.
Last week we organised the first Paris Airflow meetup of the year. It was a round table that I've moderated with Benoit Pimpaud, Furcy Pin and Marc Lamberti. We talked about the place of Airflow in 2023, the unbundling of Airflow and the best way to run your Airflow DAGs today.
The discussion was in French and the recording will be released next week. In the meantime you can still check my article Using Airflow the wrong way that summarize a bit the operators vs. containers debate. During the meetup we did not talk about Airflow alternatives, currently Mage is the rising tool that everyone tries out as a replacement for Airflow?
Enjoy the Data News.
Polars—Pandas are freezing
Recently influencers are betting that Rust will be the de-facto language in data engineering. The history repeat, we've seen it with Scala, Go or even Julia at some scale. In the end Python and SQL are still here for good. But with Rust the approach is different. The idea is not to replace Python but to replace the underlying bindings that are used by Python libraries.
And it makes sense, for instance ruff a Python linter that is build in Rust that claims to be extremely faster that the usual stuff.
On the data processing side there is Polars, a DataFrame library that could replace pandas. Let's have a quick look at it. In this overview I'll not talk about performance because I don't have the time to do a proper benchmark—and I've never done this. Just the experience of a beginner that knows pandas very well.
The installation is pretty straight forward, you can do it with pip. When compared to pandas this is awesome because it seems polars as no dependencies so it does not need to build wheels like pandas.
pip install polars
Regarding the imports the documentation continues to treat me well. It looks like stuff I know with pandas.
import polars as pl
Then I can do my first CSV import, in the example I load a French railway open dataset about lost and found objects in stations.
df = pl.read_csv("lost-objects-stations.csv", sep=";")
Then you can use the same code as pandas to select the data (head, ["col"], etc.). I want now to try a group by.
df.groupby("Station").agg([pl.count()]).sort("count", reverse=True)
# Same code but it pandas
df.groupby("Station")["Date"].count().sort_values(ascending=False)
And lastly (because if I continue the newsletter gonna be too long for you to read), I just try to convert a str Series to datetime.
To be honest I try polars for 15 minutes and I can already see how I could switch to it if I have the guaranty it is way faster. APIs are quite similar so I'm far from being lost.
🫠 If after this small introduction you want a deeper comparison of Polars you can check Modern Polars by Kevin Heavey or a 40 minutes YouTube video that explains Polars internals.
Hiring processes
The current state of the data market is weird. At the same time we have a lot of lay-offs and a lot of companies that are still looking for data folks. Which is often a critical hiring for them, but they struggle. There is a huge gap between jobs, what folks are looking for and what companies are looking for.
This week Teads shared their engineering hiring process. The process is not focused entirely on data, but still this is relevant because it can give ideas to hiring companies or junior looking for advices. They have a short 4 touchpoint interview which looks like a good compromise.
When focusing on data more, Galen wrote about what he looks for in data analyst candidates. One of the most interesting advice he gives that I can press is: you should spend time mastering the technologies you've chosen. With the current state of data this is easy to loose focus, so listen to this intervention. Stop chasing the last data trends and master what you daily use. I think that mastery in one domain can be easily transferable in other domain.
I would like to propose you job offers that I personally validate—following an open checklist. Obviously companies would pay for this service and it will be a mean for me to get something in return for the curation/writing work I do every week.
AI Saturday
- How DoorDash upgraded a heuristic with ML to save thousands of cancelled orders — When running a marketplace this is an usual problem to deal with. DoorDash shares the models they used to replace their intuition.
- 👀 Building a defensible Machine Learning company in the age of foundation models — This article is very complete, this is probably the best written article about the actual trends in the machine learning. A whole ecosystem is shifting from build it yourself to consume foundation models and APIs built by others.
Fast News ⚡️
- Announcing dbt-fal adapter — I shared fal months ago when they launched. I'm still on their Discord and when dbt finally announced Python models support I was a bit sceptical about fal offering the same thing. But because of the small scoped dbt solution—Python code only runs in the warehouse. With this release you can really mix Python and SQL code.
- How we cut our Databricks costs by 50% — We can always find optimization in our cloud setup to save costs.
- How to land a job in progressive data — If you want to use your skills to Do Good you have to look at Brittany's post about progressive data.
- Statement of objections issued against Google’s data processing terms — The German office competition regulation said that Google should do more in being explicit about how the data in processed to help Google business.
- How query engines work — This is a web book that explains how query engines work. I did not read it yet but it looks great.
- Analysis of Confluent buying Immerok — Jesse Anderson analyses last week news of Confluent (Kafka) buying Immerok (Flink) and what it implies in the real-time low-level technologies competition between Kafka / Flink / Spark.
- On Data Contracts, Data Products and Muesli — Another post on data contracts, a bit to long for me to read it. Sorry.
- Extracting, converting, and querying data in local files using clickhouse-local — This is awesome how fat Clickhouse can go. Looks like a wider alternative to DuckDB but also a good trend for other warehouse: provide a local experience that lives out of the cloud.
- ByteGraph: A Graph Database for TikTok — ByteGraph is the open-source graph database developed by the company behind TikTok. This article shows you what are the key concepts to understand it. To be honest I'm quite impressed with the first line stating that it has been designed to support OLAP, OLSP and OLTP workloads.
Data Economy 💰
- Metaplane raises $8.4m seed funding. This is a bold claim, Metaplane wants to be the Datadog for data. Operating in the data observability space the usual set of features: tests, data quality monitoring based on historical data, lineage and alerts.
- XetHub raises $7.5m seed round. XetHub brings git to data files management. They support up to 1TB repositories with git-like commands (checkout, push, commit, pull, etc.). I think that XetHub is super useful when in data science we need to keep the data alongside the models. When commit a change on a big file their repo hub summarise data diffs.
- Generative AIs are booming, following all the stories with possible Microsoft $10b investment in OpenAI, Seek AI raises $7.5m seed round. Seek AI promise is a prompt where you ask your data anything and the AI responds on top of the raw data directly.
See you next week, maybe on Friday ❤️.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.