Data News — Week 22.49
Data News #22.49 — ChatGPT, Paris Airflow Meetup takeaways, GoCardless data contracts implementation, schema drift, Pathway and Husprey fundraise.
Hello there, this is Christophe, live from the human world. Last week have been totally driven by ChatGPT frenzy, the social networks I use to follow are spammed with conversation screenshots and hype. On my side I don't know what the future holds for us but for sure MaaS—Models as a Service—looks not bright to me. OpenAI perfectly executed it, they dedicated an gigantic amount of computing power to offer a neat pay-as-you-query experience, like BigQuery. And I bet it will transform our industry as far as BigQuery did. But do we want big companies holding decision power in their own pre-trained models, leaving real data science to the big ones?
I don't want to be alarmist, this is not the tone I have here in the Data News, but do we want a future where the support chat of our home train service or our mobile carrier is under the hood ran by a Musk's company? Ok, it's a caricature, but imagine. I can't wait to see Excel comparing average cost per words written between a human and a machine.
🎄 Let's switch topic. It's time for the Advent of Data head's up. Since last week edition we had 6 new articles published in the calendar. Go taste your daily chocolates. In a nutshell you can now develop an internal pip package for your data team, handle governance, explain to stakeholders what you're doing, send AI models to small devices while understanding Rust for data engineering and 3 keys geospatial metrics.
Paris Airflow Meetup 🧑🔧
On Tuesday I organised the 4th Paris Apache Airflow Meetup. The first one since 2019 and it was awesome, I met with a lot of people, the talks and the venue were awesome. The goal now is to do a meetup per month in 2023. For this I'll look for speakers and hosts, so if you live in France and you want to share something with the French community reach me, I have a lot of ideas.
After an small introduction the evening started with a presentation by Clément and Steff from leboncoin data engineering team. They shared with us the good practices they implemented to scale their Airflow development. As a figure at leboncoin 7 teams are using Airflow to operate more the 1000 DAGs. For you a short takeaway in English of their presentation:
- Stop using custom Operators or Hooks if there is a community one available—this point is particularly relevant if you feel your custom stuff creates tech debt
- Be careful with Airflow's variables, each Variable.get does a database call and drives bad performances. The replacement solution was to use Jinja templating combined with something more traditional in app development: a constant file.
- Use priority_weight, for this they created an enum with 5 different priority humanly understandable.
- And lastly: give ownership context to DAGs, develop custom macros for repeating tasks like generate_s3_url, use pendulum date library to avoid the pain of managing dates, use cluster policies and finally do tests. And if you don't know how to do tests have a look at how Airflow is written and copy how they do it.
Then Qonto data engineering team with Charles & Charles shared how they integrated dbt within Airflow. After a small introduction of the classic modern data stack combo—snowflake-dbt-tableau-airflow—Charles presented what is dbt and what are the alternatives to integrate dbt within Airflow.
In a nutshell you have 3 options to do it:
- You use the
DbtCloudRunJobOperator
but it requires dbt Cloud - You use a
BashOperator
that runsdbt run
command - You use multiple
BashOperator
runningdbt run --select model
command
Qonto decided to go for the last option. Then the other Charles detailed what it means and how they monitor what is happening. Obviously there are a few pro/cons for this approach that are:
- cons: Airflow UI does not like having too many tasks (especially the graph view), in their setup with a KubernetesExcutor it means a lot of cold start because a model run means a new pod with a dbt CLI bootstrap, you have a lot of dependencies to manage
- pros: You are very flexible because you can run one model at a time if you want, the incident management is simplified because as dbt flaws on this topic are filled by Airflow standards, the monitoring can be done
In the end they showcased their Metabase dashboard helping them understand every dbt run that is very complete mixing data from Airflow with a clever trick—they use XCom to save metadata in the database to be able to use it in Metabase—and the dbt artifacts.
PS: shout-out to people I met there reading the newsletter, your kind words are important and it gives me a lot of motivation. See you soon ❤️.
Fast News ⚡️
- How to avoid “schema drift” — This article put word on the schema drift concept, which is the same as configuration drift (e.g. in Terraform) but for data. It happens for instance when you have 2 producers of the same event but they are not using the same type for a column. Although the article is a bit vendor oriented it is still relevant and will ring a bell to a lot of engineers.
- 7 lessons from GoCardless’ implementation of data contracts — Before ChatGPT hype the whole LinkedIn was speaking of data contracts. Here are takeaways from GoCardless. To be honest I should take more space than a bullet point to detail what they are doing, the key learnings are worth reading.
- What I learned in my first 6 months as a director of data science — tl;dr be ready to rumble for the hiring competition.
- Using server sent events to simplify real-time streaming at scale — Interesting discussion about concepts around real-time communication for apps.
- Zero trust with Kafka — Sorry I've read too many articles these days, my brain can't process this one but I like diagrams.
- How to get REALLY good at advanced SQL — I may be interesting for a few of us, this article slightly treat the expertise increase in SQL.
- Generating Chatbot performance insights using Spark SQL at Helpshift.
Data Fundraising 💰🇫🇷
- Pathway raises $4.5m pre-seed round. This is an insane amount of money for a pre-seed. Pathway is a French startup in open beta providing real time processing. You need to pip install their package and then you're able in Python to transform your tables. Transformations are operations like select, index, filter, join or map.
- Husprey raises $3m seed round. Husprey provide an alternative to the dashboard world for data analyses with advanced SQL notebooks. They already have a large number of connectors and even integrate with dbt. Husprey is also a French founded company.
See you next next week ❤️.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.