Data News — News 36
Data News #36 — Nextflow and Tabular fundraising, Career choice DS or AE?, Improve your data communication, ETL writing advice, and more.
Hi, it's me again. This week will feature more technical articles than organizational ones. As last week I celebrated my first year of freelancing I would like to share with you some numbers since I started this journey.
- 🔥 We crossed 300 subscribers for the newsletter in 5 months and the blog got 7k pageviews. I also almost doubled my LinkedIn audience in the same time.
- 💰 I've worked with 5 clients over the year and generated around 140k€.
- 🤩 For the next 6 months, 2 interns will join me to grow my content creation efforts.
I just want to say thank you to every people making this possible.
Data fundraising 💰
- Seqera Labs, a Barcelona based startup founded at the Centre for Genomic Regulation raised $5.5m in a founding round. They developed Nextflow an open-source orchestration framework that they say focused on scientific use-cases. To be honest I've never ever heard about Nextflow but it seems already pretty advanced.
- A new data platform is entering the room: Tabular. Tabular is built around Apache Iceberg (open table format for huge analytic datasets) and they raised an undisclosed amount in Series A from Andreessen Horowitz (a16z). The 3 founders Jason, Dan and Ryan seems to have met at Netflix, homeplace of Iceberg (where Dan and Ryan co-created it). I can't wait to see what's coming next!
Our Data systems are broken and deteriorating fast
This is a big piece. Medium is saying 28 min read, but I really enjoyed the reading of this article that is trying to understand why the Big Tech broke everything by trying to monetize our data and then losing control of what they built. Huge credits to Dave Fay.
Career choice — Data Scientist or Analytics Engineer?
Jason Ganz from dbt Last community team explained why he choose Analytics Engineering discipline over the Data Science. In 3 major points he describes why analytics will still be the frontend of the data and reason is pursued this path.
Improve how you communicate with data
This post published in Towards Data Science explain how you can find the right balance when adding context to a visualization between being not enough and being too much.
Migrating Airflow from Amazon EC2 to Kubernetes
Snowflake team wrote about their Airflow migration from Amazon EC2 to Kubernetes. I really like this kind of myth buster articles, seeing the architecture and the tools behind a black box (yep Snowflake could act as a black box in term of data platform) is really cool. Besides the post describe their whole Kube archi for Airflow, a must-seen for those interested.
dbt setup feedback
It's been almost 6 months since I've started using dbt and I'm starting to like it. From what I see others are also liking it, this week Jaya wrote his feedback and takeaways on what to look for when you are setting up dbt. I can't wait to try model templates as he describes.
Dask vs. PySpark
If you are interested in Dask or if you are asking yourself if it's a viable alternative to PySpark, confessionsofadataguy tried it recently for you. I don't what to spoil you the article. But the result is not as expected (at least for me).
Time to write some ETL
- If you are still struggling understanding the key differences between ETL and ELT, this is for you.
- If you are write your first Airflow pipelines but you don't know how to start to unit tests them, this is for you.
- If you still don't know what Great Expectations can add to your data pipelines, this is for you.
- Following the series on MLOps security, here 14 principles to secure your data pipelines.
Fast News
- AI Checklist — a small checklist to check each time you want to build a model
- DIY, Air Quality logger — if you like do it yourself project I propose you this air quality logger that for 30$ will be a small project to play with data
- PaddleOCR — PaddlePaddle released PaddleOCR an OCR tool. The demo on Hugging Face is working with Chinese, English, French, German, Korean and Japanese, you can check it out.
- Learn Clustering in Python — "Using Natural Language Processing (NLP) and K-Means to cluster unlabelled text in Python"
- prettymaps — a HUGE tool to turn osm maps in beautiful artwork.
See you next week ❤️
If you are interested in having a chat with me do not hesitate to drop me an email hi@blef.fr or connect on LinkedIn.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.