Data News — Week 30
Data News #30 — SAS going public, 👩 Data Engineer, Scale data, dysfunctions of Data Engineering and run Airflow in one command
Hi it's me again 👋, I hope this new digest finds you good. Maybe you are in holidays, maybe not, still I'm here to give you your weekly glimpse of the data ecosystem.
I got many feedback telling me you enjoyed Airflow Summit takeaways from last week so in the future I'll continue to cover these kind of large conferences for you! Thanks a lot ❤️.
Data fundraising 💰
- Yesterday we got news that SAS, the analytical software company started in 1966 (yep you read it, 1966), will go public in 2024. The goal of this change is to provide stock options to employees in order to attract more talents.
- Kili Technology, a Paris based AI startup, announced their $25m Series A funding. Today the company report about 40 employees on LinkedIn and developed a end-to-end AI training platform with the end goal to have better AI thanks to better data.
Is it Hard Being a Woman Data Engineer?
I've voluntarily chosen to start the Data News with this article because the gender equality in data engineering is obviously not yet here. I hope that in the future voices like Sarah Krasnik one will continue to emerge to help reach this diversity. I'll let you read the whole article to have the answer.
Scale data teams and platforms
Everyone knows, data is coming. The future we'll all face in our respective companies will be to manage more and more data everyday. This lead to a major scaling issue for everyone. What works for 10 will probably not work perfectly for 100 or 1000.
This means you'll probably need to migrate from a system to another while still maintaining the first one. So, this is a good question, how do you modernize while keeping the lights on? On the other hand you will need to scale teams, this article will provide you keys to build your data dream team.
To illustrate data migration I give you this article from LinkedIn engineering about their largest data migration task for the recruiter and jobs products.
The dysfunctions of Data Engineering
This is my evergreen content, I think that all data teams today are still dysfunctional. I've already shared a news (#27) in the past.
This week we have an amazing article on the topic. MrTrustworthy shares with us what are the dysfunctions of data engineering. You'll find a lot of founding concepts in the article: type of data engineers, inversion of responsibility, driving data rather than data-driven.
Use Amazon Location from Redshift
Yeah, now it comes some technical articles. On the AWS blog you can get an example on how you can access Amazon Location service from Redshift in SQL using some UDFs and Lamdba. As a side note, this article reveals how much AWS is technical and how all solutions look complex.
This example is about Redshift but I think it's easily feasible with other warehouses.
Understand concepts by comparison
Sometimes by comparing tools or concepts it's easier to understand. So this week I propose you 3 posts:
- When it comes to databases what is the difference between surrogate and primary keys
- Compare Relational vs. Document vs. Graph databases, this is a huge piece of article in two parts
- You always ask yourself how does the pricing model from Snowflake compares to BigQuery one? Joao Marques tries to give an answer is the part 2 of Snowflake vs. BigQuery
Quality: how to write good batch pipelines or good code comments?
Zack Wilson (a data engineering LinkedIn voice) has started writing on medium. The first topic he wrote on is how to evaluate the quality of a batch data pipeline. The third point he mentions about Evaluating the maintenance burden is the one I prefer.
On Stack Overflow blog they give you 9 rules for writing good comments in your code. My favorite rule is the third one (again).
Run Airflow in one command
Every Airflow users know that it's hard to make airflow running in one command. Airflow needs a bit of setup in order to run: env variables, scheduler & webserver, database, etc. This article gives you an example on how you can write a single CLI command to run a production-like instance locally.
If this is already too much for you, you can also check Meltano that aims to simplify airflow startup.
ML Friday — MLOps & summer upskill
This week I share with you two articles that have been written on KDnuggets. MLOps best practices and a program to upskill in machine learning with a lot of resources in 4 weeks.
—
Thanks for reading. Do not hesitate to subscribe. I've also started a YouTube channel (in French) do not hesitate to subscribe it will help me a lot 🤗.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.