Data News — Week 3
Data News #3 — Coalesce and Prophecy, low-code data tools?, Snowflake supports Iceberg, good data citizenship and YAML for Airflow.
Welcome to the (late) third edition of the Data News this year! I've been busy teaching a Big Data training course at university every Friday, hence the delay.
By the way, here's a fun thing I discovered while preparing my courses: the French railroad company has an open-data platform where you can get datasets about, for example, lost belongings, train delays... I found that very interesting. It would be fun to have it in all countries to see who's late the most.
This week I struggled a bit in finding interesting topics, so the news will be way shorter than usual.
Data fundraising 💰
- Coalesce closes $5.92m in seed funding. They plan to re-invent data transformations. Honestly if you're asking yourself what could be a dbt clone, we may have find it. Even the name is inspired from them.
- Prophecy, a low-code platform for Spark and Airflow, raised a $25 million Series A round. I may give it a try to see what's under the hood.
- On another hand, Qlik, the data-analytics giant, has confidentially filed for another IPO. 6 years ago they were public and they go back to this. I don't know what it means but I know that Qlik does not seem to belong to the "Modern Data Stack".
Snowflake add Iceberg support for external table
This is a good product news to me. Last year, the work that has been done on new data formats was a huge step forward. Seeing Snowflake supporting Iceberg as an external table format looks great towards general adoption.
From a business perspective they also support Delta Lake to help people migrating from Databricks 🙊.
Are data engineers firefighters?
We — data engineers — are often called to fix stuff that we sometimes don't know about or that we sometimes broke not on purpose. And it always appears when we are not ready for it. Indeed we try to "put the fire out" like Kiana said, and she also proposed some common points you should have in mind when working on your data systems.
Monte-Carlo wrote also a guide for non-engineers for bad data. Actually this is not a guide but more a story of what regularly happens. I do agree with most common issues they proposed.
According to a recent study by HFS, 75 percent of executives don’t trust their data.
Can we get a study on the number of employees that doesn't trust the data reading skills of their executives? Maybe it could put the number in perspective 🤷♂️.
Good data citizenship doesn’t work
Benn and Mark Grover (Stemma/Amundsen founder) wrote a joint article about good data citizenship, and why it's hard to rely on it. The idea behind is to assume that non-data team members will be good data citizen when it comes to fill data documentation. And spoiler it can't work.
In the post they deep dive on why it's hard to rely on it, especially because of the daily volume of data alteration. But they also take a look at other systems like Yelp, Wikipedia or Google to understand how they fix it.
As they said "Review more, document less." and automate everything.
YAML DAGs for Airflow
🙄. 2 developpers — sorry I did not find who's really behind — released the Typhoon orchestrator where you can write YAML to define you airflow DAGs. The idea behind is to keep Airflow task to pure Python that can be easily tested and the whole sugar code is a framework used from YAML. Here an example.
ML Friday 🧠
LinkedIn posted an article about diversity and fairness in their Artificial Intelligence products – worth the reading in my opinion.
On the other hand, Facebook open-sourced XLS-R, a cross-lingual speech recognition AI model. A Google AI researcher who works on the project wrote a Twitter reply on handling biases during the training process.
Fast News ⚡️
- A very useful SQL tip about optimising the use of the "%" wildcard.
- Staying on SQL's topic, here's an interesting take on why Google considers SQL as a coding language and why you should do it also.
- Here's a twitter thread about Privacy Shield negotiations between the US and EU.
- Everyone can do a side project: automating Nike Run club data analysis with Python, Airflow and Google Data Studio.
Thank you for reading ! See you next week.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.