Data News — Week 48
Data News #48 — Unknown Future, data privacy gaining some VC hype, MDS future, data modeling principles, Superset with Cube and more.
Hello, December has started and soon it will be time to do a retrospective. I hope this edition finds you well. I'll start with some auto-promo, but usual Data News is below 😬.
This week I've published a new YouTube video (🇫🇷, with en captions) where I do with D3js the iconic Joy Division cover (which is originally Pulsar data representation) and I also used Berkeley Earth data about temperature anomalies to represent the Unknown Future. Do not hesitate to subscribe and to give a like it will help me with YouTube discovery.
Data fundraising 💰
- Weld, an all-in-one data platform, raised $4.6m in seed round. The Danish company (🇩🇰,🇪🇺) aims to provide ETL, data models, reverse-ETL, visualizations directly in one app to rule them all. They also have a "On-demand Data Analyst" incl. in the pricing (500€ to 5000€/month). Overall, I think they have really good ideas.
- Is Privacy becoming more trendy? STRM Privacy raised €3m to bring privacy-by-design to streams. From what I understand you define schema in STRM and they ensure the data identified as PII will be correctly encrypted before any persistent storage. They also provide a data quality layer where you can validate your data.
- Last but not least, Privacy Dynamics raised $4m in seed round (after a pre-seed in 2019)and launched yesterday. They want to help engineering teams anonymize data seamlessly and without any hurdle. From the small glimpse we have they will provide a SaaS to do that and it seems they integrate directly with Snowflake and dbt — whatever it means.
Embrace PTL rather than ETL
Kudos to Dragos for the term, PTL stands for Push Transform Load. In his post Dragos says that the E in ETL is a problem because it requires a lot of stuff before having a minimal platform, so we should consider pushing data rather than extracting it.
Even if the post could be better this is an interesting view, and it works great when you have a central bus or an event-driven platform already on product side but I don't think it's good for batch platforms. Do not forget that before we had to ask for FTP exports to IT teams to get data. Do we want to go back to that time? The real issue is the ownership of the source data and this will only be solved if teams invest time on it.
To conclude on this matter I propose you this post about Change Data Capture (because CDC can help you achieve PTL pattern) that is very well illustrated.
GDPR compliance and account deletion
We just saw that privacy was becoming trendy, let's continue that way, it's time to speak about compliance. Getaround team detailed how they treated the data retention law requirement. Because of the GDPR the User lifecycle has become something and they cover the different parts.
I think they illustrated very well one of the most important part of the process: finding data to delete. A biggest issue we have in our platforms is that it tends to become quickly messy with data everywhere, so if you put a process in place do not forget all hidden places: third parties, backups, versions systems, extracts, etc. everything should go in.
What’s in Store for the Future of the Modern Data Stack?
It's the end of the year, so it's time to guess what will drive next year data innovations. Personally I only have wishes I'll write for Santa soon. But Barr Moses during an interview with Bog Muglia — ex-CEO at Snowflake — tried to predict the future of the modern data stack.
All the points are truly relevant but my personal favourite is the 5th because I do like graphs. So please give me nice usable knowledge graphs.
Build Superset dashboards on top of Cube
Several weeks ago I presented cube.js an API interface between your data storage and your visualisation layers. It's time for a deeper understanding of the product, the Cube team wrote a large tutorial on how to build a Superset dashboard with Cube as a "backend". This is super cool because it gives us a overview of the product, they also intentionally — I guess — provided us with public credentials to try the product in our dashboarding tool.
Cloud data warehouse does not mean I ignore all data modeling principles
This week Adam from confessionofadataguy wrote a feedback about merge operations with Spark, the main takeaway is that you should always partition your data if you want effective — cost and time — operations. Antonio gave also the same advice when working with BigQuery, partitioning and clustering are your true friends.
So, yeah, ok cloud data warehouses gave us "unlimited" power but that does not mean we should be stupid when storing our data. All modern warehouses have legacy from Hadoop and distributed compute, so don't forget our roots and what it takes.
Even if it's not the main point, I recently came across this ML CO2 impact calculator. It could help you realize the carbon footprint of your un-optimized GPU models. Faster access to data means less compute time, do partitions.
🦾 AI Friday
This category comes in-and-out because it depends on how I'm fascinated by some AI posts even if I understand only 10% of the posts. This week I propose you Algolia tech under the hood about AB testing evaluation for your search. When we speak about search Algolia is a big name, so this post will obviously give you the vocabulary but also ideas on what to do.
Social networks are graphs, everyone know that. LinkedIn team wrote about completing a member knowledge graph with Graph Neural Networks. It could also be useful for every company running a marketplace when it comes to identify important profiles attributes for instance.
Fast News ⚡
- ❤️ Every product will be a data product — The title say all, this is something we already discussed in last week edition so I don't have much to add right now.
- I already shared DAG factories blog posts, but here a new one — Abstracting data loading with Airflow DAG factories, I've always been a fan of abstractions when it comes to data loading in Airflow and even if this post is not innovative it gives a fresh look at the topic.
- Being a data person means sometimes being lost in the technical vocabulary. I started development with Django 12 years ago and when I see entry-level post to help you understand Django for data projects I like it.
- Deep dive into Delta Lake via Apache Zeppelin — Nice tutorial to cover basic Delta Lake concepts.
- 👀 ObservableHQ — a freaking awesome JS notebook technology — released SQL cells support. I truly can't wait to test it.
See you next week 👋.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.