Data News — Week 23.08
Data News #23.08 — A bit of infrastructure, analytics dev experience, measure everything and usual fast news.
Dear readers, I hope you had a great week. Each time I look back and I see the amount of Fridays I've spent reading and writing I'm still surprised. For the last 2 newsletters I've tried to ask your for paying support. From number of people who really paid I can see that I failed to either word it correctly, either to propose a newsletter where you see the value of paying for it.
In any case, I need your honest feedback, what would make you consider paying for the content I create?
This is something I struggle with, I really like writing, I really like this newsletter, I really like the blog, but it takes me one day per week to be done. If I want to continue for years I have to find a way to make it sustainable for me, and also if I want to continue more in this direction I have to find a model that works. I'm open to all honest feedbacks.
A bit of infrastructure
This week I've seen a lot of articles that I can put under the infrastructure category, so here we are. The current data state is heavily dependent on infrastructure, wether it's cloud, on-premise or semi-related we need to understand where the data lands and where the code runs.
First Bucky gave his thoughts about the state of infra in 2023. In a nutshell, Javascript is the future of everything, we say it for years, you write once and you run it everywhere—in the browser, on servers—then workflows systems are a key piece of every software architecture, we have a fragmentation of tooling and we want to run tasks one after the other which means we need something to orchestrate them, finally the OLAP databases are evolving in something different with many more features.
In order to improve your data infra you should sometimes try to occasionally kill your data stack, chaos engineering is something that helps discover issues. Monte Carlo also wrote this week about chaos engineering, with a manifesto.
When it comes to data storage, the real-time ecosystem has also changed a lot in the last few years and a lot of tooling went out to simplify the burden of managing Kafka clusters, Materialize—a real-time platform—detailed their architecture. But if you want to continue using the underlying tools here an overlook of Flink architecture or a few techniques you should know as a Kafka streams developer.
Finally Whatnot shared how the migrated their CD processes to ArgoCD and Pinterest now uses HTTP/3, which I didn't even know it was existing.
Fast News ⚡️
- The future analytics developer experience — It's been a few months since we have articles complaining about the actual of analytics development experience. Often they are right. At the moment the best way to develop in your data warehouse is still in the query editor of BigQuery / Snowflake / etc., even if we have tools that are trying to provide a great experience as Petr is saying in the article we still lack something. I hope it will change.
- Measuring B2B customer satisfaction — Dashlane team shares how they are measuring customer satisfaction. I really like the KPI framework they put in place and how it translates in charts.
- Measuring everything — This blog is a proposition and a signal for you that you should measure absolutely everything to understand what is happening in your product. This goes further than being a data-driven enterprise, you have to put in place a framework the puts data measurement at every product choice, resulting in maturity increase.
- Stream processing vs real-time OLAP vs streaming database — Data storage + compute field is slowly becoming a mess a lot of technologies that are so close but so far away at the same time. Hubert tries to explain stuff in the real-time category.
- Data Engineers and Kubernetes — A 101 guide about Kubernetes concepts and why you should as a data engineers understand kube basic entities.
- Coding patterns in Python — Startdatengineering is one of the best data engineering related blog and this time he propose a few patterns you might need to implement in Python when doing data pipelines.
- Etsy payments data model — Articles are often about technologies and rarely about the actual data modeling and this time Etsy team shared what was their reflexion while re-modeling payments. Sadly this is more about transactional improvements and choices rather than analytics.
- Shark attacks visualisation — This is a great example of embedded analytics. Louis deployed a version of ToucanToco—a BI tool—using Redshift to visualise data about shark attacks. In a surprising way 3 shark attacks were deadly in Italy, I'll more careful next time when I swim in the Mediterranean.
Data Economy 💰
- Qbeast raises €2.5m seed. This is interesting to see that data lakes platforms can still raise money in 2023. Qbeast propose a different way to organise data to optimise queries performance, still it seems they use Spark.
- OpenAI new strategy — Someone on Twitter reported that OpenAI privately announced a product called Foundry that would enable customers to run OpenAI on dedicated capacity with full control on the model configuration and profile.
See you next week ❤️ — small edition today, blank page issues 🫠
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.