a harbor filled with lots of boats on top of water
Summer Edition (credits)

This is the first article of the Data News Summer Edition: how to build a data platform. I tried to be as short as possible in this first article, details will come in the following ones.

The modern data stack has been criticised a lot, a few are saying it's dead other are saying we are in the post-modern era. The modern data stack as a collection of tools which interacts altogether to serve data to consumers is still relevant. Personally I think that the modern data stack characterises by having a central data storage in which everything happens.

Let's design the most complete modern data stack, or rather the fancy data stack.

In this article we will try to design the fancy data stack for a batch usage. A lot of logos and products will be mentioned. This is not a paid article. However over the years I've met people working at these companies so I might have a few biais.

As a disclaimer, this may not quite make sense in a corporate context, but since this is my blog, I'll do what I want. Still, the idea of this post is to give you an overview of existing tools and how everything fits together.

💡
If you just want a few articles to read, just go to the bottom of the email.

A few requirements

Source data

When I was looking for data, I was looking for a bit of volume, something geographical and without PII. I personally like NYC Taxi trip data but sadly it has been used many times so it removes a bit the fun. At the same time the Tour de France was ongoing and I found a "way" to get Strava data for Tour athletes on Strava. So I thought it was the perfect data to build a data platform.

Mainly there are 3 datasets:

Race data will be partitioned per day, but as the Tour is already done, it will be a bit different than in real life environment. Still, this is something I keep in mind for future trainings. Because I'm convinced that to learn data engineering you need to experiment real life pipelines running every day to experiment the morning firefighting.

I'll delve into the data in the next article, but I won't detail how I got the data because, you know.... 🏴‍☠️. Actually, it's just a few Python scripts and a bit of F12, but that's not the point of this article.

Source data (Postgres, CSV and Sheets)

The fancy data platform

In order to have a complete data platform we will need to move the data from source to consumption. But what the consumption will look like?

I want to answer multiple use-cases:

In order to answer this we will need to ingest data from the multiple sources, then transform and model the data in the chosen data storage and finally develop consumers apps to answer the business needs.

Let's try to throw a first design of our application—with logos. Obviously this can be subject to change. Either because it's too complicated either because I want to change. Once again this is fiction so I can afford to change stuff.

Actually as a one of my main advice is that you should never be strict about tech choices because you can't plan the unexpected. So do yourself a favour and accept to throw away something that does not work for you.

The fancy data stack

Just for the sake of being open, there are a lot of alternatives and my choices could have been different. Here what you can also consider if you're doing your own platform.

Conclusion

After this design exercice I have mix feeling. I'm in between. I think this is a fancy stack because I tried to put everything inside, but as the same time is find it quite boring. Like this is just stuff that works. This is linear, I'll move data from A to B to C in to order to use it with D. Actually this is just modern data engineering.

In the following part of this series you'll follow my adventure in the extraction, the transformation and in the serving for analytics and Gen AI usage.

I hope you'll enjoy this Data News Summer Edition.

FAQ and remarks


Small Fast News ⚡️

If you want dont care about this, here a few articles you might want to read by the pool.


See you next week ❤️.

red and white floral gift boxes
Be kind to me, this is my birthday (credits)