Data News — must-read 2022 articles
A collection of data articles that you should read to remember 2022. Best data articles of 2022.
Hey you, this is the last article of the year and it's gonna be about the articles and trends that made 2022 according to me. You'll see articles that I've already share during the year.
Once again thank you everyone for your support this year and see you next week for the first Data News of 2023. Sorry for the delay, I had a blank page syndrome today. Now let's jump to my selection.
ANALYTICS ENGINEERING
We have to be honest in 2022 Analytics Engineering shaped up the data field and concentrated a lot of data discussions. Analytics Engineering can be seen as a renaming of the BI Engineering, if we look at it more precisely it mainly comes out of the data roles specialisation. Analytics Engineers is a specialized role between the Data Engineer and the Data Analyst. Madison had a look a job posting to see what are the skills companies really want in Analytics Engineers.
Analytics engineers provide clean data sets to end users, modeling data in a way that empowers end users to answer their own questions. [...], an analytics engineer spends their time transforming, testing, deploying, and documenting data. Analytics engineers apply software engineering best practices like version control and continuous integration to the analytics code base.1
Analytics Engineering brought back light on data modeling. Preset wrote a gentle introduction to data modeling. In a nutshell data modeling is the techniques we can use to structure the data in data warehouses. Nowadays we have:
- Dimensional modeling — Introduced in 1996 by Ralph Kimball. We often use the Snowflake Schema or the Star Schema (that is a special case of the previous one, here Snowflake is not the data warehouse technology but more the shape of the table relationships—drawing a snowflake).
- Entity modeling — Introduced by Bill Inmon. In this methodology you use the 3NF (third normal form) to model your business entities to avoid redundancy. This approach is less flexible than the previous one.
- OBT—One big table ; I don't really know who introduced OBT except the fact that Fivetran mentioned it in 2020. This is often the easiest approach to start. Everything in one table, denormalised.
As a final note, a Reddit thread discussing is Kimball's Dimensional Modelling dead in 2022?
In order to complete the AE articles list here a few I recommend as the best 2022 analytics engineering articles:
- Factless Fact table — not so absurd it may sound at first
- Testing: Our assertions vs. reality — Probably the best talk of 2022 about testing. This is a YouTube video.
- Super Tables: The road to building reliable and discoverable data products — LinkedIn data modeling choices explained and the introduction of Super Tables concept.
- Understanding the Snowflake Query Optimizer — In order to become better at data modeling you'll need to understand how the underlying warehouse engine is working. This article is a good way to go to understand how Snowflake works.
- Stop using so many CTEs — This is a vendor article that showcases "Chained CTEs" in Hex. Still relevant because in today's data world CTE are everywhere and a lot of data transformations are just a bunch of SQL queries around a few hundreds of lines with a lot of CTEs. But CTEs are untestable blocs of code.
- 7 Antifragile Principles for a Successful Data Warehouse — Something to look at to create a healthy data warehouse.
- My guide about managing and scheduling dbt from dev to production.
DATA TEAMS
3 piece of content that I feel are relevant and not really trendy. This is more something long term that we have to have in mind:
- Building more effective data teams using the JTBD framework — Data teams are still in between with no really good practices when it comes to routines or organisation. The Job To Be Done framework can be something to look at.
- Building Modern Data Teams — The most complete resource hub with around 40 articles on how to build data teams, data strategies or think about data work / hiring.
- Emerging Architectures for Modern Data Infrastructure — The updated version of the a16z vision about modern data infrastructure.
ENGINEERING
In loose, a few of the best 2022 data engineering articles:
- Introducing Software-Defined Assets — Best article to rethink the data pipelines and to consider datasets like assets.
- The rise of the data reliability engineer — Data Engineers have a large part of their daily job that is close to SREs while not being SREs.
- The best website to understand visually machine learning models.
- Joe Reis blog. He started blogging recently after writing the excellent Fundamental of Data Engineering book. I often surprise myself agreeing to everything he says, if you have to follow someone except me I think it should be him.
- The many layers of data lineage — The best metaphor to understand what you can do with data lineage.
- Design Patterns in Machine Learning Code and Systems — Because we need design pattern even if I disliked the design pattern classes I had back in engineering school.
- 3 tips to take back control of your time.
A GLIMPSE INTO THE FUTURE
This year people talked about a lot of things, with no research here what I can remember:
- Data Mesh — The Mesh has been assimilated and tried by multiples organisations, what we've seen is that it requires a minimal size to be started, we have yet to figure out if the organisational changes are worth it.
- Data contracts — An interface between the data producers and the data consumers. The interface can take multiple form, we often summarize it as a schema registry. Very useful in a mesh organisation.
- Semantic Layer / Metric Layer / Headless BI — "Something“2 between the data warehouse and the BI tool that will probably shape trends next year.
- Unbundling of Airflow — This is year many Airflow alternatives went public, all with their own vision and great promises, in addition the one-dag-to-rule-them-all strategy has been challenged and execution has also been delocated to other system leaving Airflow like an empty shell. But in the end he'll be back.
- GPT-3 applications — It has the potential to revolutionize industries through automation and augmenting human intelligence, but has also raised concerns about its potential negative impact on employment (this bullet has been generated by ChatGPT).
Now that I've said this, I think that 3 technologies will shape data engineering next year:
- Wasm — WebAssembly is a portable compilation target in the browser. In human words it means you can run your favourite language code in a Firefox tab. One example is PyScript, that allows us to run Python in HTML. Thanks to Wasm we can use a decentralised power: your stakeholders laptops.
- DuckDB — A single node in-memory OLAP database. We did not see yet the full potential. What I think about DuckDB.
- Dagger — A programmable CI/CD engine that you can run everywhere.
- What is an analytics engineer? (Claire Carroll)
- Semantic Layer is more than just something. To be honest for the moment I take it sarcastically, because I'm not sure this is something really important—at least when I see my own French market.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.