Data News — Week 22.51
Data News #22.51 — Advent of Data wrap-up, how to manage and schedule dbt, welcome new members, buy a data book for Christmas, I command you to hire junior data engineers.
Hey you, if you just subscribed yesterday to the Data News I wish you a warm welcome ❤️🔥. The Data News is your Friday weekly data curation in which I select for you the most interesting—according to me—data articles of the last week. I hope you'll enjoy it ✨.
Christmas is coming, so whether you celebrate it or not, I wish you a great end of the year and good time with family and/or friends. There will be a last Data News next week that will be my 10 2022' must-read articles. In the meantime you can read Prukalpa's 5 must-read data blogs from 2022.
The Advent of Data is also coming to an end tomorrow, it has been an awesome ride, I'm so happy we put together such an awesome list of content and I'm so grateful to the 24 creators who accepted the rules and wrote something for this first year. I'll do a wrap-up of the Advent in January to celebrate what we achieved together.
Remember: the Advent of Data was your daily spark of data joy in December. Every day a new data article has been published by a data creator.
Guide—manage and schedule dbt
I published 2 days ago the most complete guide about dbt management and scheduling, in case you missed it you have to check it out! Original deep post that are exclusive to Data News members are something I'm willing to do more next year, to bring you additional value to this newsletter.
Next year I plan to talk about:
- Data engineering and analytics engineering career paths
- State of the data integration—related to another 2023 project 📚
- We have too many choices, my framework to take a decision
- Something you want me to write on?
Let's go back to dbt. So this guide in a nutshell will give you ideas on how you can manage dbt repository(ies)/project(s), what you have to think about to provide a top-notch developer experience, how to host and schedule your dbt code.
I'm really proud of the development experience part of the guide because I think that this is a still a unresolved part of every dbt project, something is still broken. From the first contact, the local installation, the (web?) IDE, the useless copy-pastes, the code reviews, the tooling to the development environments there is a lot to say.
As an extension this week two great articles have been written about custom dbt setups. Monzo team detailed how they created their own framework on top of dbt to follow their growth and Albert from Superside explained how they migrated from dbt Cloud to a custom setup with CI/CD, S3, Docker and Airflow.
PS: small question, I did not email you for the guide, would you have wanted to receive an email for it?
Give yourself a book Christmas 🎁
If you need gift ideas for yourself I have a few books to propose to you. The selection is a mix between 2 things I love—data engineering and visualisation.
Here the selection 📚:
- Fundamental of Data Engineering — It rapidly became a best seller, Joe and Matt wrote a greatly structured book that covers all the data engineering topics, I firmly recommend it from juniors to seniors.
- The Data Warehouse Toolkit, 3rd Edition — With the rapid rise of the Analytics Engineering role the data modelisation came back as number one priority for a lot of data teams. Dimensional modeling has been a reference for years which is the main purpose of the Kimball method.
- Effective Data Storytelling — Data storytelling have been something really trendy in the last years, but in a lot of data teams because of the dashboard constraint we often lack of creativity, context or storytelling. This book is a must-read if you want to drive actions with data.
Obviously there are more books that went release this year that are awesome, but I just mentioned what you should absolutely have.
Fast News ⚡️
- Functional Data Engineering - A Blueprint — Ananth, from Data Engineering Weekly, the best data engineering newsletter, wrote a great follow-up to Maxime's functional data engineering post. In the post he shows how we can apply entity and event schematisation to Lakehouse architecture.
- Tips for hiring junior Data Engineers — MOST. IMPORTANT. POST. OF. 2022. Every data engineer has been junior once, this is important not to gatekeep others by forgetting we once knew nothing about data engineering. This is our duty as senior to hire juniors and to help them. I guaranty you this is the most satisfying feeling, aside from my recommendation the article is awesome and speaks the truth. Last point I think the max ratio is 3 juniors for 1 senior.
- 💥 BigQuery data lineage — It looks like a something huge, but I'm note sure tbh. Soon we will have a data lineage tab in the BigQuery UI. In order to have it you'll have to activate Data Catalog/Dataplex. This is in public preview.
- Panel discussion about licenses in open-source, relation with VCs, etc. Between Doug Cutting (Hadoop co-founder), Maxime Beauchemin (Airflow & Superset creator) and David Nalley (Apache Foundation president). This is really geeky about licensing but few of you might find it interesting.
- Maybe Snowflake isn’t for you! — Thoughts around the expensive price of Snowflake that reminds Oracle. tl;dr: take control back of your tools to find the holy added value every data people has spoken about.
- Managed transcription with OpenAI whisper and Hugging Face inference endpoints — I don't even understand the first chart of the article but it looks cool.
- Personal Finances with Airflow, Docker, Great Expectations and Metabase — when you're a nerd and you like to extend the data pleasure on Saturday.
- StackOverflow 2022 developer survey — 15% of respondents developers are in data roles (but they can have multiple hats) and when it comes to technologies SQL and Python come just after JavaScript and HTML/CSS that are just everywhere. Last number is that Spark is the framework that pays the most nowadays—not a good sign for Spark future.
- Working with large CSV files in Python from Scratch — Good pattern to optimize your pandas computes by leveraging partionning.
- Data Reprocessing Pipeline in Asset Management Platform @Netflix (I did not read it, but I want to keep a track of it—looks interesting).
Data Fundraising 💰
- Qualytics raised $2.5m Seed round. The newcomers in the data quality space that is already quite crowded. Qualytics is a small team—8 employees on LinkedIn—based in the US proposing a "data firewall" that protects and compares your data to detect drifts, anomalies and history discrepancies.
See you next week for the last edition of 2022 ❤️. Enjoy holidays.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.