Data News — Week 32
Data News #32 — Fundraising and transfers, modern data stack drawback, being a data manager, data quality frameworks, CDC and more
Hi, I hope you have a good time at work. I know you'll get well deserved holidays soon! As last week was a special edition this week I'll feature the last two weeks articles. Don't forget to give me feedback on the weekly digest, it helps me a lot.
Data fundraising 💰
- Treeverse raised $23m to bring Git-like version control to data lakes. This is the company behind LakeFS open source project. They aim to create atomic versioned data lake.
- Hightouch got $12.1m in Series A in order to bring back the data into customer facing tools. Hightouch is a reverse-ETL tool and provide a way to export SQL results from warehouse directly to third parties.
- Dune Analytics raised $8m in Series A to provide free analytics for the crypto community. They develop a tool to explore, create and share dashboard on top of cryptocurrency data. I found it interesting because this is one of the first data analytics tools to be vertically specialized.
- DataRobot finally raised $300m in Series G. I already mentioned it in week 23 when they seeking for fresh capital.
This is not about fundraising but transfers. After Felipe Hoffa that moved from Google to Snowflake 1 year ago. Mosha Pasumansky also left Google to become CTO at Firebolt. Firebolt also hired Octavian Zarzu as developer advocate. Octavian was previously sharing daily tips about Snowflake.
As a reminder Firebolt is a new competitor in cloud data warehouses space promising huge performance uplift and big costs savings.
Data Surveys
Two 2021 surveys got published recently: the first one about the state of production machine learning and the second one about the state of data engineering. Do not hesitate to fill the surveys to help the community.
The unspoken gerrymandering of the modern data stack
I feature a lot of articles about MDS and tools that gravitates around. But last week Benn wrote thoughts on modern data stack and why the explosion of tools and tools categories is not a good evolution of the landscape.
On the matter if you want to go further I propose you the transcript of a Tristan Handy interview about why dbt was started to fill the gap between engineers and analysts.
We the purple people
Following the whole discussion about the MDS and the Analytics Engineering discipline creation that is linked. dbt Labs (and more precisely Anna Filippova) wrote about the real need of a people to translate business into technical. The purple people.
Play in SQL with Reddit comments
Felipe loaded 261GB of Reddit comments to Snowflake and played with it. If you want to see in action Java UDFs, data loading and recursive queries this post is for you.
Stream Your Database Changes with Change Data Capture
Nice article about CDC, with illustrations about databases events and also some databases triggers examples that could be used. If you want to go deeper in CDC or to implement it this should be on your reading list before starting.
Top-notch data quality frameworks
In this digest we are lucky because we have to post from top teams. Airbnb shared their Wall framework to prevent bug and then improve data quality. On the other side Uber shared how they achieved operational excellence in Data Quality.
Being a data engineering manager
Tiffany Jachja gave us feedback after her 3 first weeks as a data engineering manager and what she set up to map skills, roles and ownership. She also detailed the vision of the data team.
Data engineering guides
This week I share with you one BIG guide and a smaller one guide. This big one is The Data Engineering Cookbook, the repo got almost 10k stars on Github, it's well written and contains all information you need to know to start your DE journey.
The second one is a guide to prepare Data Engineering interviews. I'd add to this guide that you should prepare accordingly to company technologies and that Spark and Hadoop aren't that mandatory today.
Fast News
- OpenAI launched a Python 5 puzzles AI contest yesterday (2021-07-12) using Codex (code autocomplete AI) as your teammate. The challenge is now closed.
- Github OCTO (meaning GitHub Office of the CTO) and more specifically Amelia Wattenberger, open sourced a tool called repo-visualizer that lets you create a bubble visualization of a Github repo. See the demo post for the wow effect.
- The story of the BInosaurus 🦖
- Part 3 of the dbt with Airflow series — how to use Airflow tasks groups and a custom DAG Parser to simplify readability.
Thanks, see you next week ❤️.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.