Data News — Week 42
Data News #42 — Gitlab and OVH IPOs, Hex fundraising, why so many roles in data? Is Looker dead? Why do we lack of data engineers?
Big number, week 42. It says enough for the introduction, let's jump to the news.
Data fundraising 💰
- It was last Friday, but I'd like to highlight it. Gitlab became public, through a well welcomed IPO. Gitlab is a big piece when it comes to data. They democratize the CI/CD concepts for everyone (and for free if you want) and also they have a wonderful distributed data team that could inspire everyone. Discover their public data handbook.
- OVH — a French based company — also went public last week in Paris. OVH being controversial, I don't have a lot to say except a wish: can we have European built Analytics solutions please?
- Imagine a world where GDPR — and all local privacy laws — could be solved with one tool? Maybe with Skyflow. I don't know. Actually, they raised $45m in Series B. Skyflow provides an API on a vault in order to keep privacy data secured and compliant.
- In last week newsletter I've mentioned Hex. This week Hex is raising $16m in Series A. As per them: they are "Building the first collaborative data workspace". Checkout the GIF in the PR, it looks cool.
Data Organization: why are there so many roles ?
Good question. IMHO the reason we have so many roles is obvious: we have to many things to do in data. It's a whole ecosystem. If you want a better answer I suggest reading Furcy Pin's awesome article — my favorite of the week. I totally agree, data skills are organized in 4 realms and we have blur frontiers between each one.
I only have one guess right now. I think the role between the Analyst and the Scientist is not totally defined yet and will appear in the next years.
If you want to have another synthetic perspective Benedict Neo also tried to define the 2021 5 data roles.
Is Looker dead? — follow-up
Last week I wrote about Google's partnership with Tableau. Let's overlook this week's posts about the situation. The tableau will become clearer.
As Benn wonderfully said we need to split up Looker in two pieces, LookerML — which btw is so badly named, no ml inside — and Looker as a visualization tool. Once the split is done, who cares about Looker as a visualization tool? Even Google doesn't. What is left for Looker? Is Looker dead?
The second article continues the discussion about the position of "LookerML" in the data stack. The actual trend is to have a transformation layer (dbt) and a metrics layer (LookerML, transform, etc.), but what's his actual place?
And then meet... Malloy a new "experimental language for describing data relationships and transformations" developed by Looker team with a Google color-looking logo. The Github repository shows 2 months old files, that means that they have a strategy somewhere. I'd bet they want to compete with dbt because the whole industry is shifting towards it. But it's only a guess.
Yeah, actually the tableau is still unclear.
Data Observability 101
I've already wrote a lot about data observability — the new data category popularized by Monte Carlo. As of today the best way to be the first in your category is to create yours. That being said, if you want to truly understand what's behind this term (and also to have an overview of the tool), you can check-out the guide.
What is ClickHouse? How does it compare to Postgres?
If you ever considered ClickHouse as a warehouse or as a timeseries database, this article is for you. Timescale team — TimescaleDB is a Postgres database with timeseries capacity, open-source or managed — wrote a 33 min read post comparing ClickHouse with Postgres. They also describe ClickHouse internals and flaws.
Distributed data teams
Other mainstream people are calling it data mesh but I don't like it. HelloFresh described in a nice Medium post their journey to the mesh. I think that everyone can learn from the challenges they got because all teams are facing them, mesh or not.
If on the other side you want to technically implement better effective distributed data team you can have a look at event-driven architecture by adding a central event bus. The only interesting part of this post is that it showcases EventBridge, an Amazon service that does Kafka job but in a black box — what a good idea AWS.
Why can’t we find enough Data Engineers?
Daniel Rojas Ugalde tries to answer the $1m question in a small open post. I know that in my local market the reason is simple, there isn't any decent or advanced curriculum in data engineering in public school compared to data science for instance. So each years the gap between supply and demand is increasing.
Fast News ⚡
- Dear Dataflow users Netflix team detailed how they manage their deployments
- If you are a junior data eng without a lot of cool project to showcase you can go to drawSQL database templates to have a look at some startup data schema to get some project ideas
- Apache ECharts — For those who missed it like me, ECharts is an open source JavaScript library — behind the scene it uses zrender. Superset is planning to move from NVD3 to ECharts and they wrote about the actual state.
- Timetable in Airflow explained — more customizable schedules for your Apache Airflow DAGs
- Julia language overview for data projects
- Another data stack to be inspired — Monzo team detailed what their modern data stack
See you next week & go see my YouTube video in French about some personal anecdotes.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.