Data News — Week 22.26
Data News #22.26 — Snowplow fundraising, stuff you should know about databases, a rant against dbt, the real data quality and fast news.
Bonjour ! Here your fresh Data News edition. This is the first day of July and summer break will arrive for many of you. Sadly data never waits. So I hope you have enough team redundancy to be able to disconnect.
This week I published a new YouTube animated video: data explained to kids — in French but with English subtitles. BTW if you want to help me to do the English version, ping me 👋
Data Fundraising 💰
- Snowplow, a complete suite of web analytics tools, raised $40m Series B. This is an open-source tool I know from back in 2014 and their marketing evolved drastically. From a web analytics tool to a behavioral data platform including AI and so on. By coincidence — or by chance — last week Italy declared Google Analytics illegal following others in Europe.
- Zing Data raised $2.4m seed round. "Data at your fingertips" as they say. Their idea is to provide a mobile-first BI tool. Use your mobile to answer data questions and start data discussions. Obviously it works better for events companies.
- Lightbits raised $42m in capital. Through proprietary technology they sell highly available storage. I got into private object storages just recently because of a mission and I find it super interesting. This is a fierce competition based on saving percentages and so on.
- Opaque Systems raised $22m in Series A. Don't stop at the corporate website, the technology seems promising. Here what they sell: "Opaque is the first confidential computing platform that enables [...] analytics and machine learning on encrypted data". They are also the creators of MC2 an open-source version of their platform. It needs a try.
Things you should know about databases
I often share stuff around databases knowledge. This is a content I really enjoy. It's time for you to learn new things you should know about databases. Mahdi wrote a post with great illustrations. He explains very well how indexes and transactions work. What really happens between BEGIN and COMMIT in your SQL query?
A rant against dbt ref
Complaining about dbt became a trend. When you see the adoption and how people are happy about it this is normal at some point to see dissonant voices. It's Max turn to rant against dbt ref.
I do agree with Max. ref
manipulation is a pain point in dbt that breaks the magic. Especially when your workflow as a analyst is:
- writing SQL in your Snowflake/BigQuery web UI
- copy/paste the SQL in your text editor (whatever it is) to add it to git
- finding all the tables references to change it to ref
- forgetting something (alerted by the CI/CD)
And you do this every day, for every model you touch. In the end you spend more time playing Where's Wally? with tables names rather than writing SQL — ok I exaggerate a bit, but you got it.
On this specific point I think this is possible to develop a browser extension which on the fly replace tables names with the right dbt references — while waiting for some changes from the inside. If you want to do it with me.
The data quality no-one is speaking of
This is an intervention.
Thanks to the Modern Data Stack and dbt we created SQL-driven platforms and analysts are becoming SQL monkeys. This is not good. Pissing SQL all-day long creates monstrosities. Rather than adding extra layers to achieve data quality, build quality from inside out.
In this post, that I deeply recommend, Petr speaks the truth. Everyone should go back to the root cause of data quality issues: your code complexity. It's time to "tame the complexity".
Databricks Summit
Databricks Summit (called Data + AI Summit) is taking place. As I don't have the time to follow it, here is Simon's feedback on Day 1 and Day 2. In a nutshell they announced
- Change Data Feed, a way to track row-level changes in delta tables.
- Databricks Workflows, an orchestration tool with a copy/pasted matrix view from Airflow
- Enzyme a new optimization layer to speed up the ETL process. Bingo.
ML Friday 🤖
- Instacart detailled how they developped an internal platform to answer data science needs. They call it Griffin. This is a way to help the MLOps and it includes a feature marketplace, a workflow manager and a training & inference platform.
- Causal Forecasting at Lyft — I don't get a lot in this post but it fills my brain with memories on PHV stuff.
Fast News ⚡️
- envd, a development environment for machine learning — Write in Python or R all your app configuration and envd will build everything out of it. Why not.
- sqlglot a Python SQL parser, transpiler, and optimizer — This is a recent SQL parser that went out. They claim to be the fastest one written in Python. It is modular to adapt it to your specific dialects.
- People-first Data stacks — It's time to add empathie on top of the modern data stack.
- Everything is a funnel, but SQL doesn’t get it — In a lot of business we tend to represent stuff with funnels. Sadly SQL is not a good langage to manipulate funnels.
- Remote development at Slack — it is interesting to see how Slack tech team has been equipped with remote development environments and how it worked.
- Building a Real-time discovery stack at Whatnot
- Airflow Survey 2022 — this is something I missed few weeks ago. Some numbers about how Airflow is used today. For instance still half of the user base is using Celery Executor (and a quarter is on Local).
PS: for the first time in a long time I'm not late. See you next week ❤️.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.