Data News — Week 9
Data News #9 — Streamlit acquisition by Snowflake, hyping myself with Dagster, Kubernetes everywhere, SQL + Jinja is not enough.
I hope you are doing good and enjoy this Data News edition.
Data fundraising 💰
Snowflake has entered into an agreement to acquire Streamlit for $800m (news here and here). Streamlit is a Python package that helps you build data web apps in a wink. It generates HTML pages with interactive inputs inside.
It has been announced around the same time with Q4 FY22 earnings. And Snowflake has now reached $1b in product revenue for 2021. Which is quite massive but still small compared to GCP or AWS (resp. ~$20b and ~$67b) — I know it's different because they are generalist. If we compare to Databricks — $800m annual revenue in 2021 — Snowflake seems to win the direct competition.
This acquisition will for sure enrich Snowflake offer around their Data Cloud in order to provide to people a complete data experience. I bet we will see other acquisition from their side in the following months. Snowflake challenge is to attract companies with more than a data warehouse.
Dagster — Introducing software-defined assets
Dagster and Prefect have been in my radar for months. I tried Prefect for the last 3 months (stay tuned). I did not try Dagster yet. This week release introduce software-defined assets. They propose through Dagster to make the data assets a reality. The idea behind is to apply declarative programming (as opposed to imperative) to data.
With software-defined assets you will be able to declare your different data assets (files, dataframes, tables, ml models, etc.) and handle the whole lifecyle behind. Each asset will be define with a name, a function and upstream assets.
Using this modelisation of your data platform you get directly your own data lineage/catalog out of the assets definition because everything you need is directly defined within your orchestration tool. Or as they say in your reconciliation tool. Go check it out it's worth it.
I see the hype around.
Data & analytics trends to watch in 2022
Speaking about the hype, Taylor shared 10 bets on trends to watch in data for 2022. I really like the way she presented every bet and I pray some of them will become true in 2022.
Field level lineage explained
Engineers from Monte Carlo team detailed how they built their field level lineage in order to have a more effective root cause analysis workflow. This post is well illustrated and covers all the aspects of such a tool.
They do not enter in low-level details but they at least explore what are the use-cases and the business requirements. It could help you convince your boss that you need a data lineage layer in your modern data stack.
Spark, Kafka, Flink, Kubernetes related stuff
I speak a lot about warehouses and tooling around it because the modern data stack took all the attention recently but programmatic solutions are still here and will definitely stay for a long time.
Let's start with Kubernetes, last week was released the second part of Kubernetes: The Documentary [part 1 — part 2]. It's around 55 minutes of videos that are worth your time if you do data in the Cloud. This is a must seen to understand what's Google strategy about the cloud and how all today cloud concepts were born. In few words: AWS was betting Google and the Mountain View company needed a revolutionnary product to compete in the cloud space: Kubernetes. It covers Docker and containers revolution, open-source philosophy and kube story.
Now that Kubernetes allows us to go further we can for instance run Kafka on top of it like Yelp or we can try to improve Spark performance when coupled with Kube.
To finish this engineering category, I've recently talked about real-time streaming data platform which is one of the use case for Flink. This post is about the evolution of streaming architectures, from Storm to Flink via Spark. This is a good introduction to understand what are the main differences.
Finally if you are still struggling in doing data quality unit tests with PySpark, you can try to do it with Great Expectations. Karen shows us a lot of different cases this is well documented.
SQL + Jinja is not enough — why we need DataFrames
Last post from Furcy, except from being long — 20 minutes, is really great. This is an ode to the DataFrames format. Furcy say that multiple operations are easier to write (and do) with DataFrames than in SQL + Jinja.
He then developed a Python library to run PySpark code using BigQuery data as a storage to do a pivot operation for example. Clap him on Medium because this is good food for thoughts.
Why am I seeing this ad?
Explainable AI (XAI) is one of 2022 major trend in machine learning. Transparency challenges around data systems in order to be compliant and fair in the actual world are becoming huge. LinkedIn team detailed how they built their "Why am I seeing this ad?" panel in their product. This is a good product thinking exercise to have if you have client facing ml models.
Fast News ⚡️
- Dremio announced their Dremio Cloud to provide a query engine and a metastore on top of your data storage
- lakeFS rolled out a playground to test their technology — lakeFS is a git-like object storage for your datalake
- Gergely Orosz (The pragmatic engineer) wrote a insider about Amazon's engineering culture — if you want to read it you'll have to subscribe to his paid newsletter (this is not only about data but I found the topic somehow important)
- Towards Analytics with Redis — how can you use Redis to create an analytics application. I found this interesting because it could answer some real-time production use-cases teams could have.
- KPIs every Data Team should have — Jesse Anderson new post. He is one of the first who spoke about data team responsibilities and this blog will help you putting words on metrics for your data team.
See you next week with probably 2 posts.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.