Data News — Week 23.42
Data News #23.42 — dbt Mesh and a new dbt alternative, a few fundraising, OpenAI crazy number, Meta banning Python ads, and more.
Hey, this week Coalesce—the dbt Labs annual conference—took place. During 3 days, people shared how they used dbt around the world. I'll, as usual, write a takeaway post after binge watching all keynotes, but this is for next week. Still dbt Labs announcements were mainly towards dbt Cloud with great features to drive adoption of the paid product.
They announced dbt Mesh a product enabling cross-project dependencies for teams with multiple dbt projects. In addition they also released an Explorer view that lets you navigate through all you project and see models, macros and more directly in one nice graph.
Does this mean that you have to use dbt Cloud to have a multi-project setup? No, you can activate multi-project collaboration with dbt Core. I've written a guide that helps you do it.
📺 On the content side I'll also present next week the Fancy Data Stack project at the Data Engineering And Machine Learning Summit 2023 organised by Seattle Data Guy. I'll be online on Thursday 26 at 5PM CEST. Add it to your calendar and sign up for the conference—the list of speakers is insane.
Data News is packed this week, take time to enjoy it, rainy times are coming, you can see it as a gift 🎁.
Enough dbt use lea 🥰
Max—the first Data News member 🤗—open-sourced carbonfact/lea this week. lea aims to be a minimalist alternative to dbt by fixing a few flaws that comes with dbt. You can even see the traditional Jaffle shop example done in lea.
What are the main differences?
- You configure lea with env variables.
- a
lea prepare
command that creates database objects that needs to be created (dataset, schema, etc.). Schema are interpreted from the folder structure (with DuckDB). - lea understand the views relationships, you don't need a ref. Jinja templating is still supported tho.
- Tests are directly added in the SQL code at the column that is target. For instance if you need to test unicity on a column you add the @UNIQUE decorator. Singular tests are still supported.
- lea generates documentation as Markdown in the workdir.
- Other cool features:
lea teardown
delete database objects, lea diff shows table schema differences and you can write Python model as long as they return a DataFrame.
Max also wrote a nice post about data downstream issues—which is the main problem leading the data contracts space: Sh*t flows downhill, but not at Carbonfact. You should read it because it gives another perspective of solution to fix it.
Gen AI 🤖
- Can you run it? — There is a HuggingFace app that tells you by taking your specs what you need to run a LLM model for inference or training.
- 25 million Creative Commons image dataset released — Fondant, an open-source processing framework, released publicly available images from web crawling with their associated license.
- New Vertex AI Feature Store — GCP Vertex AI is the place to do "serverless" AI. This is awesome to see this directly integrated within BigQuery as it obviously brings simplicity. In public preview.
Fast News ⚡️
- Meta banned a creator for selling Python and Pandas courses — The automated AI filters identified the ads as violating wildlife protection rules. We are therefore thinking with our feet these algorithms are probably written in Python. Do we still want a future where AI decide for us?
- At the same time, luckily for us, Meta is creating custom silicon for AI.
- Disney new intelligent robot / toy — The entertainment company showcased a new toy with impressive capabilities opening doors for a fun future for kids.
- 11 lessons learned managing a platform team within a data mesh — BlaBlaCar, a carpooling company, is well-known France for recently adopting a data mesh organisation. This post gives great insights about the impact on the data platform team.
- The need for an open standard for the semantic layer — Following news about dbt's semantic layer, this post from cube opens the door to defining a standard when it comes to semantics. What should be the main entity type at the center of the semantics: metrics or datasets?
- Why data integration will never be fully solved — Anna covers a few data integration tools and tries to explain why this is such a tricky field that have issue to be resolved with only one cloud tool.
- Popsink a real-time ingestion and processing platform released their self-service offering this week. They are French and they built a great platform on top of Redpanda and Flink claiming to be 4x cheaper than Fivetran to do data replication. As an echo of last bullet point.
- Following Popsink kind of stuff, an example of how Fortis Games, a game editor, developed real-time platform with the same technologies.
- Rise of the data generalist: smaller teams, bigger impact — You don't need to convince me. In all experience and talks I have with people smaller teams obviously drives bigger impact.
- La stratégie nationale pour l'intelligence artificielle — In French. This is about what France wants to until 2025 to drive IA adoption.
- 3500 new students and at least 200 additional thesis on AI topics
- Do between 10% and 15% of the world market share when it comes to embarked AI
- and more stuff in order to attract foreigners and help companies
Engineering stuff
- Dagster released their internal data platform in open — Surprise they use Dagster as an orchestrator.
- dbt related stuff
- To dbt or not to dbt — A few lessons learned while implementing dbt at Intercom.
- dbt MetricFlow, semantic layer 2.0 — An quick analyse of the new semantic layer vision.
- Data contracts and schema enforcement with dbt — It comes with dbt Mesh and gives a lot of new metadata over your models to bring more software engineering practices to dbt development.
- Pros and cons of One Big Table data modeling — I really like OBT, it brings a lot of simplicity, especially in the downstream usage, but obviously it has known issues.
- What is data versioning and 3 ways to implement it — A comparison between change data capture (CDC), dimensional modeling and slowly changing dimension (SCD).
- Mage, BigQuery and bundled-up bike trips — A homemade project where Patrick used Montreal public data of bike counting sensors.
- Replace Dockerfile with Buildpacks.
Data Economy 💰
- OpenAI is near $90b valuation. With a product that has been launched in late 2022. It seems OpenAI is doing more than $100m in revenue per month. The numbers are just crazy.
- Lonestar raises additional $825k in Seed. Lonestar provides immutable storage to be sent on the moon as a backup service. Yep on the moon 🌕.
- Aindo raises €6m Series A. Aindo is a synthetic data solution, it provides a platform to generate synthetic data from your real data in order to preserve statistical relevance while removing sensible information. With synthetic data you can then publicly seek for help among the world's data scientists.
- ScyllaDB raises $43M Series C. It's NoSQL database that is compliant with Apache Cassandra interfaces, and open-source.
- Pantomath raises $14m Series A. A new data pipelines observability solution enters the game.
- Prophecy raises $35m Series B. This is a drag-n-drop data transformation product that I never heard of.
See you next week ❤️.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.