Data News — Week 22.44
Data News #22.44 — Datalake with DuckDB and Dagster, databases time, ML Saturday, autocomplete for analysts and good self-service.
Hello data news readers. I hope you had a great last week. This is the Saturday data news, yesterday I had a blank page syndrome. I hope you don't mind.
Before jumping in the news, I have 2 things to say. First, I've been listed as Best data science newsletter by Hackernoon. If you like this newsletter I'd love to get your vote. Then, I'll organise an Airflow meetup in Paris on the 6th of December and I'm still looking for speakers. Probably for 5mins light talks—fr or en.
Have fun.
Build a data lake from scratch with DuckDB and Dagster
I've recently shared a lot of articles around DuckDB in the newsletter. If you missed it DuckDB is a single-node in-memory OLAP database. In other words it means that DuckDB runs on a single server, loads the data using columnar format in the memory (RAM) and applies transformation on it. Natively DuckDB integrates with SQL and Python, which also means you can query your data with Python or SQL.
This database technology got a lot of traction because of its simplicity to install and to use. Which also mean that influencers and bloggers can experiment easily to show you how wonderful it is. This article is no exception.
Dagster on the other hand is another orchestration tool that has been thought for the cloud and the data orchestration. They firstly popularized software-defined assets concept which is a way to define data assets as code. This way the orchestrator knows data dependencies and can do reactive scheduling rather than CRON-based.
So, Pete and Sandy from Dagster team showcase how you can create a s3 datalake with DuckDB as query engine on top of it. I really like the article because it shows in a small amount of code how you can:
- ingest data from Wikipedia with Pandas
- write a compact pipeline end-to-end test with a simple test before code
- define Dagster data assets
- use DuckDB to read and write s3 assets
Obviously what they did is purely experimental but it gives ideas on how every company could create a lake with a smaller footprint and a smaller price. I mean, BigQuery and Snowflake are also launching processing on-demand, but here with DuckDB you really know what's running and it's fairly simple so you can measure all the costs.
PS: as I never used—I plan soon—DuckDB and Dagster all my comments are based on my theoretical understanding of the technologies and all the readings I had about it.
Databases time
It looks like a special edition about databases but this is not. Dremio wrote an article to explain how a read query works with Iceberg tables. In a nutshell, a read query first uses the catalog to find the right metadata files. They will point on the correct manifest files in order to get the correct data. With even more simple words, it uses metadata systems to narrow the data search, the less you read data the faster the query will be.
If we go on a more exotic database side. Redis team wrote a guide of things to consider when doing a database migration and Mohammad wrote a retro on DynamoDB 10 years after the general release.
Playing dataviz tennis for collaboration and fun
This idea is so fun and I'd love to try it in a data team. For content purpose Georgios and Lee played at dataviz tennis. Every dataviz tennis match lasted for 8 rounds, with 45 minutes per round and the person who served picked the dataset. So it means player 1 choose a dataset, work on a viz for 45 minutes and then shot the viz to player 2 that work on it for 45 minutes, and so on. All of this in R with ggplot2.
I think this is a fun way to collaborate and for some projects we should try it in data teams. This is a alternative way to do pair programming and it can be done with data pipelines as well.
ML Saturday 🤖
In bulk here few cool articles:
- Anatomy of a Data Science Team — 9 roles that appears in a data science team. This is a bit caricatural but it depicts well forces at stake when creating a data science team. As a side note I've also read Anaconda survey about the state of data science, in the survey we can see that data engineer are slightly more dissatisfied about their jobs than data scientists (cf. picture ahead).
- How Deezer built an emotional AI — Deezer a French music streaming app shows how they adapted their music recommendation engine—Flow—to learn from their users' mood while identifying mood in songs.
- Improving Instagram notification management with machine learning and causal inference — All this science just to send noisy notifications to make you addicted /s.
- Fooled by statistical significance & How Der Spiegel uses Machine Learning.
Fast News ⚡️
- Bringing autocomplete to Analytics Engineers — Analysts are probably the less equipped team when it comes to productivity. Navigating through thousands lines of SQL or developing SQL in a web browser is not very funny. Today deep channel proposed a workspace to solve this hurdle by adding autocomplete and faster error detection—it integrates mainly with dbt. To be honest this is an issue larger than this that will not be solve by 1 tool, but still the idea is great.
PS: on this matter you can still try my dbt-helper extension to extends you BigQuery console. - Snowflake RBAC Implementation with Permifrost — Managing Snowflake rights can be a huge pain in the ass. You can do it with custom code or Terraform. Here Yousign team detailed how they did it in YAML with Permifrost. Then they manage dbs, warehouse, users and roles with configuration. The article also gives what is the standard when it comes to Snowflake rights at Enterprise.
- How to make data documentation sexy — Madison wrote of lot of content when it comes to documenting data knowledge. This time she proposes rules to apply when writing documentation.
- What good data self-serve looks like — For years every data team wanted to provide self-service to stakeholders in order to reach the heaven. In heaven stakeholders do SQL, are autonomous and data team concentrates in delivering value and analyses justifying high salary costs. But this is only in heaven, most of self-service is badly achieve. Data teams being a support team trying to creating trust in the data. Nate tries in the article to define good self-service and what are the key levers to act on.
Data Fundraising 💰
- Dataloop raises $33m Series B. Dataloop is mainly a data labelling platform that focus on quality. They propose an end-to-end platform to do everything about AI.
- Alation raises $123m Series E. The company founded in 2012 raised another round to push forward their data catalog solution to enterprises. I've not a lot to say except the fact that they have too many products for my brain to understand what they really sell.
- Galileo gets $18m Series A. Galileo package integrates within your Python machine learning stack to add debug and tracking to your work.
See you next week ❤️ — PS: below should appear a survey about how you like the newsletter, please tell me what you think.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.