Data News — Week 17
Data News #17 — Mozart Data and Deepset fundraising, conferences time, Netflix fact store, data compression in Python, Zach Wilson on Reddit, etc.
Hello friends you'll find below this week's newsletter. A lot of bulleted lists, but I had less time than usual to write it. So enjoy and see you next week!
Data fundraising 💰
- Mozart Data raised $15m in Series A. Mozart Data provides an all-in-one data platform for modern needs. Snowflake has been picked behind the scene as primary warehouse and customers can use a lot of connectors, SQL based transformations and exports. This kind of cloud data platforms has flourished in the last two years to provide companies faster cold start.
- Deepset announced a $14m Series A and a cloud version of their end-to-end search NLP platform. Deepset has open-sourced haystack, a NLP framework to do neural search, question answering and semantic document search. I got an idea with haystack, stay tuned...
Conferences time
Recently a lot conferences took place. Here some records or topics I liked.
- Leboncoin team detailed how they went from a bash script to a data mesh — here are the slides. The presentation is totally worth it as they define the state of the art of data platform concepts, idempotency, commutativity, etc.
- The Metrics Store Summit, held by Transform, took place this week but we don't have the recordings yet — hey Transform team if you read this I'm keen to have it.
- Data Council Austin videos are out on YouTube and Dagster presentation about Software defined assets is available — a must see to be honest. Sandy goes through changes in the software domain and what data domain is living right now.
- In the April Airflow community meetup, Benoit put sound on a great article he wrote a year ago: You don’t need an orchestrator.
- And finally the Kafka Summit, which took place in London. Robin wrote a full recap of the conference on Confluent's blog. I also found this presentation from Viktor super interesting about Testing Kafka with Testcontainers.
Netflix ML fact store
This is maybe the first time your hear of this concept. Conceptually we commonly call it a feature store. The idea behind is to store at any time the value of facts — or machine learning features — about your users. Netflix decided to call it a fact store. This is a fact. But still their architecture is interesting.
Everything relies on a data storage, called Axion, which is a mix of Spark, a cache and Iceberg. Obviously they developed their own key-value in-memory cache called EVCache. Every feature store contains a key-value storage because you need data per ml "entities".
If you still struggle understanding what is feature engineering, Swapnil tries to unravel the feature engineering mystery.
As a follow up a small ML Friday 🤖
- Zalando detailed their machine learning platform, from experimentation to production while integrating it in their internal custom tools.
- Beyond matrix factorization, explains how Yelp mixed methods to do user recommendations.
How to build a lossless data compression and data decompression pipeline
Navigating through compression engines can be hard. In order to help you understand what it means Ramses explains what are the building blocks of a data compression pipeline and gives an example of the bzip2 algorithm written in Python. After reading the post you'll feel like Richard Hendricks.
Zack Wilson — Ask me anything on Reddit
Zack Wilson, the LinkedIn influencer, ran a AMA on Reddit about his experience being a data engineer at FAANG with a great career path evolution from L3 to L6. From the answers we can learn a lot like from what is the interview process at Netflix to advices for entry level people.
Fast News ⚡️
- Pinterest team optimizing their logging data ingestion stack — The ingestion pipelines are composed of one pub/sub system (MemQ) and a log agent (Singer) that handle Pinterest logs at huge scale. This article details performance and partitioning research.
- When not to use neural networks — Neural networks took over the datascience in the last years, but sometimes the "traditional" methods are still better or relevant, this post could help you decide.
- Data Quality unit tests in PySpark using Great Expectations — This article is more a showcase about what is possible rather than a real solution. But still valid.
- Modern data stack for startups — In summary: blablabla use a data loader like fivetran blablabla a warehouse and dbt.
- The magical fusion between batch and streaming insights — This article will show you how you can implement a good old lambda architecture on Azure. But it's time to merge these layers. For real.
- How our growth challenged our data migration process — Welcome to the Jungle backend team wrote a feedback post about what were the challenges while renaming a column in the Postgres database. This process is interesting to see for all data migration grumbling data engineers — like me.
- Facebook doesn't really know where all your data goes, leak suggests — Is it time to dismantle Facebook? As opposed to Facebook, build privacy-first systems where you know where customer data is flooding to.
PS: I hope you still have fun with pictures legends.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.