blef.fr

Hello, as I said this week data engineers never meet deadlines and here I'm. I'm late for this week publication but I have a good reason. I wrote my first original article about the complicated relationship we have, data engineers, with deadlines and also started tips daily posting on LinkedIn.

I really hope you like my content 😊 We are getting close to 300 subscribers and I want to thank you for that.

Data fundraising 💰

"Databricks raises data lake of cash at monstrous $38bn valuation" — title by blocksandfiles. In this final part of the Databricks story they raised $1.6 billion in their Series H. I previously announced they could go public but it seems the rumor was wrong.
Everyone saw the news, Docker changed their pricing model raising a lot of discussions within the tech community. The big change is that Docker will become paying for companies with 250 employees or more than $10m annual revenue. Following this story I propose you 2 Twitter threads from Docker folks to understand what's behind. The first one by Docker VP Products on buying vs. DIY OSS and the second one about Docker Desktop hidden work.
Contentsquare, a french founded analytics company, acquires Hotjar. Contentsquare is providing an digital experience platform to get insights from consolidated user journeys. Becoming one they will represent 1000 employees and they will be used on at least 1 million websites.

How to increase your data adoption

Imagine a data team with incredible people doing incredible stuff everyday but you struggle making them shine within the whole company. I'm sure you can see. Paige Berry gives you all the keys to Share your data insights to engage your colleagues. If you look for a working example of weekly Slack post with a checklist of questions you need to answer, this is the post.

On the other hand, if you are a data team leader and you still struggle in the adoption after the strategy definition, Adam Votava gives you 3 challenges to overcome.

[Research] Reimagining database querying on unstructured data

Facebook NLP Research team came out with a new concept: Neural Databases. Imagine a database where you could ask a question like: List everyone born before 1980.

And the database would be able to search among rows like below and returns Teuvo and Sheryl.

Sarah married John in 2010. 
Teuvo was born in 1912 in Russia.
Sheryl’s mother gave birth to her in 1978. 
Sarah was born in Chicago in 1982.

It seems magical and awesome. If you are interested on what is behind, you should have a look at the blog post, it includes also a paper and a Github repo with code example.

Facebook NPL research team doing magic (credits)

Timeseries anomaly detection with Thirdeye

Cyril from AB Tasty offers us a wonderful article about anomaly detection with Thirdeye. For those asking, Thirdeye is a monitoring tool for timeseries developed at LinkedIn based on Apache Pinot. If you are looking for ideas regarding how to architecture your data quality pipelines, this one is for you.

As a side note, they also added to Thirdeye a BigQuery connector.

Data Scientists I come in Peace

Over the last months, we got engineers saying we don't need Data Scientists, we need Data Engineer then Data Scientists saying we don't need Data Engineers, we need better tools. Ben Rogojan (SeattleDataGuy) answered in a YouTube video to the last provocative (clickbait?) article.

If I can add my views to the debate I think data engineers and data scientists as a whole are bringing something unique to a company when they successfully work together. This is not a race or a war.

Incident Commander reporting for duty

I'm in favor to inspire data engineering from software engineering practices. Copying the incident response management from Google SRE practice is a way to start. Glen Willis wrote on TDS about applying Incident Commander to the data field.

Pinterest Analytics Platform

If you are interested in building an Analytics platform and that you are curious about Druid this series of posts is for you. Pinterest team shares in their last article what they learned along the way about Druid optimizations.

If you are a Druid newbie this recent article describe everything you need to know.

I searched for Druid in Unsplash — will understand who can (credit)

ML Saturday — Criteo and LinkedIn

This week I want to share with you Criteo image-based ML techniques to classify e-commerce products. Obviously they have chosen deep learning techniques to work on their 25+ billion catalog. From a tech perspective they use Spark, Parquet and HDFS to transform and store the data.

LinkedIn Anti-Abuse AI Team shared their deep learning research regarding fake accounts creation or automatic profile scraping and spam. The way they represent the activity data to find patterns and then decide what to do is amazing. I really like the article.

Fast News

Data Engineering Podcast — Designing And Building Data Platforms As A Product feat. amazing people from Monte Carlo, Vimeo and Facebook.
SeattleDataGuy wrote about Snorkle.ai and analyzed trend over the data labeling industry. Data Labeling and Annotation Services is soon reaching plateau in the Gartner hype cycle for Data Science and ML.
Mito — A tool to render sheets directly inside your own JupyterLab with advanced features.
BigQuery debug fonction — I discovered the ERROR function in BigQuery, it could be useful to raise issue when something is not as expected.
GCP Sketchnotes — The Cloud Girl created a lot of sketch resources to understand Google Cloud resources.

See you next week.