Data News — Week 24.03
Data News #24.03 — ChatGPT in classes, Zuckerberg announcements, Bard and awesome news.
Hey I hope this new edition finds you well. We are deep in the winter, it's time for comfy Data News to read near the fire 🔥.
This week, on Monday, I started my annual university lecture. It's been 9 years since I started teaching and this year something was different. The students were incredibly calm, obviously my course is a bit difficult at the beginning because it touches on concepts that they are not used to—cloud, data in production, data engineering, etc. So it's normal that they don't have any questions at first. But still, even during exercices hands were still down when previous year they were asking me for debugging help.
This year something was off.
On Wednesday I finally understood what changed. It was ChatGPT. Actually the whole class was using ChatGPT—I did a raising hand survey and everyone said yes. So now, the default go to was to ask ChatGPT questions rather than ask me, and then if ChatGPT does not have the answer they might ask me.
I still don't know how to react about this. I think it does not makes sense to ban ChatGPT, like it was stupid to ban Google Search at my time. But still there is something to do, I need to research and think more about it.
I assume that education will be radically transformed. Both ways. The way students learn will be different, but the way teachers teach will have to be different. This will force us to bring something to the class that ChatGPT can't: humanity.
AI News 🤖
- OpenAI opened an Elections Program Manager — The role is to support the "efforts around elections security and integrity for the EMEA region". We have this year the European parliament election. It deeply shows how OpenAI products are—or might be—used in order to win races. I guess they already have people for the US elections.
- Palantir CEO: U.S. eating everyone's lunch on AI — Let's continue on politics, at the World Economic Forum in Davos Palantir CEO said that within 10 years 95% of the world top tech companies will be American. I don't see any difference with today.
- Meta released MAGNeT — MAGNeT is a text-to-music and text-to-sound model capable of generating high-quality audio samples conditioned on text descriptions. It seems that it works well when generating sound effects.
- Zuckerberg teased 2024 Meta AI strategy — In a selfie video on Facebook / Instagram Zucky explained that Llama 3 is coming and that Meta is building a massive 600k H100 NVidia GPU infrastructure. It just represents $27b just in raw GPUs. But luckily for us Meta will open-source everything they do because they love us so much <3. In exchange we can wear the new Ray Ban Meta glasses with AI inside to give Meta more training data. We are just seeing the world shifting and we are all ok with it.
If you want a less salty opinion than mine Oliver as always aced it. - Prompt engineering with Bard — A recording from a local Google Developers meetup from last November. In this talk we discover a few concepts on how to talk to Bard. Mainly you can do it through the API or the UI and Peter explains that Bard shines in creativity, factuality and reasoning. He greatly explains the concept of grounding and why it matters.
Just after he explains how you can "teach" Bard to reason with a reverse word example in which Bard fails. In order to do it you have to ask Bard to write and execute code in background but to activate the code execution feature "you're at the mercy of the classifier". This is our future, being at the mercy of classifiers. - Google News is boosting garbage AI-generated articles — This is a paid article. The title speaks by itself.
- Paper, Sleeper agents — A not reassuring paper. Anthropic research team proved that this is possible to insert backdoors in models and the backdoor persists despite safety and adversarial training.
Fast News ⚡️
- It's time to build — Still a big fan of Benn's content. This week he talks retrospectively about why his content shifted from the modern data stack to AI. Then there's all the marketing that goes into selling data tools to unleash the power of your data. Far from trends and the lights, it's actually time to build tools.
- My thoughts going into a New Year — Tristan, dbt CEO, wrote thoughts about 2024. He covers why AI has not yet impacted data jobs and writes about OSS licensing after Snowplow recent changes, renewing a statement saying that dbt Labs does not need to do this change because they have a solid commercial path.
- EU AI Act — A nice looking summary of what matters in the new EU AI Act, from the risk definitions to the potential fines. Then CastorDoc explains why a data catalog can help you overlook how to be compliant.
- Packaging, one year later: a look back at 2023 in Python packaging — Understanding Python packaging is one of the most important skill to master when you want to enter Python world. I've seen too many people struggle in their development workflow because they are not used to pip and all. Chris wrote a follow-up to last year post about the sad state of Python packaging explaining standards and proposing things for the future.
- How lazy imports accelerate machine learning at Meta — Meta developed their own implementation of CPython called Cinder. In order to speedup model training time they switch to Cinder and decided to use lazy imports.
- Measuring data quality: bringing theory into practice — Mikkel is one of the best when it comes to putting the correct words on data quality issues. You should read this article to clarify these concepts.
- Big O — A practical approach — The Big O notation is something taught at school and super important when programming, especially in data when complexity has to be understood to speed up data transformation. This articles gives you what's important. o/
- How DoorDash used a service mesh and saved costs.
- The evolution of a data platform, part 2 — The part 2 of the MatHem’s analytical platform. On GCP, BigQuery at the center with event flowing from PubSub / DataFlow with a great usage of all Google Cloud items.
- Airflow evolution at Snap — Large teams need multi-tenancy and Snap is one of them. This article shows all the different architecture Snap put in place to deploy Airflow at scale.
- Integrating Airbyte with data orchestrators: Airflow, Dagster and Prefect — A orchestrators comparison and how Airbyte can be used as extract-and-load within them.
- Convert your PySpark code to Snowpark code using SnowConvert — Snowflake trying to attract Databricks customers.
- Hey Snowflake, send me a <fancy> HTML email — This is one of the feature I'm the most unsure about. Do I want my warehouse to be able to send emails without my global orchestration system to be aware of it? Yes because it's cool to give freedom to users... but no because as a data engineer I want a platform where flows are controlled. What's your take of this?
If you want to do it with BigQuery, you should take a look at my friend's BigFunctions (see the send_email function). - bq-lineage-tool — Java code that uses ZetaSQL to build a column-level lineage parser for BigQuery.
- Comparison running dbt-core and dlt-dbt runner on Functions — If you're runnning dbt-core within Cloud Functions check this article integrating dbt with dlt to avoid a weird hacky subprocess.
- Build a personal real estate dashboard with dlt and dbt — Another example where you can chain dlt—to extract and load data—and dbt—to transform data—in order this time to build a real estate dashboard to find your dream property in Portugal.
Data Economy 💰
- Phospho raises €1.7m in pre-seed to build GenAI monitoring applications.
- SKY ENGINE AI raises $7m Series A. With a platform that generates synthetic data for deep learning vision algorithms. It let's you create 3D stuff that you can use to train algorithms.
See you next week ❤️.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.