Data News — Week 23.40
Data News #23.40 — OpenAI iPhone?, Python 3.12, chat with BigQuery, save costs and awesome other stuff.
Hey, I'm a bit late once again. I hope this newsletter edition finds you well. This is almost a raw edition, I had quite a big amount of links, I hope you will like this selection.
Gen AI 🤖
- OpenAI’s plan to build the "iPhone of artificial intelligence" — Obviously this is one of the main struggle for OpenAI. In order to stay forever in the B2C market they need more than chat interface, they need hardware, they need to enter in users everyday life. Still not sure we need a need addictive device.
- ❤️ Generative AI exists because of the transformer — A scroll story by the Financial Times explaining what's Generative AI. Good for everyone.
- Evaluating LLMs is a minefield — Slides from Princeton Uni about LLMs evaluation and why it's hard to understand how it evolves. You can find the video version on Princeton website, named "Societal Impact of AI".
- JPMorgan CEO says AI could bring a 3½-day workweek — blablabla, AI is awesome, we want AI everywhere and pay people less blablala /s.
- Language Modeling is compression — Paper from Deepmind. Title looks cool. To be honest this is not the first time I see LLMs and compression in the same paper and it opens, at least, to funny experiments.
- Decoding speech perception from non-invasive brain recordings — Even more crazier. This article describes the state-of-the-art in decoding speech from brain activity. There are multiples models and shows what we can achieve by just looking at electro or magnetic recordings of the brain.
Fast News ⚡️
- Microsoft Fabric: should Databricks be worried? — Vantage did a price analysis between Microsoft Fabric and Databricks. Generally Fabric pricing is simpler because fresh and new but for more complex stuff Databricks still shine.
- Introducing Python and Jinja in Cube — Cube, an open source semantic layer, has released a new writing capabilities in Python with Jinja in the YAML definitions. Something that reminds dbt. You can now write macros to generate YAML.
- Confluent announced Kafka roadmap and Flink as a Cloud service — this is the result of Confluent acquisition of Immerok. Confluent is still growing but struggle to become a real competitor to Databricks or Snowflake.
- Python 3.12 is out — Every year a new Python minor version is released and this year it brings a few cool features. Mainly you get a new generic type parameter, better f-strings with multilines in curly brackets and quote reuse, the per-interpreter GIL—where Meta is proud to say it contributed to it.
- Announcing Malloy 4.0 — This is the tool I have to try soon. Malloy is out with a new version and a lot of new features. As a reminder Malloy is a new analytical language meant to generate SQL to query databases.
- 4 tips to save warehouse money — Paul posted on LinkedIn about search index, avoiding rerunning dbt tests when possible or just deleting tables. Ian also proposed you identify and remove things you don't need anymore.
Tech and data engineering stuff ⚙️
- CRON jobs at Slack scale — Why do you need an orchestrator when you can run CRONs? Slack engineering team details how they wrapped CRON jobs on top of Kubernetes with database table to get monitoring.
- Using ML to identify date formats in file names — Dropbox developed a classifier to identify date formats in file names. This is the backbone of a naming convention feature. It gives ideas. Based on DistilRoberta this is something to look at to fix the mess of a datalake.
- Data modeling is dead! Long live data modeling! — Joe Reis Keynote at Big Data London about his next book topic: data modeling. Joe covers why data modeling was put on the side in the recent years and why we need it back today, showcasing a few useful patterns and definitions.
- Chat with BigQuery data — this is a recycling of all the chatbot use-cases, once again. It's an example where you can use natural language to access BigQuery data. There is also a walkthrough example on Airbyte with LLamaindex.
- Creating an Airflow custom hook for API calls — A guide showing you how you can extend Airflow hooks to have a custom way to call APIs.
- Goodbye Spark. Hello Polars + Delta Lake — Spark is under attack. In the last years Spark has been powering a lot of data use cases but with the modern data stack and more recently with DuckDB, Polars and smaller size OLAP technologies it allows a new way to do data processing.
- Ensuring data quality with Verity — Lyft definition of data quality and a tour of the in-house product to address data quality in the data platform: Verity. This is a must-read and a good showcase of what you can do.
- Database file format optimization: per column dictionary — Mixpanel developed a proprietary columnar database and this article shows what they did to improve compaction and increase performance.
- Exploring the power of graph databases.
Data Economy 💰
- Yahoo spins out Vespa. Vespa is the tech behind Yahoo search engine, it's a search engine and a vector database. In the current Gen AI times, it looks like a good time to do it.
- Contentsquare acquires Heap. Heap is a product analytics solution to understand better you funnel acquisition performance.
- Kestra raises $3m Seed funding. Kestra is the new kid in the open-source orchestration space but disrupting the Python status quo because it's written in Java and requires you to writes pipelines in a declarative way, in YAML. IF you want to know more you can watch this YouTube live I did with Kestra's CTO demonstrating capabilities.
See you soon ❤️.
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.