Data News — Week 24.37
Data News #24.37 — OpenAI o1 new series, building low cost platform with Model dlt and dbt, Data teams survey, feature store, Ibis without pandas.
Hey you, can you believe it's already September? This year has been flying. It feels like I just blinked, and here we are. In August, I've been focusing mainly on my next big journey—if you follow me on LinkedIn, you might have caught a sneak peek! I'll be making a full announcement next week. I want to take the time to explain my thought process and ideas behind it. I hope you will like it.
Below are the Data News wrapping summer and the first two weeks of Sept.
AI News 🤖
- OpenAI released 2 new models OpenAI o1-preview and o1-mini — These models brings changes and a breakpoint in the models naming. OpenAI decide to give up on the GPT naming, which means GPT-5 will never be plugged in. GPT paper has been co-authored by 4 person and 3 are not anymore at OpenAI, leaving GPTs also mark a change in paradigm.
The o1 series brings more “reasoning”, it looks like a pre-prompt that does a chain of thoughts on top of what they already did best. Lots of stories about exceptional things the model can do have been published today—e.g. in the OpenAI system card explained that the model was able during a cybersecurity challenge (a CTF) to understand a failing Docker environment (due to infra) and still be able to find the flag.
Here a YouTube playlist demonstrating o1 capacities.
As clem mentioned on Twitter, it's always important to pay attention to words, even if the “reason” model, it doesn't think, it processes. - More news about OpenAI
- New models are already available on Azure ; but be careful Microsoft open-source Phi-3.5-mini is out.
- Ilya Sutskever, previously Chef Scientist at OpenAI, raised 1b$ to co-found Safe Superintelligence with a manifesto.
- Alexis Conneau, Her ex-research lead at OpenAI, decided to create a new company and got a lot of Tweet impressions. Previous OpenAI members are quite popular when it comes to founding.
- Bloomberg reported that OpenAI seeks to raise $11,5b more at $150b valuation, making it the third private company in terms on valuation [paywall article].
- NEO Beta, a humanoid company backed by OpenAI, released a first video demo. And it's impressive (🙃), the robot is able to handover a bag to a human!
- We hope next OpenAI model is not o7. /s
- OpenAI and Anthropic will give their model first to US gov (NIST) to help advance safe and trustworthy AI innovation for all. But they cry when in Europe the AI Act is voted threatening innovation.
- NVidia released Eagle a vision-centric multimodal LLM — Look at the example in the Github repo, given an image and a user input the LLM is able to answer things like "Describe the image in detail" or "Which car in the picture is more aerodynamic" based on a drawing.
- Aleph Alpha introduced Pharia-1-LLM — it's a 7B model and the license is explicitly targets non-commercial and research usages. Aleph Alpha is a German company, funded by German VCs (with $500m), was trying to compete with US companies (like Mistral and OpenAI 🤭) in the models race but gave up this competition to pivot to a AI-support company for public sector.
Fast News ⚡️
- EU and China launch cross-border data flow communication mechanism — It's an official statement saying that EU and China will re-discuss the policy about the data transfer out of China for European company, which is difficult.
- Building a cost-effective analytics stack with Modal, dlt, and dbt — A great example of how you can built a small analytics stack in today's world with dlt, dbt and Modal, a serverless platform to run Python stuff. The articles contains a lot of code snippets to understand what's under the hood.
- Data teams survey 2024 — Jesse Anderson released the results of his survey about the state of data teams in 2024.
- Daily cache implementation in Python — A highly effective approach for caching when working with large datasets stored in distant buckets is to implement a local cache. It avoids the need to repeatedly download the data.
- Safe composable Python — A good article about function composition and testing in Python and how it articulates together.
- What are the key part of data engineering — Simple way to present what are the key part of data engineering.
- Automation strategies for monitoring and self-healing of data pipelines — I like the concept of self-healing pipelines, tho, not sure it's really idempotent and not sure it leads to a great management of data assets, still the article is related also to data contracts and might be solved another way.
- Medium feature store, how do they store lists — Medium built a feature store powering their recommendation system (which could work better tbh), in this blog they explain how did they decided to store features of type list.
- How the UK football rely heavily on data? — It's common knowledge that Liverpool FC won multiple titles recently by being data driven. This article shows how data teams are becoming larger and larger in the clubs. In the Premier League top 6, data team headcount average is 14.
- Spotify data science personas — Data science role is evolving and Spotify proposed multiple persona among their data teams.
- Unlocking insights with high-quality dashboards at scale — A checklist of stuff that you should have a look at to build high-quality dashboard. They even developed a Dashboard portal to improve dashboard usage and discovery.
- Ibis drops pandas backend and embrace fully DuckDB — It's a big choice moving forward Ibis, a multi backend dataframe library, decided to drop pandas support using by default DuckDB. The article says that DuckDB is way faster, covers the feature gap and pandas was mostly annoying because of NaN for nulls values, whereas it's NULL for all the other backends.
- DuckCon #5 videos — All the videos from the 5th DuckCon in Seattle are on YouTube. I did not had the time yet to look at it but I think awesome things are waiting us behind a click.
- Should you be migrating to an Iceberg Lakehouse? — This is an excellent question, and a good starting point for considering whether you should change all your assets to Iceberg.
- Amazon S3 now supports conditional writes — I think it's a great start for table formats.
- The analytics development lifecycle by dbt — Tristan Handy proposed a framework (very Enterprise) to rethink how the analytics workflow should work as of today. The article is 37 minutes long so I did not read but I saw the holy DevOps/DataOps infini sign that gave me instant headache. Toby from SQLMesh answered on LinkedIn, drama time.
- Making SQLMesh faster — the road for SQLMesh is crystal clear, they want to be the faster alternative to unmanageable large dbt projects, so they work of the speed of execution.
- What's your question?
See you next week ❤️
blef.fr Newsletter
Join the newsletter to receive the latest updates in your inbox.