OpenAI o3 Breakthrough High Score on ARC-AGI-Pub

This is a article that explores the capabilities of OpenAI’s o3 model. HN

The impacts of AI reasoning are getting closer and closer to surpassing human capabilities.
But running it to solve a problem is extremely expensive.

Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of “programs” (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.

(programs is defined here as “My mental model for LLMs is that they work as a repository of vector programs”)

As far as i understand this and it’s implications i appears to me this process is more like a directed brute force with many candidates and selects the best one.
There is a thread on HN that discusses this topic.

GPT-4 Summary

OpenAI’s new o3 system has achieved a groundbreaking 75.7% on the ARC-AGI-Pub Semi-Private Evaluation within a $10k compute limit, surpassing previous models. A high-compute version scored 87.5%, marking a major step in AI’s adaptability to novel tasks. This success highlights a shift from scaling existing architectures to innovative mechanisms for generalization and test-time knowledge recombination.

Key achievements:

ARC-AGI performance: o3 shows unprecedented adaptability, unlike previous GPT-family models, which struggled with novel tasks.
Efficiency challenges: Low-compute costs $17–$20 per task but high-compute performance remains expensive and exploratory.
Core innovation: o3 employs natural language program search, generating and executing task-specific solutions during runtime, guided by deep learning priors.

Despite its advancements, o3 is not AGI, as it still fails simple tasks. Upcoming benchmarks, such as ARC-AGI-2 in 2025, aim to further challenge AI systems while encouraging open-source progress.

This breakthrough underscores a qualitative leap in AI research, reshaping paths toward AGI and sparking broader scientific engagement.