<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Transformer on Mia Heidenstedt</title><link>https://heidenstedt.org/tags/transformer/</link><description>Recent
content
in Transformer on Mia Heidenstedt</description><generator>
Hugo</generator><language>en</language><lastBuildDate>Thu, 16 Apr 2026 07:56:47 +0000</lastBuildDate><atom:link href="https://heidenstedt.org/tags/transformer/index.xml" rel="self" type="application/rss+xml"/><item><title>OpenAI o3 Breakthrough High Score on ARC-AGI-Pub</title><link>https://heidenstedt.org/links/oai-o3-pub-breakthrough/</link><pubDate>Sat, 21 Dec 2024 14:51:53 +0000</pubDate><guid>https://heidenstedt.org/links/oai-o3-pub-breakthrough/</guid><description><![CDATA[<p>
      <em>Best viewed on the <a href="https://heidenstedt.org/links/oai-o3-pub-breakthrough/">original page</a>, where extended functionality like the
    footnote helper is available.</em>
    </p><p><a href="https://arcprize.org/blog/oai-o3-pub-breakthrough">This</a> is a article that explores the capabilities of OpenAI&rsquo;s o3 model. <a href="https://news.ycombinator.com/item?id=42473321">HN</a></p>
<p>The impacts of AI reasoning are getting closer and closer to surpassing human capabilities.<br>
But running it to solve a problem is extremely expensive.</p>
<blockquote>
<p>Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of &ldquo;programs&rdquo; (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.</p>
</blockquote>
<p>(programs is defined here as &ldquo;My mental model for LLMs is that they work as a repository of vector programs&rdquo;)</p>
<p>As far as i understand this and it&rsquo;s implications i appears to me this process is more like a directed brute force with many candidates and selects the best one.<br>
There is a <a href="https://news.ycombinator.com/item?id=42479422">thread</a> on HN that discusses this topic.</p>
<h2 id="gpt-4-summary"><a href="https://heidenstedt.org/links/oai-o3-pub-breakthrough/#gpt-4-summary">GPT-4 Summary</a></h2><p>OpenAI&rsquo;s new <strong>o3 system</strong> has achieved a groundbreaking <strong>75.7% on the ARC-AGI-Pub Semi-Private Evaluation</strong> within a $10k compute limit, surpassing previous models. A high-compute version scored <strong>87.5%</strong>, marking a major step in AI&rsquo;s adaptability to novel tasks. This success highlights a shift from scaling existing architectures to innovative mechanisms for generalization and test-time knowledge recombination.</p>
<p>Key achievements:</p>
<ul>
<li><strong>ARC-AGI performance</strong>: o3 shows unprecedented adaptability, unlike previous GPT-family models, which struggled with novel tasks.</li>
<li><strong>Efficiency challenges</strong>: Low-compute costs $17–$20 per task but high-compute performance remains expensive and exploratory.</li>
<li><strong>Core innovation</strong>: o3 employs <strong>natural language program search</strong>, generating and executing task-specific solutions during runtime, guided by deep learning priors.</li>
</ul>
<p>Despite its advancements, o3 is not AGI, as it still fails simple tasks. Upcoming benchmarks, such as <strong>ARC-AGI-2 in 2025</strong>, aim to further challenge AI systems while encouraging open-source progress.</p>
<p>This breakthrough underscores a qualitative leap in AI research, reshaping paths toward AGI and sparking broader scientific engagement.</p>
]]></description><content:encoded><![CDATA[<p>
      <em>Best viewed on the <a href="https://heidenstedt.org/links/oai-o3-pub-breakthrough/">original page</a>, where extended functionality like the
    footnote helper is available.</em>
    </p><p><a href="https://arcprize.org/blog/oai-o3-pub-breakthrough">This</a> is a article that explores the capabilities of OpenAI&rsquo;s o3 model. <a href="https://news.ycombinator.com/item?id=42473321">HN</a></p>
<p>The impacts of AI reasoning are getting closer and closer to surpassing human capabilities.<br>
But running it to solve a problem is extremely expensive.</p>
<blockquote>
<p>Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of &ldquo;programs&rdquo; (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.</p>
</blockquote>
<p>(programs is defined here as &ldquo;My mental model for LLMs is that they work as a repository of vector programs&rdquo;)</p>
<p>As far as i understand this and it&rsquo;s implications i appears to me this process is more like a directed brute force with many candidates and selects the best one.<br>
There is a <a href="https://news.ycombinator.com/item?id=42479422">thread</a> on HN that discusses this topic.</p>
<h2 id="gpt-4-summary"><a href="https://heidenstedt.org/links/oai-o3-pub-breakthrough/#gpt-4-summary">GPT-4 Summary</a></h2><p>OpenAI&rsquo;s new <strong>o3 system</strong> has achieved a groundbreaking <strong>75.7% on the ARC-AGI-Pub Semi-Private Evaluation</strong> within a $10k compute limit, surpassing previous models. A high-compute version scored <strong>87.5%</strong>, marking a major step in AI&rsquo;s adaptability to novel tasks. This success highlights a shift from scaling existing architectures to innovative mechanisms for generalization and test-time knowledge recombination.</p>
<p>Key achievements:</p>
<ul>
<li><strong>ARC-AGI performance</strong>: o3 shows unprecedented adaptability, unlike previous GPT-family models, which struggled with novel tasks.</li>
<li><strong>Efficiency challenges</strong>: Low-compute costs $17–$20 per task but high-compute performance remains expensive and exploratory.</li>
<li><strong>Core innovation</strong>: o3 employs <strong>natural language program search</strong>, generating and executing task-specific solutions during runtime, guided by deep learning priors.</li>
</ul>
<p>Despite its advancements, o3 is not AGI, as it still fails simple tasks. Upcoming benchmarks, such as <strong>ARC-AGI-2 in 2025</strong>, aim to further challenge AI systems while encouraging open-source progress.</p>
<p>This breakthrough underscores a qualitative leap in AI research, reshaping paths toward AGI and sparking broader scientific engagement.</p>
]]></content:encoded></item><item><title>An Evolved Universal Transformer Memory</title><link>https://heidenstedt.org/links/an-evolved-universal-transformer-memory/</link><pubDate>Tue, 17 Dec 2024 10:51:53 +0000</pubDate><guid>https://heidenstedt.org/links/an-evolved-universal-transformer-memory/</guid><description><![CDATA[<p>
      <em>Best viewed on the <a href="https://heidenstedt.org/links/an-evolved-universal-transformer-memory/">original page</a>, where extended functionality like the
    footnote helper is available.</em>
    </p><p>I stumbled on <a href="https://news.ycombinator.com/item?id=42411409">HN</a> over <a href="https://sakana.ai/namm/">this</a> very interesting article about a new kind of context memory system that, is able to remove information that is &ldquo;unhelpful or redundant details&rdquo;.</p>
<p>Thinking further, i think this would be super helpful for semantic search, that is currently not very performant due to the missing filters that extract importance. I have tried to counter this problem until now via summarization through small LLMs, but as one might guess turns out as not very precise and super expensive. There are other ideas one could post process text with LLMs but they are not very efficient either.</p>
<p>Paper <a href="https://arxiv.org/abs/2410.13166">https://arxiv.org/abs/2410.13166</a></p>
<h2 id="tldr-by-gpt-4o"><a href="https://heidenstedt.org/links/an-evolved-universal-transformer-memory/#tldr-by-gpt-4o">TLDR by GPT-4o</a></h2><blockquote>
<p>(the article is very good, you might to prefer to read it over this TLDR, but here is it anyway)</p>
</blockquote>
<p>Sakana AI introduces <strong>Neural Attention Memory Models (NAMMs)</strong>, a novel memory system for transformers inspired by human selective memory. NAMMs optimize how transformers store and retrieve information, enabling them to <strong>“remember” important tokens and “forget” redundant ones</strong>, significantly improving efficiency and performance, particularly for long-context tasks.</p>
<h3 id="key-technical-highlights"><a href="https://heidenstedt.org/links/an-evolved-universal-transformer-memory/#key-technical-highlights">Key Technical Highlights:</a></h3><ol>
<li><strong>Evolutionary Optimization</strong>: NAMMs use evolutionary algorithms to train a neural classifier that decides which tokens to keep or discard, bypassing non-differentiable challenges.</li>
<li><strong>Execution Steps</strong>:
<ul>
<li>Convert attention sequences into <strong>spectrograms</strong>.</li>
<li>Compress data using an <strong>exponential moving average (EMA)</strong>.</li>
<li>Use a classifier to score and selectively <strong>prune tokens</strong>.</li>
</ul>
</li>
<li><strong>Generalization</strong>:
<ul>
<li>NAMMs can <strong>zero-shot transfer</strong> to other transformers (e.g., vision, reinforcement learning) without retraining.</li>
<li>They adapt to tasks differently—retaining global information in early layers and focusing on local details in later ones.</li>
</ul>
</li>
<li><strong>Performance</strong>: Tested on <strong>LongBench, InfiniteBench</strong>, and their Japanese benchmark <strong>ChouBun</strong>, NAMMs reduce memory usage and outperform prior hand-designed memory strategies (H₂O, L₂).</li>
</ol>
<p>NAMMs demonstrate <strong>cross-domain mastery</strong> and efficiency gains across diverse tasks and input modalities, paving the way for future research into learning transformers directly atop evolved memory systems.</p>
]]></description><content:encoded><![CDATA[<p>
      <em>Best viewed on the <a href="https://heidenstedt.org/links/an-evolved-universal-transformer-memory/">original page</a>, where extended functionality like the
    footnote helper is available.</em>
    </p><p>I stumbled on <a href="https://news.ycombinator.com/item?id=42411409">HN</a> over <a href="https://sakana.ai/namm/">this</a> very interesting article about a new kind of context memory system that, is able to remove information that is &ldquo;unhelpful or redundant details&rdquo;.</p>
<p>Thinking further, i think this would be super helpful for semantic search, that is currently not very performant due to the missing filters that extract importance. I have tried to counter this problem until now via summarization through small LLMs, but as one might guess turns out as not very precise and super expensive. There are other ideas one could post process text with LLMs but they are not very efficient either.</p>
<p>Paper <a href="https://arxiv.org/abs/2410.13166">https://arxiv.org/abs/2410.13166</a></p>
<h2 id="tldr-by-gpt-4o"><a href="https://heidenstedt.org/links/an-evolved-universal-transformer-memory/#tldr-by-gpt-4o">TLDR by GPT-4o</a></h2><blockquote>
<p>(the article is very good, you might to prefer to read it over this TLDR, but here is it anyway)</p>
</blockquote>
<p>Sakana AI introduces <strong>Neural Attention Memory Models (NAMMs)</strong>, a novel memory system for transformers inspired by human selective memory. NAMMs optimize how transformers store and retrieve information, enabling them to <strong>“remember” important tokens and “forget” redundant ones</strong>, significantly improving efficiency and performance, particularly for long-context tasks.</p>
<h3 id="key-technical-highlights"><a href="https://heidenstedt.org/links/an-evolved-universal-transformer-memory/#key-technical-highlights">Key Technical Highlights:</a></h3><ol>
<li><strong>Evolutionary Optimization</strong>: NAMMs use evolutionary algorithms to train a neural classifier that decides which tokens to keep or discard, bypassing non-differentiable challenges.</li>
<li><strong>Execution Steps</strong>:
<ul>
<li>Convert attention sequences into <strong>spectrograms</strong>.</li>
<li>Compress data using an <strong>exponential moving average (EMA)</strong>.</li>
<li>Use a classifier to score and selectively <strong>prune tokens</strong>.</li>
</ul>
</li>
<li><strong>Generalization</strong>:
<ul>
<li>NAMMs can <strong>zero-shot transfer</strong> to other transformers (e.g., vision, reinforcement learning) without retraining.</li>
<li>They adapt to tasks differently—retaining global information in early layers and focusing on local details in later ones.</li>
</ul>
</li>
<li><strong>Performance</strong>: Tested on <strong>LongBench, InfiniteBench</strong>, and their Japanese benchmark <strong>ChouBun</strong>, NAMMs reduce memory usage and outperform prior hand-designed memory strategies (H₂O, L₂).</li>
</ol>
<p>NAMMs demonstrate <strong>cross-domain mastery</strong> and efficiency gains across diverse tasks and input modalities, paving the way for future research into learning transformers directly atop evolved memory systems.</p>
]]></content:encoded></item></channel></rss>