<?xml version="1.0" encoding="utf-8" standalone="yes"?><rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:content="http://purl.org/rss/1.0/modules/content/"><channel><title>Benchmark on Mia Heidenstedt</title><link>https://heidenstedt.org/tags/benchmark/</link><description>Recent
content
in Benchmark on Mia Heidenstedt</description><generator>
Hugo</generator><language>en</language><lastBuildDate>Thu, 16 Apr 2026 08:00:02 +0000</lastBuildDate><atom:link href="https://heidenstedt.org/tags/benchmark/index.xml" rel="self" type="application/rss+xml"/><item><title>Hyper Text Compression: Shrinking Wikipedia to 10.7% of its Size</title><link>https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/</link><pubDate>Mon, 11 Aug 2025 12:23:19 +0000</pubDate><guid>https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/</guid><description><![CDATA[<p>
      <em>Best viewed on the <a href="https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/">original page</a>, where extended functionality like the
    footnote helper is available.</em>
    </p><p>This is a super cool leaderboard for lossless text compression via NLP (and yes, that includes AI)! The top solution manages to compress the first GB of the English Wikipedia to a whopping 10.7% of its original size, including the compression program itself!</p>
<p><a href="https://www.mattmahoney.net/dc/text.html">Hyper Text Compression: Shrinking Wikipedia to 10.7% of its Size!</a></p>
<h2 id="automatic-tldr-by-gemini-25-pro"><a href="https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/#automatic-tldr-by-gemini-25-pro">Automatic TLDR by Gemini 2.5 Pro:</a></h2><p>This page describes the <strong>Large Text Compression Benchmark</strong>, an open competition that ranks lossless data compression programs. The primary goal is to encourage research in artificial intelligence (AI) and natural language processing (NLP) by treating text compression as a language modeling problem.</p>
<hr>
<h3 id="benchmark-overview"><a href="https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/#benchmark-overview">Benchmark Overview</a></h3><ul>
<li><strong>Test Data</strong>: The benchmark uses the first $10^9$ bytes (1 GB) of an English Wikipedia XML dump from March 3, 2006, known as <code>enwik9</code>.</li>
<li><strong>Ranking Metric</strong>: Programs are ranked solely by the <strong>total size</strong>, which is the sum of the compressed <code>enwik9</code> file size and the size of the zipped decompresser program. A smaller total size is better.</li>
<li><strong>Secondary Information</strong>: Data such as compression/decompression speed and memory usage are provided for informational purposes but do not influence the rankings.</li>
<li><strong>Goal</strong>: The benchmark&rsquo;s main purpose is not to find the best general-purpose compressor but to push the boundaries of data modeling, a fundamental challenge in both AI and compression.</li>
</ul>
<hr>
<h3 id="key-findings-and-algorithms"><a href="https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/#key-findings-and-algorithms">Key Findings and Algorithms</a></h3><p>The results table shows a wide variety of compression programs, ranked from the best compression ratio to the worst. A clear trend emerges from the top-performing entries:</p>
<ul>
<li><strong>Dominance of AI Models</strong>: The highest-ranking compressors, such as <strong>nncp</strong> and <strong>cmix</strong>, utilize sophisticated AI-based algorithms. These include neural network models like <strong>Transformers (Tr)</strong> and <strong>Long Short-Term Memory (LSTM)</strong>, as well as advanced <strong>Context Mixing (CM)</strong> techniques. These methods excel at modeling the complex patterns in natural language text, resulting in superior compression ratios.</li>
<li><strong>Trade-offs</strong>: There is a significant trade-off between compression ratio, speed, and memory. The top-ranked AI-driven compressors are extremely slow and require vast amounts of memory (often many gigabytes) and, in some cases, specialized hardware like GPUs.</li>
<li><strong>Traditional Algorithms</strong>: More conventional algorithms like <strong>Lempel-Ziv (LZ)</strong>, <strong>Burrows-Wheeler Transform (BWT)</strong>, and <strong>Prediction by Partial Match (PPM)</strong> are found further down the list. While they are generally much faster and use less memory, they cannot achieve the same level of compression as the leading AI models on this specific text-based task.</li>
</ul>
<hr>
<h3 id="hutter-prize"><a href="https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/#hutter-prize">Hutter Prize</a></h3><p>The benchmark is closely related to the <strong>Hutter Prize</strong>, which offers prize money for open-source compression improvements on a smaller subset of the data (<code>enwik8</code>, the first $10^8$ bytes). This prize has specific hardware and time constraints, encouraging practical advancements in the field.</p>
]]></description><content:encoded><![CDATA[<p>
      <em>Best viewed on the <a href="https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/">original page</a>, where extended functionality like the
    footnote helper is available.</em>
    </p><p>This is a super cool leaderboard for lossless text compression via NLP (and yes, that includes AI)! The top solution manages to compress the first GB of the English Wikipedia to a whopping 10.7% of its original size, including the compression program itself!</p>
<p><a href="https://www.mattmahoney.net/dc/text.html">Hyper Text Compression: Shrinking Wikipedia to 10.7% of its Size!</a></p>
<h2 id="automatic-tldr-by-gemini-25-pro"><a href="https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/#automatic-tldr-by-gemini-25-pro">Automatic TLDR by Gemini 2.5 Pro:</a></h2><p>This page describes the <strong>Large Text Compression Benchmark</strong>, an open competition that ranks lossless data compression programs. The primary goal is to encourage research in artificial intelligence (AI) and natural language processing (NLP) by treating text compression as a language modeling problem.</p>
<hr>
<h3 id="benchmark-overview"><a href="https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/#benchmark-overview">Benchmark Overview</a></h3><ul>
<li><strong>Test Data</strong>: The benchmark uses the first $10^9$ bytes (1 GB) of an English Wikipedia XML dump from March 3, 2006, known as <code>enwik9</code>.</li>
<li><strong>Ranking Metric</strong>: Programs are ranked solely by the <strong>total size</strong>, which is the sum of the compressed <code>enwik9</code> file size and the size of the zipped decompresser program. A smaller total size is better.</li>
<li><strong>Secondary Information</strong>: Data such as compression/decompression speed and memory usage are provided for informational purposes but do not influence the rankings.</li>
<li><strong>Goal</strong>: The benchmark&rsquo;s main purpose is not to find the best general-purpose compressor but to push the boundaries of data modeling, a fundamental challenge in both AI and compression.</li>
</ul>
<hr>
<h3 id="key-findings-and-algorithms"><a href="https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/#key-findings-and-algorithms">Key Findings and Algorithms</a></h3><p>The results table shows a wide variety of compression programs, ranked from the best compression ratio to the worst. A clear trend emerges from the top-performing entries:</p>
<ul>
<li><strong>Dominance of AI Models</strong>: The highest-ranking compressors, such as <strong>nncp</strong> and <strong>cmix</strong>, utilize sophisticated AI-based algorithms. These include neural network models like <strong>Transformers (Tr)</strong> and <strong>Long Short-Term Memory (LSTM)</strong>, as well as advanced <strong>Context Mixing (CM)</strong> techniques. These methods excel at modeling the complex patterns in natural language text, resulting in superior compression ratios.</li>
<li><strong>Trade-offs</strong>: There is a significant trade-off between compression ratio, speed, and memory. The top-ranked AI-driven compressors are extremely slow and require vast amounts of memory (often many gigabytes) and, in some cases, specialized hardware like GPUs.</li>
<li><strong>Traditional Algorithms</strong>: More conventional algorithms like <strong>Lempel-Ziv (LZ)</strong>, <strong>Burrows-Wheeler Transform (BWT)</strong>, and <strong>Prediction by Partial Match (PPM)</strong> are found further down the list. While they are generally much faster and use less memory, they cannot achieve the same level of compression as the leading AI models on this specific text-based task.</li>
</ul>
<hr>
<h3 id="hutter-prize"><a href="https://heidenstedt.org/links/ai-powered-text-compression-shrinking-wikipedia-to-107-of-its-size/#hutter-prize">Hutter Prize</a></h3><p>The benchmark is closely related to the <strong>Hutter Prize</strong>, which offers prize money for open-source compression improvements on a smaller subset of the data (<code>enwik8</code>, the first $10^8$ bytes). This prize has specific hardware and time constraints, encouraging practical advancements in the field.</p>
]]></content:encoded></item><item><title>Releasing: GoQueueBench</title><link>https://heidenstedt.org/posts/2025/releasing-goqueuebench/</link><pubDate>Tue, 25 Mar 2025 15:44:38 +0000</pubDate><guid>https://heidenstedt.org/posts/2025/releasing-goqueuebench/</guid><description><![CDATA[<p>
      <em>Best viewed on the <a href="https://heidenstedt.org/posts/2025/releasing-goqueuebench/">original page</a>, where extended functionality like the
    footnote helper is available.</em>
    </p><p>As i coded on <a href="https://github.com/i5heu/ouroboros-db">OuroborosDB</a> i noticed that i need a very fast queue for a rather unique architectural design decision.<br>
I try to build the network module in such a way that i can test the behavior completely deterministic while &ldquo;simulating&rdquo; entire clusters in a single process.</p>
<p>So i build a test prototype of my global network queue with go&rsquo;s channels and noticed that it was a major performance bottleneck, after writing 2 different ring buffer queue implementations it became clear that some queues behave completely different under different congestion levels and core counts - some so unpredictable that i just did not wanted to use them in my project.</p>
<p>This prompted me to take a relatively large chunk out of my free time and write a suite to benchmark different queue implementations i build under different conditions and score them based on their performance and predictability.</p>
<p>The result of this work is <a href="https://github.com/i5heu/GoQueueBench">GoQueueBench</a></p>
<p>These are the results of the benchmark suite:</p>
<table>
  <thead>
      <tr>
          <th>Implementation</th>
          <th>Overall Score</th>
          <th>Throughput Light Load</th>
          <th>Throughput Heavy Load</th>
          <th>Throughput Average</th>
          <th>Stability Ratio</th>
          <th>Homogeneity Factor</th>
          <th>Uncertainty</th>
          <th>Total Tests</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VortexQueue</td>
          <td><strong>11341466</strong></td>
          <td>6926449</td>
          <td><strong>5502925</strong></td>
          <td><strong>8776309</strong></td>
          <td><strong>1.15</strong></td>
          <td>0.87</td>
          <td><strong>0.25</strong></td>
          <td>681</td>
      </tr>
      <tr>
          <td>LightningQueue</td>
          <td>9631771</td>
          <td>6638213</td>
          <td>4627690</td>
          <td>6036728</td>
          <td>0.99</td>
          <td><strong>0.95</strong></td>
          <td>0.31</td>
          <td>681</td>
      </tr>
      <tr>
          <td>FastMPMCQueue</td>
          <td>9384067</td>
          <td>6870924</td>
          <td>4598620</td>
          <td>6070151</td>
          <td>0.96</td>
          <td>0.93</td>
          <td>0.28</td>
          <td>681</td>
      </tr>
      <tr>
          <td>OptimizedMPMCQueue</td>
          <td>9105262</td>
          <td>6436385</td>
          <td>4379823</td>
          <td>5838555</td>
          <td>0.97</td>
          <td>0.94</td>
          <td>0.32</td>
          <td>681</td>
      </tr>
      <tr>
          <td>OptimizedMPMCQueueSharded</td>
          <td>8130197</td>
          <td>6369891</td>
          <td>3834140</td>
          <td>6781865</td>
          <td>0.84</td>
          <td>0.88</td>
          <td>0.39</td>
          <td>681</td>
      </tr>
      <tr>
          <td>MultiHeadQueue</td>
          <td>7391203</td>
          <td>4363332</td>
          <td>3492068</td>
          <td>5558849</td>
          <td>1.12</td>
          <td>0.91</td>
          <td>0.36</td>
          <td>681</td>
      </tr>
      <tr>
          <td>BasicMPMCQueue</td>
          <td>5599252</td>
          <td>4370889</td>
          <td>2669612</td>
          <td>3667715</td>
          <td>0.89</td>
          <td>0.93</td>
          <td>0.30</td>
          <td>681</td>
      </tr>
      <tr>
          <td>Golang Buffered Channel</td>
          <td>5312485</td>
          <td>6667828</td>
          <td>2760985</td>
          <td>4312720</td>
          <td>0.54</td>
          <td>0.82</td>
          <td>0.66</td>
          <td>681</td>
      </tr>
      <tr>
          <td>FastMPMCQueueTicket</td>
          <td>3229780</td>
          <td><strong>7705164</strong></td>
          <td>1203924</td>
          <td>5803821</td>
          <td>0.21</td>
          <td>0.64</td>
          <td>1.19</td>
          <td>681</td>
      </tr>
  </tbody>
</table>
<p>Please note that i build the package so that all queue adhere to the same interface and can be swapped out easily.</p>
]]></description><content:encoded><![CDATA[<p>
      <em>Best viewed on the <a href="https://heidenstedt.org/posts/2025/releasing-goqueuebench/">original page</a>, where extended functionality like the
    footnote helper is available.</em>
    </p><p>As i coded on <a href="https://github.com/i5heu/ouroboros-db">OuroborosDB</a> i noticed that i need a very fast queue for a rather unique architectural design decision.<br>
I try to build the network module in such a way that i can test the behavior completely deterministic while &ldquo;simulating&rdquo; entire clusters in a single process.</p>
<p>So i build a test prototype of my global network queue with go&rsquo;s channels and noticed that it was a major performance bottleneck, after writing 2 different ring buffer queue implementations it became clear that some queues behave completely different under different congestion levels and core counts - some so unpredictable that i just did not wanted to use them in my project.</p>
<p>This prompted me to take a relatively large chunk out of my free time and write a suite to benchmark different queue implementations i build under different conditions and score them based on their performance and predictability.</p>
<p>The result of this work is <a href="https://github.com/i5heu/GoQueueBench">GoQueueBench</a></p>
<p>These are the results of the benchmark suite:</p>
<table>
  <thead>
      <tr>
          <th>Implementation</th>
          <th>Overall Score</th>
          <th>Throughput Light Load</th>
          <th>Throughput Heavy Load</th>
          <th>Throughput Average</th>
          <th>Stability Ratio</th>
          <th>Homogeneity Factor</th>
          <th>Uncertainty</th>
          <th>Total Tests</th>
      </tr>
  </thead>
  <tbody>
      <tr>
          <td>VortexQueue</td>
          <td><strong>11341466</strong></td>
          <td>6926449</td>
          <td><strong>5502925</strong></td>
          <td><strong>8776309</strong></td>
          <td><strong>1.15</strong></td>
          <td>0.87</td>
          <td><strong>0.25</strong></td>
          <td>681</td>
      </tr>
      <tr>
          <td>LightningQueue</td>
          <td>9631771</td>
          <td>6638213</td>
          <td>4627690</td>
          <td>6036728</td>
          <td>0.99</td>
          <td><strong>0.95</strong></td>
          <td>0.31</td>
          <td>681</td>
      </tr>
      <tr>
          <td>FastMPMCQueue</td>
          <td>9384067</td>
          <td>6870924</td>
          <td>4598620</td>
          <td>6070151</td>
          <td>0.96</td>
          <td>0.93</td>
          <td>0.28</td>
          <td>681</td>
      </tr>
      <tr>
          <td>OptimizedMPMCQueue</td>
          <td>9105262</td>
          <td>6436385</td>
          <td>4379823</td>
          <td>5838555</td>
          <td>0.97</td>
          <td>0.94</td>
          <td>0.32</td>
          <td>681</td>
      </tr>
      <tr>
          <td>OptimizedMPMCQueueSharded</td>
          <td>8130197</td>
          <td>6369891</td>
          <td>3834140</td>
          <td>6781865</td>
          <td>0.84</td>
          <td>0.88</td>
          <td>0.39</td>
          <td>681</td>
      </tr>
      <tr>
          <td>MultiHeadQueue</td>
          <td>7391203</td>
          <td>4363332</td>
          <td>3492068</td>
          <td>5558849</td>
          <td>1.12</td>
          <td>0.91</td>
          <td>0.36</td>
          <td>681</td>
      </tr>
      <tr>
          <td>BasicMPMCQueue</td>
          <td>5599252</td>
          <td>4370889</td>
          <td>2669612</td>
          <td>3667715</td>
          <td>0.89</td>
          <td>0.93</td>
          <td>0.30</td>
          <td>681</td>
      </tr>
      <tr>
          <td>Golang Buffered Channel</td>
          <td>5312485</td>
          <td>6667828</td>
          <td>2760985</td>
          <td>4312720</td>
          <td>0.54</td>
          <td>0.82</td>
          <td>0.66</td>
          <td>681</td>
      </tr>
      <tr>
          <td>FastMPMCQueueTicket</td>
          <td>3229780</td>
          <td><strong>7705164</strong></td>
          <td>1203924</td>
          <td>5803821</td>
          <td>0.21</td>
          <td>0.64</td>
          <td>1.19</td>
          <td>681</td>
      </tr>
  </tbody>
</table>
<p>Please note that i build the package so that all queue adhere to the same interface and can be swapped out easily.</p>
]]></content:encoded></item><item><title>OpenAI o3 Breakthrough High Score on ARC-AGI-Pub</title><link>https://heidenstedt.org/links/oai-o3-pub-breakthrough/</link><pubDate>Sat, 21 Dec 2024 14:51:53 +0000</pubDate><guid>https://heidenstedt.org/links/oai-o3-pub-breakthrough/</guid><description><![CDATA[<p>
      <em>Best viewed on the <a href="https://heidenstedt.org/links/oai-o3-pub-breakthrough/">original page</a>, where extended functionality like the
    footnote helper is available.</em>
    </p><p><a href="https://arcprize.org/blog/oai-o3-pub-breakthrough">This</a> is a article that explores the capabilities of OpenAI&rsquo;s o3 model. <a href="https://news.ycombinator.com/item?id=42473321">HN</a></p>
<p>The impacts of AI reasoning are getting closer and closer to surpassing human capabilities.<br>
But running it to solve a problem is extremely expensive.</p>
<blockquote>
<p>Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of &ldquo;programs&rdquo; (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.</p>
</blockquote>
<p>(programs is defined here as &ldquo;My mental model for LLMs is that they work as a repository of vector programs&rdquo;)</p>
<p>As far as i understand this and it&rsquo;s implications i appears to me this process is more like a directed brute force with many candidates and selects the best one.<br>
There is a <a href="https://news.ycombinator.com/item?id=42479422">thread</a> on HN that discusses this topic.</p>
<h2 id="gpt-4-summary"><a href="https://heidenstedt.org/links/oai-o3-pub-breakthrough/#gpt-4-summary">GPT-4 Summary</a></h2><p>OpenAI&rsquo;s new <strong>o3 system</strong> has achieved a groundbreaking <strong>75.7% on the ARC-AGI-Pub Semi-Private Evaluation</strong> within a $10k compute limit, surpassing previous models. A high-compute version scored <strong>87.5%</strong>, marking a major step in AI&rsquo;s adaptability to novel tasks. This success highlights a shift from scaling existing architectures to innovative mechanisms for generalization and test-time knowledge recombination.</p>
<p>Key achievements:</p>
<ul>
<li><strong>ARC-AGI performance</strong>: o3 shows unprecedented adaptability, unlike previous GPT-family models, which struggled with novel tasks.</li>
<li><strong>Efficiency challenges</strong>: Low-compute costs $17–$20 per task but high-compute performance remains expensive and exploratory.</li>
<li><strong>Core innovation</strong>: o3 employs <strong>natural language program search</strong>, generating and executing task-specific solutions during runtime, guided by deep learning priors.</li>
</ul>
<p>Despite its advancements, o3 is not AGI, as it still fails simple tasks. Upcoming benchmarks, such as <strong>ARC-AGI-2 in 2025</strong>, aim to further challenge AI systems while encouraging open-source progress.</p>
<p>This breakthrough underscores a qualitative leap in AI research, reshaping paths toward AGI and sparking broader scientific engagement.</p>
]]></description><content:encoded><![CDATA[<p>
      <em>Best viewed on the <a href="https://heidenstedt.org/links/oai-o3-pub-breakthrough/">original page</a>, where extended functionality like the
    footnote helper is available.</em>
    </p><p><a href="https://arcprize.org/blog/oai-o3-pub-breakthrough">This</a> is a article that explores the capabilities of OpenAI&rsquo;s o3 model. <a href="https://news.ycombinator.com/item?id=42473321">HN</a></p>
<p>The impacts of AI reasoning are getting closer and closer to surpassing human capabilities.<br>
But running it to solve a problem is extremely expensive.</p>
<blockquote>
<p>Effectively, o3 represents a form of deep learning-guided program search. The model does test-time search over a space of &ldquo;programs&rdquo; (in this case, natural language programs – the space of CoTs that describe the steps to solve the task at hand), guided by a deep learning prior (the base LLM). The reason why solving a single ARC-AGI task can end up taking up tens of millions of tokens and cost thousands of dollars is because this search process has to explore an enormous number of paths through program space – including backtracking.</p>
</blockquote>
<p>(programs is defined here as &ldquo;My mental model for LLMs is that they work as a repository of vector programs&rdquo;)</p>
<p>As far as i understand this and it&rsquo;s implications i appears to me this process is more like a directed brute force with many candidates and selects the best one.<br>
There is a <a href="https://news.ycombinator.com/item?id=42479422">thread</a> on HN that discusses this topic.</p>
<h2 id="gpt-4-summary"><a href="https://heidenstedt.org/links/oai-o3-pub-breakthrough/#gpt-4-summary">GPT-4 Summary</a></h2><p>OpenAI&rsquo;s new <strong>o3 system</strong> has achieved a groundbreaking <strong>75.7% on the ARC-AGI-Pub Semi-Private Evaluation</strong> within a $10k compute limit, surpassing previous models. A high-compute version scored <strong>87.5%</strong>, marking a major step in AI&rsquo;s adaptability to novel tasks. This success highlights a shift from scaling existing architectures to innovative mechanisms for generalization and test-time knowledge recombination.</p>
<p>Key achievements:</p>
<ul>
<li><strong>ARC-AGI performance</strong>: o3 shows unprecedented adaptability, unlike previous GPT-family models, which struggled with novel tasks.</li>
<li><strong>Efficiency challenges</strong>: Low-compute costs $17–$20 per task but high-compute performance remains expensive and exploratory.</li>
<li><strong>Core innovation</strong>: o3 employs <strong>natural language program search</strong>, generating and executing task-specific solutions during runtime, guided by deep learning priors.</li>
</ul>
<p>Despite its advancements, o3 is not AGI, as it still fails simple tasks. Upcoming benchmarks, such as <strong>ARC-AGI-2 in 2025</strong>, aim to further challenge AI systems while encouraging open-source progress.</p>
<p>This breakthrough underscores a qualitative leap in AI research, reshaping paths toward AGI and sparking broader scientific engagement.</p>
]]></content:encoded></item></channel></rss>