Developer, Knowledge Management Advocate

Hyper Text Compression: Shrinking Wikipedia to 10.7% of its Size

This is a super cool leaderboard for lossless text compression via NLP (and yes, that includes AI)! The top solution manages to compress the first GB of the English Wikipedia to a whopping 10.7% of its original size, including the compression program itself!

Hyper Text Compression: Shrinking Wikipedia to 10.7% of its Size!

Automatic TLDR by Gemini 2.5 Pro:

This page describes the Large Text Compression Benchmark, an open competition that ranks lossless data compression programs. The primary goal is to encourage research in artificial intelligence (AI) and natural language processing (NLP) by treating text compression as a language modeling problem.


Benchmark Overview

  • Test Data: The benchmark uses the first $10^9$ bytes (1 GB) of an English Wikipedia XML dump from March 3, 2006, known as enwik9.
  • Ranking Metric: Programs are ranked solely by the total size, which is the sum of the compressed enwik9 file size and the size of the zipped decompresser program. A smaller total size is better.
  • Secondary Information: Data such as compression/decompression speed and memory usage are provided for informational purposes but do not influence the rankings.
  • Goal: The benchmark’s main purpose is not to find the best general-purpose compressor but to push the boundaries of data modeling, a fundamental challenge in both AI and compression.

Key Findings and Algorithms

The results table shows a wide variety of compression programs, ranked from the best compression ratio to the worst. A clear trend emerges from the top-performing entries:

  • Dominance of AI Models: The highest-ranking compressors, such as nncp and cmix, utilize sophisticated AI-based algorithms. These include neural network models like Transformers (Tr) and Long Short-Term Memory (LSTM), as well as advanced Context Mixing (CM) techniques. These methods excel at modeling the complex patterns in natural language text, resulting in superior compression ratios.
  • Trade-offs: There is a significant trade-off between compression ratio, speed, and memory. The top-ranked AI-driven compressors are extremely slow and require vast amounts of memory (often many gigabytes) and, in some cases, specialized hardware like GPUs.
  • Traditional Algorithms: More conventional algorithms like Lempel-Ziv (LZ), Burrows-Wheeler Transform (BWT), and Prediction by Partial Match (PPM) are found further down the list. While they are generally much faster and use less memory, they cannot achieve the same level of compression as the leading AI models on this specific text-based task.

Hutter Prize

The benchmark is closely related to the Hutter Prize, which offers prize money for open-source compression improvements on a smaller subset of the data (enwik8, the first $10^8$ bytes). This prize has specific hardware and time constraints, encouraging practical advancements in the field.

Performance Optimization text-compression Technology Benchmark Software Development Data Storage Open Source nlp AI Coding
Loading...