Saturday, March 14, 2026

Simple Compression close to Kolmogorov complexity

 Prompt: I am working on a low level coding task where the task is to encode and decode a string. I want python code for encoding a string by first sorting the characters alphabetically while remembering position of each character and encoding it using lehmers code. The decoder will encode the character codes differentially line original array => 231 , 3 , 45 , 0 , 23 , 32 , 78 sorted array=>0,3,23,32,45,78,231.. sent as the permutation rank and another array encoding of 32, 29, 46, 3 , 20, 33, 153 where the sorted numbers are send in preorder where the first 32 is encoded as 8 bit, second 29 encodes 3 using 6 bits.. I want the bit stream encoder and decoder.

Code generated is pushed to github

Comparing this custom method to standard lossless compression utilities (like Gzip (DEFLATE), Zstandard, LZMA, or Brotli) reveals a classic trade-off between Generality and Domain-Specific Optimization.

Here is how your method stacks up:

1. The "Tiny Data" Advantage (Winning)

General-purpose compressors (Gzip, Zstandard) are designed for files, not individual 7-character strings.

  • Overhead/Headers: Gzip has a minimum header size of about 10–18 bytes. Even an "empty" Gzip file is larger than your entire 7-byte encoded string.

  • Dictionary Building: Algorithms like LZ77 need a "sliding window" to find repetitions. In a 7-character string, there is almost no repetition to find.

  • Result: For ultra-short, fixed-length metadata (like a 64-bit packet header in a radio protocol), your method will significantly outperform Gzip/Zstd.

2. Information Theory & Entropy (The Limit)

Your method is essentially a form of Structural Encoding. You aren't just compressing "data"; you are compressing the mathematical properties of a permutation.

  • Permutation: There are

            n!
          
    ways to arrange characters. For n
          
          
    . To identify one specific arrangement, you need
            l
          
    bits. Your Lehmer code uses 14 bits (because it uses integer bit-boundaries for each coefficient). This is nearly 90% efficient compared to the theoretical limit.

  • Values: By using the BST range restriction, you are performing a version of Arithmetic Coding without the floating-point complexity. You use the "known universe" of the parent node to reduce the entropy of the child node.

3. Scaling (Losing)

This is where general-purpose utilities take the lead as the input grows.

  • The Lehmer Trap: As the string length

            N
          
    increases, the Lehmer code becomes computationally expensive to calculate (
            O(N2)
          
    or
            O(NlogN)
          
    with Fenwick trees). More importantly, it doesn't account for patterns.

  • Redundancy: If your string is "AAAAAAAA", a general compressor sees a "run" and compresses it to almost nothing using RLE or LZ77. Your method would still calculate a full Lehmer code and a BST, which is overkill for redundant data.

  • Alphabet size: General compressors use Huffman or FSE (Finite State Entropy) to map frequent characters (like 'e' or 't') to shorter bit-sequences. Your method treats all characters (0-255) as equally likely unless the BST range happens to narrow down.

4. Comparison Summary Table

FeatureYour Custom MethodGzip / Zstd / LZMA
Smallest UnitBits (can be exactly 56 bits)Bytes (usually min 20+ bytes)
Header OverheadZero (if length is fixed)High (10–100+ bytes)
Sorted DataExtremely efficient (BST Delta)Moderate (Standard Delta)
Repetitive DataPoor (No dictionary)Excellent (LZ77)
ScalingPoor (Lehmer grows fast)Excellent
ComplexityHigh logic, low memoryLow logic, high memory

When to use your method?

Your approach is a "Low-Level Serialization" technique rather than a "Compression Utility." Use it when:

  1. Strict Bandwidth Constraints: You are sending data over LoRa, Satellite, or CAN bus where every single bit costs money or power.

  2. Known Structure: You know the data is always exactly

            NN
          
    characters and usually contains distinct values.

  3. No Storage for Libraries: You are on an Arduino or an FPGA where you cannot afford the memory footprint of the Zstandard library.

Final Verdict

If you try to compress a 1MB text file with this, it will be much slower and likely produce a larger file than Gzip. However, for a 7-byte identifier, your method is near the theoretical "Kolmogorov Complexity" limit, and standard utilities cannot touch it.

No comments:

Post a Comment