Saturday, March 14, 2026

Simple compression made for a slightly longer string length

 Continued from last post: (Prompts)

What will happen if I let the string length be larger like 255 instead of 7 and I also run length encode repetitions of codes in one node in the bst

I want you to critique a technique I am thinking of where LZ77 is modified by referring to future encountered strings in a second pass which reorders the indexes so that larger lengths of text are matched..

The technique you are describing is a fascinating hybrid of LZ77 (Dictionary coding) and Global Optimization. In the compression world, this is often referred to as "Optimal Parsing" or "Look-ahead Matching," and when you involve reordering, you are touching on "Block Sorting."

Here is a critique of your proposed technique, broken down by its theoretical advantages and practical "bottlenecks."


1. The "Permutation" Tax (The Biggest Hurdle)

The most critical part of your idea is the "reordering of indexes."

  • The Problem: If you reorder the source text to bring similar strings closer together (to get those longer LZ77 matches), the decoder no longer knows the original order of the characters.

  • The Cost: As we saw in our previous discussion about Lehmer codes, the "map" required to restore the original order is often larger than the space you saved by finding longer matches.

  • Critique: For this to be effective, the gain in match length must be greater than the bits required to encode the permutation. In most natural data (text, code), the cost of the permutation usually wipes out the compression gains unless the data is extremely repetitive.

2. Connection to the Burrows-Wheeler Transform (BWT)

Your idea of "reordering to find better matches" is essentially what the BWT (used in bzip2) does.

  • How BWT does it: It reorders the entire string alphabetically based on its suffixes. This clusters similar characters together.

  • The "Magic": The BWT is reversible with only one integer index (the "primary index"), rather than a full Lehmer-style permutation map.

  • Critique: If your goal is to reorder data for better compression, BWT is the "gold standard" because it avoids the Lehmer code overhead. If your technique requires a full index map, it will likely lose to BWT.

3. "Optimal Parsing" vs. "Greedy Parsing"

Standard LZ77 is "Greedy." It takes the longest match it sees right now. However, taking a 10-byte match now might prevent you from taking two 8-byte matches later.

  • Your "Second Pass": By looking at "future encountered strings," you are performing Optimal Parsing.

  • Existing Implementations: Algorithms like LZMA (7-Zip) and Zstandard use this. They use a "flexible" or "optimal" parser that looks ahead several steps to decide which match sequence results in the fewest total bits.

  • Critique: This is a proven, highly effective technique. The "future pass" allows you to build a cost-graph and find the shortest path (in bits) to encode the file. It increases compression time significantly but improves the ratio.

4. Non-Causal Dependencies (The Circularity Problem)

LZ77 is usually "causal"—the decoder only needs what it has already seen to decode what is coming next.

  • The Technique: If you refer to "future" strings in a way that the decoder hasn't reached yet, the decoder cannot start work until it has the entire future block in memory.

  • Critique: This turns a "Streaming" algorithm into a "Block" algorithm. This is fine for file compression (like .7z), but it makes the technique unusable for real-time applications (like video streaming or network protocols) where you can't wait for the "future" to arrive.

5. Dictionary Pre-Sorting

If by "reordering," you mean keeping the original text as-is but reordering how the dictionary is searched, you are onto a very strong technique.

  • Static Dictionaries: Some modern compressors look at the "future" of a dataset (or a training set), identify the most common long strings, and put them into a "Static Dictionary" that is sent before the actual data.

  • Critique: This is very effective for small messages. If you know the "future" involves many repetitions of "UserLoginRequest," putting that at the front of the index allows for 1-byte matches throughout the rest of the stream.


Summary Recommendation

Your idea is moving toward "Optimal Parsing" and "Block Sorting." To make it work:

  1. Don't reorder the text physically unless you use a transform like BWT. Storing the "reordering map" (the permutation) is almost always too expensive for files larger than a few dozen bytes.

  2. Use the "Future Pass" for decision making. Use the second pass to look at the entire file and decide: "Should I use a match here, or save this string to be a reference for a much larger match later?"

  3. Entropy Coding of Matches: If you find very long matches by looking ahead, ensure you use Arithmetic Coding or Finite State Entropy (FSE) to encode the match lengths and offsets. Long matches are rare, so they should be encoded with very few bits when they do occur.

In short: You've independently identified the core logic behind LZMA/7-Zip. The secret to its success isn't just finding the matches, but using that "second pass" to calculate the mathematically perfect combination of matches to minimize the bit-stream.


Simple Compression close to Kolmogorov complexity

 Prompt: I am working on a low level coding task where the task is to encode and decode a string. I want python code for encoding a string by first sorting the characters alphabetically while remembering position of each character and encoding it using lehmers code. The decoder will encode the character codes differentially line original array => 231 , 3 , 45 , 0 , 23 , 32 , 78 sorted array=>0,3,23,32,45,78,231.. sent as the permutation rank and another array encoding of 32, 29, 46, 3 , 20, 33, 153 where the sorted numbers are send in preorder where the first 32 is encoded as 8 bit, second 29 encodes 3 using 6 bits.. I want the bit stream encoder and decoder.

Code generated is pushed to github

Comparing this custom method to standard lossless compression utilities (like Gzip (DEFLATE), Zstandard, LZMA, or Brotli) reveals a classic trade-off between Generality and Domain-Specific Optimization.

Here is how your method stacks up:

1. The "Tiny Data" Advantage (Winning)

General-purpose compressors (Gzip, Zstandard) are designed for files, not individual 7-character strings.

  • Overhead/Headers: Gzip has a minimum header size of about 10–18 bytes. Even an "empty" Gzip file is larger than your entire 7-byte encoded string.

  • Dictionary Building: Algorithms like LZ77 need a "sliding window" to find repetitions. In a 7-character string, there is almost no repetition to find.

  • Result: For ultra-short, fixed-length metadata (like a 64-bit packet header in a radio protocol), your method will significantly outperform Gzip/Zstd.

2. Information Theory & Entropy (The Limit)

Your method is essentially a form of Structural Encoding. You aren't just compressing "data"; you are compressing the mathematical properties of a permutation.

  • Permutation: There are

            n!
          
    ways to arrange characters. For n
          
          
    . To identify one specific arrangement, you need
            l
          
    bits. Your Lehmer code uses 14 bits (because it uses integer bit-boundaries for each coefficient). This is nearly 90% efficient compared to the theoretical limit.

  • Values: By using the BST range restriction, you are performing a version of Arithmetic Coding without the floating-point complexity. You use the "known universe" of the parent node to reduce the entropy of the child node.

3. Scaling (Losing)

This is where general-purpose utilities take the lead as the input grows.

  • The Lehmer Trap: As the string length

            N
          
    increases, the Lehmer code becomes computationally expensive to calculate (
            O(N2)
          
    or
            O(NlogN)
          
    with Fenwick trees). More importantly, it doesn't account for patterns.

  • Redundancy: If your string is "AAAAAAAA", a general compressor sees a "run" and compresses it to almost nothing using RLE or LZ77. Your method would still calculate a full Lehmer code and a BST, which is overkill for redundant data.

  • Alphabet size: General compressors use Huffman or FSE (Finite State Entropy) to map frequent characters (like 'e' or 't') to shorter bit-sequences. Your method treats all characters (0-255) as equally likely unless the BST range happens to narrow down.

4. Comparison Summary Table

FeatureYour Custom MethodGzip / Zstd / LZMA
Smallest UnitBits (can be exactly 56 bits)Bytes (usually min 20+ bytes)
Header OverheadZero (if length is fixed)High (10–100+ bytes)
Sorted DataExtremely efficient (BST Delta)Moderate (Standard Delta)
Repetitive DataPoor (No dictionary)Excellent (LZ77)
ScalingPoor (Lehmer grows fast)Excellent
ComplexityHigh logic, low memoryLow logic, high memory

When to use your method?

Your approach is a "Low-Level Serialization" technique rather than a "Compression Utility." Use it when:

  1. Strict Bandwidth Constraints: You are sending data over LoRa, Satellite, or CAN bus where every single bit costs money or power.

  2. Known Structure: You know the data is always exactly

            NN
          
    characters and usually contains distinct values.

  3. No Storage for Libraries: You are on an Arduino or an FPGA where you cannot afford the memory footprint of the Zstandard library.

Final Verdict

If you try to compress a 1MB text file with this, it will be much slower and likely produce a larger file than Gzip. However, for a 7-byte identifier, your method is near the theoretical "Kolmogorov Complexity" limit, and standard utilities cannot touch it.