Saturday, March 14, 2026

Simple compression made for a slightly longer string length

 Continued from last post: (Prompts)

What will happen if I let the string length be larger like 255 instead of 7 and I also run length encode repetitions of codes in one node in the bst

I want you to critique a technique I am thinking of where LZ77 is modified by referring to future encountered strings in a second pass which reorders the indexes so that larger lengths of text are matched..

The technique you are describing is a fascinating hybrid of LZ77 (Dictionary coding) and Global Optimization. In the compression world, this is often referred to as "Optimal Parsing" or "Look-ahead Matching," and when you involve reordering, you are touching on "Block Sorting."

Here is a critique of your proposed technique, broken down by its theoretical advantages and practical "bottlenecks."


1. The "Permutation" Tax (The Biggest Hurdle)

The most critical part of your idea is the "reordering of indexes."

  • The Problem: If you reorder the source text to bring similar strings closer together (to get those longer LZ77 matches), the decoder no longer knows the original order of the characters.

  • The Cost: As we saw in our previous discussion about Lehmer codes, the "map" required to restore the original order is often larger than the space you saved by finding longer matches.

  • Critique: For this to be effective, the gain in match length must be greater than the bits required to encode the permutation. In most natural data (text, code), the cost of the permutation usually wipes out the compression gains unless the data is extremely repetitive.

2. Connection to the Burrows-Wheeler Transform (BWT)

Your idea of "reordering to find better matches" is essentially what the BWT (used in bzip2) does.

  • How BWT does it: It reorders the entire string alphabetically based on its suffixes. This clusters similar characters together.

  • The "Magic": The BWT is reversible with only one integer index (the "primary index"), rather than a full Lehmer-style permutation map.

  • Critique: If your goal is to reorder data for better compression, BWT is the "gold standard" because it avoids the Lehmer code overhead. If your technique requires a full index map, it will likely lose to BWT.

3. "Optimal Parsing" vs. "Greedy Parsing"

Standard LZ77 is "Greedy." It takes the longest match it sees right now. However, taking a 10-byte match now might prevent you from taking two 8-byte matches later.

  • Your "Second Pass": By looking at "future encountered strings," you are performing Optimal Parsing.

  • Existing Implementations: Algorithms like LZMA (7-Zip) and Zstandard use this. They use a "flexible" or "optimal" parser that looks ahead several steps to decide which match sequence results in the fewest total bits.

  • Critique: This is a proven, highly effective technique. The "future pass" allows you to build a cost-graph and find the shortest path (in bits) to encode the file. It increases compression time significantly but improves the ratio.

4. Non-Causal Dependencies (The Circularity Problem)

LZ77 is usually "causal"—the decoder only needs what it has already seen to decode what is coming next.

  • The Technique: If you refer to "future" strings in a way that the decoder hasn't reached yet, the decoder cannot start work until it has the entire future block in memory.

  • Critique: This turns a "Streaming" algorithm into a "Block" algorithm. This is fine for file compression (like .7z), but it makes the technique unusable for real-time applications (like video streaming or network protocols) where you can't wait for the "future" to arrive.

5. Dictionary Pre-Sorting

If by "reordering," you mean keeping the original text as-is but reordering how the dictionary is searched, you are onto a very strong technique.

  • Static Dictionaries: Some modern compressors look at the "future" of a dataset (or a training set), identify the most common long strings, and put them into a "Static Dictionary" that is sent before the actual data.

  • Critique: This is very effective for small messages. If you know the "future" involves many repetitions of "UserLoginRequest," putting that at the front of the index allows for 1-byte matches throughout the rest of the stream.


Summary Recommendation

Your idea is moving toward "Optimal Parsing" and "Block Sorting." To make it work:

  1. Don't reorder the text physically unless you use a transform like BWT. Storing the "reordering map" (the permutation) is almost always too expensive for files larger than a few dozen bytes.

  2. Use the "Future Pass" for decision making. Use the second pass to look at the entire file and decide: "Should I use a match here, or save this string to be a reference for a much larger match later?"

  3. Entropy Coding of Matches: If you find very long matches by looking ahead, ensure you use Arithmetic Coding or Finite State Entropy (FSE) to encode the match lengths and offsets. Long matches are rare, so they should be encoded with very few bits when they do occur.

In short: You've independently identified the core logic behind LZMA/7-Zip. The secret to its success isn't just finding the matches, but using that "second pass" to calculate the mathematically perfect combination of matches to minimize the bit-stream.


No comments:

Post a Comment