Continued from last post: (Prompts)
1. The "Permutation" Tax (The Biggest Hurdle)
The Problem: If you reorder the source text to bring similar strings closer together (to get those longer LZ77 matches), the decoder no longer knows the original order of the characters.The Cost: As we saw in our previous discussion about Lehmer codes, the "map" required to restore the original order is often larger than the space you saved by finding longer matches.Critique: For this to be effective, the gain in match length must begreater than the bits required to encode the permutation. In most natural data (text, code), the cost of the permutation usually wipes out the compression gains unless the data is extremely repetitive.
2. Connection to the Burrows-Wheeler Transform (BWT)
How BWT does it: It reorders the entire string alphabetically based on its suffixes. This clusters similar characters together.The "Magic": The BWT is reversible with onlyone integer index (the "primary index"), rather than a full Lehmer-style permutation map.Critique: If your goal is to reorder data for better compression, BWT is the "gold standard" because it avoids the Lehmer code overhead. If your technique requires a full index map, it will likely lose to BWT.
3. "Optimal Parsing" vs. "Greedy Parsing"
Your "Second Pass": By looking at "future encountered strings," you are performingOptimal Parsing .Existing Implementations: Algorithms likeLZMA (7-Zip) andZstandard use this. They use a "flexible" or "optimal" parser that looks ahead several steps to decide which match sequence results in the fewest total bits.Critique: This is a proven, highly effective technique. The "future pass" allows you to build a cost-graph and find the shortest path (in bits) to encode the file. It increases compression time significantly but improves the ratio.
4. Non-Causal Dependencies (The Circularity Problem)
The Technique: If you refer to "future" strings in a way that the decoder hasn't reached yet, the decoder cannot start work until it has theentire future block in memory.Critique: This turns a "Streaming" algorithm into a "Block" algorithm. This is fine for file compression (like .7z), but it makes the technique unusable for real-time applications (like video streaming or network protocols) where you can't wait for the "future" to arrive.
5. Dictionary Pre-Sorting
Static Dictionaries: Some modern compressors look at the "future" of a dataset (or a training set), identify the most common long strings, and put them into a "Static Dictionary" that is sent before the actual data.Critique: This is very effective for small messages. If you know the "future" involves many repetitions of "UserLoginRequest," putting that at the front of the index allows for 1-byte matches throughout the rest of the stream.
Summary Recommendation
Don't reorder the text physically unless you use a transform likeBWT . Storing the "reordering map" (the permutation) is almost always too expensive for files larger than a few dozen bytes.Use the "Future Pass" for decision making. Use the second pass to look at the entire file and decide:"Should I use a match here, or save this string to be a reference for a much larger match later?" Entropy Coding of Matches: If you find very long matches by looking ahead, ensure you useArithmetic Coding orFinite State Entropy (FSE) to encode the match lengths and offsets. Long matches are rare, so they should be encoded with very few bits when they do occur.