More common are methods where the dictionary starts in some predetermined state but the contents change during the encoding process, based on the data that has already been encoded. Both the LZ77 and LZ78 algorithms work on this principle. In LZ77, a data structure called the "sliding window" is used to hold the last N bytes of data processed; this window serves as the dictionary, effectively storing every substring that has appeared in the past N bytes as dictionary entries. Instead of a single index identifying a dictionary entry, two values are needed: the length, indicating the length of the matched text, and the offset (also called the distance), indicating that the match is found in the sliding window starting offset bytes before the current text.
LZ78 uses a more explicit dictionary structure; at the beginning of the encoding process, the dictionary only needs to contain entries for the symbols of the alphabet used in the text to be compressed, but the indexes are numbered so as to leave spaces for many more entries. (For instance, if the input text will be in ASCII, there will be 256 entries in the dictionary, but the indexes may be nine bits long, leaving space for 256 more entries, or even ten bits long, leaving space for 768 more entries.) At each step of the encoding process, the longest entry in the dictionary that matches the text is found, and its index is written to the output; the combination of that entry and the character that followed it in the text is then added to the dictionary as a new entry.
The LZ78 decoder receives each symbol and, if it already has a previous prefix, adds the prefix plus the symbol to the dictionary. It then outputs the symbol and sets the prefix to the last character of the symbol. One "gotcha" here is that if the encoder sees a sequence of the form STRING STRING CHARACTER, where STRING is currently in the dictionary, it will output a symbol that is one higher than the decoder's last dictionary entry. The decoder must detect such an event and output the previous symbol plus its first character. This symbol will always be only one higher than the last numbered symbol in the decoder's dictionary.
|R||-- (UR is in the dictionary already)|
|Y||RLY (doesn't matter)|
Example: The encoder is encoding BANANANANA; after outputting the indexes for B, A, N and AN the encoder has in its dictionary entries for BA, AN, NA, and ANA and the decoder has entries for BA, AN, and NA. The encoder can match "ANA" so it sends the index for "ANA" and adds "ANAN" to the dictionary. However, the decoder doesn't have "ANA" in its dictionary. It must guess that this new symbol is the prefix (the last symbol it received, "AN") plus its first character ("A"). It then outputs "ANA" and adds the prefix plus the last character of the output ("A" again) to the dictionary. Decoding can continue from there.
Another dictionary coding scheme is byte pair encoding, where a byte that does not appear in the source text is assigned to represent the most commonly appearing two-byte combination. This can be done repeatedly as long as there are bytes that do not appear in the source text, and bytes that are already representing combinations of other bytes can themselves appear in combinations.
Girl Power Prevails with Internet Dictionary Entry; LANGUAGE: Spice Girls's Mantra among New Words on OED Online
Jan 17, 2002; Byline: HANNAH JONES Arts & Media Editor WHAT'S the difference between girl power and a riot girl? Apparently more than wedge...
US Patent Issued to Canon on May 1 for "Apparatus and Method for Recognizing Speech Based on Feature Parameters of Modified Speech and Playing Back the Modified Speech" (Japanese Inventors)
May 07, 2012; ALEXANDRIA, Va., May 7 -- United States Patent no. 8,170,874, issued on May 1, was assigned to Canon K. K. (Tokyo). "Apparatus...
WIPO ASSIGNS PATENT TO MITSUBISHI ELECTRIC FOR "RECOGNITION DICTIONARY CREATION DEVICE AND SPEECH RECOGNITION DEVICE" (JAPANESE INVENTORS)
Aug 16, 2011; GENEVA, Aug. 16 -- Publication No. WO/2011/096015 was published on Aug. 11. Title of the invention: "RECOGNITION DICTIONARY...