DjVu (pronounced déjà vu) is a computer file format designed primarily to store scanned images, especially those containing text and line drawings. It uses technologies such as image layer separation of text and background/images, progressive loading, arithmetic coding, and lossy compression for bitonal (monochrome) images. This allows for high quality, readable images to be stored in a minimum of space, so that they can be made available on the web.
DjVu has been promoted as an alternative to PDF, as it gives smaller files than PDF for most scanned documents. The DjVu developers report that color magazine pages compress to 40–70KB, black and white technical papers compress to 15–40KB, and ancient manuscripts compress to around 100KB; all of these are significantly better than the typical 500KB required for a satisfactory JPEG image. Like PDF, DjVu can contain an OCRed text layer, making it easy to perform cut and paste and text search operations.
DjVu divides a single image into many different images, then compresses them separately. To create a DjVu file, the initial image is first separated into three images: a background image, a foreground image, and a mask image. The background and foreground images are typically lower-resolution color images (e.g., 100dpi); the mask image is a high-resolution bilevel image (e.g., 300dpi) and is typically where the text is stored. The background and foreground images are then compressed using a wavelet-based compression algorithm named IW44. The mask image is compressed using a method called JB2 (similar to JBIG2). The JB2 encoding method identifies nearly-identical shapes on the page, such as multiple occurrences of a particular character in a given font, style, and size. It compresses the bitmap of each unique shape separately, and then encodes the locations where each shape appears on the page. Thus, instead of compressing a letter "e" in a given font multiple times, it compresses the letter "e" once (as a compressed bit image) and then records every place on the page it occurs.
All this suggests that in the long run vector graphics will become the format of choice for the production of text documents by typesetting. On the other hand, for scanned media the following two options exist:
Roughly, the printed media content can be said to be a mix of text and graphics. To store various scanned media types both PDF and DjVu formats employ various codecs. The simplest (and the least efficient) way of storing scanned media is to treat both graphics and text as graphics. Historically, this was the first way how scanned media was stored in PDF: for color and gray images the JPEG codec was used, while for bitonal (black-and-white) images one of the fax codecs was used, most notably CCITT3 & CCITT4. As a result, a typical PDF file size was several hundred kilobytes per page. It was around this time when DjVu was proposed. This new file format essentially combined two new codecs with a very simple file structure:
Although Adobe Reader 5.0 was able to render JBIG2-encoded images, the encoder only appeared in Adobe Acrobat 6.0. This, along with other factors, led to the establishment of DjVu as the format of choice for storing scanned documents. The general conclusion is that DjVu file will be smaller, while the PDF file will have higher quality (will be more accurate).
With PDF documents one can zoom in on vector-based content to an arbitrary depth or print them at an arbitrarily high resolution without introducing quality loss or jaggedness inherent to raster formats. But if a PDF is simply used as a container for non-vector images (such as scans), those images will not gain anything. Also, a vector format can always be converted to a raster format, usually with irrevocable data loss, but the other direction is very difficult.
PDF is most useful when the original source is an electronic document such as a Microsoft Word doc or TeX file. Such documents benefit most from the vector graphics technology that underlies PDF. DjVu files can be marginally smaller but only deliver a high resolution image, possibly enriched with the associated text.
DjVu is very good for image files, and has been optimized especially for scanned text and images. However, PDF could be better if the scanned raster images can be transformed into high quality vector graphics, for instance by applying optical character recognition to the scanned image, identifying the fonts, and carefully proofreading the resulting file. This procedure often costs too much time. Suitable fonts might not be available, or one may want to preserve the original document more exactly, including signatures, marginal comments, paper texture, or other markings. In such cases, DjVu is the better choice.
At present, the most advanced method for compressing scanned bitonal documents seems to be Cartesian Perceptual Compression. Its size/quality ratios are unmatched by both DjVu and PDF. However, this compression format enjoys limited popularity since it is a closed file format/codec, which is protected by a US patent.