summaryrefslogtreecommitdiff
path: root/algorithm.doc
diff options
context:
space:
mode:
Diffstat (limited to 'algorithm.doc')
-rw-r--r--algorithm.doc165
1 files changed, 165 insertions, 0 deletions
diff --git a/algorithm.doc b/algorithm.doc
new file mode 100644
index 0000000..c8c3a92
--- /dev/null
+++ b/algorithm.doc
@@ -0,0 +1,165 @@
+1. Algorithm
+
+The deflation algorithm used by zip and gzip is a variation of
+Lempel-Ziv 1977 [LZ77]. It finds duplicated strings in
+the input data. The second occurrence of a string is replaced by a
+pointer to the previous string, in the form of a pair (distance,
+length). Distances are limited to 32K bytes, and lengths are limited
+to 258 bytes. When a string does not occur anywhere in the previous
+32K bytes, it is emitted as a sequence of literal bytes. (In this
+description, 'string' must be taken as an arbitrary sequence of bytes,
+and is not restricted to printable characters.)
+
+Literals or match lengths are compressed with one Huffman tree, and
+match distances are compressed with another tree. The trees are stored
+in a compact form at the start of each block. The blocks can have any
+size (except that the compressed data for one block must fit in
+available memory). A block is terminated when zip determines that it
+would be useful to start another block with fresh trees. (This is
+somewhat similar to compress.)
+
+Duplicated strings are found using a hash table. All input strings of
+length 3 are inserted in the hash table. A hash index is computed for
+the next 3 bytes. If the hash chain for this index is not empty, all
+strings in the chain are compared with the current input string, and
+the longest match is selected.
+
+The hash chains are searched starting with the most recent strings, to
+favor small distances and thus take advantage of the Huffman encoding.
+The hash chains are singly linked. There are no deletions from the
+hash chains, the algorithm simply discards matches that are too old.
+
+To avoid a worst-case situation, very long hash chains are arbitrarily
+truncated at a certain length, determined by a runtime option (zip -1
+to -9). So zip does not always find the longest possible match but
+generally finds a match which is long enough.
+
+zip also defers the selection of matches with a lazy evaluation
+mechanism. After a match of length N has been found, zip searches for a
+longer match at the next input byte. If a longer match is found, the
+previous match is truncated to a length of one (thus producing a single
+literal byte) and the longer match is emitted afterwards. Otherwise,
+the original match is kept, and the next match search is attempted only
+N steps later.
+
+The lazy match evaluation is also subject to a runtime parameter. If
+the current match is long enough, zip reduces the search for a longer
+match, thus speeding up the whole process. If compression ratio is more
+important than speed, zip attempts a complete second search even if
+the first match is already long enough.
+
+The lazy match evaluation is not performed for the fastest compression
+modes (speed options -1 to -3). For these fast modes, new strings
+are inserted in the hash table only when no match was found, or
+when the match is not too long. This degrades the compression ratio
+but saves time since there are both fewer insertions and fewer searches.
+
+
+2. gzip file format
+
+The gzip file format was standardized in Internet RFC 1952 [RFC1952].
+This section briefly describes the format and comments on some
+implementation details.
+
+The pkzip format imposes a lot of overhead in various headers, which
+are useful for an archiver but not necessary when only one file is
+compressed. gzip uses a much simpler structure. Numbers are in little
+endian format, and bit 0 is the least significant bit.
+A gzip file is a sequence of compressed members. Each member has the
+following structure:
+
+2 bytes magic header 0x1f, 0x8b (\037 \213)
+1 byte compression method (0..7 reserved, 8 = deflate)
+1 byte flags
+ bit 0 set: file probably ascii text
+ bit 1 set: header CRC-16 present
+ bit 2 set: extra field present
+ bit 3 set: original file name present
+ bit 4 set: file comment present
+ bit 5,6,7: reserved
+4 bytes file modification time in Unix format
+1 byte extra flags (depend on compression method)
+1 byte operating system on which compression took place
+
+2 bytes optional part number (second part=1)
+2 bytes optional extra field length
+? bytes optional extra field
+? bytes optional original file name, zero terminated
+? bytes optional file comment, zero terminated
+2 bytes optional 16-bit header CRC
+? bytes compressed data
+4 bytes crc32
+4 bytes uncompressed input size modulo 2^32
+
+The format was designed to allow single pass compression without any
+backwards seek, and without a priori knowledge of the uncompressed
+input size or the available size on the output media. If input does
+not come from a regular disk file, the file modification time is set
+to the time at which compression started.
+
+The time stamp is useful mainly when one gzip file is transferred over
+a network. In this case it would not help to keep ownership
+attributes. In the local case, the ownership attributes are preserved
+by gzip when compressing/decompressing the file. A time stamp of zero
+is ignored.
+
+Bit 0 in the flags is only an optional indication, which can be set by
+a small lookahead in the input data. In case of doubt, the flag is
+cleared indicating binary data. For systems which have different
+file formats for ascii text and binary data, the decompressor can
+use the flag to choose the appropriate format.
+
+The extra field, if present, must consist of one or more subfields,
+each with the following format:
+
+ subfield id : 2 bytes
+ subfield size : 2 bytes (little-endian format)
+ subfield data
+
+The subfield id can consist of two letters with some mnemonic value.
+Please send any such id to <gzip@gnu.org>. Ids with a zero second
+byte are reserved for future use. The following ids are defined:
+
+ Ap (0x41, 0x70) : Apollo file type information
+
+The subfield size is the size of the subfield data and does not
+include the id and the size itself. The field 'extra field length' is
+the total size of the extra field, including subfield ids and sizes.
+
+It must be possible to detect the end of the compressed data with any
+compression format, regardless of the actual size of the compressed
+data. If the compressed data cannot fit in one file (in particular for
+diskettes), each part starts with a header as described above, but
+only the last part has the crc32 and uncompressed size. A decompressor
+may prompt for additional data for multi-part compressed files. It is
+desirable but not mandatory that multiple parts be extractable
+independently so that partial data can be recovered if one of the
+parts is damaged. This is possible only if no compression state is
+kept from one part to the other. The compression-type dependent flags
+can indicate this.
+
+If the file being compressed is on a file system with case insensitive
+names, the original name field must be forced to lower case. There is
+no original file name if the data was compressed from standard input.
+
+Compression is always performed, even if the compressed file is
+slightly larger than the original. The worst case expansion is
+a few bytes for the gzip file header, plus 5 bytes every 32K block,
+or an expansion ratio of 0.015% for large files. Note that the actual
+number of used disk blocks almost never increases.
+
+Jean-loup Gailly
+gzip@gnu.org
+
+References:
+
+[LZ77] Ziv J., Lempel A., "A Universal Algorithm for Sequential Data
+Compression", IEEE Transactions on Information Theory, Vol. 23, No. 3,
+May 1977, pp. 337-343.
+
+[RFC1952] Deutsch P., "GZIP file format specification version 4.3",
+Internet RFC 1952, May 1996, <http://www.ietf.org/rfc/rfc1952.txt>.
+
+APPNOTE.TXT documentation file in PKZIP 1.93a (October 1991). This
+version no longer seems to be available online; the latest version is
+in <http://www.pkware.com/documents/casestudies/APPNOTE.TXT>.