ZIP2 file format specification

ZIP2 file format specification

subtitle

by John M. Dlugosz
john@dlugosz.com
http://www.dlugosz.com/ZIP2/

Last edited 15 Aug 2000 11:43 PM CDT

ZIP2 file format specification *
1 Why *

2 Design Goals *

3 Overview *

3.1 Why ZIP2 files are smaller *

3.2 Why ZIP2 files are easier to manipulate *

3.3 Multi-File Spanning Archives *

3.4 Extensibility *

3.5 Scenarios *

3.5.1 Extracting a File *

3.5.2 Adding a File *

3.5.3 Replacing a File *

3.5.4 Splitting an Archive *

3.5.5 Acting as a Filter *

4 Structure *

4.1 Non-Chunk Header *

4.2 Fill Bytes *

4.3 Data Type Definitions *

4.4 Chunks *

4.4.1 Structure Definition *

4.4.2 Chunk Type (name) *

4.4.3 Instances *

5 Chunk Types *

5.1 [ME] *

5.2 CMNT, cMNT *

5.3 COMP, cOMP *

5.4 Comp *

5.5 CRYP, cRYP *

5.6 Cryp *

5.7 DATA, dATA *

5.8 Data *

5.9 DICT, dICT *

5.10 Dict *

5.11 DICt *

5.12 ENdX *

5.13 FaTR, faTR *

5.14 ICON *

5.15 IDEN *

5.16 INDX, iNDX *

5.17 Indx *

5.18 MaCF, maCF *

5.19 Macf *

5.20 ROOT *

5.21 RTFM *

5.22 SECU, sECU *

5.23 SHAC,sHAC *

5.24 ShAI, shAI *

5.25 Shai *

5.26 SIGN, sIGN *

5.26.1 Security Level Indicator *

5.27 SIgn *

5.28 TiME, tiME *

5.29 TOcN, TOcn *

5.30 TREE, tREE *

5.31 Tree *

5.32 TREe *

5.33 UnIX, unIX *

5.34 XXXX *

5.35 User Defined *

6 Built-In Chunk Definitions *

6.1 COMP *

6.2 CRYP *

6.3 DICT *

6.4 TREE *

7 Notes on Archive Entries *

7.1 The Index Structure *

7.2 Entry Names *

7.3 Subdirectories *

8 Compression Methods *

8.1 Deflation (method 1) *

8.2 UTR-6 (method 2) *

8.3 Run-Length, Repeat, Delta (method 3) *

8.4 Huffman Encoding (method 4) *

8.5 Composition (method 5) *

8.6 Burrows-Wheeler Transform (method 6) *

9 Encryption *

10 Details of Algorithms *

10.1 CRC-32 *

10.2 UTF-8 *

10.3 Długosz’ Variable-Length Integer Encoding *

10.3.1 Introduction *

10.3.2 Overview *

10.3.3 Benefits and Features *

10.3.4 Criticisms and Possible Improvements *

10.3.5 Details *

11 Conformance *

11.1 Baseline 2000 *

11.2 Robust 2000 *

Why

Design Goals

spanning archives

Implementations of ZIP managers often leave out all support for multi-file spanning archives. Those that do have it provide very little support. Often all you can do with a split archive is extract, not add or otherwise manipulate. Also, most tools assume that spanning archives will be on removable media.

The lack of support stems from the poor design of spanning in the old ZIP specification.

ZIP2 is designed to make it easy to handle multi-file spanning archives.

orthogonal security model

ZIP files have a bit for "encrypted", but the standard ZIP encryption method is easily broken by cryptoanalysis. In fact, many programs exist to break password-protected ZIP files, and it takes a fraction of a second to do so.

Good algorithms are available. It is a given that these algorithms must continue to be revised to keep up with advances in cryptoanalysis and increases in CPU speed and storage.

Yet it is problematic to use a different algorithm in a classic ZIP file. If you flag it as encrypted, it won’t decode correctly on a standard decompressor. You just can’t do it without violating the standard format.

In ZIP2, all data can be encrypted. This applies to the file data (DATA chunks) as well as other information such as the list of file names and file comments. The encryption method is specified, so new methods can be added without breaking the file format.

extensible

Programs can adopt the ZIP2 file format and customize it to hold additional information, while still maintaining compatibility with standard tools.

The standard tools will be backward compatible with older versions, as the specification is updated.

The standard tools can grow to accommodate new things without the need for new code—various data is supplied as otherwise ordinary ZIP2 files, rather than being built-into the program. These files can be updated, and used to upgrade any decoder regardless of platform or machine instruction set.

See §3.4, Extensibility.

easy to manipulate

See §3.2, Why ZIP2 files are easier to manipulate.

well documented

better compression

See §3.1, Why ZIP2 files are smaller.

non-ASCII characters

Classic ZIP files use 8-bit characters for file names, with no specification as to what character set is used. Presumably, it uses the native character set of whoever created the archive, which might not match the native character set of who tries to extract files from it.

Furthermore, file systems (including NTFS and UDF used for DVD’s) can use more than 256 characters at a time.

All text in ZIP2 uses Unicode.

handles large files

A 32-bit length value is quickly becoming a problem. As I write this, inexpensive hard disks are in the 20–30 GB range, and removable media (DVD-RAM) is 2.2 GB. So it’s not out of the question that a user could have a file larger than 4GB in size.

Macintosh support

Overview

Why ZIP2 files are smaller

ZIP2 files are smaller than the equivalent ZIP files even though both use the same compression algorithm (deflation). Here’s why:

Multiple files can be compressed in one batch, rather than starting the compressor over for every file. This is a major improvement for small files. A gzip’ed TAR file contrasted with a ZIP file shows the same benefit.

Files can be compressed with a standard "dictionary", giving the compressor a head start. For example, ZIP2 knows a standard dictionary containing all the keywords in C++, so C and C++ source files compress better.

The directory, or list of all the file names, is smaller. ZIP2 has subdirectories, while ZIP files store the full name of every file, duplicating the directory name. Also, the whole directory in ZIP2 can be compressed as well.

Why ZIP2 files are easier to manipulate

Multi-File Spanning Archives

Extensibility

The chunk system makes it easy to add previously unknown information to newer versions of the ZIP2 specification. Well-behaved user-defined chunks allow application-specific information to be associated with stored files, using the identical system as the standard items.

Subsections for each: The ability to grow to accommodate new dictionaries, algorithms, and whole new chunk types. How existing implementations can "know" about new things without rewriting the front-end.

Scenarios

Extracting a File

Easy

The user asks the program to extract foo.txt from testfile.zip2.

The program quickly skims the ZIP2 file to see which chunks are present. One of the chunks is the list of files contained in the archive. Reading that chunk (INDX #0) and consulting the information in it, it determines that foo.txt is number 42. There are no file positions or offsets stored. Rather, everything refers to chunk instances, which can be anywhere in the file.

The chunk DATA #42 contains the compressed and encrypted data for the file. This chunk is read and decoded, and the file foo.txt written.
Multi-File Archive

Adding a File

Easy

The user asks the program to add bar.txt to the existing archive testfile.zip2.

The program sees that the directory (INDX #0) happens to be the last in the file. This is especially convenient, and the program proceeds by reading all the information from this chunk, and deleting it, thus shortening the file.

A new DATA chunk is appended to the file containing the content from bar.txt. It is assigned a number previously unused, DATA #99.

A record for bar.txt is added to the directory information, and an updated INDX chunk written, once again at the end of the file.
More Difficult

The user asks the program to add bar.txt to the existing archive testfile.zip2.

The directory is not at the end of the file, so deleting it leaves a hole in the file. Rather than having to rewrite the entire archive, this deleted area can be re-used.

The new DATA #99 chunk is split into two parts. Part 1 is sized to exactly fit into the hole left by deleting the INDX #0 chunk. Part 2, the remainder, is appended to the file.

Finally, the updated INDX #0 chunk is written.

Replacing a File

Shorter

The user asks the program to update bar.txt in the archive testfile.zip2.

The INDX chunk indicates that bar.txt is file number 99. DATA chunk #99 is overwritten in-place, but the new one is shorter than the old one. The unused space is claimed with a XXXX chunk or fill bytes.

The INDX chunk is updated with the new file length and date. The result may be a different size than the original, due to the use of variable-length number fields. Also, if the INDX chunk is compressed, changing a couple bytes could very well change the compressed size.

Deleting the INDX chunk leaves a hole, or may be at the end of the file, as explained in other cases. Writing the new INDX chunk is handled in exactly the same way as the DATA chunk in that it can use an XXXX chunk if shorter, and split it if longer.
Longer

The user asks the program to update bar.txt in the archive testfile.zip2.

The INDX chunk indicates that bar.txt is file number 99. DATA chunk #99 is overwritten in-place, but the new one is longer than the old one.

The DATA #99 chunk is split into two parts. Part 1 is sized to fit in the old location, and Part 2, the remainder, is appended to the file.

The INDX chunk is updated with the new file size and timestamp, as explained elsewhere.

Splitting an Archive

Suppose a server had a monster-sized archive for people to download. Rather than downloading the whole thing at once, a CGI program was present to allow the user to pick any size desired, and the archive would be divided into multiple files of at most that size.

The program needs to split the archive on the fly, with an arbitrary file size.

Start by positioning the input at the first chunk, just after the non-chunk header.

Open a new output file, and start by writing a non-chunk header. Subtract the length of the non-chunk header from the threshold length.

Copy one chunk at a time from input to output. When you reach a chunk that won’t fit in the output without going over the threshold size, either leave it out (to be the first thing in the next file) or split the chunk into two parts. In the former case, the output file will be smaller than the threshold size. In the latter case, write the first part to the output, then start a new output file. The second part of the split chunk is the first chunk written to the new file.

Repeat until all chunks are copied. The program could also generate chunks that specify which part contains which (regular) chunk, and add them to the last output file. Even if it doesn’t do that for every interesting chunk, it ought to do so for the INDX #0 at least.

The program can also add IDEN chunks. This is highly recommended, as it will identify the files even if the browser munged the file names when downloading.

The program will choose not to split a chunk if it’s one that can’t be split, or if the first part of the split chunk would be very small.

Acting as a Filter

There are two issues of interest to some that ZIP2 does handle. One is when the input is of unknown length, the other is when the output is not a file.

An input of unknown length is handled easily because the length of the file is only present in the INDX entry, which may be emitted last.

Of more concern is writing the ZIP2 archive to a transmission channel as it works, rather than buffering the entire job before it can write the DATA chunk. After all, the DATA chunk starts with a length.

The solution is to write multiple DATA chunks. The program can be compressing incoming data and buffering the compressed data for a short time. When the consumer is ready for more output, the compressed data at hand can be wrapped in a dATA chunk, specifying the length thus far. Meanwhile, compression of input continues. When the input is finally ended, and compression completes, a final DATA chunk is emitted. If necessary (that is, you didn’t see the end coming) the final DATA chunk can be zero-length.

Structure

A ZIP2 file has the following structure: The bulk of the file is a collection of chunks concatenated together, possibly with fill bytes. Before this is a short non-chunk header, so called because it’s the only stuff that’s not formatted as chunks.

Non-Chunk Header

This short segment allows a file to be easily identified as a ZIP2 file, and provides for embedding a ZIP2 data stream in a larger file.

Signature: 4 bytes containing "ZIP2"

Usage text: at most 128 bytes (and as little as 0 bytes) of text, intended to explain what to do with this file type, not be a comment about the contents of this specific archive. For example, "See http://www.dlugosz.com/ZIP2/ for information on the ZIP2 file format.".

Post-signature: 11 or 15 bytes containing "\x1a\x04\xfe" followed a uint32 or uint64 length followed by "ZIP2" again. The first part terminates the text on most systems (^Z or ^D), so the first signature and usage text can be typed (or cat’ed) as a text file. The hex FE makes sure the high bits have not been lost, and is also a byte that is never present in a UTF-8 string. This means that the end of the usage text can be located unambiguously by scanning for a hex FE byte. Then the signature is repeated just to make it less likely a random text file will accidentally look like a ZIP2 file.

The 4-byte length may be hex FFFFFFFF which means "unspecified". Otherwise, it is taken to be the length of the body (chunked part) of the data, and anything after that is ignored. This provides for embedding a ZIP2 data stream inside a larger file.

The length is not a variable-length field to facilitate in-place updates as the file changes.

Fill Bytes

To facilitate deleting or resizing chunks in-place, fill bytes may be used between chunks. A fill byte has the value of 0.

For regions larger than 12 bytes, use an XXXX chunk to mark the unused area in the file. The fill bytes are designed for gaps smaller than the minimum size of a regular chunk (which therefore can’t hold a XXXX chunk).

Since the chunk begins with a length, and a single zero byte is read as a length of zero, this elegantly fits the definition of a degenerate chunk. Code that walks the chunks in a file will easily handle the presence of fill bytes (see Listing 1 on page *).

Data Type Definitions

lpstring: All text fields are encoded as Unicode in UTF-8 format, length-prefixed with a uintV specifying the number of bytes (not characters) that follow.

For example, "Hello" would be encoded as hex 05 48 6C 6C 6F, and “Długosz” as hex 08 44 C9 82 75 67 6F 73 7A, and the complete text of Moby Dick as hex D2 54 F1 43 41 6C 6C 20 6D 65 20 49 73 68 6D 61 65 6C … 1,201,374 bytes omitted for brevity … 61 67 6F 2E.

sintV: A signed variation of uintV.

uint16: A 16-bit unsigned value. It is stored as big-endian, because this is already established as the "network standard", and it’s consistent with the byte ordering of the variable-length values and UTF-8 encoding.

uint32: A 32-bit unsigned value. It is stored as big-endian, because this is already established as the "network standard", and it’s consistent with the byte ordering of the variable-length values and UTF-8 encoding.

uint64: A 64-bit unsigned value. It is stored as big-endian, because this is already established as the "network standard", and it’s consistent with the byte ordering of the variable-length values and UTF-8 encoding.

uintV: A variable-length unsigned value. It may represent an integer of any length, and require from 1 to an arbitrary number of bytes. The encoding is detailed in section 10.3 on page *.

text: When text is found as the body of a block whose length is already known, it is not necessary to use a length prefix. This is specifically true when the content of a chunk’s Payload is text, as with file comments.

The text is in UTF-8 format. Line-break characters are code hex 0A, which is the "\n" in C and Perl. This encodes as one byte of UTF-8, while U+2028 or U+2029 would require three.

time64: 8 byte, big-endian, time as used for FILETIME in Windows. It has a 100 nanosecond granularity, the finest of any common system. Times are always stored in Universal Coordinated Time (UTC).

Chunks

The bulk of the file is made up of chunks. The chunks are simply concatenated together, and each chunk has the same layout so they may be traversed without actually decoding them all.

Structure Definition

Length: uintV. This is the total length, in bytes, of the chunk not counting the length field itself.

A simple program can traverse the chunks by looking at the length field.
Listing 1

sub skim_file ($)

{

my $file= shift; #the input file, queued past the non-chunk header.

while (!eof($file)) {

my $length= read_uintV ($file);

if ($length >= 4) { #otherwise can’t have 4 byte type tag

my $type;

read ($file, $type, 4);

print "chunk type: $type, length: $length\n";

}

seek ($file, $length-4, SEEK_CUR); #skip to next chunk

}

}

Note that the logic of reading the Length, then seeking that many bytes to get to the next chunk will naturally handle the fill bytes between chunks. In general, such code that skims a ZIP2 file looking for something in particular should check the length before assuming the chunk contains subsequent fields. If the length is less than 4 bytes, it’s improper to read a Type field within that chunk’s extent! With this logic, fill bytes are simply degenerate chunks.

Type: 4 byte name field. This indicates the content of the chunk, and is described in detail in §4.4.2 below.

Instance_number_range: 2 uintV. Starting number and count of how many instances are in this chunk. The chunk instance number is unique to the type, but different types may have the same number without conflict. Chunks are referenced by number, not by position in the file (or even which file they are in, in spanning archives), so this number is very important. The number 0 may be repeated, as it indicates "not applicable" or "don’t care".

For example, hex 07 01 is simply instance seven. Hex 07 10 means that this chunk holds instances seven through twenty-two, inclusive.

Part: unitV. A chunk’s content may be arbitrarily chopped up into smaller chunks. When this happens, each is given the same number and consecutive parts, starting with 0. For example, DATA chunk #3 may contain compressed content of a file, and it is chopped up into dATA #3 pt. 0 and DATA #3 pt. 1 (and each may be in a different file, for spanning archives).

Payload_specification: This is data that is not present in every chunk type, so is considered part of the chunk-specific data (content). However, it applies to all instances, so it appears before the Payload and is not repeated in each data packet. It consists of 2 uintV’s, specifying the Compression and the Encryption, respectively.

Payload: The actual content. Each packet of data (each instance number is associated with one packet), except the last, is prefixed with its length (as a uintV) and the whole thing concatenated together. Fully-structured chunk types omit the length, indicated by the second name flag.

Checksum: uint32. A CRC-32 (see §10.1 on page *) check of the entire chunk, including the Length, through the end of the Data. The algorithm is exactly as with classic ZIP files, but the 1’s complement is stored.

Chunk Type (name)

The chunk is named using a system similar to PNG/MNG, in that 4 bytes are chosen so they happen to be mnemonic as ASCII characters, and the 0x20 bit positions are flags, so that flags are visible as lower-case letters.

The first flag indicates that the chunk is "continued". That is, it is not the last or only part. The last part (which may be the only part) has this flag cleared.

The second flag indicates that the chunk data is fully structured, and that multiple instances can be concatenated together without including the length of each one. You can always tell where one ends and the next begins.

The third flag indicates that the chunk is redundant. That is, the information can be calculated or found, but is included for efficiency. For example, a directory of what can be found in one part of a multi-part archive allows the part to stand alone too, but the information is redundant with what’s in the master directory on the last piece of media.

The fourth flag indicates that this chunk is a pointer to the chunk that might have been present here. It can specify which file (in a spanning set) the chunk can be expected to reside in, point to another file, or indicate a standard definition.

Instances

Chunk instances is a central concept of this file format. Everything refers to some chunk instance, rather than an actual file position.

Each chunk header indicates not a single instance but a range of instances. Having multiple instances under one header saves overhead for very small chunks, but more importantly allows the entire group to be compressed as one data stream.

In a range, all the instances’ data are concatenated together. They may be put together along with lengths, or the lengths may not be needed, as the case may be. The resulting combined data is the Payload of this chunk.

If there is a Payload Specification, this data is further processed before being placed in the chunk’s payload area. First it is compressed, then it is encrypted.

For small items like file comments, timestamps, and other attributes, being able to compress the whole set as one data stream is very important. Yet, the concept of associating an instance of (whatever) with each DATA instance still holds.

For DATA chunks, being able to combine and compress several small files as a group can improve compression. It’s the exact same mechanism—an implementation can treat chunks uniformly.

Chunk Types

[ME]

Identifies the creator of the archive, especially for purposes of recognizing the meanings of user-defined chunks. No Payload Specification. Details TBA.

CMNT, cMNT

Comments for the corresponding files. Each instance is UTF-8 text. Note that the text (if it is plain text—it may be compressed and/or encrypted) isn’t preceded by a length as with text fields in structured data. However, when multiple instances are in the same chunk, the length prefix exists for all strings except the last, because it’s part of the chunk format.

COMP, cOMP

Instances from hex 40 to hex FFF inclusive shall not be used, as these are reserved for built-in standard definitions.

The Payload data starts with a uintV that indicates which compression algorithm is to be used. It is then followed by parameter and configuration information specific to that algorithm. See §7 starting on page *.

Comp

Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.

There is no Payload Specification.

The Payload data is a single uintV, indicating the file number of a multi-part archive that is expected to contain the corresponding COMP chunk. Note that there is only one value, and it indicates the location of all specified instances defined by this chunk.

CRYP, cRYP

Information about encryption, TBA.

The Encryption value in a Payload Specification refers to an instance of CRYP.

See §8.5 starting on page *.

Every chunk pointing to the same instance of CRYP is encrypted using the same key. If the CRYP instance requires prompting for a pass-phrase, then all uses of this instance refer to the same pass-phrase. Another instance of CRYP that points to the same crypto algorithm and parameters would imply that although the same algorithm is used, the user should be prompted again for a (presumably different) pass-phrase.

If the crypto algorithm and parameters don’t require user input, than identical instances are simply redundant.

Cryp

Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.

There is no Payload Specification.

The Payload data is a single uintV, indicating the file number of a multi-part archive that is expected to contain the corresponding CRYP chunk. Note that there is only one value, and it indicates the location of all specified instances defined by this chunk.

See examples under Data.

DATA, dATA

Contains data, typically the content of a file in the archive. It may be compressed (usually is), and may be encrypted.

Each compressed file is stored as one DATA instance.

Data

Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.

There is no Payload Specification. There is a single uintV, specifying the file which is expected to contain all the specified instances of DATA.

For example, hex 0B 44 61 74 61 01 29 00 03 39 7D A4 50

length: 0B (12 bytes)
"Data"
Instance: 01, count 29 (instances 1 through 42 inclusive)
part: 00
data: 03 (meaning: look at file 3)
checksum: 397DA450

means that DATA instances 1 through 42 inclusive can be expected in file 3 of a multi-file archive. Note that the chunk header specifies 42 instances, but there is only one data value. It applies to all instances referred to by the chunk’s header. You cannot group forwarding chunks that forward to different places.

DICT, dICT

This is data used to "seed" a compressor. A dictionary-based compression engine may be pre-loaded with data that is representative of the content being compressed. If multiple files refer to the same dictionary, you save space.

There are standard dictionaries that are known to the decompressor (see §6.3 on page *) and are not included in the file. That really saves space, when compressing source code. However, you can include your own dictionary in an archive, using this chunk.

The Instance number may not be between hex 40 and hex FFF, inclusive, as these are reserved for built-in standard dictionaries.

Dict

Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.

There is no Payload Specification.

The Payload data is a single uintV, indicating the file number of a multi-part archive that is expected to contain the corresponding DICT chunk. Note that there is only one value, and it indicates the location of all specified instances defined by this chunk.

DICt

Specifies an "external" dictionary. There is no Payload Specification. The Payload data is a URL of a file that contains a corresponding DICT instance, or another DICt instance that further redirects the search.

The URL shall be "URL encoded" 8-bit data. E.g. a space will be represented as "%20".

ENdX

No Payload Specification, always 4 or 8 bytes of data.

This chunk, if present, must be last in the file. Upon finding this chunk, no further chunks are read. This allows the parser to stop before the physical end of the file, so the ZIP2 data stream can be embedded in a larger file.

A second use is to find the beginning of the data stream. The data in this chunk is a single uint32 or uint64 value that, when subtracted from the starting byte of this chunk, indicates the starting point of the first byte in the data set (the opening "Z" in the signature). This is intended to support self-extracting archives where the data set is appended to an executable file. The file may be opened, end ENdX chunk (with a correct CRC) found as the last 19 (or 23) bytes in the file. It’s presence is interpreted as a signature, and its payload points to the beginning of the archive.

The instance number, range, and part number must all be zero. Implementations may rely on this chunk being exactly 19 bytes long if the physical file is less than 4 GB, and exactly 19 or 23 bytes otherwise.

The value is a fixed size, not a uintV, to facilitate reading it backwards. It’s simpler to always look 19 bytes from the end than it is to try each location looking for a matching pattern.

FaTR, faTR

"FAT Attributes", actually FAT and NTFS style attribute bits. Goes with corresponding DATA instance.

Supply the actual bit meanings here, and pointer to the Win32 documentation on the Web.

ICON

A small bitmap, to be displayed by graphical environments to represent this archive.

The icon may be one of several standard icons (such as a squeezed file folder or a setup icon), an actual image, a pointer to an icon that’s in the archive already, or an external location.

All methods of representing the icon are found in the ICON chunk, rather than using ICON, Icon, etc. for different means. Likewise, if several ICON instances are present, they must all be grouped into a single chunk. This is to facilitate a shell or other GUI browser quickly and simply finding the ICON data. A program otherwise unaware of the ZIP2 format can step through the chunks using the lengths, and find the ICON chunk, then stop looking.

There is no Payload Specification. ICON data is never compressed (other than the compression implied by the bitmap’s data format) or encrypted.

The Payload is a single byte specifying the type of record, which is followed by data specific to that type.

Type 0: A standard icon. Data is a single uintV specifying which icon to use. See the file standard_icons.zip2 for actual data—each file in it is named n-x.?ng is a bitmap for standard icon number n. There may be multiple images (different values of x) for one value on n, and all are the same image at different resolutions and color depths. Pick one.

This file (standard_icons.zip2) not provided yet.

Idea: include standard icon with a small number overlay, for use in multiple part archives.

Type 1: An actual image. Data is a PNG or MNG file for the bitmap. There may be multiple instances of ICON with type 1 data, and all are assumed to be the same logical image. However, each instance may be a different resolution or color depth or feature set (e.g. animation). A browser can choose the one it likes the best.

Type 2: An internal icon. The data is a file name (no length prefix) of a file within this archive. The browser should use the icon it normally would use for that file, possibly extracting it to a temporary location.

Type 3: An external icon. The data is a URL of a ZIP2 file, whose ICON is to be used instead. Presumably that file will have multiple Type 1 ICON records and no others.

IDEN

This identifies this file, especially when part of a multi-file archive.

The Instance number must be zero. The part number corresponds to the file sequence. There is no Payload Specification. The Payload data is contains a revision number (a uint16) and the original file name (text).

The revision number should be incremented when a program updates the file. This can be used to verify that all files in a multi-part archive are in sync (or that you’ve downloaded a matched set). The value is a fixed size to facilitate in-place updates. The value can simply roll over at 64K, if that ever happens.

Each file in a multi-part split archive will have a copy of this chunk, with part number 1 found in the first file of the set, part number 1 found in the second file, etc.

INDX, iNDX

This is the "directory" of files stored in the archive. Instance 0 is the principle directory. Other indexes are used for subdirectories or alternate directories.

An alternate directory can be used to give a different view of what’s in the archive, exposing different files, or the same files with different names or attributes. In a multi-part archive, each file can contain an index showing what’s in that specific file, in addition to the principle directory, containing everything, found in one of the parts.

The Payload of this chunk contains a prelude, followed by a record for each file.

The prelude is:

Type: byte. 1=main, 2=alternate, 3=subdirectory. A main directory is the principle directory if the instance number is zero, and a per-file portion directory otherwise.

FileSystem: uintV. This specifies the file system of the creator. The decoder can use this information to help interpret the contents. 0=unspecified, 1=FAT, 2=HPFS, 3=NTFS, 4=Macintosh, 5=UNIX (elf type binaries), 6=UNIX (other).

Pragmatically, you can watch out for naming differences, know to expect resource forks for Macintosh, or multiple data streams for NTFS.

This needs work. The general idea is to know where the files came from, and details about the directory idosyncronicies.

Count: uintV. This is how many file records follow.

Base_time: time64. This is the starting value used by the delta times.

Each record is:

Type: byte. 0=file, 1=directory, 2=volume label, 3=alternate index, 4=subdirectory index. High-order bits give common attributes: 0x80=private, 0x40=read-only.

A "directory" type indicates a file entry that is the name of a subdirectory rather than a file. It doesn’t have a DATA Instance, but other corresponding Instances (e.g. FaTR and SECU) may be present if applicable to the directory itself.

This contrasts with the "subdirectory index" type, which points to another INDX Instance. That index’s content is implicitly included into this one, with this entry’s name prepended to each entry in the so-included INDX.

See §7.3 for more on subdirectory usage.

need hard links and symbolic links

Name: lpstring. This is the name of the file, or the description of the index. Names use a universal (file-system independent) system for indicating subdirectories and other special things. See §7.2, Entry Names, for details.

Data: uintV. Where to find the contents data. For a file, this is the instance number of a DATA chunk. For a subdirectory or volume label, this value is ignored. For an index, this is the instance number of a INDX chunk.

Date: sintV. This is the timestamp, expressed as the number of seconds different from the previous file’s time stamp. Note that the timestamps are also found (optionally) in TiME chunks. This value should match (to its lower precision) the TiME’s value. If there is a discrepancy, this value takes precedence.

Date needs more study. Is 1-second enough for this field?

Size: uintV. The uncompressed size of the file.

Indx

Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.

There is no Payload Specification. There is always a single uintV of data, specifying the file which is expected to contain all the specified instances of INDX.

MaCF, maCF

Each instance holds a Macintosh Finder Info record (16 bytes) for the corresponding DATA instance.

This is native information on the Macintosh computers, specifying the file’s type and creator. If extracted on other platforms, this information can be used to add file extensions or otherwise indicate the file’s "type" on the target system.

Macf

Indicates that the MACF chunks are in another file in a multi-part archive.

There is no Payload Specification. There is always a single uintV of data, specifying the file which is expected to contain all the specified instances of MACF.

ROOT

An Instance of ROOT corresponds with an Instance of INDX, and supplements it. The value in the Payload of ROOT is an absolute file path that specifies where the files originally came from, or otherwise where they should be extracted to.

When absolute paths are needed (perhaps the archive represents a backup), don’t put absolute paths into the INDX. Instead, the paths in INDX are relative to a starting point, normally specified upon extraction. The value in ROOT is a suggested default for extraction.

The value in ROOT is specified using the actual syntax for the filesystem noted in the corresponding INDX. This allows things that cannot be represented in the abstracted simple tree structure. For example, drive letters and UNC names for DOS-inspired file systems can be used. However, the control codes as used with INDX file names may be used also, so this value could be file-system independent if there is no need for file-system specific features.

In addition, substitutions are made in this string, so it may specified a logical location without knowing its actual name. For example "the temp directory", or "the home directory" or "the system directory".

The substitution codes are specified by one of the words listed below (case sensitive), surrounded by a pair of hex 1E characters (RS). For example (to use C string literal syntax), "\x1eTEMP\x1e\x1cMydir" specifies the Mydir subdirectory in the "temp" directory, whatever that may be.

Here is a list of special places.

Add to this: what are all the interesting places on each supported operating system?

Also, someone who works on Unix needs to check my statements here.

home: The "home" directory.

ProgramFiles: The place under which programs are installed. Corresponds to the "Program Files" directory under Windows, and the /usr/bin directory under Unix.

temp: The directory for temporary files.

Windows: (Windows systems) The Windows directory. This is returned by the GetWindowsDirectory API function.

WindowsSystem: (Windows systems) This is returned by the GetSystemDirectory API function.

RTFM

A comment that applies to the entire archive. Instance 0 should be displayed to the user whenever the file is manipulated (such as a title bar or status bar), and it should be kept short. Instance 1 and higher are displayed when asked, and may be longer. Instance 2 is for a copyright statement. Instance 3 should contain a URL for contact information or home site, and no other characters.

This may be repeated in each file of a multi-file split archive.

SECU, sECU

Security information, Win32-style. The ACL information for all the files can become lengthy and redundant. Rather than specify a fancier indexing scheme to central ACL data, this simple representation is simply compressed. Remember, the Payload Specification specifies a compression method, and this chunk can use Deflation with a dictionary with excellent results.

Need details on a binary structure, as useful to a NT/Win2k programmer.

SHAC,sHAC

This holds an SHA-1 digest (20-byte binary format) for arbitrary chunks in this archive.

The ShAI chunk type (below) holds information for a decoded file. However, that does not protect other chunks, such as INDX, against corruption or mismatched multi-file sets. This SHAC (SHA-1 for a Chunk) can apply checking to any other part of the archive.

Possible uses include putting in one file a single SHAC checking all other chunks in a multi-part spanning archive. Or, multiple SHAC instances can be used, each checking different things. A SHAC record in one file might check the previous file in the set. It’s completely flexible, and only real use will tell what forms are most useful for what purposes.

An Instance contains a 20-byte SHA-1 digest, followed by a set of records nominating which chunks are part of the data to digest. Each record consists of:

Type: 4 byte name field.

Instance_number_range: 2 uintV.

Part: unitV.

These values match the corresponding values in the header of some chunk in the archive (see §4.4.1).

The nominated chunk must match the definition of some chunk in the archive. That is, if there is a chunk present that’s type INDX, instances 3 through 7, part 0, then you can’t nominate INDX instance 4 only. It has to match—exactly—the actual chunk block in the archive.

The data input to the SHA-1 algorithm is formed by concatenating the nominated blocks together, in the order in which they are specified. The data includes the entire block header starting with the Length, through the checksum at the end, inclusive.

A SHAC chunk may not nominate any SHAC chunk.

ShAI, shAI

This holds an SHA-1 digest (20-byte binary format) of the corresponding DATA Instance. See http://www.itl.nist.gov/fipspubs/fip180-1.htm for more information on SHA-1.

This was originally called SHAO, with "O" for "One" because SHA1 is not a legal chunk name. However, the O was confusingly similar to a zero, and it mentally reads as "SHA zero" instead of "SHA O(ne)". So the I for the last character looks like a 1, and is a roman-numeral One. Hopefully this is less confusing.

Although each chunk contains a CRC-32 checksum, that does not offer full verification of the restored file since a file may be broken into multiple chunks. Just because each chunk was correct doesn’t mean that the parts in a multi-part archive didn’t get mixed up. Also, the chunk-level checksum applies to the encoded data, so won’t catch errors in the encoding or decoding process.

To address this, and provide a stronger check in the simple cases, this chunk may be used. It holds a 20-byte "digest" of the original file data. Upon restoring a file, this digest may be verified.

SHA-1 was chosen because MD5 has been compromised. However, the future may bring further improvements in this area, and SHA-1 may be dethroned in favor of something else. So one design idea was to have a single "digest" chunk name (such as DGST) and it would contain a byte specifying which digest method was used, for each Instance. However, this would prevent the instances from being a constant 20 bytes each. Instead, a new chunk name will be introduced when a new digest algorithm is adopted.

It is recommended that a multi-part spanning archive always contain ShAI records for any file that spans archive parts, and that shAI (a split ShAI) not be split across files.

SHA-1 is only defined on inputs up to 2⁶⁴ bits, or 2⁵⁶ bytes. Since the ZIP2 file format has no such limitation on its component files, we need to address this. Compute multiple SHA-1 on different parts of the file and then combine them; or hash the whole file down to something smaller and then compute SHA-1 on that? It’s an open issue, but not an urgent one since a file that large would require a million hard disks. With exponential growth, this won’t be a problem for 20 years or so. Large data streams that do exist today (scientific data from measurements) ought to use more error checking than a single 20-byte hash, and much smaller blocks with independent error recovery.

Shai

Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.

There is no Payload Specification. There is always a single uintV of data, specifying the file which is expected to contain all the specified instances of ShAI.

SIGN, sIGN

This holds a "digital signature" for the original file stored in the corresponding DATA Instance.

Important open issue: where do we get the keys and the code? There are numerous competing PKI and Certificate systems. PGP code is free and portable, and has a library that’s very good for programmatic use. Has the W3C endorsed or ratified anything yet?

Details TBA.

Besides the actual signature, meta data should include the manner in which the signature was made, and a "security level" number.

Security Level Indicator

The "security level" is a number indicating how secure the signer is, as illustrated below.

The number are spaced apart so that additional levels can be added, and keep them in order from least to most safe. However, by defining only a few distinct levels, there will be less confusion about just how to rate something.

A value of 16 indicates that the ZIP program automatically signed the data, without direct user involvement. Specifically, any automatic script or batch file running on the system (possibly without the user’s knowledge) would produce signed files. This level of signing is not proof against a worm-type virus, as once the user is logged in, any background software can produce signed archives.

A value of 32 indicates that the user must interact with the system to produce a signature. That might just be a dialog box that says, "OK to sign using default signature?".

A value of 48 indicates that the passphrase was actually supplied by the user in order to access the private key. A script cannot do that without user interaction.

A value of 64 means that the public key is stored on a removable media, such as a CD-R or a smart card, and is only presented to the machine when it is momentarily accessed.

So, what’s to prevent an email worm virus from changing a 16 to a 48? The meta data must be stored inside the encrypted envelope, so that you cannot change a value without invalidating the signature.

While too heavy for script-type viruses, a Trojan Horse could include a full-blown ZIP2 creator program that’s been suitably modified to report the level wrong. So, the public keys themselves should be known to be secure to a particular level, offering a cross-check if necessary.

So if it can be defeated, why report the level at all? Because Windows 2000 has a "public key infrastructure" that seems to provide software running in a particular user’s account access to encrypt/decrypt ability without further prompting, once that user has logged on. If this is used to sign archives, there will be heavy use of level 16, which is useless against malicious scripts. This value was introduced to distinguish this kind of casual but useless signing from a stronger signature.

SIgn

Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.

There is no Payload Specification. There is always a single uintV of data, specifying the file which is expected to contain all the specified instances of SIGN.

TiME, tiME

Advanced timestamp information, goes with corresponding DATA instance number.

The Payload contains three time64 values, specifying the last-write time, creation time, and last-access time.

TOcN, TOcn

This is a "Table Of Contents", used to make skimming a zip2 file much faster.

Consider a program that skims the entire zip2 file, noting the position and header information of each chunk. This is easy to do, but causes a lot of seeks, going through the entire file. Seeking is slow on some media such as CD’s, and that is also a prime potential usage of zip2 files. Similarly, a remote file can generate many round trips over the network to seek and read a little bit at a time.

To prevent the need for skimming the entire file, the TOcN chunk may be used. This contains the header information (type and instance) plus the file position of each chunk. When skimming, the program can stop skimming when it sees the TOcN, and simply use the information contained within it instead.

The TOcN is always instance 0. It defeats the purpose to break it into multiple parts, so there is no tOcN chunk defined.

The content of the TOcN is a record for each chunk in the file (the TOcN chunk itself may be omitted, as can the TOcn if it precedes the TOcN in the file, but all other chunks shall be listed). Each record contains the Type (4 byte name field), Instance_number_range (2 uintV) and Part (uintV) from the corresponding chunk, and the position in the file (uintV) such that the opening "Z" in the signature is byte zero.

Now, it would only be useful if the TOcN were found at the beginning of the file. But since its own length would affect the position of everything else, and it will change as the file is updated, that is impossible to arrange unless the file is produced and written as a single creation process. In order to allow files to be updated without being entirely rewritten, the TOcN must be allowed to appear wherever there is room. To this end, the TOcn chunk is used.

The TOcn chunk points to the TOcN chunk. This can be placed at the beginning of the file (the first chunk is best), and updated in-place when the TOcN is regenerated and relocated.

The TOcn contains exactly 4 or 8 bytes of data, and is a uint32 or uint64 containing the offset of the TOcN chunk. It is a fixed size to facilitate in-place updates.

TREE, tREE

This is data used to "seed" a compressor. A Huffman Tree or similar compressor can use this information instead of including it with the compressed data.

There are standard trees that are known to the decompressor and are not included in the file. However, you can include your own tree in an archive, using this chunk. A single TREE can be useful for compressing all the file comments, for example.

The Instance number may not be between hex 40 and hex FFF, inclusive, as these are reserved for built-in standard trees.

There is no Payload Specification. The Payload data is a list of 256 uint16 values. If the data is shorter than 512 bytes, the remaining entries are implicitly hex 8000, a middle value.

This table is a set of weights, indicating the relative commonality of each byte. This may be used to generate a Huffman tree, but is a general enough specification that it is easily used for other algorithms.

Tree

Indicates that the specified instances are in a different file. That is, it can tell the program which file in a multi-file archive contains the instance.

There is no Payload Specification. There is always a single uintV of data, specifying the file which is expected to contain all the specified instances of TREE.

TREe

Specifies an "external" tree. There is no Payload Specification. The Payload data is a URL of a file that contains a corresponding TREE instance, or another TREe instance that further redirects the search.

The URL shall be "URL encoded" 8-bit data. E.g. a space will be represented as "%20", and the protocol, such as "http:" shall also be present.

UnIX, unIX

Advanced attributes, UNIX style. Gives permission bits for self, group, and all. Goes with corresponding DATA instance number.

XXXX

This is a filler used to mark unused space within a file.

The Instance number, range, and part shall always be zero. There is no Payload Specification.

The Checksum is unique in this type in that it does not depend on the (ignored) Payload data. Instead, the Checksum is computed as the Length xor’ed with hex 58 58 58 58.

The data may be left as-is when a chunk is deleted—simply overwrite the 7 bytes after the Length (which is unchanged) with hex 58 58 58 58 00 00 00 and update the Checksum. It is not necessary to read the old data to compute the new Checksum, since it depends only on the Length.

Alternatively, the data can be replaced with all zeros. This takes more time when modifying the data set, but is advantageous because it can take advantage of "sparse file" disk allocation, or further compression during transmission.

To fill areas that are too small for a chunk definition (12 bytes), use fill bytes, as described in §4.2 on page *.

User Defined

A chunk name beginning with "[" (becomes "(" when the "continued" flag is set) indicates data that corresponds to DATA instances. A program that doesn’t know what it is can still keep it associated with the correct file, by understanding this.

A chunk name beginning with "]" (becomes ")" when the "continued" flag is set) implies no such relationship, and programs that don’t know about it can’t manipulate it.

All names beginning with one of these 4 characters and continuing with 3 letters are user-defined, and will never have an official meaning in this specification. Any program generating user-defined chunks is expected to leave a "[ME]" chunk identifying itself, too.

The meaning of the first two letter’s flags must be respected precisely. The meaning of the last two letter’s flags are open to some interpretation, and ought to be used if they seem to make sense. Flag 2 must be off for "[" (associated with DATA) chunks, because implementations won’t know the implicit size of the individual values.

Built-In Chunk Definitions

COMP

Instance hex 40: Simply stored, no transformation at all.

Instance hex 41: Deflated (method 1), no dictionary.

Instance hex 42: UTR-6 (method 2), no parameters applicable.

Instance hex 43: Huffman encoding (method 4) using the English letter frequency tree (TREE instance hex 40).

Instance hex 44: Burrows-Wheeler Transform (method 6), no parameters applicable.

CRYP

Not yet designed. At this point, I’m thinking that CRYP chunks will refer to crypto algorithms and parameters, in an analogous manner as COMP chunks do with compression algorithms.

DICT

The actual dictionary data is supplied in the standard_chunks.ZIP2 file.

not yet produced.

Instance hex 40: No content. This is an empty dictionary that implies no pre-loading takes place.

Instance hex 41: C/C++ source code. All the reserved words and common extensions, in alphabetical order, separated by spaces.

Instance hex 42: Perl source code.

Instance hex 43: Meant for compressing SECU chunks, it contains common ALC’s such as "Everyone".

Instance hex xx: … more to be defined …

A plain-text dictionary for English (common words and word parts?) needs to be investigated. Can a generic dictionary cause Deflation to work better on plain text?

TREE

The actual tree tables are supplied in the standard_chunks.ZIP2 file.

not yet produced.

Instance hex 40: English letter frequency data.

Notes on Archive Entries

The Index Structure

Entry Names

Classic ZIP files use a forward slash character (‘/’) for a directory separator, regardless of platform. This is not good, because the slash is a perfectly legal character in Macintosh file names. In general, any choice of "special" character can clash with some system. So, this Name field uses characters in the C0 control code region as special markers, and provides a way to escape out any character, just in case.

The control characters are chosen because they have no graphic representation so are unlikely to be used in names, and they encode as one byte in UTF-8.

All characters in the range hex 00 through hex 1F inclusive (commonly called C0 Control Characters) are reserved. Three of them are defined in this specification, but the rest are reserved for future use. Any control character used in a Name must be escaped out.

The character hex 1B (ESC) is used to escape out special meanings for characters. The following character is taken as a letter, devoid of special meaning. For example, to include an actual ESC character, use ESC ESC.

The character hex 1C (FS) is used as the subdirectory separator character.

The character hex 1D (GS) is used as a stream-name separator. The name of a file may be followed by a GS character and the name of a sub-file. For NTFS file system, this specifies alternate data streams. For Macintosh systems, the stream-name of ‘⌘’ (character hex 2318) specifies the resource fork.

Subdirectories

Files in subdirectories may be stored in separate indices, or qualified and stored in the same index.

… details and examples

Compression Methods

We still need a "light duty" algorithm for compressing very short text, such as file comments and file names.

Deflation (method 1)

The parameter data is a unitV specifying a DICT instance. Note that DICT #0 is defined as empty, so this is how you use no dictionary.

What Zlib does. Note that gzip and zip both require oddities—try and avoid that.

The code for zlib (written by by Jean-loup Gailly and Mark Adler) is available on a huge number of platforms. It’s well tested, well documented, and actively maintained. It’s thought to be free of patent problems. It’s already standardized as RFC 1950.

See http://www.cdrom.com/pub/infozip/zlib/zlib_docs.html for details.

UTR-6 (method 2)

This is a method for representing Unicode data more compactly than UTF-8 when a lot of non-"basic Latin" characters are used. Simply encoding Unicode data as UTF-8 requires a single byte only for ASCII characters, and takes two bytes for the "General Scripts" area, and more bytes otherwise, including CJKV Ideographs. UTR-6 will encode General Scripts as one byte per character through "windowing" the desired character set into a range of 1-byte values.

For details of the algorithm, see Unicode Technical Report #6 Revision 3.1, "A Standard Compression Scheme for Unicode", located at http://www.unicode.org/unicode/reports/tr6/tr6-3.1.html.

Note that UTR-6 may be used in conjunction with other compression algorithms. For example, the result of UTR-6 may then be run through a Huffman tree.

Run-Length, Repeat, Delta (method 3)

The general-purpose Deflation method fails to give optimum results in certain cases. This simple method can be used preparatory to deflation to eliminate the pathological cases and in less severe cases enhance its performance.

For example, given a megabyte file containing mostly the letter "M", a standard ZIP file held 2201 bytes of file data. However, that ZIP file was itself zipped down to 208 bytes, a reduction of 91% ! Clearly, the first pass did not compress as well as it might have. This simple example illustrates the limitations of a dictionary-based compression system.

If the same file is run-length compressed as described below, the result is 12 bytes. It is hex 02 84 00 08 01 83 FE 4D 00 02 0D 0A to be exact.

More typically, run-length compression will save room in the dictionary, so deflation can, in general, work better.

Preliminary experiments indicate that Chromosome 1 of the Human Genome Project is a megabyte smaller when simply Huffman encoded than when deflated. The result can be deflated by another 5%.

The file in question contains mostly the letters G, A, T, and C. Deflation should be able to seize on the repeating copies of the "conserved sequences", assuming they fit in the sliding window. However, it doesn’t work as well as simply using two bits per character.

Point is, some kinds of data don’t deflate well.

A custom encoder could be devised that compresses these data files very well, identifying repeated sequences that would not fit in the deflation "window", and encode them as type 2 runs. The resulting zip2 file could be read with a standard extractor, even though a custom compressor was used.

This method combines several simple schemes, which may be intermixed within one data stream.

The data stream consists of a set of runs. Each run starts with a byte specifying its type, and is followed by data specific to that type.

Type 0 is "verbatim". It is a 0 byte, a uintV length, and then that many data bytes. The decoded output is simply those data bytes unchanged.

Type 1 is "byte run". It is a 1, a uintV length, and a value byte. The decoded output is length copies of the value byte.

Type 2 is "repeat". It is a 2, a uintV repeat count, a uintV length, and that many data bytes. The data bytes are decoded, recursively, as runs. The result of that is then repeated count times to produce the output.

Type 3 is "delta". It is a 3 byte, a uintV repeat count, and that many data bytes. The first data byte is copied to the output unchanged. Each subsequent byte is added to the previous output byte (modulo 256) to form the next output byte.

Types 4 through 8 are used together for a manual dictionary lookup system. Strings are noted the first time they are encountered, and referenced again some time later. An encoder can perform global analysis of the file, or take advantage of special domain knowledge about the file, to produce repeated strings that are too long or too far apart for Deflation to find. Although a special encoder was used, the standard decoder can be used. For example, a compressor for the Human Genome Data can list the known repeated sequences.

Type 4 is used to note a string the first time it is seen. A 4 byte followed by a uintV index number (must be used in order starting with 1 (zero has special meaning)) that has not been used before, followed by a uintV length, followed by that many data bytes. The data bytes are copied to the output (recursively decoded), and are also remembered for later reference. If a Type 4 record appears nested inside another record such that it gets repeated, subsequent instantiations behave as a Type 5, pulling the string from the memorized table, and the data bytes are skipped and not decoded again.

Subsequently, a 5 byte followed by a uintV that is an existing index number will cause the memorized string to be output.

For fancier use, a 6 byte is a copy-with-changes. A 6 byte is followed by a uintV index number, which refers to a memorized string, followed by a uintV length, followed by that many data bytes. The data bytes are recursively decoded, and the result, not counting the content of nested Type 7 records, must match the length of the memorized string. The content of the decoded data is then treated as a delta to the memorized string: each byte of this record is added (modulo 256) to the corresponding byte of the memorized string, and the sum is output.

Within a Type 6 record, you may use Type 7 records to insert or remove content. A 7 byte followed by a positive sintV length followed by that many bytes of data is treated as an insertion, which is spliced into the memorized string at that point. The data is recursively decoded, and the result is spliced in at the position of the record, not consuming positions in the Type 6 delta string.

If a 7 byte is followed by a negative sintV length, the absolute value of that length is deleted from the memorized sequence. The content of the Type 6 delta string will be that many byte shorter, and should not contain delta bytes for the omitted positions.

When a memorized string will no longer be needed, it can be deleted from the memorized table by using a 8 byte followed by a uintV index number. This is used to save memory in the decoder, if desired.

Huffman Encoding (method 4)

The parameter data is a uintV naming a TREE instance.

Using data from a TREE chunk, a Huffman tree is created so that the most common letters have the smallest bit sequences. The output data is a set of blocks. Each block is a length (uintV) followed by the result of looking up the bit sequence on each byte of input, with trailing zeros added to make the output a multiple of 8 bits. A single zero byte (looks like a block length of zero) indicates the end of data.

The length is needed because the tailing padding bits could otherwise be confused for more data when decoding. The length is a uintV of the number of bytes in the unencoded input. That is chosen over the number of bits in the output so it may be emitted before knowing how long the output will be. The decoder will need to stop when the length reaches this expected value, even if there are bits remaining in the encoded data stream.

The wrapping blocks are used because the length is needed. By using multiple blocks, you don’t need to know the length of the input in advance (a requirement). You can compress each buffer-full independently, and output a 0 when you reach the end.

The bytes of encoded data are created from each group of 8 bits, the first bit in the stream being the high bit in the byte.

To form the Huffman tree, start with the TREE data set as leaf nodes. This forms an initial list of 256 nodes.

While there are more than one node in the list, find the two nodes with the smallest weights. Remove them from the list and form a new non-leaf node that points to these two nodes, the zero branch being the leaf node that was closer to the beginning of the list, and the one branch being the other one. The weight is the sum of the weights of the combined nodes. Add this new node to the end of the list.

That is simple, but how optimal is it? Is there a (still simple) way to form a tree that does a better job of minimizing the sum of the (path length times weight of leaf)?

The bit string emitted for a byte is defined as the branch choices to get from the root to that byte’s leaf node.

Composition (method 5)

This powerful feature allows multiple compression algorithms to be chained together. This means that some compression methods can be designed as preliminary filters that are meant to be applied before deflation.

The parameter data is a list of steps to compose. Each item in the list is a uintV length of that item, a uintV compression method, and the method-specific parameter data.

For example, to specify Run-length compression followed by a Huffman tree using TREE #6, finally fed into deflation with no dictionary, the data would be:

03 00 04 01 06 01 01 00

Burrows-Wheeler Transform (method 6)

What libbzip2 does (see http://sourceware.cygnus.com/bzip2/index.html). This is a complete compression method centered on the Burrow-Wheeler Transform.

The documentation on libbzip2 indicates that some improvements are possible but were omitted to maintain backward compatibility in the bzip file format. However, such improvements could be made for use within the ZIP2 file. This needs to be investigated.

The code for libbzip2 (by Julian Seward) is written in ANSI C and works on a variety of UNIX platforms and Windows. A Macintosh version has been found at http://persephone.cps.unizar.es/~spd/bzip2/. It’s well tested, well documented, and actively maintained. It’s thought to be free of patent problems.

An upcoming version of zlib is supposed to include BWT as a major new feature. Since it’s still in progress, it could be written to learn from libbzip2 and improve upon it. If ready in time, it could be a better candidate for use in this specification.

This algorithm gives better results (around 5%) than deflation for large files. For small files, deflation works better. When a suitable dictionary is available, small files deflate much better.

Encryption

ZIP2 features an orthogonal model for encryption, in that any chunk having a Payload Specification can be encrypted.

As with classic ZIP files, one method of encrypting a file is such that anyone with the right pass-phrase can decrypt it. That is, the session key is generated entirely from the pass-phrase.

Another way is that the session key is encrypted using one or more public keys, and is targeted to only be read by the holders of the corresponding private keys.

Although freely-available portable code exists for strong encryption algorithms and it should be fairly simple to just choose a library, it is probably wise to develop this feature in conjunction with the digital signature capability, which is more difficult, in that it needs to work with existing key infrastructures.

Details of Algorithms

CRC-32

This is the same CRC-32 algorithm and polynomial as used in ZIP files.

The code for computing the CRC always gets the polarity of the bits the other way around. That is, you have to ones-complement the word to get the right answer, and undo that on input to the next buffer-full.

So ZIP2 files represent the result using this more natural bit polarity. That is, a ZIP2 checksum will be bit-for-bit the exact opposite of the ZIP checksum. In C syntax,

zip2_CRC= ~ compute_zip_CRC (data);

UTF-8

The UTF-8 Encoding Form of Unicode data is described in the Unicode Standard. The 2.1 book gave a detailed description in an appendix; the 3.0 book gives a hurried description in paragraph D36 in section 3.8.

This is formally described in RFC 2279, available at ftp://ftp.isi.edu/in-notes/rfc2279.txt or wherever Internet Standards are found.

Długosz’ Variable-Length Integer Encoding

Introduction

Since the whole idea of the ZIP2 file format is to save room, it is wasteful to store values with a larger number of bytes when a small number is generally enough. Also, since it is designed with the future in mind, what is the upper limit of the value’s size? For example, 32 bits might be a good choice for a length value now, but insufficient in the near future. So it is natural that many integer values will be stored as variable-sized integers. It will be short when possible, keeping the file efficient. However, it may be long when necessary.

My first idea was to simply use UTF-8 encoding. However, this is less efficient for small integers than it could be (two byte forms only goes up to 2K) and is limited to handling numbers up to 31 bits. A design goal was to handle 64 bits, so this would not do.

Since I was looking for something different anyway, it became clear that the design considerations for UTF-8 were simply not applicable for my needs. A simpler format would work quite well, and offer other benefits in exchange.

Overview

Here is the encoding, in a nutshell.

A byte with its high bit cleared is a one-byte value, holding a number from 0–127. If the high bit is set, the next highest bit indicates that the value is two bytes long when clear, longer when set.

This generalizes to a scheme where the first n-1 bits are 1, followed by a 0 bit, followed by value data. There is a total number of n bytes.

Encoded Length

Bits for the Value

1

7

2

14

3

21

4

28

5

35

6

42

7

49

8

56

9

63

10

70

n

7*n

So a two-byte value can hold up to 16K, which is sufficient for ordinals most of the time. A 4-byte value can hold up to 256 Meg, which is sufficient for most length values, so lengths are no longer than they would be if variable-length encoding were not used. Full 64-bit values may be represented with 10 bytes, if required. Values may be larger than 64 bits, under this definition.

This is much more efficient than UTF-8 encoding for these kinds of values. Things were two bytes would be used anyway still take up only 2 bytes; things were four bytes would be used anyway still take up only 4 bytes; yet you have the flexibility of using values that are larger than what was supposed to be typical when the format was drafted.

Benefits and Features

Efficient encoding keeps the values short most of the time.

Easy to decode, compared to UTF-8. Rather than extracting some bits from each byte and shifting them into position, the value can be read simply by masking off the high bits.

The algorithm extends to integers of arbitrary size.

Criticisms and Possible Improvements

The encoding is not unique. For example, the value 42 could be encoded as two bytes hex 80 2A, or as the single byte 2A. If you always use the shorter form, it follows that 80 2A is wasteful as it could be used for something else.

You could improve the algorithm by not duplicating all shorter values’ meanings. Specifically, a two byte encoding could be biased to encode a value from 128 to 16511, rather than from 0 to 16383, since the latter is redundant as 0–127 is handled just fine by a one byte encoding.

However, the range extension is less than 0.8%, so won’t contribute much to keeping the encoded values in the file short. It complicates the encoding logic, in terms of finding the smallest bit length. More seriously, it makes it harder to explain. Most critically, it makes the code more complex.

Details

Encoding Algorithm

Start with the value to be encoded, v.

Trim the unnecessary high bits off v. For unsigned values, this is all the leading zeros. For signed values, this is all the leading bits except for one remaining representative sign bit. For example, –42 is 0xFFFFFFD6, and this can be trimmed to 7 bits without losing information: 1010110. When extending to any designed word length, the highest bit is replicated. So it follows that the shortest possible word to represent –42 contains only one such bit. Call the number of necessary bits b. By design, zero-extending (for unsigned values) or sign-extending (for signed values) the b least-significant bits of v back to the original word length, or to any length for that matter, will produce the original value.

Calculate the number of bytes required, n, as b divided by 7, rounding up to the next integer.

Prepare an output area as a sequence of n*8 bits. Set the n-1 most significant bits to 1, followed by a single bit of 0. Extend v to n*7 bits (sign extend for signed values, zero extend for unsigned values) and place these as the least significant part of the output area.

Divide the output area into a stream of bytes, with the most significant bits being in the first byte.

Decoding Algorithm

Given a byte stream queued up to the first byte in a variable-length integer,

Initialize the size to 0.

While the next byte (peek without consuming it) is hex FF, consume the FF byte and add 7 to size.

Let n be the number of leading 1 bits in the next byte.

Read size+n+1 bytes and place at the right of a buffer. The buffer is the size of the desired integer.

For unsigned values, zero the first size+n bits of the buffer. Or, for signed values, look at bit n+2 in the byte examined in step 3, and replicate this into the first n+1 bits in the buffer.

The buffer may now be interpreted as an integer in big-endian format.

Illustrative Examples

Unsigned values from 0 through 127 are encoded as a single byte of the same value.

Signed values will have the sign bit (to be replicated) immediately following the first 0 bit.

A value up to 16,383 is encoded as binary 10xxxxxx xxxxxxxx.

A value less than 2²¹ is encoded as binary 110xxxxx xxxxxxxx xxxxxxxx.

Conformance

In order to make sure that any decoder can read any ZIP2 file, here is a set of requirements, options, and assumptions. This goes beyond what is listed in the main part of this specification.

Baseline 2000

This details a "correct" file. It is suffixed with the year of definition, so that future revisions can be called "Baseline 2005" or whatever. It is expected that changes will be made to accommodate actual issues found in the field, and that lengths and number sizes may increase with time.

uintV and sintV values will be encoded with the smallest size necessary to contain the value.

uintV and sintV values will not exceed 64 bits of value or 10 bytes of encoding.

The sum of the lengths of all parts of a multi-part chunk will not exceed a 64-bit value.

The (uncompressed) length of a file stored in an archive will not exceed a 64-bit value.

Every non-0 chunk Instance shall be unique. The same instance shall not be repeated in a file, and the same Instance in different files of a multi-part spanning archive shall contain the identical value.

An INDX#1 chunk shall be present. For a multi-part archive, it is not required that this chunk be in a specific file (e.g. the last one).

Every Instance number referred to by other data in the file shall refer to chunks that exist (or are in the reserved range for implicit definitions).

Chunks with reserved Instance numbers shall not be present.

Robust 2000

This details the behavior expected from a "robust" (that is, not flaky or buggy) implementation.

The program shall exit gracefully upon encountering an error, rather than crashing or destroying data if fed a bad archive file.

Forwarding chunks (Data, Indx, etc.) tell the program which file in a multi-part spanning archive the original chunk can be found in. This should be taken with a grain of salt. If the chunk is not found where indicated, recover by treating it as if the forwarding chunk were not present at all—look for it. It’s a very real possibility that updates will leave out-of-date forwarding chunks in other files.

If an INDX record refers to a DATA Instance that is not present, corrupted, or dependant on other Instances with such problems, the program shall gracefully refuse to extract just that one file. Other files (with correct definitions) must still work. Likewise for INDX records that refer to other INDX Instances.

An unknown Instance in the range reserved for built-in definitions shall be treated the same as a missing chunk—do without, or fail on just the thing that depends on it.

Secondary information (i.e. TiME, SECU) associated with a DATA Instance that is found to corrupted should not prevent the program from extracting the file exactly as if the chunk were not present instead of invalid.

A program shall not assume nor require that secondary information (i.e. MaCF) be present for a stored file.

When deleting or replacing a file, the program shall delete the corresponding user-defined chunks associated with that file. When replacing a file, the program may update the user-defined chunk if it understands that chunk.

A program that thinks it understands a particular user-defined chunk must check the [ME] chunk to make sure, not go by the name alone.

After manipulating a file in a multi-part archive, a program shall attempt to update chunks in other files, as applicable. This includes forwarding chunks and copies of chunks that were changed. However, the manipulations up to that point must be correct as far as it goes, so that if updating the other files fails (e.g. the user doesn’t offer the numbered disk when prompted; a file is write-protected), the archive file set is still usable. Furthermore, just because one such file fails doesn’t mean it should stop trying the others.

Encoded Length	Bits for the Value
1	7
2	14
3	21
4	28
5	35
6	42
7	49
8	56
9	63
10	70
n	7*n