Spaces:
Running
Running
The .xz File Format | |
=================== | |
Version 1.0.4 (2009-08-27) | |
0. Preface | |
0.1. Notices and Acknowledgements | |
0.2. Getting the Latest Version | |
0.3. Version History | |
1. Conventions | |
1.1. Byte and Its Representation | |
1.2. Multibyte Integers | |
2. Overall Structure of .xz File | |
2.1. Stream | |
2.1.1. Stream Header | |
2.1.1.1. Header Magic Bytes | |
2.1.1.2. Stream Flags | |
2.1.1.3. CRC32 | |
2.1.2. Stream Footer | |
2.1.2.1. CRC32 | |
2.1.2.2. Backward Size | |
2.1.2.3. Stream Flags | |
2.1.2.4. Footer Magic Bytes | |
2.2. Stream Padding | |
3. Block | |
3.1. Block Header | |
3.1.1. Block Header Size | |
3.1.2. Block Flags | |
3.1.3. Compressed Size | |
3.1.4. Uncompressed Size | |
3.1.5. List of Filter Flags | |
3.1.6. Header Padding | |
3.1.7. CRC32 | |
3.2. Compressed Data | |
3.3. Block Padding | |
3.4. Check | |
4. Index | |
4.1. Index Indicator | |
4.2. Number of Records | |
4.3. List of Records | |
4.3.1. Unpadded Size | |
4.3.2. Uncompressed Size | |
4.4. Index Padding | |
4.5. CRC32 | |
5. Filter Chains | |
5.1. Alignment | |
5.2. Security | |
5.3. Filters | |
5.3.1. LZMA2 | |
5.3.2. Branch/Call/Jump Filters for Executables | |
5.3.3. Delta | |
5.3.3.1. Format of the Encoded Output | |
5.4. Custom Filter IDs | |
5.4.1. Reserved Custom Filter ID Ranges | |
6. Cyclic Redundancy Checks | |
7. References | |
0. Preface | |
This document describes the .xz file format (filename suffix | |
".xz", MIME type "application/x-xz"). It is intended that this | |
this format replace the old .lzma format used by LZMA SDK and | |
LZMA Utils. | |
0.1. Notices and Acknowledgements | |
This file format was designed by Lasse Collin | |
<[email protected]> and Igor Pavlov. | |
Special thanks for helping with this document goes to | |
Ville Koskinen. Thanks for helping with this document goes to | |
Mark Adler, H. Peter Anvin, Mikko Pouru, and Lars Wirzenius. | |
This document has been put into the public domain. | |
0.2. Getting the Latest Version | |
The latest official version of this document can be downloaded | |
from <http://tukaani.org/xz/xz-file-format.txt>. | |
Specific versions of this document have a filename | |
xz-file-format-X.Y.Z.txt where X.Y.Z is the version number. | |
For example, the version 1.0.0 of this document is available | |
at <http://tukaani.org/xz/xz-file-format-1.0.0.txt>. | |
0.3. Version History | |
Version Date Description | |
1.0.4 2009-08-27 Language improvements in Sections 1.2, | |
2.1.1.2, 3.1.1, 3.1.2, and 5.3.1 | |
1.0.3 2009-06-05 Spelling fixes in Sections 5.1 and 5.4 | |
1.0.2 2009-06-04 Typo fixes in Sections 4 and 5.3.1 | |
1.0.1 2009-06-01 Typo fix in Section 0.3 and minor | |
clarifications to Sections 2, 2.2, | |
3.3, 4.4, and 5.3.2 | |
1.0.0 2009-01-14 The first official version | |
1. Conventions | |
The key words "MUST", "MUST NOT", "REQUIRED", "SHOULD", | |
"SHOULD NOT", "RECOMMENDED", "MAY", and "OPTIONAL" in this | |
document are to be interpreted as described in [RFC-2119]. | |
Indicating a warning means displaying a message, returning | |
appropriate exit status, or doing something else to let the | |
user know that something worth warning occurred. The operation | |
SHOULD still finish if a warning is indicated. | |
Indicating an error means displaying a message, returning | |
appropriate exit status, or doing something else to let the | |
user know that something prevented successfully finishing the | |
operation. The operation MUST be aborted once an error has | |
been indicated. | |
1.1. Byte and Its Representation | |
In this document, byte is always 8 bits. | |
A "null byte" has all bits unset. That is, the value of a null | |
byte is 0x00. | |
To represent byte blocks, this document uses notation that | |
is similar to the notation used in [RFC-1952]: | |
+-------+ | |
| Foo | One byte. | |
+-------+ | |
+---+---+ | |
| Foo | Two bytes; that is, some of the vertical bars | |
+---+---+ can be missing. | |
+=======+ | |
| Foo | Zero or more bytes. | |
+=======+ | |
In this document, a boxed byte or a byte sequence declared | |
using this notation is called "a field". The example field | |
above would be called "the Foo field" or plain "Foo". | |
If there are many fields, they may be split to multiple lines. | |
This is indicated with an arrow ("--->"): | |
+=====+ | |
| Foo | | |
+=====+ | |
+=====+ | |
---> | Bar | | |
+=====+ | |
The above is equivalent to this: | |
+=====+=====+ | |
| Foo | Bar | | |
+=====+=====+ | |
1.2. Multibyte Integers | |
Multibyte integers of static length, such as CRC values, | |
are stored in little endian byte order (least significant | |
byte first). | |
When smaller values are more likely than bigger values (for | |
example file sizes), multibyte integers are encoded in a | |
variable-length representation: | |
- Numbers in the range [0, 127] are copied as is, and take | |
one byte of space. | |
- Bigger numbers will occupy two or more bytes. All but the | |
last byte of the multibyte representation have the highest | |
(eighth) bit set. | |
For now, the value of the variable-length integers is limited | |
to 63 bits, which limits the encoded size of the integer to | |
nine bytes. These limits may be increased in the future if | |
needed. | |
The following C code illustrates encoding and decoding of | |
variable-length integers. The functions return the number of | |
bytes occupied by the integer (1-9), or zero on error. | |
#include <stddef.h> | |
#include <inttypes.h> | |
size_t | |
encode(uint8_t buf[static 9], uint64_t num) | |
{ | |
if (num > UINT64_MAX / 2) | |
return 0; | |
size_t i = 0; | |
while (num >= 0x80) { | |
buf[i++] = (uint8_t)(num) | 0x80; | |
num >>= 7; | |
} | |
buf[i++] = (uint8_t)(num); | |
return i; | |
} | |
size_t | |
decode(const uint8_t buf[], size_t size_max, uint64_t *num) | |
{ | |
if (size_max == 0) | |
return 0; | |
if (size_max > 9) | |
size_max = 9; | |
*num = buf[0] & 0x7F; | |
size_t i = 0; | |
while (buf[i++] & 0x80) { | |
if (i >= size_max || buf[i] == 0x00) | |
return 0; | |
*num |= (uint64_t)(buf[i] & 0x7F) << (i * 7); | |
} | |
return i; | |
} | |
2. Overall Structure of .xz File | |
A standalone .xz files consist of one or more Streams which may | |
have Stream Padding between or after them: | |
+========+================+========+================+ | |
| Stream | Stream Padding | Stream | Stream Padding | ... | |
+========+================+========+================+ | |
The sizes of Stream and Stream Padding are always multiples | |
of four bytes, thus the size of every valid .xz file MUST be | |
a multiple of four bytes. | |
While a typical file contains only one Stream and no Stream | |
Padding, a decoder handling standalone .xz files SHOULD support | |
files that have more than one Stream or Stream Padding. | |
In contrast to standalone .xz files, when the .xz file format | |
is used as an internal part of some other file format or | |
communication protocol, it usually is expected that the decoder | |
stops after the first Stream, and doesn't look for Stream | |
Padding or possibly other Streams. | |
2.1. Stream | |
+-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+ | |
| Stream Header | Block | Block | ... | Block | | |
+-+-+-+-+-+-+-+-+-+-+-+-+=======+=======+ +=======+ | |
+=======+-+-+-+-+-+-+-+-+-+-+-+-+ | |
---> | Index | Stream Footer | | |
+=======+-+-+-+-+-+-+-+-+-+-+-+-+ | |
All the above fields have a size that is a multiple of four. If | |
Stream is used as an internal part of another file format, it | |
is RECOMMENDED to make the Stream start at an offset that is | |
a multiple of four bytes. | |
Stream Header, Index, and Stream Footer are always present in | |
a Stream. The maximum size of the Index field is 16 GiB (2^34). | |
There are zero or more Blocks. The maximum number of Blocks is | |
limited only by the maximum size of the Index field. | |
Total size of a Stream MUST be less than 8 EiB (2^63 bytes). | |
The same limit applies to the total amount of uncompressed | |
data stored in a Stream. | |
If an implementation supports handling .xz files with multiple | |
concatenated Streams, it MAY apply the above limits to the file | |
as a whole instead of limiting per Stream basis. | |
2.1.1. Stream Header | |
+---+---+---+---+---+---+-------+------+--+--+--+--+ | |
| Header Magic Bytes | Stream Flags | CRC32 | | |
+---+---+---+---+---+---+-------+------+--+--+--+--+ | |
2.1.1.1. Header Magic Bytes | |
The first six (6) bytes of the Stream are so called Header | |
Magic Bytes. They can be used to identify the file type. | |
Using a C array and ASCII: | |
const uint8_t HEADER_MAGIC[6] | |
= { 0xFD, '7', 'z', 'X', 'Z', 0x00 }; | |
In plain hexadecimal: | |
FD 37 7A 58 5A 00 | |
Notes: | |
- The first byte (0xFD) was chosen so that the files cannot | |
be erroneously detected as being in .lzma format, in which | |
the first byte is in the range [0x00, 0xE0]. | |
- The sixth byte (0x00) was chosen to prevent applications | |
from misdetecting the file as a text file. | |
If the Header Magic Bytes don't match, the decoder MUST | |
indicate an error. | |
2.1.1.2. Stream Flags | |
The first byte of Stream Flags is always a null byte. In the | |
future, this byte may be used to indicate a new Stream version | |
or other Stream properties. | |
The second byte of Stream Flags is a bit field: | |
Bit(s) Mask Description | |
0-3 0x0F Type of Check (see Section 3.4): | |
ID Size Check name | |
0x00 0 bytes None | |
0x01 4 bytes CRC32 | |
0x02 4 bytes (Reserved) | |
0x03 4 bytes (Reserved) | |
0x04 8 bytes CRC64 | |
0x05 8 bytes (Reserved) | |
0x06 8 bytes (Reserved) | |
0x07 16 bytes (Reserved) | |
0x08 16 bytes (Reserved) | |
0x09 16 bytes (Reserved) | |
0x0A 32 bytes SHA-256 | |
0x0B 32 bytes (Reserved) | |
0x0C 32 bytes (Reserved) | |
0x0D 64 bytes (Reserved) | |
0x0E 64 bytes (Reserved) | |
0x0F 64 bytes (Reserved) | |
4-7 0xF0 Reserved for future use; MUST be zero for now. | |
Implementations SHOULD support at least the Check IDs 0x00 | |
(None) and 0x01 (CRC32). Supporting other Check IDs is | |
OPTIONAL. If an unsupported Check is used, the decoder SHOULD | |
indicate a warning or error. | |
If any reserved bit is set, the decoder MUST indicate an error. | |
It is possible that there is a new field present which the | |
decoder is not aware of, and can thus parse the Stream Header | |
incorrectly. | |
2.1.1.3. CRC32 | |
The CRC32 is calculated from the Stream Flags field. It is | |
stored as an unsigned 32-bit little endian integer. If the | |
calculated value does not match the stored one, the decoder | |
MUST indicate an error. | |
The idea is that Stream Flags would always be two bytes, even | |
if new features are needed. This way old decoders will be able | |
to verify the CRC32 calculated from Stream Flags, and thus | |
distinguish between corrupt files (CRC32 doesn't match) and | |
files that the decoder doesn't support (CRC32 matches but | |
Stream Flags has reserved bits set). | |
2.1.2. Stream Footer | |
+-+-+-+-+---+---+---+---+-------+------+----------+---------+ | |
| CRC32 | Backward Size | Stream Flags | Footer Magic Bytes | | |
+-+-+-+-+---+---+---+---+-------+------+----------+---------+ | |
2.1.2.1. CRC32 | |
The CRC32 is calculated from the Backward Size and Stream Flags | |
fields. It is stored as an unsigned 32-bit little endian | |
integer. If the calculated value does not match the stored one, | |
the decoder MUST indicate an error. | |
The reason to have the CRC32 field before the Backward Size and | |
Stream Flags fields is to keep the four-byte fields aligned to | |
a multiple of four bytes. | |
2.1.2.2. Backward Size | |
Backward Size is stored as a 32-bit little endian integer, | |
which indicates the size of the Index field as multiple of | |
four bytes, minimum value being four bytes: | |
real_backward_size = (stored_backward_size + 1) * 4; | |
If the stored value does not match the real size of the Index | |
field, the decoder MUST indicate an error. | |
Using a fixed-size integer to store Backward Size makes | |
it slightly simpler to parse the Stream Footer when the | |
application needs to parse the Stream backwards. | |
2.1.2.3. Stream Flags | |
This is a copy of the Stream Flags field from the Stream | |
Header. The information stored to Stream Flags is needed | |
when parsing the Stream backwards. The decoder MUST compare | |
the Stream Flags fields in both Stream Header and Stream | |
Footer, and indicate an error if they are not identical. | |
2.1.2.4. Footer Magic Bytes | |
As the last step of the decoding process, the decoder MUST | |
verify the existence of Footer Magic Bytes. If they don't | |
match, an error MUST be indicated. | |
Using a C array and ASCII: | |
const uint8_t FOOTER_MAGIC[2] = { 'Y', 'Z' }; | |
In hexadecimal: | |
59 5A | |
The primary reason to have Footer Magic Bytes is to make | |
it easier to detect incomplete files quickly, without | |
uncompressing. If the file does not end with Footer Magic Bytes | |
(excluding Stream Padding described in Section 2.2), it cannot | |
be undamaged, unless someone has intentionally appended garbage | |
after the end of the Stream. | |
2.2. Stream Padding | |
Only the decoders that support decoding of concatenated Streams | |
MUST support Stream Padding. | |
Stream Padding MUST contain only null bytes. To preserve the | |
four-byte alignment of consecutive Streams, the size of Stream | |
Padding MUST be a multiple of four bytes. Empty Stream Padding | |
is allowed. If these requirements are not met, the decoder MUST | |
indicate an error. | |
Note that non-empty Stream Padding is allowed at the end of the | |
file; there doesn't need to be a new Stream after non-empty | |
Stream Padding. This can be convenient in certain situations | |
[GNU-tar]. | |
The possibility of Stream Padding MUST be taken into account | |
when designing an application that parses Streams backwards, | |
and the application supports concatenated Streams. | |
3. Block | |
+==============+=================+===============+=======+ | |
| Block Header | Compressed Data | Block Padding | Check | | |
+==============+=================+===============+=======+ | |
3.1. Block Header | |
+-------------------+-------------+=================+ | |
| Block Header Size | Block Flags | Compressed Size | | |
+-------------------+-------------+=================+ | |
+===================+======================+ | |
---> | Uncompressed Size | List of Filter Flags | | |
+===================+======================+ | |
+================+--+--+--+--+ | |
---> | Header Padding | CRC32 | | |
+================+--+--+--+--+ | |
3.1.1. Block Header Size | |
This field overlaps with the Index Indicator field (see | |
Section 4.1). | |
This field contains the size of the Block Header field, | |
including the Block Header Size field itself. Valid values are | |
in the range [0x01, 0xFF], which indicate the size of the Block | |
Header as multiples of four bytes, minimum size being eight | |
bytes: | |
real_header_size = (encoded_header_size + 1) * 4; | |
If a Block Header bigger than 1024 bytes is needed in the | |
future, a new field can be added between the Block Header and | |
Compressed Data fields. The presence of this new field would | |
be indicated in the Block Header field. | |
3.1.2. Block Flags | |
The Block Flags field is a bit field: | |
Bit(s) Mask Description | |
0-1 0x03 Number of filters (1-4) | |
2-5 0x3C Reserved for future use; MUST be zero for now. | |
6 0x40 The Compressed Size field is present. | |
7 0x80 The Uncompressed Size field is present. | |
If any reserved bit is set, the decoder MUST indicate an error. | |
It is possible that there is a new field present which the | |
decoder is not aware of, and can thus parse the Block Header | |
incorrectly. | |
3.1.3. Compressed Size | |
This field is present only if the appropriate bit is set in | |
the Block Flags field (see Section 3.1.2). | |
The Compressed Size field contains the size of the Compressed | |
Data field, which MUST be non-zero. Compressed Size is stored | |
using the encoding described in Section 1.2. If the Compressed | |
Size doesn't match the size of the Compressed Data field, the | |
decoder MUST indicate an error. | |
3.1.4. Uncompressed Size | |
This field is present only if the appropriate bit is set in | |
the Block Flags field (see Section 3.1.2). | |
The Uncompressed Size field contains the size of the Block | |
after uncompressing. Uncompressed Size is stored using the | |
encoding described in Section 1.2. If the Uncompressed Size | |
does not match the real uncompressed size, the decoder MUST | |
indicate an error. | |
Storing the Compressed Size and Uncompressed Size fields serves | |
several purposes: | |
- The decoder knows how much memory it needs to allocate | |
for a temporary buffer in multithreaded mode. | |
- Simple error detection: wrong size indicates a broken file. | |
- Seeking forwards to a specific location in streamed mode. | |
It should be noted that the only reliable way to determine | |
the real uncompressed size is to uncompress the Block, | |
because the Block Header and Index fields may contain | |
(intentionally or unintentionally) invalid information. | |
3.1.5. List of Filter Flags | |
+================+================+ +================+ | |
| Filter 0 Flags | Filter 1 Flags | ... | Filter n Flags | | |
+================+================+ +================+ | |
The number of Filter Flags fields is stored in the Block Flags | |
field (see Section 3.1.2). | |
The format of each Filter Flags field is as follows: | |
+===========+====================+===================+ | |
| Filter ID | Size of Properties | Filter Properties | | |
+===========+====================+===================+ | |
Both Filter ID and Size of Properties are stored using the | |
encoding described in Section 1.2. Size of Properties indicates | |
the size of the Filter Properties field as bytes. The list of | |
officially defined Filter IDs and the formats of their Filter | |
Properties are described in Section 5.3. | |
Filter IDs greater than or equal to 0x4000_0000_0000_0000 | |
(2^62) are reserved for implementation-specific internal use. | |
These Filter IDs MUST never be used in List of Filter Flags. | |
3.1.6. Header Padding | |
This field contains as many null byte as it is needed to make | |
the Block Header have the size specified in Block Header Size. | |
If any of the bytes are not null bytes, the decoder MUST | |
indicate an error. It is possible that there is a new field | |
present which the decoder is not aware of, and can thus parse | |
the Block Header incorrectly. | |
3.1.7. CRC32 | |
The CRC32 is calculated over everything in the Block Header | |
field except the CRC32 field itself. It is stored as an | |
unsigned 32-bit little endian integer. If the calculated | |
value does not match the stored one, the decoder MUST indicate | |
an error. | |
By verifying the CRC32 of the Block Header before parsing the | |
actual contents allows the decoder to distinguish between | |
corrupt and unsupported files. | |
3.2. Compressed Data | |
The format of Compressed Data depends on Block Flags and List | |
of Filter Flags. Excluding the descriptions of the simplest | |
filters in Section 5.3, the format of the filter-specific | |
encoded data is out of scope of this document. | |
3.3. Block Padding | |
Block Padding MUST contain 0-3 null bytes to make the size of | |
the Block a multiple of four bytes. This can be needed when | |
the size of Compressed Data is not a multiple of four. If any | |
of the bytes in Block Padding are not null bytes, the decoder | |
MUST indicate an error. | |
3.4. Check | |
The type and size of the Check field depends on which bits | |
are set in the Stream Flags field (see Section 2.1.1.2). | |
The Check, when used, is calculated from the original | |
uncompressed data. If the calculated Check does not match the | |
stored one, the decoder MUST indicate an error. If the selected | |
type of Check is not supported by the decoder, it SHOULD | |
indicate a warning or error. | |
4. Index | |
+-----------------+===================+ | |
| Index Indicator | Number of Records | | |
+-----------------+===================+ | |
+=================+===============+-+-+-+-+ | |
---> | List of Records | Index Padding | CRC32 | | |
+=================+===============+-+-+-+-+ | |
Index serves several purposes. Using it, one can | |
- verify that all Blocks in a Stream have been processed; | |
- find out the uncompressed size of a Stream; and | |
- quickly access the beginning of any Block (random access). | |
4.1. Index Indicator | |
This field overlaps with the Block Header Size field (see | |
Section 3.1.1). The value of Index Indicator is always 0x00. | |
4.2. Number of Records | |
This field indicates how many Records there are in the List | |
of Records field, and thus how many Blocks there are in the | |
Stream. The value is stored using the encoding described in | |
Section 1.2. If the decoder has decoded all the Blocks of the | |
Stream, and then notices that the Number of Records doesn't | |
match the real number of Blocks, the decoder MUST indicate an | |
error. | |
4.3. List of Records | |
List of Records consists of as many Records as indicated by the | |
Number of Records field: | |
+========+========+ | |
| Record | Record | ... | |
+========+========+ | |
Each Record contains information about one Block: | |
+===============+===================+ | |
| Unpadded Size | Uncompressed Size | | |
+===============+===================+ | |
If the decoder has decoded all the Blocks of the Stream, it | |
MUST verify that the contents of the Records match the real | |
Unpadded Size and Uncompressed Size of the respective Blocks. | |
Implementation hint: It is possible to verify the Index with | |
constant memory usage by calculating for example SHA-256 of | |
both the real size values and the List of Records, then | |
comparing the hash values. Implementing this using | |
non-cryptographic hash like CRC32 SHOULD be avoided unless | |
small code size is important. | |
If the decoder supports random-access reading, it MUST verify | |
that Unpadded Size and Uncompressed Size of every completely | |
decoded Block match the sizes stored in the Index. If only | |
partial Block is decoded, the decoder MUST verify that the | |
processed sizes don't exceed the sizes stored in the Index. | |
4.3.1. Unpadded Size | |
This field indicates the size of the Block excluding the Block | |
Padding field. That is, Unpadded Size is the size of the Block | |
Header, Compressed Data, and Check fields. Unpadded Size is | |
stored using the encoding described in Section 1.2. The value | |
MUST never be zero; with the current structure of Blocks, the | |
actual minimum value for Unpadded Size is five. | |
Implementation note: Because the size of the Block Padding | |
field is not included in Unpadded Size, calculating the total | |
size of a Stream or doing random-access reading requires | |
calculating the actual size of the Blocks by rounding Unpadded | |
Sizes up to the next multiple of four. | |
The reason to exclude Block Padding from Unpadded Size is to | |
ease making a raw copy of Compressed Data without Block | |
Padding. This can be useful, for example, if someone wants | |
to convert Streams to some other file format quickly. | |
4.3.2. Uncompressed Size | |
This field indicates the Uncompressed Size of the respective | |
Block as bytes. The value is stored using the encoding | |
described in Section 1.2. | |
4.4. Index Padding | |
This field MUST contain 0-3 null bytes to pad the Index to | |
a multiple of four bytes. If any of the bytes are not null | |
bytes, the decoder MUST indicate an error. | |
4.5. CRC32 | |
The CRC32 is calculated over everything in the Index field | |
except the CRC32 field itself. The CRC32 is stored as an | |
unsigned 32-bit little endian integer. If the calculated | |
value does not match the stored one, the decoder MUST indicate | |
an error. | |
5. Filter Chains | |
The Block Flags field defines how many filters are used. When | |
more than one filter is used, the filters are chained; that is, | |
the output of one filter is the input of another filter. The | |
following figure illustrates the direction of data flow. | |
v Uncompressed Data ^ | |
| Filter 0 | | |
Encoder | Filter 1 | Decoder | |
| Filter n | | |
v Compressed Data ^ | |
5.1. Alignment | |
Alignment of uncompressed input data is usually the job of | |
the application producing the data. For example, to get the | |
best results, an archiver tool should make sure that all | |
PowerPC executable files in the archive stream start at | |
offsets that are multiples of four bytes. | |
Some filters, for example LZMA2, can be configured to take | |
advantage of specified alignment of input data. Note that | |
taking advantage of aligned input can be beneficial also when | |
a filter is not the first filter in the chain. For example, | |
if you compress PowerPC executables, you may want to use the | |
PowerPC filter and chain that with the LZMA2 filter. Because | |
not only the input but also the output alignment of the PowerPC | |
filter is four bytes, it is now beneficial to set LZMA2 | |
settings so that the LZMA2 encoder can take advantage of its | |
four-byte-aligned input data. | |
The output of the last filter in the chain is stored to the | |
Compressed Data field, which is is guaranteed to be aligned | |
to a multiple of four bytes relative to the beginning of the | |
Stream. This can increase | |
- speed, if the filtered data is handled multiple bytes at | |
a time by the filter-specific encoder and decoder, | |
because accessing aligned data in computer memory is | |
usually faster; and | |
- compression ratio, if the output data is later compressed | |
with an external compression tool. | |
5.2. Security | |
If filters would be allowed to be chained freely, it would be | |
possible to create malicious files, that would be very slow to | |
decode. Such files could be used to create denial of service | |
attacks. | |
Slow files could occur when multiple filters are chained: | |
v Compressed input data | |
| Filter 1 decoder (last filter) | |
| Filter 0 decoder (non-last filter) | |
v Uncompressed output data | |
The decoder of the last filter in the chain produces a lot of | |
output from little input. Another filter in the chain takes the | |
output of the last filter, and produces very little output | |
while consuming a lot of input. As a result, a lot of data is | |
moved inside the filter chain, but the filter chain as a whole | |
gets very little work done. | |
To prevent this kind of slow files, there are restrictions on | |
how the filters can be chained. These restrictions MUST be | |
taken into account when designing new filters. | |
The maximum number of filters in the chain has been limited to | |
four, thus there can be at maximum of three non-last filters. | |
Of these three non-last filters, only two are allowed to change | |
the size of the data. | |
The non-last filters, that change the size of the data, MUST | |
have a limit how much the decoder can compress the data: the | |
decoder SHOULD produce at least n bytes of output when the | |
filter is given 2n bytes of input. This limit is not | |
absolute, but significant deviations MUST be avoided. | |
The above limitations guarantee that if the last filter in the | |
chain produces 4n bytes of output, the chain as a whole will | |
produce at least n bytes of output. | |
5.3. Filters | |
5.3.1. LZMA2 | |
LZMA (Lempel-Ziv-Markov chain-Algorithm) is a general-purpose | |
compression algorithm with high compression ratio and fast | |
decompression. LZMA is based on LZ77 and range coding | |
algorithms. | |
LZMA2 is an extension on top of the original LZMA. LZMA2 uses | |
LZMA internally, but adds support for flushing the encoder, | |
uncompressed chunks, eases stateful decoder implementations, | |
and improves support for multithreading. Thus, the plain LZMA | |
will not be supported in this file format. | |
Filter ID: 0x21 | |
Size of Filter Properties: 1 byte | |
Changes size of data: Yes | |
Allow as a non-last filter: No | |
Allow as the last filter: Yes | |
Preferred alignment: | |
Input data: Adjustable to 1/2/4/8/16 byte(s) | |
Output data: 1 byte | |
The format of the one-byte Filter Properties field is as | |
follows: | |
Bits Mask Description | |
0-5 0x3F Dictionary Size | |
6-7 0xC0 Reserved for future use; MUST be zero for now. | |
Dictionary Size is encoded with one-bit mantissa and five-bit | |
exponent. The smallest dictionary size is 4 KiB and the biggest | |
is 4 GiB. | |
Raw value Mantissa Exponent Dictionary size | |
0 2 11 4 KiB | |
1 3 11 6 KiB | |
2 2 12 8 KiB | |
3 3 12 12 KiB | |
4 2 13 16 KiB | |
5 3 13 24 KiB | |
6 2 14 32 KiB | |
... ... ... ... | |
35 3 27 768 MiB | |
36 2 28 1024 MiB | |
37 3 29 1536 MiB | |
38 2 30 2048 MiB | |
39 3 30 3072 MiB | |
40 2 31 4096 MiB - 1 B | |
Instead of having a table in the decoder, the dictionary size | |
can be decoded using the following C code: | |
const uint8_t bits = get_dictionary_flags() & 0x3F; | |
if (bits > 40) | |
return DICTIONARY_TOO_BIG; // Bigger than 4 GiB | |
uint32_t dictionary_size; | |
if (bits == 40) { | |
dictionary_size = UINT32_MAX; | |
} else { | |
dictionary_size = 2 | (bits & 1); | |
dictionary_size <<= bits / 2 + 11; | |
} | |
5.3.2. Branch/Call/Jump Filters for Executables | |
These filters convert relative branch, call, and jump | |
instructions to their absolute counterparts in executable | |
files. This conversion increases redundancy and thus | |
compression ratio. | |
Size of Filter Properties: 0 or 4 bytes | |
Changes size of data: No | |
Allow as a non-last filter: Yes | |
Allow as the last filter: No | |
Below is the list of filters in this category. The alignment | |
is the same for both input and output data. | |
Filter ID Alignment Description | |
0x04 1 byte x86 filter (BCJ) | |
0x05 4 bytes PowerPC (big endian) filter | |
0x06 16 bytes IA64 filter | |
0x07 4 bytes ARM (little endian) filter | |
0x08 2 bytes ARM Thumb (little endian) filter | |
0x09 4 bytes SPARC filter | |
If the size of Filter Properties is four bytes, the Filter | |
Properties field contains the start offset used for address | |
conversions. It is stored as an unsigned 32-bit little endian | |
integer. The start offset MUST be a multiple of the alignment | |
of the filter as listed in the table above; if it isn't, the | |
decoder MUST indicate an error. If the size of Filter | |
Properties is zero, the start offset is zero. | |
Setting the start offset may be useful if an executable has | |
multiple sections, and there are many cross-section calls. | |
Taking advantage of this feature usually requires usage of | |
the Subblock filter, whose design is not complete yet. | |
5.3.3. Delta | |
The Delta filter may increase compression ratio when the value | |
of the next byte correlates with the value of an earlier byte | |
at specified distance. | |
Filter ID: 0x03 | |
Size of Filter Properties: 1 byte | |
Changes size of data: No | |
Allow as a non-last filter: Yes | |
Allow as the last filter: No | |
Preferred alignment: | |
Input data: 1 byte | |
Output data: Same as the original input data | |
The Properties byte indicates the delta distance, which can be | |
1-256 bytes backwards from the current byte: 0x00 indicates | |
distance of 1 byte and 0xFF distance of 256 bytes. | |
5.3.3.1. Format of the Encoded Output | |
The code below illustrates both encoding and decoding with | |
the Delta filter. | |
// Distance is in the range [1, 256]. | |
const unsigned int distance = get_properties_byte() + 1; | |
uint8_t pos = 0; | |
uint8_t delta[256]; | |
memset(delta, 0, sizeof(delta)); | |
while (1) { | |
const int byte = read_byte(); | |
if (byte == EOF) | |
break; | |
uint8_t tmp = delta[(uint8_t)(distance + pos)]; | |
if (is_encoder) { | |
tmp = (uint8_t)(byte) - tmp; | |
delta[pos] = (uint8_t)(byte); | |
} else { | |
tmp = (uint8_t)(byte) + tmp; | |
delta[pos] = tmp; | |
} | |
write_byte(tmp); | |
--pos; | |
} | |
5.4. Custom Filter IDs | |
If a developer wants to use custom Filter IDs, he has two | |
choices. The first choice is to contact Lasse Collin and ask | |
him to allocate a range of IDs for the developer. | |
The second choice is to generate a 40-bit random integer, | |
which the developer can use as his personal Developer ID. | |
To minimize the risk of collisions, Developer ID has to be | |
a randomly generated integer, not manually selected "hex word". | |
The following command, which works on many free operating | |
systems, can be used to generate Developer ID: | |
dd if=/dev/urandom bs=5 count=1 | hexdump | |
The developer can then use his Developer ID to create unique | |
(well, hopefully unique) Filter IDs. | |
Bits Mask Description | |
0-15 0x0000_0000_0000_FFFF Filter ID | |
16-55 0x00FF_FFFF_FFFF_0000 Developer ID | |
56-62 0x3F00_0000_0000_0000 Static prefix: 0x3F | |
The resulting 63-bit integer will use 9 bytes of space when | |
stored using the encoding described in Section 1.2. To get | |
a shorter ID, see the beginning of this Section how to | |
request a custom ID range. | |
5.4.1. Reserved Custom Filter ID Ranges | |
Range Description | |
0x0000_0300 - 0x0000_04FF Reserved to ease .7z compatibility | |
0x0002_0000 - 0x0007_FFFF Reserved to ease .7z compatibility | |
0x0200_0000 - 0x07FF_FFFF Reserved to ease .7z compatibility | |
6. Cyclic Redundancy Checks | |
There are several incompatible variations to calculate CRC32 | |
and CRC64. For simplicity and clarity, complete examples are | |
provided to calculate the checks as they are used in this file | |
format. Implementations MAY use different code as long as it | |
gives identical results. | |
The program below reads data from standard input, calculates | |
the CRC32 and CRC64 values, and prints the calculated values | |
as big endian hexadecimal strings to standard output. | |
#include <stddef.h> | |
#include <inttypes.h> | |
#include <stdio.h> | |
uint32_t crc32_table[256]; | |
uint64_t crc64_table[256]; | |
void | |
init(void) | |
{ | |
static const uint32_t poly32 = UINT32_C(0xEDB88320); | |
static const uint64_t poly64 | |
= UINT64_C(0xC96C5795D7870F42); | |
for (size_t i = 0; i < 256; ++i) { | |
uint32_t crc32 = i; | |
uint64_t crc64 = i; | |
for (size_t j = 0; j < 8; ++j) { | |
if (crc32 & 1) | |
crc32 = (crc32 >> 1) ^ poly32; | |
else | |
crc32 >>= 1; | |
if (crc64 & 1) | |
crc64 = (crc64 >> 1) ^ poly64; | |
else | |
crc64 >>= 1; | |
} | |
crc32_table[i] = crc32; | |
crc64_table[i] = crc64; | |
} | |
} | |
uint32_t | |
crc32(const uint8_t *buf, size_t size, uint32_t crc) | |
{ | |
crc = ~crc; | |
for (size_t i = 0; i < size; ++i) | |
crc = crc32_table[buf[i] ^ (crc & 0xFF)] | |
^ (crc >> 8); | |
return ~crc; | |
} | |
uint64_t | |
crc64(const uint8_t *buf, size_t size, uint64_t crc) | |
{ | |
crc = ~crc; | |
for (size_t i = 0; i < size; ++i) | |
crc = crc64_table[buf[i] ^ (crc & 0xFF)] | |
^ (crc >> 8); | |
return ~crc; | |
} | |
int | |
main() | |
{ | |
init(); | |
uint32_t value32 = 0; | |
uint64_t value64 = 0; | |
uint64_t total_size = 0; | |
uint8_t buf[8192]; | |
while (1) { | |
const size_t buf_size | |
= fread(buf, 1, sizeof(buf), stdin); | |
if (buf_size == 0) | |
break; | |
total_size += buf_size; | |
value32 = crc32(buf, buf_size, value32); | |
value64 = crc64(buf, buf_size, value64); | |
} | |
printf("Bytes: %" PRIu64 "\n", total_size); | |
printf("CRC-32: 0x%08" PRIX32 "\n", value32); | |
printf("CRC-64: 0x%016" PRIX64 "\n", value64); | |
return 0; | |
} | |
7. References | |
LZMA SDK - The original LZMA implementation | |
http://7-zip.org/sdk.html | |
LZMA Utils - LZMA adapted to POSIX-like systems | |
http://tukaani.org/lzma/ | |
XZ Utils - The next generation of LZMA Utils | |
http://tukaani.org/xz/ | |
[RFC-1952] | |
GZIP file format specification version 4.3 | |
http://www.ietf.org/rfc/rfc1952.txt | |
- Notation of byte boxes in section "2.1. Overall conventions" | |
[RFC-2119] | |
Key words for use in RFCs to Indicate Requirement Levels | |
http://www.ietf.org/rfc/rfc2119.txt | |
[GNU-tar] | |
GNU tar 1.21 manual | |
http://www.gnu.org/software/tar/manual/html_node/Blocking-Factor.html | |
- Node 9.4.2 "Blocking Factor", paragraph that begins | |
"gzip will complain about trailing garbage" | |
- Note that this URL points to the latest version of the | |
manual, and may some day not contain the note which is in | |
1.21. For the exact version of the manual, download GNU | |
tar 1.21: ftp://ftp.gnu.org/pub/gnu/tar/tar-1.21.tar.gz | |