Universidad Carlos III de Madrid

Ingeniería de Telecomunicación

Enero-Mayo 2010 / January-May 2010

Projects Section Project: Data Decompressor


The project goal is to build a data decompressor using a simple, lossless, (de)compression algorithm. A lossless compression algorithm takes some data and produces an alternative representation using fewer bits than the original representation. It must be possible to recover the full original representation of the data from the compressed version. We will be using a simplified version of the popular LZ77 algorithm.

Your task is to build a decompression program. This program will take the name of a compressed file as an argument and produce its decompressed version. It must read the compressed file, decompress it and generate an output file as a result. The contents of the output file must be equal to the contents of the original file, previous to compression.

Data Format

Here follows the compressed data format and the procedure to decompress them.

The input (compressed) file contains bytes. Those bytes represent a sequence of symbols, which must be processed by the decompressor in order to recover the original (uncompressed) data.

Symbols

Compressed data is a sequence of symbols coded as bytes. There are two types of symbols:

  • Symbols representing the literal value of a byte. No compression is performed here.
  • Instructions on how to replicate a previous sequence of bytes. Here is where compression is achieved. We use two numbers to identify a sequence of bytes: an offset and a length. The offset indicates how many bytes you have to step back to find the beginning of the sequence (1 means the beginning of the sequence is the last byte codified so far, 2 means the next-to-last byte...). The length is just the length in bytes of the sequence. Suppose, for simplicity, that lengths will not be greater than offsets.

Symbols must be decoded in order and one by one, to recover the original data.

As an example, let us say we read the following sequence of symbols from a compressed file: literal(3), literal(4), literal(5), literal(6), literal(7), literal(8), repetition(offset=4, length=3). Each literal symbol will be decoded as itself, until literal(8), the contents of the uncompressed file should be byte 3, 4, 5, 6, 7 and 8. Byte 8 is the more recent and byte 3 is the oldest. Then the repetition symbol repetition(offset=4, length=3) comes, representing bytes 5, 6 and 7 (step back 4 bytes, and repeat the following 3 bytes from there). The final sequence of uncompressed bytes will be: 3, 4, 5, 6, 7, 8, 5, 6 and 7.

Symbols coded as bytes

As files in a disk are coded as a sequence of bytes, the symbols used to compress bytes have to be encoded as bytes themselves when stored in a file. Therefore, you will need to read the symbols as a sequence of bytes from a file. Each symbol is encoded as follows:

  • Literal symbols representing bytes from 0 to 254: are encoded as themselves. As an example, literal(45) is encoded as byte 45.
  • Literal symbol for byte 255: is encoded as two consecutive 255 bytes.
  • Symbols for sequence repetition: are encoded as 4 bytes: first a 255 byte, then 3 bytes (L, D0 and D1). The L byte is the length of the sequence minus 5 (this is, L=0 means a sequence length of 5 bytes, L=1 means a sequence length of 6 bytes...). L=255 is not allowed, thus, the maximum length allowed for a sequence is 259 (=254+5). D0 and D1 are used for the offset, using the following expression: offset = 1 + D0 + (256*D1). As an example, D0=7 and D1=3 means an offset of 1 + 7 + (256*3) = 776.

As an example, let us say we have the following bytes (compressed data): 1, 2, 3, 4, 5, 255, 255, 6, 7, 254, 253, 255, 0, 5, 0, 255, 5, 9, 0. The corresponding symbols would be:

  1. Byte 1: literal(1).
  2. Byte 2: literal(2).
  3. Byte 3: literal(3).
  4. Byte 4: literal(4).
  5. Byte 5: literal(5).
  6. Bytes 255, 255: literal(255).
  7. Byte 6: literal(6).
  8. Byte 7: literal(7).
  9. Byte 254: literal(254).
  10. Byte 253: literal(253).
  11. Bytes 255, 0, 5, 0: repetition(offset=6, length=5).
  12. Bytes 255, 5, 9, 0: repetition(offset=10, length=10).

Decompressing those symbols would produce the following uncompressed data: 1, 2, 3, 4, 5, 255, 6, 7, 254, 253, 5, 255, 6, 7, 254, 255, 6, 7, 254, 253, 5, 255, 6, 7 and 254.

Checking data for repetitions

The maximum offset we can use is 65536 (1 + 255 + (256 * 255)). Therefore, it makes no sense to look for byte sequence repetitions longer than 65536 bytes back. Thus, the decompressor must only have a buffer of enough size as to store the last 65536 decoded bytes. Initially, before decompressing the first byte, the decompressor must assume that the last 65536 bytes have a value of 0.

Assignment

Write a program that decompresses an existing file and stores the result in a newly created file. There must be a class called Decompressor with a main method. This method will receive two arguments from the command line: the name of the (compressed) input file and the name of the (decompressed) output file. If the output file already exists, the program will overwrite it. This is how a typical invocation would look like:

java Decompressor compressed.dat uncompressed.dat

If there is some input/output error while reading the input file or writing the output file, the program will exit with an IOException. If the input file cannot be decompressed due to bad format or corrupted data (for instance, a byte 255 that starts a multibyte symbol is read but there are no more available bytes in the file), the program will exit using an ad-hoc exception.

Design Tips

A good design of the program will help you to get higher grades. On one hand, the teachers will be glad to read simple, clean and well organized code with a nice object oriented touch. On the other hand, a high quality code will help you in the project exam, where you will be asked to modify your own code for extended functionality in a limited time. Here we propose some tips for submitting quality code. Feel free to ignore them if you think of better designs (the teachers will be glad to grade you higher if your design is good).

Suggested classes

Generally speaking, we advise you to distribute functionality among classes, like this:

  • Reading from the input file. As an example, you can have a method to read a new byte from the file, throwing an exception when the end-of-file is reached.
  • Getting a sequence of symbols from a byte sequence. Have a method to decode and return the next symbol. To do that, you will have to read one or more bytes, depending on the symbol. You can use the suggested class above for that.
  • Getting the uncompressed bytes from a symbol sequence. Have a method to translate each symbol into the correspondent uncompressed byte sequence. You can use the previous class for getting each symbol as needed.
  • Organize and control the whole decompression process, using the previous classes.
  • Symbol representation. As there are two types of symbols, it can be nice to have a main class for symbols and another two classes inheriting from it for each type of symbol.
  • Remembering the last decompressed bytes. Have a buffer big enough for that, you will need it when a repetition symbol is read.
  • Your own exceptions for internal errors relevant to your application.

Decompressed data buffer

Decompressed data can be stored in its own class, using an array of 65536 bytes (byte buffer[65536]). This class can have methods for adding new data and reading old data from an offset and a length.

A circular buffer can be useful here, as the one you saw for the implementation of queues with an array. When adding new data fills the array you should continue adding in the positions at the beginning, overwritting them, as no more than 65536 bytes are needed for decompressing the current symbol.

To help you check that your buffer is working fine, we suggest you to test smaller buffers first (8 or 16 bytes length), as it will be much easier to fill them with accountable data in your tests.

Implementation Details

Java's byte

You can use Java's byte data type for your implementation, but there are some important details you must be aware of.

First, a byte in Java is an 8 bits numbers with sign, meaning it stores numbers between -128 and 127 (inclusively), using two's complement representation. While reading bytes from a file, bytes between 128 and 255 will be stored in Java as negative values. As an example, a 255 byte in the file will be stored in a Java byte as an -1, a 254 will be stored as -2, etc., until 128, which is stored as -128. Bytes between 0 and 127 will be stored in Java bytes as themselves. This means that checking for a 255 on a file will require you to compare with -1 in Java.

This have implications also in the offset calculation (1 + D0 + (256 * D1)). A way to solve this is using Java's bitwise operators, like this:

byte d0 = ...
byte d1 = ...
int offset = 1 + (d0 & 0xFF) + ((d1 & 0xFF) << 8);

Another important detail is when calling methods that has byte as arguments; if you want to pass literal values you can do something like this:

public void method(byte value) {
    (...)
}
(...)
method((byte) 5);

Here the explicit casting to byte is required, as the compiler will think that 5 is an int instead of a byte.

Sample Files

In the following table you will find some examples of compressed files and their uncompressed counterparts. You can use these files to test your decompressor program. Again we suggest you to begin with the smaller files. You can also create your own test files using an hex editor.

Description Compressed Uncompressed
File Size File Size
Only simple literals literals-1.in 6 literals-1.out 6
Only literals including 255 at the beginning literals-2.in 8 literals-2.out 7
Only literals including 255 at the end literals-3.in 8 literals-3.out 7
Only literals including 255 in the middle literals-4.in 8 literals-4.out 7
Only literals including adjacent 255 literals-5.in 10 literals-5.out 8
Only literals 1 literals-6.in 18 literals-6.out 12
Only literals 2 literals-7.in 4 literals-7.out 4
Project example example.in 19 example.out 25
Single repetition 1 repetition-1.in 10 repetition-1.out 12
Double repetition repetition-2.in 14 repetition-2.out 18
Prefetch repetition repetition-3.in 6 repetition-3.out 7
Triple repetition repetition-4.in 18 repetition-4.out 24
Single repetition 2 repetition-5.in 16 repetition-5.out 18
Buffer overflow 1 overflow-1.in 65538 overflow-1.out 65538
Buffer overflow 2 overflow-2.in 196609 overflow-2.out 196609
Buffer overflow + repetition 1 overflow-3.in 65542 overflow-3.out 65543
Buffer overflow + repetition 2 overflow-4.in 65542 overflow-4.out 65543
Incorrect literal incorrect-1.in 3 (error expected) -
Incorrect repetition 1 incorrect-2.in 4 (error expected) -
Incorrect repetition 2 incorrect-3.in 5 (error expected) -
Simple example with all symbols simple.txt.psz 53 simple.txt 58
Don Quijote de La Mancha quijote.txt.psz 2702 quijote.txt 3128
Sample photo (BMP format) photo.bmp.psz 2023506 photo.bmp 2359418

The majority of the files above are binary. It is better to examine them with an hex editor. For your convenience, you can download a ZIP file with all the tests above through this link.

Submission Instructions

You must work in this project in groups of two, except if your teacher told you otherwise. You must submit your group composition to your teacher in time. All group members must belong to the same small group (lab group).

One (and only one) of the members of the group must submit the project, using the submission link that will be available in your AulaGlobal small group web interface. The submission date will be announced as an AulaGlobal notice. Submissions after the due date wil be ignored.

You must submit a single ZIP file. It will only contain your project source code (.java files) and a plain text readme.txt file. You must not submit compiled classes (.class files) or any other kind of file.

The readme.txt file will include the group members' NIA, surnames and names. Optionally, it can also include a brief description (4 paragraphs or less) of your program design. Any relevant fact that your teacher must know about your project must be included here. For instance, if any of the members of your group has not worked at all it the project, or if you are aware of certain bugs in your program that you were not able to fix in time (grades can be improved by showing a good understanding of your bugs and how do you think they can be fixed, actually fixing them will be graded higher, of course).

The name of the ZIP file must be group-XX-YYYYYYYYY-ZZZZZZZZZ.zip, with XX being your small group (lab group) number, YYYYYYYYY the smallest NIA of your project group members and ZZZZZZZZZ the NIA of the other member. For example: group-95-188888888-199999999.zip.

Your project source code must be your own original work, not someone else's. If the teachers detect any kind of plagiarism in your code (from other students outside your project group, third-party academy teachers...) the plagiarism protocol will be called (ask your teacher for an English version, if you need it) and you will be graded with a 0 (failure). The teachers will be using plagiarism checking tools on your submitted code, that will detect common patterns in code between all projects of all groups and courses. These tools will detect plagiarism even if the code is later modified by changing comments, variable and methods names, whitespace changes, switching loop types or moving pieces of code around.