Adventures in Useless Reverse Engineering
Episode 1: LucasArts Classic Adventures
The below investigation was done on a Mac, and uses Boxer (with DOSBox), MAC OS X terminal (for the unix-type commands cat, tail, cmp, and strings), and TextWrangler. A few additional tools are mentioned as they were discovered.
PART I
I recently found my old copy of LucasArts Classic Adventures and used Boxer/DOSBox on the Mac to install and play some classic LucasArts adventures.
For whatever reason, I was curious about how the old installer worked. It seemed to use some .xxx file format which I’d never seen before. Although all of these games could originally be run from disks, this installer only installed to a hard drive, and if all of the archive (*.xxx) files were dumped into a directory on the hard drive, the installer would also run just fine without asking the user to insert disks. The installer will always install to \MANIAC, \ZAK, etc (i.e., based off root directory of destination drive) regardless of where it was run.
The disks contain the installer and archive files for each game:
LucasArts Classic Adventures (IBM) - Seven 3.5” Disks
CLASSIC 1:
INSTALL EXE 24,762 21-09-1992 17:57
MANIAC_A XXX 401,060 26-08-1992 11:36
ZAK____A XXX 633,717 26-08-1992 11:33
LOOM___A XXX 394,088 08-09-1992 12:35
4 file(s) 1,453,627 bytes total
CLASSIC 2:
INDY___B XXX 571,088 28-08-1992 12:00
LOOM___B XXX 874,101 08-09-1992 12:35
2 file(s) 1,445,189 bytes total
CLASSIC 3:
INDY___C XXX 1,454,088 28-08-1992 12:00
1 file(s) 1,454,088 bytes total
CLASSIC 4:
INDY___D XXX 1,454,088 28-08-1992 12:00
1 file(s) 1,454,088 bytes total
CLASSIC 5:
INDY___E XXX 73,054 28-08-1992 12:00
MONKEY_E XXX 1,384,088 08-09-1992 11:41
2 file(s) 1,457,142 bytes total
CLASSIC 6:
MONKEY_F XXX 1,454,088 08-09-1992 11:41
1 file(s) 1,454,088 bytes total
CLASSIC 7:
MONKEY_G XXX 697,641 08-09-1992 11:41
1 file(s) 697,641 bytes total
So the first question is, what are .xxx files? As I mentioned, I’ve never heard it as a file extension. Searching for info on it online didn’t return much. Hmmm. Looking at a hex dump of the .xxx files (using TextWrangler) gives a few hints:
MANIAC_A.XXX
00000000: 4C 46 47 21 9C 1E 06 00 6D 61 6E 69 61 63 5F 41 LFG!....maniac_A
00000010: 2E 78 78 78 00 00 01 00 C9 FA 09 00 46 49 4C 45 .xxx........FILE
00000020: 3C 04 00 00 30 30 2E 4C 46 4C 00 2E 18 2E 79 25 <...00.LFL....y%
00000030: 55 19 C4 07 00 00 02 00 01 00 00 00 00 06 FE F9 U...............
'LFG!’: LucasFilm Games, perhaps? The name of the archive (maniac_a.xxx) is clearly there, as well as the word 'FILE', which must be file marker. After a few unknown bytes it is followed by the name of one of the files in the archive, '00.LFL'.
Here’s another:
MONKEY_E.XXX
00000000: 4C 46 47 21 90 1E 15 00 6D 6F 6E 6B 65 79 5F 43 LFG!....monkey_C
00000010: 2E 78 78 78 00 00 03 00 9D 27 45 00 46 49 4C 45 .xxx.....'E.FILE
00000020: F1 0B 00 00 30 30 30 2E 4C 46 4C 00 EE 35 79 25 ....000.LFL..5y%
00000030: 2B 21 A5 20 00 00 02 00 01 00 00 00 00 06 8A 0C +!. ............
Well, that's interesting. It looks like the archive name is just for reference, since it doesn't match the actual name of the archive file. But again we do have the 'FILE' marker and a filename for one of the files in the archive ('000.LFL').
Searching for info on a 'LFG' or 'LFG!' archive type (online) again yields no results. So, assuming that the '!' is a part of this assumed 'LFG!' archive marker, we can see if the next four bytes before the archive name might be something
obvious. It looks like it could be a length in little endian order (ie, least significant byte first).
MANIAC_A.XXX:
00000004: 9C 1E 06 00 = 0x00061E9C = 401,052
MONKEY_E.XXX:
00000004: 90 1E 15 00 = 0x00151E90 = 1,384,080
Wait a minute! That is the length of the archive file, minus these 8 bytes used for the marker and the length! Also of note, the length we've decoded here is only the length of the particular archive file. (We know this since it matches the MONKEY_E.XXX file length, where the MONKEY archive spans multiple files.)
Okay, so we have a archive type marker, a length, and archive name. It is not clear whether the archive name is a fixed number of bytes or simply null terminated, since all the archive names use the full 8+3 characters and there is a null (0) after in each case.
The next non-zero value is a 1 for Maniac Mansion, and 3 for Monkey Island. Hmmm...the Maniac Mansion files are in a single archive, while Monkey Island uses 3 disks (therefore 3 files). Could this byte indicate the number of disks? Looking further, it is consistent with all the five games. (You can verify this yourself if you don't believe me!) A zero follows this each time. It is not clear if the disk count is 2 bytes or if the zero has another meaning (or is just a delimiter), so we'll ignore that for now.
Next are another four bytes that look suspiciously like a length. Let's check it out:
MANIAC_A.XXX: 00000018: C9 FA 09 00 = 0x0009FAC9 = 654,025
MONKEY_E.XXX: 00000018: 9D 27 45 00 = 0x0045279D = 4,532,125
Now let's take a look at the final install directories:
Contents of folder C:\MANIAC\
...
57 file(s) 654,025 bytes total
Contents of folder C:\MONKEY\
...
10 file(s) 4,532,125 bytes total
Look at that! That number is clearly the total number of uncompressed bytes from the entire archive (spanning all archive files/disks). This must be where the installer finds out how much space is required for the install.
Next is the 'FILE' delimiter, which we already saw. I'm assuming this marks the beginning of info for each file in the archive. And searching for those bytes, we do find that 'FILE' occurs 57 times in the MANIAC archive:
$ strings MANIAC_?.XXX | grep -c FILE
57
This works for the others as well (although multi-disk files need to be summed):
$ strings ZAK____?.XXX | grep -c FILE
62
$ strings LOOM___?.XXX | grep -c FILE
75
$ strings INDY___?.XXX | grep -c FILE
91
$ strings MONKEY_?.XXX | grep -c FILE
10
Okay, what next? Following FILE is yet another number that looks like a 4-byte length:
MANIAC_A.XXX:
0000001C: 3C 04 00 00 = 0x0000043C = 1,084
This should be for file 00.LFL. The decompressed length of this file is:
00 LFL 1,988
Hmm...this doesn't look right. The other option is that this is the *compressed* length. We can find out if this is the case by finding out where the next file starts:
MANIAC_A.XXX
00000020: 3C 04 00 00 30 30 2E 4C 46 4C 00 2E 18 2E 79 25 <...00.LFL....y%
00000460: 46 49 4C 45 BC 33 00 00 30 31 2E 4C 46 4C 00 2E FILE.3..01.LFL..
The next file marker starts at 0x460. The length we are checking ends at 0x24. Subtract and we have 0x43C = 1,084. So this is the length of the data that makes up the file.
Checking another one:
MONKEY_E.XXX
00000020: F1 0B 00 00 30 30 30 2E 4C 46 4C 00 EE 35 79 25 ....000.LFL..5y%
00000C10: 5B 1F 04 FC 03 46 49 4C 45 BE 03 00 00 39 30 31 [....FILE....901
At 0x1C: F1 0B 00 00 = 0x00000BF1 = 3,057
0xC15 - 0x24 = 0xBF1 = 3,057
Confirmed!
After this length is the file name, which appears to be null terminated.
It seems like there should be an uncompressed length. We saw earlier that for the 00.LFL file in MANIAC it should be 1,988, which is 0x7C4. And look:
00000030: 55 19 C4 07 00 00 02 00 01 00 00 00 00 06 FE F9 U...............
There are 7 unknown bytes in between. Hmmm. Well, let's look at another file.
For MONKEY, the first file's final length is:
000 LFL 8,357
This is 0x20A5. And here it is:
00000030: 2B 21 A5 20 00 00 02 00 01 00 00 00 00 06 8A 0C +!. ............
Oddly, there are now only 6 bytes in between the end of the filename and this length. Could the filename field be a fixed length? From looking at multiple instances, it looks like this is the case. There are at least 12 bytes allocated
for the filename (8+’.’+3), which is null terminated if less than 12 bytes. It isn't clear whether it would have a null if all 12 bytes were used (as there is no instance of this) but it would make the implementation cleaner. So there are still one or two bytes unaccounted for; it is unknown whether these are padding or some real used values.
After the uncompressed length, there is this pattern for every file in every archive:
02 00 01 00 00 00 00 06
This must have something to do with the compression. To get more info, we are going to have to look at the installer (install.exe) itself.
PART II
The first two bytes of install.exe indicate that it is indeed an MZ executable file.
install.exe
0000: 4D 5A 2B 00 2E 00 00 00 02 00 20 06 FF FF 7E 08 MZ+....... ...~.
0010: 80 00 00 00 0E 00 8B 05 1C 00 00 00 4C 5A 39 31 ............LZ91
Running 'strings' on install.exe is interesting as well. Here’s a portion:
1) Maniac Mansion
2) Zak McKracken
3) Loom
4) Indiana Jones and the Last Crusade
5) The Secret of Monkey Island
maniac maniac_a.xxx 1
//directory, filename, beginning disk#
zak zak____a.xxx 1
loom loom___a.xxx 1
indy indy___a.xxx 2
monkey monkey_a.xxx 5
It was nice of LucasArts to leave the comment in. Perhaps the number of games (5) is hardcoded, but it looks like
this info controls the install directory name, the install archive name, and the first installation disk number.
If you look back at the names of the archives they do not all start with A. Indy starts with B. Monkey start with E.
So it appears that the ending letter is incremented based on the disk number (a=disk 1, b=disk 2, etc).
The output of ‘strings’ also reveals more strings earlier in the file, but they appear somewhat corrupted. Could they be packed?
Looking back at the first 32 bytes of the file, we see 'LZ91'. LZ could refer to Lempel-Ziv; time to hit the web again. This time, information is available: Searching "LZ91 exe" on Google leads to the "LZEXE home page" (
http://bellard.org/lzexe.html). Nice!
LZEXE.exe is a program from ~1989 that allows executables to be compressed. When launched, it will decompress itself into memory. The program places either a ‘LZ09’ (for versions .90) or ‘LZ91’ (for version .91) tag at offset 1C of the compressed executable. (See
http://www.shikadi.net/moddingwiki/LZEXE.) LucasArts must have used version .91!
There is also a utility to reverse the process; that is, expand the compressed exe file back to its uncompressed form. This utility is UNLZEXE.EXE. Running it on install.exe (in DOSBox) creates a new install.exe that is larger, and a install.olz which is the compressed version.
C:\>UNLZEXE.EXE INSTALL.EXE
Running the expanded version of install.exe seems to work fine at first. But wait, but it looks different. Attempting to instal any of the games results in a rude message about poking yourself in the eye! Plus, it fails. What’s up with that, LucasArts?
Running strings on this new uncompressed version (“strings install.exe”) reveals not only that those corrupted looking strings seen earlier are corrected (compete with the poke yourself references) but also that the original clean strings seen (with the directory and archive info) are gone! Huh?
It look like LucasArts sneakily tacked on some file data (namely, those original strings) on the end of the compressed exe file. When decompressed by the utility, it does not know that there is data after the program data and so this is lost. This can be verified by using the INFOEXE.EXE program distributed with the LZEXE.EXE program:
LucasArts’ compressed install.exe:
Decimal Hex
Length on disk (bytes) 24762 000060BA
Length of code in the EXE (bytes) 23083 00005A2B
There appears to be an extra 1,679 bytes on the end of the file. The UNLZEXE.EXE utility ignores this extra data. And if the install program does not detect this data, it apparently will tell you to poke yourself in the eye. Weird.
Just for further understanding, a few more things were tried. LDEXE.EXE (Version 0.91) was downloaded and run on the decompressed install.exe. (Note that the original LDEXE.EXE is in French.) Now compare the original file and the recompressed file:
C:\>LZEXE.EXE INSTALL.EXE
In terminal, the original installer and the recompressed decompressed version were compared:
$ cmp -b -l install.olz install.exe
cmp: EOF on install.exe
They match, but install.olz is a longer file since it has the extra data. Note that running the recompressed version still doesn’t work. Let’s see if we can fix it. First copy the extra data from the end of the original installer to a text file:
$ tail -c 1679 install.olz > install.txt
Running it again, and wallah! It works! No more poking in the eye! But why? Running strings earlier revealed an “install.txt” string. It didn’t seem like too far a leap to assume that during development install.txt could hold that data that would eventually be tacked on the end of the compressed file. And it does!
To get the file to normal though, we can concatenate it:
$ cat install.txt >> install.exe
$ cmp -b -l install.olz install.exe
$
They match! So install.exe was successfully decompressed, recompressed, and combined with the extra data to match the original file exactly. When install.txt is deleted, it still works.
The text data at the end can be modified, and it will take effect whether in a separate text file or tacked back on to the executable. For instance, the directories, file names, and starting disk can all be changed easily (not that you’d want to).
Okay, let’s look into the decompressed install.exe some more. Here’s a section of output from ‘strings’ that stand out:
PKWARE Data Compression Library(tm)
Copyright 1990-92 PKWARE Inc. All Rights Reserved.
Patent No. 5,051,745
Version 1.03
Ah ha! Finally a clue as to how the actual data in the xxx archive is compressed! A little searching on the net brings up this under the zip standard (with the note that it is seldom implemented):
10 - PKWARE Data Compression Library Imploding
The header bytes (even though there are only 2) match what for each file; and the specification looks straightforward to implement. I think we can safely say this is the compression method used.
After implementing the spec in C, I was able to verify that the files are all using the PKWARE Implode method that was described in the link above. The new implementation extracts all files and are a exact match with the original files. I think that is all there is find out! And I have a new utility to use, “LFGExtract”.
Next time, perhaps, we’ll look at some of the actual SCUMM game files. (I suppose that would be more exciting.)
Afterward
Once I had written the decompressor (which I called "LFGExtract"), it seemed only natural to write a compression utility as well. I had two goals: First, create archives that meet or beat the compression ratios achieved in the LucasArts Classic Adventures archives (.xxx files), and second, have the original install.exe program be able to decompress the archives with no problem.
The implementation for 'imploder' was fairly straightforward and just followed the spec used to implement the decompressor. Basically, for each new encode attempt, start with a length of 2 and look through the dictionary to see if there is a match that can be used. If so, continue to look for increasing length until a match can no longer be found. At this point encode the offset and length into the compressed bitstream. If no match can be found at all, just encode one literal byte. Move to the next byte that hasn't been encoded and repeat the process.
All valid lengths may be used (2 through 518), as well as the minimum and maximum offsets (0 and 4095, mapping to the first and last elements in the dictionary). Oddly, it appears that the archives on the Classic Adventures disks were compressed with limits on these values (for example, offset 0 is not used). I'm not sure if there is a benefit to this.
One optimization is made to the algorithm: before committing to using an offset/length entry, check if encoding the first byte as a literal, and then using a dictionary entry from the byte after that would produce better results (where better results are defined as a shorter bit length used for encoding of both the literal and the new offset/length pair). This helped reduce the compressed size further, and results from this implementation beat whatever was used to produce the original achieved files for the classic games. (First goal met!)
Finally, archives created by this new implode implementation (I called it LFGMake) can be decompressed and extracted by the original install.exe utility. Success!