There are many embed formats out there dedicated to store and organize resources, such as ZIP or TAR. However, these formats (by their all-purpose approach) contain much more stuff than what you actually need, often making a distinction between symbolic links, files and raw information, preserving creation date, file attributes and other meta data. Using these formats would require you to write a program that complies 100% with the official specifications and trust me, that's a lot of responsibility. So why not come up with your own format instead?
To keep things simple, we're going to avoid usage-driven containers like Half-Life (1998 - Valve) WAD files or Jill of the Jungle (1992 - Epic MegaGames) DMA/SHA files (for your information, these files are dedicated to store graphics).
What's in the box?!
Usually, resources containers will at least feature a File Allocation Table (FAT) which will describe the resource (its name and its size), and the actual data. When I say "usually", I mean there are exceptions out there. For instance, Boppin' (1991 - Accursed Toys) includes a hard-coded list of resource names and offsets as the accompanying RES file only holds data, GTA III (2001 - DMA Design) and Spelunky (2012 - Mossmouth) both use two files to store resources (one for the data and another for the FAT).
To allow the bare minimum of flexibility, you should always include a FAT along with the data. Obviously, you can add more stuff if you want. For instance, Blood's (1997 - Monolith) RFF files include the creation date of each source file, the number of revisions (how many times the resource has been replaced), and even an encryption system to prevent extraction.
The FAT may sometime not describe resources, but arbitrary chunks of information the program needs. They are sometimes referred to by indices, sometimes by labels and markers. That's the case of Doom (1993 - iD Sofware) and Rise of the Triad (1995 - Apogee) WAD files for instance. In WADs, the type of each resource is defined by their position in the FAT and the surrounding opening and closing markers (for instance, all resources located in the FAT between S_START and S_END are sprites). If this sounds overly complicated, it's only because it is... given WAD's ancestor (Wolfenstein-3D's VSWAP.WL6), it's understandable.
Using non-filename labels to describe resources is a perfectly fine thing to do, but it's not as flexible or instinctive, especially if you plan to reuse your code or provide modding capabilities to your game. Indeed, most games named in this article support packed files and loose files; it means that if a requested file is found in the main directory, it will overwrite the file contained within the resources container. It's particularly convenient during development. Some games went further by loading multiple containers one after the other, each adding or overwriting the content of previous containers.
Low flexibility, easy management (CMP files)
Used by early Apogee games, the CMP format is a strange beast: not only does it lack header, it also has a fixed FAT of 200 entries of 20 bytes each; here's what each entry in the FAT looks like:
Like many formats from the early 90s, the FAT uses a fixed-length 12-character long field for resNames, which is sufficient since DOS only permits 12-character long names. Because of the fixed-length FAT and the inclusion of a resOffset field, this format is particularly easy to handle: to "remove" a resource, you don't actually have to erase the data and may simply "crush" the record away from the FAT. Also, you could replace an existing entry by appending new data and redirecting the resOffset and correcting the resLength. The bottom line being that you're limited to "only" 200 entries.
Tightly packed junk (GRP files)
Used by most Build Engine-powered games, Ken Silverman's GRP files manage to reduce the size of each entry in the FAT by simply ditching the offset field altogether:
The FAT is located right after the header, and each entry is exactly 16 bytes long:
Leaving the offset out makes it slightly more cumbersome to find where the requested resource starts. Begin by computing the location of the first resource (header size in bytes + length of the FAT in bytes). Then, add up all resLength you've read up until the point you found the requested resource. It's simple, clean, and while it keeps resources tightly packed, it's also much harder to modify: when adding new resources, not only does the FAT need to be entirely rewritten, but the whole data chunk also has to be moved (since resources are stored contiguously.) Thankfully, you don't have to recompute each offset since it's missing from the FAT. It needs to be mentioned because it highlights how inconvenient a variable-length FAT can be when put early in your container.
Little known fact, Duke Nukem 3D (1996 - 3D Realms) supports loading multiple GRP files at once, which was super convenient for add-ons that included maps, sounds, and tiles graphics... usually though, modders would just drop all their files in the main directory and call it a day.
Moving parts (PAK files)
Quake (1996 - iD Software) shipped with a more flexible approach that made each entry in the FAT much larger, and also allowed the FAT to be located anywhere like Dark Forces' GOB files (1995 - LucasArts). The header looks like this:
Once you have the information needed to locate the FAT, jump to FATOffset and read (FATLength / 64) entries (since each entry in the FAT is 64 bytes long):
Having a mobile FAT is both a blessing and a curse. If it allows tools to handle the archive the way they want, it also means that a specific tool cannot bet on the FAT location; for instance, if you write a tool that assumes the FAT is located at the very end of the container, you could add a new file by moving the FAT a little further down, and put the new data there. It means that the FAT would only need one extra entry, and no resOffset would need to be updated. But you can't bet on that, because another tool might put the FAT at the beginning of the file, screwing up each and every resOffset... unless you decide to copy and paste the FAT, ignore the junk of the initial FAT, and write your own FAT at the end... that would preserve the resOffset fields. If iD wanted the FAT to remain at the end of the file, maybe they should have replaced the fatOffset field with a numEntries field, and force the program to compute the FAT location as "fileLength - numEntries * 64".
To obtain a specific resource, read each entry in the FAT until you find the one you're looking for, then jump to the specified resOffset location and read resLength bytes of data. Usually, PAK managers will always write the FAT at the very end of the container, since it makes it easier to append data later on. This FAT provides far more room than needed to store DOS-compliant 12-character filenames because iD organized their assets in directories (models, sounds and maps are stored in different sub-directories) and thus, extra room was needed. But the question is: when is the field big enough? When Ritual Entertainment used the Quake II engine to develop SiN in 1998, they thought that 56 characters was not enough and upped the field to a whooping 120 characters! That's a lot of wasted bytes.
Quake one-upped Duke Nukem 3D in the modding department. Although archives had to be sequentially named (PAK0.PAK, PAK1.PAK, etc.), each mod had its own directory (with the base game being located in "id/"). Had iD allowed custom PAK names, using directories might have been redundant. On the other hand, it allowed separate configuration files and mods could use multiple containers if they needed to.
Larger filenames (LAB files)
Used for the first time in Outlaws (1997 - LucasArts), LAB files introduced a twist that allows variable-length filenames while maintaining the length of each entry in the FAT. The container starts with a 16-byte header:
And right after the header comes the FAT (read numFiles entries, 16-byte each):
The really cool thing about LAB files is the buffer of strLength bytes following the FAT. The buffer contains each and every resource name, null-terminated. In C, you could reserve enough memory to store the whole buffer, then point to the offset provided by resOffset, and the result would be the resource name. This means that any variable-length string can be obtained. Alternatively, when loading resources, one could search a matching string instead of parsing the FAT. When the name is retrieved, the position of the first byte can used to search the FAT. I really like this design.
I learned nothing
You could (well if you want - do as you wish, I'm not your mom) use resources containers to keep all your files in the same place and make the installation directory tidy. Plus they are super convenient for modding. The structure will depend on how much work you're willing to put in your tools. Overall, the FAT shouldn't be stored near the beginning of the container, unless it is fixed-length. Also, the length of each entry should also be fixed to increase reading speed.
The tools you'll develop to manage your own format will depend on how often you'll modify your files; Appending new data when adding resources, or leaving data junk behind when removing resources is much faster, but will create bloated archives that will need some cleanup at some point. Using offsets to find files is faster than having to compute it on your own, but requires more room for each entry and may become tedious to work with if the FAT is located before the data; likewise, the space reserved to store the resource name will depend on your project organization.
Welp... I hope this thing gave you brilliant ideas and didn't just waste your time. Stay tooned!