WAVEFORM AUDIO FILE FORMAT (WAV) The WAVE file format is a subset of Microsoft's RIFF (Resource Interchange File Format) specification for the storage of multimedia resource files. This structure has been used for various formats (AVI, PAL, RTF,...) but this document will focus on its use with the WAV file format specifically. Only mandatory chunks are explained in this document, there's no explanation of how the LIST or INFO chunks work, or what the "cue " subchunk should countain. More succinct documents out there present the "canonical WAVE file format" (I highly recommend Craig Stuart Sapp's page on the topic), which should be more than enough for most people. RIFF structure RIFF (Resource Interchange File Format) is the tagged file structure designed for multimedia resource files. The structure of a RIFF file is similar to the structure of an Electronic Arts IFF file. RIFF is not actually a file format itself (since it does not represent a specific kind of information), but rather an encapsulation scheme. Chunks The basic building block of a RIFF file is called a chunk and is defined as follows: BYTE ckID[4]; // Chunk type identifier DWORD ckSize; // Chunk size in bytes (size of ckData) BYTE ckData[]; // Chunk data Two special types of chunks identified by a ckID value of "LIST" and "RIFF", may contain nested chunks, or subchunks. Here is a discussion of various fields: ckID A four-character code that identifies the representation of the chunk data. A program reading a RIFF file can skip over any chunk whose chunk ID it doesn't recognize; it simply skips the number of bytes specified by ckSize plus the pad byte, if present. In the case of WAV files, ckID must be "RIFF". ckSize A 32-bit unsigned value identifying the size of ckData. This size value does not include the size of the ckID or ckSize fields or the pad byte at the end of ckData. ckData Binary data of fixed or variable size. The start of ckData is word-aligned with respect to the start of the RIFF file. If ckSize is an odd number of bytes, a pad byte with value zero is written after ckData. Word aligning improves access speed (for chunks resident in memory) and maintains compatibility with EA IFF. RIFF Forms A RIFF form is a chunk with a "RIFF" ckID. The first DWORD of ckData in the RIFF chunk is a four-character code value identifying the form type of the file. For a WAV files, this four-character code is "WAVE". Following the form-type code is a series of subchunks. Which subchunks are present depends on the form type. They are introduced exactly like chunks: a four-character code followed by a DWORD specifying its length in bytes. Note however that the code is always lowercase for subchunks. WAVE Format Chunk For WAV files, there are two mandatory subchunks, identified as "fmt " and "data". Other subchunks such as "cue ", "ltxt" or "file" are not covered. Subchunk - "fmt " This subchunk specifies how the data must be interpreted, and MUST always occur before subchunk "data". It is defined as follows: WORD wFormatTag; // Format category. WORD wChannels; // Number of channels. DWORD dwSamplesPerSec; // Sampling rate. DWORD dwAvgBytesPerSec; // For buffer estimation. WORD wBlockAlign; // Data block size. CHAR formatSpecific[]; // Format-specific fields. Here is a discussion of various fields: wFormatTag A number indicating the WAVE format category of the file. The content of the portion of the "fmt " chunk, and the interpretation of the waveform data, depend on this value. Currently defined WAVE format categories are: Value Format Category 0x0001 (WAVE_FORMAT_PCM) Microsoft Pulse Code Modulation (PCM). 0x0101 (IBM_FORMAT_MULAW) IBM mu-law. 0x0102 (IBM_FORMAT_ALAW) IBM a-law. 0x0103 (IBM_FORMAT_ADPCM) IBM AVC Adaptive Differential Pulse Code Modulation. wChannels The number of channels represented in the waveform data, such as 1 for mono or 2 for stereo. dwSamplesPerSec The sampling rate (in samples per second) at which each channel should be played. dwAvgBytesPerSec The average number of bytes per second at which the waveform data should be transferred. Playback software can estimate the buffer size using this value. wBlockAlign The block alignment (in bytes) of the waveform data. Playback software needs to process a multiple of wBlockAlign bytes of data at a time, so the value of wBlockAlign can be used for buffer alignment. formatSpecific This field may contain more or less bytes of information depending on the value found in wFormatTag. If wFormatTag is 0x0001, then the waveform data consists of samples represented in Pulse Code Modulation (PCM) format, and this field only contains a WORD (wBitsPerSample) describing the size in bits of each sample of each channel (if there are multiple channels, the sample size is the same for each channel). For PCM data, the wAvgBytesPerSec of the "fmt " chunk should be equal to the following formula rounded up to the next whole number: wAvgBytesPerSec = wChannels * wBitsPerSecond * (wBitsPerSample / 8) And the wBlockAlign field should be equal to the following formula, rounded up to the next whole number: wBlockAlign = wChannels * (wBitsPerSample / 8) Subchunk - "data" In a single-channel WAVE file, samples are stored consecutively. For stereo WAVE files, channel 0 represents the left channel, and channel 1 represents the right channel. The speaker position mapping for more than two channels is currently undefined. In multiple-channel WAVE files, samples are interleaved. Each sample is contained in an integer i. The size of i is the smallest number of bytes required to contain the specified sample size. The least significant byte is stored first. The bits that represent the sample amplitude are stored in the most significant bits of i, and the remaining bits are set to zero. For example, if the sample size (recorded in nBitsPerSample) is 12 bits, then each sample is stored in a two-byte integer. The least significant four bits of the first (least significant) byte is set to zero. Canonical Microsoft PCM WAVE File Format This section describes the structure of a classic Microsoft Pulse Code Modulation (PCM) WAVE file and the value each field MUST have to match that specific file format. The file starts with RIFF chunk type, which is defined as follows: BYTE ckID[4]; // Chunk type identifier, MUST be "RIFF". DWORD ckSize; // Chunk size in bytes, or LOF - 8. Right after appears the formType: BYTE formType[4]; // Form type, MUST be "WAVE". Then appears the "fmt " subchunk, which describes how the audio data must be interpreted. It is defined as follows: BYTE sckID1[4]; // Subchunk id, MUST be "fmt ". DWORD sckSize1; // Subchunk size in bytes, MUST be 16. WORD wFormatTag; // Format category, MUST be 0x0001. WORD wChannels; // Number of channels. DWORD dwSamplesPerSec; // Sampling rate. DWORD dwAvgBytesPerSec; // For buffer estimation. WORD wBlockAlign; // Data block size. WORD wBitsPerSample; // Bits per sample. Then appears the "data" subchunk, which is defined as follows: BYTE sckID2[4]; // Subchunk id, MUST be "data". DWORD sckSize2; // Subchunk size in bytes. BYTE sckData[]; // Subchunk data.