ADX is a lossy proprietary audio storage and compression format developed by CRI Middleware specifically for use in video games, it is derived from ADPCM. Its most notable feature is a looping function that has proved useful for background music in various games that have adopted the format, such as the Dreamcast and later generation Sonic the Hedgehog games from SEGA, as well as many PlayStation 2 and GameCube games. There is also a sibling format, AHX, which uses a variant of MPEG-2 audio and is intended specifically for voice recordings. A packaging archive, AFS, is also included for bundling individual ADX and AHX files into a single container.
The ADX format's specification is not freely available, however the most important elements of the structure have been documented in various places on the web. The information here may be incomplete but is sufficient to build a working codec or transcoder. AHX is not covered here as information about that variant is rare, however a cursory examination with a hex editor reveals a strikingly similar design to ADX 'version 3' minus the looping feature.
As a side note, AFS archive files are a simple variant of a tarball which uses numerical indices to identify the contents rather than names. Source code for an extractor for this format is included in the ADX archive at
|0x0||0x80||0||Copyright Offset||Encoding Type||Block Size||Bits Per Sample||Channel Count||Sample Rate||Total Samples|
|0x10||Version Mark||Unknown||Loop Enabled (v3)||Loop begin sample index (v3)|
|0x20||Loop begin byte index (v3)||Loop Enabled (v4) End sample index (v3)||Loop begin sample index (v4) End byte index (v3)||Loop begin byte index (v4)|
|0x30||Loop end sample index (v4)||Loop end byte index (v4)||Unknown|
|???||[CopyrightOffset - 2] -> ASCII String: "(c)CRI"|
|...||[CopyrightOffset + 4] -> Audio Data|
|Scale||32 4bit samples|
|First sample||Second sample|
The decoding method for a sample is demonstrated below in C99:
#define SAMPLES_PER_BLOCK 32
#define BYTES_PER_BLOCK 18 /* SAMPLES_PER_BLOCK / 2 + sizeof(uint16_t) */
/* sample_index is an uint_fast32_t incremented every time a sample has been decoded from every channel */
/* current_channel is an uint_fast8_t that holds the index for the channel currently being decoded (ie. 0 for left, 1 for right) */
/* audio_data_start is an uint_fast16_t byte index of the first byte of audio data in the file (ie. adx_header->CopyrightOffset + 4) */
/* num_channels is a uint_fast8_t channel count, that is 1 for mono, 2 for stereo, etc (This is adx_header->ChannelCount verbatim) */
/* raw_data is a uint8_t pointer to the start of where the file is located in memory */
/* previous_sample and second_previous_sample are both int_fast32_t's */
/* --- Get 4 bit sample --- */
data_index = audio_data_start + (sample_index / SAMPLES_PER_BLOCK) * num_channels * BYTES_PER_BLOCK + current_channel * BYTES_PER_BLOCK;
block_scale = ntohs(*(uint16_t*)&raw_data[data_index] ) + 1;
data_index += 2 + sample_index % SAMPLES_PER_BLOCK / 2;
sample_4bit = raw_data[data_index];
if (sample_index % 2) /* If the sample index [starting at 0] is odd then we are decoding a secondary sample */
sample_4bit &= 0x0F;
else /* Otherwise it is a primary sample */
sample_4bit >>= 4;
/* --- Decode 4 bit sample --- */
sample = sample_4bit;
if (sample_4bit & 8) sample -= 16; /* Check the 4th bit (the sign), if negative then adjust for larger variable */
sample *= block_scale * volume; /* Scale up the sample and amplify */
sample += previous_sample * 0x7298; /* Incorporate previous sample data */
sample -= second_previous_sample * 0x3350; /* Incorporate previous previous sample data */
sample >>= 14; /* Divide the sample by 16384 */
if (sample > 32767) /* Round-off the sample within the valid range for a 16bit signed sample */
sample = 32767;
else if (sample < -32768)
sample = -32768;
second_previous_sample = previous_sample; /* Update the previous samples for the current channel */
previous_sample = sample;
Before processing the sample, it is necessary to acquire the "block scale" and the byte containing the sample within the file, the calculations used here appear more complex than they truly are, a counter and cache variables would be simpler and more efficient in practice but the entire positional calculations are demonstrated for clarity.
The first calculation finds the channel block for the current sample, this involves converting the 'total samples read' counter into a 'number of frames read' counter then adding the offset of block for the current channel within the frame. The 'block scale' is located at the start so that needs to be converted to the local endian (in this case, ntohs is appropriate for this task) and stored for later. The second calculation moves to the byte within the channel block. As the samples are nybbles, not whole bytes, the
if statement cuts off the undesired sample and shifts the nybble appropriately to the low 4bits.
The decoding process involves first adjusting the 4bit signed value for a 32bit [or larger] variable as few desktop processors can handle 4bit numbers directly. The highest bit of the 4bit value is the sign bit, the number itself is formatted in Two's complement. The demonstration code uses a simple trick for sign-extending the value, for example, if sample_4bit is -1 (1111 in binary), which is 15 in unsigned arithmetic, subtracting 16 will convert the number to -1 again in the larger variable.
The next stage is to multiply the sample by the 'block scale' which gives it a rational amplitude, then amplify by a volume, the value used for the volume varies between sources from 0x1000 to 0x4000, it is recommended that you should likely not go higher than 0x4000 as distortion effects may be noticeable in common test files, caused by oversaturating the sound. The next 2 steps include information from the previous two samples to bring the sample in line with the others. The previous sample trackers translate across block boundaries but separate tracker sets must be kept for each channel, the values start at 0 in the first audio frame of the file. Lastly, the sample is divided by 16384 using a downshift to compress into the expected signed 16bit range (-32768 to 32767) then truncated if necessary.
ADX supports a simple encryption scheme which XORs values from a linear congruential random number generator with the block scale values. This method is computationally inexpensive to decrypt (in keeping with ADX's real-time decoding) yet renders the encrypted files unusable. The encryption is active when the Version Mark value in the header is 0x01F40408 (note that the final byte is 0x08 rather than 0x00 as in unencrypted files). As XOR is symmetric the same method is used to decrypt as to encrypt. The encryption key is a set of three 16-bit values: the multiplier, increment, and start values for the linear congruential generator (the modulus is 0x7fff to keep the values in the 15-bit range of valid block scales). Typically all ADX files from a single game will use the same key.
The encryption method is vulnerable to known-plaintext attacks. If an unencrypted version of the same audio is known the random number stream can be easily retrieved and from it the key parameters can be determined, rendering every ADX encrypted with that same key decryptable. The encryption method attempts to make this more difficult by not encrypting silent blocks (with all sample nybbles equal to 0), as their scale is known to be 0.
Even if the encrypted ADX is the only sample available, it is possible to determine a key by assuming that the scale values of the decrypted ADX must fall within a "low range". This method does not necessarily find the key used to encrypt the file, however. While it can always determine keys that produce an apparently correct output, errors may exist undetected. This is due to the increasingly random distribution of the lower bits of the scale values, which becomes impossible to separate from the randomness added by the encryption.