is a free software speech codec
that may be used on VoIP
applications and podcasts
. Speex claims to be free of any patent
restrictions and is licensed under the revised (3-clause) BSD license
. It may be used with the Ogg container format
or directly transmitted over UDP
The Speex designers see their project as complementary to the Vorbis general-purpose audio compression project.
Speex is a lossy format, meaning quality is permanently degraded to reduce file size.
Unlike many other speech codecs, Speex is not targeted at cellular telephony but rather at Voice over IP
(VoIP) and file-based compression. The design goals have been to make a codec that would be optimized for high quality speech and low bit rate. To achieve this the codec uses multiple bit rates, and supports ultra-wideband (32 kHz sampling rate
(16 kHz sampling rate) and narrowband (telephone quality, 8 kHz sampling rate). Designing for Voice over IP (VoIP
) instead of cell phone use means that Speex must be robust to lost packets, but not to corrupted ones since the User Datagram Protocol
(UDP) ensures that packets either arrive unaltered or don't arrive. All this led to the choice of Code Excited Linear Prediction
(CELP) as the encoding technique to use for Speex. One of the main reasons is that CELP has long proven that it could do the job and scale well to both low bit rates
(as evidenced by DoD CELP @ 4.8 kbit/s) and high bit rates (as with G.728
@ 16 kbit/s).
The main characteristics can be summarized as follows:
- Free software/open-source, patent and royalty-free
- Integration of narrowband and wideband in the same bit-stream
- Wide range of bit rates available (from 2 kbit/s to 44 kbit/s)
- Dynamic bit rate switching and Variable bit-rate (VBR)
- Voice Activity Detection (VAD, integrated with VBR)
- Variable complexity
- Ultra-wideband mode at 32 kHz (up to 48 kHz)
- Intensity stereo encoding option
Sampling rate: Speex is mainly designed for three different sampling rates: 8 kHz (the same sampling rate to transmit telephone
calls), 16 kHz, and 32 kHz. These are respectively referred to as narrowband, wideband and ultra-wideband.Quality: Speex encoding is controlled most of the time by a quality parameter that ranges from 0 to 10. In constant bit-rate (CBR) operation, the quality parameter is an integer
, while for variable bit-rate (VBR), the parameter is a real (floating point
) number.Complexity (variable): With Speex, it is possible to vary the complexity allowed for the encoder. This is done by controlling how the search is performed with an integer ranging from 1 to 10 in a way that's similar to the -1 to -9 options to gzip compression
utilities. For normal use, the noise level at complexity 1 is between 1 and 2 dB higher than at complexity 10, but the CPU
requirements for complexity 10 is about five times higher than for complexity 1. In practice, the best trade-off is between complexity 2 and 4, though higher settings are often useful when encoding non-speech sounds like DTMF
tones, or if encoding is not in real-time.Variable Bit-Rate (VBR): Variable bit-rate (VBR) allows a codec to change its bit rate dynamically to adapt to the "difficulty" of the audio being encoded. In the example of Speex, sounds like vowels
and high-energy transients
require a higher bit rate to achieve good quality, while fricatives
(e.g. s and f sounds) can be coded adequately with fewer bits. For this reason, VBR can achieve lower bit rate for the same quality, or a better quality for a certain bit rate. Despite its advantages, VBR has two main drawbacks: first, by only specifying quality, there's no guarantee about the final average bit-rate. Second, for some real-time applications like voice over IP
(VoIP), what counts is the maximum bit-rate, which must be low enough for the communication channel.Average Bit-Rate (ABR): Average bit-rate solves one of the problems of VBR, as it dynamically adjusts VBR quality in order to meet a specific target bit-rate. Because the quality/bit-rate is adjusted in real-time (open-loop), the global quality will be slightly lower than that obtained by encoding in VBR with exactly the right quality setting to meet the target average bitrate.Voice Activity Detection (VAD): When enabled, voice activity detection detects whether the audio being encoded is speech or silence/background noise. VAD is always implicitly activated when encoding in VBR, so the option is only useful in non-VBR operation. In this case, Speex detects non-speech periods and encodes them with just enough bits to reproduce the background noise. This is called "comfort noise
generation" (CNG).Discontinuous Transmission (DTX): Discontinuous transmission is an addition to VAD/VBR operation, that allows to stop transmitting completely when the background noise is stationary. In a file, 5 bits are used for each missing frame (corresponding to 250 bit/s).Perceptual enhancement: Perceptual enhancement is a part of the decoder which, when turned on, tries to reduce (the perception of) the noise produced by the coding/decoding process. In most cases, perceptual enhancement makes the sound further from the original objectively (signal-to-noise ratio), but in the end it still sounds better (subjective improvement).Algorithmic delay: Every codec introduces a delay in the transmission. For Speex, this delay is equal to the frame size, plus some amount of "look-ahead" required to process each frame. In narrowband operation (8 kHz), the delay is 30 ms, while for wideband (16 kHz), the delay is 34 ms. These values don't account for the CPU time it takes to encode or decode the frames.
Large application base
There is already a large base of applications supporting the Speex codec, from streaming
applications like teleconference
to videogames and audio processing applications. Most of these are based on the DirectShow
on Microsoft Windows
, or OpenH323
), for example. There are also plugins
for the Winamp
players. Also KSP Sound Player
from version 2006.0.0.2 and foobar2000
The media type for Speex is audio/ogg while contained by Ogg, and audio/x-speex when transported through RTP or without container.
See the plugin and software page on speex.org site for more details.
Microsoft's Xbox Live uses Speex for the headsets, as announced by Ralph Giles, the Theora codec maintainer, on LugRadio.
The latest Half-Life 1 engine and mods use the voice_speex.dll codec as its ingame VoIP function. Though it is not enabled by default, server administrators must enable it by typing in the console of their server either through rcon or at the physical server computer "sv_voicecodec voice_speex". Speex provides much better quality than the default Miles voice codec.
The United States Army's Land Warrior system, designed by General Dynamics, also uses Speex for VoIP on an EPLRS radio designed by Raytheon.
In Sid Meier's Civilization 4, Speex is used to encode the descriptions of the technologies as read by Leonard Nimoy.
The VoIP Program TeamSpeak Uses Speex codecs as one of the 3 codecs available. the range of quality starts from 3.4Kbit to 25.9Kbit. Many servers prefer the Speex codec due to its good quality with few or many people in a room.
The Rockbox project uses Speex for its voice interface. It can also play Speex files on supported players, such as the Apple iPod or the iRiver H10.
The Vernier LabQuest handheld data acquisition device for science education uses Speex for voice annotations created by students and teachers using either the built-in or an external microphone.
The Flash Player 10 Beta from Adobe includes support for Speex audio codec.
This article uses material from the Speex Codec Manual which is copyright © Jean-Marc Valin and licensed under the terms of the Text of the GNU Free Documentation License