We propose the Opus codec based on a linear prediction layer (LP) and an
We propose the Opus codec, based on a linear prediction layer (LP) and an
MDCT-based layer. The main idea behind the proposal is that
the speech low frequencies are usually more efficiently coded using
in speech, low frequencies are usually more efficiently coded using
linear prediction codecs (such as CELP variants), while music and higher speech frequencies
are more efficiently coded in the transform domain (e.g. MDCT). For low
sampling rates, the MDCT layer is not useful and only the LP-based layer is
...
...
@@ -90,15 +90,15 @@ as the sole symbolic representation of the codec.</t>
<t>While the symbolic representation is unambiguous and complete it is not
always the easiest way to understand the codec's operation. For this reason
this document also describes significant parts of the codec in english and
takes the opportunity to explain the rational behind many of the more
this document also describes significant parts of the codec in English and
takes the opportunity to explain the rationale behind many of the more
surprising elements of the design. These descriptions are intended to be
accurate and informative but the limitations of common english sometimes
accurate and informative, but the limitations of common english sometimes
result in ambiguity, so it is intended that the reader will always read
them alongside the symbolic representation. Numerous references to the
implementation are provided for this purpose. The descriptions sometimes
differs in ordering, or through mathematical simplification, from the
reference wherever such deviation made an explanation easier to understand.
differ from the reference in ordering or through mathematical simplification
wherever such deviation made an explanation easier to understand.
For example, the right shift and left shift operations in the reference
implementation are often described using division and multiplication in the text.
In general, the text is focused on the 'what' and 'why' while the symbolic
...
...
@@ -113,7 +113,7 @@ representation most clearly provides the 'how'.
In hybrid mode, each frame is coded first by the LP layer and then by the MDCT
layer. In the current prototype, the cutoff frequency is 8 kHz. In the MDCT
layer, all bands below 8 kHz are discarded, such that there is no coding
redundancy between the two layers. Also both layers use the same instance of
redundancy between the two layers. Also, both layers use the same instance of
the range coder to encode the signal, which ensures that no "padding bits" are
wasted. The hybrid approach makes it easy to support both constant bit-rate
(CBR) and varaible bit-rate (VBR) coding. Although the SILK layer used is VBR,
...
...
@@ -152,10 +152,10 @@ There are three possible operating modes for the proposed prototype:
<t>A hybrid (LP+MDCT) mode for full-bandwidth speech at medium bitrates</t>
<t>An MDCT-only mode for very low delay speech transmission as well as music transmission.</t>
</list>
Each of these modes supports a number of difference frame sizes and sampling
Each of these modes supports a number of different frame sizes and sampling
rates. In order to distinguish between the various modes and configurations,
we define a single-byte table-of-contents (TOC) header that can used in the transport layer
(e.g RTP) to signal this information. The following describes the proposed
(e.g., RTP) to signal this information. The following describes the proposed
TOC byte.
</t>
...
...
@@ -190,9 +190,9 @@ for a total of 16 configurations.
</t>
<t>
There is thus a total of 32 configurations, encoded in 5 bits. On bit is used to signal mono vs stereo, which leaves 2 bits for the number of frames per packets (codes 0 to 3):
There is thus a total of 32 configurations, encoded in 5 bits. One bit is used to signal mono vs stereo, which leaves 2 bits for the number of frames per packets (codes 0 to 3):
<liststyle="symbols">
<t>0: 1 frames in the packet</t>
<t>0: 1 frame in the packet</t>
<t>1: 2 frames in the packet, each with equal compressed size</t>
<t>2: 2 frames in the packet, with different compressed size</t>
<t>3: arbitrary number of frames in the packet</t>
...
...
@@ -200,7 +200,7 @@ There is thus a total of 32 configurations, encoded in 5 bits. On bit is used to
For code 2, the TOC byte is followed by the length of the first frame, encoded as described below.
For code 3, the TOC byte is followed by a byte encoding the number of frames in the packet, with the MSB indicating VBR. In the VBR case, the byte indicating the number of frames is followed by N-1 frame
lengths encoded as described below. As an additional limit, the audio duration contained
within a packet may not exceed 120 ms.
within a packet MUST NOT exceed 120 ms.
</t>
<t>
...
...
@@ -215,7 +215,10 @@ The compressed size of the frames (if needed) is indicated -- usually -- with on
<t>
The maximum size representable is 255*4+255=1275 bytes. For 20 ms frames, that
represents a bit-rate of 510 kb/s, which is really the highest rate anyone would want
to use in stereo mode (beyond that point, lossless codecs would be more appropriate).
to use in stereo mode.
Beyond that point, lossless codecs would be more appropriate.
It is also roughly the maximum useful rate of the MDCT layer, as shortly
thereafter additional bits are no longer able to improve quality.