Skip to content
GitLab
Explore
Sign in
Register
Primary navigation
Search or go to…
Project
Opus
Manage
Activity
Members
Labels
Plan
Issues
Issue boards
Milestones
Wiki
Code
Merge requests
Repository
Branches
Commits
Tags
Repository graph
Compare revisions
Snippets
Build
Pipelines
Jobs
Pipeline schedules
Artifacts
Deploy
Releases
Model registry
Operate
Environments
Monitor
Incidents
Analyze
Value stream analytics
Contributor analytics
CI/CD analytics
Repository analytics
Model experiments
Help
Help
Support
GitLab documentation
Compare GitLab plans
Community forum
Contribute to GitLab
Provide feedback
Terms and privacy
Keyboard shortcuts
?
Snippets
Groups
Projects
Show more breadcrumbs
Marcus Asteborg
Opus
Commits
049dd18a
Commit
049dd18a
authored
14 years ago
by
Gregory Maxwell
Committed by
Jean-Marc Valin
14 years ago
Browse files
Options
Downloads
Patches
Plain Diff
Draft update (allocation
parent
ac768f33
No related branches found
Branches containing commit
No related tags found
Tags containing commit
No related merge requests found
Changes
2
Hide whitespace changes
Inline
Side-by-side
Showing
2 changed files
celt
+1
-1
1 addition, 1 deletion
celt
doc/draft-ietf-codec-opus.xml
+233
-64
233 additions, 64 deletions
doc/draft-ietf-codec-opus.xml
with
234 additions
and
65 deletions
celt
@
7c328c2d
Subproject commit
dd2973dd6c440c3e48d18e9b010bbb2e916f528
6
Subproject commit
7c328c2d7a9bb28b837d935658ec05bfd7746aa
6
This diff is collapsed.
Click to expand it.
doc/draft-ietf-codec-opus.xml
+
233
−
64
View file @
049dd18a
<?xml version=
'
1.0
'
?>
<?xml version=
"
1.0
" encoding="utf-8"
?>
<!DOCTYPE rfc SYSTEM 'rfc2629.dtd'>
<!DOCTYPE rfc SYSTEM 'rfc2629.dtd'>
<?rfc toc="yes" symrefs="yes" ?>
<?rfc toc="yes" symrefs="yes" ?>
...
@@ -58,7 +58,7 @@ transmission over the Internet.
...
@@ -58,7 +58,7 @@ transmission over the Internet.
<section
anchor=
"introduction"
title=
"Introduction"
>
<section
anchor=
"introduction"
title=
"Introduction"
>
<t>
<t>
We propose the Opus codec based on a linear prediction layer (LP) and an
We propose the Opus codec based on a linear prediction layer (LP) and an
MDCT-based
enhancement
layer. The main idea behind the proposal is that
MDCT-based layer. The main idea behind the proposal is that
the speech low frequencies are usually more efficiently coded using
the speech low frequencies are usually more efficiently coded using
linear prediction codecs (such as CELP variants), while the higher frequencies
linear prediction codecs (such as CELP variants), while the higher frequencies
are more efficiently coded in the transform domain (e.g. MDCT). For low
are more efficiently coded in the transform domain (e.g. MDCT). For low
...
@@ -76,6 +76,37 @@ In this proposed prototype, the LP layer is based on the
...
@@ -76,6 +76,37 @@ In this proposed prototype, the LP layer is based on the
</t>
</t>
<t>
This is a work in progress.
</t>
<t>
This is a work in progress.
</t>
<t>
The primary normative part of this specification is provided by the source
code part of the document. The codec contains significant amounts of fixed-point
arithmetic which must be performed exactly, including all rounding considerations,
and so any useful specification must make extensive use of domain-specific
symbolic language to adequately define these operations. Additionally, any
conflict between the symbolic representation and the included reference
implementation must be resolved. For the practical reasons of compatibility and
testability it would be advantageous to give the reference implementation to
have priority in any disagreement. The C language is also one of the most
widely understood human-readable symbolic representations for machine
behavior. For these reasons this RFC utilizes the reference implementation
as the sole symbolic representation of the codec.
</t>
<t>
While the symbolic representation is unambiguous and complete it is not
always the easiest way to understand the codec's operation. For this reason
this document also describes significant parts of the codec in english and
takes the opportunity to explain the rational behind many of the more
surprising elements of the design. These descriptions are intended to be
accurate and informative but the limitations of common english sometimes
result in ambiguity, so it is intended that the reader will always read
them alongside the symbolic representation. Numerous references to the
implementation are provided for this purpose. The descriptions sometimes
differs in ordering, or through mathematical simplification, from the
reference wherever such deviation made an explanation easier to understand.
For example, the right shift and left shift operations in the reference
implementation are often described using division and multiplication in the text.
In general, the text is focused on the 'what' and 'why' while the symbolic
representation most clearly provides the 'how'.
</t>
</section>
</section>
<section
anchor=
"hybrid"
title=
"Opus Codec"
>
<section
anchor=
"hybrid"
title=
"Opus Codec"
>
...
@@ -619,73 +650,211 @@ This is implemented in unquant_energy_finalise() (quant_bands.c).
...
@@ -619,73 +650,211 @@ This is implemented in unquant_energy_finalise() (quant_bands.c).
</section>
<!-- Energy decode -->
</section>
<!-- Energy decode -->
<section
anchor=
"allocation"
title=
"Bit allocation"
>
<section
anchor=
"allocation"
title=
"Bit allocation"
>
<t>
Bit allocation is performed based only on information available to both
<t>
Many codecs transmit significant amounts of side information for
the encoder and decoder. The same calculations are performed in a bit-exact
the purpose of controlling bit allocation within a frame. Often this
manner in both the encoder and decoder to ensure that the result is always
side information controls bit usage indirectly and must be carefully
exactly the same. Any mismatch causes corruption of the decoded output.
selected to achieve the desired rate constraints.
</t>
The allocation is computed by compute_allocation() (rate.c),
which is used in both the encoder and the decoder.
</t>
<t>
The band-energy normalized structure of Opus MDCT mode ensures that a
constant bit allocation for the shape content of a band will result in a
<t>
For a given band, the bit allocation is nearly constant across
roughly constant tone to noise ratio, which provides for fairly consistent
frames that use the same number of bits for Q1, yielding a
perceptual performance. The effectiveness of this approach is the result of
pre-defined signal-to-mask ratio (SMR) for each band. Because the
two factors: The band energy, which is understood to be perceptually
bands each have a width of one Bark, this is equivalent to modeling the
important on its own, is always preserved regardless of the shape precision and because
masking occurring within each critical band, while ignoring inter-band
the constant tone-to-noise ratio implies a constant intra-band noise to masking ratio.
masking and tone-vs-noise characteristics. While this is not an
Intra-band masking is the strongest of the perceptual masking effects. This structure
optimal bit allocation, it provides good results without requiring the
means that the ideal allocation is more consistent from frame to frame than
transmission of any allocation information. Additionally, the encoder
it is for other codecs without an equivalent structure.
</t>
is able to signal alterations to the implicit allocation via
two means: There is an entropy coded trim parameter can be used to tilt the
<t>
Because the bit allocation is used to drive the decoding of the range-coder
allocation to favor low or high frequencies, and there is a boost parameter
stream it MUST be recovered exactly so that identical coding decisions are
which can be used to shift large amounts of additional precision into
made in the encoder and decoder. Any deviation from the reference's resulting
individual bands.
bit allocation will result in corrupted output, though implementers are
</t>
free to implement the procedure in any way which produces identical results.
</t>
<t>
Because all of the information required to decode a frame must be derived
<t>
from that frame alone in order to retain robustness to packet loss the
For every encoded or decoded frame, a target allocation must be computed
overhead of explicitly signaling the allocation would be considerable,
using the projected allocation. In the reference implementation this is
especially for low-latency (small frame size) applications,
performed by compute_allocation() (rate.c).
even though the allocation is relatively static.
</t>
The target computation begins by calculating the available space as the
number of eighth-bits which can be fit in the frame after Q1 is stored according
<t>
For this reason, in the MDCT mode Opus uses a primarily implicit bit
to the range coder (ec_tell_frac()) and reserving one eighth-bit.
allocation. The available bit-stream capacity is known in advance to both
Then the two projected prototype allocations whose sums multiplied by 8 are nearest
the encoder and decoder without additional signaling, ultimately from the
to that value are determined. These two projected prototype allocations are then interpolated
packet sizes expressed by a higher level protocol. Using this information
by finding the highest integer interpolation coefficient in the range 0-63
the codec interpolates an allocation from a hard-coded table.
</t>
such that the sum of the higher prototype times the coefficient divided by
64 plus the sum of the lower prototype multiplied is less than or equal to the
<t>
While the band-energy structure effectively models intra-band masking,
available eighth-bits. During the interpolation a maximum allocation
it ignores the weaker inter-band masking, band-temporal masking, and
in each band is imposed along with a threshold hard minimum allocation for
other less significant perceptual effects. While these effects can
each band.
often be ignored they can become significant for particular samples. One
Starting from the last coded band a binary decision is coded for each
mechanism available to encoders would be to simply increase the overall
band over the minimum threshold to determine if that band should instead
rate for these frames, but this is not possible in a constant rate mode
recieve only the minimum allocation. This process stops at the first
and can be fairly inefficient. As a result three explicitly signaled
non-minimum band, the first band to recieve an explicitly coded boost,
mechanisms are provided to alter the implicit allocation:
</t>
or the first band in the frame, whichever comes first.
The reference implementation performs this step in interp_bits2pulses()
<t>
using a binary search for the interpolation. (rate.c).
<list
style=
"symbols"
>
</t>
<t>
Band boost
</t>
<t>
Allocation trim
</t>
<t>
<t>
band skipping
</t>
Because the computed target will sometimes be somewhat smaller than the
</list>
available space, the excess space is divided by the number of bands, and this amount
is added equally to each band which was not forced to the minimum value.
</t>
<t>
The allocation target is separated into a portion used for fine energy
and a portion used for the Spherical Vector Quantizer (PVQ). The fine energy
quantizer operates in whole-bit steps and is allocated based on an offset
fraction of the total usable space. Excess bits above the maximums are
left unallocated and placed into the rolling balance maintained during
the quantization process.
</t>
</t>
<t>
The first of these mechanisms, band boost, allows an encoder to boost
the allocation in specific bands. The second, allocation trim, works by
biasing the overall allocation towards higher or lower frequency bands. The third, band
skipping, selects which low-precision high frequency bands
will be allocated no shape bits at all.
</t>
<t>
In stereo mode there are also two additional parameters
potentially coded as part of the allocation procedure: a parameter to allow the
selective elimination of allocation for the 'side' in jointly coded bands,
and a flag to deactivate joint coding. These values are not signaled if
they would be meaningless in the overall context of the allocation.
</t>
<t>
Because every signaled adjustment increases overhead and implementation
complexity none were included speculatively: The reference encoder makes use
of all of these mechanisms. While the decision logic in the reference was
found to be effective enough to justify the overhead and complexity further
analysis techniques may be discovered which increase the effectiveness of these
parameters. As with other signaled parameters, encoder is free to choose the
values in any manner but unless a technique is known to deliver superior
perceptual results the methods used by the reference implementation should be
used.
</t>
<t>
The process of allocation consists of the following steps: determining the per-band
maximum allocation vector, decoding the boosts, decoding the tilt, determining
the remaining capacity the frame, searching the mode table for the
entry nearest but not exceeding the available space (subject to the tilt, boosts, band
maximums, and band minimums), linear interpolation, reallocation of
unused bits with concurrent skip decoding, determination of the
fine-energy vs shape split, and final reallocation. This process results
in an shape allocation per-band (in 1/8th bit units), a per-band fine-energy
allocation (in 1 bit per channel units), a set of band priorities for
controlling the use of remaining bits at the end of the frame, and a
remaining balance of unallocated space which is usually zero except
at very high rates.
</t>
<t>
The maximum allocation vector is an approximation of the maximum space
which can be used by each band for a given mode. The value is
approximate because the shape encoding is variable rate (due
to entropy coding of splitting parameters). Setting the maximum too low reduces the
maximum achievable quality in a band while setting it too high
may result in waste: bit-stream capacity available at the end
of the frame which can not be put to any use. The maximums
specified by the codec reflect the average maximum. In the reference
the maximums are provided partially computed form, in order to fit in less
memory, as a static table (XXX cache.caps). Implementations are expected
to simply use the same table data but the procedure for generating
this table is included in rate.c as part of compute_pulse_cache().
</t>
<t>
To convert the values in cache.caps into the actual maximums: First
set nbBands to the maximum number of bands for this mode and stereo to
zero if stereo is not in use and one otherwise. For each band assign N
to the number of MDCT bins covered by the band (for one channel), set LM
to the shift value for the frame size (e.g. 0 for 120, 1 for 240, 3 for 480)
then set i to nbBands*(2*LM+stereo). Then set the maximum for the band to
the i-th index of cache.caps + 64 and multiply by the number of channels
in the current frame (one or two) and by N then divide the result by 4
using truncating integer division. The resulting vector will be called
cap[]. The elements fit in signed 16 bit integers but do not fit in 8 bits.
This procedure is implemented in the reference in the function init_caps() in celt.c.
</t>
<t>
The band boosts are represented by a series of binary symbols which
are coded with very low probability. Each band can potentially be boosted
multiple times, subject to the frame actually having enough room to obey
the boost and having enough room to code the boost symbol. The default
coding cost for a boost starts out at six bits, but subsequent boosts
in a band cost only a single bit and every time a band is boosted the
initial cost is reduced (down to a minimum of two). Since the initial
cost of coding a boost is 6 bits the coding cost of the boost symbols when
completely unused is 0.48 bits/frame for a 21 band mode (21*-log2(1-1/2^6)).
</t>
<t>
To decode the band boosts: First set 'dynalloc_logp' to 6, the initial
amount of storage required to signal a boost in bits, 'total_bits' to the
size of the frame in 8th-bits, 'total_boost' to zero, and 'tell' to the total number
of 8th bits decoded
so far. For each band from the coding start (0 normally, but 17 in hybrid mode)
to the coding end (which changes depending on the signaled bandwidth): Set 'width'
to the number of MDCT bins in this band for all channels. Take the larger of width
and 64, then the minimum of that value and the width times eight and set 'quanta'
to the result. This represents a boost step size of six bits subject to limits
of 1/bit/sample and 1/8th bit/sample. Set 'boost' to zero and 'dynalloc_loop_logp'
to dynalloc_logp. While dynalloc_loop_log (the current worst case symbol cost) in
8th bits plus tell is less than total_bits plus total_boost and boost is less than cap[] for this
band: Decode a bit from the bitstream with a with dynalloc_loop_logp as the cost
of a one, update tell to reflect the current used capacity, if the decoded value
is zero break the loop otherwise add quanta to boost and total_boost, subtract quanta from
total_bits, and set dynalloc_loop_log to 1. When the while loop finishes
boost contains the boost for this band. If boost is non-zero and dynalloc_logp
is greater than 2 decrease dynalloc_logp. Once this process has been
execute on all bands the band boosts have been decoded. This procedure
is implemented around line 2352 of celt.c.
</t>
<t>
At very low rates it's possible that there won't be enough available
space to execute the inner loop even once. In these cases band boost
is not possible but its overhead is completely eliminated. Because of the
high cost of band boost when activated a reasonable encoder should not be
using it at very low rates. The reference implements its dynalloc decision
logic at around 1269 of celt.c
</t>
<t>
The allocation trim is a integer value from 0-10. The default value of
5 indicates no trim. The trim parameter is entropy coded in order to
lower the coding cost of less extreme adjustments. Values lower than
5 bias the allocation towards lower frequencies and values above 5
bias it towards higher frequencies. Like other signaled parameters, signaling
of the trim is gated so that it is not included if there is insufficient space
available in the bitstream. To decode the trim first set
the trim value to 5 then iff the count of decoded 8th bits so far (ec_tell_frac)
plus 48 (6 bits) is less than or equal to the total frame size in 8th
bits minus total_boost (a product of the above band boost procedure) then
decode the trim value using the inverse CDF {127, 126, 124, 119, 109, 87, 41, 19, 9, 4, 2, 0}.
</t>
<t>
Stereo parameters
</t>
<t>
Anti-collapse reservation
</t>
<t>
The allocation computation first begins by setting up some initial conditions.
'total' is set to the available remaining 8th bits, computed by taking the
size of the coded frame times 8 and subtracting ec_tell_frac(). From this value one (8th bit)
is subtracted to assure that the resulting allocation will be conservative. 'anti_collapse_rsv'
is set to 8 (8th bits) iff the frame is a transient, LM is greater than 1, and total is
greater than or equal to (LM+2) * 8. Total is then decremented by anti_collapse_rsv and clamped
to be equal to or greater than zero. 'skip_rsv' is set to 8 (8th bits) if total is greater than
8, otherwise it is zero. Total is then decremented by skip_rsv. This reserves space for the
final skipping flag.
</t>
<t>
If the current frame is stereo intensity_rsv is set to the conservative log2 in 8th bits
of the number of coded bands for this frame (given by the table LOG2_FRAC_TABLE). If
intensity_rsv is greater than total then intensity_rsv is set to zero otherwise total is
decremented by intensity_rsv, and if total is still greater than 8 dual_stereo_rsv is
set to 8 and total is decremented by dual_stereo_rsv.
</t>
<t>
The allocation process then computes a vector representing the hard minimum amounts allocation
any band will receive for shape. This minimum is higher than the technical limit of the PVQ
process, but very low rate allocations produce excessively an sparse spectrum and these bands
are better served by having no allocation at all. For each coded band set thresh[band] to
twenty-four times the number of MDCT bins in the band and divide by 16. If 8 times the number
of channels is greater, use that instead. This sets the minimum allocation to one bit per channel
or 48 128th bits per MDCT bin, whichever is greater. The band size dependent part of this
value is not scaled by the channel count because at the very low rates where this limit is
applicable there will usually be no bits allocated to the side.
</t>
<t>
The previously decoded allocation trim is used to derive a vector of per-band adjustments,
'trim_offsets[]'. For each coded band take the alloc_trim and subtract 5 and LM then multiply
the result by number of channels, the number MDCT bins in the shortest frame size for this mode,
the number remaining bands, 2^LM, and 8. Then divide this value by 64. Finally, if the
number of MDCT bins in the band per channel is only one 8 times the number of channels is subtracted
in order to diminish the allocation by one bit because width 1 bands receive greater benefit
from the coarse energy coding.
</t>
</section>
</section>
<section
anchor=
"PVQ-decoder"
title=
"Shape Decoder"
>
<section
anchor=
"PVQ-decoder"
title=
"Shape Decoder"
>
<t>
<t>
In each band, the normalized
<spanx
style=
"emph"
>
shape
</spanx>
is encoded
In each band, the normalized
<spanx
style=
"emph"
>
shape
</spanx>
is encoded
...
...
This diff is collapsed.
Click to expand it.
Preview
0%
Loading
Try again
or
attach a new file
.
Cancel
You are about to add
0
people
to the discussion. Proceed with caution.
Finish editing this message first!
Save comment
Cancel
Please
register
or
sign in
to comment