<t>The MDCT implementation has no special characteristics. The

input is a windowed signal (after pre-emphasis) of 2*N samples and the output is N

frequency-domain samples. A <spanxstyle="emph">low-overlap</spanx> window is used to reduce the algorithmic delay.

It is derived from a basic (full overlap) window that is the same as the one used in the Vorbis codec: W(n)=[sin(pi/2*sin(pi/2*(n+.5)/L))]^2. The low-overlap window is created by zero-padding the basic window and inserting ones in the middle, such that the resulting window still satisfies power complementarity. The MDCT is computed in mdct_forward() (mdct.c), which includes the windowing operation and a scaling of 2/N.

</t>

</section>

<sectionanchor="normalization"title="Bands and Normalization">

<t>

The MDCT output is divided into bands that are designed to match the ear's critical bands,

with the exception that each band has to be at least 3 bins wide. For each band, the encoder

computes the energy that will later be encoded. Each band is then normalized by the

square root of the <spanxstyle="strong">non-quantized</spanx> energy, such that each band now forms a unit vector X.

The energy and the normalization are computed by compute_band_energies()

It is important to quantize the energy with sufficient resolution because

any energy quantization error cannot be compensated for at a later

stage. Regardless of the resolution used for encoding the shape of a band,

it is perceptually important to preserve the energy in each band. CELT uses a

coarse-fine strategy for encoding the energy in the base-2 log domain,

as implemented in quant_bands.c</t>

<sectionanchor="coarse-energy"title="Coarse energy quantization">

<t>

The coarse quantization of the energy uses a fixed resolution of

6 dB and is the only place where entropy coding is used.

To minimize the bitrate, prediction is applied both in time (using the previous frame)

and in frequency (using the previous bands). The 2-D z-transform of

the prediction filter is: A(z_l, z_b)=(1-a*z_l^-1)*(1-z_b^-1)/(1-b*z_b^-1)

where b is the band index and l is the frame index. The prediction coefficients are

a=0.8 and b=0.7 when not using intra energy and a=b=0 when using intra energy.

The time-domain prediction is based on the final fine quantization of the previous

frame, while the frequency domain (within the current frame) prediction is based

on coarse quantization only (because the fine quantization has not been computed

yet). We approximate the ideal

probability distribution of the prediction error using a Laplace distribution. The

coarse energy quantization is performed by quant_coarse_energy() and

quant_coarse_energy() (quant_bands.c).

</t>

<t>

The Laplace distribution for each band is defined by a 16-bit (Q15) decay parameter.

Thus, the value 0 has a frequency count of p[0]=2*(16384*(16384-decay)/(16384+decay)). The

values +/- i each have a frequency count p[i] = (p[i-1]*decay)>>14. The value of p[i] is always

rounded down (to avoid exceeding 32768 as the sum of all frequency counts), so it is possible

for the sum to be less than 32768. In that case additional values with a frequency count of 1 are encoded. The signed values corresponding to symbols 0, 1, 2, 3, 4, ...

are [0, +1, -1, +2, -2, ...]. The encoding of the Laplace-distributed values is

implemented in ec_laplace_encode() (laplace.c).

</t>

<!-- FIXME: bit budget consideration -->

</section><!-- coarse energy -->

<sectionanchor="fine-energy"title="Fine energy quantization">

<t>

After the coarse energy quantization and encoding, the bit allocation is computed

(<xreftarget="allocation"></xref>) and the number of bits to use for refining the

energy quantization is determined for each band. Let B_i be the number of fine energy bits

for band i; the refinement is an integer f in the range [0,2^B_i-1]. The mapping between f

and the correction applied to the coarse energy is equal to (f+1/2)/2^B_i - 1/2. Fine

energy quantization is implemented in quant_fine_energy()

(quant_bands.c).

</t>

<t>

If any bits are unused at the end of the encoding process, these bits are used to

increase the resolution of the fine energy encoding in some bands. Priority is given

to the bands for which the allocation (<xreftarget="allocation"></xref>) was rounded

down. At the same level of priority, lower bands are encoded first. Refinement bits

are added until there are no unused bits. This is implemented in quant_energy_finalise()

The best PVQ codeword is encoded as a uniformly-distributed integer value

by encode_pulses() (cwrs.c).

The codeword is converted to a unique index in the same way as specified in

<xreftarget="PVQ"></xref>. The indexing is based on the calculation of V(N,K) (denoted N(L,K) in <xreftarget="PVQ"></xref>), which is the number of possible combinations of K pulses

in N samples. The number of combinations can be computed recursively as

V(N,K) = V(N+1,K) + V(N,K+1) + V(N+1,K+1), with V(N,0) = 1 and V(0,K) = 0, K != 0.

There are many different ways to compute V(N,K), including pre-computed tables and direct

use of the recursive formulation. The reference implementation applies the recursive

formulation one line (or column) at a time to save on memory use,

along with an alternate,

univariate recurrence to initialise an arbitrary line, and direct

polynomial solutions for small N. All of these methods are

equivalent, and have different trade-offs in speed, memory usage, and

code size. Implementations MAY use any methods they like, as long as

they are equivalent to the mathematical definition.

</t>

<t>

The indexing computations are performed using 32-bit unsigned integers. For large codebooks,

32-bit integers are not sufficient. Instead of using 64-bit integers (or more), the encoding

is made slightly sub-optimal by splitting each band into two equal (or near-equal) vectors of

size (N+1)/2 and N/2, respectively. The number of pulses in the first half, K1, is first encoded as an

integer in the range [0,K]. Then, two codebooks are encoded with V((N+1)/2, K1) and V(N/2, K-K1).

The split operation is performed recursively, in case one (or both) of the split vectors

still requires more than 32 bits. For compatibility reasons, the handling of codebooks of more

than 32 bits MUST be implemented with the splitting method, even if 64-bit arithmetic is available.

</t>

</section>

</section>

<sectionanchor="stereo"title="Stereo support">

<t>

When encoding a stereo stream, some parameters are shared across the left and right channels, while others are transmitted separately for each channel, or jointly encoded. Only one copy of the flags for the features, transients and pitch (pitch period and gains) are transmitted. The coarse and fine energy parameters are transmitted separately for each channel. Both the coarse energy and fine energy (including the remaining fine bits at the end of the stream) have the left and right bands interleaved in the stream, with the left band encoded first.

</t>

<t>

The main difference between mono and stereo coding is the PVQ coding of the normalized vectors. In stereo mode, a normalized mid-side (M-S) encoding is used. Let L and R be the normalized vector of a certain band for the left and right channels, respectively. The mid and side vectors are computed as M=L+R and S=L-R and no longer have unit norm.

</t>

<t>

From M and S, an angular parameter theta=2/pi*atan2(||S||, ||M||) is computed. The theta parameter is converted to a Q14 fixed-point parameter itheta, which is quantized on a scale from 0 to 1 with an interval of 2^-qb, where qb = (b-2*(N-1)*(40-log2_frac(N,4)))/(32*(N-1)), b is the number of bits allocated to the band, and log2_frac() is defined in cwrs.c. From here on, the value of itheta MUST be treated in a bit-exact manner since

both the encoder and decoder rely on it to infer the bit allocation.

</t>

<t>

Let m=M/||M|| and s=S/||S||; m and s are separately encoded with the PVQ encoder described in <xreftarget="pvq"></xref>. The number of bits allocated to m and s depends on the value of itheta. The number of bits allocated to coding m is obtained by:

<t>where bitexact_cos() is a fixed-point cosine approximation that MUST be bit-exact with the reference implementation

in mathops.h. The spectral folding operation is performed independently for the mid and side vectors.</t>

</section>

<sectionanchor="synthesis"title="Synthesis">

<t>

After all the quantization is completed, the quantized energy is used along with the

quantized normalized band data to resynthesize the MDCT spectrum. The inverse MDCT (<xreftarget="inverse-mdct"></xref>) and the weighted overlap-add are applied and the signal is stored in the <spanxstyle="emph">synthesis buffer</spanx> so it can be used for pitch prediction.

The encoder MAY omit this step of the processing if it knows that it will not be using

the pitch predictor for the next few frames. If the de-emphasis filter (<xreftarget="inverse-mdct"></xref>) is applied to this resynthesized

signal, then the output will be the same (within numerical precision) as the decoder's output.

Each CELT frame can be encoded in a different number of octets, making it possible to vary the bitrate at will. This property can be used to implement source-controlled variable bitrate (VBR). Support for VBR is OPTIONAL for the encoder, but a decoder MUST be prepared to decode a stream that changes its bit-rate dynamically. The method used to vary the bit-rate in VBR mode is left to the implementor, as long as each frame can be decoded by the reference decoder.