diff --git a/doc/ietf/draft-valin-celt-codec.xml b/doc/ietf/draft-valin-celt-codec.xml index 0288c014af7ff3dfdf5f13dc56cc84deaf24a813..59a5425d6922ffa292d7213006c8137e73c96d04 100644 --- a/doc/ietf/draft-valin-celt-codec.xml +++ b/doc/ietf/draft-valin-celt-codec.xml @@ -343,7 +343,7 @@ CELT uses an entropy coder based upon <xref target="range-coding"></xref>, which is itself a rediscovery of the FIFO arithmetic code introduced by <xref target="coding-thesis"></xref>. It is very similar to arithmetic encoding, except that encoding is done with digits in any base instead of with bits, -so it is faster when using larger bases (e.g.: an octet). All of the +so it is faster when using larger bases (i.e.: an octet). All of the calculations in the range coder must use bit-exact integer arithmetic. </t> @@ -519,7 +519,7 @@ The CELT codec has several optional features that can be switched on or off in e <section anchor="intra" title="Intra-frame energy (I)"> <t> -CELT uses prediction to encode the energy in each frequency band. In order to make frames independent, however, it is possible to disable the part of the prediction that depends on previous frames. This is called <spanx style="emph">intra-frame energy</spanx> and requires around 12 more bits per frame. It is enabled with the <spanx style="emph">I</spanx> bit (Table. <xref target="flags-encoding">flags-encoding</xref>). The use of intra energy is OPTIONAL and the decision method is left to the implementor. The reference code describes one way of deciding which frames would benefit most from having their energy encoded without prediction. The intra_decision() (<xref target="quant_bands.c">quant_bands.c</xref>) function looks for frames where the log-spectral distance between consecutive frames is more than 9 dB. When such a difference is found between two frames, the next frame (not the one for which the difference is detected) is marked encoded with intra energy. The one-frame delay is to ensure that when a frame containing a transient event is lost, then the next frame will be decoded without accumulating error from the lost frame. +CELT uses prediction to encode the energy in each frequency band. In order to make frames independent, however, it is possible to disable the part of the prediction that depends on previous frames. This is called <spanx style="emph">intra-frame energy</spanx> and requires around 12 more bits per frame. It is enabled with the <spanx style="emph">I</spanx> bit (Table. <xref target="flags-encoding">flags-encoding</xref>). The use of intra energy is OPTIONAL and the decision method is left to the implementor. The reference code describes one way of deciding which frames would benefit most from having their energy encoded without prediction. The intra_decision() (<xref target="quant_bands.c">quant_bands.c</xref>) function looks for frames where the log-spectral distance between consecutive frames is more than 9 dB. When such a difference is found between two frames, the next frame (not the one for which the difference is detected) is marked encoded with intra energy. The one-frame delay is to ensure that when a frame containing a transient is lost, then the next frame will be decoded without accumulating error from the lost frame. </t> </section> @@ -708,7 +708,9 @@ all integer codevectors y of N dimensions that satisfy sum(abs(y(j))) = K. <t> In bands where neither pitch nor folding is used, the PVQ is used to encode the unit vector that results from the normalization in -<xref target="normalization"></xref> directly. " In the case where a pitch +<xref target="normalization"></xref> directly. Given a PVQ codevector y, +the unit vector X is obtained as X = y/||y||, where ||.|| denotes the +L2 norm. In the case where a pitch prediction or a folding vector p is used, the quantized unit vector X' becomes: </t> <t>X' = p' + g_f * y,</t> @@ -790,11 +792,11 @@ V(N,K) = V(N+1,K) + V(N,K+1) + V(N+1,K+1), with V(N,0) = 1 and V(0,K) = 0, K != There are many different ways to compute V(N,K), including pre-computed tables and direct use of the recursive formulation. The reference implementation applies the recursive formulation one line (or column) at a time to save on memory use, -along with an alternate, -univariate recurrence to initialise an arbitrary line, and direct -polynomial solutions for small N. All of these methods are -equivalent, and have different trade-offs in speed, memory usage, and -code size. Implementations MAY use any methods they like, as long as +along with an alternate, +univariate recurrence to initialise an arbitrary line, and direct +polynomial solutions for small N. All of these methods are +equivalent, and have different trade-offs in speed, memory usage, and +code size. Implementations MAY use any methods they like, as long as they are equivalent to the mathematical definition. </t> @@ -815,7 +817,7 @@ than 32 bits MUST be implemented with the splitting method, even if 64-bit arith <section anchor="stereo" title="Stereo support"> <t> -When encoding a stereo stream, some parameters are shared across the left and right channels, while others are transmitted separately for each channel, or jointly encoded. Only one copy of the flags for the transients and pitch (pitch period and gains) features are transmitted. The coarse and fine energy parameters are transmitted separately for each channel. Both the coarse energy and fine energy (including the remaining fine bits at the end of the stream) have the left and right bands interleaved in the stream, with the left band encoded first. +When encoding a stereo stream, some parameters are shared across the left and right channels, while others are transmitted separately for each channel, or jointly encoded. Only one copy of the flags for the features, transients and pitch (pitch period and gains) are transmitted. The coarse and fine energy parameters are transmitted separately for each channel. Both the coarse energy and fine energy (including the remaining fine bits at the end of the stream) have the left and right bands interleaved in the stream, with the left band encoded first. </t> <t> @@ -903,74 +905,74 @@ to the application that a problem has occurred. <section anchor="range-decoder" title="Range Decoder"> <t> The range decoder extracts the symbols and integers encoded using the range encoder in -<xref target="range-encoder"></xref>. The range decoder maintains an internal -state vector composed of the two-tuple (dif,rng), representing the -difference between the high end of the current range and the actual -coded value, and the size of the current range, respectively. Both -dif and rng are 32-bit unsigned integer values. rng is initialized to -2^7. dif is initialized to rng minus the top 7 bits of the first -input octet. Then the range is immediately normalized, using the +<xref target="range-encoder"></xref>. The range decoder maintains an internal +state vector composed of the two-tuple (dif,rng), representing the +difference between the high end of the current range and the actual +coded value, and the size of the current range, respectively. Both +dif and rng are 32-bit unsigned integer values. rng is initialized to +2^7. dif is initialized to rng minus the top 7 bits of the first +input octet. Then the range is immediately normalized, using the procedure described in the following section. </t> <section anchor="decoding-symbols" title="Decoding Symbols"> <t> - Decoding symbols is a two-step process. The first step determines - a value fs that lies within the range of some symbol in the current - context. The second step updates the range decoder state with the - three-tuple (fl,fh,ft) corresponding to that symbol, as defined in + Decoding symbols is a two-step process. The first step determines + a value fs that lies within the range of some symbol in the current + context. The second step updates the range decoder state with the + three-tuple (fl,fh,ft) corresponding to that symbol, as defined in <xref target="encoding-symbols"></xref>. </t> <t> The first step is implemented by ec_decode() (<xref target="rangedec.c">rangedec.c</xref>), - and computes fs = ft-min((dif-1)/(rng/ft)+1,ft), where ft is - the sum of the frequency counts in the current context, as described - in <xref target="encoding-symbols"></xref>. The divisions here are exact integer division. + and computes fs = ft-min((dif-1)/(rng/ft)+1,ft), where ft is + the sum of the frequency counts in the current context, as described + in <xref target="encoding-symbols"></xref>. The divisions here are exact integer division. </t> <t> - In the reference implementation, a special version of ec_decode() - called ec_decode_bin() (<xref target="rangeenc.c">rangeenc.c</xref>) is defined using - the parameter ftb instead of ft. It is mathematically equivalent to - calling ec_decode() with ft = (1<<ftb), but avoids one of the - divisions. + In the reference implementation, a special version of ec_decode() + called ec_decode_bin() (<xref target="rangeenc.c">rangeenc.c</xref>) is defined using + the parameter ftb instead of ft. It is mathematically equivalent to + calling ec_decode() with ft = (1<<ftb), but avoids one of the + divisions. </t> <t> - The decoder then identifies the symbol in the current context - corresponding to fs; i.e., the one whose three-tuple (fl,fh,ft) - satisfies fl <= fs < fh. This tuple is used to update the decoder - state according to dif = dif - (rng/ft)*(ft-fh), and if fl is greater - than zero, rng = (rng/ft)*(fh-fl), or otherwise rng = rng - (rng/ft)*(ft-fh). After this update, the range is normalized. + The decoder then identifies the symbol in the current context + corresponding to fs; i.e., the one whose three-tuple (fl,fh,ft) + satisfies fl <= fs < fh. This tuple is used to update the decoder + state according to dif = dif - (rng/ft)*(ft-fh), and if fl is greater + than zero, rng = (rng/ft)*(fh-fl), or otherwise rng = rng - (rng/ft)*(ft-fh). After this update, the range is normalized. </t> <t> - To normalize the range, the following process is repeated until - rng > 2^23. First, rng is set to (rng<8)&0xFFFFFFFF. Then the next - 8 bits of input are read into sym, using the remaining bit from the - previous input octet as the high bit of sym, and the top 7 bits of the - next octet for the remaining bits of sym. If no more input octets - remain, zero bits are used instead. Then, dif is set to - (dif<<8)-sym&0xFFFFFFFF (i.e., using wrap-around if the subtraction - overflows a 32-bit register). Finally, if dif is larger than 2^31, - dif is then set to dif - 2^31. This process is carried out by - ec_dec_normalize() (<xref target="rangedec.c">rangedec.c</xref>). + To normalize the range, the following process is repeated until + rng > 2^23. First, rng is set to (rng<8)&0xFFFFFFFF. Then the next + 8 bits of input are read into sym, using the remaining bit from the + previous input octet as the high bit of sym, and the top 7 bits of the + next octet for the remaining bits of sym. If no more input octets + remain, zero bits are used instead. Then, dif is set to + (dif<<8)-sym&0xFFFFFFFF (i.e., using wrap-around if the subtraction + overflows a 32-bit register). Finally, if dif is larger than 2^31, + dif is then set to dif - 2^31. This process is carried out by + ec_dec_normalize() (<xref target="rangedec.c">rangedec.c</xref>). </t> </section> <section anchor="decoding-ints" title="Decoding Uniformly Distributed Integers"> <t> - Functions ec_dec_uint() or ec_dec_bits() are based on ec_decode() and - decode one of N equiprobable symbols, each with a frequency of 1, - where N may be as large as 2^32-1. Because ec_decode() is limited to - a total frequency of 2^16-1, this is done by decoding a series of - symbols in smaller contexts. + Functions ec_dec_uint() or ec_dec_bits() are based on ec_decode() and + decode one of N equiprobable symbols, each with a frequency of 1, + where N may be as large as 2^32-1. Because ec_decode() is limited to + a total frequency of 2^16-1, this is done by decoding a series of + symbols in smaller contexts. </t> <t> - ec_dec_bits() (<xref target="entdec.c">entdec.c</xref>) is defined, like + ec_dec_bits() (<xref target="entdec.c">entdec.c</xref>) is defined, like ec_decode_bin(), to take a single parameter ftb, with ftb < 32. and ftb < 32, and produces an ftb-bit decoded integer value, t, initialized to zero. While ftb is greater than 8, it decodes the next 8 most significant bits of the integer, s = ec_decode_bin(8), updates - the decoder state with the 3-tuple (s,s+1,256), adds those bits to + the decoder state with the 3-tuple (s,s+1,256), adds those bits to the current value of t, t = t<<8 | s, and subtracts 8 from ftb. Then it decodes the remaining bits of the integer, s = ec_decode_bin(ftb), updates the decoder state with the 3 tuple (s,s+1,1<<ftb), and adds @@ -995,15 +997,15 @@ procedure described in the following section. <section anchor="decoder-tell" title="Current Bit Usage"> <t> - The bit allocation routines in CELT need to be able to determine a - conservative upper bound on the number of bits that have been used - to decode from the current frame thus far. This drives allocation - decisions which must match those made in the encoder. This is - computed in the reference implementation to fractional bit precision - by the function ec_dec_tell() (<xref target="rangedec.c">rangedec.c</xref>). Like all - operations in the range decoder, it must be implemented in a - bit-exact manner, and must produce exactly the same value returned by - ec_enc_tell() after encoding the same symbols. + The bit allocation routines in CELT need to be able to determine a + conservative upper bound on the number of bits that have been used + to decode from the current frame thus far. This drives allocation + decisions which must match those made in the encoder. This is + computed in the reference implementation to fractional bit precision + by the function ec_dec_tell() (<xref target="rangedec.c">rangedec.c</xref>). Like all + operations in the range decoder, it must be implemented in a + bit-exact manner, and must produce exactly the same value returned by + ec_enc_tell() after encoding the same symbols. </t> </section>