@@ -141,73 +141,73 @@ There are three possible operating modes for the proposed prototype:
...
@@ -141,73 +141,73 @@ There are three possible operating modes for the proposed prototype:
</list>
</list>
Each of these modes supports a number of difference frame sizes and sampling
Each of these modes supports a number of difference frame sizes and sampling
rates. In order to distinguish between the various modes and configurations,
rates. In order to distinguish between the various modes and configurations,
we need to define a simple header that can used in the transport layer
we define a single-byte table-of-contents (TOC) header that can used in the transport layer
(e.g RTP) to signal this information. The following describes the proposed
(e.g RTP) to signal this information. The following describes the proposed
header.
TOC byte.
</t>
</t>
<t>
<t>
The LP mode supports the following configurations (numbered from 00000...01011 in binary):
The LP mode supports the following configurations (numbered from 0 to 11):
<liststyle="symbols">
<liststyle="symbols">
<t>8 kHz: 10, 20, 40, 60 ms (00000...00011)</t>
<t>8 kHz: 10, 20, 40, 60 ms (0..3)</t>
<t>12 kHz: 10, 20, 40, 60 ms (00100...00111)</t>
<t>12 kHz: 10, 20, 40, 60 ms (4..7)</t>
<t>16 kHz: 10, 20, 40, 60 ms (01000...01011)</t>
<t>16 kHz: 10, 20, 40, 60 ms (8..11)</t>
</list>
</list>
for a total of 12 configurations.
for a total of 12 configurations.
</t>
</t>
<t>
<t>
The hybrid mode supports the following configurations (numbered from 01100...01111):
The hybrid mode supports the following configurations (numbered from 12 to 15):
<liststyle="symbols">
<liststyle="symbols">
<t>32 kHz: 10, 20 ms (01100...01101)</t>
<t>32 kHz: 10, 20 ms (12..13)</t>
<t>48 kHz: 10, 20 ms (01110...01111)</t>
<t>48 kHz: 10, 20 ms (14..15)</t>
</list>
</list>
for a total of 4 configurations.
for a total of 4 configurations.
</t>
</t>
<t>
<t>
The MDCT-only mode supports the following configurations (numbered from 10000...11101):
The MDCT-only mode supports the following configurations (numbered from 16 to 31):
<liststyle="symbols">
<liststyle="symbols">
<t>8 kHz: 2.5, 5, 10, 20 ms (10000...10011)</t>
<t>8 kHz: 2.5, 5, 10, 20 ms (16..19)</t>
<t>16 kHz: 2.5, 5, 10, 20 ms (10100...10111)</t>
<t>16 kHz: 2.5, 5, 10, 20 ms (20..23)</t>
<t>32 kHz: 2.5, 5, 10, 20 ms (11000...11011)</t>
<t>32 kHz: 2.5, 5, 10, 20 ms (24..27)</t>
<t>48 kHz: 2.5, 5, 10, 20 ms (11100...11111)</t>
<t>48 kHz: 2.5, 5, 10, 20 ms (28..31)</t>
</list>
</list>
for a total of 16 configurations.
for a total of 16 configurations.
</t>
</t>
<t>
<t>
There is thus a total of 32 configurations, so 5 bits are necessary to
There is thus a total of 32 configurations, encoded in 5 bits. On bit is used to signal mono vs stereo, which leaves 2 bits for the number of frames per packets (codes 0 to 3):
indicate the mode, frame size and sampling rate (MFS). This leaves 3 bits for the number of frames per packets (codes 0 to 7):
<liststyle="symbols">
<liststyle="symbols">
<t>0-2: 1-3 frames in the packet, each with equal compressed size</t>
<t>0: 1 frames in the packet</t>
<t>3: arbitrary number of frames in the packet, each with equal compressed size (one size needs to be encoded)</t>
<t>1: 2 frames in the packet, each with equal compressed size</t>
<t>4-5: 2-3 frames in the packet, with different compressed sizes, which need to be encoded (except the last one)</t>
<t>2: arbitrary number of frames in the packet, each with equal compressed size</t>
<t>6: arbitrary number of frames in the packet, with different compressed sizes, each of which needs to be encoded</t>
<t>3: arbitrary number of frames in the packet, with different compressed sizes</t>
<t>7: The first frame has this MFS, but others have different MFS. Each compressed size needs to be encoded.</t>
</list>
</list>
When code 7 is used and the last frames of a packet have the same MFS, it is
For codes 2 and 3, the TOC byte is followed by the number of frames in the packet.
allowed to switch to another code for them.
For code 3, the byte indicating the number of frames is followed by N-1 frame
lengths encoded as described below. As an additional limit, the audio duration contained
within a packet may not exceed 120 ms.
</t>
</t>
<t>
<t>
The compressed size of the frames (if needed) is indicated -- usually -- with one byte, with the following meaning:
The compressed size of the frames (if needed) is indicated -- usually -- with one byte, with the following meaning:
<liststyle="symbols">
<liststyle="symbols">
<t>0: No frame (DTX or lost packet)</t>
<t>0: No frame (DTX or lost packet)</t>
<t>1-251: Size of the frame in bytes</t>
<t>1-251: Size of the frame in bytes</t>
<t>252-255: A second byte is needed. The total size is (size[1]*4)+(size[0]%4)+252</t>
<t>252-255: A second byte is needed. The total size is (size[1]*4)+size[0]</t>
</list>
</list>
</t>
</t>
<t>
<t>
The maximum size representable is 255*4+3+252=1275 bytes. For 20 ms frames, that
The maximum size representable is 255*4+255=1275 bytes. For 20 ms frames, that
represents a bit-rate of 510 kb/s, which is really the highest rate anyone would want
represents a bit-rate of 510 kb/s, which is really the highest rate anyone would want
to use in stereo mode (beyond that point, lossless codecs would be more appropriate).
to use in stereo mode (beyond that point, lossless codecs would be more appropriate).
</t>
</t>
<sectionanchor="examples"title="Examples">
<sectionanchor="examples"title="Examples">
<t>
<t>
Simplest case: one packet
Simplest case: one narrowband mono 20-ms SILK frame