Enhanced spatialization can lead to high-frequency swishiness after mixing down to mono

Enhanced spatialization during encoding can result in high-frequency swishiness after mixdown to mono. The problem becomes evident when the following conditions are true:

The encoder captures far-field (across-the room) audio in stereo, and
the decoder side mixes the audio down to mono.

This can occur when the receiving side has a mono speaker. This is likely to occur in multi-chat with a mix of devices doing hands-free communication.

Reproducing the problem:

Capture far-field (across the room) speech with stereo mics.
Encode and decode with Opus.
Mix down to mono.
Listen.

Expected result:

Mono audio without audio artifacts.

Observed result:

High frequency "swishiness" due to high-frequency phase issues.

One (sub-optimal) solution is to just use the right or left channel for the mono signal instead of mixing the audio down to mono. However, this means that the mono signal will lose key information from the other channel.

The following patch eliminates the above-mentioned artifacts by not doing additional spatialization:

$ git diff
diff --git a/celt/bands.c b/celt/bands.c
index 62f0ee7..f493a81 100644
--- a/celt/bands.c
+++ b/celt/bands.c
@@ -794,7 +794,7 @@ static void compute_theta(struct band_ctx *ctx, struct split_ctx *sctx,
    } else if (stereo) {
       if (encode)
       {
-         inv = itheta > 8192;
+         inv = 0; // Don't reverse phase. leads to high-freq "swishiness" on mixdown to mono.
          if (inv)
          {
             int j;