Fix OOB read in fixed-point NEON intrinsics.
Fix OOB read in fixed-point NEON intrinsics.
xcorr_kernel_neon_fixed() read one more sample from y[] in the main loop than it needed to allow use of vector loads, but unlike the native asm in celt_pitch_xcorr_arm.s, the loop condition did not exit early enough to prevent this from overrunning the end of the array. Additionally, the tail loop always read one value beyond what it needed.
This patch fixes the loop condition on the main loop. Since this makes the tail section run even for lengths that are a multiple of 8 (e.g., on fully half the multiplies for usages like celt_fir() or celt_iir() with an order of 16, which is common), rather than try to fix the tail loop, we replace it with a non-looping adaptation of the native asm, which continues to use vector loads as much as possible for the remaining elements (and also does not read ahead past the end of the y[] array).
Overall slowdown of test_opus_encode on a Raspberry Pi 5 Model B Rev 1.0 is 0.12% vs. 0.13% for fixing the existing tail loop.