Skip to content
GitLab
  • Menu
Projects Groups Snippets
  • /
  • Help
    • Help
    • Support
    • Community forum
    • Submit feedback
    • Contribute to GitLab
  • Sign in / Register
  • R rnnoise
  • Project information
    • Project information
    • Activity
    • Labels
    • Members
  • Repository
    • Repository
    • Files
    • Commits
    • Branches
    • Tags
    • Contributors
    • Graph
    • Compare
  • Issues 5
    • Issues 5
    • List
    • Boards
    • Service Desk
    • Milestones
  • Merge requests 2
    • Merge requests 2
  • CI/CD
    • CI/CD
    • Pipelines
    • Jobs
    • Schedules
  • Deployments
    • Deployments
    • Environments
    • Releases
  • Packages & Registries
    • Packages & Registries
    • Container Registry
  • Monitor
    • Monitor
    • Incidents
  • Analytics
    • Analytics
    • Value stream
    • CI/CD
    • Repository
  • Wiki
    • Wiki
  • Snippets
    • Snippets
  • Activity
  • Graph
  • Create a new issue
  • Jobs
  • Commits
  • Issue Boards
Collapse sidebar
  • Xiph.Org
  • rnnoise
  • Merge requests
  • !2

Create SIMD-accelerated version of `compute_gru` function

  • Review changes

  • Download
  • Email patches
  • Plain diff
Open Casey Primozic requested to merge Ameobea/rnnoise:optimize into master Jul 19, 2021
  • Overview 1
  • Commits 11
  • Changes 6

I did some CPU profiling of pulseeffects / easyeffects which includes this library as a dependency for its noise reduction functionality. This led me to see that the compute_gru function was the hottest one in the whole application:

The changes in this pull request create a new function compute_gru_avx which has the same functionality as compute_gru but uses SIMD intrinsics to accelerate the function dramatically. The main changes focus on computing the sums in the GRU function 8 at a time and using FMAs to combine multiplications and adds into a single operation, increasing accuracy as a result. This also serves to reduce the overhead of loop counter checking which my profiling let me to believe was the most expensive part of the whole function before these changes.

Additionally, converting the weights (which are stored as 8-bit signed integers) is done 8 at a time using SIMD for an additional speedup. The DNN-based noise reduction used in Amazon Chime is also quantized to 8-bit weights, and a technical writeup of that system explains that SIMD instructions are used to run that network very efficiently even on light-weight smartphone CPUs, and this change matches that approach.

I also made some changes to the build configuration for compiler flags that use -O3 instead of -O2 which yielded some benefits on my machine as well as pasing -march=native which facilitates the SIMD used. If these CPU features aren't available, they will be disabled at build-time.

After the optimizations, the compute_gru_avx function uses only ~4.4% of the total CPU time compared to ~19.63% from before - a 4.45x speedup:

Here is a compiler explorer that shows the full assembly produced by the optimized compute_gru_avx function: https://c.godbolt.org/z/xzEGxj8ne

Testing done on my own machine using pulseeffects + librnnoise.so built with the optimized code seems to work identically to before with reduced CPU usage for the application.


Let me know if you think this is something that you'd like to get merged into the project. I'm happy to make any changes necessary. There may be a better/different way you'd like to handle the CPU feature detection and and I'd love suggestions on how to handle that.

Edited Jul 19, 2021 by Casey Primozic
Assignee
Assign to
Reviewer
Request review from
Time tracking
Source branch: optimize