Add QS8_QC8W GEMM/IGEMM microkernels for Wasm Relaxed Unsigned and Signed … #6505

fanchenkong1 · 2024-05-30T05:32:07Z

…Dot Product

This PR is related to issue #6454.

This change adds qs8_qc8w gemm/igemm microkernels for Wasm relaxed simd dot product on signed and unsigned bytes. The new microkernels can use the recently supported AVX-VNNI instructions in V8 on x64 devices.

fbarchard · 2024-06-11T20:14:50Z

src/qs8-gemm/MRx4c16-wasmusdot.c.in

+        const v128_t va${M} = wasm_v128_xor(wasm_v128_load(a${M}), vsign_mask);
+        a${M} += 16;
+
+      $for N in range(4):


ideally make all loops respect NR=4 and try some different sizes. See how neon dot product kernels do NR

Thanks for the suggestion! I will take a look at neon dot product kernels.

fbarchard · 2024-06-11T20:16:41Z

src/qs8-gemm/MRx4c16-wasmusdot.c.in

+          const v128_t vb${N} = wasm_v128_load((const int8_t*) w + ${N * 16});
+
+        $for M in range(MR):
+          vacc${M}x${N} = wasm_i32x4_relaxed_dot_i8x16_i7x16_add(vb${N}, va${M}, vacc${M}x${N});


If weights are signed int8 this will overflow if the implementation is vpmaddubsw?

The implementation of vpmaddubsw will not pass the check CheckWAsmUSDOT as it has a different saturation behavior.

fbarchard · 2024-06-11T20:18:48Z

src/qs8-gemm/MRx4c16-wasmusdot.c.in

+
+      const v128_t vmin = wasm_v128_load64_splat(params->wasmsimd.min);
+      $for M in range(MR):
+        vacc${M}x0123 = wasm_f32x4_pmax(vacc${M}x0123, vmin);


note pmax is 2 instructions on arm.

(I only check in V8. Please let me know if there is anything wrong.)

Yes, pmax is 2 instructions on arm and 1 instruction on x64. min is 1 instruction on arm and many instructions on x64. Relaxed min/max is 1 instruction on both x64 and arm, but has implementation-defined behavior on NaN and +/-0.0.

Since I32x4.dot_i8x16_i7x16_add_s is SDOT on arm64, it will probably go to src/qs8-gemm/MRx4c16-wasmsdot.c.in. src/qs8-gemm/MRx4c16-wasmusdot.c.in may be executed mainly on x64. Shall we keep wasm_f32x4_pmax here in src/qs8-gemm/MRx4c16-wasmusdot.c.in, or replace it with relaxed max (I'm not very confident if it is valid) if only any arm device will use it?

fbarchard · 2024-06-11T20:20:06Z

src/configs/hardware-config.c

@@ -217,6 +217,18 @@ static void init_hardware_config(void) {
        wasm_v128_xor(xint8_output, wasm_i32x4_const_splat(-128)),
        wasm_v128_xor(overflow_output, wasm_i32x4_const(65536, 33024, 33024, 512))));
    }
+    {
+      // Check out-of-bounds behaviour of VNNI (vpdpbusd) version Relaxed Integer Dot Product with Accumulation.


This works with (i8mm) vsudot on ARM as well?

I'm not quite familiar with ARM. But I see a description here. It seem that this check may also work with vsudot, since vsudot deal with s8/u8 input and no saturation applied?

Oh yes, vsudot would work identical to VNNI, if V8 supports that, but it is considered an I8MM instruction. The more common instruction is vsdot which takes 2 signed int8. So for ARM (or vnni-int8) we might like a version that does not add 128.

fbarchard · 2024-06-11T20:22:01Z

There is a merge conflict for the internal review. Can you rebase and/or break into smaller PR

fanchenkong1 · 2024-06-13T08:19:25Z

There is a merge conflict for the internal review. Can you rebase and/or break into smaller PR

Thanks for reminding me the issue! I just rebased the change. If you like smaller PR to make things clear, please let me know.

fbarchard

Looks good! I would normally do a seperate PR for the gemm-config to allow the gemm config to be rolled back but keep the kernel, and to make the PR itself easier to land... the build files tend to require rebasing.

…Dot Product

...and merge wasmsdot wasmusdot template into one.

fanchenkong1 · 2024-07-10T03:05:01Z

Looks good! I would normally do a seperate PR for the gemm-config to allow the gemm config to be rolled back but keep the kernel, and to make the PR itself easier to land... the build files tend to require rebasing.

Thanks for the suggestion! Change in gemm-config is reverted in the latest commit.

alankelly · 2024-07-11T12:36:31Z

Thanks for this. Improved WASMSIMD performance is desperately needed on Intel.

Can you please send a follow up updating the gemm-config?

fanchenkong1 · 2024-07-12T01:43:18Z

Thanks for this. Improved WASMSIMD performance is desperately needed on Intel.

Can you please send a follow up updating the gemm-config?

Sure, I will create a follow-up PR on updating gemm-config.

fbarchard · 2024-07-16T04:56:54Z

src/qs8-gemm/MRx4c16-wasmdot.c.in

-#include "xnnpack/gemm.h"
-#include "xnnpack/math.h"
-
+#include <xnnpack/gemm.h>


please change to #include "xnnpack/gemm.h" for xnn headers

Thanks for the comments! I updated them in the latest change.

fanchenkong1 marked this pull request as draft June 6, 2024 01:51

fanchenkong1 marked this pull request as ready for review June 6, 2024 01:58

fbarchard reviewed Jun 11, 2024

View reviewed changes

fbarchard approved these changes Jun 11, 2024

View reviewed changes

fanchenkong1 force-pushed the wasm-vnni branch from ec2296e to 4be2339 Compare June 13, 2024 07:48

fanchenkong1 force-pushed the wasm-vnni branch 2 times, most recently from a7511a4 to a9b63f7 Compare July 9, 2024 08:07

fbarchard approved these changes Jul 9, 2024

View reviewed changes

fanchenkong1 added 3 commits July 10, 2024 09:51

QC8/QS8 GEMM/IGEMM microkernels for Wasm Relaxed Unsigned and Signed …

419371b

…Dot Product

Add NR support to qs8-qc8w gemm/igemm kernels

45ea498

...and merge wasmsdot wasmusdot template into one.

Revert changes in gemm-config

b1799df

fanchenkong1 force-pushed the wasm-vnni branch from a9b63f7 to b1799df Compare July 10, 2024 02:44

fbarchard approved these changes Jul 10, 2024

View reviewed changes

fbarchard reviewed Jul 16, 2024

View reviewed changes

fix includes

353abb6

fbarchard approved these changes Jul 17, 2024

View reviewed changes

copybara-service bot merged commit 1d69718 into google:master Jul 18, 2024
20 checks passed

fanchenkong1 mentioned this pull request Jul 19, 2024

Enable wasmusdot qs8_qc8w gemm/igemm in gemm-config #6739

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add QS8_QC8W GEMM/IGEMM microkernels for Wasm Relaxed Unsigned and Signed … #6505

Add QS8_QC8W GEMM/IGEMM microkernels for Wasm Relaxed Unsigned and Signed … #6505

fanchenkong1 commented May 30, 2024

fbarchard Jun 11, 2024

fanchenkong1 Jun 13, 2024

fbarchard Jun 11, 2024

fanchenkong1 Jun 13, 2024

fbarchard Jun 11, 2024

fanchenkong1 Jun 13, 2024

fbarchard Jun 11, 2024

fanchenkong1 Jun 13, 2024

fbarchard Jun 27, 2024

fbarchard commented Jun 11, 2024

fanchenkong1 commented Jun 13, 2024

fbarchard left a comment

fanchenkong1 commented Jul 10, 2024

alankelly commented Jul 11, 2024

fanchenkong1 commented Jul 12, 2024

fbarchard Jul 16, 2024

fanchenkong1 Jul 16, 2024

Add QS8_QC8W GEMM/IGEMM microkernels for Wasm Relaxed Unsigned and Signed … #6505

Add QS8_QC8W GEMM/IGEMM microkernels for Wasm Relaxed Unsigned and Signed … #6505

Conversation

fanchenkong1 commented May 30, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

fbarchard commented Jun 11, 2024

fanchenkong1 commented Jun 13, 2024

fbarchard left a comment

Choose a reason for hiding this comment

fanchenkong1 commented Jul 10, 2024

alankelly commented Jul 11, 2024

fanchenkong1 commented Jul 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment