Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add QS8_QC8W GEMM/IGEMM microkernels for Wasm Relaxed Unsigned and Signed … #6505

Merged
merged 4 commits into from
Jul 18, 2024

Conversation

fanchenkong1
Copy link
Contributor

…Dot Product

This PR is related to issue #6454.

This change adds qs8_qc8w gemm/igemm microkernels for Wasm relaxed simd dot product on signed and unsigned bytes. The new microkernels can use the recently supported AVX-VNNI instructions in V8 on x64 devices.

@fanchenkong1 fanchenkong1 marked this pull request as draft June 6, 2024 01:51
@fanchenkong1 fanchenkong1 marked this pull request as ready for review June 6, 2024 01:58
const v128_t va${M} = wasm_v128_xor(wasm_v128_load(a${M}), vsign_mask);
a${M} += 16;

$for N in range(4):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ideally make all loops respect NR=4 and try some different sizes. See how neon dot product kernels do NR

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the suggestion! I will take a look at neon dot product kernels.

const v128_t vb${N} = wasm_v128_load((const int8_t*) w + ${N * 16});

$for M in range(MR):
vacc${M}x${N} = wasm_i32x4_relaxed_dot_i8x16_i7x16_add(vb${N}, va${M}, vacc${M}x${N});
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If weights are signed int8 this will overflow if the implementation is vpmaddubsw?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of vpmaddubsw will not pass the check CheckWAsmUSDOT as it has a different saturation behavior.


const v128_t vmin = wasm_v128_load64_splat(params->wasmsimd.min);
$for M in range(MR):
vacc${M}x0123 = wasm_f32x4_pmax(vacc${M}x0123, vmin);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note pmax is 2 instructions on arm.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(I only check in V8. Please let me know if there is anything wrong.)

Yes, pmax is 2 instructions on arm and 1 instruction on x64. min is 1 instruction on arm and many instructions on x64. Relaxed min/max is 1 instruction on both x64 and arm, but has implementation-defined behavior on NaN and +/-0.0.

Since I32x4.dot_i8x16_i7x16_add_s is SDOT on arm64, it will probably go to src/qs8-gemm/MRx4c16-wasmsdot.c.in. src/qs8-gemm/MRx4c16-wasmusdot.c.in may be executed mainly on x64. Shall we keep wasm_f32x4_pmax here in src/qs8-gemm/MRx4c16-wasmusdot.c.in, or replace it with relaxed max (I'm not very confident if it is valid) if only any arm device will use it?

@@ -217,6 +217,18 @@ static void init_hardware_config(void) {
wasm_v128_xor(xint8_output, wasm_i32x4_const_splat(-128)),
wasm_v128_xor(overflow_output, wasm_i32x4_const(65536, 33024, 33024, 512))));
}
{
// Check out-of-bounds behaviour of VNNI (vpdpbusd) version Relaxed Integer Dot Product with Accumulation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This works with (i8mm) vsudot on ARM as well?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not quite familiar with ARM. But I see a description here. It seem that this check may also work with vsudot, since vsudot deal with s8/u8 input and no saturation applied?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh yes, vsudot would work identical to VNNI, if V8 supports that, but it is considered an I8MM instruction. The more common instruction is vsdot which takes 2 signed int8. So for ARM (or vnni-int8) we might like a version that does not add 128.

@fbarchard
Copy link
Contributor

There is a merge conflict for the internal review. Can you rebase and/or break into smaller PR

@fanchenkong1
Copy link
Contributor Author

There is a merge conflict for the internal review. Can you rebase and/or break into smaller PR

Thanks for reminding me the issue! I just rebased the change. If you like smaller PR to make things clear, please let me know.

@fanchenkong1 fanchenkong1 force-pushed the wasm-vnni branch 2 times, most recently from a7511a4 to a9b63f7 Compare July 9, 2024 08:07
Copy link
Contributor

@fbarchard fbarchard left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I would normally do a seperate PR for the gemm-config to allow the gemm config to be rolled back but keep the kernel, and to make the PR itself easier to land... the build files tend to require rebasing.

@fanchenkong1
Copy link
Contributor Author

Looks good! I would normally do a seperate PR for the gemm-config to allow the gemm config to be rolled back but keep the kernel, and to make the PR itself easier to land... the build files tend to require rebasing.

Thanks for the suggestion! Change in gemm-config is reverted in the latest commit.

@alankelly
Copy link
Collaborator

Thanks for this. Improved WASMSIMD performance is desperately needed on Intel.

Can you please send a follow up updating the gemm-config?

@fanchenkong1
Copy link
Contributor Author

Thanks for this. Improved WASMSIMD performance is desperately needed on Intel.

Can you please send a follow up updating the gemm-config?

Sure, I will create a follow-up PR on updating gemm-config.

#include "xnnpack/gemm.h"
#include "xnnpack/math.h"

#include <xnnpack/gemm.h>
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please change to #include "xnnpack/gemm.h" for xnn headers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comments! I updated them in the latest change.

@copybara-service copybara-service bot merged commit 1d69718 into google:master Jul 18, 2024
20 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants