Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Solve bank conflict #8

Open
yofufufufu opened this issue May 17, 2024 · 1 comment
Open

Solve bank conflict #8

yofufufufu opened this issue May 17, 2024 · 1 comment

Comments

@yofufufufu
Copy link

In my opinion, when loading data from global memory to shared memory(i.e. write shared memory) with vectorized access, because of the transposition, threads within a warp may write the same col in shared memory.
For example, thread 0 reads A[0][0] to A[0][3], thread 1 reads A[0][4] to A[0][7]. So thread 0 writes As[0][0] to As[3][0], thread 1 writes As[4][0] to As[7][0]. For a BM(=128) * BK(=8) size As, it is obvious that As[0][0] and As[4][0] are on the same bank, causing bank conflict.
So I think bank conflict will only occur when writing As not Bs. But in kernel v7 and v8, it seems like you try to optimize wrting to Bs:

tmp = reinterpret_cast<float4 *>(&B[innerRowB * N + innerColB * 4])[0];
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 0] = tmp.x;
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 1] = tmp.y;
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 2] = tmp.z;
Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 3] = tmp.w;

Did I understand something wrong?

@yofufufufu
Copy link
Author

Still looking forward to your reply 😿

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant