Solve bank conflict #8

yofufufufu · 2024-05-17T12:36:08Z

In my opinion, when loading data from global memory to shared memory(i.e. write shared memory) with vectorized access, because of the transposition, threads within a warp may write the same col in shared memory.
For example, thread 0 reads A[0][0] to A[0][3], thread 1 reads A[0][4] to A[0][7]. So thread 0 writes As[0][0] to As[3][0], thread 1 writes As[4][0] to As[7][0]. For a BM(=128) * BK(=8) size As, it is obvious that As[0][0] and As[4][0] are on the same bank, causing bank conflict.
So I think bank conflict will only occur when writing As not Bs. But in kernel v7 and v8, it seems like you try to optimize wrting to Bs:

SGEMM_CUDA/src/kernels/8_kernel_bank_extra_col.cuh

Lines 56 to 60 in 60cba6f

    
           tmp = reinterpret_cast<float4 *>(&B[innerRowB * N + innerColB * 4])[0]; 
        
           Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 0] = tmp.x; 
        
           Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 1] = tmp.y; 
        
           Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 2] = tmp.z; 
        
           Bs[innerRowB * (BN + extraCols) + innerColB * 4 + 3] = tmp.w;

Did I understand something wrong?

The text was updated successfully, but these errors were encountered:

yofufufufu · 2024-06-04T16:25:52Z

Still looking forward to your reply 😿

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Solve bank conflict #8

Solve bank conflict #8

yofufufufu commented May 17, 2024

yofufufufu commented Jun 4, 2024

Solve bank conflict #8

Solve bank conflict #8

Comments

yofufufufu commented May 17, 2024

yofufufufu commented Jun 4, 2024