WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #2793

JimBobSquarePants · 2024-08-15T06:46:35Z

Prerequisites

I have written a descriptive pull-request title
I have verified that there are no overlapping pull-requests open
I have verified that I am following the existing coding patterns and practice as demonstrated in the repository. These follow strict Stylecop rules 👮.
I have provided test coverage for my change (where applicable)

Description

This is a replacement for #1518 by @Sergio0694 with most of the work based upon his implementation. I've modernized some of the code and added Vector512 support also.

Description

Follow up to #1513. This PR does a couple things:

Switch the resize kernel processing to float

Add an AVX2 vectorized method to normalize the kernel

Vectorize the kernel copy when not using FMA, using Span<T>.CopyTo instead

Remove the permute8x32 when using FMA, by creating a convolution kernel of 4x the size

Resize convolution codegen diff

Before:
vmovsd xmm2, [rax]
vpermps ymm2, ymm1, ymm2
vfmadd231ps ymm0, ymm2, [r8]
After:
vmovupd ymm2, [r8]
vfmadd231ps ymm0, ymm2, [rax] 

Resize tests currently have four failing tests with minor differences while the ResizeKernelMap has 3 single failing tests. Turning off the periodic kernel map fixes the kernel map failing tests so that is somehow related (I have no idea why).

I would like to hopefully get issues fixed and merge this because performance in the Playground Benchmarks looks really, really good so if anyone can spare some time to either provide assistance or point me in the right direction. please let me know.

CC @antonfirsov @saucecontrol

BenchmarkDotNet v0.13.11, Windows 11 (10.0.22631.4037/23H2/2023Update/SunValley3)
11th Gen Intel Core i7-11370H 3.30GHz, 1 CPU, 8 logical and 4 physical cores
.NET SDK 8.0.400
  [Host]   : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI
  ShortRun : .NET 8.0.8 (8.0.824.36612), X64 RyuJIT AVX-512F+CD+BW+DQ+VL+VBMI

Job=ShortRun  IterationCount=5  LaunchCount=1
WarmupCount=5

Method	Mean	Error	StdDev	Ratio	RatioSD	Gen0	Gen1	Gen2	Allocated	Alloc Ratio
'System.Drawing Load, Resize, Save'	397.19 ms	19.614 ms	5.094 ms	1.00	0.00	-	-	-	13.23 KB	1.00
'ImageFlow Load, Resize, Save'	280.24 ms	10.062 ms	2.613 ms	0.71	0.00	500.0000	500.0000	500.0000	4642.6 KB	351.01
'ImageSharp Load, Resize, Save'	116.64 ms	22.234 ms	5.774 ms	0.29	0.02	-	-	-	1312.78 KB	99.25
'ImageSharp TD Load, Resize, Save'	75.36 ms	0.933 ms	0.242 ms	0.19	0.00	166.6667	-	-	1309.11 KB	98.98
'ImageMagick Load, Resize, Save'	404.52 ms	33.183 ms	8.618 ms	1.02	0.03	-	-	-	54.48 KB	4.12
'ImageFree Load, Resize, Save'	236.37 ms	2.954 ms	0.767 ms	0.60	0.01	6000.0000	6000.0000	6000.0000	95.97 KB	7.26
'MagicScaler Load, Resize, Save'	68.02 ms	2.070 ms	0.537 ms	0.17	0.00	-	-	-	45.56 KB	3.44
'SkiaSharp Load, Resize, Save'	137.06 ms	4.841 ms	1.257 ms	0.35	0.00	-	-	-	88.15 KB	6.66
'NetVips Load, Resize, Save'	123.93 ms	2.914 ms	0.757 ms	0.31	0.01	-	-	-	50.95 KB	3.85

saucecontrol · 2024-08-21T18:09:59Z

src/ImageSharp/Common/Helpers/Vector512Utilities.cs

+    public static Vector512<float> MultiplyAddEstimate(Vector512<float> a, Vector512<float> b, Vector512<float> c)
+
+        // Don't actually use FMA as it requires many more instruction to extract the
+        // upper and lower parts of the vector and then recombine them.


What's this about? When inlined, a helper wrapping Avx512F.FusedMultiplyAdd will compile to a single instruction. One of:

Instruction: vfmadd132ps zmm, zmm, zmm vfmadd213ps zmm, zmm, zmm vfmadd231ps zmm, zmm, zmm CPUID Flags: AVX512F

Urgh... I didn't even think of Avx512F.FusedMultiplyAdd. I saw the equivalent method in the runtime PR exposing the functionality and blindly copied.

MultiplyAddEstimate

src/ImageSharp/Common/Helpers/Numerics.cs

src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeKernel.cs

saucecontrol · 2024-08-21T18:56:40Z

Brain's not fully awake yet today, but I'll give the maths a look soon.

…ernel.cs Co-authored-by: Clinton Ingram <[email protected]>

Co-authored-by: Clinton Ingram <[email protected]>

JimBobSquarePants · 2024-08-22T12:25:52Z

Brain's not fully awake yet today, but I'll give the maths a look soon.

Thanks for the review so far. I still haven't figured out what is going on with PeriodicKernelMap. It all looks correct to me.

saucecontrol · 2024-08-23T18:43:51Z

It looks to me like the only differences are due to the change to single precision for kernel normalization and for calculation of the distances passed to the interpolation function. You'll definitely give up some accuracy there, and I'm not sure it's worth it since the kernels only have to be built once per resize. You can see here that @antonfirsov changed the precision to double from the initial implementation some years back.

Since the periodic kernel map relies on each repetition of the kernel weights being exact, I can see how precision loss might lead to some differences when compared with a separate calculation per interval. I've actually never looked at your implementation of the kernel map before, and now my curiosity is piqued because I arrived at something similar myself, but my implementation calculates each kernel window separately, and only replaces that interval with the periodic version if they match exactly. Part of this was due to a lack of confidence in the maths on my part, as I only discovered the periodicity of the kernel weights by observation and kind of intuited my way to a solution.

@antonfirsov would you mind filling in some gaps on the theory behind your periodic kernel map implementation? Did you use some paper or other implementation as inspiration, or did you arrive at it observationally like I did?

JimBobSquarePants added 3 commits August 14, 2024 23:19

Reimplement @Sergio0694 work.

cd1b77a

Add Vector512 support

36fefc6

Use dedicated property

4728b97

JimBobSquarePants added the area:performance label Aug 15, 2024

JimBobSquarePants added this to the v4.0.0 milestone Aug 15, 2024

JimBobSquarePants added 2 commits August 15, 2024 17:24

Update ResizeKernelMap.cs

8c19a97

Don't use FMA for 512

7840665

saucecontrol reviewed Aug 21, 2024

View reviewed changes

src/ImageSharp/Common/Helpers/Numerics.cs Outdated Show resolved Hide resolved

saucecontrol reviewed Aug 21, 2024

View reviewed changes

src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeKernel.cs Outdated Show resolved Hide resolved

JimBobSquarePants and others added 4 commits August 22, 2024 21:20

Update src/ImageSharp/Processing/Processors/Transforms/Resize/ResizeK…

58f6afb

…ernel.cs Co-authored-by: Clinton Ingram <[email protected]>

Update src/ImageSharp/Common/Helpers/Numerics.cs

0594035

Co-authored-by: Clinton Ingram <[email protected]>

Merge branch 'main' into js/resize-map-optimizations

e60dd07

use Avx512F.FusedMultiplyAdd

72813ee

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #2793

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #2793

JimBobSquarePants commented Aug 15, 2024 •

edited

Loading

saucecontrol Aug 21, 2024

JimBobSquarePants Aug 22, 2024

saucecontrol commented Aug 21, 2024

JimBobSquarePants commented Aug 22, 2024

saucecontrol commented Aug 23, 2024

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #2793

Are you sure you want to change the base?

WIP - Speed improvements to resize convolution (no vpermps w/ FMA) #2793

Conversation

JimBobSquarePants commented Aug 15, 2024 • edited Loading

Prerequisites

Description

Description

Resize convolution codegen diff

saucecontrol Aug 21, 2024

Choose a reason for hiding this comment

JimBobSquarePants Aug 22, 2024

Choose a reason for hiding this comment

saucecontrol commented Aug 21, 2024

JimBobSquarePants commented Aug 22, 2024

saucecontrol commented Aug 23, 2024

JimBobSquarePants commented Aug 15, 2024 •

edited

Loading