phaeron writes:
Additive blending is a relatively simple, common, and useful operation. It takes two images and adds them together, saturating the result to white. It's useful for some kinds of overlays, as well as incrementally building a final image from components. We'll start with a simple scalar version:
void additiveblend1(unsigned char *dst, const unsigned char *src, int quads) {
int bytes = quads << 2;
do {
unsigned v = (unsigned)src[0] + dst[0];
if (v >= 256)
v = 255;
dst[0] = v;
++src;
++dst;
} while(--bytes);
}
This code works a byte at a time and works equally well for any byte-per-channel format, including RGB24, RGB32, YUY2, and YV12. It runs at a dismal 16 clocks/byte on a 1.6GHz Pentium 4.
Tip: The CPU's time stamp counter, accessible via the RDTSC instruction, is a great way to benchmark small pieces of code. However, the results can be heavily affected by cache effects. For code that is deterministic and quick, running the test twice in a row and taking the second timing can eliminate most cache effects.
Now, let's rewrite it in MMX.
void additiveblend2(unsigned char *dst, const unsigned char *src, int quads) {
__asm {
mov ecx, src
mov edx, dst
mov eax, quads
pxor mm7, mm7
top:
movd mm0, [ecx] ;load four source bytes
movd mm1, [edx] ;load four destination bytes
punpcklbw mm0, mm7 ;unpack source bytes to words
punpcklbw mm1, mm7 ;unpack destination bytes to words
paddw mm0, mm1 ;add words together
packuswb mm0, mm1 ;pack words with saturation
movd [edx], mm0 ;store blended result
add ecx, 4 ;advance source pointer
add edx, 4 ;advance destination pointer
dec eax ;loop back until done
jne top
emms
}
}
This version runs much better on a random 1K sample, at 1.64 clocks/byte. However, we can do better. Unpacking to 16-bit words is generally required in most MMX routines but it's a total waste here, as the CPU has saturating add instructions that work directly on bytes. We can use those instead.
Tip: MMX has a lot of interesting instructions, such as saturating adds and subtracts, clamping packs, and multiply-plus-shift instructions. Learning to use these effectively can improve the performance of your code. However, there are a few omissions in the basic MMX set, including unsigned multiplies, byte shifts, and saturating 32-bit operations, that you should be aware of.
So here's the version using the saturating PADDUSB (packed add with unsigned byte saturation) instruction:
void additiveblend3(unsigned char *dst, const unsigned char *src, int quads) {
__asm {
mov ecx, src
mov edx, dst
mov eax, quads
top:
movd mm0, [ecx]
movd mm1, [edx]
paddusb mm0, mm1
movd [edx], mm0
add ecx, 4
add edx, 4
dec eax
jne top
emms
}
}
This version is still faster at 1.27 clocks/byte and is simpler, too. But why are we still doing 4 bytes at a time, when we could be doing 8?
void additiveblend4(unsigned char *dst, const unsigned char *src, int quads) {
__asm {
mov ecx, src
mov edx, dst
mov eax, quads
pxor mm7, mm7
top:
movq mm0, [ecx]
paddusb mm0, [edx]
movq [edx], mm0
add ecx, 8
add edx, 8
sub eax, 2
jne top
emms
}
}
Although we only doubled the parallism of the loop, this change makes a huge difference and kicks us all the way up to 0.49 clocks/pixel. Now notice that the scalar code is beginning to take a lot of the inner loop -- in fact, half of it. We can eliminate the pointer updates by switching to array notation:
void additiveblend5(unsigned char *dst, const unsigned char *src, int quads) {
__asm {
mov ecx, src
mov edx, dst
mov eax, quads
shl eax, 2
add ecx, eax
add edx, eax
neg eax
top:
movq mm0, [ecx+eax]
paddusb mm0, [edx+eax]
movq [edx+eax], mm0
add eax, 8
jne top
emms
}
}
There is a slight trick here, in that we index negatively from the end of the arrays rather than positively from the start, and count up instead of down. Removing the pointer operations trims execution time by a tiny amount, down to 0.48 clocks/pixel. We're still not done yet, though. Let's unroll the loop. That way, the CPU has less loop overhead instructions to execute, and also relaxes the loop-carried dependency on the loop count (EAX in a loop iteration is dependent on the last loop's EAX, which is dependent on EAX from the loop before that, etc).
void additiveblend6(unsigned char *dst, const unsigned char *src, int quads) {
__asm {
mov ecx, src
mov edx, dst
mov eax, quads
shl eax, 2
add ecx, eax
add edx, eax
neg eax
top:
movq mm0, [ecx+eax]
movq mm1, [ecx+eax+8]
paddusb mm0, [edx+eax]
paddusb mm1, [edx+eax+8]
movq [edx+eax], mm0
movq [edx+eax+8], mm1
add eax, 16
jne top
emms
}
}
This version now runs at 0.42 clocks/pixel, about 38x faster than the original scalar version. That's fast enough, right? Well, yes, except for one problem: it's wrong. Namely, this routine handles 16 bytes at a time instead of 4 bytes at a time, so if the quad count isn't divisible by 4, this routine doesn't work properly. Whoops.
Tip: Faster code is generally less desirable than working code.
void additiveblend7(unsigned char *dst, const unsigned char *src, int quads) {
__asm {
mov ecx, src
mov edx, dst
mov eax, quads
shl eax, 2
lea ecx, [ecx+eax-12]
lea edx, [edx+eax-12]
neg eax
add eax, 12
jc look_for_oddballs
top:
movq mm0, [ecx+eax]
movq mm1, [ecx+eax+8]
paddusb mm0, [edx+eax]
paddusb mm1, [edx+eax+8]
movq [edx+eax], mm0
movq [edx+eax+8], mm1
add eax, 16
jnc top
look_for_oddballs:
sub eax, 12
jnc no_oddballs
oddball_loop:
movd mm0, [ecx+eax+12]
movd mm1, [edx+eax+12]
paddusb mm0, mm1
movd [edx+eax+12], mm0
add eax, 4
jne oddball_loop
no_oddballs:
emms
}
}
Here is the final version. It's probably still not as fast as it could be but for most uses it's fast enough. Unfortunately, the code to handle odd fixups at the end of the line takes more code than the optimized version. It is often the case that handling odd cases will greatly complicate an optimized routine. There is an trick to handle odd widths that unfortunately cannot be used here, but is useful when the destination and sources don't overlap: back off the source and destination pointers and redo some bytes that have already been done. For instance, if a routine handles 8 bytes at a time and is asked to do 21 bytes, do 16 bytes in the fast loop, then do 8 unaligned bytes at the end with bytes 13-15 getting redone. This can let you handle odd widths with very little code.
Omake
Remember that scalar version at the top that ran at 16 clocks/pixel? Well, that was a pretty bad scalar version. Try this one:
void additiveblend1s(unsigned char *dst, const unsigned char *src, int quads) {
unsigned *dst2 = (unsigned *)dst;
const unsigned *src2 = (const unsigned *)src;
do {
unsigned a = src2[0];
unsigned b = dst2[0];
unsigned sats = ((a&b) + (((a^b)&0xfefefefe)>>1)) & 0x80808080;
sats = sats + sats - (sats >> 7);
dst2[0] = (a|sats) + (b&~sats);
++src2;
++dst2;
} while(--quads);
}
This version handles 32 bits at a time, and runs at 3.43 clocks/byte. MMX is still faster here, but there are some cases in which it isn't -- for instance, when I tried writing an HSV filter, an optimized scalar version turned out to be faster than an optimized MMX version. So don't think that MMX is always the best way.