Tutorial :32x32 Multiply and add optimization


I'm working on optimizing an application . I found that i need to optimize an inner loop for improved performance. rgiFilter is a 16 bit arrary.

for (i = 0; i < iLen; i++) {      iPredErr = (I32)*rgiResidue;      rgiFilter = rgiFilterBuf;      rgiPrevVal = rgiPrevValRdBuf + iRecent;      rgiUpdate = rgiUpdateRdBuf + iRecent;        iPred = iScalingOffset;        for (j = 0; j < iOrder_Div_8; j++) {                       iPred += (I32) rgiFilter[0] * rgiPrevVal[0];                    rgiFilter[0] += rgiUpdate[0];                     iPred += (I32) rgiFilter[1] * rgiPrevVal[1];                    rgiFilter[1] += rgiUpdate[1];                     iPred += (I32) rgiFilter[2] * rgiPrevVal[2];                    rgiFilter[2] += rgiUpdate[2];                     iPred += (I32) rgiFilter[3] * rgiPrevVal[3];                    rgiFilter[3] += rgiUpdate[3];                     iPred += (I32) rgiFilter[4] * rgiPrevVal[4];                    rgiFilter[4] += rgiUpdate[4];                     iPred += (I32) rgiFilter[5] * rgiPrevVal[5];                    rgiFilter[5] += rgiUpdate[5];                     iPred += (I32) rgiFilter[6] * rgiPrevVal[6];                    rgiFilter[6] += rgiUpdate[6];                     iPred += (I32) rgiFilter[7] * rgiPrevVal[7];                    rgiFilter[7] += rgiUpdate[7];                        rgiFilter += 8;          rgiPrevVal += 8;                      rgiUpdate += 8;        }  

ode here


Your only bet is to do more than one operation at a time, and that means one of these 3 options:

  1. SSE instructions (SIMD). You process multiple memory locations with a single instructions
  2. Multi-threading (MIMD). This works best if you have more than 1 cpu core. Split your array into multiple, similarly sized strips that are independant of eachother (dependency will increase this option's complexity a lot, to the point of being slower than sequentially calculating everything if you need a lot of locks). Note that the array has to be big enough to offset the extra context switching and synchronization overhead (it's pretty small, but not negligeable). Best for 4 cores or more.
  3. Both at once. If your array is really big, you could gain a lot by combining both.


If rgiFilterBuf, rgiPrevValRdBuf and rgiUpdateRdBuf are function parameters that don't alias, declare them with the restrict qualifier. This will allow the compiler to optimise more aggresively.

As some others have commented, your inner loop looks like it may be a good fit for vector processing instructions (like SSE, if you're on x86). Check your compiler's intrinsics.


I don't think you can do much to optimize it in C. Your compiler might have options to generate SIMD code, but you probably need to just go and write your own SIMD assembly code if performance is critical...


You can replace the inner loop with very few SSE2 intrinsics

see [_mm_madd_epi16][1] to replace the eight

iPred += (I32) rgiFilter[] * rgiPrevVal[];  

and [_mm_add_epi16][2] or _[mm_add_epi32][3] to replace the eight

rgiFilter[] += rgiUpdate[];  

You should see a nice acceleration with that alone.

These intrinsics are specific to Microsoft and Intel Compilers. I am sure equivalents exist for GCC I just havent used them.

EDIT: based on the comments below I would change the following...

If you have mixed types the compiler is not always smart enough to figure it out. I would suggest the following to make it more obvious and give it a better chance at autovectorizing.

  1. declare rgiFilter[] as I32 bit for the purposes of this function. You will pay one copy.
  2. change iPred to iPred[] as I32 also
  3. do the iPred[] summming outside the inner (or even outer) loop

  4. Pack similar instructions in groups of four

    iPred[0] += rgiFilter[0] * rgiPrevVal[0];

    iPred[1] += rgiFilter[1] * rgiPrevVal[1];

    iPred[2] += rgiFilter[2] * rgiPrevVal[2];

    iPred[3] += rgiFilter[3] * rgiPrevVal[3];

    rgiFilter[0] += rgiUpdate[0];

    rgiFilter[1] += rgiUpdate[1];

    rgiFilter[2] += rgiUpdate[2];

    rgiFilter[3] += rgiUpdate[3];

This should be enough for the Intel compiler to figure it out


  1. Ensure that iPred is hold in a register (not read from memory before and not written back into memory after each += operation).
  2. Optimize the memory layout for 1st level cache. Ensure that the 3 arrays to not fight for same cache entries. This depends on CPU architecture and isn't simple at all.


Loop unrolling and vectorizing should left to the compiler.

See Gcc Auto-vectorization


Start out by making sure that the data is layed out linearly in memory so that you get no cache misses. This doesn't seem to be an issue though.

If you can't SSE the operations (and if the compiler fails with it - look at the assembly), try to separate into several different for-loops that are smaller (one for each 0 .. 8). Compilers tend to be able to do better optimizations on loops that perform less amount of operations (except in cases like this where it might be able to do vectorization/SSE).

16 bit integers are more expensive for 32/64 bit architecture to use (unless they have specific 16-bit registers). Try converting it to 32 bits before doing the loop (most 64-bit architectures have 32bit registers as well afaik).


Pretty good code.

At each step, you're basically doing three things, a multiplication and two additions.

The other suggestions are good. Also, I've sometimes found that I get faster code if I separate those activities into different loops, like

  • one loop to do the multiplication and save to a temporary array.

  • one loop to sum that array in iPred.

  • one loop to add rgiUpdate to rgiFilter.

With the unrolling, your loop overhead is negligible, but if the number of different things done inside each loop is minimized, the compiler can sometimes make better use of its registers.


There's lots of optimizations that you can do that involve introducing target specific code. I'll stick mostly with generic stuff, though.

First, if you are going to loop with index limits then you should usually try to loop downward.


for (i = 0; i < iLen; i++) {  


for (i = iLen-1; i <= 0; i--) {  

This can take advantage of the fact that many common processors essentially do a comparison with 0 for the results of any math operation, so you don't have to do an explicit comparison.

This only works, though, if going backwards through the loop has the same results and if the index is signed (though you can sneak around that).

Alternately you could try limiting by pointer math. This might eliminate the need for an explicit index (counter) variable, which could speed things up, especially if registers are in short supply.

for (p = rgiFilter; p <= rgiFilter+8; ) {       iPred += (I32) (*p) + *rgiPreval++;       *p++ += *rgiUpdate++;         ....    }  

This also gets rid of the odd updating at the end of your inner loop. The updating at the end of the loop could confuse the compiler and make it produce worse code. You may also find that the loop unrolling that you did do may produce worse or equally as good results as if you had only two statements in the body of the inner loop. The compiler is likely able to make good decisions about how this loop should be rolled/unrolled. Or you might just want to make sure that the loop is unrolled twice since rgiFilter is an array of 16 bit values and see if the compiler can take advantage of accessing it just twice to accomplish two reads and two writes -- doing one 32 bit load and one 32 bit store.

for (p = rgiFilter; p <= rgiFilter+8; ) {       I16 x = *p;       I16 y = *(p+1); // Hope that the compiler can combine these loads       iPred += (I32) x + *rgiPreval++;       iPred += (I32) y + *rgiPreval++;         *p++ += *rgiUpdate++;       *p++ += *rgiUpdate++; // Hope that the complier can combine these stores         ....    }  

If your compiler and/or target processor supports it you can also try issuing prefetch instructions. For instance gcc has:

__builtin_prefetch (const void * addr)  __builtin_prefetch (const void * addr, int rw)  __builtin_prefetch (const void * addr, int rw, int locality)  

These can be used to tell the compiler that if the target has prefetch instructions it should use them to try to go ahead and get addr into the cache. Optimally these should be issued once per cache line step per array you're working on. The rw argument is to tell the compiler if you want to read or write to address. Locality has to do with if the data needs to stay in cache after you access it. The compiler just tries to do the best it can figure out how to to generate the right instructions for this, but if it can't do what you ask for on a certain target it just does nothing and it doesn't hurt anything.

Also, since the __builtin_ functions are special the normal rules about variable number of arguments don't really apply -- this is a hint to the compiler, not a call to a function.

You should also look into any vector operations your target supports as well as any generic or platform specific functions, builtins, or pragmas that your compiler supports for doing vector operations.

Note:If u also have question or solution just comment us below or mail us on toontricks1994@gmail.com
Next Post »