Skip to main content
Visitor II
November 11, 2017
Question

Optimized Multiplies for Cosmic Compiler?

  • November 11, 2017
  • 4 replies
  • 2621 views
Posted on November 11, 2017 at 23:27

Are there optimized basic math functions available for the STM8 such as; 8 bit by 8 bit multiply and 8 bit by 16 bit multiply and similar?

My application has a time critical 8 bit by 16 bit multiply. The Cosmic compiler seems to always default to a 16 bit by 16 bit multiply, which is slower.I wrote an inline assembly macro that runs in about 2/3 the time of the compiler's output; but it was very tedious to write. I would rather not do this for every math function.

Any helpful information would be appreciated.

I include my macro here, in case anyone else finds it useful:

// macro to perform an optimized UI8 by UI16 multiply: uint16_t

RESULT_UI16 = (uint8_t)X_UI8 * (uint16_t)Y_UI16;

// Note: All arguments must be declared '@tiny'

// macro assumes that no overflow occurs

// '_asm()' will load an 8 bit argument in reg A or a 16 argument into reg X

#define MULT_8x16(X_UI8, Y_UI16, RESULT_UI16) {\

_asm('LDW Y,X\n SWAPW X\n',(uint16_t)Y_UI16);\

_asm('MUL X,A\n SWAPW X\n PUSHW X\n LDW X,Y\n', (uint8_t)X_UI8);\

_asm('MUL X,A\n ADDW X,($1,SP)\n POPW Y\n');\

_asm('CLRW Y\n LD YL,A\n LDW (Y),X\n',(@tiny uint16_t*)(&RESULT_UI16));\

}

#compiler #math

Note: this post was migrated and contained many threaded conversations, some content may be missing.
    This topic has been closed for replies.

    4 replies

    Visitor II
    November 12, 2017
    Posted on November 12, 2017 at 15:40

    I guess there would be a C rule for variable rank and promotion when operands differ. Casting the variables does not necessarily means the * operator won't convert them to something else. When things need to be very optimized to the core cycle level, it makes sense to get to assembly level as in this particular case. Check for the math ansi library for specific function (not using *) in case it exists... this requires some doc reading about the compiler.

    Visitor II
    November 13, 2017
    Posted on November 13, 2017 at 13:54

    Yes, C has its rules on types. But the compiler is still free to optimize, as long as the observable behaviour is the same. E.g. 8-bit types will always be promoted to at least 16-bit types by the rules. But compilers will still use 8x8->16 multiplication where they can.

    Philipp

    Visitor II
    November 14, 2017
    Posted on November 14, 2017 at 06:33

    In my opinion, arithmetic in COSMIC CXSTM8, is very high quality. However, for DSP this may not be enough.

    Visitor II
    November 14, 2017
    Posted on November 14, 2017 at 11:37

    I agree that the Cosmic compiler does well in all of the head to head comparisons I have seen; but, I don't think I would refer to simple multiplication as 'DSP'.

    The reason for this post was that I was hoping that someone official or unofficial had identified common operations that could be sped up and then created optimized functions or macros to perform those operations.  An application note would be great. 

    Visitor II
    November 14, 2017
    Posted on November 14, 2017 at 11:43

    I assume, that with any compiler, someone would have looked into common operations, espcially multiplications, and how to speed them up. Multiplications can be quite time-intensive, and are important both in benchmarks and real-world applications. That 8x16->16 multiplication where neither operand is a constant is not treated as a special case probably means that it was not considered particularly common / important. Not even SDCC has such an optimization.

    But if you provide examples from real-world code, where such an optimization matters a lot, requesting the feature from compiler developers might result in it getting implemented.

    Philipp

    Visitor II
    November 14, 2017
    Posted on November 14, 2017 at 16:29

    UPDATE:

    This new macro seems to provide the same performance on the Cosmic compiler as the '_asm()' macro in my original post but is more portable.  And I think my original time measurements were off. Both of these macros may be as much as 2 times faster than the compiler standard output.

    // macro to perform an optimized UI8 by UI16 multiply: uint16_t RESULT_UI16 = (uint8_t)X_UI8 * (uint16_t)Y_UI16;

    // Note: All arguments should be declared '@tiny'

    // macro assumes that no overflow occurs

    #define MULT_8x16(X_UI8,Y_UI16,RESULT_UI16) {\

          

    RESULT_UI16

    = (((uint8_t)(

    (uint16_t)

    Y_UI16

    >>8) *

    (uint8_t)

    X_UI8

    )<<8) + ((uint8_t)

    Y_UI16

    *

    (uint8_t)

    X_UI8

    ); \

       }

    Visitor II
    November 16, 2017
    Posted on November 16, 2017 at 11:18

    Hello,

    this looks like a good optimization to implement: we'll check some details are report back here soon.

    Regards,

    Luca (Cosmic)

    Visitor II
    November 16, 2017
    Posted on November 16, 2017 at 21:20

    This optimization will be implemented in the next release of the compiler (no due date yet, probably a couple of months) in the form of a library routine that comes down almost exactly to the C macro mentioned above: this means that for absolute best speed the macro will still be the best solution (because it is inlined), but using it too many times will make the code bigger. Conversely it also means that a code that did not need this much speed for this special multiplication will end up a few bytes bigger than before (but this can be avoided using casts to force the 16x16 multiplication).

    As to why we did not implement this before, Philipp already gave the biggest part of the answer: since this is not the most used kind of operation and no one asked for it before, we just preferred to favor code size rather then speed: this used to be the standard choice for 8 bit micros in the past, but we see it slowly changing to a more balanced approach between size and speed, so if there are other suggestions for similar improvments don't hesitate to let us know and we will evaluate on a case by case basis.

    Visitor II
    November 16, 2017
    Posted on November 16, 2017 at 23:04

    That's great, Luca, thank you.

    BTW, do you know of any white papers or app notes about writing faster code with Cosmic Compiler?