I have seen a few discussions about instruction speed and the speed of toggling GPIOs. However the discussions are not very thorough and leave many open questions, including actual results and their relevence. Since I have to deliver results at a critical stage in a project I have spent some time doing real measurements and am having difficulties working out whether they represent optimum speed or whether there is some setting which is causing slower results as expected. Test set up: The testing involves measuring a GPIO output pin and benchmark reference is from the ST presentation where 12MHz toggle speed is stated as achievable. As well as determining the toggle speed possible, also the instruction speed from FLASH and SRAM was measured. 1. Setup. Running on STR912F with PLL set to 48MHz. No other dividers activated as far as aware. [verification - speed when running from 25MHz oscillator was about half that as from PLL and various additional dividers did decrease the speed accordingly - the dividers were all removed for the measurements below] 2. Test 1. RAW GPIO toggle speed based on a sequence of assember instructions optimised for one instruction per output state change: str r2,[r0,#0] set '1' str r1,[r0,#0] set '0' str r2,[r0,#0] set '1' str r3,[r0,#0] set '0' str r4,[r0,#0] set '1' The period between '0' and '1' was measured as: - 185ns when running from FLASH - 185ns when running from SRAM with no wait states - 210ns when running from SRAM with wait states The results were identical with or without buffered peripherals (this is contrary to statements in other postings ?). Accesses between buffered and unbuffered is understood to be basically 0x4800xxxx and 0x5800xxxx addresses. This gives a toggle frequency of 5,4MHz according to period or 2,7MHz when measured as the generated square wave frequency (it is not clear how the ST value is defined). Assuming that the speed will be doubled at the max. 96MHz this gives 10,8MHz (or 5,4MHz) which is a little less than the stated 12MHz or a little less than half of it. It seems as though the toggle speed is not identical to the instruction execution speed in this case but limited in the port access hardware to some extent (see instruction speed measurement in next point) 3. Instruction speed To interprete the speed of instruction execution a small loop was placed between two of the toggles. A variable was incremented in a register and the resulting loop caused a total of 65 instructions in Thumb mode to be executed (I don't think that mode (ARM or Thumb) is actually relevant for the instruction speed test). By measuring the time increase between the GPIO changes and dividing it by the total quantity of instructions the single instruction execution time was calculated. Time for 65 instructions when running in FLASH = 10,2us Time for 65 instructions when running in SRAM with wait states = 5,33us Time for 65 instructions when running in SRAM without wait states = 3,97us The instruction times are therefore: 157ns / 82ns / 61ns or expressed in instructions per second 6.4M / 12M / 16.4M The results suggest that the instruction speed from SRAM could be faster than the GPIO toggle speed, so probably the port accesses are slowing. Since the PLL speed was 48MHz is suggests that about 3 or more clock are required to execute one instruction. Now these are the measurement results and everyone knows that measurement results have to be treated with great care because they may not be accurate. And this is the main reason why I want to show them here. I was expecting the instruction speed to be equal to the PLL speed but the results deviate by a factor of about 3 and more (depending on where the code is running). The fact that it can be slower is not the point because this is clear from the way the FLASH and its queue operates. The other way of stating the results are : what am I doing wrong to not measure faster instruction speed? If there is an incorrect chip setting what is it (or could it be)? Are the GPIO results accurate (same basic question about settings). If we assume some measurement inaccuracy and the actual factor between clock and instruction and GPIO toggle to ST stated amximum is a factor of 2, where can this half speed reduction be coming from??? Many thanks for any serious analysis and suggestions!! Regards Mark Butcher
Thanks for the tip. I looked around for this setting and found it in the start up assember file. It is NOT activated so I will change this and repeat. I also see that the start up code is activating the wait states in SRAM. Do you know when and whether this is necessary? If I go to 96MHz will the wait states then be necessary or are the superfluous? Also do you know why one would want to disable the buffered operation per default? Are there risks or power consumption increases to cause ST to default them off in the start up? I will update the report once I have re-measured. regards Mark
I have an update after testing with buffered mode enabled. Buffered mode not enabled: Port toggling: - 185ns when running from FLASH - 185ns when running from SRAM with no wait states - 210ns when running from SRAM with wait states Time for 65 instructions when running in FLASH = 10,2us Time for 65 instructions when running in SRAM with wait states = 5,33us Time for 65 instructions when running in SRAM without wait states = 3,97us The instruction times are therefore: 157ns / 82ns / 61ns or expressed in instructions per second 6.4M / 12M / 16.4M Buffered mode enabled Port toggling: - 168ns when running from FLASH (GPIO accesses in buffered space) - 132ns when running from buffered SRAM space with no wait states - 130ns when running from non-buffered SMAR space with no wait states - 126ns when running from D-TCM SRAM space Time for 65 instructions when running in FLASH = 10,7us Time for 65 instructions when running in buffered SRAM space with wait states = 4,79us Time for 65 instructions when running in non-buffered SRAM space without wait states = 5,39us Time for 65 instructions when running in D-TCM SRAM space without wait states = 3,89us This is giving best GPIO and Instruction performance when running in D-TCM SRAM space and using buffered GPIO access. However the relationships are still not clear - can anyone shed light on exactly what is going on. The present best setting are therefore achieving GPIO toggling in 126ns and about 17M instructions per second at 48MHz. Testing at 96MHz has proved to not work at the moment. The PLL locks but as soon as the PLL is selected as clock FLASH memory accesses seem to be no longer accurate and the code crashes. Any ideas? Regards Mark
Another result which is interesting. The 65 instruction test was a small loop register volatile int x = 0; while (x < 10) x++; Now I have straightened out the loop. register volatile int x = 0; x++; x++; x++; x++; x++; etc. Now I am measuring the time for 58 Thumb instructions. From FLASH - 7,26us - 8M Instructions per second at 48MHz From SRAM - 2,8us - 20M Instructions per second at 48MHz (zero wait state) This is showing again quite a large difference between operation from FLASH and SRAM (is the factor 2,5 expected or could it indicate a problem with settings somewhere?) The performance out of SRAM is a bit better now without the loop but shoud I not be expecting more? Is there an explaination for this? Regards Mark
I have an improvement by enabling the PFQBC, which was being disabled in the ST start up file 91x_init.s ; --- Enable 96K RAM LDR R0, = SCRO_AHB_UNB LDR R1, = 0x0196 <--- sets SRAM wait states and disables PFQBC STR R1, [R0] Now the straight line instruction performance has improved in FLASH to the same as in SRAM - 20MIPs at 48MHz. This is better and shows that the problems are probably still set up related. Since in this case it is the ST standard start up code disabling it must in fact be quite a common problem for beginners(?) There must be some more secret bits to set and/or clear to get the device to operate as fast as originally expected... where can they be hiding? I wonder how many times I have already re-read the user's manual? Any one know more? BTW. Clock / PLL register setups 00020000 = SCU_CLKCNTR 000bc019 = SCU_PLLCONF (with the value 0xac019 it locks to 96MHz but the program crashes - it can not read correctly from FLASH? However I could previously run at 96MHz before playing around with other stuff...). CHips are marked with 610 - I think that this is Rev. D. Regards Mark
variable was incremented in a register and the resulting loop caused a total of 65 instructions in Thumb mode to be executed (I don't think that mode (ARM or Thumb) is actually relevant for the instruction speed test). I disagree, my tests showed that at 96 MHz the Thumb code is faster than ARM code by 42%. Hand-crafted assembly code. No compiler magic. If we assume some measurement inaccuracy and the actual factor between clock and instruction and GPIO toggle to ST stated amximum is a factor of 2, where can this half speed reduction be coming from??? Those who know aren't telling, and those who don't know resort to guessing. Based on the public domain opinions scattered all over Inet plus doing my homework - my guess is that ST's marketing and engineering are disconnected - and - you and I are ''early adopters'' (i.e. beta testers-volunteers). What your local ST FAE has to say about your tests?
Just FYI, I am in the same boat as Mark at this point.
The fastest I can toggle the GPIO is at 124 ns between edges. Here is my loop (toggling P6.0): while (1) { *(U32*)(0x4800C004) = 0x00000000; *(U32*)(0x4800C004) = 0x00000001; *(U32*)(0x4800C004) = 0x00000000; *(U32*)(0x4800C004) = 0x00000001; } The C optimizer translates this into 4 STR and 1 B, so this is well optimized. At 48MHz, 124 ns translates into 6 cycles. I would have expected one STR to consume either 1 cycle or 3 cycles, but not 6. In the STR91x library, you can modify 91x_init.s and enable this define to get the buffering to work. #define BUFFERED_Mode ; Work on Buffered mode, when enabling this define All other clocks are 1:1 with MCLK. Also, I can't run at 96MHz. It just crashes when MCLK switches to the PLL. -Mark 2
I have been communicating with a couple of others who are experimenting. Presently the situation is that 6MHz toggle rate is the best that probably can be achieved (the 12M specified in ST presentations seems to really mean that an edge can be generated at 12M rate, resulting in 6MHz square wave). Two other pieces of info are important: 96 MHz is max for CPU and 48 MHz is max for APB (PCM = MCLK/2). Above 75M the flash needs 2 wait states which is programmed using the FMI configuration command (see the FLASH users manual and not the device data sheet). (I didn't actually manage to get this working on a first attempt but maybe because a clock was still out of spec. some where) It also seems that various clocking strategies/buffering modes are better for certain jobs and so it depends a bit on what is being optimised for. One piece of info that I received is interesting - it may be a good strategy to clock at 73MHz so that no dividers are used and no extra wait states are needed (not exactly sure where 73MHz comes from - perhaps experimental?) and the overall performance may then be best. I haven't actually been able to do any more detailed work because the software application was getting way behind schedule - with the 20MIPs I can limp by for the moment. Once there is breathing space I will try to sort out the PLL rate which should be adequate for the present work. Please tell if you have something more. Regards Mark