@vbk22398 wrote:
My superior wants me to do it Bare Metal in register level, but I feel it is overwhelming as there are lots of registers and bit fields to be concerned about.
Why does your superior want that? Performance reasons? Or because your superior wants all code to be written in house without any third party code?
STM32CubeMX has HAL and LL.
LL is Low Level and basically only uses macros or inline functions to directly access to registers, while HAL uses functions. In STM32CubeMX you can select per peripheral if you want to use HAL or LL.
LL is less portable and harder to use. But I would call that bare metal.
My suggestion is to first get your code to work and then one-by-one rewrite the provided functions only if needed.
@vbk22398 wrote:
Also I don't know how to find "the things which have the biggest impact on speed."
Profiling. Measure the speed. One way is to set an IO pin before calling a function and clearing it afterwards. You can use a Logic Analyzer or an oscilloscope to measure the duration of the function. Using different IO pins for different functions can give you a nice visual overview of the timing. You can also use timers to measure duration of functions.
Generally you want to avoid busy waiting for things like peripherals. Example:
Uart sends "Hello world!" at 9600baud 1 stop bits, no parity. This should take 12.5 milliseconds. Usually the uart reports done while the last byte is being send so it can report it is done a little sooner. Waiting for the uart to finish at the end of the send function results in the function to take about 12.5 milliseconds. But you can also check if it is done sending with a separate function. You can do other things in the mean time.