ESP32-S3 has (a few) SIMD instructions

Digital Author May 6, 2024

0 0 4 minutes read

Intro

Espressif Techniques released their ESP32-S3 SoC a few years prior to now, but most efficient recently have they released more documentation and pork up of its paunchy capabilities. With out any adjustments to your code, the S3 runs about 15% quicker than older ESP32 CPUs at the identical clock spin. It has a ‘hidden’ functionality that is more sophisticated to dispute, but could well perhaps even be payment the verbalize could well perhaps must it’s likely you’ll perhaps perhaps admire more spin. This text is geared towards programmers who are already conversant in SIMD instructions on other platforms.

I’ve been optimizing code with SIMD for more than 15 years on Intel, Arm and DSPs (even Cadence’s), so when I heard that the S3 had SIMD instructions, I without delay went procuring for documentation. When the S3 became obtainable to take hang of, there used to be most efficient a promise of documentation and pork up. Within the 2+ years since then, not grand has changed. At the pause of 2023, Espressif released a file describing the fresh instructions:

S3 Technical Reference Handbook

The file has a first payment level of detail and most efficient a few errors, but what’s lacking restful are examples and documentation on how to dispute them to your have code – conspicuously absent are instructions on the utilization of the assembler and linking them to your C/C++ code. That shouldn’t be if truth be told entirely the fault of Espressif. The Xtensa processor comes from Cadence and for some reason they like to protect the entire lot under NDA, even records which can perhaps perhaps reduction of us dispute their processors. I collect it tough to love why the instruction predicament want to be kept secret; a CPU provider could well perhaps must restful assemble it as easy as conceivable for engineers to dispute their CPUs. The ‘substitute secrets’ are in the hardware sort, not in the instruction predicament. I’ve worked with Cadence’s DSPs forward of, so I’m conversant in their approach of doing issues. Their Vision DSPs have a robust and robust instruction predicament. Unfortunately, the S3 has a really minimal predicament of SIMD instructions, doubtlessly due to cost and silicon home limits.

Since the SIMD ‘Processor Extension’ is handled as a coprocessor, the principle instruction predicament want to be jumbled collectively the code. Right here’s the file for the principle LX7 instructions:

Xtensa ISA

The programmer model includes 16 odd reason/tackle registers (a0-a15), 8 128-bit vast SIMD registers (q0-q7), and two particular accumulator registers for multiply/salvage operations. The memory bus is documented as 128-bits vast, so it’s positively advantageous to read and write to memory at the native width. There are additionally some instructions to govern GPIO bits.

How I got started with S3 SIMD

I spent the upper a part of a day making an are attempting and experimenting with these instructions till I got working code. I started with a search on Github for any public repos containing the one instruction wanted for any S3 SIMD mission – load (ee.vld.128). About a hits popped up in Espressif’s esp-dsp mission. Quite lots of their ESP32-S3 code is closed source, but a few capabilities pulled reduction the veil on how to dispute them in my have projects. Right here is the code I vulnerable as a starting level:

https://github.com/espressif/esp-dsp/blob/master/modules/fft/mounted/dsps_fft2r_sc16_aes3.S

I tried placing a couple of capabilities accurate into a single .S file, but that doesn’t seem to work so every unbiased will get it’s have file. My first dispute case for these instructions is to optimize the coloration conversion step of my JPEG decoder. The YCbCr->RGB step takes a serious quantity of time and is a correct fit for SIMD optimization.

The Instruction Feature

I’ve written SIMD code for the pixel coloration conversion a couple of instances in a couple of SIMD instruction sets and what struck me with the S3 used to be how small I needed to work with. One of many principle sticking functions is that even supposing the instruction encodings are 24-bits every, there’s no bits reserved for shift quantity. There are particular shift instructions and even they build not want the shift quantity encoded. The multiply instructions can shift accurate after multiply, however the shift quantity must first be loaded into the SAR (shift quantity register). This requires 2 additional instructions (and potentially additional pipeline cycles) to enact. The principle sticking functions which assemble it tougher to build up work done are that the instructions must not orthogonal all the scheme through records sizes and are lacking deal of issues wanted to assemble atmosphere friendly code. Shall we yelp, the shift left and accurate instructions most efficient operate on 32-bit values and most efficient attain arithmetic transferring (carry the designate bit), not logical. To generate RGB565 output, I would favor to shift 16-bit values. My workaround used to be to multiply by 1 and predicament the SAR for accurate transferring and multiply by a energy of 2 and predicament the SAR to 0 for left transferring.

Right here’s a handy e-book a rough checklist of what I protect in solutions indispensable SIMD parts that are lacking on the S3:

– Shift accurate logical

– Shift 8 or 16-bit values

– Add or multiply with widening

– Appropriate sort shift with narrowing

– Add and subtract with out saturation

– Unaligned reads and writes

Additionally lacking (Good to haves) that other Cadence DSPs have:

– Scatter / opt writes/reads

– Horizontal vector operations

– Rearrange vector ingredient command

– Floating level pork up

– Predicated operations (operate on collect sides most efficient)

The gotcha that had me inviting into circles for some time is the memory alignment restriction. In my JPEG decoder internal records constructing, seemingly the most most sides are aligned on 4-byte boundaries. S3 SIMD load and retailer instructions can at most efficient accumulate entry to memory on 8-byte boundaries, but ideally want the entire lot on 16-byte boundaries. I mounted this by inserting a “long double” in front of the objects that want 16-byte alignment 😀.

The build to head from here…

I fade so to add some optimized capabilities to my diversified imaging libraries the build relevant and study the build it takes me. I did an preliminary test with my JPEGDEC library and saw a virtually 40% speedup (14ms -> 10ms) by the utilization of the S3 SIMD instructions for the coloration conversion step. I fade to submit this code after I’ve had time to utterly flesh it out and test it. Ethical success with your dispute of S3 SIMD…