🌐
Arm Developer
developer.arm.com › community › arm-community-blogs › b › architectures-and-processors-blog › posts › rust-neon-intrinsics
Neon Intrinsics in Rust
June 17, 2023 - At the time of writing, all the Neon intrinsics that are Armv8.0-A are implemented and are stabilized, additionally the intrinsics that are in FEAT_RDM are also stable. The AES, SHA1 and SHA2 intrinsics have been proposed and agreed to be stabilized. In a few months time they will also be available in the stable compiler. Except for any intrinsics that work with bfloat or f16 due to a lack of support for those types in Rust.
Discussions

32-bit ARM NEON intrinsics are unsound due to subnormal flushing
This is the ARM NEON version of #114479. Example by @beetrees, to be compiled with --target armv7-unknown-linux-gnueabihf -O -Ctarget-feature=+neon: #![feature(stdarch_arm_neon_intrinsics)] use std... More on github.com
🌐 github.com
2
September 2, 2024
Implement all ARM NEON intrinsics
Steps for implementing an intrinsic: Select an intrinsic below Review coresimd/arm/neon.rs and coresimd/aarch64/neon.rs Consult ARM official documentation about your intrinsic Consult godbolt for h... More on github.com
🌐 github.com
32
October 24, 2017
SIMD/NEON in Stable Rust?
I'm working on a project and I would like to work with NEON/ARM SIMD intrinsics in stable rust project. Most of the SIMD libraries in Rust are currently nightly only or don't support NEON. Is my best shot writing a C library and calling that from Rust? More on users.rust-lang.org
🌐 users.rust-lang.org
1
0
June 27, 2020
SIMD instructions with Rust on Android - Rust Zürisee June 2023

I understood the initial part, but its going to be extremely difficult figuring out instruction for different machines

More on reddit.com
🌐 r/rust
5
35
June 22, 2023
🌐
Rust Programming Language
users.rust-lang.org › help
Rust on ARMv7l with no NEON support - help - The Rust Programming Language Forum
May 31, 2016 - I suspect that this is because all the binaries available for Rust on armv7 (rustup-init, rust nightly, stable, etc.) are compiled with NEON support, but the CPU on the server does not feature NEON support. root@cloudsearcher:~# file rustup-init rustup-init: EL...
🌐
GitHub
github.com › rust-lang › rust › issues › 129880
32-bit ARM NEON intrinsics are unsound due to subnormal flushing · Issue #129880 · rust-lang/rust
September 2, 2024 - This is the ARM NEON version of #114479. Example by @beetrees, to be compiled with --target armv7-unknown-linux-gnueabihf -O -Ctarget-feature=+neon: #![feature(stdarch_arm_neon_intrinsics)] use std::arch::arm::{float32x2_t, vadd_f32}; us...
Published   Sep 02, 2024
🌐
GitHub
github.com › rust-lang-nursery › stdsimd › issues › 148
Implement all ARM NEON intrinsics · Issue #148 · rust-lang/stdarch
October 24, 2017 - Steps for implementing an intrinsic: Select an intrinsic below Review coresimd/arm/neon.rs and coresimd/aarch64/neon.rs Consult ARM official documentation about your intrinsic Consult godbolt for how the intrinsic should be codegen'd, us...
Author   gnzlbg
🌐
Rust
docs.rs › simd › latest › simd › arm › neon
simd::arm::neon - Rust
API documentation for the Rust `neon` mod in crate `simd`.
🌐
Arm Learning
learn.arm.com › learning-paths › cross-platform › simd-on-rust › simd-on-rust-part1
Write SIMD code on Arm using Rust: Arm SIMD on Rust
Now you can create an equivalent program in Rust using the std::simd approach. Shown below is the same program modified to use std::simd. Replace the functions in average2.rs with the following and save the updated contents in a file name average3.rs: #[inline(never)] fn average_vec(c: &mut [f32], a: &[f32], b: &[f32]) -> () { #[cfg(target_arch = "aarch64")] { return unsafe { average_vec_asimd(c, a, b) }; } } #[cfg(target_arch = "aarch64")] #[target_feature(enable = "neon")] unsafe fn average_vec_asimd(c: &mut [f32], a: &[f32], b: &[f32]) -> () { let half = f32x4::splat(0.5_f32); for i in (0..c.len()).step_by(4) { let va: f32x4 = f32x4::from_slice(&a[i..i+4]); let vb: f32x4 = f32x4::from_slice(&b[i..i+4]); let vc: f32x4 = (va + vb) * half; vc.copy_to_slice(&mut c[i..i+4]); } }
🌐
Rust Programming Language
users.rust-lang.org › help
SIMD/NEON in Stable Rust? - help - The Rust Programming Language Forum
June 27, 2020 - I'm working on a project and I would like to work with NEON/ARM SIMD intrinsics in stable rust project. Most of the SIMD libraries in Rust are currently nightly only or don't support NEON. Is my best shot writing a C lib…
Find elsewhere
🌐
HackMD
hackmd.io › @rust-libs › ryl0iiIrv
Hello, SIMD! - HackMD
October 5, 2020 - ## References - [General Notes][jubilee-notes] - Arm: [Arm Neon], [Neon Programmer's Guide], [SVE vs SVE2] - [Intel Intrinsics Guide] [jubilee-notes]: https://hackmd.io/-LaVJuO2SuS53uGX-D76tA [SVE vs SVE2]: https://developer.arm.com/tools-and-software/server-and-hpc/compile/arm-instruction-emulator/resources/tutorials/sve/sve-vs-sve2/single-page [Arm Neon]: https://developer.arm.com/architectures/instruction-sets/simd-isas/neon [Neon Programmer's Guide]: https://static.docs.arm.com/den0018/a/DEN0018A_neon_programmers_guide_en.pdf?_ga=2.112843328.535197283.1547875098-60705264.1529324001 [Intel Intrinsics Guide]: https://software.intel.com/sites/landingpage/IntrinsicsGuide/ ## General Thoughts ### SIMD nomenclature We want standardized names for things as we expose these capabilities to Rustaceans.
🌐
The Rust Programming Language
doc.rust-lang.org › core › arch › arm › index.html
core::arch::arm - Rust
Non-target_arch=arm64ec and neon and fp16 · Floating-point absolute value Arm’s documentation · vabs_ f32 · Experimental · neon · Floating-point absolute value Arm’s documentation · vabs_s8 · Experimental · neon · Absolute value (wrapping). Arm’s documentation ·
🌐
GitHub
github.com › hsivonen › simd › blob › master › src › arm › neon.rs
simd/src/arm/neon.rs at master · hsivonen/simd
A crate that exposes some SIMD functionality on nightly Rust; to be obsoleted by stdsimd - hsivonen/simd
Author   hsivonen
🌐
Gendignoux
gendignoux.com › blog › 2023 › 01 › 05 › rust-arm-simd-android.html
Testing SIMD instructions on ARM with Rust on Android | Blog | Guillaume Endignoux
January 5, 2023 - So for example, Rust’s HashMap doesn’t use any SIMD on ARM (see rust-lang/hashbrown/269). Additionally, on a given CPU architecture the performance can vary from one CPU model to the next. So in any case: benchmark, measure and profile your code! The second aspect is whether dynamic feature detection (and its overhead) matters in practice. As we’ve learned, all Android devices running on ARM64 support NEON, with the feature enabled at compile time.
🌐
The Rust Programming Language
doc.rust-lang.org › core › arch › aarch64 › index.html
core::arch::aarch64 - Rust
CRC32 single round checksum for bytes (32 bits). Arm’s documentation · vaba_s8 · neon · Absolute difference and accumulate (64-bit) Arm’s documentation · vaba_ s16 · neon · Absolute difference and accumulate (64-bit) Arm’s documentation · vaba_ s32 · neon ·
🌐
GitHub
github.com › rust-lang › stdarch › pull › 384
Add arm neon vector types and fixes ARM tests by gnzlbg · Pull Request #384 · rust-lang/stdarch
ARMv6 and older do not support neon at all, so for ARM neon is only allowed if the v7 feature is enabled (probably should check v8 here as well but that's not supported by rustc yet?).
Author   rust-lang
Top answer
1 of 1
1

I've shown this to a colleague who's come up with better intrinsics code than I would. Here's his suggestion, it's not been compiled, so there needs to be some finishing off of pseudo-code pieces etc, but something along the lines of below should be much faster & work:

let mut line_counter = 0;
for chunk in buffer.chunks(32) { // Read 32 bytes at a time
    unsafe {
        let src1 = vld1q_u8(chunk.as_ptr());
        let src2 = vld1q_u8(chunk.as_ptr() + 16);
        let out1 = vceqq_u8(src1, needle_vector);
        let out2 = vceqq_u8(src2, needle_vector);
        // We slot these next to each other in the same vector.
        // In this case the bottom 64-bits of the vector will tell you
        // if there are any needle values inside the first vector and
        // the top 64-bits tell you if you have any needle values in the
        // second vector.
        let combined = vpmaxq_u8(out1, out2);
        // Now we get another maxp which compresses this information into
        // a single 64-bit value, where the bottom 32-bits tell us about
        // src1 and the top 32-bit about src2.
        let combined = vpmaxq_u8(combined, combined);
        let remapped = vreinterpretq_u64_u8 (combined);
        let val = vgetq_lane_u64 (remapped, 0);
        if (val == 0) // most chunks won't have a new-line
          ... // If val is 0 that means no match was found in either vectors, adjust offset and continue.
        if (val & 0xFFFF)
          ... // there must be a match in src1. use below code in a function
        if (val & 0xFFFF0000)
          ... // there must be a match in src2. use below code in a function

    ...
    }
}

Now that we now which vector to look in, we should find the index in the vector As an example, let's assume matchvec is the vector we found above (so either out1 or out2).

To find the first index:

// We create a mark of repeating 0xf00f chunks. when we fill an entire vector
// with it we get a pattern where every byte is 0xf0 or 0x0f. We'll use this 
// to find the index of the matches.
let mask = unsafe { vreinterpretq_u16_u8 (vdupq_n_u16 (0xf00f)); }

// We first clear the bits we don't want, which leaves for each adjacent 8-bit entries
// 4 bits of free space alternatingly.
let masked = vandq_u8 (matchvec, mask);
// Which means when we do a pairwise addition
// we are sure that no overflow will ever happen.  The entries slot next to each other
// and a non-zero bit indicates the start of the first element.
// We've also compressed the values into the lower 64-bits again.
let compressed = vpaddq_u8 (masked, masked);
let val = vgetq_lane_u64 (compressed, 0);
// Post now contains the index of the first element, every 4 bit is a new entry
// This assumes Rust has kept val on the SIMD side. if it did not, then it's best to
// call vclz on the lower 64-bits of compressed and transfer the results. 
let pos = (val).leading_zeros() as usize;
// So just shift pos right by 2 to get the actual index.
let pos = pos >> 2;

pos will now contain the index of the first needle value.

If you were processing out2, remember to add 16 to the result.

To find all the indices we can run through the bitmask without using clz, we avoid the repeated register file transfers this way.

// set masked and compressed as above
let masked = vandq_u8 (matchvec, mask);
let compressed = vpaddq_u8 (masked, masked);
int idx = current_offset;
while (val)
{
  if (val & 0xf)
   {
     // entry found at idx.
   }
   idx++;
   val = val >> 4;
}
🌐
Reddit
reddit.com › r/rust › exploring rustfft's simd architecture
r/rust on Reddit: Exploring RustFFT's SIMD Architecture
January 6, 2021 - I think ARM has Neon as an alternative to the AVX instructions in x86.
🌐
Reddit
reddit.com › r/rust › exploring simd instructions in rust on a macbook m2
r/rust on Reddit: Exploring SIMD Instructions in Rust on a MacBook M2
July 9, 2024 -

Recently I delved into the world of SIMD (Single Instruction, Multiple Data) instructions in Rust, leveraging NEON intrinsics on my MacBook M2 with ARM architecture. SIMD allows parallel processing by performing the same operation on multiple data points simultaneously, theoretically speeding up tasks that are parallelizable.

ARM Intrinsics

What I Did?

I experimented with two functions to explore the impact of SIMD:

  • Array Addition: Using SIMD to add elements of two arrays.

#[target_feature(enable = "neon")]
unsafe fn add_arrays_simd(a: &[f32], b: &[f32], c: &mut [f32]) {
    // NEON intrinsics for ARM architecture
    use core::arch::aarch64::*;

    let chunks = a.len() / 4;
    for i in 0..chunks {
        // Load 4 elements from each array into a NEON register
        let a_chunk = vld1q_f32(a.as_ptr().add(i * 4));
        let b_chunk = vld1q_f32(b.as_ptr().add(i * 4));
        let c_chunk = vaddq_f32(a_chunk, b_chunk);
        // Store the result back to memory
        vst1q_f32(c.as_mut_ptr().add(i * 4), c_chunk);
    }

    // Handle the remaining elements that do not fit into a 128-bit register
    for i in chunks * 4..a.len() {
        c[i] = a[i] + b[i];
    }
}
  • Matrix Multiplication: Using SIMD to perform matrix multiplication.

#[target_feature(enable = "neon")]
unsafe fn multiply_matrices_simd(a: &[f32], b: &[f32], c: &mut [f32], n: usize) {
    // NEON intrinsics for ARM architecture
    use core::arch::aarch64::*;
    for i in 0..n {
        for j in 0..n {
            // Initialize a register to hold the sum
            let mut sum = vdupq_n_f32(0.0);

            for k in (0..n).step_by(4) {
                // Load 4 elements from matrix A into a NEON register
                let a_vec = vld1q_f32(a.as_ptr().add(i * n + k));
                // Use the macro to load the column vector from matrix B
                let b_vec = load_column_vector!(b, n, j, k);

                // Intrinsic to perform (a * b) + c
                sum = vfmaq_f32(sum, a_vec, b_vec);
            }
            // Horizontal add the elements in the sum register
            let result = vaddvq_f32(sum);
            // Store the result in the output matrix
            *c.get_unchecked_mut(i * n + j) = result;
        }
    }
}

Performance Observations

Array Addition: I benchmarked array addition on various array sizes. Surprisingly, the SIMD implementation was slower than the normal implementation. This might be due to the overhead of loading data into SIMD registers and the relatively small benefit from parallel processing for this task. For example, with an input size of 100,000, SIMD was about 6 times slower than normal addition. Even at the best case for SIMD, it was still 1.1 times slower.

Matrix Multiplication: Here, I observed a noticeable improvement in performance. For instance, with an input size of 16, SIMD was about 3 times faster than the normal implementation. Even with larger input sizes, SIMD consistently performed better, showing up to a 63% reduction in time compared to the normal method. Matrix multiplication involves a lot of repetitive operations that can be efficiently parallelized with SIMD, making it a perfect candidate for SIMD optimization.

Comment if you have any insights or questions about SIMD instructions in Rust!

GitHub: https://github.com/amSiddiqui/Rust-SIMD-performance