32-bit ARM NEON intrinsics are unsound due to subnormal flushing
Implement all ARM NEON intrinsics
SIMD/NEON in Stable Rust?
SIMD instructions with Rust on Android - Rust Zürisee June 2023
I understood the initial part, but its going to be extremely difficult figuring out instruction for different machines
More on reddit.comVideos
The video from my talk at the Rust Zurich meetup a few weeks ago is now online :) More in-depth details on my blog: https://gendignoux.com/blog/tags.html#android
I understood the initial part, but its going to be extremely difficult figuring out instruction for different machines
On July 1st, Reddit will no longer be accessible via third-party apps. Please see our position on this topic, as well as our list of alternative Rust discussion venues.
I am a bot, and this action was performed automatically. Please contact the moderators of this subreddit if you have any questions or concerns.
Recently I delved into the world of SIMD (Single Instruction, Multiple Data) instructions in Rust, leveraging NEON intrinsics on my MacBook M2 with ARM architecture. SIMD allows parallel processing by performing the same operation on multiple data points simultaneously, theoretically speeding up tasks that are parallelizable.
ARM Intrinsics
What I Did?
I experimented with two functions to explore the impact of SIMD:
-
Array Addition: Using SIMD to add elements of two arrays.
#[target_feature(enable = "neon")]
unsafe fn add_arrays_simd(a: &[f32], b: &[f32], c: &mut [f32]) {
// NEON intrinsics for ARM architecture
use core::arch::aarch64::*;
let chunks = a.len() / 4;
for i in 0..chunks {
// Load 4 elements from each array into a NEON register
let a_chunk = vld1q_f32(a.as_ptr().add(i * 4));
let b_chunk = vld1q_f32(b.as_ptr().add(i * 4));
let c_chunk = vaddq_f32(a_chunk, b_chunk);
// Store the result back to memory
vst1q_f32(c.as_mut_ptr().add(i * 4), c_chunk);
}
// Handle the remaining elements that do not fit into a 128-bit register
for i in chunks * 4..a.len() {
c[i] = a[i] + b[i];
}
}-
Matrix Multiplication: Using SIMD to perform matrix multiplication.
#[target_feature(enable = "neon")]
unsafe fn multiply_matrices_simd(a: &[f32], b: &[f32], c: &mut [f32], n: usize) {
// NEON intrinsics for ARM architecture
use core::arch::aarch64::*;
for i in 0..n {
for j in 0..n {
// Initialize a register to hold the sum
let mut sum = vdupq_n_f32(0.0);
for k in (0..n).step_by(4) {
// Load 4 elements from matrix A into a NEON register
let a_vec = vld1q_f32(a.as_ptr().add(i * n + k));
// Use the macro to load the column vector from matrix B
let b_vec = load_column_vector!(b, n, j, k);
// Intrinsic to perform (a * b) + c
sum = vfmaq_f32(sum, a_vec, b_vec);
}
// Horizontal add the elements in the sum register
let result = vaddvq_f32(sum);
// Store the result in the output matrix
*c.get_unchecked_mut(i * n + j) = result;
}
}
}Performance Observations
Array Addition: I benchmarked array addition on various array sizes. Surprisingly, the SIMD implementation was slower than the normal implementation. This might be due to the overhead of loading data into SIMD registers and the relatively small benefit from parallel processing for this task. For example, with an input size of 100,000, SIMD was about 6 times slower than normal addition. Even at the best case for SIMD, it was still 1.1 times slower.
Matrix Multiplication: Here, I observed a noticeable improvement in performance. For instance, with an input size of 16, SIMD was about 3 times faster than the normal implementation. Even with larger input sizes, SIMD consistently performed better, showing up to a 63% reduction in time compared to the normal method. Matrix multiplication involves a lot of repetitive operations that can be efficiently parallelized with SIMD, making it a perfect candidate for SIMD optimization.
Comment if you have any insights or questions about SIMD instructions in Rust!
GitHub: https://github.com/amSiddiqui/Rust-SIMD-performance