I wish it were a cleverer answer, but I'm afraid that it's simply that MATLAB has been heavily optimized for |for| loops over the years but the same optimizations have not been applied to |arrayfun|. In this case by far the most important optimization will be multithreading, so perhaps you have 32 virtual cores to work with. It's tempting to just convert the arrayfun implementation to a for loop internally but as other have implied, the devil is in the detail since it would take some effort to get identical, backwards compatible behaviour. Best just to think of |arrayfun| as syntactic sugar only to be used for performance non-critical situations. Or for the GPU, where it has a special implementation. Answer from Joss Knight on mathworks.com
🌐
MathWorks
mathworks.com › matlabcentral › answers › 324130-is-arrayfun-faster-much-more-than-for-loop
is 'arrayfun' faster much more than 'for' loop? - MATLAB Answers - MATLAB Central
February 9, 2017 - It's kind of sad because writing a single command arrayfun is so much more appealing that writing out the full for loop. I guess that's the sacrifice that we make for performance ;) ... https://www.mathworks.com/matlabcentral/answers/324130-is-arrayfun-faster-much-more-than-for-loop#comment_1265555
Discussions

arrays - arrayfun can be significantly slower than an explicit loop in matlab. Why? - Stack Overflow
On my machine (Matlab 2011b on Linux Mint 12), the output of this test is: Elapsed time is 1.020689 seconds. Elapsed time is 9.248388 seconds. What the?!? arrayfun, while admittedly a cleaner looking solution, is an order of magnitude slower. What is going on here? Further, I did a similar style of test for cellfun and found it to be about 3 times slower than an explicit loop... More on stackoverflow.com
🌐 stackoverflow.com
array/cellfun vs. for loop
array/cellfun vs. for loop. Learn more about speed, arrayfun, cellfun, for loop, optimization MATLAB More on mathworks.com
🌐 mathworks.com
1
5
June 28, 2012
Investigating the speed of arrayfun vs alternatives

Looking around on StackOverflow, it seems like this is recognized behavior. But that doesn't mean there aren't cases where arrayFun can perform better.

On my machine, my times are roughly twice yours for your code. But consider this example instead:

x = gpuArray(rand(100,100,3));
tic
    z = arrayfun(@(x) x^2, x);
toc

z = gpuArray(zeros(length(x),1));
tic
    for i=1:length(x(:))
        z(i) = x(i).^2;
    end
toc

Elapsed time is 0.011370 seconds.
Elapsed time is 17.763806 seconds.

Arrayfun's assumptions that allow for parallel code leads to a huge jump in performance in that use case (mostly because the for loop approach isn't great for running on the GPU). Not sure if it's even a fair comparison, frankly.

On Matlab Central, one poster suggests avoiding arrayfun if you're not planning on using a GPU: https://www.mathworks.com/matlabcentral/answers/144344-in-my-code-arrayfun-slower-than-for-loop

With the exception of being on a GPU, arrayfun will most likely often be slower than a for-loop and harder to read. It's just a less flexible more complex for-loop.

Personally, I'd recommend against it at all cost unless you're targeting a GPU.

Obviously, my for loop on a GPU is much slower, and the GPU arrayfun call is roughly the same speed as your "direct" method on my machine. For more complex examples, I'd expect the GPU approach to really show its stuff.

More on reddit.com
🌐 r/matlab
6
10
November 10, 2016
Comparing arrayfun and for loop
Comparing arrayfun and for loop. Learn more about loops, performance More on mathworks.com
🌐 mathworks.com
1
3
April 21, 2013
Top answer
1 of 2
102

You can get the idea by running other versions of your code. Consider explicitly writing out the computations, instead of using a function in your loop

tic
Soln3 = ones(T, N);
for t = 1:T
    for n = 1:N
        Soln3(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
    end
end
toc

Time to compute on my computer:

Soln1  1.158446 seconds.
Soln2  10.392475 seconds.
Soln3  0.239023 seconds.
Oli    0.010672 seconds.

Now, while the fully 'vectorized' solution is clearly the fastest, you can see that defining a function to be called for every x entry is a huge overhead. Just explicitly writing out the computation got us factor 5 speedup. I guess this shows that MATLABs JIT compiler does not support inline functions. According to the answer by gnovice there, it is actually better to write a normal function rather than an anonymous one. Try it.

Next step - remove (vectorize) the inner loop:

tic
Soln4 = ones(T, N);
for t = 1:T
    Soln4(t, :) = 3*x(t, :).^2 + 2*x(t, :) - 1;
end
toc

Soln4  0.053926 seconds.

Another factor 5 speedup: there is something in those statements saying you should avoid loops in MATLAB... Or is there really? Have a look at this then

tic
Soln5 = ones(T, N);
for n = 1:N
    Soln5(:, n) = 3*x(:, n).^2 + 2*x(:, n) - 1;
end
toc

Soln5   0.013875 seconds.

Much closer to the 'fully' vectorized version. Matlab stores matrices column-wise. You should always (when possible) structure your computations to be vectorized 'column-wise'.

We can go back to Soln3 now. The loop order there is 'row-wise'. Lets change it

tic
Soln6 = ones(T, N);
for n = 1:N
    for t = 1:T
        Soln6(t, n) = 3*x(t, n)^2 + 2*x(t, n) - 1;
    end
end
toc

Soln6  0.201661 seconds.

Better, but still very bad. Single loop - good. Double loop - bad. I guess MATLAB did some decent work on improving the performance of loops, but still the loop overhead is there. If you would have some heavier work inside, you would not notice. But since this computation is memory bandwidth bounded, you do see the loop overhead. And you will even more clearly see the overhead of calling Func1 there.

So what's up with arrayfun? No function inlinig there either, so a lot of overhead. But why so much worse than a double nested loop? Actually, the topic of using cellfun/arrayfun has been extensively discussed many times (e.g. here, here, here and here). These functions are simply slow, you can not use them for such fine-grain computations. You can use them for code brevity and fancy conversions between cells and arrays. But the function needs to be heavier than what you wrote:

tic
Soln7 = arrayfun(@(a)(3*x(:,a).^2 + 2*x(:,a) - 1), 1:N, 'UniformOutput', false);
toc

Soln7  0.016786 seconds.

Note that Soln7 is a cell now.. sometimes that is useful. Code performance is quite good now, and if you need cell as output, you do not need to convert your matrix after you have used the fully vectorized solution.

So why is arrayfun slower than a simple loop structure? Unfortunately, it is impossible for us to say for sure, since there is no source code available. You can only guess that since arrayfun is a general purpose function, which handles all kinds of different data structures and arguments, it is not necessarily very fast in simple cases, which you can directly express as loop nests. Where does the overhead come from we can not know. Could the overhead be avoided by a better implementation? Maybe not. But unfortunately the only thing we can do is study the performance to identify the cases, in which it works well, and those, where it doesn't.

Update Since the execution time of this test is short, to get reliable results I added now a loop around the tests:

for i=1:1000
   % compute
end

Some times given below:

Soln5   8.192912 seconds.
Soln7  13.419675 seconds.
Oli     8.089113 seconds.

You see that the arrayfun is still bad, but at least not three orders of magnitude worse than the vectorized solution. On the other hand, a single loop with column-wise computations is as fast as the fully vectorized version... That was all done on a single CPU. Results for Soln5 and Soln7 do not change if I switch to 2 cores - In Soln5 I would have to use a parfor to get it parallelized. Forget about speedup... Soln7 does not run in parallel because arrayfun does not run in parallel. Olis vectorized version on the other hand:

Oli  5.508085 seconds.
2 of 2
-8

That because!!!!

x = randn(T, N); 

is not gpuarray type;

All you need to do is

x = randn(T, N,'gpuArray');
🌐
MathWorks
mathworks.com › matlabcentral › answers › 144344-in-my-code-arrayfun-slower-than-for-loop
In my code, arrayfun slower than for loop - MATLAB Answers - MATLAB Central
With the exception of being on a GPU, arrayfun will most likely often be slower than a for-loop and harder to read. It's just a less flexible more complex for-loop. Personally, I'd recommend against it at all cost unless you're targeting a GPU.
🌐
Reddit
reddit.com › r/matlab › investigating the speed of arrayfun vs alternatives
r/matlab on Reddit: Investigating the speed of arrayfun vs alternatives
November 10, 2016 -

I have been writing some matlab code and I've found myself using arrayfun to elegantly perform certain operations without a for loop. (similar to list comprehension in Python).

However, I started to think about the speed of it. so I made a contrived example of squaring a large matrix. (my actual uses are a bit more fancy). See the code below if you're interested

The final times were:

method time (s)
array_fun 5.1231
array_fun_fixed 5.0522
pre_loop 0.0343
non_pre_loop 0.3276
pre_loop_fun 0.5520
pre_loop_fun_each 71.9136
non_pre_loop_fun 0.8565
direct 9.1100e-04

I expected the direct would be the fastest. And I expected the preallocated loops to be faster than the dynamically sized ones. I also expected the one one where it creates the function each time to be the slowest.

And, I kind of expected for an anonymous function to be slower than direct manipulation. But what really surprised me was the (a) the HUGE slowdown of arrayfun and (b) that it was (much) slower than a loop calling the same function. It seems as though fixing the function does not matter, but arrayfun still underperforms a loop.

If anything, I would expect arrayfun to be on par with a pre-allocated loop. Assuming UniformOutput is set to false (default), Matlab knows the final size of the returned array. Furthermore, it should just be looping over it in a pretty clear manner.

Any thoughts or insights into the overhead and methods of this?

Thanks

Code

The script I ran:

clear
R = rand(1000);

% Arrafun with anonymous function
clear BLA
tic
    BLA = arrayfun(@(n) n^2,R);
T.array_fun = toc;

clear BLA
tic
    FUN = @(n) n^2;
    BLA = arrayfun(FUN,R);
T.array_fun_fixed = toc;

% Preallocated loop
clear BLA
tic
    BLA = zeros(size(R));
    for ii = 1:size(R,1)
        for jj = 1:size(R,2)
            BLA(ii,jj) = R(ii,jj)^2;
        end
    end 
T.pre_loop = toc;

% Non-Preallocated loop
clear BLA
tic
    for ii = 1:size(R,1)
        for jj = 1:size(R,2)
            BLA(ii,jj) = R(ii,jj)^2;
        end
    end 
T.non_pre_loop = toc;

% Preallocated loop with with anonymous function
clear BLA
tic
    fun = @(n) n^2;
    BLA = zeros(size(R));
    for ii = 1:size(R,1)
        for jj = 1:size(R,2)
            BLA(ii,jj) = fun(R(ii,jj));
        end
    end 
T.pre_loop_fun = toc;

% Preallocated loop with with anonymous function EACH TIME
clear BLA
tic
    BLA = zeros(size(R));
    for ii = 1:size(R,1)
        for jj = 1:size(R,2)
            fun = @(n) n^2;
            BLA(ii,jj) = fun(R(ii,jj));
        end
    end 
T.pre_loop_fun_each = toc;

% Non-Preallocated loop with with anonymous function
clear BLA
tic
    fun = @(n) n^2;
    for ii = 1:size(R,1)
        for jj = 1:size(R,2)
            BLA(ii,jj) = fun(R(ii,jj));
        end
    end 
T.non_pre_loop_fun = toc;

% Direct
clear BLA
tic
    BLA = R.^2;
T.direct = toc;
Top answer
1 of 3
5

Looking around on StackOverflow, it seems like this is recognized behavior. But that doesn't mean there aren't cases where arrayFun can perform better.

On my machine, my times are roughly twice yours for your code. But consider this example instead:

x = gpuArray(rand(100,100,3));
tic
    z = arrayfun(@(x) x^2, x);
toc

z = gpuArray(zeros(length(x),1));
tic
    for i=1:length(x(:))
        z(i) = x(i).^2;
    end
toc

Elapsed time is 0.011370 seconds.
Elapsed time is 17.763806 seconds.

Arrayfun's assumptions that allow for parallel code leads to a huge jump in performance in that use case (mostly because the for loop approach isn't great for running on the GPU). Not sure if it's even a fair comparison, frankly.

On Matlab Central, one poster suggests avoiding arrayfun if you're not planning on using a GPU: https://www.mathworks.com/matlabcentral/answers/144344-in-my-code-arrayfun-slower-than-for-loop

With the exception of being on a GPU, arrayfun will most likely often be slower than a for-loop and harder to read. It's just a less flexible more complex for-loop.

Personally, I'd recommend against it at all cost unless you're targeting a GPU.

Obviously, my for loop on a GPU is much slower, and the GPU arrayfun call is roughly the same speed as your "direct" method on my machine. For more complex examples, I'd expect the GPU approach to really show its stuff.

2 of 3
2

i enjoy these types of experiments. some comments on your benchmarking methodology.

  1. i would either use matlabs built-in 'timeit' function or take an average of many iterations of the same thing. several results appear in the noise of a standard deviation.

  2. since you're benchmarking, you want to ensure matlabs not getting in the way by trying to be helpful. it does all sorts of voodoo behind the scene to anticipate and optimize execution. that is, use 'clear all' between each test.

  3. goes without saying that you're benchmarking is at the mercy of the os, so the more you can do to reduce interrupts the better.

last, i don't think 'arrayfun' is intended for the usage you demonstrated. matrix operations make no sense for arrayfun. but imagine you had an array of 1s and 0s and you wanted to test whether a given 1 was surrounded by 0s. that's the kind of thing 'arrayfun' would make easier, albeit not necessarily faster.

🌐
MathWorks
mathworks.com › matlabcentral › answers › 72983-comparing-arrayfun-and-for-loop
Comparing arrayfun and for loop - MATLAB Answers - MATLAB Central
April 21, 2013 - Only if the purpose of avoiding the loop is speed-up. If compact syntax is the goal, that's what arrayfun is meant for. Sign in to comment. Sign in to answer this question. MATLAB Language Fundamentals Loops and Conditional Statements
Find elsewhere
🌐
MathWorks
mathworks.com › matlabcentral › answers › 873328-speeding-up-using-cellfunction-and-arrayfun-versus-for-loop
speeding up using cellfunction and arrayfun versus for-loop - MATLAB Answers - MATLAB Central
July 6, 2021 - So a comparison of cellfun, arrayfun or loops is not really smart yet. But it is expected, that loops are faster: cellfun and arrayfun are mex functions, which have to call the Matlab level for each element.
Top answer
1 of 3
3

I don't think your loops are equivalent. It seems you're squaring every element in an array with your CPU implementation, but performing some sort of count for arrayfun.

Regardless, I think the explanation you're looking for is as follows:

When run on the GPU, you code can be functionally decomposed -- into each array cell in this case -- and squared separately. This is okay because for a given i, the value of [cell_i]^2 doesn't depend on any of the other values in other cells. What most likely happens is the array get's decomposed into S buffers where S is the number of stream processing units your GPU has. Each unit then computes the square of the data in each cell of its buffer. The result is copied back to the original array and the result is returned to count.

Now don't worry, if you're counting things as it seems *array_fun* is actually doing, a similar thing is happening. The algorithm most likely partitions the array off into similar buffers, and, instead of squaring each cell, add the values together. You can think of the result of this first step as a smaller array which the same process can be applied to recursively to count the new sums.

2 of 3
2

As per the reference page here http://www.mathworks.co.uk/help/toolbox/distcomp/arrayfun.html, "the MATLAB function passed in for evaluation is compiled for the GPU, and then executed on the GPU". In the explicit for loop version, each operation is executed separately on the GPU, and this incurs overhead - the arrayfun version is one single GPU kernel invocation.

🌐
Aalto
math.aalto.fi › ~apiola › matlab › opas › TUUTOR11 › html › speedup.html
Speedup tricks
In the last example, we used cellfun() function but there is a similar function arrayfun() that applies a function to every element of an array. When other vectorization techniques fail, this can be a better alternative than looping over every element yourself, but is not always much faster.
🌐
Matlab Scripts
matlabscripts.com › matlab-arrayfun
Unlocking Matlab Arrayfun: Your Key to Efficient Coding
January 6, 2025 - The core idea of `arrayfun` is enabling a functional programming style, wherein functions are treated as first-class objects. ... Code Readability: It often makes the code much more concise and expressive compared to for-loops.
🌐
MathWorks
mathworks.com › matlabcentral › answers › 718275-use-arrayfun-and-cellfun-to-avoid-for-loops
Use arrayfun and cellfun to avoid for-loops - MATLAB Answers - MATLAB Central
January 16, 2021 - Use arrayfun and cellfun to avoid for-loops. Learn more about cellfun, arrayfun, avoid loops, for loop, performance, runtime optimizing MATLAB
🌐
Narkive
comp.soft-sys.matlab.narkive.com › 8aCSCjbM › is-arrayfun-ever-faster-than-looping
Is arrayfun ever faster than looping?
Should I change all my arrayfun operations to loops? Yes, it is normal. Change if speed if your priority. See similar topic on CELLFUN: http://www.mathworks.com/matlabcentral/newsreader/view_thread/253815 Arrayfun and cellfun essentially are for-loops with extra overhead.