numpy matrix multiplication time complexity

Why is matrix multiplication faster with numpy than with ctypes in Python?

stackoverflow.com › questions › 10442365 › why-is-matrix-multiplication-faster-with-numpy-than-with-ctypes-in-python

NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).

The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).

In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.

Answer from Translunar on Stack Overflow

Stack Overflow

stackoverflow.com › questions › 10442365 › why-is-matrix-multiplication-faster-with-numpy-than-with-ctypes-in-python

c - Why is matrix multiplication faster with numpy than with ctypes in Python? - Stack Overflow

Top answer

1 of 6

In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.

2 of 6

I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:

/**begin repeat
 *
 * #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
 * LONG, ULONG, LONGLONG, ULONGLONG,
 * FLOAT, DOUBLE, LONGDOUBLE,
 * DATETIME, TIMEDELTA#
 * #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
 * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
 * npy_float, npy_double, npy_longdouble,
 * npy_datetime, npy_timedelta#
 * #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,
 * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
 * npy_float, npy_double, npy_longdouble,
 * npy_datetime, npy_timedelta#
 */
static void
@name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,
           void *NPY_UNUSED(ignore))
{
    @out@ tmp = (@out@)0;
    npy_intp i;

    for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
        tmp += (@out@)(*((@type@ *)ip1)) *
               (@out@)(*((@type@ *)ip2));
    }
    *((@type@ *)op) = (@type@) tmp;
}
/**end repeat**/

This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.

One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).

Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.

Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.

Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:

for (i = mp1; i <= *n; i += 5) {
stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) * 
    SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4);
}

This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.

reddit.com › r/localllama › beating numpy's matrix multiplication in 150 lines of c code

r/LocalLLaMA on Reddit: Beating NumPy's matrix multiplication in 150 lines of C code

July 1, 2024 -

TL;DR This blog post is the result of my attempt to implement high-performance matrix multiplication on CPU while keeping the code simple, portable and scalable. The implementation follows the BLIS design, works for arbitrary matrix sizes, and, when fine-tuned for an AMD Ryzen 7700 (8 cores), outperforms NumPy (=OpenBLAS), achieving over 1 TFLOPS of peak performance across a wide range of matrix sizes.

By efficiently parallelizing the code with just 3 lines of OpenMP directives, it’s both scalable and easy to understand. Throughout this tutorial, we'll implement matrix multiplication from scratch, learning how to optimize and parallelize C code using matrix multiplication as an example. This is my first time writing a blog post. If you enjoy it, please subscribe and share it! I would be happy to hear feedback from all of you.

This is the first part of my planned two-part blog series. In the second part, we will learn how to optimize matrix multiplication on GPUs. Stay tuned!

Tutorial: https://salykova.github.io/matmul-cpu
Github repo: matmul.c

Top answer

1 of 5

I thought that numpy used strassens algorithm. I'm surprised you're able to beat it for large matrices without it. Edit: I'd also be interested to see how it would do if you imported it as a python library to see how much overhead there is. I wonder if the main slowdown for numpy is the python interface (as numpy is just a wrapper for its C functions).

2 of 5

This is a very good blog post. I did encounter an issue where the MathJax script failed to load because it had a plain HTTP URL but the page is served over HTTPS. One comment about matrix multiplication in LLMs: in a transformer decoder, when generating a single sequence, most of the time is spent in vector-matrix products rather than matrix-matrix products. This is usually done with a separate code path which avoids packing the matrices, because the cost of packing outweighs the benefits in this case. BLIS also has "skinny and unpacked" ("sup") variants of matrix multiplication when inputs are very narrow or short. Another optimization that is common is to pre-pack or pre-transpose whichever input is the weights, so this doesn't have to be done on each iteration.

Discussions

Complexity of matrix inversion in numpy - Computational Science Stack Exchange

You should probably note that, ...ob/master/numpy/linalg/umath_linalg.c.src) the inv routine attempts to call the dgetrf function from your system LAPACK package, which then performs an LU decomposition of your original matrix. This is morally equivalent to Gaussian elimination, but can be tuned to a slightly lower complexity by using faster matrix multiplication algorithms ... More on scicomp.stackexchange.com

scicomp.stackexchange.com

February 11, 2016

Matrix multiplication algorithm time complexity - Stack Overflow

I came up with this algorithm for matrix multiplication. I read somewhere that matrix multiplication has a time complexity of O(n^2). I think my algorithm will give O(n^3). I don't know how to calc... More on stackoverflow.com

stackoverflow.com

python - What is the time complexity of numpy.linalg.det? - Stack Overflow

For more information about the ... Thus, in practice, the complexity of the matrix multiplication is between O(n^2.81) and O(n^3) regarding the target BLAS implementation (which is dependent of your platform and your configuration of Numpy).... More on stackoverflow.com

stackoverflow.com

algorithms - What is the complexity of multiplying a matrix by a scalar? - Computer Science Stack Exchange

I would like to know the complexity of multiplying a matrix of $n\times m$ size by a scalar $\alpha$? In fact, I have a directed graph $G=(V,E)$ represented by an incidence matrix $M$. I would lik... More on cs.stackexchange.com

cs.stackexchange.com

April 5, 2015

Benjaminjohnston

benjaminjohnston.com.au › matmul

Benjamin Johnston - Faster Matrix Multiplications in Numpy

Matrix multiplications in NumPy are reasonably fast without the need for optimization. However, if every second counts, it is possible to significantly improve performance (even without a GPU). Below are a collection of small tricks that can help with large (~4000x4000) matrix multiplications.

Why is matrix multiplication faster with numpy than with ctypes in Python?

stackoverflow.com › questions › 10442365 › why-is-matrix-multiplication-faster-with-numpy-than-with-ctypes-in-python

In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.

Answer from Translunar on Stack Overflow

Scipy

proceedings.scipy.org › articles › gerudo-f2bc6f59-002

A Modified Strassen Algorithm to Accelerate Numpy Large Matrix Multiplication with Integer Entries - SciPy Proceedings

June 1, 2023 - The algorithm uses Strassen’s algorithm for several divide and conquer steps before crossing over to a regular Numpy matrix multiplication. For large matrices the method is 8 to 30 times faster than calling Numpy.matmul or Numpy.dot to multiply the matrices directly.

Wikipedia

en.wikipedia.org › wiki › Computational_complexity_of_matrix_multiplication

Computational complexity of matrix multiplication - Wikipedia

January 12, 2026 - This means that, treating the input n×n matrices as block 2 × 2 matrices, the task of multiplying two n×n matrices can be reduced to seven subproblems of multiplying two n/2×n/2 matrices. Applying this recursively gives an algorithm needing ... Unlike algorithms with faster asymptotic complexity, Strassen's algorithm is used in practice. The numerical stability is reduced compared to the naive algorithm, but it is faster in cases where n > 100 or so and appears in several libraries, such as BLAS. Fast matrix multiplication algorithms cannot achieve component-wise stability, but some can be shown to exhibit norm-wise stability.

Simple algorithms Matrix multiplication exponent Related problems

Stack Exchange

scicomp.stackexchange.com › questions › 22105 › complexity-of-matrix-inversion-in-numpy

Complexity of matrix inversion in numpy - Computational Science Stack Exchange

Top answer

1 of 2

(This is getting too long for comments...)

I'll assume you actually need to compute an inverse in your algorithm.¹ First, it is important to note that these alternative algorithms are not actually claimed to be faster, just that they have better asymptotic complexity (meaning the required number of elementary operations grows more slowly). In fact, in practice these are actually (much) slower than the standard approach (for given $\text{[math]}$ ), for the following reasons:

The $\text{[math]}$ -notation hides a constant in front of the power of $\text{[math]}$ , which can be astronomically large -- so large that $\text{[math]}$ can be much smaller than $\text{[math]}$ for any $\text{[math]}$ that can be handled by any computer in the foreseeable future. (This is the case for the Coppersmith–Winograd algorithm, for example.)
The complexity assumes that every (arithmetical) operation takes the same time -- but this is far from true in actual practice: Multiplying a bunch of numbers with the same number is much faster than multiplying the same amount of different numbers. This is due to the fact that the major bottle-neck in current computing is getting the data into cache, not the actual arithmetical operations on that data. So an algorithm which can be rearranged to have the first situation (called cache-aware) will be much faster than one where this is not possible. (This is the case for the Strassen algorithm, for example.)

Also, numerical stability is at least as important as performance; and here, again, the standard approach usually wins.

For this reason, the standard high-performance libraries (BLAS/LAPACK, which Numpy calls when you ask it to compute an inverse) usually only implement this approach. Of course, there are Numpy implementations of, e.g., Strassen's algorithm out there, but an $\text{[math]}$ algorithm hand-tuned at assembly level will soundly beat an $\text{[math]}$ algorithm written in a high-level language for any reasonable matrix size.

¹ But I'd be amiss if I didn't point out that this is very rarely really necessary: anytime you need to compute a product $\text{[math]}$ , you should instead solve the linear system $\text{[math]}$ (e.g., using numpy.linalg.solve) and use $\text{[math]}$ instead -- this is much more stable, and can be done (depending on the structure of the matrix $\text{[math]}$ ) much faster. If you need to use $\text{[math]}$ multiple times, you can precompute a factorization of $\text{[math]}$ (which is usually the most expensive part of the solve) and reuse that later.

2 of 2

You should probably note that, buried deep inside the numpy source code (see https://github.com/numpy/numpy/blob/master/numpy/linalg/umath_linalg.c.src) the inv routine attempts to call the dgetrf function from your system LAPACK package, which then performs an LU decomposition of your original matrix. This is morally equivalent to Gaussian elimination, but can be tuned to a slightly lower complexity by using faster matrix multiplication algorithms in a high-performance BLAS.

If you follow this route, you should be warned that forcing the entire library chain to use the new library rather than the system one which came with your distribution is fairly complex. One alternative on modern computer systems is to look at parallelized methods using packages like scaLAPACK or (in the python world) petsc4py. However these are typically happier being used as iterative solvers for linear algebra systems than applied to direct methods and PETSc in particular targets sparse systems more than dense ones.

Towards Data Science

towardsdatascience.com › home › latest › understanding deepmind matrix multiplication

Understanding DeepMind matrix multiplication | Towards Data Science

March 5, 2025 - This algorithm is applied to block ... we saw that for square matrices of size 4096 the standard numpy matmult takes about 454.37 +/- 6.27 s, while Strassen takes 31.57 +/- 1.01, which is a difference of about one order of ...

Find elsewhere

Google Bing Mojeek

Verve AI

vervecopilot.com › interview-questions › why-is-understanding-matrix-multiplication-python-crucial-for-your-interview-success

Why Is Understanding Matrix Multiplication Python Crucial For Your Interview Success

September 11, 2025 - This manual approach to matrix multiplication python has a time complexity of $O(n^3)$ for square matrices of size $n \times n$, which is a common discussion point for algorithmic efficiency in interviews [^3]. While manual implementation is vital for understanding, Python’s scientific computing libraries, especially NumPy...

Wikipedia

en.wikipedia.org › wiki › Matrix_multiplication_algorithm

Matrix multiplication algorithm - Wikipedia

1 month ago - As of September 2025, the best bound on the asymptotic complexity of a matrix multiplication algorithm is O(n2.371339) time, given by Alman, Duan, Williams, Xu, Xu, and Zhou.

Iterative algorithm Divide-and-conquer algorithm Sub-cubic algorithms Parallel and distributed algorithms Further reading

GeeksforGeeks

geeksforgeeks.org › matrix-multiplication-in-numpy

Matrix Multiplication in NumPy - GeeksforGeeks

23:59

3 min read Multiplication of two Matrices in Single line using Numpy in Python · Matrix multiplication is an operation that takes two matrices as input and produces single matrix by multiplying rows of the first matrix to the column of the ...

Published September 2, 2020

Cloudploys

survey.cloudploys.com › dogs-for-oalg0 › dob---numpy-matrix-multiplication-time-complexity---7xsli.html

Numpy matrix multiplication time complexity

Furthermore, our NumPy solution involves both Python-stack recursions and the allocation of many temporary arrays, which adds significant computation time. Nov 04, 2019 · Matrix multiplication: If you are multiplying two matrices, (n, p) and (p, m) then the general complexity of this is O(nmp), ...

Siboehm

siboehm.com › articles › 22 › Fast-MMM-on-CPU

Fast Multidimensional Matrix Multiplication on CPU from Scratch

August 14, 2022 - This is incredibly fast, considering this boils down to 18 FLOPs / core / cycle, with a cycle taking a third of a nanosecond. Numpy does this using a highly optimized BLAS implementation.BLAS is short for Basic Linear Algebra Subprograms.

Ashishmalik

ashishmalik.in › post › faster_multiplication_of_matrix

Optimizing Matrix Multiplication | - Ashish Malik

Time Complexity: $O(n^3)$ (but highly optimized with BLAS/LAPACK) Description: NumPy is the de facto standard for numerical computing in Python. Its matmul (or @ operator) function is a wrapper around highly optimized C and Fortran libraries ...

GeeksforGeeks

geeksforgeeks.org › python › multiply-matrices-of-complex-numbers-using-numpy-in-python

Multiply Matrices of Complex Numbers using NumPy in Python - GeeksforGeeks

September 30, 2025 - A Complex Number is any number ... of two complex numbers can be done using the below formula: (a+ib) \times (x+iy)=ax+i^2by+i(bx+ay)=ax-by+i(bx+ay) NumPy provides vdot() method that ......

Baeldung

baeldung.com › home › core concepts › math and logic › matrix multiplication algorithm time complexity

Matrix Multiplication Algorithm Time Complexity | Baeldung on Computer Science

March 18, 2024 - The naive matrix multiplication algorithm contains three nested loops. For each iteration of the outer loop, the total number of the runs in the inner loops would be equivalent to the length of the matrix. Here, integer operations take time. In general, if the length of the matrix is , the total time complexity would be .

Stack Overflow

stackoverflow.com › questions › 8546756 › matrix-multiplication-algorithm-time-complexity

Matrix multiplication algorithm time complexity - Stack Overflow

Top answer

1 of 5

Using linear algebra, there exist algorithms that achieve better complexity than the naive O(n³). Solvay Strassen algorithm achieves a complexity of O(n^2.807) by reducing the number of multiplications required for each 2x2 sub-matrix from 8 to 7.

The fastest known matrix multiplication algorithm is Coppersmith-Winograd algorithm with a complexity of O(n^2.3737). Unless the matrix is huge, these algorithms do not result in a vast difference in computation time. In practice, it is easier and faster to use parallel algorithms for matrix multiplication.

2 of 5

The naive algorithm, which is what you've got once you correct it as noted in comments, is O(n^3).

There do exist algorithms that reduce this somewhat, but you're not likely to find an O(n^2) implementation. I believe the question of the most efficient implementation is still open.

See this wikipedia article on Matrix Multiplication for more information.

Codegive

codegive.com › blog › numpy_matrix_multiplication_algorithm.php

Numpy matrix multiplication algorithm

The efficiency of the NumPy matrix multiplication algorithm is critical for several reasons: Performance: Manual matrix multiplication in Python using nested loops is notoriously slow, with a time complexity of O(mnp).

Stack Overflow

stackoverflow.com › questions › 72206172 › what-is-the-time-complexity-of-numpy-linalg-det

python - What is the time complexity of numpy.linalg.det? - Stack Overflow

Top answer

1 of 1

TL;DR: it is between O(n^2.81) and O(n^3) regarding the target BLAS implementation.

Indeed, Numpy uses a LU decomposition (in the log space). The actual implementation can be found here. It indeed uses the sgetrf/dgetrf primitive of LAPACK. Multiple libraries provides such a libraries. The most famous is the one of NetLib though it is not the fastest. The Intel MKL is an example of library providing a fast implementation. Fast LU decomposition algorithms use tiling methods so to use a matrix multiplication internally. Their do that because the matrix multiplication is one of the most optimized methods linear algebra libraries (for example the MKL, BLIS, and OpenBLAS generally succeed to reach nearly optimal performance on modern processors). More generally, the complexity of the LU decomposition is the one of the matrix multiplication.

The complexity of the naive squared matrix multiplication is O(n^3). Faster algorithms exists like Strassen (running in ~O(n^2.81) time) which is often used for big matrices. The Coppersmith–Winograd algorithm achieves a significantly better complexity (~O(n^2.38)), but no linear algebra libraries actually use it since it is a galactic algorithm. Put it shortly, such algorithm is theoretically asymptotically better than others but the hidden constant make it impractical for any real-world usage. For more information about the complexity of the matrix multiplication, please read this article. Thus, in practice, the complexity of the matrix multiplication is between O(n^2.81) and O(n^3) regarding the target BLAS implementation (which is dependent of your platform and your configuration of Numpy).

Stack Exchange

cs.stackexchange.com › questions › 41053 › what-is-the-complexity-of-multiplying-a-matrix-by-a-scalar

algorithms - What is the complexity of multiplying a matrix by a scalar? - Computer Science Stack Exchange

Top answer

1 of 2

In general, if you have an $n\times m$ matrix $A=(a_{i,j})$ with $1\le i\le n$ and $1\le j\le m$ then there will be $nm$ entries in the array. To multiply $A$ by a scalar $c$ you multiply each element by c, which (assuming multiplication can be done in constant time) will take $nm$ multiplications. That's not the way to go, though, if $A$ is the adjacency matrix of a directed graph.

For your peoblem, if $A$ is the adjacency matrix of a digraph $G$ on $n$ vertices, then $A$ will be an $n\times n$ matrix $(a_{i, j})$ with $a_{i,j}=1$ when there is a directed edge $e=(i, j)$ from vertex $i$ to vertex $j$ and $a_{i,j}=0$ otherwise. You correctly note that the reversed graph $G'$ will have an adjacency matrix $B=(b_{i,j})$ which is the transpose of $A$, so we'll have $b_{i, j} = a_{j, i}$, since $G'$ will have an edge from vertex $v_i$ to vertex $v_j$ if and only if $G$ has an edge from $v_j$ to $v_i$. To get the transpose of $A$, then, we'll swap $a_{i, j}\leftrightarrow a_{j, i}$ entries of $A$. There will be $n(n-1)/2$ swaps needed, since we swap every element above the main diagonal with its corresponding entry below the main diagonal.

Note that multiplying $A$ by $-1$ won't work, since the entries of an adjacency matrix must be only $1$ or $0$.

2 of 2

Of course, as previously noted by other users, you cannot transpose a matrix by a simple multiplication by a scalar. Transposition is an operation of its own.

The complesity of matrix operations depends on the way the matrices are represented.

The cost of matrix operations is very dependent on the way matrices are actually represented in your algorithm. While the most obvious representations can have a cost $O(nm)$ (or $O(n^2)$ for a square matrix) for scalar multiplication or for transposition, other representations can be chosen, depending on the algorithm it is used for, with a lower cost.

The complexity of multiplying a matrix by a scalar $\alpha$ in the usual way does imply multiplying each of its $n\times m$ elements by $\alpha$, and hence has a cost $O(nm)$. However, in general, the cost depends on the way you actually represent matrices, and it is conceivable to use representations where the cost might be constant, if the matrix is represented up to scalar multiplication together with a scalar factor. The computational cost can also be lower in the case of a sparse matrix representation that allows iterating only on the useful (or non-zero) elements.

Without going into details, matrix transposition can also be done in constant time, i.e., complexity $O(1)$, if you choose to represent a matrix as a pair composed of an appropriate memory structure of elements, and an indexation function for accessing them, given the row and column number. Then transposition can be achieved in constant time simply by changing the indexation function.

Note: this is a matter of representation, not of model of computation (the expression computational model being ambiguous). But I am not sure what is intended by Tom van der Zanden'comment.

Studytonight

studytonight.com › numpy › numpy-matrix-multiplication

NumPy Matrix Multiplication - Studytonight

The process of multiplication of matrix in Numpy is commonly known as Vectorization. The main goal of the vectorization process is to reduce the use of for loops for carrying out such operations. And when the usage of for loop is skipped from the program it will reduce the overall execution time of the code.