NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).

The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).

In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.

Answer from Translunar on Stack Overflow
Top answer
1 of 6
43

NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).

The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).

In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.

2 of 6
31

I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:

/**begin repeat
 *
 * #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
 * LONG, ULONG, LONGLONG, ULONGLONG,
 * FLOAT, DOUBLE, LONGDOUBLE,
 * DATETIME, TIMEDELTA#
 * #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
 * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
 * npy_float, npy_double, npy_longdouble,
 * npy_datetime, npy_timedelta#
 * #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,
 * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
 * npy_float, npy_double, npy_longdouble,
 * npy_datetime, npy_timedelta#
 */
static void
@name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,
           void *NPY_UNUSED(ignore))
{
    @out@ tmp = (@out@)0;
    npy_intp i;

    for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
        tmp += (@out@)(*((@type@ *)ip1)) *
               (@out@)(*((@type@ *)ip2));
    }
    *((@type@ *)op) = (@type@) tmp;
}
/**end repeat**/

This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.

One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).

Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.

Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.

Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:

for (i = mp1; i <= *n; i += 5) {
stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) * 
    SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4);
}

This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.

🌐
Reddit
reddit.com › r/localllama › beating numpy's matrix multiplication in 150 lines of c code
r/LocalLLaMA on Reddit: Beating NumPy's matrix multiplication in 150 lines of C code
July 1, 2024 -

TL;DR This blog post is the result of my attempt to implement high-performance matrix multiplication on CPU while keeping the code simple, portable and scalable. The implementation follows the BLIS design, works for arbitrary matrix sizes, and, when fine-tuned for an AMD Ryzen 7700 (8 cores), outperforms NumPy (=OpenBLAS), achieving over 1 TFLOPS of peak performance across a wide range of matrix sizes.

By efficiently parallelizing the code with just 3 lines of OpenMP directives, it’s both scalable and easy to understand. Throughout this tutorial, we'll implement matrix multiplication from scratch, learning how to optimize and parallelize C code using matrix multiplication as an example. This is my first time writing a blog post. If you enjoy it, please subscribe and share it! I would be happy to hear feedback from all of you.

This is the first part of my planned two-part blog series. In the second part, we will learn how to optimize matrix multiplication on GPUs. Stay tuned!

Tutorial: https://salykova.github.io/matmul-cpu
Github repo: matmul.c

Discussions

Complexity of matrix inversion in numpy - Computational Science Stack Exchange
You should probably note that, ...ob/master/numpy/linalg/umath_linalg.c.src) the inv routine attempts to call the dgetrf function from your system LAPACK package, which then performs an LU decomposition of your original matrix. This is morally equivalent to Gaussian elimination, but can be tuned to a slightly lower complexity by using faster matrix multiplication algorithms ... More on scicomp.stackexchange.com
🌐 scicomp.stackexchange.com
February 11, 2016
Matrix multiplication algorithm time complexity - Stack Overflow
I came up with this algorithm for matrix multiplication. I read somewhere that matrix multiplication has a time complexity of O(n^2). I think my algorithm will give O(n^3). I don't know how to calc... More on stackoverflow.com
🌐 stackoverflow.com
python - What is the time complexity of numpy.linalg.det? - Stack Overflow
For more information about the ... Thus, in practice, the complexity of the matrix multiplication is between O(n^2.81) and O(n^3) regarding the target BLAS implementation (which is dependent of your platform and your configuration of Numpy).... More on stackoverflow.com
🌐 stackoverflow.com
algorithms - What is the complexity of multiplying a matrix by a scalar? - Computer Science Stack Exchange
I would like to know the complexity of multiplying a matrix of $n\times m$ size by a scalar $\alpha$? In fact, I have a directed graph $G=(V,E)$ represented by an incidence matrix $M$. I would lik... More on cs.stackexchange.com
🌐 cs.stackexchange.com
April 5, 2015
🌐
Benjaminjohnston
benjaminjohnston.com.au › matmul
Benjamin Johnston - Faster Matrix Multiplications in Numpy
Matrix multiplications in NumPy are reasonably fast without the need for optimization. However, if every second counts, it is possible to significantly improve performance (even without a GPU). Below are a collection of small tricks that can help with large (~4000x4000) matrix multiplications.

NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).

The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).

In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.

Answer from Translunar on Stack Overflow
🌐
Scipy
proceedings.scipy.org › articles › gerudo-f2bc6f59-002
A Modified Strassen Algorithm to Accelerate Numpy Large Matrix Multiplication with Integer Entries - SciPy Proceedings
June 1, 2023 - The algorithm uses Strassen’s algorithm for several divide and conquer steps before crossing over to a regular Numpy matrix multiplication. For large matrices the method is 8 to 30 times faster than calling Numpy.matmul or Numpy.dot to multiply the matrices directly.
🌐
Wikipedia
en.wikipedia.org › wiki › Computational_complexity_of_matrix_multiplication
Computational complexity of matrix multiplication - Wikipedia
January 12, 2026 - This means that, treating the input n×n matrices as block 2 × 2 matrices, the task of multiplying two n×n matrices can be reduced to seven subproblems of multiplying two n/2×n/2 matrices. Applying this recursively gives an algorithm needing ... Unlike algorithms with faster asymptotic complexity, Strassen's algorithm is used in practice. The numerical stability is reduced compared to the naive algorithm, but it is faster in cases where n > 100 or so and appears in several libraries, such as BLAS. Fast matrix multiplication algorithms cannot achieve component-wise stability, but some can be shown to exhibit norm-wise stability.
Top answer
1 of 2
25

(This is getting too long for comments...)

I'll assume you actually need to compute an inverse in your algorithm.1 First, it is important to note that these alternative algorithms are not actually claimed to be faster, just that they have better asymptotic complexity (meaning the required number of elementary operations grows more slowly). In fact, in practice these are actually (much) slower than the standard approach (for given ), for the following reasons:

  1. The -notation hides a constant in front of the power of , which can be astronomically large -- so large that can be much smaller than for any that can be handled by any computer in the foreseeable future. (This is the case for the Coppersmith–Winograd algorithm, for example.)

  2. The complexity assumes that every (arithmetical) operation takes the same time -- but this is far from true in actual practice: Multiplying a bunch of numbers with the same number is much faster than multiplying the same amount of different numbers. This is due to the fact that the major bottle-neck in current computing is getting the data into cache, not the actual arithmetical operations on that data. So an algorithm which can be rearranged to have the first situation (called cache-aware) will be much faster than one where this is not possible. (This is the case for the Strassen algorithm, for example.)

Also, numerical stability is at least as important as performance; and here, again, the standard approach usually wins.

For this reason, the standard high-performance libraries (BLAS/LAPACK, which Numpy calls when you ask it to compute an inverse) usually only implement this approach. Of course, there are Numpy implementations of, e.g., Strassen's algorithm out there, but an algorithm hand-tuned at assembly level will soundly beat an algorithm written in a high-level language for any reasonable matrix size.


1 But I'd be amiss if I didn't point out that this is very rarely really necessary: anytime you need to compute a product , you should instead solve the linear system (e.g., using numpy.linalg.solve) and use instead -- this is much more stable, and can be done (depending on the structure of the matrix ) much faster. If you need to use multiple times, you can precompute a factorization of (which is usually the most expensive part of the solve) and reuse that later.

2 of 2
7

You should probably note that, buried deep inside the numpy source code (see https://github.com/numpy/numpy/blob/master/numpy/linalg/umath_linalg.c.src) the inv routine attempts to call the dgetrf function from your system LAPACK package, which then performs an LU decomposition of your original matrix. This is morally equivalent to Gaussian elimination, but can be tuned to a slightly lower complexity by using faster matrix multiplication algorithms in a high-performance BLAS.

If you follow this route, you should be warned that forcing the entire library chain to use the new library rather than the system one which came with your distribution is fairly complex. One alternative on modern computer systems is to look at parallelized methods using packages like scaLAPACK or (in the python world) petsc4py. However these are typically happier being used as iterative solvers for linear algebra systems than applied to direct methods and PETSc in particular targets sparse systems more than dense ones.

🌐
Towards Data Science
towardsdatascience.com › home › latest › understanding deepmind matrix multiplication
Understanding DeepMind matrix multiplication | Towards Data Science
March 5, 2025 - This algorithm is applied to block ... we saw that for square matrices of size 4096 the standard numpy matmult takes about 454.37 +/- 6.27 s, while Strassen takes 31.57 +/- 1.01, which is a difference of about one order of ...
Find elsewhere
🌐
Verve AI
vervecopilot.com › interview-questions › why-is-understanding-matrix-multiplication-python-crucial-for-your-interview-success
Why Is Understanding Matrix Multiplication Python Crucial For Your Interview Success
September 11, 2025 - This manual approach to matrix multiplication python has a time complexity of \(O(n^3)\) for square matrices of size \(n \times n\), which is a common discussion point for algorithmic efficiency in interviews [^3]. While manual implementation is vital for understanding, Python’s scientific computing libraries, especially NumPy...
🌐
Wikipedia
en.wikipedia.org › wiki › Matrix_multiplication_algorithm
Matrix multiplication algorithm - Wikipedia
1 month ago - As of September 2025, the best bound on the asymptotic complexity of a matrix multiplication algorithm is O(n2.371339) time, given by Alman, Duan, Williams, Xu, Xu, and Zhou.
🌐
GeeksforGeeks
geeksforgeeks.org › matrix-multiplication-in-numpy
Matrix Multiplication in NumPy - GeeksforGeeks
3 min read Multiplication of two Matrices in Single line using Numpy in Python · Matrix multiplication is an operation that takes two matrices as input and produces single matrix by multiplying rows of the first matrix to the column of the ...
Published   September 2, 2020
🌐
Cloudploys
survey.cloudploys.com › dogs-for-oalg0 › dob---numpy-matrix-multiplication-time-complexity---7xsli.html
Numpy matrix multiplication time complexity
Furthermore, our NumPy solution involves both Python-stack recursions and the allocation of many temporary arrays, which adds significant computation time. Nov 04, 2019 · Matrix multiplication: If you are multiplying two matrices, (n, p) and (p, m) then the general complexity of this is O(nmp), ...
🌐
Siboehm
siboehm.com › articles › 22 › Fast-MMM-on-CPU
Fast Multidimensional Matrix Multiplication on CPU from Scratch
August 14, 2022 - This is incredibly fast, considering this boils down to 18 FLOPs / core / cycle, with a cycle taking a third of a nanosecond. Numpy does this using a highly optimized BLAS implementation.BLAS is short for Basic Linear Algebra Subprograms.
🌐
Ashishmalik
ashishmalik.in › post › faster_multiplication_of_matrix
Optimizing Matrix Multiplication | - Ashish Malik
Time Complexity: $O(n^3)$ (but highly optimized with BLAS/LAPACK) Description: NumPy is the de facto standard for numerical computing in Python. Its matmul (or @ operator) function is a wrapper around highly optimized C and Fortran libraries ...
🌐
GeeksforGeeks
geeksforgeeks.org › python › multiply-matrices-of-complex-numbers-using-numpy-in-python
Multiply Matrices of Complex Numbers using NumPy in Python - GeeksforGeeks
September 30, 2025 - A Complex Number is any number ... of two complex numbers can be done using the below formula: (a+ib) \times (x+iy)=ax+i^2by+i(bx+ay)=ax-by+i(bx+ay) NumPy provides vdot() method that ......
🌐
Baeldung
baeldung.com › home › core concepts › math and logic › matrix multiplication algorithm time complexity
Matrix Multiplication Algorithm Time Complexity | Baeldung on Computer Science
March 18, 2024 - The naive matrix multiplication algorithm contains three nested loops. For each iteration of the outer loop, the total number of the runs in the inner loops would be equivalent to the length of the matrix. Here, integer operations take time. In general, if the length of the matrix is , the total time complexity would be .
🌐
Codegive
codegive.com › blog › numpy_matrix_multiplication_algorithm.php
Numpy matrix multiplication algorithm
The efficiency of the NumPy matrix multiplication algorithm is critical for several reasons: Performance: Manual matrix multiplication in Python using nested loops is notoriously slow, with a time complexity of O(mnp).
Top answer
1 of 1
3

TL;DR: it is between O(n^2.81) and O(n^3) regarding the target BLAS implementation.

Indeed, Numpy uses a LU decomposition (in the log space). The actual implementation can be found here. It indeed uses the sgetrf/dgetrf primitive of LAPACK. Multiple libraries provides such a libraries. The most famous is the one of NetLib though it is not the fastest. The Intel MKL is an example of library providing a fast implementation. Fast LU decomposition algorithms use tiling methods so to use a matrix multiplication internally. Their do that because the matrix multiplication is one of the most optimized methods linear algebra libraries (for example the MKL, BLIS, and OpenBLAS generally succeed to reach nearly optimal performance on modern processors). More generally, the complexity of the LU decomposition is the one of the matrix multiplication.

The complexity of the naive squared matrix multiplication is O(n^3). Faster algorithms exists like Strassen (running in ~O(n^2.81) time) which is often used for big matrices. The Coppersmith–Winograd algorithm achieves a significantly better complexity (~O(n^2.38)), but no linear algebra libraries actually use it since it is a galactic algorithm. Put it shortly, such algorithm is theoretically asymptotically better than others but the hidden constant make it impractical for any real-world usage. For more information about the complexity of the matrix multiplication, please read this article. Thus, in practice, the complexity of the matrix multiplication is between O(n^2.81) and O(n^3) regarding the target BLAS implementation (which is dependent of your platform and your configuration of Numpy).

Top answer
1 of 2
2

In general, if you have an $n\times m$ matrix $A=(a_{i,j})$ with $1\le i\le n$ and $1\le j\le m$ then there will be $nm$ entries in the array. To multiply $A$ by a scalar $c$ you multiply each element by c, which (assuming multiplication can be done in constant time) will take $nm$ multiplications. That's not the way to go, though, if $A$ is the adjacency matrix of a directed graph.

For your peoblem, if $A$ is the adjacency matrix of a digraph $G$ on $n$ vertices, then $A$ will be an $n\times n$ matrix $(a_{i, j})$ with $a_{i,j}=1$ when there is a directed edge $e=(i, j)$ from vertex $i$ to vertex $j$ and $a_{i,j}=0$ otherwise. You correctly note that the reversed graph $G'$ will have an adjacency matrix $B=(b_{i,j})$ which is the transpose of $A$, so we'll have $b_{i, j} = a_{j, i}$, since $G'$ will have an edge from vertex $v_i$ to vertex $v_j$ if and only if $G$ has an edge from $v_j$ to $v_i$. To get the transpose of $A$, then, we'll swap $a_{i, j}\leftrightarrow a_{j, i}$ entries of $A$. There will be $n(n-1)/2$ swaps needed, since we swap every element above the main diagonal with its corresponding entry below the main diagonal.

Note that multiplying $A$ by $-1$ won't work, since the entries of an adjacency matrix must be only $1$ or $0$.

2 of 2
1

Of course, as previously noted by other users, you cannot transpose a matrix by a simple multiplication by a scalar. Transposition is an operation of its own.

The complesity of matrix operations depends on the way the matrices are represented.

The cost of matrix operations is very dependent on the way matrices are actually represented in your algorithm. While the most obvious representations can have a cost $O(nm)$ (or $O(n^2)$ for a square matrix) for scalar multiplication or for transposition, other representations can be chosen, depending on the algorithm it is used for, with a lower cost.

The complexity of multiplying a matrix by a scalar $\alpha$ in the usual way does imply multiplying each of its $n\times m$ elements by $\alpha$, and hence has a cost $O(nm)$. However, in general, the cost depends on the way you actually represent matrices, and it is conceivable to use representations where the cost might be constant, if the matrix is represented up to scalar multiplication together with a scalar factor. The computational cost can also be lower in the case of a sparse matrix representation that allows iterating only on the useful (or non-zero) elements.

Without going into details, matrix transposition can also be done in constant time, i.e., complexity $O(1)$, if you choose to represent a matrix as a pair composed of an appropriate memory structure of elements, and an indexation function for accessing them, given the row and column number. Then transposition can be achieved in constant time simply by changing the indexation function.

Note: this is a matter of representation, not of model of computation (the expression computational model being ambiguous). But I am not sure what is intended by Tom van der Zanden'comment.

🌐
Studytonight
studytonight.com › numpy › numpy-matrix-multiplication
NumPy Matrix Multiplication - Studytonight
The process of multiplication of matrix in Numpy is commonly known as Vectorization. The main goal of the vectorization process is to reduce the use of for loops for carrying out such operations. And when the usage of for loop is skipped from the program it will reduce the overall execution time of the code.