NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).
The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).
In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.
NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).
The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).
In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.
I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:
/**begin repeat
*
* #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
* LONG, ULONG, LONGLONG, ULONGLONG,
* FLOAT, DOUBLE, LONGDOUBLE,
* DATETIME, TIMEDELTA#
* #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
* npy_long, npy_ulong, npy_longlong, npy_ulonglong,
* npy_float, npy_double, npy_longdouble,
* npy_datetime, npy_timedelta#
* #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,
* npy_long, npy_ulong, npy_longlong, npy_ulonglong,
* npy_float, npy_double, npy_longdouble,
* npy_datetime, npy_timedelta#
*/
static void
@name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,
void *NPY_UNUSED(ignore))
{
@out@ tmp = (@out@)0;
npy_intp i;
for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
tmp += (@out@)(*((@type@ *)ip1)) *
(@out@)(*((@type@ *)ip2));
}
*((@type@ *)op) = (@type@) tmp;
}
/**end repeat**/
This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.
One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).
Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.
Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.
Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:
for (i = mp1; i <= *n; i += 5) {
stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) *
SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4);
}
This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.
TL;DR This blog post is the result of my attempt to implement high-performance matrix multiplication on CPU while keeping the code simple, portable and scalable. The implementation follows the BLIS design, works for arbitrary matrix sizes, and, when fine-tuned for an AMD Ryzen 7700 (8 cores), outperforms NumPy (=OpenBLAS), achieving over 1 TFLOPS of peak performance across a wide range of matrix sizes.
By efficiently parallelizing the code with just 3 lines of OpenMP directives, it’s both scalable and easy to understand. Throughout this tutorial, we'll implement matrix multiplication from scratch, learning how to optimize and parallelize C code using matrix multiplication as an example. This is my first time writing a blog post. If you enjoy it, please subscribe and share it! I would be happy to hear feedback from all of you.
This is the first part of my planned two-part blog series. In the second part, we will learn how to optimize matrix multiplication on GPUs. Stay tuned!
Tutorial: https://salykova.github.io/matmul-cpu
Github repo: matmul.c
Complexity of matrix inversion in numpy - Computational Science Stack Exchange
Matrix multiplication algorithm time complexity - Stack Overflow
python - What is the time complexity of numpy.linalg.det? - Stack Overflow
algorithms - What is the complexity of multiplying a matrix by a scalar? - Computer Science Stack Exchange
NumPy uses a highly-optimized, carefully-tuned BLAS method for matrix multiplication (see also: ATLAS). The specific function in this case is GEMM (for generic matrix multiplication). You can look up the original by searching for dgemm.f (it's in Netlib).
The optimization, by the way, goes beyond compiler optimizations. Above, Philip mentioned Coppersmith–Winograd. If I remember correctly, this is the algorithm which is used for most cases of matrix multiplication in ATLAS (though a commenter notes it could be Strassen's algorithm).
In other words, your matmult algorithm is the trivial implementation. There are faster ways to do the same thing.
(This is getting too long for comments...)
I'll assume you actually need to compute an inverse in your algorithm.1 First, it is important to note that these alternative algorithms are not actually claimed to be faster, just that they have better asymptotic complexity (meaning the required number of elementary operations grows more slowly). In fact, in practice these are actually (much) slower than the standard approach (for given ), for the following reasons:
The
-notation hides a constant in front of the power of
, which can be astronomically large -- so large that
can be much smaller than
for any
that can be handled by any computer in the foreseeable future. (This is the case for the Coppersmith–Winograd algorithm, for example.)
The complexity assumes that every (arithmetical) operation takes the same time -- but this is far from true in actual practice: Multiplying a bunch of numbers with the same number is much faster than multiplying the same amount of different numbers. This is due to the fact that the major bottle-neck in current computing is getting the data into cache, not the actual arithmetical operations on that data. So an algorithm which can be rearranged to have the first situation (called cache-aware) will be much faster than one where this is not possible. (This is the case for the Strassen algorithm, for example.)
Also, numerical stability is at least as important as performance; and here, again, the standard approach usually wins.
For this reason, the standard high-performance libraries (BLAS/LAPACK, which Numpy calls when you ask it to compute an inverse) usually only implement this approach. Of course, there are Numpy implementations of, e.g., Strassen's algorithm out there, but an algorithm hand-tuned at assembly level will soundly beat an
algorithm written in a high-level language for any reasonable matrix size.
1 But I'd be amiss if I didn't point out that this is very rarely really necessary: anytime you need to compute a product
numpy.linalg.solve) and use You should probably note that, buried deep inside the numpy source code (see https://github.com/numpy/numpy/blob/master/numpy/linalg/umath_linalg.c.src) the inv routine attempts to call the dgetrf function from your system LAPACK package, which then performs an LU decomposition of your original matrix. This is morally equivalent to Gaussian elimination, but can be tuned to a slightly lower complexity by using faster matrix multiplication algorithms in a high-performance BLAS.
If you follow this route, you should be warned that forcing the entire library chain to use the new library rather than the system one which came with your distribution is fairly complex. One alternative on modern computer systems is to look at parallelized methods using packages like scaLAPACK or (in the python world) petsc4py. However these are typically happier being used as iterative solvers for linear algebra systems than applied to direct methods and PETSc in particular targets sparse systems more than dense ones.
Using linear algebra, there exist algorithms that achieve better complexity than the naive O(n3). Solvay Strassen algorithm achieves a complexity of O(n2.807) by reducing the number of multiplications required for each 2x2 sub-matrix from 8 to 7.
The fastest known matrix multiplication algorithm is Coppersmith-Winograd algorithm with a complexity of O(n2.3737). Unless the matrix is huge, these algorithms do not result in a vast difference in computation time. In practice, it is easier and faster to use parallel algorithms for matrix multiplication.
The naive algorithm, which is what you've got once you correct it as noted in comments, is O(n^3).
There do exist algorithms that reduce this somewhat, but you're not likely to find an O(n^2) implementation. I believe the question of the most efficient implementation is still open.
See this wikipedia article on Matrix Multiplication for more information.
In general, if you have an $n\times m$ matrix $A=(a_{i,j})$ with $1\le i\le n$ and $1\le j\le m$ then there will be $nm$ entries in the array. To multiply $A$ by a scalar $c$ you multiply each element by c, which (assuming multiplication can be done in constant time) will take $nm$ multiplications. That's not the way to go, though, if $A$ is the adjacency matrix of a directed graph.
For your peoblem, if $A$ is the adjacency matrix of a digraph $G$ on $n$ vertices, then $A$ will be an $n\times n$ matrix $(a_{i, j})$ with $a_{i,j}=1$ when there is a directed edge $e=(i, j)$ from vertex $i$ to vertex $j$ and $a_{i,j}=0$ otherwise. You correctly note that the reversed graph $G'$ will have an adjacency matrix $B=(b_{i,j})$ which is the transpose of $A$, so we'll have $b_{i, j} = a_{j, i}$, since $G'$ will have an edge from vertex $v_i$ to vertex $v_j$ if and only if $G$ has an edge from $v_j$ to $v_i$. To get the transpose of $A$, then, we'll swap $a_{i, j}\leftrightarrow a_{j, i}$ entries of $A$. There will be $n(n-1)/2$ swaps needed, since we swap every element above the main diagonal with its corresponding entry below the main diagonal.
Note that multiplying $A$ by $-1$ won't work, since the entries of an adjacency matrix must be only $1$ or $0$.
Of course, as previously noted by other users, you cannot transpose a matrix by a simple multiplication by a scalar. Transposition is an operation of its own.
The complesity of matrix operations depends on the way the matrices are represented.
The cost of matrix operations is very dependent on the way matrices are actually represented in your algorithm. While the most obvious representations can have a cost $O(nm)$ (or $O(n^2)$ for a square matrix) for scalar multiplication or for transposition, other representations can be chosen, depending on the algorithm it is used for, with a lower cost.
The complexity of multiplying a matrix by a scalar $\alpha$ in the usual way does imply multiplying each of its $n\times m$ elements by $\alpha$, and hence has a cost $O(nm)$. However, in general, the cost depends on the way you actually represent matrices, and it is conceivable to use representations where the cost might be constant, if the matrix is represented up to scalar multiplication together with a scalar factor. The computational cost can also be lower in the case of a sparse matrix representation that allows iterating only on the useful (or non-zero) elements.
Without going into details, matrix transposition can also be done in constant time, i.e., complexity $O(1)$, if you choose to represent a matrix as a pair composed of an appropriate memory structure of elements, and an indexation function for accessing them, given the row and column number. Then transposition can be achieved in constant time simply by changing the indexation function.
Note: this is a matter of representation, not of model of computation (the expression computational model being ambiguous). But I am not sure what is intended by Tom van der Zanden'comment.