Indexing individual array elements
This the main type of code where Cython can really help you. Indexing an individual element (e.g. an = a[n]) can be a fairly slow operation in Python. Partly because Python is not a hugely quick language and so running Python code lots of times within a loop can be slow, and partly because the array is stored as a tightly-packed array of C floats, but the indexing operation needs to return a Python object. Therefore, indexing a Numpy array requires new Python objects to be allocated.
In Cython you can declare the arrays as typed memoryviews, or as np.ndarray. (Typed memoryviews are the more modern approach and you should usually prefer them). Doing so allows you to directly access the tightly-packed C array directly and retrieve the C value, without creating a Python object.
The directives cython.boundscheck and cython.wraparound can be very worthwhile for further speed-ups to the index (but remember they do remove useful features, so think before using them).
vs vectorization
A lot of the time a loop over a Numpy array can be written as a vectorized operation - something that acts on the whole array at once. It is usually a good idea to write Python+Numpy code like this. If you have multiple chained vectorized operations, it is sometimes worthwhile to write it explicitly as a Cython loop to avoid allocating intermediate arrays.
Alternatively the little-known Pythran backend for Cython convert a set of vectorized Numpy operations to optimized C++ code.
Indexing array slices
Isn't a problem in Cython, but typically isn't something that will get you significant speed-ups alone.
Calling Numpy functions
e.g. last = np.sin(an)
These require a Python call, so Cython typically cannot accelerate these - it has no visibility into the contents of the Numpy function.
However, here the operation is on a single value, and not on a Numpy array. In this case we can use sin from the C standard library, which will be significantly quicker than a Python function call. You'd do from libc.math cimport sin and call sin rather than np.sin.
Numba is an alternative Python accelerator that has better visibility into Numpy functions can can often optimize without changes.
Array allocations
e.g. transformed_a = np.zeros_like(a).
This is just a Numpy function call and thus Cython has no ability to speed it up. If it's just an intermediate value and not returned to Python then you might consider a fixed-size C array on the stack
cdef double transformed_a[10] # note - you must know the size at compile-time
or by allocating them with the C malloc function (remember to free it). Or by using Cython's cython.view.array (which is still a Python object, but can be a little quicker).
Whole-array arithmetic
e.g. transformed_a * b, which multiplies transformed_a and b element by element.
Cython doesn't help you here - it's just a disguised function call (although Pythran+Cython may have some benefits). For large arrays this kind of operation is pretty efficient in Numpy so don't overthink it.
Note that whole-array operations aren't defined for Cython typed memoryviews, so you need to do np.asarray(memview) to get them back to Numpy arrays. This typically doesn't need a copy and is quick.
For some operations like this, you can use BLAS and LAPACK functions (which are fast C implementations of array and matrix operations). Scipy provies a Cython interface for them (https://docs.scipy.org/doc/scipy/reference/linalg.cython_blas.html) to use. They're a little more complex to use that the natural Python code.
The illustrative example
Just for completeness, I'd write it something like:
import numpy as np
from libc.math cimport sin
cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def some_func(double[::1] a, b):
cdef double[::1] transformed_a = np.zeros_like(a)
cdef double last = 0
cdef double an, delta
cdef Py_ssize_t n
for n in range(1, a.shape[0]):
an = a[n]
if an > 0:
delta = an - a[n-1]
transformed_a[n] = delta*last
else:
last = sin(an)
return np.asarray(transformed_a) * b
which is a little over 10x faster.
cython -a is helpful here - it produces an annotated HTML file that shows which lines contain a lot of interaction with Python.
python - What parts of a Numpy-heavy function can I accelerate with Cython - Stack Overflow
python - Cython: cimport and import numpy as (both) np - Stack Overflow
python - Numpy vs Cython speed - Stack Overflow
python - Using Cython correctly in sample code with numpy - Stack Overflow
Videos
With slight modification, version 3 becomes twice as fast:
@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def process2(np.ndarray[DTYPE_t, ndim=2] array):
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
cdef np.ndarray[DTYPE_t, ndim=2] out = np.empty((rows, cols))
for row in range(rows):
for row2 in range(rows):
for col in range(cols):
out[row, col] += array[row2, col] - array[row, col]
return out
The bottleneck in your calculation is memory access. Your input array is C ordered, which means that moving along the last axis makes the smallest jump in memory. Therefore your inner loop should be along axis 1, not axis 0. Making this change cuts the run time in half.
If you need to use this function on small input arrays then you can reduce the overhead by using np.empty instead of np.ones. To reduce the overhead further use PyArray_EMPTY from the numpy C API.
If you use this function on very large input arrays (2**31) then the integers used for indexing (and in the range function) will overflow. To be safe use:
cdef Py_ssize_t rows = array.shape[0]
cdef Py_ssize_t cols = array.shape[1]
cdef Py_ssize_t row, col, row2
instead of
cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2
Timing:
In [2]: a = np.random.rand(10000, 10)
In [3]: timeit process(a)
1 loops, best of 3: 3.53 s per loop
In [4]: timeit process2(a)
1 loops, best of 3: 1.84 s per loop
where process is your version 3.
As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this
First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays
Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as
out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole arrayB*Cmust be created in memory, then added toA. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.
Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.
As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.
You must initialize the numpy C API by calling import_array().
Add this line to your cython file:
cnp.import_array()
And as pointed out by @user4815162342 and @DavidW in the comments, you must call Py_Initialize() and Py_Finalize() in main().
Thank you for your help first. I could get something useful information, even though that could not directly solve my problem.
By referring to others advice, rather than calling print_me function from .so file, I decided to call directly from C. This is what I did.
# print_me.pyx
import numpy as np
cimport numpy as np
np.import_array()
cdef public char* print_me(f):
cdef int[2][4] ll = [[1, 2, 3, 4], [5,6,7,8]]
cdef np.ndarray[np.int_t, ndim=2] nll = np.zeros((4, 6), dtype=np.int)
print nll
nll += 1
print nll
return f + str(ll[1][0])
This is my .c file
// main.c
#include <python2.7/Python.h>
#include "print_me.h"
int main()
{
// initialize python
Py_Initialize();
PyObject* filename = PyString_FromString("hello");
initsquare_number();
//initprint_me();
// call python-oriented function
printf("%s\n", print_me(filename));
// finalize python
Py_Finalize();
return 0;
}
I compiled then as follows
# to generate print_me.c and print_me.h
cython print_me.pyx
# to build main.c and print_me.c into main.o and print_me.o
cc -c main.c print_me.c -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include
# to linke .o files
cc -lpython2.7 -ldl main.o print_me.o -o main
# execute main
./main
This results the following
[[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]
[0 0 0 0 0 0]]
[[1 1 1 1 1 1]
[1 1 1 1 1 1]
[1 1 1 1 1 1]
[1 1 1 1 1 1]]
hello5
Thank you for all of your help again!! :)