Indexing individual array elements

This the main type of code where Cython can really help you. Indexing an individual element (e.g. an = a[n]) can be a fairly slow operation in Python. Partly because Python is not a hugely quick language and so running Python code lots of times within a loop can be slow, and partly because the array is stored as a tightly-packed array of C floats, but the indexing operation needs to return a Python object. Therefore, indexing a Numpy array requires new Python objects to be allocated.

In Cython you can declare the arrays as typed memoryviews, or as np.ndarray. (Typed memoryviews are the more modern approach and you should usually prefer them). Doing so allows you to directly access the tightly-packed C array directly and retrieve the C value, without creating a Python object.

The directives cython.boundscheck and cython.wraparound can be very worthwhile for further speed-ups to the index (but remember they do remove useful features, so think before using them).

vs vectorization

A lot of the time a loop over a Numpy array can be written as a vectorized operation - something that acts on the whole array at once. It is usually a good idea to write Python+Numpy code like this. If you have multiple chained vectorized operations, it is sometimes worthwhile to write it explicitly as a Cython loop to avoid allocating intermediate arrays.

Alternatively the little-known Pythran backend for Cython convert a set of vectorized Numpy operations to optimized C++ code.

Indexing array slices

Isn't a problem in Cython, but typically isn't something that will get you significant speed-ups alone.

Calling Numpy functions

e.g. last = np.sin(an)

These require a Python call, so Cython typically cannot accelerate these - it has no visibility into the contents of the Numpy function.

However, here the operation is on a single value, and not on a Numpy array. In this case we can use sin from the C standard library, which will be significantly quicker than a Python function call. You'd do from libc.math cimport sin and call sin rather than np.sin.

Numba is an alternative Python accelerator that has better visibility into Numpy functions can can often optimize without changes.

Array allocations

e.g. transformed_a = np.zeros_like(a).

This is just a Numpy function call and thus Cython has no ability to speed it up. If it's just an intermediate value and not returned to Python then you might consider a fixed-size C array on the stack

cdef double transformed_a[10]  # note - you must know the size at compile-time

or by allocating them with the C malloc function (remember to free it). Or by using Cython's cython.view.array (which is still a Python object, but can be a little quicker).

Whole-array arithmetic

e.g. transformed_a * b, which multiplies transformed_a and b element by element.

Cython doesn't help you here - it's just a disguised function call (although Pythran+Cython may have some benefits). For large arrays this kind of operation is pretty efficient in Numpy so don't overthink it.

Note that whole-array operations aren't defined for Cython typed memoryviews, so you need to do np.asarray(memview) to get them back to Numpy arrays. This typically doesn't need a copy and is quick.

For some operations like this, you can use BLAS and LAPACK functions (which are fast C implementations of array and matrix operations). Scipy provies a Cython interface for them (https://docs.scipy.org/doc/scipy/reference/linalg.cython_blas.html) to use. They're a little more complex to use that the natural Python code.

The illustrative example

Just for completeness, I'd write it something like:

import numpy as np
from libc.math cimport sin
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def some_func(double[::1] a, b):
    cdef double[::1] transformed_a = np.zeros_like(a)
    cdef double last = 0
    cdef double an, delta
    cdef Py_ssize_t n
    for n in range(1, a.shape[0]):
        an = a[n]
        if an > 0:
            delta = an - a[n-1]
            transformed_a[n] = delta*last
        else:
            last = sin(an)
    return np.asarray(transformed_a) * b

which is a little over 10x faster.

cython -a is helpful here - it produces an annotated HTML file that shows which lines contain a lot of interaction with Python.

Answer from DavidW on Stack Overflow
🌐
Cython
cython.readthedocs.io › en › latest › src › tutorial › numpy.html
Working with NumPy — Cython 3.3.0a0 documentation
I’ll refer to it as both convolve_py.py for the Python version and convolve1.pyx for the Cython version – Cython uses “.pyx” as its file suffix. import numpy as np def naive_convolve(f, g): # f is an image and is indexed by (v, w) # g is a filter kernel and is indexed by (s, t), # it needs odd dimensions # h is the output image and is indexed by (x, y), # it is not cropped if g.shape[0] % 2 != 1 or g.shape[1] % 2 != 1: raise ValueError("Only odd dimensions on filter supported") # smid and tmid are number of pixels between the center pixel # and the edge, ie for a 5x5 filter they will be 2.
🌐
Readthedocs
cython-docs2.readthedocs.io › en › latest › src › tutorial › numpy.html
Using Cython with NumPy — Cython 0.15pre documentation
Cython has support for fast access to NumPy arrays. To optimize code using such arrays one must cimport the NumPy pxd file (which ships with Cython), and declare any arrays as having the ndarray type. The data type and number of dimensions should be fixed at compile-time and passed.
Discussions

python - What parts of a Numpy-heavy function can I accelerate with Cython - Stack Overflow
Introductory notes: trying to accelerate Python+Numpy code with Cython is a common problem and this question is an attempt to create a canonical question about what types of operation you can accel... More on stackoverflow.com
🌐 stackoverflow.com
python - Cython: cimport and import numpy as (both) np - Stack Overflow
In the tutorial of the Cython documentation, there are cimport and import statements of numpy module: import numpy as np cimport numpy as np I found this convention is quite popular among numpy/cy... More on stackoverflow.com
🌐 stackoverflow.com
python - Numpy vs Cython speed - Stack Overflow
I have an analysis code that does some heavy numerical operations using numpy. Just for curiosity, tried to compile it with cython with little changes and then I rewrote it using loops for the nump... More on stackoverflow.com
🌐 stackoverflow.com
python - Using Cython correctly in sample code with numpy - Stack Overflow
I was wondering if I'm missing something when using Cython with Numpy because I haven't seen much of an improvement. I wrote this code as an example. Naive version: import numpy as np from skimag... More on stackoverflow.com
🌐 stackoverflow.com
May 22, 2017
🌐
Cython
cython.readthedocs.io › en › latest › src › userguide › numpy_tutorial.html
Cython for NumPy users — Cython 3.3.0a0 documentation
Using memoryviews or importing NumPy with import numpy does not mean that you have to add the path to NumPy include files. You need to add this path only if you use cimport numpy. Despite this, you may still get warnings like the following from the compiler, because Cython is not disabling the usage of the old deprecated Numpy API:
🌐
NumPy
numpy.org › doc › stable › reference › arrays.nditer.cython.html
Putting the inner loop in Cython — NumPy v2.4 Manual
To Cython-ize this function, we replace the inner loop (y[…] += x*x) with Cython code that’s specialized for the float64 dtype. With the ‘external_loop’ flag enabled, the arrays provided to the inner loop will always be one-dimensional, so very little checking needs to be done. ... import numpy as np cimport numpy as np cimport cython def axis_to_axeslist(axis, ndim): if axis is None: return [-1] * ndim else: if type(axis) is not tuple: axis = (axis,) axeslist = [1] * ndim for i in axis: axeslist[i] = -1 ax = 0 for i in range(ndim): if axeslist[i] != -1: axeslist[i] = ax ax += 1 return
🌐
Paperspace
blog.paperspace.com › faster-numpy-array-processing-ndarray-cython
NumPy Array Processing With Cython: 1250x Faster | Paperspace Blog
April 9, 2021 - This is also the case for the NumPy array. If we leave the NumPy array in its current form, Cython works exactly as regular Python does by creating an object for each number in the array.
🌐
GitHub
github.com › cython › cython › blob › master › docs › examples › userguide › numpy_tutorial › numpy_and_cython.ipynb
cython/docs/examples/userguide/numpy_tutorial/numpy_and_cython.ipynb at master · cython/cython
"print(Cython.__version__)" ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [], "source": [ "import numpy as np\n", "array_1 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc)\n", "array_2 = np.random.uniform(0, 1000, size=(3000, 2000)).astype(np.intc)\n", "a = 4\n", "b = 3\n", "c = 9" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### The first Cython program" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "##### Numpy version" ] },
Author   cython
Top answer
1 of 1
15

Indexing individual array elements

This the main type of code where Cython can really help you. Indexing an individual element (e.g. an = a[n]) can be a fairly slow operation in Python. Partly because Python is not a hugely quick language and so running Python code lots of times within a loop can be slow, and partly because the array is stored as a tightly-packed array of C floats, but the indexing operation needs to return a Python object. Therefore, indexing a Numpy array requires new Python objects to be allocated.

In Cython you can declare the arrays as typed memoryviews, or as np.ndarray. (Typed memoryviews are the more modern approach and you should usually prefer them). Doing so allows you to directly access the tightly-packed C array directly and retrieve the C value, without creating a Python object.

The directives cython.boundscheck and cython.wraparound can be very worthwhile for further speed-ups to the index (but remember they do remove useful features, so think before using them).

vs vectorization

A lot of the time a loop over a Numpy array can be written as a vectorized operation - something that acts on the whole array at once. It is usually a good idea to write Python+Numpy code like this. If you have multiple chained vectorized operations, it is sometimes worthwhile to write it explicitly as a Cython loop to avoid allocating intermediate arrays.

Alternatively the little-known Pythran backend for Cython convert a set of vectorized Numpy operations to optimized C++ code.

Indexing array slices

Isn't a problem in Cython, but typically isn't something that will get you significant speed-ups alone.

Calling Numpy functions

e.g. last = np.sin(an)

These require a Python call, so Cython typically cannot accelerate these - it has no visibility into the contents of the Numpy function.

However, here the operation is on a single value, and not on a Numpy array. In this case we can use sin from the C standard library, which will be significantly quicker than a Python function call. You'd do from libc.math cimport sin and call sin rather than np.sin.

Numba is an alternative Python accelerator that has better visibility into Numpy functions can can often optimize without changes.

Array allocations

e.g. transformed_a = np.zeros_like(a).

This is just a Numpy function call and thus Cython has no ability to speed it up. If it's just an intermediate value and not returned to Python then you might consider a fixed-size C array on the stack

cdef double transformed_a[10]  # note - you must know the size at compile-time

or by allocating them with the C malloc function (remember to free it). Or by using Cython's cython.view.array (which is still a Python object, but can be a little quicker).

Whole-array arithmetic

e.g. transformed_a * b, which multiplies transformed_a and b element by element.

Cython doesn't help you here - it's just a disguised function call (although Pythran+Cython may have some benefits). For large arrays this kind of operation is pretty efficient in Numpy so don't overthink it.

Note that whole-array operations aren't defined for Cython typed memoryviews, so you need to do np.asarray(memview) to get them back to Numpy arrays. This typically doesn't need a copy and is quick.

For some operations like this, you can use BLAS and LAPACK functions (which are fast C implementations of array and matrix operations). Scipy provies a Cython interface for them (https://docs.scipy.org/doc/scipy/reference/linalg.cython_blas.html) to use. They're a little more complex to use that the natural Python code.

The illustrative example

Just for completeness, I'd write it something like:

import numpy as np
from libc.math cimport sin
cimport cython

@cython.boundscheck(False)
@cython.wraparound(False)
def some_func(double[::1] a, b):
    cdef double[::1] transformed_a = np.zeros_like(a)
    cdef double last = 0
    cdef double an, delta
    cdef Py_ssize_t n
    for n in range(1, a.shape[0]):
        an = a[n]
        if an > 0:
            delta = an - a[n-1]
            transformed_a[n] = delta*last
        else:
            last = sin(an)
    return np.asarray(transformed_a) * b

which is a little over 10x faster.

cython -a is helpful here - it produces an annotated HTML file that shows which lines contain a lot of interaction with Python.

Find elsewhere
🌐
Nyu-cds
nyu-cds.github.io › python-cython › 03-numpy
Introduction to Cython: Cython for NumPy Users
March 8, 2017 - Learn different techniques for using NumPy programs with Cython.
🌐
NumPy
numpy.org › devdocs › reference › arrays.nditer.cython.html
Putting the inner loop in Cython — NumPy v2.5.dev0 Manual
To Cython-ize this function, we replace the inner loop (y[…] += x*x) with Cython code that’s specialized for the float64 dtype. With the ‘external_loop’ flag enabled, the arrays provided to the inner loop will always be one-dimensional, so very little checking needs to be done. ... import numpy as np cimport numpy as np cimport cython def axis_to_axeslist(axis, ndim): if axis is None: return [-1] * ndim else: if type(axis) is not tuple: axis = (axis,) axeslist = [1] * ndim for i in axis: axeslist[i] = -1 ax = 0 for i in range(ndim): if axeslist[i] != -1: axeslist[i] = ax ax += 1 return
🌐
FutureLearn
futurelearn.com › home › blog
Using NumPy with Cython
October 25, 2022 - In this article we present how to efficiently utilize NumPy arrays with Cython.
🌐
InfoWorld
infoworld.com › home › software development › programming languages › python
Use Cython to accelerate array iteration in NumPy | InfoWorld
August 31, 2022 - NumPy is known for being fast, but there's always room for improvement. Learn how to use Cython to iterate over NumPy arrays at the speed of C.
🌐
NumPy
numpy.org › doc › 2.1 › reference › arrays.nditer.cython.html
Putting the inner loop in Cython — NumPy v2.1 Manual
To Cython-ize this function, we replace the inner loop (y[…] += x*x) with Cython code that’s specialized for the float64 dtype. With the ‘external_loop’ flag enabled, the arrays provided to the inner loop will always be one-dimensional, so very little checking needs to be done. ... import numpy as np cimport numpy as np cimport cython def axis_to_axeslist(axis, ndim): if axis is None: return [-1] * ndim else: if type(axis) is not tuple: axis = (axis,) axeslist = [1] * ndim for i in axis: axeslist[i] = -1 ax = 0 for i in range(ndim): if axeslist[i] != -1: axeslist[i] = ax ax += 1 return
🌐
GitHub
github.com › cython › cython › blob › master › docs › src › tutorial › numpy.rst
cython/docs/src/tutorial/numpy.rst at master · cython/cython
It is both valid Python and valid Cython code. I'll refer to it as both :file:`convolve_py.py` for the Python version and :file:`convolve1.pyx` for the Cython version -- Cython uses ".pyx" as its file suffix. .. literalinclude:: ../../examples/tutorial/numpy/convolve_py.py
Author   cython
Top answer
1 of 5
53

With slight modification, version 3 becomes twice as fast:

@cython.boundscheck(False)
@cython.wraparound(False)
@cython.nonecheck(False)
def process2(np.ndarray[DTYPE_t, ndim=2] array):

    cdef unsigned int rows = array.shape[0]
    cdef unsigned int cols = array.shape[1]
    cdef unsigned int row, col, row2
    cdef np.ndarray[DTYPE_t, ndim=2] out = np.empty((rows, cols))

    for row in range(rows):
        for row2 in range(rows):
            for col in range(cols):
                out[row, col] += array[row2, col] - array[row, col]

    return out

The bottleneck in your calculation is memory access. Your input array is C ordered, which means that moving along the last axis makes the smallest jump in memory. Therefore your inner loop should be along axis 1, not axis 0. Making this change cuts the run time in half.

If you need to use this function on small input arrays then you can reduce the overhead by using np.empty instead of np.ones. To reduce the overhead further use PyArray_EMPTY from the numpy C API.

If you use this function on very large input arrays (2**31) then the integers used for indexing (and in the range function) will overflow. To be safe use:

cdef Py_ssize_t rows = array.shape[0]
cdef Py_ssize_t cols = array.shape[1]
cdef Py_ssize_t row, col, row2

instead of

cdef unsigned int rows = array.shape[0]
cdef unsigned int cols = array.shape[1]
cdef unsigned int row, col, row2

Timing:

In [2]: a = np.random.rand(10000, 10)
In [3]: timeit process(a)
1 loops, best of 3: 3.53 s per loop
In [4]: timeit process2(a)
1 loops, best of 3: 1.84 s per loop

where process is your version 3.

2 of 5
36

As mentioned in the other answers, version 2 is essentially the same as version 1 since cython is unable to dig into the array access operator in order to optimise it. There are 2 reasons for this

  • First, there is a certain amount of overhead in each call to a numpy function, as compared to optimised C code. However this overhead will become less significant if each operation deals with large arrays

  • Second, there is the creation of intermediate arrays. This is clearer if you consider a more complex operation such as out[row, :] = A[row, :] + B[row, :]*C[row, :]. In this case a whole array B*C must be created in memory, then added to A. This means that the CPU cache is being thrashed, as data is being read from and written to memory rather than being kept in the CPU and used straight away. Importantly, this problem becomes worse if you are dealing with large arrays.

Particularly since you state that your real code is more complex than your example, and it shows a much greater speedup, I suspect that the second reason is likely to be the main factor in your case.

As an aside, if your calculations are sufficiently simple, you can overcome this effect by using numexpr, although of course cython is useful in many more situations so it may be the better approach for you.

Top answer
1 of 2
5

You must initialize the numpy C API by calling import_array().

Add this line to your cython file:

cnp.import_array()

And as pointed out by @user4815162342 and @DavidW in the comments, you must call Py_Initialize() and Py_Finalize() in main().

2 of 2
0

Thank you for your help first. I could get something useful information, even though that could not directly solve my problem.

By referring to others advice, rather than calling print_me function from .so file, I decided to call directly from C. This is what I did.

# print_me.pyx
import  numpy as np
cimport numpy as np

np.import_array()

cdef public char* print_me(f):
    cdef int[2][4] ll = [[1, 2, 3, 4], [5,6,7,8]]
    cdef np.ndarray[np.int_t, ndim=2] nll = np.zeros((4, 6), dtype=np.int)
    print nll
    nll += 1
    print nll
    return f + str(ll[1][0])

This is my .c file

// main.c
#include <python2.7/Python.h>
#include "print_me.h"

int main()
{
    // initialize python
    Py_Initialize();
    PyObject* filename = PyString_FromString("hello");
    initsquare_number();
    //initprint_me();

    // call python-oriented function
    printf("%s\n", print_me(filename));

    // finalize python
    Py_Finalize();
    return 0;
}

I compiled then as follows

# to generate print_me.c and print_me.h
cython print_me.pyx

# to build main.c and print_me.c into main.o and print_me.o
cc -c main.c print_me.c -I/usr/include/python2.7 -I/usr/lib64/python2.7/site-packages/numpy/core/include

# to linke .o files
cc -lpython2.7 -ldl main.o print_me.o -o main

# execute main
./main

This results the following

[[0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]
 [0 0 0 0 0 0]]
[[1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]
 [1 1 1 1 1 1]]
hello5

Thank you for all of your help again!! :)

🌐
Quora
quora.com › Is-Cython-faster-than-NumPy
Is Cython faster than NumPy? - Quora
Answer (1 of 2): If you are only using NumPy’s vectorized operations over your arrays and writing very little actual Python code, it would be hard to beat with Cython, since NumPy is relies on heavily optimized C and Fortran libraries that explicitly take advantage CPU features like SIMD ...