As there was a recent discussion on Python's speed, here is a collection of some good articles discussing about Python's speed and why it poses extra challenges to be fast as CPU instructions/executed code.
-
Pycon talk: Anthony Shaw - Why Python is slow
-
Pycon talk: Mark Shannon - How we are making CPython faster
-
Python 3.13 will ship with --enable-jit, --disable-gil
-
Python performance: it’s not just the interpreter
-
Cinder: Instagram's performance-oriented Python fork
Also remember, the raw CPU speed rarely matters, as many workloads are IO-bound, network-bound, or a performance question is irrelevant... or: Python trades some software development cost for increased hardware cost. In these cases, Python extensions and specialised libraries can do the heavy lifting outside the interpreter (PyArrow, Polards, Pandas, Numba, etc.).
performance - How to speed up Python execution? - Stack Overflow
performance - What tools or approaches are available to speed up code written in Python? - Computational Science Stack Exchange
News faster CPython, JIT and 3.14
Performance hacks for faster Python code
Videos
To answer your last question first, if you have a problem with performance, then it's worth it. That's the only criterion, really.
As for how:
If your algorithm is slow because it's computationally expensive, consider rewriting it as a C extension, or use Cython, which will let you write fast extensions in a Python-esque language. Also, PyPy is getting faster and faster and may just be able to run your code without modification.
If the code is not computationally expensive, but it just loops a huge amount, it may be possible to break it down with Multiprocessing, so it gets done in parallel.
Lastly, if this is some kind of basic data splatting task, consider using a fast data store. All the major relational databases are optimised up the wazoo, and you may find that your task can be sped up simply by getting the database to do it for you. You may even be able to shape it to fit a Redis store, which can aggregate big data sets brilliantly.
The only real way to know would be to profile and measure. Your code could be doing anything. "doSomething" might be a time.sleep(10) in which case, forking off 10000000 processes would make the whole program run in approximately 10 seconds (ignoring the forking overhead and resulting slowdowns).
Use http://docs.python.org/library/profile.html and check to see where the bottle necks are, see if you can optimise the "fully optimised" program using better coding. If it's already fast enough, stop.
Then, depending on whether it's CPU or I/O bound and the hardware you have, you might want to try multiprocessing or threading. You can also try distributing to multiple machines and doing some map/reduce kind of thing if your problem can be broken down.
I'm going to break up my answer into three parts. Profiling, speeding up the python code via c, and speeding up python via python. It is my view that Python has some of the best tools for looking at what your code's performance is then drilling down to the actual bottle necks. Speeding up code without profiling is about like trying to kill a deer with an uzi.
If you are really only interested in mat-vec products, I would recommend scipy.sparse.
Python tools for profiling
profile and cProfile modules: These modules will give you your standard run time analysis and function call stack. It is pretty nice to save their statistics and using the pstats module you can look at the data in a number of ways.
kernprof: this tool puts together many routines for doing things like line by line code timing
memory_profiler: this tool produces line by line memory foot print of your code.
IPython timers: The timeit function is quite nice for seeing the differences in functions in a quick interactive way.
Speeding up Python
Cython: cython is the quickest way to take a few functions in python and get faster code. You can decorate the function with the cython variant of python and it generates c code. This is very maintable and can also link to other hand written code in c/c++/fortran quite easily. It is by far the preferred tool today.
ctypes: ctypes will allow you to write your functions in c and then wrap them quickly with its simple decoration of the code. It handles all the pain of casting from PyObjects and managing the gil to call the c function.
Other approaches exist for writing your code in C but they are all somewhat more for taking a C/C++ library and wrapping it in Python.
Python-only approaches
If you want to stay inside Python mostly, my advice is to figure out what data you are using and picking correct data types for implementing your algorithms. It has been my experience that you will usually get much farther by optimizing your data structures then any low level c hack. For example:
numpy: a contingous array very fast for strided operations of arrays
numexpr: a numpy array expression optimizer. It allows for multithreading numpy array expressions and also gets rid of the numerous temporaries numpy makes because of restrictions of the Python interpreter.
blist: a b-tree implementation of a list, very fast for inserting, indexing, and moving the internal nodes of a list
pandas: data frames (or tables) very fast analytics on the arrays.
pytables: fast structured hierarchical tables (like hdf5), especially good for out of core calculations and queries to large data.
First of all, if there is a C or Fortran implementation available (MATLAB MEX function?), why don't you write a Python wrapper?
If you want your own implementation not only a wrapper, I would strongly suggest to use the numpy module for linear algebra stuff. Make sure it is linked to an optimized blas (like ATLAS, GOTOblas, uBLAS, Intel MKL, ...). And use Cython or weave. Read this Performance Python article for a good introduction and benchmark. The different implementations in this article are available for download here courtesy of Travis Oliphant (Numpy-guru).
Good luck.