🌐
Phoronix
phoronix.com › review › python-311-performance
Python 3.11 Performance Benchmarks Show Huge Improvement - Phoronix
As shown by benchmarking back to Python 3.8, there isn't normally too much variation in the performance department between CPython releases. But with Python 3.11 it's a big change for increasing the performance and making this de facto Python implementation more competitive to the likes of ...
🌐
Python documentation
docs.python.org › 3 › whatsnew › 3.11.html
What's New In Python 3.11 — Python 3.14.3 documentation
January 30, 2026 - In Python 3.11, the frame struct was reorganized to allow performance optimizations.
Discussions

Python 3.11 is faster than 3.8
zig: Emulated 600 frames in 0.24s (2521fps) rs: Emulated 600 frames in 0.37s (1626fps) cpp: Emulated 600 frames in 0.40s (1508fps) nim: Emulated 600 frames in 0.44s (1367fps) go: Emulated 600 frames in 1.75s (342fps) php: Emulated 600 frames in 23.74s (25fps) py: Emulated 600 frames in 26.16s ... More on news.ycombinator.com
🌐 news.ycombinator.com
295
378
October 30, 2022
performance - Python 3.11 worse optimized than 3.10? - Stack Overflow
The older version executes in 187ms, Python 3.11 needs about 17000ms. Does 3.10 realize that only the first 5 chars of a are needed, whereas 3.11 executes the whole loop? I confirmed this performance difference on godbolt. More on stackoverflow.com
🌐 stackoverflow.com
Why does this specific code run faster in Python 3.11? - Stack Overflow
When I tried to run with multiple python versions I am seeing a drastic performance difference. C:\Users\Username\Desktop>py -3.10 benchmark.py 16.79652149998583 C:\Users\Username\Desktop>py -3.11 benchmark.py 10.92280820000451 More on stackoverflow.com
🌐 stackoverflow.com
Python 3.11 vs 3.10 performance

12.4 ms

6.35 ms: 1.96x faster

That's 1.96x as fast. Unless 1x faster means the exact same speed, and 0.5x faster is actually half the speed.

This is one of my biggest pet peeves in benchmarks.

More on reddit.com
🌐 r/programming
25
9
July 6, 2022
🌐
Reddit
reddit.com › r/python › how python 3.11 became so fast!!!
r/Python on Reddit: How Python 3.11 became so fast!!!
January 16, 2023 -

With Python 3.11, it’s making quite some noise in the Python circles. It has become almost 2x times faster than its predecessor. But what's new in this version of Python?

New Data structure: Because of the removal of the exception stack huge memory is being saved which is again used by the cache to allocate to the newly created python object frame.

Specialized adaptive Interpreter:

Each instruction is one of the two states.

  • General, with a warm-up counter: When the counter reaches zero, the instruction is specialized. (to do general lookup)

  • Specialized, with a miss counter: When the counter reaches zero, the instruction is de-optimized. (to lookup particular values or types of values)

Specialized bytecode: Specialization is just how the memory is read (the reading order) when a particular instruction runs. The same stuff can be accessed in multiple ways, specialization is just optimizing the memory read for that particular instruction.

Read the full article here: https://medium.com/aiguys/how-python-3-11-is-becoming-faster-b2455c1bc555

🌐
Towards Data Science
towardsdatascience.com › home › latest › running faster than ever before
Running Faster than Ever Before | Towards Data Science
March 5, 2025 - However, with the ever increasing volumes of data on our hands, wouldn’t it be great to complete computations faster? The upcoming Python 3.11 release is highly anticipated for the expected 10–60% boost in performance in comparison to the ...
🌐
Medium
medium.com › aiguys › how-python-3-11-is-becoming-faster-b2455c1bc555
How Python 3.11 became so fast | AIGuys
November 7, 2022 - Python 3.11 became 2x times faster than its predecessor. Architecture change of Python 3.11. Faster Python. Python speed comparison with other languages.
🌐
Kracekumar
kracekumar.com › post › micro-benchmark-python-311
Python 3.11 micro-benchmark · Technical Ramblings
Python 3.11 is faster compared to Python 3.10 by 2.89%. The median execution times, Python 3.9 - 11.46s, Python 3.10 - 11.35s, Python 3.11 - 11.13s. Since most of the code was doing network call, it’s surprising to see a small performance improvement in Python 3.11.
🌐
Trendblog
trendblog.net › home › coding › python 3.11 performance benchmark show huge improvement
Python 3.11 Performance Benchmark Show Huge Improvement - Trendblog.net
December 8, 2025 - As shown by benchmarking back to Python 3.8, there isn’t normally too much variation in the performance department between CPython releases. But with Python 3.11 it’s a big change for increasing the performance and making this de facto Python implementation more competitive to the likes of Pyston and PyPy.
Find elsewhere
🌐
Andy Pearce
andy-pearce.com › blog › posts › 2022 › Dec › whats-new-in-python-311-performance-improvements
What’s New in Python 3.11 - Performance Improvements
December 16, 2022 - These changes are the work of the Faster CPython project, and they claim the changes in Python 3.11 make it around 25% faster on average, or anything from 10-60% depending on specific use-cases.
🌐
Towards Data Science
towardsdatascience.com › home › latest › python 3.11 is indeed faster than 3.10
Python 3.11 Is Indeed Faster Than 3.10 | Towards Data Science
March 5, 2025 - Python 3.11 took only 21 seconds to sort while the 3.10 counterpart took 39 seconds. An interesting performance challenge is how fast our program reads and writes information on the disk.
🌐
Europython
ep2022.europython.eu › session › how-we-are-making-python-3-11-faster
How we are making Python 3.11 faster - Mark Shannon - EuroPython 2022 | July 11th-17th 2022 | Dublin Ireland & Remote
August 14, 2022 - Python 3.11 is between 10% and 60% faster than Python 3.10, depending on the application. We have achieved this in a fully portable way by making the interpreter adapt to the program being run, and by streamlining key data structures.
🌐
Jott
jott.live › markdown › py3.11_vs_3.8
Python 3.11 is much faster than 3.8
([Python source code](https://... -0.169077842 python3.11 sim.py 10000000 31.92s user 0.05s system 99% cpu 31.976 total ``` Python 3.11 took only **31.98 seconds**! That's 3x faster!...
🌐
Hacker News
news.ycombinator.com › item
Python 3.11 is faster than 3.8 | Hacker News
October 30, 2022 - zig: Emulated 600 frames in 0.24s (2521fps) rs: Emulated 600 frames in 0.37s (1626fps) cpp: Emulated 600 frames in 0.40s (1508fps) nim: Emulated 600 frames in 0.44s (1367fps) go: Emulated 600 frames in 1.75s (342fps) php: Emulated 600 frames in 23.74s (25fps) py: Emulated 600 frames in 26.16s ...
🌐
Medium
medium.com › @hieutrantrung.it › pythons-performance-revolution-how-3-11-made-speed-a-priority-4cdeee59c349
Python’s Performance Revolution: How 3.11+ Made Speed a Priority | by Trung Hiếu Trần | Medium
January 5, 2026 - ... Result: Up to 10–20% improvement in function-heavy code. ... While not strictly a performance feature, Python 3.11’s improved error messages help developers debug faster — an indirect but real performance win.
🌐
Lewoniewski
en.lewoniewski.info › 2023 › python-3-10-vs-python-3-11-performance-testing
Python 3.10 vs Python 3.11 – performance testing
October 17, 2023 - The result shows that Python 3.11 has the best performance results over Python 3.10 in the following tests: deltablue (1.63x faster), logging_silent (1.43x faster), richards (1.40x faster).
Top answer
1 of 2
28

TL;DR: you should not use such a loop in any performance critical code but ''.join instead. The inefficient execution appears to be related to a regression during the bytecode generation in CPython 3.11 (and missing optimizations during the evaluation of binary add operation on Unicode strings).


General guidelines

This is an antipattern. You should not write such a code if you want this to be fast. This is described in PEP-8:

Code should be written in a way that does not disadvantage other implementations of Python (PyPy, Jython, IronPython, Cython, Psyco, and such).
For example, do not rely on CPython’s efficient implementation of in-place string concatenation for statements in the form a += b or a = a + b. This optimization is fragile even in CPython (it only works for some types) and isn’t present at all in implementations that don’t use refcounting. In performance sensitive parts of the library, the ''.join() form should be used instead. This will ensure that concatenation occurs in linear time across various implementations.

Indeed, other implementations like PyPy does not perform an efficient in-place string concatenation for example. A new bigger string is created for every iteration (since strings are immutable, the previous one may be referenced and PyPy does not use a reference counting but a garbage collector). This results in a quadratic runtime as opposed to a linear runtime in CPython (at least in past implementation).


Deep Analysis

I can reproduce the problem on Windows 10 between the embedded (64-bit x86-64) version of CPython 3.10.8 and the one of 3.11.0:

Timings:
 - CPython 3.10.8:    146.4 ms
 - CPython 3.11.0:  15186.8 ms

It turns out the code has not particularly changed between CPython 3.10 and 3.11 when it comes to Unicode string appending. See for example PyUnicode_Append: 3.10 and 3.11.

A low-level profiling analysis shows that nearly all the time is spent in one unnamed function call of another unnamed function called by PyUnicode_Concat (which is also left unmodified between CPython 3.10.8 and 3.11.0). This slow unnamed function contains a pretty small set of assembly instructions and nearly all the time is spent in one unique x86-64 assembly instruction: rep movsb byte ptr [rdi], byte ptr [rsi]. This instruction is basically meant to copy a buffer pointed by the rsi register to a buffer pointed by the rdi register (the processor copy rcx bytes for the source buffer to the destination buffer and decrement the rcx register for each byte until it reach 0). This information shows that the unnamed function is actually memcpy of the standard MSVC C runtime (ie. CRT) which appears to be called by _copy_characters itself called by _PyUnicode_FastCopyCharacters of PyUnicode_Concat (all the functions are still belonging to the same file). However, these CPython functions are still left unmodified between CPython 3.10.8 and 3.11.0. The non-negligible time spent in malloc/free (about 0.3 seconds) seems to indicate that a lot of new string objects are created -- certainly at least 1 per iteration -- matching with the call to PyUnicode_New in the code of PyUnicode_Concat. All of this indicates that a new bigger string is created and copied as specified above.

The thing is calling PyUnicode_Concat is certainly the root of the performance issue here and I think CPython 3.10.8 is faster because it certainly calls PyUnicode_Append instead. Both calls are directly performed by the main big interpreter evaluation loop and this loop is driven by the generated bytecode.

It turns out that the generated bytecode is different between the two version and it is the root of the performance issue. Indeed, CPython 3.10 generates an INPLACE_ADD bytecode instruction while CPython 3.11 generates a BINARY_OP bytecode instruction. Here is the bytecode for the loops in the two versions:

CPython 3.10 loop:

        >>   28 FOR_ITER                 6 (to 42)
             30 STORE_NAME               4 (_)
  6          32 LOAD_NAME                1 (a)
             34 LOAD_CONST               2 ('a')
             36 INPLACE_ADD                             <----------
             38 STORE_NAME               1 (a)
             40 JUMP_ABSOLUTE           14 (to 28)

CPython 3.11 loop:

        >>   66 FOR_ITER                 7 (to 82)
             68 STORE_NAME               4 (_)
  6          70 LOAD_NAME                1 (a)
             72 LOAD_CONST               2 ('a')
             74 BINARY_OP               13 (+=)         <----------
             78 STORE_NAME               1 (a)
             80 JUMP_BACKWARD            8 (to 66)

This changes appears to come from this issue. The code of the main interpreter loop (see ceval.c) is different between the two CPython version. Here are the code executed by the two versions:

        // In CPython 3.10.8
        case TARGET(INPLACE_ADD): {
            PyObject *right = POP();
            PyObject *left = TOP();
            PyObject *sum;
            if (PyUnicode_CheckExact(left) && PyUnicode_CheckExact(right)) {
                sum = unicode_concatenate(tstate, left, right, f, next_instr); // <-----
                /* unicode_concatenate consumed the ref to left */
            }
            else {
                sum = PyNumber_InPlaceAdd(left, right);
                Py_DECREF(left);
            }
            Py_DECREF(right);
            SET_TOP(sum);
            if (sum == NULL)
                goto error;
            DISPATCH();
        }

//----------------------------------------------------------------------------

        // In CPython 3.11.0
        TARGET(BINARY_OP_ADD_UNICODE) {
            assert(cframe.use_tracing == 0);
            PyObject *left = SECOND();
            PyObject *right = TOP();
            DEOPT_IF(!PyUnicode_CheckExact(left), BINARY_OP);
            DEOPT_IF(Py_TYPE(right) != Py_TYPE(left), BINARY_OP);
            STAT_INC(BINARY_OP, hit);
            PyObject *res = PyUnicode_Concat(left, right); // <-----
            STACK_SHRINK(1);
            SET_TOP(res);
            _Py_DECREF_SPECIALIZED(left, _PyUnicode_ExactDealloc);
            _Py_DECREF_SPECIALIZED(right, _PyUnicode_ExactDealloc);
            if (TOP() == NULL) {
                goto error;
            }
            JUMPBY(INLINE_CACHE_ENTRIES_BINARY_OP);
            DISPATCH();
        }

Note that unicode_concatenate calls PyUnicode_Append (and do some reference counting checks before). In the end, CPython 3.10.8 calls PyUnicode_Append which is fast (in-place) and CPython 3.11.0 calls PyUnicode_Concat which is slow (out-of-place). It clearly looks like a regression to me.

People in the comments reported having no performance issue on Linux. However, experimental tests shows a BINARY_OP instruction is also generated on Linux, and I cannot find so far any Linux-specific optimization regarding string concatenation. Thus, the difference between the platforms is pretty surprising.


Update: towards a fix

I have opened an issue about this available here. One should not that putting the code in a function is significantly faster due to the variable being local (as pointed out by @Dennis in the comments).


Related posts:

  • How slow is Python's string concatenation vs. str.join?
  • Python string 'join' is faster (?) than '+', but what's wrong here?
  • Python string concatenation in for-loop in-place?
2 of 2
0

As mentioned in the other answer, this is indeed a regression but it will NOT be fixed in Python 3.12, from the GitHub issue:

We aren't implementing a register VM, so the performance in 3.12+ will be like 3.11. Moving the iteration into a function will restore the n ln(n) performance.

🌐
Medium
medium.com › homelane-tech › how-python-became-60-faster-in-version-3-11-3d0d110fb87e
How Python Became 60% Faster in Version 3.11! | by Puneet Gupta | homelane-tech | Medium
November 28, 2022 - So, every frame object in python’s internal stack now consumes 160 bytes of memory compared to 240 bytes of the earlier version. There are similar improvements done across the core python codebase which resulted in better performance and speed in ...
🌐
GitHub
github.com › mandiant › capa › issues › 1846
evaluate python 3.8 vs python 3.11 performance for standalone builds · Issue #1846 · mandiant/capa
we've been using the oldest supported python build in order to support the widest array of operating systems (esp linux with glibc). but, if newer pythons are faster, then maybe it makes sense to also build a standalone binary that makes use of the latest optimizations. ... enhancementNew feature or requestNew feature or requestgood first issueGood for newcomersGood for newcomers ... You can’t perform that action at this time.
Top answer
1 of 1
7

There's a big section in the "what's new" page labeled "faster runtime". It looks like the most likely cause of the speedup here is PEP 659, which is a first start towards JIT optimization (perhaps not quite JIT compilation, but definitely JIT optimization).

Particularly, the lookup and call for len and str now bypass a lot of dynamic machinery in the overwhelmingly common case where the built-ins aren't shadowed or overridden. The global and builtin dict lookups to resolve the name get skipped in a fast path, and the underlying C routines for len and str are called directly, instead of going through the general-purpose function call handling.

You wanted source references, so here's one. The str call will get specialized in specialize_class_call:

    if (tp->tp_flags & Py_TPFLAGS_IMMUTABLETYPE) {
        if (nargs == 1 && kwnames == NULL && oparg == 1) {
            if (tp == &PyUnicode_Type) {
                _Py_SET_OPCODE(*instr, PRECALL_NO_KW_STR_1);
                return 0;
            }

where it detects that the call is a call to the str builtin with 1 positional argument and no keywords, and replaces the corresponding PRECALL opcode with PRECALL_NO_KW_STR_1. The handling for the PRECALL_NO_KW_STR_1 opcode in the bytecode evaluation loop looks like this:

        TARGET(PRECALL_NO_KW_STR_1) {
            assert(call_shape.kwnames == NULL);
            assert(cframe.use_tracing == 0);
            assert(oparg == 1);
            DEOPT_IF(is_method(stack_pointer, 1), PRECALL);
            PyObject *callable = PEEK(2);
            DEOPT_IF(callable != (PyObject *)&PyUnicode_Type, PRECALL);
            STAT_INC(PRECALL, hit);
            SKIP_CALL();
            PyObject *arg = TOP();
            PyObject *res = PyObject_Str(arg);
            Py_DECREF(arg);
            Py_DECREF(&PyUnicode_Type);
            STACK_SHRINK(2);
            SET_TOP(res);
            if (res == NULL) {
                goto error;
            }
            CHECK_EVAL_BREAKER();
            DISPATCH();
        }

which consists mostly of a bunch of safety prechecks and reference fiddling wrapped around a call to PyObject_Str, the C routine for calling str on an object.

Python 3.11 includes many other performance enhancements besides the above, including optimizations to stack frame creation, method lookup, common arithmetic operations, interpreter startup, and more. Most code should run much faster now, barring things like I/O-bound workloads and code that spent most of its time in C library code (like NumPy).