Reference count contention with nested functions #124218

colesbury · 2024-09-18T19:58:50Z

Creating nested functions can be a source of reference count contention in the free-threaded build. Consider the following (contrived) example:

from concurrent.futures import ThreadPoolExecutor

def func(n):
    sum = 0
    for i in range(n):
        def square(x):
            return x ** 2
        sum += square(i)

with ThreadPoolExecutor(max_workers=8) as executor:
    future = executor.submit(func, 100000)
    print(future.result())

Creating many square functions concurrently causes reference count contention on the func_code, func_globals, and func_builtins fields:

cpython/Include/cpython/funcobject.h

Lines 11 to 16 in 21d2a9a

    
           #define _Py_COMMON_FIELDS(PREFIX) \ 
        
               PyObject *PREFIX ## globals; \ 
        
               PyObject *PREFIX ## builtins; \ 
        
               PyObject *PREFIX ## name; \ 
        
               PyObject *PREFIX ## qualname; \ 
        
               PyObject *PREFIX ## code;        /* A code object, the __code__ attribute */ \

The code object and builtins and globals dictionaries are already configured to support deferred reference counting, but the references in PyFunctionObject are not _PyStackRefs -- they are normal "counted" references.

Note that this is an issue for nested functions (closures), but not module-level functions or methods because those are typically created infrequently.

I outline a few possibly ways to address this below. My preference is for 2a below.

Option 1: Use deferred reference counting in `PyFunctionObject`

Variant 1a: Use `_PyStackRef` in `PyFunctionObject`

Instead of PyObject *func_code we have _PyStackRef func_code. We use this strategy effectively in a number of other places, including the frame object and generators.

The downside of this approach is that the fields of PyFunctionObject are exposed in public headers (cpython/funcobject.h), even though they are not documented. Changing the type of func_code, func_globals, and func_builtins risks breaking backwards compatibility with some C API extensions.

Variant 1b: Use `PyObject*` and new bitfield

Instead of using _PyStackRef, we can keep the fields as PyObject * and store whether the field uses a deferred reference in a separate field. This was the approach I took by the nogil-3.9 fork.

This has fewer compatibility issues than using _PyStackRef, but there are still compatibility hazards. It would not be safe for extensions to change func_code/func_globals/func_builtins with something like Py_SETREF(func->func_globals, new_globals) because the reference counting semantics are different.

Option 2: Use per-thread reference counting

We already use per-thread reference counting for the references from instances to their types (i.e., ob_type), if the type is a heap type. Storing the reference counts per-thread avoids most of the reference count contention on the object. This also avoids the compatibility issues with option 1 because you can use a per-thread incref with a normal Py_DECREF -- the only risk is performance, not correctness.

The challenge with this approach is that we need some quick and reliable way to index the per-thread reference count array. For heap types, we added a new field unique_id in the free-threaded build. We can do something similar for code objects, but the globals and builtins are just "normal" dictionaries and I don't think we want to add a new field for every dictionary.

Variant 2a: Allocate space for identifier when creating globals and builtins.

When we create globals and builtins dictionaries we allocate space for an extra Py_ssize_t unique id at the end after the PyDictObject. The type would still just PyDict_Type, so Python and the rest of the C API would just see a normal dictionary. We can identify these special dictionaries using a bit in ob_gc_bitsor by stealing another bit fromma_version_tag`.

If the globals or builtins dictionaries are replaced by user defined dictionaries, than things would still work, they'd just might have scaling bottlenecks.

Variant 2b: Use a hash table for per-thread references

We can use a hash table to map PyObject* to their per-thread reference counts. This is less efficient than having a unique index into the per-thread reference count array, but avoids the need for an extra field.

The text was updated successfully, but these errors were encountered:

colesbury added performance Performance or resource usage topic-free-threading labels Sep 18, 2024

colesbury changed the title ~~Address reference count contention with nested functions~~ Reference count contention with nested functions Sep 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reference count contention with nested functions #124218

Reference count contention with nested functions #124218

colesbury commented Sep 18, 2024

Reference count contention with nested functions #124218

Reference count contention with nested functions #124218

Comments

colesbury commented Sep 18, 2024

Option 1: Use deferred reference counting in PyFunctionObject

Variant 1a: Use _PyStackRef in PyFunctionObject

Variant 1b: Use PyObject* and new bitfield

Option 2: Use per-thread reference counting

Variant 2a: Allocate space for identifier when creating globals and builtins.

Variant 2b: Use a hash table for per-thread references

Option 1: Use deferred reference counting in `PyFunctionObject`

Variant 1a: Use `_PyStackRef` in `PyFunctionObject`

Variant 1b: Use `PyObject*` and new bitfield