pytorch dataloader shared memory

How to share data among DataLoader processes to save memory

discuss.pytorch.org › t › how-to-share-data-among-dataloader-processes-to-save-memory › 108772

If you are lazily loading the data (which is the common use case, if you are dealing with large datasets), the memory overhead from the copies might be small in comparison to the overall memory usage in the script. That being said, you could try to use shared arrays as described here instead. Answer from ptrblck on discuss.pytorch.org

PyTorch Forums

discuss.pytorch.org › memory format

How to share data among DataLoader processes to save memory - Memory Format - PyTorch Forums

Top answer

1 of 6

2 of 6

See here, @ptrblck [image] Communicating with Dataloader workers I looked everywhere and worked on this problem for a few days before solving it. Basically, data structures instantiated using the multiprocessing package do not behave reliably. For example, things break pr…

Stack Overflow

stackoverflow.com › questions › 64173515 › can-we-share-memory-between-workers-in-a-pytorch-dataloader

Can we share memory between workers in a Pytorch DataLoader? - Stack Overflow

Top answer

1 of 2

The Pytorch documentation explicitly mentions this issue with DataLoader duplicating the underlying dataset (at least on Windows and macOS as I understand).

In general, you should not eagerly load all your dataset in memory because of such issue. The dataset should be lazy loaded, i.e. samples should only be loaded when they are accessed in the __getitem__ method.

If your whole dataset in stored on disk as a monolithic tensor, you could fragment it into individual samples and save them into a folder for instance.

You could then define your dataset as:

from torch.utils.data import Dataset, DataLoader
from glob import glob
from os.path import abspath


class MyDataset(Dataset):
  def __init__(self, folder: str):
    # Retrieve all tensor file names
    folder = abspath(folder)
    self.files = glob(f"{folder}/*.pt")

  def __getitem__(self, index: int):
    # Load tensors on demand
    return torch.load(self.files[index])
 
  def __len__(self) -> int:
    return len(self.files)

Another solution is to memory-map the dataset. This is what HuggingFace does for huge datasets, take a look here. This avoids loading the whole dataset in RAM and also allows it to be shared in multiple processes without copies.

2 of 2

Ray may be an interesting option for you. Check out ray training datasets!

Additionally, you could also use

data_id = ray.put(data)

to dump your data, and

data = ray.get(data_id)

to load the same files without copying them between functions.

Discussions

python multiprocessing - Pytorch dataset and shared memory? - Stack Overflow

7 Can we share memory between workers in a Pytorch DataLoader? More on stackoverflow.com

stackoverflow.com

How to share memory for Dataloader when using multiprocess?

I wrap my data with Dataset, then use Dataloader for enumerate. But because of copy-on-write mechanism, my memory goes so high out of expected. My problem can be simplified as following: class DataIter(Dataset): def __init__(self): self.data = range(90317731) def __len__(self): return ... More on discuss.pytorch.org

discuss.pytorch.org

July 9, 2018

Shared memory with torch.multiprocessing

On top of that, I use multiple num_workers in my dataloader so having a simple Python list as a caxhe would mean multiple caches which eats up a lot of memory. The natural solution is to use shared memory. And this is how I use it In the launch process, do if __name__ == '__main__... More on discuss.pytorch.org

discuss.pytorch.org

July 4, 2020

Pytorch Dataloader Memory Leak

Hi, I noticed that while training a PyTorch model the subprocesses that are started by the dataloader workers are accumulating memory over time while loading new batches and it seems this memory is never released, ultimately resulting in a “dataloader worker does not have sufficient shared memory” ... More on discuss.pytorch.org

discuss.pytorch.org

August 30, 2023

Yuxin's Blog

ppwwyyxx.com › blog › 2022 › Demystify-RAM-Usage-in-Multiprocess-DataLoader

Demystify RAM Usage in Multi-Process Data Loaders - Yuxin's Blog

December 24, 2022 - The essence of the solution is to let all processes share memory through a single torch.Tensor object, which needs to be moved to Linux shared memory by PyTorch's custom pickling routine.

Stack Overflow

stackoverflow.com › questions › 60542153 › pytorch-dataset-and-shared-memory

python multiprocessing - Pytorch dataset and shared memory? - Stack Overflow

Top answer

1 of 1

The answer depends on your OS and settings. If you are using Linux with the default process start method, you don't have to worry about duplicates or process communication, because worker processes share memory! This is efficiently implemented as Inter Process Communication (IPC) through shared memory (some more details here). For Windows, things are more complicated. From the documentation:

Since workers rely on Python multiprocessing, worker launch behavior is different on Windows compared to Unix.

On Unix, fork() is the default multiprocessing start method. Using fork(), child workers typically can access the dataset and Python argument functions directly through the cloned address space.

On Windows, spawn() is the default multiprocessing start method. Using spawn(), another interpreter is launched which runs your main script, followed by the internal worker function that receives the dataset, collate_fn and other arguments through pickle serialization.

This means that your dynamically cached Dataset members would be automatically shared between all processes on Linux. That's great! However, on Windows, processes will not have received copies of them (they only received the Dataset upon spawning), so you should use a process communication scheme, e.g. through multiprocessing Pipe, Queue or Manager (preferred for broadcasting to multiple processes, but you would have to convert tensors to lists). This is not very efficient, besides rather bothersome to implement.

Nevertheless, there is another method: memory mapping (memmaping). This means that your objects will be written to virtual memory, and again all processes will have access to it, while a respective "shadow copy" of these objects will at some point be flushed and exist on your hard drive (can be placed in a /tmp directory). You can use memmaping with the mmap module, in which case your objects will have to be serialized as a binary file, or you can use numpy.memmap. You can find more details here.

PyTorch Forums

discuss.pytorch.org › t › how-to-share-memory-for-dataloader-when-using-multiprocess › 20790

How to share memory for Dataloader when using multiprocess? - PyTorch Forums

July 9, 2018 - I wrap my data with Dataset, then use Dataloader for enumerate. But because of copy-on-write mechanism, my memory goes so high out of expected. My problem can be simplified as following: class DataIter(Dataset): def __init__(self): self.data = range(90317731) def __len__(self): return ...

Latentwalk

latentwalk.io › 2023 › 08 › 19 › torch-shmem

PyTorch DataLoaders and Shared Memory · Walking in the Latent Space

August 19, 2023 - Unlike pipes, once a shared memory region is mapped, the kernel is not involved with data transfers which means bytes can be copied more efficiently. So when a tensor is put inside the data queue by a worker process, PyTorch creates a new shared memory region and places tensor data in it.

AWS

docs.aws.amazon.com › codeguru › detector-library › python › pytorch-data-loader-with-multiple-workers

Pytorch data loader with multiple workers | Amazon Q, Detector Library

Using DataLoader with num_workers greater than 0 can cause increased memory consumption over time when iterating over native Python objects such as list or dict. Pytorch uses multiprocessing in this scenario placing the data in shared memory. However, reference counting triggers copy-on-writes ...

PyTorch Forums

discuss.pytorch.org › distributed

Shared memory with torch.multiprocessing - distributed - PyTorch Forums

July 4, 2020 - On top of that, I use multiple num_workers in my dataloader so having a simple Python list as a caxhe would mean multiple caches which eats up a lot of memory. The natural solution is to use shared memory. And this is how I use it In the launch process, do if __name__ == '__main__...

Find elsewhere

Google Bing Mojeek

PyTorch Forums

discuss.pytorch.org › data

Pytorch Dataloader Memory Leak - data - PyTorch Forums

Top answer

1 of 8

Shared memory shouldn’t be used of no multiprocessing is needed in the DataLoaders. Are you manually sharing tensors somewhere in your code?

2 of 8

Could you check if you are running into this issue?

PyTorch Forums

discuss.pytorch.org › data

Dataset size and limited shared memory - data - PyTorch Forums

January 26, 2023 - The training cannot start because I obtain the following message: RuntimeError: DataLoader worker (pid 12945) is killed by signal: Bus error. It is possible that dataloader's workers are out of shared memory. Please try to raise your shared memory limit. I’m new to PyTorch and Colab and I’m ...

Kaggle

kaggle.com › product-feedback › 72606

increase pytorch shared memory | Kaggle

I'm trying out pytorch 1.0 and fastai 1.0 https://www.kaggle.com/dromosys/human-protein-fastai-v3 I get RuntimeError: DataLoader worker (pid 173) is killed...

GitHub

github.com › jotaf98 › shareddataset

GitHub - jotaf98/shareddataset: A PyTorch Dataset that caches samples in shared memory, accessible globally to all processes · GitHub

# the worker processes of a DataLoader all share the same memory. # use persistent workers to ensure the SharedDataset is not deallocated # between epochs.

Starred by 25 users

Forked by 2 users

Languages Python

Stack Overflow

stackoverflow.com › questions › 73613484 › python-multi-processing-with-shared-memory-and-pytorch-data-loader-runtimeerro

python multi processing with shared memory and pytorch data loader - RuntimeError:use CUDA with multiprocessing you must use the 'spawn' start method - Stack Overflow

I am trying to implement a program with a producer and a consumer classes. The producer class reads the numpy array(an image) and puts it in a shared memory and the consumer class will read the numpy array data from the shared memory and apply a pytorch inference model on that.

GitHub

github.com › pytorch › pytorch › issues › 5040

Give a better error when we run out of shared memory, instead of "RuntimeError: DataLoader worker (pid 13) is killed by signal: Bus error." · Issue #5040 · pytorch/pytorch

February 5, 2018 - When I set num_workers=1 or other value greater than 0 in torch.utils.data.DataLoader, I get this error. The detail of the error: Traceback (most recent call last): File "/opt/project/train.py", line 150, in <module> dataset_sizes=dataset_sizes) File "/opt/project/train.py", line 51, in train_model outputs = model(inputs) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/modules/module.py", line 357, in __call__ result = self.forward(*input, **kwargs) File "/opt/conda/envs/pytorch-py3.6/lib/python3.6/site-packages/torch/nn/parallel/data_parallel.py", line 64, in forward

Author pytorch

reddit.com › r/pytorch › dataloader and multiprocessing

r/pytorch on Reddit: Dataloader and multiprocessing

July 30, 2019 -

Hi everyone! I working on image classification and I have a project where we made the data loading part ourselves. The code is capable to load and preprocess images for the next batch on a different threads (using an output Tensor in shared memory for efficiency), while the current batch is being processed by the GPU.

But I want to implement a more complex data sampling scheme so I need something like the pytorch dataloader.

Is there a way to keep the efficiency of the old design (load next batch during inference and backprop, as few Tensors as possible) while using DataLoader?

I tried implementing something using Dataloader but it was very unefficient, especially the execution collate_fn.

Any advice on efficient dataloading that could be interesting?

Top answer

1 of 1

torch.utils.data.DataLoader supports already multiprocessing. What did you set num_workers and pin_memory to?

PyTorch Forums

discuss.pytorch.org › memory format

How to share data among DataLoader processes to save memory - #7 by Sidney_Bassett - Memory Format - PyTorch Forums

Top answer

1 of 1

GitHub

github.com › PyTorchLightning › pytorch-lightning › issues › 2352

Shared memory leak with large dataset and num_workers > 0 · Issue #2352 · Lightning-AI/pytorch-lightning

June 25, 2020 - When I use num_workers > 0 in DataLoader I obviosly use shared memory through Pytorch multiprocessing.

Author Lightning-AI

Eventual

eventual.ai › blog › pytorch-data-loader

Using PyTorch DataLoaders to Streamline Multimodal Data

October 22, 2025 - If you have a very large in-memory Dataset and spawn multiple workers, you would run out of RAM because each worker replicates that data. There are ways to work around this: Use shared memory constructs or memory-mapped files

GitHub

gist.github.com › pzelasko › cda0d8d7f4de880e2f59e4ed5e3b346a

Disable shared memory in PyTorch dataloader · GitHub

Disable shared memory in PyTorch dataloader. GitHub Gist: instantly share code, notes, and snippets.

Grokipedia

grokipedia.com › shared memory leak in pytorch dataloader

Shared memory leak in PyTorch DataLoader — Grokipedia

March 19, 2026 - The shared memory leak in PyTorch's DataLoader refers to behaviors that can lead to excessive memory consumption during long-running training loops when using multiple worker processes (num_workers > 0) with multiprocessing enabled.