The Pytorch documentation explicitly mentions this issue with DataLoader duplicating the underlying dataset (at least on Windows and macOS as I understand).
In general, you should not eagerly load all your dataset in memory because of such issue. The dataset should be lazy loaded, i.e. samples should only be loaded when they are accessed in the __getitem__ method.
If your whole dataset in stored on disk as a monolithic tensor, you could fragment it into individual samples and save them into a folder for instance.
You could then define your dataset as:
from torch.utils.data import Dataset, DataLoader
from glob import glob
from os.path import abspath
class MyDataset(Dataset):
def __init__(self, folder: str):
# Retrieve all tensor file names
folder = abspath(folder)
self.files = glob(f"{folder}/*.pt")
def __getitem__(self, index: int):
# Load tensors on demand
return torch.load(self.files[index])
def __len__(self) -> int:
return len(self.files)
Another solution is to memory-map the dataset. This is what HuggingFace does for huge datasets, take a look here. This avoids loading the whole dataset in RAM and also allows it to be shared in multiple processes without copies.
Ray may be an interesting option for you. Check out ray training datasets!
Additionally, you could also use
data_id = ray.put(data)
to dump your data, and
data = ray.get(data_id)
to load the same files without copying them between functions.
python multiprocessing - Pytorch dataset and shared memory? - Stack Overflow
How to share memory for Dataloader when using multiprocess?
Shared memory with torch.multiprocessing
Pytorch Dataloader Memory Leak
Hi everyone! I working on image classification and I have a project where we made the data loading part ourselves. The code is capable to load and preprocess images for the next batch on a different threads (using an output Tensor in shared memory for efficiency), while the current batch is being processed by the GPU.
But I want to implement a more complex data sampling scheme so I need something like the pytorch dataloader.
Is there a way to keep the efficiency of the old design (load next batch during inference and backprop, as few Tensors as possible) while using DataLoader?
I tried implementing something using Dataloader but it was very unefficient, especially the execution collate_fn.
Any advice on efficient dataloading that could be interesting?