With a tf.data pipeline, there are several spots where you can parallelize. Depending on how your data are stored and read, you can parallelize reading. You can also parallelize augmentation, and you can prefetch data as you train, so your GPU (or other hardware) is never hungry for data.

In the code below, I have demonstrated how you can parallelize augmentation and add prefetching.

import numpy as np
import tensorflow as tf

x_shape = (32, 32, 3)
y_shape = ()  # A single item (not array).
classes = 10

# This is tf.data.experimental.AUTOTUNE in older tensorflow.
AUTOTUNE = tf.data.AUTOTUNE

def generator_fn(n_samples):
    """Return a function that takes no arguments and returns a generator."""
    def generator():
        for i in range(n_samples):
            # Synthesize an image and a class label.
            x = np.random.random_sample(x_shape).astype(np.float32)
            y = np.random.randint(0, classes, size=y_shape, dtype=np.int32)
            yield x, y
    return generator

def augment(x, y):
    return x * tf.random.normal(shape=x_shape), y

samples = 10
batch_size = 5
epochs = 2

# Create dataset.
gen = generator_fn(n_samples=samples)
dataset = tf.data.Dataset.from_generator(
    generator=gen, 
    output_types=(np.float32, np.int32), 
    output_shapes=(x_shape, y_shape)
)
# Parallelize the augmentation.
dataset = dataset.map(
    augment, 
    num_parallel_calls=AUTOTUNE,
    # Order does not matter.
    deterministic=False
)
dataset = dataset.batch(batch_size, drop_remainder=True)
# Prefetch some batches.
dataset = dataset.prefetch(AUTOTUNE)

# Prepare model.
model = tf.keras.applications.VGG16(weights=None, input_shape=x_shape, classes=classes)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")

# Train. Do not specify batch size because the dataset takes care of that.
model.fit(dataset, epochs=epochs)
Answer from jkr on Stack Overflow
๐ŸŒ
TensorFlow
tensorflow.org โ€บ tensorflow core โ€บ tf.data: build tensorflow input pipelines
tf.data: Build TensorFlow input pipelines | TensorFlow Core
This allows it to restart the generator when it reaches the end. It takes an optional args argument, which is passed as the callable's arguments. The output_types argument is required because tf.data builds a tf.Graph internally, and graph edges require a tf.dtype.
๐ŸŒ
Medium
medium.com โ€บ analytics-vidhya โ€บ write-your-own-custom-data-generator-for-tensorflow-keras-1252b64e41c3
Write your own Custom Data Generator for TensorFlow Keras | by Arjun Muraleedharan | Analytics Vidhya | Medium
March 25, 2021 - When you write a for loop with range(start, end, step) , it does not create a list with all the elements from start to end, but instead, it created a generator that can generate values from start to end and then it will create values on the go. Have you ever encountered a problem where the dataset you have is too big to be loaded into memory at once that you run out of RAM?
Discussions

tensorflow - Tensorflow2.x custom data generator with multiprocessing - Stack Overflow
I just upgraded to tensorflow 2.3. I want to make my own data generator for training. With tensorflow 1.x, I did this: def get_data_generator(test_flag): item_list = load_item_list(test_flag) p... More on stackoverflow.com
๐ŸŒ stackoverflow.com
python - How to train TensorFlow network using a generator to produce inputs? - Stack Overflow
The TensorFlow docs describe a bunch of ways to read data using TFRecordReader, TextLineReader, QueueRunner etc and queues. What I would like to do is much, much simpler: I have a python generator More on stackoverflow.com
๐ŸŒ stackoverflow.com
python - How to build a Custom Data Generator for Keras/tf.Keras where X images are being augmented and corresponding Y labels are also images - Stack Overflow
I am working on Image Binarization using UNet and have a dataset of 150 images and their binarized versions too. My idea is to augment the images randomly to make them look like they are differents... More on stackoverflow.com
๐ŸŒ stackoverflow.com
Dataset from generator is far slower than from tensor slices, anything I can improve?
And want some help on how to improve the performance using the generator. If I want to have a dataset with 10 elements for each records, how can I get better performance? Thanks so much in advance! import tensorflow as tf import numpy as np import time tf.compat.v1.enable_eager_execution( ... More on github.com
๐ŸŒ github.com
22
May 25, 2022
๐ŸŒ
TensorFlow
tensorflow.org โ€บ tensorflow v2.16.1 โ€บ tf.keras.preprocessing.image.imagedatagenerator
tf.keras.preprocessing.image.ImageDataGenerator | TensorFlow v2.16.1
Skip to main content ยท English ยท ไธญๆ–‡ โ€“ ็ฎ€ไฝ“ ยท GitHub ยท Acceder ยท TensorFlow v2.16.1
๐ŸŒ
Stanford University
stanford.edu โ€บ ~shervine โ€บ blog โ€บ keras-how-to-generate-data-on-the-fly
A detailed example of how to use data generators with Keras
In this blog post, we are going to show you how to generate your dataset on multiple cores in real time and feed it right away to your deep learning model. The framework used in this tutorial is the one provided by Python's high-level package Keras, which can be used on top of a GPU installation of either TensorFlow ...
๐ŸŒ
DevGenius
blog.devgenius.io โ€บ a-practical-guide-to-creating-tf-data-dataset-in-tensorflow-63e95fd42daf
A Practical Guide to Creating tf.data.Dataset in TensorFlow | by ifeelfree | Dev Genius
December 31, 2025 - tf.data.Dataset provides flexible and efficient ways to build input pipelines in TensorFlow. You can generate datasets from in-memory data, Python generators, and TFRecord files.
๐ŸŒ
Mahmoudyusof
mahmoudyusof.github.io โ€บ facial-keypoint-detection โ€บ data-generator
How to use data generators in tensorflow
September 24, 2020 - You can also easily make a validation generator and validate your model against that, all you need to do is make a new instance of the DataGenerator class, and pass in the validation csv and base directory and you're good to go. That's why I love OOP. from tensorflow.keras.models import Sequential model = Sequential([ ## define the model's architecture ]) train_gen = DataGenerator("data.csv", "data", (244, 244), batch_size=20, shuffle=True) ## compile the model first of course # now let's train the model model.fit(train_gen, epochs=5, ...) # note you could also make a validation generator and pass it here like normal datasets # back in the days you had to do this # model.fit_generator(train_gen, ...)
Top answer
1 of 1
5

With a tf.data pipeline, there are several spots where you can parallelize. Depending on how your data are stored and read, you can parallelize reading. You can also parallelize augmentation, and you can prefetch data as you train, so your GPU (or other hardware) is never hungry for data.

In the code below, I have demonstrated how you can parallelize augmentation and add prefetching.

import numpy as np
import tensorflow as tf

x_shape = (32, 32, 3)
y_shape = ()  # A single item (not array).
classes = 10

# This is tf.data.experimental.AUTOTUNE in older tensorflow.
AUTOTUNE = tf.data.AUTOTUNE

def generator_fn(n_samples):
    """Return a function that takes no arguments and returns a generator."""
    def generator():
        for i in range(n_samples):
            # Synthesize an image and a class label.
            x = np.random.random_sample(x_shape).astype(np.float32)
            y = np.random.randint(0, classes, size=y_shape, dtype=np.int32)
            yield x, y
    return generator

def augment(x, y):
    return x * tf.random.normal(shape=x_shape), y

samples = 10
batch_size = 5
epochs = 2

# Create dataset.
gen = generator_fn(n_samples=samples)
dataset = tf.data.Dataset.from_generator(
    generator=gen, 
    output_types=(np.float32, np.int32), 
    output_shapes=(x_shape, y_shape)
)
# Parallelize the augmentation.
dataset = dataset.map(
    augment, 
    num_parallel_calls=AUTOTUNE,
    # Order does not matter.
    deterministic=False
)
dataset = dataset.batch(batch_size, drop_remainder=True)
# Prefetch some batches.
dataset = dataset.prefetch(AUTOTUNE)

# Prepare model.
model = tf.keras.applications.VGG16(weights=None, input_shape=x_shape, classes=classes)
model.compile(optimizer="adam", loss="sparse_categorical_crossentropy")

# Train. Do not specify batch size because the dataset takes care of that.
model.fit(dataset, epochs=epochs)
Find elsewhere
Top answer
1 of 3
6

Custom Image Data Generator

load Directory data into dataframe for CustomDataGenerator

def data_to_df(data_dir, subset=None, validation_split=None):
    df = pd.DataFrame()
    filenames = []
    labels = []
    
    for dataset in os.listdir(data_dir):
        img_list = os.listdir(os.path.join(data_dir, dataset))
        label = name_to_idx[dataset]
        
        for image in img_list:
            filenames.append(os.path.join(data_dir, dataset, image))
            labels.append(label)
        
    df["filenames"] = filenames
    df["labels"] = labels
    
    if subset == "train":
        split_indexes = int(len(df) * validation_split)
        train_df = df[split_indexes:]
        val_df = df[:split_indexes]
        return train_df, val_df
    
    return df

train_df, val_df = data_to_df(train_dir, subset="train", validation_split=0.2)

Custom Data Generator


import tensorflow as tf
from PIL import Image
import numpy as np

class CustomDataGenerator(tf.keras.utils.Sequence):

    ''' Custom DataGenerator to load img 
    
    Arguments:
        data_frame = pandas data frame in filenames and labels format
        batch_size = divide data in batches
        shuffle = shuffle data before loading
        img_shape = image shape in (h, w, d) format
        augmentation = data augmentation to make model rebust to overfitting
    
    Output:
        Img: numpy array of image
        label : output label for image
    '''
    
    def __init__(self, data_frame, batch_size=10, img_shape=None, augmentation=True, num_classes=None):
        self.data_frame = data_frame
        self.train_len = len(data_frame)
        self.batch_size = batch_size
        self.img_shape = img_shape
        self.num_classes = num_classes
        print(f"Found {self.data_frame.shape[0]} images belonging to {self.num_classes} classes")

    def __len__(self):
        ''' return total number of batches '''
        self.data_frame = shuffle(self.data_frame)
        return math.ceil(self.train_len/self.batch_size)

    def on_epoch_end(self):
        ''' shuffle data after every epoch '''
        # fix on epoch end it's not working, adding shuffle in len for alternative
        pass
    
    def __data_augmentation(self, img):
        ''' function for apply some data augmentation '''
        img = tf.keras.preprocessing.image.random_shift(img, 0.2, 0.3)
        img = tf.image.random_flip_left_right(img)
        img = tf.image.random_flip_up_down(img)
        return img
        
    def __get_image(self, file_id):
        """ open image with file_id path and apply data augmentation """
        img = np.asarray(Image.open(file_id))
        img = np.resize(img, self.img_shape)
        img = self.__data_augmentation(img)
        img = preprocess_input(img)

        return img

    def __get_label(self, label_id):
        """ uncomment the below line to convert label into categorical format """
        #label_id = tf.keras.utils.to_categorical(label_id, num_classes)
        return label_id

    def __getitem__(self, idx):
        batch_x = self.data_frame["filenames"][idx * self.batch_size:(idx + 1) * self.batch_size]
        batch_y = self.data_frame["labels"][idx * self.batch_size:(idx + 1) * self.batch_size]
        # read your data here using the batch lists, batch_x and batch_y
        x = [self.__get_image(file_id) for file_id in batch_x] 
        y = [self.__get_label(label_id) for label_id in batch_y]

        return tf.convert_to_tensor(x), tf.convert_to_tensor(y)
2 of 3
0

You can use libraries like albumentations and imgaug, both are good but I have heard there are issues with random seed with albumentations. Here's an example of imgaug taken from the documentation here:

seq = iaa.Sequential([
    iaa.Dropout([0.05, 0.2]),      # drop 5% or 20% of all pixels
    iaa.Sharpen((0.0, 1.0)),       # sharpen the image
    iaa.Affine(rotate=(-45, 45)),  # rotate by -45 to 45 degrees (affects segmaps)
    iaa.ElasticTransformation(alpha=50, sigma=5)  # apply water effect (affects segmaps)
], random_order=True)

# Augment images and segmaps.
images_aug = []
segmaps_aug = []
for _ in range(len(input_data)):
    images_aug_i, segmaps_aug_i = seq(image=image, segmentation_maps=segmap)
    images_aug.append(images_aug_i)
    segmaps_aug.append(segmaps_aug_i)

You are going in the right way with the custom generator. In __getitem__, make a batch using batch_x = self.files[index:index+batch_size] and same with batch_y, then augment them using X,y = __data_generation(batch_x, batch_y) which will load images(using any library you like, I prefer opencv), and return the augmented pairs (and any other manipulation).

Your __getitem__ will then return the tuple (X,y)

๐ŸŒ
Medium
medium.com โ€บ the-owl โ€บ creating-a-tf-dataset-using-a-data-generator-5e5564609e64
Creating a TF Dataset using a Data Generator | by Siladittya Manna | The Owl | Medium
August 21, 2021 - Creating a TF Dataset using a Data Generator tf.data.Dataset In this article, we are going to build a tf.data.Dataset from a data generator. This data generator will perform data fetching โ€ฆ
๐ŸŒ
GitHub
github.com โ€บ tensorflow โ€บ tensorflow โ€บ issues โ€บ 56258
Dataset from generator is far slower than from tensor slices, anything I can improve? ยท Issue #56258 ยท tensorflow/tensorflow
May 25, 2022 - And want some help on how to improve the performance using the generator. If I want to have a dataset with 10 elements for each records, how can I get better performance? Thanks so much in advance! import tensorflow as tf import numpy as np import time tf.compat.v1.enable_eager_execution( config=None, device_policy=None, execution_mode=None ) size = 100000 data = np.random.rand(size) def get_one(): i = 0 while i < size: yield tuple([data[i]]*10) i += 1 # dataset = tf.data.Dataset.from_generator(get_one, output_types=tuple([tf.float32]*10)) dataset = tf.data.Dataset.from_tensor_slices(tuple([data]*10)) dataset = dataset.batch(512) i = 0 total_time = 0 start = time.time() for sample in dataset: # Performing a training step end = time.time() used = end - start total_time += used print("Get batch time: ", used) i += 1 start = time.time() print("Average get batch time: ", total_time / i)
Published ย  May 25, 2022
Author ย  hkvision
๐ŸŒ
Reddit
reddit.com โ€บ r/tensorflow โ€บ help create datagenerator for a dataset
r/tensorflow on Reddit: Help create datagenerator for a dataset
May 17, 2024 -

I have a folder named train with 3sub folders named time1, time2, label which contain images which are used for satellite images change detection where I have a model which I input images from time1 and time2 directory and output change map image

Link to dataset: https://www.kaggle.com/datasets/kacperk77/sysucd

Need to create data generator to be able to train model

๐ŸŒ
MachineLearningMastery
machinelearningmastery.com โ€บ home โ€บ blog โ€บ a gentle introduction to the tensorflow.data api
A Gentle Introduction to the tensorflow.data API - MachineLearningMastery.com
August 6, 2022 - The other way of training the same network is to provide the data from a Python generator function instead of a NumPy array. A generator function is the one with a yield statement to emit data while the function runs parallel to the data consumer.
๐ŸŒ
PyImageSearch
pyimagesearch.com โ€บ home โ€บ blog โ€บ a gentle introduction to tf.data with tensorflow
A gentle introduction to tf.data with TensorFlow - PyImageSearch
June 8, 2023 - The AUTOTUNE argument tells TensorFlow to automatically optimize this function call to make it as efficient as possible. cache: Caches the result, thereby making subsequent data reads/accesses faster. repeat: Repeats the process once we reach the end of the dataset/epoch. ... # create a standard image generator object print("[INFO] creating a ImageDataGenerator object...") imageGen = ImageDataGenerator(rescale=1.0/255) dataGen = imageGen.flow_from_directory( args["dataset"], target_size=(96, 96), batch_size=BS, class_mode="categorical", color_mode="rgb")
๐ŸŒ
TensorFlow
tensorflow.org โ€บ datasets โ€บ writing custom datasets
Writing custom datasets | TensorFlow Datasets
July 18, 2023 - For example, if you're contributing ... check for the common implementation gotchas. To generate the dataset, run tfds build from the my_dataset/ directory:...
๐ŸŒ
TensorFlow
tensorflow.org โ€บ tensorflow core โ€บ data augmentation
Data augmentation | TensorFlow Core
Warning: There are two sets of ... in this tutorial. For more information, refer to Random number generation. Applying random transformations to the images can further help generalize and expand the dataset....
๐ŸŒ
Reddit
reddit.com โ€บ r/tensorflow โ€บ [help] creating/ understanding tensorflow generators
r/tensorflow on Reddit: [HELP] Creating/ Understanding tensorflow generators
August 8, 2022 -

Hi!

I'm working with a large dataset and need to feed the data to my model with a generator.

The problem I'm facing is that the size of the batches I'm feeding my data usually varies which means that I can't hardcode any value for batch size, which I believe is the reason for my errors.

Why my batch sizes varies is because I balance the data exactly even, since this seems to have been the best route for my cnn to actually improve. The way I balance the data is to simply remove data until they are even. Which means that for different batches the amount of data removed varies.

Currently I have this code:

def data_generator():
    x_train, y_train = get_data(DATA_FOLDER)

    x_train, y_train = balance_data(x_train, y_train)

    print(x_train.shape, y_train.shape)

    yield np.array(x_train), np.array(y_train)

where the data returned has the shape (and so the data is correct):

(2326, 3095) (2326, 1)

Then I run:

generator = data_generator()

model.fit(generator, epochs=EPOCHS)

And I get the following err:

Epoch 1/20
1/1 [==============================] - 1s 814ms/step - loss: 0.6931 - accuracy: 0.5000
Epoch 2/20
WARNING:tensorflow:Your input ran out of data; interrupting training. Make sure that your dataset or generator can generate at least `steps_per_epoch * epochs` batches (in this case, 20 batches). You may need to use the repeat() function when building your dataset.

What should I do to resolve the err?

What should the output of the generator be? Should it only return one datapoint for each iteration?

Thanks for any help!

Top answer
1 of 2
5
Tensorflow generators need to be infinite generators, which is what leads to your problem. A python generator with yield will stop and error once you have looped on the dataset. It is a bit tricky to code a proper generator for tensorflow, but fortunately you don't need to. I would advise you to create a tf dataset (if the dataset fits in memory) using the tensorflow dataset api https://www.tensorflow.org/api_docs/python/tf/data/Dataset . In particular, use the "from tensor slices" method. You can take a look at this tutorial https://www.tensorflow.org/tutorials/load_data/numpy for a concrete example If your dataset needs to be created using a generator, you can try the "from generator" method https://www.tensorflow.org/guide/data#consuming_python_generators although I don't advise using it as it may lead to some unexpected problems if you don't handle your generator properly. Either way, for better results you still want your data to be wrapped in a tf dataset object, as it will handle shuffling and batching properly. If the data does not fit memory, I would advise you to export your dataset in the tfrecord format, if you are interested I can give you some more information about that.
2 of 2
1
Well, prepare the data before so that you don't have to "remove" things in the worst possible time. There is a reason to hard code a batch size and don't touch it. Forget about your function, really! Look at https://stanford.edu/~shervine/blog/keras-how-to-generate-data-on-the-fly