Answering my own question after some investigation:

  • warm_start=True and calling .fit() sequentially should not be used for incremental learning on new datasets with potential concept drift. It simply uses the previously fitted model's parameters to initialize a new fit, and will likely be overwritten if the new data is sufficiently different (i.e. signals are different). After a few mini-batches with large enough sample size (datasets in my case), the overall performance converges to exactly that of simply re-initializing the model. My guess is that this method should be used for the primary purpose of reducing training time when fitting the same dataset, or when there is no significant concept drift in new data.
  • partial_fit on the other hand has an effect and can be used for incremental learning (especially for datasets too large to fit into memory and feeding in mini-batches). However, in datasets with potential concept drift or high noise, it performs worse than disregarding past observations and simply fitting on each dataset without any incremental learning.
  • For SGDClassifier, calling partial_fit repeatedly makes a difference.

Edit (2022)
This post/answer has gotten a lot more views than expected, and I thought I'd expand a bit on my previous answer.

Let's say your ML model is a very simple linear regression, where are the weights and biases, and the input/features.

And let's say that you've trained the model so that you've obtained some estimates for on some initial dataset . Now you've obtained another dataset .

Using warm_start=True and .fit() simply uses as an initialization for the parameters to be optimized on . This can reduce the training time, especially if the datasets and are assumed to be generated from the same underlying data generating process ("more-or-less constant" in the docs).

On the other hand,partial_fit is for incrementally updating the parameters. So if you trained the model on , and then partial_fit on , this would be conceptually similar to training a fresh model on a combined dataset.

The distinction can be pretty subtle so here's an example. Let's say you are training a classifier on the Iris dataset. Now suppose you went out and collected more data on the flowers. If you think there is concept drift (the flowers have evolved and are slightly different) or perhaps only care about the new data, then using warm_start lets you train the model on the new data faster than training from scratch with random initialization.

On the other hand, let's say you are building a music recommendation system where users provide feedback on whether the recommendation was good or not. Then you can use partial_fit to incrementally update the model as the live data comes in.

Answer from Adam on Stack Exchange
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.linear_model.SGDClassifier.html
SGDClassifier — scikit-learn 1.8.0 documentation
Metadata routing for sample_weight parameter in partial_fit. ... The updated object. set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → SGDClassifier[source]# Configure whether metadata should be requested to be passed to the score method. Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config).
Top answer
1 of 4
27

Answering my own question after some investigation:

  • warm_start=True and calling .fit() sequentially should not be used for incremental learning on new datasets with potential concept drift. It simply uses the previously fitted model's parameters to initialize a new fit, and will likely be overwritten if the new data is sufficiently different (i.e. signals are different). After a few mini-batches with large enough sample size (datasets in my case), the overall performance converges to exactly that of simply re-initializing the model. My guess is that this method should be used for the primary purpose of reducing training time when fitting the same dataset, or when there is no significant concept drift in new data.
  • partial_fit on the other hand has an effect and can be used for incremental learning (especially for datasets too large to fit into memory and feeding in mini-batches). However, in datasets with potential concept drift or high noise, it performs worse than disregarding past observations and simply fitting on each dataset without any incremental learning.
  • For SGDClassifier, calling partial_fit repeatedly makes a difference.

Edit (2022)
This post/answer has gotten a lot more views than expected, and I thought I'd expand a bit on my previous answer.

Let's say your ML model is a very simple linear regression, where are the weights and biases, and the input/features.

And let's say that you've trained the model so that you've obtained some estimates for on some initial dataset . Now you've obtained another dataset .

Using warm_start=True and .fit() simply uses as an initialization for the parameters to be optimized on . This can reduce the training time, especially if the datasets and are assumed to be generated from the same underlying data generating process ("more-or-less constant" in the docs).

On the other hand,partial_fit is for incrementally updating the parameters. So if you trained the model on , and then partial_fit on , this would be conceptually similar to training a fresh model on a combined dataset.

The distinction can be pretty subtle so here's an example. Let's say you are training a classifier on the Iris dataset. Now suppose you went out and collected more data on the flowers. If you think there is concept drift (the flowers have evolved and are slightly different) or perhaps only care about the new data, then using warm_start lets you train the model on the new data faster than training from scratch with random initialization.

On the other hand, let's say you are building a music recommendation system where users provide feedback on whether the recommendation was good or not. Then you can use partial_fit to incrementally update the model as the live data comes in.

2 of 4
8

Just to add another, hopefully clarifying example: You may have fitted 100 trees in a random forest model and you want to add 10 more. Then you can achieve this by setting estimator.set_params(n_estimators=110, warm_start=True) and calling the fit method of the already fitted estimator. It typically would not make sense to fit the first 100 trees on one part of the data and the next 10 trees on a different part. Warm start doesn't change the first 100 trees.

Similarly for GradientBoostingClassifier you can add more boosted trees using warm_start. You wouldn't want an additional boosted tree to be fitted on a different mini-batch. This would result in a chaotic learning process.

Discussions

python - Sklearn SGDClassifier partial fit - Stack Overflow
I'm trying to use SGD to classify a large dataset. As the data is too large to fit into memory, I'd like to use the partial_fit method to train the classifier. I have selected a sample of the datas... More on stackoverflow.com
🌐 stackoverflow.com
Difference between sklearn warm_start and partial_fit for online learning using SGDRegressor?
I tried partial_fit on SGDClassifier and its not actually learning incrementally. We have to give all the class lables before the initial fit(). You can see sample example here: https://ideone.com/qtGpnY . I am not sure about warm_start. IMO, warm_start is just saying to reuse but actual addition of new samples to the already trained model has to be performed by partial_fit(). isnt it ? Did you made some prototype to test them ? More on reddit.com
🌐 r/learnmachinelearning
3
7
February 21, 2018
python - What is the difference between partial fit and warm start? - Stack Overflow
Therefore, when fitting new data to your model, partial_fit will only correct the model one step towards the new data, but with fit and warm_start it will act as if you would combine your old data and your new data together and fit the model once until convergence. ... from sklearn.linear_model ... More on stackoverflow.com
🌐 stackoverflow.com
incremental or partial_fit
I have very large dataset a csv file of 900GB. Can I use auto-sklean with it using partial_fit or any incremental training? I know that sk-learn have many partial_fit methods, are any of them expos... More on github.com
🌐 github.com
2
July 5, 2019
🌐
scikit-learn
scikit-learn.org › 0.15 › modules › scaling_strategies.html
6. Strategies to scale computationally: bigger data — scikit-learn 0.15-git documentation
Currently the preferred way to do this is to use the so-called hashing trick as implemented by sklearn.feature_extraction.FeatureHasher for datasets with categorical variables represented as list of Python dicts or sklearn.feature_extraction.text.HashingVectorizer for text documents. Finally, for 3. we have a number of options inside scikit-learn. Although all algorithms cannot learn incrementally (i.e. without seeing all the instances at once), all estimators implementing the partial_fit API are candidates.
🌐
Medium
medium.com › @megha.natarajan › unpacking-batch-fitting-in-machine-learning-leveraging-partial-fit-in-scikit-learn-and-pytorch-cbe0142d0535
Unpacking Batch Fitting in Machine Learning: Leveraging partial_fit in Scikit-learn and PyTorch Dynamics | by Megha Natarajan | Medium
October 30, 2023 - Visually comparing the workings of partial fitting (batch fitting) with traditional fitting (non-batch fitting) will provide a clearer understanding of how the learning process evolves over time in these two scenarios. Let us create some synthetic data and a simple classifier for this purpose. We will plot decision boundaries at various stages of the training process to see how the model learns under both strategies. import numpy as np import matplotlib.pyplot as plt from sklearn...
🌐
GitHub
github.com › scikit-learn › scikit-learn › discussions › 26453
Is Partial_fit() function in MLPClassifer an incremental learning method, or just a kind of fine-tuning? · scikit-learn/scikit-learn · Discussion #26453
May 28, 2023 - An example is when I use it to train a MLP for time series data, if I only use 20% data for training and use partial_fit() to fit each 10% data in every next time interval, I can get higher performance than the origin MLP training by 20% data.
Author   scikit-learn
🌐
Tom's Blog
tomaugspurger.net › posts › scalable machine learning (part 2): partial fit
Scalable Machine Learning (Part 2): Partial Fit | Tom's Blog
September 15, 2017 - Instead, I’ve put together a small wrapper that will use scikit-learn’s SGDClassifier.partial_fit to fit the model out-of-core (but not in parallel). from daskml.preprocessing import StandardScaler from daskml.linear_model import BigSGDClassifier # The wrapper from dask.diagnostics import ResourceProfiler, Profiler, ProgressBar from sklearn.pipeline import make_pipeline
🌐
Reddit
reddit.com › r/learnmachinelearning › difference between sklearn warm_start and partial_fit for online learning using sgdregressor?
r/learnmachinelearning on Reddit: Difference between sklearn warm_start and partial_fit for online learning using SGDRegressor?
February 21, 2018 -

I am working to implement a time series forecasting model using walk-forward analysis (meteorological data). The model needs to assimilate new observations and re-train without taking too long computationally. I have found some models in sklearn which allow incremental learning such as SGDRegressor and PassiveAggressiveRegressor. My first thought it to try SGDR with learning_rate='constant', eta0=0.1, shuffle=False. (very open to any suggestions about this approach)

In order to update the model without running fit on all the training data, I see two relevant things in the documentation.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html

warm_start = True; When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.

partial_fit(); Fit linear model with Stochastic Gradient Descent using subset of training data/target values.

They sound the same to me and I haven't found any further explanation about their use. Should I use them together, like pass warm_start=True when building the model and then .partial_fit() for each training chunk? I would greatly appreciate some help. Thanks!

Find elsewhere
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › fitting-data-in-chunks-vs-fitting-all-at-once-in-scikit-learn
Fitting Data in Chunks vs. Fitting All at Once in scikit-learn - GeeksforGeeks
July 23, 2025 - This example demonstrates how to ... the partial_fit method on chunks of data. The traditional approach is to fit the entire dataset at once using methods like fit. This approach is straightforward and often leads to a more stable model since all data is available for training at the same time. Let's see a example that illustrates fitting a model using the entire dataset at once. ... from sklearn.feature_e...
Top answer
1 of 4
15

I don't know about the Passive Aggressor, but at least when using the SGDRegressor, partial_fit will only fit for 1 epoch, whereas fit will fit for multiple epochs (until the loss converges or max_iter is reached). Therefore, when fitting new data to your model, partial_fit will only correct the model one step towards the new data, but with fit and warm_start it will act as if you would combine your old data and your new data together and fit the model once until convergence.

Example:

from sklearn.linear_model import SGDRegressor
import numpy as np

np.random.seed(0)
X = np.linspace(-1, 1, num=50).reshape(-1, 1)
Y = (X * 1.5 + 2).reshape(50,)

modelFit = SGDRegressor(learning_rate="adaptive", eta0=0.01, random_state=0, verbose=1,
                     shuffle=True, max_iter=2000, tol=1e-3, warm_start=True)
modelPartialFit = SGDRegressor(learning_rate="adaptive", eta0=0.01, random_state=0, verbose=1,
                     shuffle=True, max_iter=2000, tol=1e-3, warm_start=False)
# first fit some data
modelFit.fit(X, Y)
modelPartialFit.fit(X, Y)
# for both: Convergence after 50 epochs, Norm: 1.46, NNZs: 1, Bias: 2.000027, T: 2500, Avg. loss: 0.000237
print(modelFit.coef_, modelPartialFit.coef_) # for both: [1.46303288]

# now fit new data (zeros)
newX = X
newY = 0 * Y

# fits only for 1 epoch, Norm: 1.23, NNZs: 1, Bias: 1.208630, T: 50, Avg. loss: 1.595492:
modelPartialFit.partial_fit(newX, newY)

# Convergence after 49 epochs, Norm: 0.04, NNZs: 1, Bias: 0.000077, T: 2450, Avg. loss: 0.000313:
modelFit.fit(newX, newY)

print(modelFit.coef_, modelPartialFit.coef_) # [0.04245779] vs. [1.22919864]
newX = np.reshape([2], (-1, 1))
print(modelFit.predict(newX), modelPartialFit.predict(newX)) # [0.08499296] vs. [3.66702685]
2 of 4
7

If warm_start = False, each subsequent call to .fit() (after an initial call to .fit() or partial_fit()) will reset the model's trainable parameters for the initialisation. If warm_start = True, each subsequent call to .fit() (after an initial call to .fit() or partial_fit()) will retain the values of the model's trainable parameters from the previous run, and use those initially. Regardless of the value of warm_start, each call to partial_fit() will retain the previous run's model parameters and use those initially.

Example using MLPRegressor:

import sklearn.neural_network
import numpy as np
np.random.seed(0)
x = np.linspace(-1, 1, num=50).reshape(-1, 1)
y = (x * 1.5 + 2).reshape(50,)
cold_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=False, max_iter=1)
warm_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=True, max_iter=1)

cold_model.fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[0.17009494]])] [array([0.74643783])]
cold_model.fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60819342]])] [array([-1.21256186])]
#after second run of .fit(), values are completely different
#because they were re-initialised before doing the second run for the cold model

warm_model.fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39815616]])] [array([1.651504])]
warm_model.fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39715616]])] [array([1.652504])]
#this time with the warm model, params change relatively little, as params were
#not re-initialised during second call to .fit()

cold_model.partial_fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60719343]])] [array([-1.21156187])]
cold_model.partial_fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60619347]])] [array([-1.21056189])]
#with partial_fit(), params barely change even for cold model,
#as no re-initialisation occurs

warm_model.partial_fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39615617]])] [array([1.65350392])]
warm_model.partial_fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39515619]])] [array([1.65450372])]
#and of course the same goes for the warm model
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.linear_model.SGDRegressor.html
SGDRegressor — scikit-learn 1.8.0 documentation
Metadata routing for sample_weight parameter in partial_fit. ... The updated object. set_score_request(*, sample_weight: bool | None | str = '$UNCHANGED$') → SGDRegressor[source]# Configure whether metadata should be requested to be passed to the score method. Note that this method is only relevant when this estimator is used as a sub-estimator within a meta-estimator and metadata routing is enabled with enable_metadata_routing=True (see sklearn.set_config).
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.preprocessing.StandardScaler.html
StandardScaler — scikit-learn 1.8.0 documentation
Will be reset on new calls to fit, but increments across partial_fit calls. ... Equivalent function without the estimator API. ... Further removes the linear correlation across features with ‘whiten=True’. ... NaNs are treated as missing values: disregarded in fit, and maintained in transform. We use a biased estimator for the standard deviation, equivalent to numpy.std(x, ddof=0). Note that the choice of ddof is unlikely to affect model performance. ... >>> from sklearn.preprocessing import StandardScaler >>> data = [[0, 0], [0, 0], [1, 1], [1, 1]] >>> scaler = StandardScaler() >>> print(scaler.fit(data)) StandardScaler() >>> print(scaler.mean_) [0.5 0.5] >>> print(scaler.transform(data)) [[-1.
🌐
scikit-learn
scikit-learn.org › 1.5 › modules › generated › sklearn.linear_model.SGDClassifier.html
scikit-learn.org - sklearn.linear_model.SGDClassifier
Repeatedly calling fit or partial_fit when warm_start is True can result in a different solution than when calling fit a single time because of the way the data is shuffled. If a dynamic learning rate is used, the learning rate is adapted depending on the number of samples already seen.
🌐
GitHub
github.com › automl › auto-sklearn › issues › 696
incremental or partial_fit · Issue #696 · automl/auto-sklearn
July 5, 2019 - I have very large dataset a csv file of 900GB. Can I use auto-sklean with it using partial_fit or any incremental training? I know that sk-learn have many partial_fit methods, are any of them exposed in autosklearn.classification.AutoSkl...
Author   bytearchive
🌐
scikit-learn
scikit-learn.org › dev › auto_examples › cluster › plot_dict_face_patches.html
Online learning of a dictionary of parts of faces — scikit-learn 1.8.dev0 documentation
The verbose setting on the MiniBatchKMeans enables us to see that some clusters are reassigned during the successive calls to partial-fit. This is because the number of patches that they represent has become too low, and it is better to choose a random new cluster. # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause · from sklearn import datasets faces = datasets.fetch_olivetti_faces()
🌐
Calmcode
calmcode.io › labs › scikit-partial
Pipeline components that support partial_fit.
The main Pipeline in scikit-learn, however, does not support this .partial_fit API. Which is why we made a variant that does in scikit-partial. To get started with this new pipeline you'll first need to install it: ... Once installed you can use it to train models in multiple batches. The code below gives an example of this. import pandas as pd from sklearn.linear_model import SGDClassifier from sklearn.feature_extraction.text import HashingVectorizer from skpartial.pipeline import make_partial_pipeline # First, load some data.
🌐
MNE Forum
mne.discourse.group › mailing list archive (read-only)
GeneralizingEstimator with incremental learning / .partial_fit - Mailing List Archive (read-only) - MNE Forum
August 5, 2020 - External Email - Use Caution Hi! I would need to try decoding with incremental learning (EEG data). I was planning to use logistic regression by means of the SGDClassifier . I would then need to call .partial_fit to make my estimator learn on each of my training sets.
🌐
scikit-learn
scikit-learn.org › 1.5 › modules › generated › sklearn.decomposition.IncrementalPCA.html
IncrementalPCA — scikit-learn 1.5.2 documentation
Metadata routing for check_input parameter in partial_fit. ... The updated object. ... Apply dimensionality reduction to X. X is projected on the first principal components previously extracted from a training set, using minibatches of size batch_size if X is sparse. ... New data, where n_samples is the number of samples and n_features is the number of features. ... Projection of X in the first principal components. ... >>> import numpy as np >>> from sklearn.decomposition import IncrementalPCA >>> X = np.array([[-1, -1], [-2, -1], [-3, -2], ...