Answering my own question after some investigation:
warm_start=Trueand calling.fit()sequentially should not be used for incremental learning on new datasets with potential concept drift. It simply uses the previously fitted model's parameters to initialize a new fit, and will likely be overwritten if the new data is sufficiently different (i.e. signals are different). After a few mini-batches with large enough sample size (datasets in my case), the overall performance converges to exactly that of simply re-initializing the model. My guess is that this method should be used for the primary purpose of reducing training time when fitting the same dataset, or when there is no significant concept drift in new data.partial_fiton the other hand has an effect and can be used for incremental learning (especially for datasets too large to fit into memory and feeding in mini-batches). However, in datasets with potential concept drift or high noise, it performs worse than disregarding past observations and simply fitting on each dataset without any incremental learning.- For
SGDClassifier, callingpartial_fitrepeatedly makes a difference.
Edit (2022)
This post/answer has gotten a lot more views than expected, and I thought I'd expand a bit on my previous answer.
Let's say your ML model is a very simple linear regression,
where
are the weights and biases, and
the input/features.
And let's say that you've trained the model so that you've obtained some estimates for on some initial dataset
. Now you've obtained another dataset
.
Using warm_start=True and .fit() simply uses as an initialization for the parameters to be optimized on
. This can reduce the training time, especially if the datasets
and
are assumed to be generated from the same underlying data generating process ("more-or-less constant" in the docs).
On the other hand,partial_fit is for incrementally updating the parameters. So if you trained the model on , and then
partial_fit on , this would be conceptually similar to training a fresh model on a combined dataset.
The distinction can be pretty subtle so here's an example. Let's say you are training a classifier on the Iris dataset. Now suppose you went out and collected more data on the flowers. If you think there is concept drift (the flowers have evolved and are slightly different) or perhaps only care about the new data, then using warm_start lets you train the model on the new data faster than training from scratch with random initialization.
On the other hand, let's say you are building a music recommendation system where users provide feedback on whether the recommendation was good or not. Then you can use partial_fit to incrementally update the model as the live data comes in.
Answering my own question after some investigation:
warm_start=Trueand calling.fit()sequentially should not be used for incremental learning on new datasets with potential concept drift. It simply uses the previously fitted model's parameters to initialize a new fit, and will likely be overwritten if the new data is sufficiently different (i.e. signals are different). After a few mini-batches with large enough sample size (datasets in my case), the overall performance converges to exactly that of simply re-initializing the model. My guess is that this method should be used for the primary purpose of reducing training time when fitting the same dataset, or when there is no significant concept drift in new data.partial_fiton the other hand has an effect and can be used for incremental learning (especially for datasets too large to fit into memory and feeding in mini-batches). However, in datasets with potential concept drift or high noise, it performs worse than disregarding past observations and simply fitting on each dataset without any incremental learning.- For
SGDClassifier, callingpartial_fitrepeatedly makes a difference.
Edit (2022)
This post/answer has gotten a lot more views than expected, and I thought I'd expand a bit on my previous answer.
Let's say your ML model is a very simple linear regression,
where
are the weights and biases, and
the input/features.
And let's say that you've trained the model so that you've obtained some estimates for on some initial dataset
. Now you've obtained another dataset
.
Using warm_start=True and .fit() simply uses as an initialization for the parameters to be optimized on
. This can reduce the training time, especially if the datasets
and
are assumed to be generated from the same underlying data generating process ("more-or-less constant" in the docs).
On the other hand,partial_fit is for incrementally updating the parameters. So if you trained the model on , and then
partial_fit on , this would be conceptually similar to training a fresh model on a combined dataset.
The distinction can be pretty subtle so here's an example. Let's say you are training a classifier on the Iris dataset. Now suppose you went out and collected more data on the flowers. If you think there is concept drift (the flowers have evolved and are slightly different) or perhaps only care about the new data, then using warm_start lets you train the model on the new data faster than training from scratch with random initialization.
On the other hand, let's say you are building a music recommendation system where users provide feedback on whether the recommendation was good or not. Then you can use partial_fit to incrementally update the model as the live data comes in.
Just to add another, hopefully clarifying example: You may have fitted 100 trees in a random forest model and you want to add 10 more. Then you can achieve this by setting estimator.set_params(n_estimators=110, warm_start=True) and calling the fit method of the already fitted estimator. It typically would not make sense to fit the first 100 trees on one part of the data and the next 10 trees on a different part. Warm start doesn't change the first 100 trees.
Similarly for GradientBoostingClassifier you can add more boosted trees using warm_start. You wouldn't want an additional boosted tree to be fitted on a different mini-batch. This would result in a chaotic learning process.
python - Sklearn SGDClassifier partial fit - Stack Overflow
Difference between sklearn warm_start and partial_fit for online learning using SGDRegressor?
python - What is the difference between partial fit and warm start? - Stack Overflow
incremental or partial_fit
I am working to implement a time series forecasting model using walk-forward analysis (meteorological data). The model needs to assimilate new observations and re-train without taking too long computationally. I have found some models in sklearn which allow incremental learning such as SGDRegressor and PassiveAggressiveRegressor. My first thought it to try SGDR with learning_rate='constant', eta0=0.1, shuffle=False. (very open to any suggestions about this approach)
In order to update the model without running fit on all the training data, I see two relevant things in the documentation.
http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.SGDRegressor.html
warm_start = True; When set to True, reuse the solution of the previous call to fit as initialization, otherwise, just erase the previous solution.
partial_fit(); Fit linear model with Stochastic Gradient Descent using subset of training data/target values.
They sound the same to me and I haven't found any further explanation about their use. Should I use them together, like pass warm_start=True when building the model and then .partial_fit() for each training chunk? I would greatly appreciate some help. Thanks!
I don't know about the Passive Aggressor, but at least when using the SGDRegressor, partial_fit will only fit for 1 epoch, whereas fit will fit for multiple epochs (until the loss converges or max_iter
is reached). Therefore, when fitting new data to your model, partial_fit will only correct the model one step towards the new data, but with fit and warm_start it will act as if you would combine your old data and your new data together and fit the model once until convergence.
Example:
from sklearn.linear_model import SGDRegressor
import numpy as np
np.random.seed(0)
X = np.linspace(-1, 1, num=50).reshape(-1, 1)
Y = (X * 1.5 + 2).reshape(50,)
modelFit = SGDRegressor(learning_rate="adaptive", eta0=0.01, random_state=0, verbose=1,
shuffle=True, max_iter=2000, tol=1e-3, warm_start=True)
modelPartialFit = SGDRegressor(learning_rate="adaptive", eta0=0.01, random_state=0, verbose=1,
shuffle=True, max_iter=2000, tol=1e-3, warm_start=False)
# first fit some data
modelFit.fit(X, Y)
modelPartialFit.fit(X, Y)
# for both: Convergence after 50 epochs, Norm: 1.46, NNZs: 1, Bias: 2.000027, T: 2500, Avg. loss: 0.000237
print(modelFit.coef_, modelPartialFit.coef_) # for both: [1.46303288]
# now fit new data (zeros)
newX = X
newY = 0 * Y
# fits only for 1 epoch, Norm: 1.23, NNZs: 1, Bias: 1.208630, T: 50, Avg. loss: 1.595492:
modelPartialFit.partial_fit(newX, newY)
# Convergence after 49 epochs, Norm: 0.04, NNZs: 1, Bias: 0.000077, T: 2450, Avg. loss: 0.000313:
modelFit.fit(newX, newY)
print(modelFit.coef_, modelPartialFit.coef_) # [0.04245779] vs. [1.22919864]
newX = np.reshape([2], (-1, 1))
print(modelFit.predict(newX), modelPartialFit.predict(newX)) # [0.08499296] vs. [3.66702685]
If warm_start = False, each subsequent call to .fit() (after an initial call to .fit() or partial_fit()) will reset the model's trainable parameters for the initialisation. If warm_start = True, each subsequent call to .fit() (after an initial call to .fit() or partial_fit()) will retain the values of the model's trainable parameters from the previous run, and use those initially.
Regardless of the value of warm_start, each call to partial_fit() will retain the previous run's model parameters and use those initially.
Example using MLPRegressor:
import sklearn.neural_network
import numpy as np
np.random.seed(0)
x = np.linspace(-1, 1, num=50).reshape(-1, 1)
y = (x * 1.5 + 2).reshape(50,)
cold_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=False, max_iter=1)
warm_model = sklearn.neural_network.MLPRegressor(hidden_layer_sizes=(), warm_start=True, max_iter=1)
cold_model.fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[0.17009494]])] [array([0.74643783])]
cold_model.fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60819342]])] [array([-1.21256186])]
#after second run of .fit(), values are completely different
#because they were re-initialised before doing the second run for the cold model
warm_model.fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39815616]])] [array([1.651504])]
warm_model.fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39715616]])] [array([1.652504])]
#this time with the warm model, params change relatively little, as params were
#not re-initialised during second call to .fit()
cold_model.partial_fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60719343]])] [array([-1.21156187])]
cold_model.partial_fit(x,y)
print cold_model.coefs_, cold_model.intercepts_
#[array([[-0.60619347]])] [array([-1.21056189])]
#with partial_fit(), params barely change even for cold model,
#as no re-initialisation occurs
warm_model.partial_fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39615617]])] [array([1.65350392])]
warm_model.partial_fit(x,y)
print warm_model.coefs_, warm_model.intercepts_
#[array([[-1.39515619]])] [array([1.65450372])]
#and of course the same goes for the warm model
fit(), always initializes the parameters like a new object, and trains the model with the dataset passed in fit() method.
Whereas partial_fit(), works on top of the initialize parameter and tries to improve the existing weights with the new dataset passed in partial_fit().
It is always good to save the model in persistent storage (say pickle file), for later use or for further training.
I don't think there is a "correct way" with those options. Both will fit your data, but one will try to do it in one instance (fit) and the other will let you fit portions of your data (partial_fit).
In most cases, users will divide their huge dataset into smaller 'chunks' and feed these chunks in sequence to partial_fit, and the call to partial_fit with your final chunk will return the complete fit.