SVM is a support-vector machine which is a special linear-model. From a theoretical view it's a convex-optimization problem and we can get the global-optimum in polynomial-time. There are many different optimization-approaches.

In the past people used general Quadratic Programming solvers. Nowadays specialized approaches like SMO and others are used.

sklearn's specialized SVM-optimizers are based on liblinear and libsvm. There are many documents and research papers if you are interested in the algorithms.

Keep in mind, that SVC (libsvm) and LinearSVC (liblinear) make different assumptions in regards to the optimization-problem, which results in different performances on the same task (linear-kernel: LinearSVC much more efficient than SVC in general; but some tasks can't be tackled by LinearSVC).

SGD is an Stochastic Gradient Descent-based (this is a general optimization method!) optimizer which can optimize many different convex-optimization problems (actually: this is more or less the same method used in all those Deep-Learning approaches; so people use it in the non-convex setting too; throwing away theoretical-guarantees).

sklearn says: Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions. Now it's actually even more versatile, but here it's enough to note that it subsumes (some) SVMs, logistic-regression and others.

Now SGD-based optimization is very different from QP and others. If one would take QP for example, there are no hyper-parameters to tune. This is a bit simplified, as there can be tuning, but it's not needed to guarantee convergence and performance! (theory of QP-solvers, e.g. Interior-point method is much more robust)

SGD-based optimizers (or general first-order methods) are very very hard to tune! And they need tuning! Learning-rates or learning-schedules in general are parameters to look at as convergence depends on these (theory and practice)!

It's a very complex topic, but some simplified rules:

  • Specialized SVM-methods

    • scale worse with the number of samples
    • do not need hyper-parameter tuning
  • SGD-based methods

    • scale better for huge-data in general
    • need hyper-parameter tuning
    • solve only a subset of the tasks approachable by the the above (no kernel-methods!)

My opinion: use (the easier to use) LinearSVC as long as it's working, given your time-budget!

Just to make it clear: i highly recommend grabbing some dataset (e.g. from within sklearn) and do some comparisons between those candidates. The need for param-tuning is not a theoretical-problem! You will see non-optimal (objective / loss) results in the SGD-case quite easily!

And always remember: Stochastic Gradient Descent is sensitive to feature scaling docs. This is more or less a consequence of first-order methods.

Answer from sascha on Stack Overflow
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.linear_model.SGDClassifier.html
SGDClassifier — scikit-learn 1.8.0 documentation
SGD allows minibatch (online/out-of-core) learning via the partial_fit method. For best results using the default learning rate schedule, the data should have zero mean and unit variance. This implementation works with data represented as dense or sparse arrays of floating point values for the features. The model it fits can be controlled with the loss parameter; by default, it fits a linear support vector machine (SVM...
Top answer
1 of 2
25

SVM is a support-vector machine which is a special linear-model. From a theoretical view it's a convex-optimization problem and we can get the global-optimum in polynomial-time. There are many different optimization-approaches.

In the past people used general Quadratic Programming solvers. Nowadays specialized approaches like SMO and others are used.

sklearn's specialized SVM-optimizers are based on liblinear and libsvm. There are many documents and research papers if you are interested in the algorithms.

Keep in mind, that SVC (libsvm) and LinearSVC (liblinear) make different assumptions in regards to the optimization-problem, which results in different performances on the same task (linear-kernel: LinearSVC much more efficient than SVC in general; but some tasks can't be tackled by LinearSVC).

SGD is an Stochastic Gradient Descent-based (this is a general optimization method!) optimizer which can optimize many different convex-optimization problems (actually: this is more or less the same method used in all those Deep-Learning approaches; so people use it in the non-convex setting too; throwing away theoretical-guarantees).

sklearn says: Stochastic Gradient Descent (SGD) is a simple yet very efficient approach to discriminative learning of linear classifiers under convex loss functions. Now it's actually even more versatile, but here it's enough to note that it subsumes (some) SVMs, logistic-regression and others.

Now SGD-based optimization is very different from QP and others. If one would take QP for example, there are no hyper-parameters to tune. This is a bit simplified, as there can be tuning, but it's not needed to guarantee convergence and performance! (theory of QP-solvers, e.g. Interior-point method is much more robust)

SGD-based optimizers (or general first-order methods) are very very hard to tune! And they need tuning! Learning-rates or learning-schedules in general are parameters to look at as convergence depends on these (theory and practice)!

It's a very complex topic, but some simplified rules:

  • Specialized SVM-methods

    • scale worse with the number of samples
    • do not need hyper-parameter tuning
  • SGD-based methods

    • scale better for huge-data in general
    • need hyper-parameter tuning
    • solve only a subset of the tasks approachable by the the above (no kernel-methods!)

My opinion: use (the easier to use) LinearSVC as long as it's working, given your time-budget!

Just to make it clear: i highly recommend grabbing some dataset (e.g. from within sklearn) and do some comparisons between those candidates. The need for param-tuning is not a theoretical-problem! You will see non-optimal (objective / loss) results in the SGD-case quite easily!

And always remember: Stochastic Gradient Descent is sensitive to feature scaling docs. This is more or less a consequence of first-order methods.

2 of 2
1

SVC(SVM) uses kernel based optimisation, where, the input data is transformed to complex data(unravelled) which is expanded thus identifying more complex boundaries between classes. SVC can perform Linear and Non-Linear classification

SVC can perform Linear classification by setting the kernel parameter to 'linear' svc = SVC(kernel='linear')

SVC can perform non-linear classification by setting the kernel parameter to 'poly' , 'rbf'(default) svc = SVC(kernel='poly') svc = SVC(kernel='rbf')

SGDClassifier uses gradient descent optimisation technique, where, the optimum coefficients are identified by iteration process. SGDClassifier can perform only linear classification

SGDClassifer can use Linear SVC(SVM) model when the parameter loss is set to 'hinge'(which is the default) i.e SGDClassifier(loss='hinge')

🌐
scikit-learn
scikit-learn.org › stable › modules › sgd.html
1.5. Stochastic Gradient Descent — scikit-learn 1.8.0 documentation
The class sklearn.linear_model.SGDOneClassSVM implements an online linear version of the One-Class SVM using a stochastic gradient descent.
🌐
GitHub
github.com › qandeelabbassi › python-svm-sgd
GitHub - qandeelabbassi/python-svm-sgd: Python implementation of stochastic sub-gradient descent algorithm for SVM from scratch
Python implementation of stochastic sub-gradient descent algorithm for SVM from scratch - qandeelabbassi/python-svm-sgd
Starred by 37 users
Forked by 22 users
Languages   Python 100.0% | Python 100.0%
🌐
scikit-learn
scikit-learn.org › stable › auto_examples › linear_model › plot_sgdocsvm_vs_ocsvm.html
One-Class SVM versus One-Class SVM using Stochastic Gradient Descent — scikit-learn 1.8.0 documentation
import matplotlib import matplotlib.lines as mlines import matplotlib.pyplot as plt import numpy as np from sklearn.kernel_approximation import Nystroem from sklearn.linear_model import SGDOneClassSVM from sklearn.pipeline import make_pipeline from sklearn.svm import OneClassSVM font = {"weight": "normal", "size": 15} matplotlib.rc("font", **font) random_state = 42 rng = np.random.RandomState(random_state) # Generate train data X = 0.3 * rng.randn(500, 2) X_train = np.r_[X + 2, X - 2] # Generate some regular novel observations X = 0.3 * rng.randn(20, 2) X_test = np.r_[X + 2, X - 2] # Generate
🌐
Netlify
michael-fuchs-python.netlify.app › 2019 › 11 › 11 › introduction-to-sgd-classifier
Introduction to SGD Classifier - Michael Fuchs Python
November 11, 2019 - Which linear classifier is used is determined with the hypter parameter loss. So, if I write clf = SGDClassifier(loss=‘hinge’) it is an implementation of Linear SVM and if I write clf = SGDClassifier(loss=log’) it is an implementation of Logisitic regression.
🌐
GitHub
github.com › rickysu123 › Support-Vector-Machine
GitHub - rickysu123/Support-Vector-Machine: Implementation of Support Vector Machine using Stochastic Gradient Descent
Running on the test set, I get ... DIR]$ python SVM.py a7a.test Training... Weights are trained 4 times, with a learning rate of 0.45 and a capacity of 0.0127 Testing... Tested on a7a.test, accuracy is 84.8599456329% ************ Your interpretation **** (Attention: important!) I immediately got results that were better than SGD Perceptron, ...
Author   rickysu123
Find elsewhere
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.linear_model.SGDOneClassSVM.html
SGDOneClassSVM — scikit-learn 1.8.0 documentation
This estimator has a linear complexity in the number of training samples and is thus better suited than the sklearn.svm.OneClassSVM implementation for datasets with a large number of training samples (say > 10,000). Examples · >>> import numpy as np >>> from sklearn import linear_model >>> X = np.array([[-1, -1], [-2, -1], [1, 1], [2, 1]]) >>> clf = linear_model.SGDOneClassSVM(random_state=42, tol=None) >>> clf.fit(X) SGDOneClassSVM(random_state=42, tol=None) >>> print(clf.predict([[4, 4]])) [1] decision_function(X)[source]# Signed distance to the separating hyperplane.
🌐
MaviccPRP@web.studio
maviccprp.github.io › a-support-vector-machine-in-just-a-few-lines-of-python-code
A Support Vector Machine in just a few Lines of Python Code
April 3, 2017 - Finally we can code our SGD algorithm using our update rules. To keep it simple, we will linearly loop over the sample set. For larger data sets it makes sence, to randomly pick a sample during each iteration in the for-loop. def svm_sgd(X, Y): w = np.zeros(len(X[0])) eta = 1 epochs = 100000 for epoch in range(1,n): for i, x in enumerate(X): if (Y[i]*np.dot(X[i], w)) < 1: w = w + eta * ( (X[i] * Y[i]) + (-2 *(1/epoch)* w) ) else: w = w + eta * (-2 *(1/epoch)* w) return w
🌐
GitHub
github.com › raj-shah14 › Implementation-of-SVM-using-SGD
GitHub - raj-shah14/Implementation-of-SVM-using-SGD: Support Vector Machine using Stochastic Gradient Descent
sgd_svm.py · View all files · Support Vector Machine using Stochastic Gradient Descent · There was an error while loading. Please reload this page. Activity · 0 stars · 0 watching · 1 fork · Report repository · No releases published · No packages published · Python 100.0% You can’t perform that action at this time.
Author   raj-shah14
🌐
GitHub
github.com › go2chayan › Support_Vector_Machine
GitHub - go2chayan/Support_Vector_Machine: A demo of Support Vector Machine using Stochastic Gradient Descent (SGD)
Iftekhar Tanveer Email: ... SVMs with SGD for the voting dataset, and compare results with the previous assignment. Use the dev set to experiment with different values of the capacity parameter C and the learning rate. ************** Files *************** README: This document progAss2.py: The original python ...
Author   go2chayan
🌐
GitHub
github.com › qandeelabbassi › python-svm-sgd › blob › master › svm.py
python-svm-sgd/svm.py at master · qandeelabbassi/python-svm-sgd
Python implementation of stochastic sub-gradient descent algorithm for SVM from scratch - qandeelabbassi/python-svm-sgd
Author   qandeelabbassi
🌐
Medium
fordcombs.medium.com › svm-from-scratch-step-by-step-in-python-f1e2d5b9c5be
SVM from scratch: step by step in Python | by Ford Combs | Medium
May 23, 2020 - SVM from scratch: step by step in Python How to build a support vector machine using the Pegasos algorithm for stochastic gradient descent. All of the code can be found here …
🌐
DataCamp
campus.datacamp.com › courses › linear-classifiers-in-python › support-vector-machines
Using SGDClassifier | Python
In this final coding exercise, you'll do a hyperparameter search over the regularization strength and the loss (logistic regression vs. linear SVM) using SGDClassifier().
🌐
Kaggle
kaggle.com › code › residentmario › support-vector-machines-and-stoch-gradient-descent
Support vector machines and stoch gradient descent
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
Top answer
1 of 1
4

When SVMs and SGD can't be combined

SVMs are often used in combination with the kernel trick, which enables classification of non-linearly separable data. This answer explains why you wouldn't use stochastic gradient descent to solve a kernelised SVM: https://stats.stackexchange.com/questions/215524/is-gradient-descent-possible-for-kernelized-svms-if-so-why-do-people-use-quadr

Linear SVMs

If we stick to Linear SVMs, then we can run an experiment using sklearn, as it provides wrappers over libsvm (SVC), liblinear (LinearSVC) and also offers the SGDClassifier. Recommend reading the linked documentation of libsvm and liblinear to understand what is happening under the hood.

Comparison on example dataset

Below is a comparison of computational performance and accuracy over a randomly generated dataset (which may not be representative of your problem). You should alter the problem to fit your requirements.

import time
import numpy as np
import matplotlib.pyplot as plt

from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split

# Randomly generated dataset
# Linear function + noise
np.random.seed(0)
X = np.random.normal(size=(50000, 10))
coefs = np.random.normal(size=10)
epsilon = np.random.normal(size=50000)
y = (X @ coefs + epsilon) > 0

# Classifiers to compare
algos = {
    'LibSVM': {
        'model': SVC(),
        'max_n': 4000,
        'time': [],
        'error': []
    },
    'LibLinear': {
        'model': LinearSVC(dual=False),
        'max_n': np.inf,
        'time': [],
        'error': []
    },
    'SGD': {
        'model': SGDClassifier(max_iter=1000, tol=1e-3),
        'max_n': np.inf,
        'time': [],
        'error': []
    }
}

splits = list(range(100, 1000, 100)) + \
         list(range(1500, 5000, 500)) + \
         list(range(6000, 50000, 1000))
for i in splits:
    X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                        test_size=1-i/50000,
                                                        random_state=0)
    for k, v in algos.items():
        if i < v['max_n']:
            model = v['model']
            t0 = time.time()
            model.fit(X_train, y_train)
            t1 = time.time()
            v['time'].append(t1 - t0)
            preds = model.predict(X_test)
            e = (preds != y_test).sum() / len(y_test)
            v['error'].append(e)

Plotting the results, we see that the traditional libsvm solver cannot be used on large n, while the liblinear and SGD implementations scale well computationally.

plt.figure()
for k, v in algos.items():
    plt.plot(splits[:len(v['time'])], v['time'], label='{} time'.format(k))
plt.legend()
plt.semilogx()
plt.title('Time comparison')
plt.show()

Plotting the error, we see that SGD is worse than LibSVM for the same training set, but if you have a large training set this becomes a minor point. The liblinear algorithm performs best on this dataset:

plt.figure()
for k, v in algos.items():
    plt.plot(splits[:len(v['error'])], v['error'], label='{} error'.format(k))
plt.legend()
plt.semilogx()
plt.title('Error comparison')
plt.xlabel('Number of training examples')
plt.ylabel('Error')
plt.show()

🌐
Towards Data Science
towardsdatascience.com › home › latest › stochastic gradient descent implementation for softsvm
Stochastic gradient descent implementation for SoftSVM | Towards Data Science
March 5, 2025 - Additionally, very small values of λ, let’s say 0.1, may require more interactions to converge, but it can give a smaller loss. This induces that we should test different values of λ to tune the Soft SVM with stochastic gradient descent in order to find out the best parameter for the minimum binary loss with the least number of interactions (to save precious computer execution time).