🌐
scikit-learn
scikit-learn.org › stable › modules › sgd.html
1.5. Stochastic Gradient Descent — scikit-learn 1.8.0 documentation
The implementation of SGD is influenced by the Stochastic Gradient SVM of [7]. Similar to SvmSGD, the weight vector is represented as the product of a scalar and a vector which allows an efficient weight update in the case of \(L_2\) regularization. In the case of sparse input X, the intercept is updated with a smaller learning rate (multiplied by 0.01) to account for the fact that it is updated more frequently.
🌐
University of Utah
users.cs.utah.edu › ~zhe › pdf › lec-19-2-svm-sgd-upload.pdf pdf
1 Support Vector Machines: Training with Stochastic Gradient Descent
Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·
🌐
Towards Data Science
towardsdatascience.com › home › latest › stochastic gradient descent implementation for softsvm
Stochastic gradient descent implementation for SoftSVM | Towards Data Science
March 5, 2025 - This is a quick tutorial on how to implement the Stochastic Gradient Descent (SGD) optimization method for SoftSVM on MATLAB to find a linear classifier with minimal empirical loss.
🌐
Springer
link.springer.com › home › machine learning and knowledge discovery in databases › conference paper
The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited | SpringerLink
We reconsider the stochastic (sub)gradient approach to the unconstrained primal L1-SVM optimization. We observe that if the learning rate is inversely proportional to the number of steps, i.e., the number of times any training pattern is presented to the algorithm,...
🌐
Svivek
svivek.com › teaching › machine-learning › lectures › slides › svm › svm-sgd.pdf pdf
Training with Stochastic Gradient Descent
Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·
🌐
Svivek
svivek.com › teaching › machine-learning › fall2018 › slides › svm › svm-sgd.pdf pdf
Machine Learning Support Vector Machines: Training with
Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·
🌐
scikit-learn
scikit-learn.org › stable › modules › generated › sklearn.linear_model.SGDClassifier.html
SGDClassifier — scikit-learn 1.8.0 documentation
This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength ...
🌐
MaviccPRP@web.studio
maviccprp.github.io › a-support-vector-machine-in-just-a-few-lines-of-python-code
A Support Vector Machine in just a few Lines of Python Code
April 3, 2017 - As for the perceptron, we use python 3 and numpy. The SVM will learn using the stochastic gradient descent algorithm (SGD). SGD minimizes a function by following the gradients of the cost function.
Top answer
1 of 3
18

Set $\mathbf w = \phi(\mathbf x)\cdot \mathbf u$ so that $\mathbf w^t \phi(\mathbf x)=\mathbf u^t \cdot \mathbf K$ and $\mathbf w^t\mathbf w = \mathbf u^t\mathbf K\mathbf u$, with $\mathbf K = \phi(\mathbf x)^t\phi(\mathbf x)$, where $\phi(x)$ is a mapping of the original input matrix, $\mathbf x$. This allows one to solve the SVM through the primal formulation. Using your notation for the loss:

$$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{u}^t \cdot \mathbf{K}^{(i)} + b)\right)} + \dfrac{1}{2} \mathbf{u}^t \cdot \mathbf{K} \cdot \mathbf{u}$$

$ \mathbf{K}$ is a $m \times m$ matrix, and $\mathbf{u}$ is a $m \times 1$ matrix. Neither is infinite.

Indeed, the dual is usually faster to solve, but the primal has it's advantages as well, such as approximate solutions (which are not guaranteed in the dual formulation).


Now, why is the dual so much more prominent isn't obvious at all: [1]

The historical reasons for which most of the research in the last decade has been about dual optimization are unclear. We believe that it is because SVMs were first introduced in their hard margin formulation [Boser et al., 1992], for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account [Chapelle et al., 2000]


Chapelle (2007) argues the time complexity of both primal and dual optimization is $\mathcal{O}\left(nn_{sv} + n_{sv}^3\right)$, worst case being $\mathcal{O}\left(n^3\right)$, but they analyzed quadratic and approximate hinge losses, so not a proper hinge loss, as it's not differentiable to be used with Newton's method.


[1] Chapelle, O. (2007). Training a support vector machine in the primal. Neural computation, 19(5), 1155-1178.

2 of 3
6

If we apply a transformation $\phi$ to all input weight vectors ($\mathbf{x}^{(i)}$), we get the following cost function:

$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b)\right)} \quad + \quad \dfrac{1}{2} \mathbf{w}^t \cdot \mathbf{w}$

The kernel trick replaces $\phi(\mathbf{u})^t \cdot \phi(\mathbf{v})$ by $K(\mathbf{u}, \mathbf{v})$. Since the weight vector $\mathbf{w}$ is not transformed, the kernel trick cannot be applied to the cost function above.

The cost function above corresponds to the primal form of the SVM objective:

$\underset{\mathbf{w}, b, \mathbf{\zeta}}\min{C \sum\limits_{i=1}^m{\zeta^{(i)}} + \dfrac{1}{2}\mathbf{w}^t \cdot \mathbf{w}}$

subject to $y^{(i)}(\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b) \ge 1 - \zeta^{(i)})$ and $\zeta^{(i)} \ge 0$ for $i=1, \cdots, m$

The dual form is:

$\underset{\mathbf{\alpha}}\min{\dfrac{1}{2}\mathbf{\alpha}^t \cdot \mathbf{Q} \cdot \mathbf{\alpha} - \mathbf{1}^t \cdot \mathbf{\alpha}}$

subject to $\mathbf{y}^t \cdot \mathbf{\alpha} = 0$ and $0 \le \alpha_i \le C$ for $i = 1, 2, \cdots, m$

where $\mathbf{1}$ is a vector full of 1s and $\mathbf{Q}$ is an $m \times m$ matrix with elements $Q_{ij} = y^{(i)} y^{(j)} \phi(\mathbf{x}^{(i)})^t \cdot \phi(\mathbf{x}^{(j)})$.

Now we can use the kernel trick by computing $Q_{ij}$ like so:

$Q_{ij} = y^{(i)} y^{(j)} K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$

So the kernel trick can only be used on the dual form of the SVM problem (plus some other algorithms such as logistic regression).

Now you can use off-the-shelf Quadratic Programming libraries to solve this problem, or use Lagrangian multipliers to get an unconstrained function (the dual cost function), then search for a minimum using Gradient Descent or any other optimization technique. One of the most efficient approach seems to be the SMO algorithm implemented by the libsvm library (for kernelized SVM).

Find elsewhere
🌐
GitHub
github.com › MarRist › SVM-with-Stochastic-Gradient-Descent
GitHub - MarRist/SVM-with-Stochastic-Gradient-Descent: This repository contains projects that were written for Machine Learning course at University of Toronto
This is an implementation of Stochastic Gradient Descent with momentum β and learning rate α. The implemented algorithm is then used to approximately optimize the SVM objective.
Starred by 2 users
Forked by 2 users
Languages   Python 100.0% | Python 100.0%
🌐
arXiv
arxiv.org › abs › 1304.6383
[1304.6383] The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited
January 25, 2014 - We reconsider the stochastic (sub)gradient approach to the unconstrained primal L1-SVM optimization. We observe that if the learning rate is inversely proportional to the number of steps, i.e., the number of times any training pattern is presented to the algorithm, the update rule may be transformed into the one of the classical perceptron with margin in which the margin threshold increases linearly with the number of steps.
🌐
MIT CSAIL
people.csail.mit.edu › dsontag › courses › ml16 › slides › lecture5.pdf pdf
Support vector machines (SVMs) Lecture 5 David Sontag
So5 margin SVM · Subgradient · (for non-­‐differenNable funcNons) (Sub)gradient descent of SVM objecNve
🌐
JMLR
jmlr.org › papers › v13 › wang12b.html
Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training
Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and ...
🌐
Kaggle
kaggle.com › code › residentmario › support-vector-machines-and-stoch-gradient-descent
Support vector machines and stoch gradient descent
Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds
🌐
arXiv
arxiv.org › abs › 1905.01219
[1905.01219] Performance Optimization on Model Synchronization in Parallel Stochastic Gradient Descent Based SVM
May 3, 2019 - In this research, we identify the bottlenecks in model synchronization in parallel stochastic gradient descent (PSGD)-based SVM algorithm with respect to the training model synchronization frequency (MSF). Our research shows that by optimizing the MSF in the data sets that we used, a reduction ...
🌐
scikit-learn
scikit-learn.org › 1.5 › modules › sgd.html
1.5. Stochastic Gradient Descent — scikit-learn 1.5.2 documentation
The implementation of SGD is influenced by the Stochastic Gradient SVM of [7]. Similar to SvmSGD, the weight vector is represented as the product of a scalar and a vector which allows an efficient weight update in the case of L2 regularization. In the case of sparse input X, the intercept is updated with a smaller learning rate (multiplied by 0.01) to account for the fact that it is updated more frequently.
🌐
scikit-learn
scikit-learn.org › stable › auto_examples › linear_model › plot_sgdocsvm_vs_ocsvm.html
One-Class SVM versus One-Class SVM using Stochastic Gradient Descent — scikit-learn 1.8.0 documentation
This example shows how to approximate the solution of sklearn.svm.OneClassSVM in the case of an RBF kernel with sklearn.linear_model.SGDOneClassSVM, a Stochastic Gradient Descent (SGD) version of the One-Class SVM.
🌐
GitHub
github.com › tpeng › svmsgd
GitHub - tpeng/svmsgd: A svm solver with stochastic gradient descent
A svm solver with stochastic gradient descent. Contribute to tpeng/svmsgd development by creating an account on GitHub.
Author   tpeng
🌐
GitHub
github.com › joaofaro › SVMSGD
GitHub - joaofaro/SVMSGD: Stochastic Gradient Descent SVM classifier
This repository is meant to provide an easy-to-use implementation of the SVM classifier using the Stochastic Gradient Descent. This approach followed the one presented in Bottou, Léon. "Large-scale machine learning with stochastic gradient descent." Proceedings of COMPSTAT'2010.
Starred by 25 users
Forked by 2 users
Languages   C++ 93.7% | Makefile 6.3% | C++ 93.7% | Makefile 6.3%
🌐
Wiley Online Library
onlinelibrary.wiley.com › doi › abs › 10.1002 › cpe.6292
Stochastic gradient descent‐based support vector machines training optimization on Big Data and HPC frameworks - Abeykoon - 2022 - Concurrency and Computation: Practice and Experience - Wiley Online Library
March 30, 2021 - With the increasing amount of research data nowadays, understanding how to do efficient training is more important than ever. This article discusses the performance optimizations and benchmarks related to providing high-performance support for SVM training. In this research, we have focused on a highly scalable gradient descent-based approach to implementing the core SVM algorithm.