stochastic gradient descent svm - Brave Search

scikit-learn.org › stable › modules › sgd.html

1.5. Stochastic Gradient Descent — scikit-learn 1.8.0 documentation

The implementation of SGD is influenced by the Stochastic Gradient SVM of [7]. Similar to SvmSGD, the weight vector is represented as the product of a scalar and a vector which allows an efficient weight update in the case of \(L_2\) regularization. In the case of sparse input X, the intercept is updated with a smaller learning rate (multiplied by 0.01) to account for the fact that it is updated more frequently.

University of Utah

users.cs.utah.edu › ~zhe › pdf › lec-19-2-svm-sgd-upload.pdf pdf

1 Support Vector Machines: Training with Stochastic Gradient Descent

Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·

Videos

Week 11b: Large Scale Machine Learning - Part 6: Learning ...

7.3.5. Gradient Descent for Support Vector Machine Classifier - ...

November 3, 2021

Gradient Descent for Support Vector Machines and Subgradients - ...

12 SMO and Stochastic SVM - YouTube

October 1, 2015

Stochastic Gradient Descent, Clearly Explained!!! - YouTube

SVM using SGD in Python - YouTube

Towards Data Science

towardsdatascience.com › home › latest › stochastic gradient descent implementation for softsvm

Stochastic gradient descent implementation for SoftSVM | Towards Data Science

March 5, 2025 - This is a quick tutorial on how to implement the Stochastic Gradient Descent (SGD) optimization method for SoftSVM on MATLAB to find a linear classifier with minimal empirical loss.

link.springer.com › home › machine learning and knowledge discovery in databases › conference paper

The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited | SpringerLink

We reconsider the stochastic (sub)gradient approach to the unconstrained primal L1-SVM optimization. We observe that if the learning rate is inversely proportional to the number of steps, i.e., the number of times any training pattern is presented to the algorithm,...

svivek.com › teaching › machine-learning › lectures › slides › svm › svm-sgd.pdf pdf

Training with Stochastic Gradient Descent

Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·

svivek.com › teaching › machine-learning › fall2018 › slides › svm › svm-sgd.pdf pdf

Machine Learning Support Vector Machines: Training with

Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·

scikit-learn.org › stable › modules › generated › sklearn.linear_model.SGDClassifier.html

SGDClassifier — scikit-learn 1.8.0 documentation

This estimator implements regularized linear models with stochastic gradient descent (SGD) learning: the gradient of the loss is estimated each sample at a time and the model is updated along the way with a decreasing strength ...

MaviccPRP@web.studio

maviccprp.github.io › a-support-vector-machine-in-just-a-few-lines-of-python-code

A Support Vector Machine in just a few Lines of Python Code

April 3, 2017 - As for the perceptron, we use python 3 and numpy. The SVM will learn using the stochastic gradient descent algorithm (SGD). SGD minimizes a function by following the gradients of the cost function.

stats.stackexchange.com › questions › 215524 › is-gradient-descent-possible-for-kernelized-svms-if-so-why-do-people-use-quadr

optimization - Is Gradient Descent possible for kernelized SVMs (if so, why do people use Quadratic Programming)? - Cross Validated

Set $\mathbf w = \phi(\mathbf x)\cdot \mathbf u$ so that $\mathbf w^t \phi(\mathbf x)=\mathbf u^t \cdot \mathbf K$ and $\mathbf w^t\mathbf w = \mathbf u^t\mathbf K\mathbf u$, with $\mathbf K = \phi(\mathbf x)^t\phi(\mathbf x)$, where $\phi(x)$ is a mapping of the original input matrix, $\mathbf x$. This allows one to solve the SVM through the primal formulation. Using your notation for the loss:

$$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{u}^t \cdot \mathbf{K}^{(i)} + b)\right)} + \dfrac{1}{2} \mathbf{u}^t \cdot \mathbf{K} \cdot \mathbf{u}$$

$ \mathbf{K}$ is a $m \times m$ matrix, and $\mathbf{u}$ is a $m \times 1$ matrix. Neither is infinite.

Indeed, the dual is usually faster to solve, but the primal has it's advantages as well, such as approximate solutions (which are not guaranteed in the dual formulation).

Now, why is the dual so much more prominent isn't obvious at all: [1]

The historical reasons for which most of the research in the last decade has been about dual optimization are unclear. We believe that it is because SVMs were first introduced in their hard margin formulation [Boser et al., 1992], for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account [Chapelle et al., 2000]

Chapelle (2007) argues the time complexity of both primal and dual optimization is $\mathcal{O}\left(nn_{sv} + n_{sv}^3\right)$, worst case being $\mathcal{O}\left(n^3\right)$, but they analyzed quadratic and approximate hinge losses, so not a proper hinge loss, as it's not differentiable to be used with Newton's method.

[1] Chapelle, O. (2007). Training a support vector machine in the primal. Neural computation, 19(5), 1155-1178.

If we apply a transformation $\phi$ to all input weight vectors ($\mathbf{x}^{(i)}$), we get the following cost function:

$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b)\right)} \quad + \quad \dfrac{1}{2} \mathbf{w}^t \cdot \mathbf{w}$

The kernel trick replaces $\phi(\mathbf{u})^t \cdot \phi(\mathbf{v})$ by $K(\mathbf{u}, \mathbf{v})$. Since the weight vector $\mathbf{w}$ is not transformed, the kernel trick cannot be applied to the cost function above.

The cost function above corresponds to the primal form of the SVM objective:

$\underset{\mathbf{w}, b, \mathbf{\zeta}}\min{C \sum\limits_{i=1}^m{\zeta^{(i)}} + \dfrac{1}{2}\mathbf{w}^t \cdot \mathbf{w}}$

subject to $y^{(i)}(\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b) \ge 1 - \zeta^{(i)})$ and $\zeta^{(i)} \ge 0$ for $i=1, \cdots, m$

The dual form is:

$\underset{\mathbf{\alpha}}\min{\dfrac{1}{2}\mathbf{\alpha}^t \cdot \mathbf{Q} \cdot \mathbf{\alpha} - \mathbf{1}^t \cdot \mathbf{\alpha}}$

subject to $\mathbf{y}^t \cdot \mathbf{\alpha} = 0$ and $0 \le \alpha_i \le C$ for $i = 1, 2, \cdots, m$

where $\mathbf{1}$ is a vector full of 1s and $\mathbf{Q}$ is an $m \times m$ matrix with elements $Q_{ij} = y^{(i)} y^{(j)} \phi(\mathbf{x}^{(i)})^t \cdot \phi(\mathbf{x}^{(j)})$.

Now we can use the kernel trick by computing $Q_{ij}$ like so:

$Q_{ij} = y^{(i)} y^{(j)} K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$

So the kernel trick can only be used on the dual form of the SVM problem (plus some other algorithms such as logistic regression).

Now you can use off-the-shelf Quadratic Programming libraries to solve this problem, or use Lagrangian multipliers to get an unconstrained function (the dual cost function), then search for a minimum using Gradient Descent or any other optimization technique. One of the most efficient approach seems to be the SMO algorithm implemented by the libsvm library (for kernelized SVM).

Find elsewhere

Google Bing Mojeek

github.com › MarRist › SVM-with-Stochastic-Gradient-Descent

GitHub - MarRist/SVM-with-Stochastic-Gradient-Descent: This repository contains projects that were written for Machine Learning course at University of Toronto

This is an implementation of Stochastic Gradient Descent with momentum β and learning rate α. The implemented algorithm is then used to approximately optimize the SVM objective.

Starred by 2 users

Forked by 2 users

Languages Python 100.0% | Python 100.0%

arxiv.org › abs › 1304.6383

[1304.6383] The Stochastic Gradient Descent for the Primal L1-SVM Optimization Revisited

January 25, 2014 - We reconsider the stochastic (sub)gradient approach to the unconstrained primal L1-SVM optimization. We observe that if the learning rate is inversely proportional to the number of steps, i.e., the number of times any training pattern is presented to the algorithm, the update rule may be transformed into the one of the classical perceptron with margin in which the margin threshold increases linearly with the number of steps.

people.csail.mit.edu › dsontag › courses › ml16 › slides › lecture5.pdf pdf

Support vector machines (SVMs) Lecture 5 David Sontag

So5 margin SVM · Subgradient · (for non-‐diﬀerenNable funcNons) (Sub)gradient descent of SVM objecNve

jmlr.org › papers › v13 › wang12b.html

Breaking the Curse of Kernelization: Budgeted Stochastic Gradient Descent for Large-Scale SVM Training

Online algorithms that process one example at a time are advantageous when dealing with very large data or with data streams. Stochastic Gradient Descent (SGD) is such an algorithm and it is an attractive choice for online Support Vector Machine (SVM) training due to its simplicity and ...

kaggle.com › code › residentmario › support-vector-machines-and-stoch-gradient-descent

Support vector machines and stoch gradient descent

Checking your browser before accessing www.kaggle.com · Click here if you are not automatically redirected after 5 seconds

arxiv.org › abs › 1905.01219

[1905.01219] Performance Optimization on Model Synchronization in Parallel Stochastic Gradient Descent Based SVM

May 3, 2019 - In this research, we identify the bottlenecks in model synchronization in parallel stochastic gradient descent (PSGD)-based SVM algorithm with respect to the training model synchronization frequency (MSF). Our research shows that by optimizing the MSF in the data sets that we used, a reduction ...

scikit-learn.org › 1.5 › modules › sgd.html

1.5. Stochastic Gradient Descent — scikit-learn 1.5.2 documentation

The implementation of SGD is influenced by the Stochastic Gradient SVM of [7]. Similar to SvmSGD, the weight vector is represented as the product of a scalar and a vector which allows an efficient weight update in the case of L2 regularization. In the case of sparse input X, the intercept is updated with a smaller learning rate (multiplied by 0.01) to account for the fact that it is updated more frequently.

scikit-learn.org › stable › auto_examples › linear_model › plot_sgdocsvm_vs_ocsvm.html

One-Class SVM versus One-Class SVM using Stochastic Gradient Descent — scikit-learn 1.8.0 documentation

This example shows how to approximate the solution of sklearn.svm.OneClassSVM in the case of an RBF kernel with sklearn.linear_model.SGDOneClassSVM, a Stochastic Gradient Descent (SGD) version of the One-Class SVM.

github.com › tpeng › svmsgd

GitHub - tpeng/svmsgd: A svm solver with stochastic gradient descent

A svm solver with stochastic gradient descent. Contribute to tpeng/svmsgd development by creating an account on GitHub.

Author tpeng

github.com › joaofaro › SVMSGD

GitHub - joaofaro/SVMSGD: Stochastic Gradient Descent SVM classifier

This repository is meant to provide an easy-to-use implementation of the SVM classifier using the Stochastic Gradient Descent. This approach followed the one presented in Bottou, Léon. "Large-scale machine learning with stochastic gradient descent." Proceedings of COMPSTAT'2010.

Starred by 25 users

Forked by 2 users

Languages C++ 93.7% | Makefile 6.3% | C++ 93.7% | Makefile 6.3%

Wiley Online Library

onlinelibrary.wiley.com › doi › abs › 10.1002 › cpe.6292

Stochastic gradient descent‐based support vector machines training optimization on Big Data and HPC frameworks - Abeykoon - 2022 - Concurrency and Computation: Practice and Experience - Wiley Online Library

March 30, 2021 - With the increasing amount of research data nowadays, understanding how to do efficient training is more important than ever. This article discusses the performance optimizations and benchmarks related to providing high-performance support for SVM training. In this research, we have focused on a highly scalable gradient descent-based approach to implementing the core SVM algorithm.