Brave Search

October 16, 2010 - We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy ${\epsilon}$ is ${\tilde{O}(1 / \epsilon)}$, where each iteration operates on a single training example.

MIT CSAIL

people.csail.mit.edu › dsontag › courses › ml16 › slides › lecture5.pdf pdf

Support vector machines (SVMs) Lecture 5 David Sontag

So5 margin SVM · Subgradient · (for non-‐diﬀerenNable funcNons) (Sub)gradient descent of SVM objecNve

University of Utah

users.cs.utah.edu › ~zhe › pdf › lec-19-2-svm-sgd-upload.pdf pdf

1 Support Vector Machines: Training with Stochastic Gradient Descent

Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·

Davidrosenberg

davidrosenberg.github.io › mlcourse › Archive › 2018 › Lectures › 03c.subgradient-descent.pdf pdf

Subgradient Descent David S. Rosenberg New York University February 7, 2018

Gradient Descent on SVM Objective? If we blindly apply gradient descent from a random starting point · seems unlikely that we’ll hit a point where the gradient is undeﬁned. Still, doesn’t mean that gradient descent will work if objective not diﬀerentiable! Theory of subgradients and subgradient descent will clear up any uncertainty.

GitHub

github.com › josiahw › SimpleSVM

GitHub - josiahw/SimpleSVM: A simple sub-gradient SVM implementation in python and numpy

A simple sub-gradient SVM implementation in python and numpy - josiahw/SimpleSVM

Author josiahw

TTIC

home.ttic.edu › ~nati › Publications › PegasosMPB.pdf pdf

Mathematical Programming manuscript No. (will be inserted by the editor)

31. Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM.

UW Computer Sciences

pages.cs.wisc.edu › ~sklee › talks › ismp_09.pdf pdf

Decomposition and Stochastic Subgradient Algorithms for Support Vector Machines

- SVM objective function f(·) is strongly convex with modulus λ.

MIT CSAIL

people.csail.mit.edu › dsontag › courses › ml16 › slides › notes_convexity16.pdf pdf

Gradient, Subgradient and how they may aﬀect your grade(ient)

February 7, 2016 - The subgradient of a convex function f at w0 is formally deﬁned as all vectors v such that ... Notice that the SVM objective contains an average over data points. We can approximate · this average by looking at a single data point at a time. This serves as the basis for stochastic · gradient descent methods.

Stack Overflow

stackoverflow.com › questions › 36020583 › compute-the-gradient-of-the-svm-loss-function

python - Compute the gradient of the SVM loss function - Stack Overflow

Top answer

1 of 3

The method to calculate gradient in this case is Calculus (analytically, NOT numerically!). So we differentiate loss function with respect to W(yi) like this:

and with respect to W(j) when j!=yi is:

The 1 is just indicator function so we can ignore the middle form when condition is true. And when you write in code, the example you provided is the answer.

Since you are using cs231n example, you should definitely check note and videos if needed.

Hope this helps!

2 of 3

If the substraction less than zero the loss is zero so the gradient of W is also zero. If the substarction larger than zero, then the gradient of W is the partial derviation of the loss.

Stack Exchange

datascience.stackexchange.com › questions › 16285 › difference-between-subgradient-svm-and-kernel-svm

machine learning - Difference between subgradient SVM and kernel SVM? - Data Science Stack Exchange

Top answer

1 of 1

You are gravely misunderstanding SVM. Sub-gradient descent algorithm for SVM is a method to solve the underlying optimization problem of SVM.

SVM is always a linear classifier, which can yet by using kernels operate in a higher dimensional space. Therefore in input space, the separating hyperplane (linear!) computed in feature space (kernel!), seems non-linear.

Effectively you are thereby solving a non-linear classification task, but you are projecting into a higher dimensional feature space where the classification task is solved by a linear classifier.

Please read:

Wikipedia on svm

Please watch:

great lecture about SVM

Find elsewhere

Google Bing Mojeek

Svivek

svivek.com › teaching › lectures › slides › svm › svm-sgd.pdf pdf

Training with Stochastic Gradient Descent

Hebrew University of Jerusalem

cs.huji.ac.il › ~shais › papers › ShalevSiSrCo10.pdf pdf

Mathematical Programming manuscript No. (will be inserted by the editor)

the optima of the SVM objectives are 2.88%, 14.9%, 0.457% and 0.57%. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM

arXiv

arxiv.org › abs › 1812.02261

[1812.02261] GADGET SVM: A Gossip-bAseD sub-GradiEnT Solver for Linear SVMs

December 5, 2018 - In this paper, we present a distributed algorithm for learning linear Support Vector Machines in the primal form for binary classification called Gossip-bAseD sub-GradiEnT (GADGET) SVM.

Stack Exchange

math.stackexchange.com › questions › 2572318 › derivation-of-gradient-of-svm-loss

machine learning - Derivation of gradient of SVM loss - Mathematics Stack Exchange

Top answer

1 of 3

Let's start with basics. The so-called gradient is just the ordinary derivative, that is, slope. For example, slope of the linear function $y=kx+b$ equals $k$, so its gradient w.r.t. $x$ equals $k$. If $x$ and $k$ are not numbers, but vectors, then the gradient is also a vector.

Another piece of good news is that gradient is a linear operator. It means, you can add functions and multiply by constants before or after differentiation, it doesn't make any difference

Now take the definition of SVM loss function for a single $i$-th observation. It is

$\mathrm{loss} = \mathrm{max}(0, \mathrm{something} - w_y*x)$

where $\mathrm{something}=wx+\Delta$. Thus, loss equals $\mathrm{something}-w_y*x$, if the latter is non-negative, and $0$ otherwise.

In the first (non-negative) case the loss $\mathrm{something}-w_y*x$ is linear in $w_y$, so the gradient is just the slope of this function of $w_y$, that is , $-x$.

In the second (negative) case the loss $0$ is constant, so its derivative is also $0$.

To write all this cases in one equation, we invent a function (it is called indicator) $I(x)$, which equals $1$ if $x$ is true, and $0$ otherwise. With this function, we can write

$\mathrm{derivative} = I(\mathrm{something} - w_y*x > 0) * (-x)$

If $\mathrm{something} - w_y*x > 0$, the first multiplier equals 1, and gradient equals $x$. Otherwise, the first multiplier equals 0, and gradient as well. So I just rewrote the two cases in a single line.

Now let's turn from a single $i$-th observation to the whole loss. The loss is sum of individual losses. Thus, because differentiation is linear, the gradient of a sum equals sum of gradients, so we can write

$\text{total derivative} = \sum(I(something - w_y*x_i > 0) * (-x_i))$

Now, move the $-$ multiplier from $x_i$ to the beginning of the formula, and you will get your expression.

2 of 3

David has provided good answer. But I would point out that the sum() in David's answer:

total_derivative = sum(I(something - w_y*x[i] > 0) * (-x[i]))

is different from the one in the original Nikhil's question:

$$ \def\w{{\mathbf w}} \nabla_{\w_{y_i}} L_i=-\left[\sum_{j\ne y_i} \mathbf{I}( \w_j^T x_i - \w_{y_i}^T x_i + \Delta >0) \right] x_i $$ The above equation is still the gradient due to the i-th observation, but for the weight of the ground truth class, i.e. $w_{y_i}$. There is the summation $\sum_{j \ne y_i}$, because $w_{y_i}$ is in every term of the SVM loss $L_i$:

$$ \def\w{{\mathbf w}} L_i = \sum_{j \ne y_i} \max (0, \w_j^T x_i - \w_{y_i}^T x_i + \Delta) $$ For every non-zero term, i.e. $w^T_j x_i - w^T_{y_i} x_i + \Delta > 0$, you would obtain the gradient $-x_i$. In total, the gradient $\nabla_{w_{y_i}} L_i$ is $numOfNonZeroTerm \times (- x_i)$, same as the equation above.

Gradients of individual observations $\nabla L_i$ (computed above) are then averaged to obtain the gradient of the batch of observations $\nabla L$.

arXiv

arxiv.org › abs › 2206.09311

[2206.09311] Primal Estimated Subgradient Solver for SVM for Imbalanced Classification

November 9, 2023 - Abstract page for arXiv paper 2206.09311: Primal Estimated Subgradient Solver for SVM for Imbalanced Classification

ResearchGate

researchgate.net › publication › 220589628_Pegasos_Primal_estimated_sub-gradient_solver_for_SVM

Pegasos: Primal estimated sub-gradient solver for SVM | Request PDF

August 5, 2025 - We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM).

sandipanweb

sandipanweb.wordpress.com › 2018 › 04 › 29 › implementing-pegasos-primal-estimated-sub-gradient-solver-for-svm-using-it-for-sentiment-classification-and-switching-to-logistic-regression-objective-by-changing-the-loss-function-in-python

Implementing PEGASOS: Primal Estimated sub-GrAdient SOlver for SVM, Logistic Regression and Application in Sentiment Classification (in Python) | sandipanweb

May 1, 2018 - Since the hinge-loss is not continuous, the sub-gradient of the objective is considered instead for the gradient computation for a single update step with SGD. The learning rate η is gradually decreased with iteration.

Stack Exchange

stats.stackexchange.com › questions › 215733 › use-sub-gradient-to-solve-the-primal-kernel-svm-pegasos-algorithm

machine learning - Use Sub Gradient to Solve the Primal (Kernel) SVM (Pegasos Algorithm) - Cross Validated

Top answer

1 of 2

The objective function for the Primal Kernel SVM (No bias term):

$$ \arg \min_{\boldsymbol{\beta}} \frac{\lambda}{2} \boldsymbol{\beta}^{\top} \boldsymbol{K} \boldsymbol{\beta} + \frac{1}{n} \sum_{i = 1}^{n} \max \left\{ 0, 1 - {y}_{i} \boldsymbol{k}_{i}^{\top} \boldsymbol{\beta} \right\} $$

The simplest formulations of the Kernel Pegasos algorithm, in my opinion, is in Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms which is a book with the same author of the paper Pegasos: Primal Estimated Sub GrAdient SOlver for SVM.

For the Kernel SVM (Page 223):

My implementation (Kernel SVM):

function PegasosKernelSVM!( mX :: Matrix{T}, hKₖ :: Function, vY :: Vector{T}, λ :: T, vαₜ :: Vector{T}, vβₜ :: Vector{T} ) where {T <: AbstractFloat}
    # Following Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms.
    # See page 223 - SGD for Solving Soft SVM with Kernels.
    # `hKₖ(vZ, ii)` - Returns the dot product of the `ii` -th data sample with `vZ`.

    numSamples    = length(vY);
    dataDim       = size(mX, 1);
    numIterations = size(mX, 2);

    # First Iteration
    ii = 1;
    tt = ii;
    ηₜ = inv(λ * tt);

    vαₜ .= ηₜ .* vβₜ;

    kk = rand(1:numSamples);
    yₖ = vY[kk];

    valSum  = yₖ * hKₖ(vαₜ, kk);
    vβₜ[kk] += (valSum < one(T)) * yₖ;

    @views mX[:, ii] = vαₜ;

    for ii in 2:numIterations

        tt = ii;
        ηₜ = inv(λ * tt);

        vαₜ .= ηₜ .* vβₜ;

        kk = rand(1:numSamples);
        yₖ = vY[kk];

        # @views valSum = yₖ * dot(mK[:, kk], vαₜ);
        valSum  = yₖ * hKₖ(vαₜ, kk);
        vβₜ[kk] += (valSum < one(T)) * yₖ;

        @views mX[:, ii] .= inv(tt) .* ((T(ii - 1) .* mX[:, ii - 1]) .+ vαₜ);
    end

    return mX;

end

Remarks:

The algorithm assumes no bias / intercept term. It may be adapted to include one, yet will loose its speed.
The code calculates the whole path of the estimation of the parameter $\boldsymbol{\beta}$ (Which is vαₜ in the code).

I verified the implementation vs. a DCP Solver and on Non Separable synthetic data set:

The prediction, for a given new sample $\boldsymbol{z}$ is given by:

$$ \boldsymbol{z} \to \operatorname{sign} \left( \sum_{i} {\beta}_{i} k \left( \boldsymbol{z}, \boldsymbol{x}_{i} \right) \right) $$

Where $\boldsymbol{x}_{i}$ is the $i$ -th data sample and $k \left( \cdot, \cdot \right)$ is the kernel function.
It is different from the calculation for the Dual Variable as seen in Prediction of a Sample Given the Optimal Dual Variable of Kernel SVM.

The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q215733 folder).

2 of 2

The objective function for the Primal Linear SVM (No bias term):

$$ \arg \min_{\boldsymbol{w}} \frac{\lambda}{2} {\left\| \boldsymbol{w} \right\|}_{2}^{2} + \frac{1}{n} \sum_{k = 1}^{n} \max \left\{ 0, 1 - {y}_{k} \boldsymbol{x}_{k}^{\top} \boldsymbol{w} \right\} $$

There are simpler formulations, in my opinion, of the algorithm in Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms which is a book with the same author of the paper Pegasos: Primal Estimated Sub GrAdient SOlver for SVM.

For the Linear SVM:

My implementation (Linear SVM):

function PegasosSVM!( mW :: Matrix{T}, hXₖ! :: Function, vY :: Vector{T}, λ :: T, vWₜ :: Vector{T}, vθₜ :: Vector{T}, vXₜ :: Vector{T} ) where {T <: AbstractFloat}
    # Following Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms.
    # See page 213 - SGD for Solving Soft SVM.
    # `hXₖ!(vZ, ii)` - Returns the `ii` -th data sample in `vZ`.

    numSamples    = length(vY);
    dataDim       = size(mW, 1);
    numIterations = size(mW, 2);

    # First iteration
    tt = 1;
    ηₜ = inv(λ * tt);

    vWₜ .= ηₜ .* vθₜ;

    kk = rand(1:numSamples);
    yₖ = vY[kk];

    hXₖ!(vXₜ, kk);

    valSum = yₖ * dot(vWₜ, vXₜ);
    if valSum < one(T)
        vθₜ .+= yₖ .* vXₜ;
    end

    @views mW[:, 1] = vWₜ;


    for ii in 2:numIterations

        # tt = ii - 1;
        tt = ii;
        ηₜ = inv(λ * tt);

        vWₜ .= ηₜ .* vθₜ;

        kk = rand(1:numSamples);
        yₖ = vY[kk];

        hXₖ!(vXₜ, kk);

        valSum = yₖ * dot(vWₜ, vXₜ);
        if valSum < one(T)
            vθₜ .+= yₖ .* vXₜ;
        end

        @views mW[:, ii] .= inv(tt) .* ((T(ii - 1) .* mW[:, ii - 1]) .+ vWₜ);
    end

    return mW;

end

Remarks:

The algorithm assumes no bias / intercept term. It may adapted to include the bias term yet it will become much slower (Strongly Convex to Convex).
The code calculates the whole path of the estimation of the parameter $\boldsymbol{w}$.

I verified the implementation vs. a DCP Solver and on Non Separable synthetic data set:

The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q215733 folder).

Monishver11

monishver11.github.io › blog › 2025 › subgradient

Subgradient and Subgradient Descent | Monishver Chandrasekaran

Subgradients are a powerful tool for dealing with non-differentiable convex functions, and subgradient descent provides a straightforward yet effective way to optimize such functions. While slower than gradient descent, subgradient descent shines in scenarios where gradients are unavailable.

Stack Exchange

stats.stackexchange.com › questions › 215524 › is-gradient-descent-possible-for-kernelized-svms-if-so-why-do-people-use-quadr

optimization - Is Gradient Descent possible for kernelized SVMs (if so, why do people use Quadratic Programming)? - Cross Validated

Top answer

1 of 3

Set $\mathbf w = \phi(\mathbf x)\cdot \mathbf u$ so that $\mathbf w^t \phi(\mathbf x)=\mathbf u^t \cdot \mathbf K$ and $\mathbf w^t\mathbf w = \mathbf u^t\mathbf K\mathbf u$, with $\mathbf K = \phi(\mathbf x)^t\phi(\mathbf x)$, where $\phi(x)$ is a mapping of the original input matrix, $\mathbf x$. This allows one to solve the SVM through the primal formulation. Using your notation for the loss:

$$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{u}^t \cdot \mathbf{K}^{(i)} + b)\right)} + \dfrac{1}{2} \mathbf{u}^t \cdot \mathbf{K} \cdot \mathbf{u}$$

$ \mathbf{K}$ is a $m \times m$ matrix, and $\mathbf{u}$ is a $m \times 1$ matrix. Neither is infinite.

Indeed, the dual is usually faster to solve, but the primal has it's advantages as well, such as approximate solutions (which are not guaranteed in the dual formulation).

Now, why is the dual so much more prominent isn't obvious at all: [1]

The historical reasons for which most of the research in the last decade has been about dual optimization are unclear. We believe that it is because SVMs were first introduced in their hard margin formulation [Boser et al., 1992], for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account [Chapelle et al., 2000]

Chapelle (2007) argues the time complexity of both primal and dual optimization is $\mathcal{O}\left(nn_{sv} + n_{sv}^3\right)$, worst case being $\mathcal{O}\left(n^3\right)$, but they analyzed quadratic and approximate hinge losses, so not a proper hinge loss, as it's not differentiable to be used with Newton's method.

[1] Chapelle, O. (2007). Training a support vector machine in the primal. Neural computation, 19(5), 1155-1178.

2 of 3

If we apply a transformation $\phi$ to all input weight vectors ($\mathbf{x}^{(i)}$), we get the following cost function:

$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b)\right)} \quad + \quad \dfrac{1}{2} \mathbf{w}^t \cdot \mathbf{w}$

The kernel trick replaces $\phi(\mathbf{u})^t \cdot \phi(\mathbf{v})$ by $K(\mathbf{u}, \mathbf{v})$. Since the weight vector $\mathbf{w}$ is not transformed, the kernel trick cannot be applied to the cost function above.

The cost function above corresponds to the primal form of the SVM objective:

$\underset{\mathbf{w}, b, \mathbf{\zeta}}\min{C \sum\limits_{i=1}^m{\zeta^{(i)}} + \dfrac{1}{2}\mathbf{w}^t \cdot \mathbf{w}}$

subject to $y^{(i)}(\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b) \ge 1 - \zeta^{(i)})$ and $\zeta^{(i)} \ge 0$ for $i=1, \cdots, m$

The dual form is:

$\underset{\mathbf{\alpha}}\min{\dfrac{1}{2}\mathbf{\alpha}^t \cdot \mathbf{Q} \cdot \mathbf{\alpha} - \mathbf{1}^t \cdot \mathbf{\alpha}}$

subject to $\mathbf{y}^t \cdot \mathbf{\alpha} = 0$ and $0 \le \alpha_i \le C$ for $i = 1, 2, \cdots, m$

where $\mathbf{1}$ is a vector full of 1s and $\mathbf{Q}$ is an $m \times m$ matrix with elements $Q_{ij} = y^{(i)} y^{(j)} \phi(\mathbf{x}^{(i)})^t \cdot \phi(\mathbf{x}^{(j)})$.

Now we can use the kernel trick by computing $Q_{ij}$ like so:

$Q_{ij} = y^{(i)} y^{(j)} K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$

So the kernel trick can only be used on the dual form of the SVM problem (plus some other algorithms such as logistic regression).

Now you can use off-the-shelf Quadratic Programming libraries to solve this problem, or use Lagrangian multipliers to get an unconstrained function (the dual cost function), then search for a minimum using Gradient Descent or any other optimization technique. One of the most efficient approach seems to be the SMO algorithm implemented by the libsvm library (for kernelized SVM).