You are gravely misunderstanding SVM. Sub-gradient descent algorithm for SVM is a method to solve the underlying optimization problem of SVM.

SVM is always a linear classifier, which can yet by using kernels operate in a higher dimensional space. Therefore in input space, the separating hyperplane (linear!) computed in feature space (kernel!), seems non-linear.

Effectively you are thereby solving a non-linear classification task, but you are projecting into a higher dimensional feature space where the classification task is solved by a linear classifier.

Please read:

Wikipedia on svm

Please watch:

great lecture about SVM

Answer from Nikolas Rieble on Stack Exchange
🌐
Springer
link.springer.com › home › mathematical programming › article
Pegasos: primal estimated sub-gradient solver for SVM | Mathematical Programming | Springer Nature Link
October 16, 2010 - We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy \({\epsilon}\) is \({\tilde{O}(1 / \epsilon)}\), where each iteration operates on a single training example.
🌐
MIT CSAIL
people.csail.mit.edu › dsontag › courses › ml16 › slides › lecture5.pdf pdf
Support vector machines (SVMs) Lecture 5 David Sontag
So5 margin SVM · Subgradient · (for non-­‐differenNable funcNons) (Sub)gradient descent of SVM objecNve
🌐
University of Utah
users.cs.utah.edu › ~zhe › pdf › lec-19-2-svm-sgd-upload.pdf pdf
1 Support Vector Machines: Training with Stochastic Gradient Descent
Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·
🌐
Davidrosenberg
davidrosenberg.github.io › mlcourse › Archive › 2018 › Lectures › 03c.subgradient-descent.pdf pdf
Subgradient Descent David S. Rosenberg New York University February 7, 2018
Gradient Descent on SVM Objective? If we blindly apply gradient descent from a random starting point · seems unlikely that we’ll hit a point where the gradient is undefined. Still, doesn’t mean that gradient descent will work if objective not differentiable! Theory of subgradients and subgradient descent will clear up any uncertainty.
🌐
GitHub
github.com › josiahw › SimpleSVM
GitHub - josiahw/SimpleSVM: A simple sub-gradient SVM implementation in python and numpy
A simple sub-gradient SVM implementation in python and numpy - josiahw/SimpleSVM
Author   josiahw
🌐
TTIC
home.ttic.edu › ~nati › Publications › PegasosMPB.pdf pdf
Mathematical Programming manuscript No. (will be inserted by the editor)
31. Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM.
Find elsewhere
🌐
MIT CSAIL
people.csail.mit.edu › dsontag › courses › ml16 › slides › notes_convexity16.pdf pdf
Gradient, Subgradient and how they may affect your grade(ient)
February 7, 2016 - The subgradient of a convex function f at w0 is formally defined as all vectors v such that ... Notice that the SVM objective contains an average over data points. We can approximate · this average by looking at a single data point at a time. This serves as the basis for stochastic · gradient descent methods.
🌐
Svivek
svivek.com › teaching › lectures › slides › svm › svm-sgd.pdf pdf
Training with Stochastic Gradient Descent
Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·
🌐
Hebrew University of Jerusalem
cs.huji.ac.il › ~shais › papers › ShalevSiSrCo10.pdf pdf
Mathematical Programming manuscript No. (will be inserted by the editor)
the optima of the SVM objectives are 2.88%, 14.9%, 0.457% and 0.57%. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM
🌐
arXiv
arxiv.org › abs › 1812.02261
[1812.02261] GADGET SVM: A Gossip-bAseD sub-GradiEnT Solver for Linear SVMs
December 5, 2018 - In this paper, we present a distributed algorithm for learning linear Support Vector Machines in the primal form for binary classification called Gossip-bAseD sub-GradiEnT (GADGET) SVM.
🌐
arXiv
arxiv.org › abs › 2206.09311
[2206.09311] Primal Estimated Subgradient Solver for SVM for Imbalanced Classification
November 9, 2023 - Abstract page for arXiv paper 2206.09311: Primal Estimated Subgradient Solver for SVM for Imbalanced Classification
🌐
ResearchGate
researchgate.net › publication › 220589628_Pegasos_Primal_estimated_sub-gradient_solver_for_SVM
Pegasos: Primal estimated sub-gradient solver for SVM | Request PDF
August 5, 2025 - We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM).
Top answer
1 of 3
12

Let's start with basics. The so-called gradient is just the ordinary derivative, that is, slope. For example, slope of the linear function $y=kx+b$ equals $k$, so its gradient w.r.t. $x$ equals $k$. If $x$ and $k$ are not numbers, but vectors, then the gradient is also a vector.

Another piece of good news is that gradient is a linear operator. It means, you can add functions and multiply by constants before or after differentiation, it doesn't make any difference

Now take the definition of SVM loss function for a single $i$-th observation. It is

$\mathrm{loss} = \mathrm{max}(0, \mathrm{something} - w_y*x)$

where $\mathrm{something}=wx+\Delta$. Thus, loss equals $\mathrm{something}-w_y*x$, if the latter is non-negative, and $0$ otherwise.

In the first (non-negative) case the loss $\mathrm{something}-w_y*x$ is linear in $w_y$, so the gradient is just the slope of this function of $w_y$, that is , $-x$.

In the second (negative) case the loss $0$ is constant, so its derivative is also $0$.

To write all this cases in one equation, we invent a function (it is called indicator) $I(x)$, which equals $1$ if $x$ is true, and $0$ otherwise. With this function, we can write

$\mathrm{derivative} = I(\mathrm{something} - w_y*x > 0) * (-x)$

If $\mathrm{something} - w_y*x > 0$, the first multiplier equals 1, and gradient equals $x$. Otherwise, the first multiplier equals 0, and gradient as well. So I just rewrote the two cases in a single line.

Now let's turn from a single $i$-th observation to the whole loss. The loss is sum of individual losses. Thus, because differentiation is linear, the gradient of a sum equals sum of gradients, so we can write

$\text{total derivative} = \sum(I(something - w_y*x_i > 0) * (-x_i))$

Now, move the $-$ multiplier from $x_i$ to the beginning of the formula, and you will get your expression.

2 of 3
1

David has provided good answer. But I would point out that the sum() in David's answer:

total_derivative = sum(I(something - w_y*x[i] > 0) * (-x[i]))

is different from the one in the original Nikhil's question:

$$ \def\w{{\mathbf w}} \nabla_{\w_{y_i}} L_i=-\left[\sum_{j\ne y_i} \mathbf{I}( \w_j^T x_i - \w_{y_i}^T x_i + \Delta >0) \right] x_i $$ The above equation is still the gradient due to the i-th observation, but for the weight of the ground truth class, i.e. $w_{y_i}$. There is the summation $\sum_{j \ne y_i}$, because $w_{y_i}$ is in every term of the SVM loss $L_i$:

$$ \def\w{{\mathbf w}} L_i = \sum_{j \ne y_i} \max (0, \w_j^T x_i - \w_{y_i}^T x_i + \Delta) $$ For every non-zero term, i.e. $w^T_j x_i - w^T_{y_i} x_i + \Delta > 0$, you would obtain the gradient $-x_i$. In total, the gradient $\nabla_{w_{y_i}} L_i$ is $numOfNonZeroTerm \times (- x_i)$, same as the equation above.

Gradients of individual observations $\nabla L_i$ (computed above) are then averaged to obtain the gradient of the batch of observations $\nabla L$.

Top answer
1 of 2
1

The objective function for the Primal Kernel SVM (No bias term):

$$ \arg \min_{\boldsymbol{\beta}} \frac{\lambda}{2} \boldsymbol{\beta}^{\top} \boldsymbol{K} \boldsymbol{\beta} + \frac{1}{n} \sum_{i = 1}^{n} \max \left\{ 0, 1 - {y}_{i} \boldsymbol{k}_{i}^{\top} \boldsymbol{\beta} \right\} $$

The simplest formulations of the Kernel Pegasos algorithm, in my opinion, is in Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms which is a book with the same author of the paper Pegasos: Primal Estimated Sub GrAdient SOlver for SVM.

For the Kernel SVM (Page 223):

My implementation (Kernel SVM):

function PegasosKernelSVM!( mX :: Matrix{T}, hKₖ :: Function, vY :: Vector{T}, λ :: T, vαₜ :: Vector{T}, vβₜ :: Vector{T} ) where {T <: AbstractFloat}
    # Following Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms.
    # See page 223 - SGD for Solving Soft SVM with Kernels.
    # `hKₖ(vZ, ii)` - Returns the dot product of the `ii` -th data sample with `vZ`.

    numSamples    = length(vY);
    dataDim       = size(mX, 1);
    numIterations = size(mX, 2);

    # First Iteration
    ii = 1;
    tt = ii;
    ηₜ = inv(λ * tt);

    vαₜ .= ηₜ .* vβₜ;

    kk = rand(1:numSamples);
    yₖ = vY[kk];

    valSum  = yₖ * hKₖ(vαₜ, kk);
    vβₜ[kk] += (valSum < one(T)) * yₖ;

    @views mX[:, ii] = vαₜ;

    for ii in 2:numIterations

        tt = ii;
        ηₜ = inv(λ * tt);

        vαₜ .= ηₜ .* vβₜ;

        kk = rand(1:numSamples);
        yₖ = vY[kk];

        # @views valSum = yₖ * dot(mK[:, kk], vαₜ);
        valSum  = yₖ * hKₖ(vαₜ, kk);
        vβₜ[kk] += (valSum < one(T)) * yₖ;

        @views mX[:, ii] .= inv(tt) .* ((T(ii - 1) .* mX[:, ii - 1]) .+ vαₜ);
    end

    return mX;

end

Remarks:

  • The algorithm assumes no bias / intercept term. It may be adapted to include one, yet will loose its speed.
  • The code calculates the whole path of the estimation of the parameter $\boldsymbol{\beta}$ (Which is vαₜ in the code).

I verified the implementation vs. a DCP Solver and on Non Separable synthetic data set:

The prediction, for a given new sample $\boldsymbol{z}$ is given by:

$$ \boldsymbol{z} \to \operatorname{sign} \left( \sum_{i} {\beta}_{i} k \left( \boldsymbol{z}, \boldsymbol{x}_{i} \right) \right) $$

Where $\boldsymbol{x}_{i}$ is the $i$ -th data sample and $k \left( \cdot, \cdot \right)$ is the kernel function.
It is different from the calculation for the Dual Variable as seen in Prediction of a Sample Given the Optimal Dual Variable of Kernel SVM.


The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q215733 folder).

2 of 2
0

The objective function for the Primal Linear SVM (No bias term):

$$ \arg \min_{\boldsymbol{w}} \frac{\lambda}{2} {\left\| \boldsymbol{w} \right\|}_{2}^{2} + \frac{1}{n} \sum_{k = 1}^{n} \max \left\{ 0, 1 - {y}_{k} \boldsymbol{x}_{k}^{\top} \boldsymbol{w} \right\} $$

There are simpler formulations, in my opinion, of the algorithm in Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms which is a book with the same author of the paper Pegasos: Primal Estimated Sub GrAdient SOlver for SVM.

For the Linear SVM:

My implementation (Linear SVM):

function PegasosSVM!( mW :: Matrix{T}, hXₖ! :: Function, vY :: Vector{T}, λ :: T, vWₜ :: Vector{T}, vθₜ :: Vector{T}, vXₜ :: Vector{T} ) where {T <: AbstractFloat}
    # Following Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms.
    # See page 213 - SGD for Solving Soft SVM.
    # `hXₖ!(vZ, ii)` - Returns the `ii` -th data sample in `vZ`.

    numSamples    = length(vY);
    dataDim       = size(mW, 1);
    numIterations = size(mW, 2);

    # First iteration
    tt = 1;
    ηₜ = inv(λ * tt);

    vWₜ .= ηₜ .* vθₜ;

    kk = rand(1:numSamples);
    yₖ = vY[kk];

    hXₖ!(vXₜ, kk);

    valSum = yₖ * dot(vWₜ, vXₜ);
    if valSum < one(T)
        vθₜ .+= yₖ .* vXₜ;
    end

    @views mW[:, 1] = vWₜ;


    for ii in 2:numIterations

        # tt = ii - 1;
        tt = ii;
        ηₜ = inv(λ * tt);

        vWₜ .= ηₜ .* vθₜ;

        kk = rand(1:numSamples);
        yₖ = vY[kk];

        hXₖ!(vXₜ, kk);

        valSum = yₖ * dot(vWₜ, vXₜ);
        if valSum < one(T)
            vθₜ .+= yₖ .* vXₜ;
        end

        @views mW[:, ii] .= inv(tt) .* ((T(ii - 1) .* mW[:, ii - 1]) .+ vWₜ);
    end

    return mW;

end

Remarks:

  • The algorithm assumes no bias / intercept term. It may adapted to include the bias term yet it will become much slower (Strongly Convex to Convex).
  • The code calculates the whole path of the estimation of the parameter $\boldsymbol{w}$.

I verified the implementation vs. a DCP Solver and on Non Separable synthetic data set:


The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q215733 folder).

🌐
Monishver11
monishver11.github.io › blog › 2025 › subgradient
Subgradient and Subgradient Descent | Monishver Chandrasekaran
Subgradients are a powerful tool for dealing with non-differentiable convex functions, and subgradient descent provides a straightforward yet effective way to optimize such functions. While slower than gradient descent, subgradient descent shines in scenarios where gradients are unavailable.
🌐
GitHub
github.com › MarRist › SVM-with-Stochastic-Gradient-Descent
GitHub - MarRist/SVM-with-Stochastic-Gradient-Descent: This repository contains projects that were written for Machine Learning course at University of Toronto
where the w are the SVM model parameters, b is the bias term, C is the penalty parameter for misclassifying the classes, and N is the batch size. The first term in the objective is the regularization term where the second one is known as the hinge loss. The gradient of the hinge loss was calculated using sub-gradients:
Starred by 2 users
Forked by 2 users
Languages   Python 100.0% | Python 100.0%