🌐
Springer
link.springer.com › home › mathematical programming › article
Pegasos: primal estimated sub-gradient solver for SVM | Mathematical Programming | Springer Nature Link
October 16, 2010 - We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM). We prove that the number of iterations required to obtain a solution of accuracy \({\epsilon}\) is \({\tilde{O}(1 / \epsilon)}\), where each iteration operates on a single training example.
🌐
MIT CSAIL
people.csail.mit.edu › dsontag › courses › ml16 › slides › lecture5.pdf pdf
Support vector machines (SVMs) Lecture 5 David Sontag
So5 margin SVM · Subgradient · (for non-­‐differenNable funcNons) (Sub)gradient descent of SVM objecNve
🌐
University of Utah
users.cs.utah.edu › ~zhe › pdf › lec-19-2-svm-sgd-upload.pdf pdf
1 Support Vector Machines: Training with Stochastic Gradient Descent
Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·
🌐
Davidrosenberg
davidrosenberg.github.io › mlcourse › Archive › 2018 › Lectures › 03c.subgradient-descent.pdf pdf
Subgradient Descent David S. Rosenberg New York University February 7, 2018
Gradient Descent on SVM Objective? If we blindly apply gradient descent from a random starting point · seems unlikely that we’ll hit a point where the gradient is undefined. Still, doesn’t mean that gradient descent will work if objective not differentiable! Theory of subgradients and subgradient descent will clear up any uncertainty.
🌐
GitHub
github.com › josiahw › SimpleSVM
GitHub - josiahw/SimpleSVM: A simple sub-gradient SVM implementation in python and numpy
A simple sub-gradient SVM implementation in python and numpy - josiahw/SimpleSVM
Author   josiahw
🌐
TTIC
home.ttic.edu › ~nati › Publications › PegasosMPB.pdf pdf
Mathematical Programming manuscript No. (will be inserted by the editor)
31. Shalev-Shwartz, S., Singer, Y., Srebro, N.: Pegasos: Primal Estimated sub-GrAdient SOlver for SVM.
🌐
MIT CSAIL
people.csail.mit.edu › dsontag › courses › ml16 › slides › notes_convexity16.pdf pdf
Gradient, Subgradient and how they may affect your grade(ient)
February 7, 2016 - The subgradient of a convex function f at w0 is formally defined as all vectors v such that ... Notice that the SVM objective contains an average over data points. We can approximate · this average by looking at a single data point at a time. This serves as the basis for stochastic · gradient descent methods.
Find elsewhere
🌐
Svivek
svivek.com › teaching › lectures › slides › svm › svm-sgd.pdf pdf
Training with Stochastic Gradient Descent
Outline: Training SVM by optimization · 1. Review of convex functions and gradient descent · 2. Stochastic gradient descent · 3. Gradient descent vs stochastic gradient descent · 4. Sub-derivatives of the hinge loss · 5. Stochastic sub-gradient descent for SVM ·
🌐
Hebrew University of Jerusalem
cs.huji.ac.il › ~shais › papers › ShalevSiSrCo10.pdf pdf
Mathematical Programming manuscript No. (will be inserted by the editor)
the optima of the SVM objectives are 2.88%, 14.9%, 0.457% and 0.57%. Pegasos: Primal Estimated sub-GrAdient SOlver for SVM
🌐
arXiv
arxiv.org › abs › 1812.02261
[1812.02261] GADGET SVM: A Gossip-bAseD sub-GradiEnT Solver for Linear SVMs
December 5, 2018 - In this paper, we present a distributed algorithm for learning linear Support Vector Machines in the primal form for binary classification called Gossip-bAseD sub-GradiEnT (GADGET) SVM.
Top answer
1 of 3
12

Let's start with basics. The so-called gradient is just the ordinary derivative, that is, slope. For example, slope of the linear function $y=kx+b$ equals $k$, so its gradient w.r.t. $x$ equals $k$. If $x$ and $k$ are not numbers, but vectors, then the gradient is also a vector.

Another piece of good news is that gradient is a linear operator. It means, you can add functions and multiply by constants before or after differentiation, it doesn't make any difference

Now take the definition of SVM loss function for a single $i$-th observation. It is

$\mathrm{loss} = \mathrm{max}(0, \mathrm{something} - w_y*x)$

where $\mathrm{something}=wx+\Delta$. Thus, loss equals $\mathrm{something}-w_y*x$, if the latter is non-negative, and $0$ otherwise.

In the first (non-negative) case the loss $\mathrm{something}-w_y*x$ is linear in $w_y$, so the gradient is just the slope of this function of $w_y$, that is , $-x$.

In the second (negative) case the loss $0$ is constant, so its derivative is also $0$.

To write all this cases in one equation, we invent a function (it is called indicator) $I(x)$, which equals $1$ if $x$ is true, and $0$ otherwise. With this function, we can write

$\mathrm{derivative} = I(\mathrm{something} - w_y*x > 0) * (-x)$

If $\mathrm{something} - w_y*x > 0$, the first multiplier equals 1, and gradient equals $x$. Otherwise, the first multiplier equals 0, and gradient as well. So I just rewrote the two cases in a single line.

Now let's turn from a single $i$-th observation to the whole loss. The loss is sum of individual losses. Thus, because differentiation is linear, the gradient of a sum equals sum of gradients, so we can write

$\text{total derivative} = \sum(I(something - w_y*x_i > 0) * (-x_i))$

Now, move the $-$ multiplier from $x_i$ to the beginning of the formula, and you will get your expression.

2 of 3
1

David has provided good answer. But I would point out that the sum() in David's answer:

total_derivative = sum(I(something - w_y*x[i] > 0) * (-x[i]))

is different from the one in the original Nikhil's question:

$$ \def\w{{\mathbf w}} \nabla_{\w_{y_i}} L_i=-\left[\sum_{j\ne y_i} \mathbf{I}( \w_j^T x_i - \w_{y_i}^T x_i + \Delta >0) \right] x_i $$ The above equation is still the gradient due to the i-th observation, but for the weight of the ground truth class, i.e. $w_{y_i}$. There is the summation $\sum_{j \ne y_i}$, because $w_{y_i}$ is in every term of the SVM loss $L_i$:

$$ \def\w{{\mathbf w}} L_i = \sum_{j \ne y_i} \max (0, \w_j^T x_i - \w_{y_i}^T x_i + \Delta) $$ For every non-zero term, i.e. $w^T_j x_i - w^T_{y_i} x_i + \Delta > 0$, you would obtain the gradient $-x_i$. In total, the gradient $\nabla_{w_{y_i}} L_i$ is $numOfNonZeroTerm \times (- x_i)$, same as the equation above.

Gradients of individual observations $\nabla L_i$ (computed above) are then averaged to obtain the gradient of the batch of observations $\nabla L$.

🌐
arXiv
arxiv.org › abs › 2206.09311
[2206.09311] Primal Estimated Subgradient Solver for SVM for Imbalanced Classification
November 9, 2023 - Abstract page for arXiv paper 2206.09311: Primal Estimated Subgradient Solver for SVM for Imbalanced Classification
🌐
ResearchGate
researchgate.net › publication › 220589628_Pegasos_Primal_estimated_sub-gradient_solver_for_SVM
Pegasos: Primal estimated sub-gradient solver for SVM | Request PDF
August 5, 2025 - We describe and analyze a simple and effective stochastic sub-gradient descent algorithm for solving the optimization problem cast by Support Vector Machines (SVM).
Top answer
1 of 2
1

The objective function for the Primal Kernel SVM (No bias term):

$$ \arg \min_{\boldsymbol{\beta}} \frac{\lambda}{2} \boldsymbol{\beta}^{\top} \boldsymbol{K} \boldsymbol{\beta} + \frac{1}{n} \sum_{i = 1}^{n} \max \left\{ 0, 1 - {y}_{i} \boldsymbol{k}_{i}^{\top} \boldsymbol{\beta} \right\} $$

The simplest formulations of the Kernel Pegasos algorithm, in my opinion, is in Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms which is a book with the same author of the paper Pegasos: Primal Estimated Sub GrAdient SOlver for SVM.

For the Kernel SVM (Page 223):

My implementation (Kernel SVM):

function PegasosKernelSVM!( mX :: Matrix{T}, hKₖ :: Function, vY :: Vector{T}, λ :: T, vαₜ :: Vector{T}, vβₜ :: Vector{T} ) where {T <: AbstractFloat}
    # Following Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms.
    # See page 223 - SGD for Solving Soft SVM with Kernels.
    # `hKₖ(vZ, ii)` - Returns the dot product of the `ii` -th data sample with `vZ`.

    numSamples    = length(vY);
    dataDim       = size(mX, 1);
    numIterations = size(mX, 2);

    # First Iteration
    ii = 1;
    tt = ii;
    ηₜ = inv(λ * tt);

    vαₜ .= ηₜ .* vβₜ;

    kk = rand(1:numSamples);
    yₖ = vY[kk];

    valSum  = yₖ * hKₖ(vαₜ, kk);
    vβₜ[kk] += (valSum < one(T)) * yₖ;

    @views mX[:, ii] = vαₜ;

    for ii in 2:numIterations

        tt = ii;
        ηₜ = inv(λ * tt);

        vαₜ .= ηₜ .* vβₜ;

        kk = rand(1:numSamples);
        yₖ = vY[kk];

        # @views valSum = yₖ * dot(mK[:, kk], vαₜ);
        valSum  = yₖ * hKₖ(vαₜ, kk);
        vβₜ[kk] += (valSum < one(T)) * yₖ;

        @views mX[:, ii] .= inv(tt) .* ((T(ii - 1) .* mX[:, ii - 1]) .+ vαₜ);
    end

    return mX;

end

Remarks:

  • The algorithm assumes no bias / intercept term. It may be adapted to include one, yet will loose its speed.
  • The code calculates the whole path of the estimation of the parameter $\boldsymbol{\beta}$ (Which is vαₜ in the code).

I verified the implementation vs. a DCP Solver and on Non Separable synthetic data set:

The prediction, for a given new sample $\boldsymbol{z}$ is given by:

$$ \boldsymbol{z} \to \operatorname{sign} \left( \sum_{i} {\beta}_{i} k \left( \boldsymbol{z}, \boldsymbol{x}_{i} \right) \right) $$

Where $\boldsymbol{x}_{i}$ is the $i$ -th data sample and $k \left( \cdot, \cdot \right)$ is the kernel function.
It is different from the calculation for the Dual Variable as seen in Prediction of a Sample Given the Optimal Dual Variable of Kernel SVM.


The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q215733 folder).

2 of 2
0

The objective function for the Primal Linear SVM (No bias term):

$$ \arg \min_{\boldsymbol{w}} \frac{\lambda}{2} {\left\| \boldsymbol{w} \right\|}_{2}^{2} + \frac{1}{n} \sum_{k = 1}^{n} \max \left\{ 0, 1 - {y}_{k} \boldsymbol{x}_{k}^{\top} \boldsymbol{w} \right\} $$

There are simpler formulations, in my opinion, of the algorithm in Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms which is a book with the same author of the paper Pegasos: Primal Estimated Sub GrAdient SOlver for SVM.

For the Linear SVM:

My implementation (Linear SVM):

function PegasosSVM!( mW :: Matrix{T}, hXₖ! :: Function, vY :: Vector{T}, λ :: T, vWₜ :: Vector{T}, vθₜ :: Vector{T}, vXₜ :: Vector{T} ) where {T <: AbstractFloat}
    # Following Shai Shalev Shwartz, Shai Ben David - Understanding Machine Learning: From Theory to Algorithms.
    # See page 213 - SGD for Solving Soft SVM.
    # `hXₖ!(vZ, ii)` - Returns the `ii` -th data sample in `vZ`.

    numSamples    = length(vY);
    dataDim       = size(mW, 1);
    numIterations = size(mW, 2);

    # First iteration
    tt = 1;
    ηₜ = inv(λ * tt);

    vWₜ .= ηₜ .* vθₜ;

    kk = rand(1:numSamples);
    yₖ = vY[kk];

    hXₖ!(vXₜ, kk);

    valSum = yₖ * dot(vWₜ, vXₜ);
    if valSum < one(T)
        vθₜ .+= yₖ .* vXₜ;
    end

    @views mW[:, 1] = vWₜ;


    for ii in 2:numIterations

        # tt = ii - 1;
        tt = ii;
        ηₜ = inv(λ * tt);

        vWₜ .= ηₜ .* vθₜ;

        kk = rand(1:numSamples);
        yₖ = vY[kk];

        hXₖ!(vXₜ, kk);

        valSum = yₖ * dot(vWₜ, vXₜ);
        if valSum < one(T)
            vθₜ .+= yₖ .* vXₜ;
        end

        @views mW[:, ii] .= inv(tt) .* ((T(ii - 1) .* mW[:, ii - 1]) .+ vWₜ);
    end

    return mW;

end

Remarks:

  • The algorithm assumes no bias / intercept term. It may adapted to include the bias term yet it will become much slower (Strongly Convex to Convex).
  • The code calculates the whole path of the estimation of the parameter $\boldsymbol{w}$.

I verified the implementation vs. a DCP Solver and on Non Separable synthetic data set:


The code is available on my StackExchange Code GitHub Repository (Look at the CrossValidated\Q215733 folder).

🌐
Monishver11
monishver11.github.io › blog › 2025 › subgradient
Subgradient and Subgradient Descent | Monishver Chandrasekaran
Subgradients are a powerful tool for dealing with non-differentiable convex functions, and subgradient descent provides a straightforward yet effective way to optimize such functions. While slower than gradient descent, subgradient descent shines in scenarios where gradients are unavailable.
Top answer
1 of 3
18

Set $\mathbf w = \phi(\mathbf x)\cdot \mathbf u$ so that $\mathbf w^t \phi(\mathbf x)=\mathbf u^t \cdot \mathbf K$ and $\mathbf w^t\mathbf w = \mathbf u^t\mathbf K\mathbf u$, with $\mathbf K = \phi(\mathbf x)^t\phi(\mathbf x)$, where $\phi(x)$ is a mapping of the original input matrix, $\mathbf x$. This allows one to solve the SVM through the primal formulation. Using your notation for the loss:

$$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{u}^t \cdot \mathbf{K}^{(i)} + b)\right)} + \dfrac{1}{2} \mathbf{u}^t \cdot \mathbf{K} \cdot \mathbf{u}$$

$ \mathbf{K}$ is a $m \times m$ matrix, and $\mathbf{u}$ is a $m \times 1$ matrix. Neither is infinite.

Indeed, the dual is usually faster to solve, but the primal has it's advantages as well, such as approximate solutions (which are not guaranteed in the dual formulation).


Now, why is the dual so much more prominent isn't obvious at all: [1]

The historical reasons for which most of the research in the last decade has been about dual optimization are unclear. We believe that it is because SVMs were first introduced in their hard margin formulation [Boser et al., 1992], for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account [Chapelle et al., 2000]


Chapelle (2007) argues the time complexity of both primal and dual optimization is $\mathcal{O}\left(nn_{sv} + n_{sv}^3\right)$, worst case being $\mathcal{O}\left(n^3\right)$, but they analyzed quadratic and approximate hinge losses, so not a proper hinge loss, as it's not differentiable to be used with Newton's method.


[1] Chapelle, O. (2007). Training a support vector machine in the primal. Neural computation, 19(5), 1155-1178.

2 of 3
6

If we apply a transformation $\phi$ to all input weight vectors ($\mathbf{x}^{(i)}$), we get the following cost function:

$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b)\right)} \quad + \quad \dfrac{1}{2} \mathbf{w}^t \cdot \mathbf{w}$

The kernel trick replaces $\phi(\mathbf{u})^t \cdot \phi(\mathbf{v})$ by $K(\mathbf{u}, \mathbf{v})$. Since the weight vector $\mathbf{w}$ is not transformed, the kernel trick cannot be applied to the cost function above.

The cost function above corresponds to the primal form of the SVM objective:

$\underset{\mathbf{w}, b, \mathbf{\zeta}}\min{C \sum\limits_{i=1}^m{\zeta^{(i)}} + \dfrac{1}{2}\mathbf{w}^t \cdot \mathbf{w}}$

subject to $y^{(i)}(\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b) \ge 1 - \zeta^{(i)})$ and $\zeta^{(i)} \ge 0$ for $i=1, \cdots, m$

The dual form is:

$\underset{\mathbf{\alpha}}\min{\dfrac{1}{2}\mathbf{\alpha}^t \cdot \mathbf{Q} \cdot \mathbf{\alpha} - \mathbf{1}^t \cdot \mathbf{\alpha}}$

subject to $\mathbf{y}^t \cdot \mathbf{\alpha} = 0$ and $0 \le \alpha_i \le C$ for $i = 1, 2, \cdots, m$

where $\mathbf{1}$ is a vector full of 1s and $\mathbf{Q}$ is an $m \times m$ matrix with elements $Q_{ij} = y^{(i)} y^{(j)} \phi(\mathbf{x}^{(i)})^t \cdot \phi(\mathbf{x}^{(j)})$.

Now we can use the kernel trick by computing $Q_{ij}$ like so:

$Q_{ij} = y^{(i)} y^{(j)} K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$

So the kernel trick can only be used on the dual form of the SVM problem (plus some other algorithms such as logistic regression).

Now you can use off-the-shelf Quadratic Programming libraries to solve this problem, or use Lagrangian multipliers to get an unconstrained function (the dual cost function), then search for a minimum using Gradient Descent or any other optimization technique. One of the most efficient approach seems to be the SMO algorithm implemented by the libsvm library (for kernelized SVM).