As for scipy.optimize, you misuse its optimization methods. Both Newton-CG and BFGS assume your cost function is smooth, which is not the case. If you use a robust gradient-free method, like Nelder-Mead, you will converge to the right point in most cases (I have tried it).

Your problem can be theoretically solved by gradient descent, but only if you adapt it to a non-smooth function. Currently, your algorithm approaches optimum quickly, but starts jumping around instead of converging, due to a large learning rate combined with a sharp change in gradient where the maximum in the cost function changes from 0 to positive:

You can calm these oscillations down by decreasing learning rate each time when your costs fails to decrease relative to the previous iteration

def train(self):

    #----------Optimize using scipy.optimize----------
    if self.method=='optimize':
        opt=optimize.minimize(self.costFunc,self.theta,args=(self.xdata,self.ydata),\
                jac=self.jac,method='BFGS')
        self.theta=opt.x

    #---------Optimize using Gradient descent---------
    elif self.method=='GD':
        costs=[]
        lr=self.lr

        for ii in range(self.n_iter):
            dj=self.jac(self.theta,self.xdata,self.ydata)
            old_theta = self.theta.copy()
            self.theta=self.theta-lr*dj
            cii=self.costFunc(self.theta,self.xdata,self.ydata)

            # if cost goes up, decrease learning rate and restore theta
            if len(costs) > 0 and cii > costs[-1]:
                lr *= 0.9
                self.theta = old_theta
            costs.append(cii)

        self.costs=numpy.array(costs)

    return self

This small amendment to your code results in much better convergence:

and in resulting parameters which are pretty close to the optimal - like [0.50110433 0.50076661] or [0.50092616 0.5007394 ].

In modern applications (like neural networks) this adaptation of learning rate is implemented within advanced gradient descent algorithms like ADAM, which constantly track changes in mean and variance of the gradient.

Update. This second part of the answer concerns the secont version of the code.

About ADAM. You got exploding vt because of the line vt=vt/(1-beta2**t). You should normalize only the value of vt used to calculate a gradient step, not the value that goes to the next iteration. Like here:

...
mt=beta1*mt_1+(1-beta1)*dj
vt=beta2*vt_1+(1-beta2)*dj**2
mt_temp=mt/(1-beta1**t)
vt_temp=vt/(1-beta2**t)
old_theta=self.theta
self.theta=self.theta-lr*mt_temp/(numpy.sqrt(vt_temp)+epsilon)
mt_1=mt
vt_1=vt
...

About instability. Both Nelder-Mead method and gradient descent depend on initial value of the parameters, that's the sad truth. You can try to improve convergence by making more iterations of GD and fading learning rate in a wiser way, or by decreasing optimization parameters like xatol and fatol for Nelder-Mead method.

However, even if you achieve perfect convergence (parameter values like [ 1.81818459 -1.81817712 -4.09093887] in your case), you have problems. Convergence can be roughly checked by the following code:

print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b+1e-3]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b-1e-3]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta-1e-3, [mysvm.b]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta+1e-3, [mysvm.b]]), mysvm.xdata, mysvm.ydata))

which results in

6.7323592305075515
6.7335116664996
6.733895813394582
6.745819882839341
6.741974212439457

Your cost increases if you change theta or the intercept in any direction - thus, the solution is optimal. But then solution of sklearn is not optimal (from the point of view of mysvm), because the code

print(mysvm.costFunc(numpy.concatenate([clf.coef_[0], clf.intercept_]), mysvm.xdata, mysvm.ydata))

prints 40.31527145374271! It means, you have reached a local minimum, but the sklearn's SVM has minimized something different.

And if you read the documentation of sklearn, you can find what's wrong: they minimize sum(errors) * C + 0.5 * penalty, and you minimize mean(errors) * C + 0.5 * penalty!!! This is the most probable cause of discrepancy.

Answer from David Dale on Stack Overflow
🌐
Analytics Vidhya
analyticsvidhya.com › home › introduction support vector machines (svm) with python implementation
Introduction Support Vector Machines (SVM) with Python Implementation
December 9, 2024 - Soft margin SVM is like a savvy detective, armed with the power to draw clear lines between different classes of data points, enabling it to make accurate predictions with remarkable precision.
🌐
scikit-learn
scikit-learn.org › stable › auto_examples › svm › plot_svm_margin.html
SVM Margins Example — scikit-learn 1.8.0 documentation
A small value of C includes more/all the observations, allowing the margins to be calculated using all the data in the area. # Authors: The scikit-learn developers # SPDX-License-Identifier: BSD-3-Clause import matplotlib.pyplot as plt import numpy as np from sklearn import svm # we create 40 separable points np.random.seed(0) X = np.r_[np.random.randn(20, 2) - [2, 2], np.random.randn(20, 2) + [2, 2]] Y = [0] * 20 + [1] * 20 # figure number fignum = 1 # fit the model for name, penalty in (("unreg", 1), ("reg", 0.05)): clf = svm.SVC(kernel="linear", C=penalty) clf.fit(X, Y) # get the separating hyperplane w = clf.coef_[0] a = -w[0] / w[1] xx = np.linspace(-5, 5) yy = a * xx - (clf.intercept_[0]) / w[1] # plot the parallels to the separating hyperplane that pass through the # support vectors (margin away from hyperplane in direction # perpendicular to hyperplane).
🌐
GitHub
github.com › DrIanGregory › MachineLearning-SupportVectorMachines
GitHub - DrIanGregory/MachineLearning-SupportVectorMachines: Support vector machines implemented from scratch in Python. · GitHub
SKLearn implements SVM optimisation using LIBSVM which utilises an SMO routine. Python (>2.7), Numpy, CVXOPT, sklearn and matplotlib.
Starred by 14 users
Forked by 3 users
Languages   Python
🌐
Python Programming
pythonprogramming.net › soft-margin-kernel-cvxopt-svm-machine-learning-tutorial
Kernels, Soft Margin SVM, and Quadratic Programming ...
In this tutorial, we're going to show a Python-version of kernels, soft-margin, and solving the quadratic programming problem with CVXOPT. In this brief section, I am going to mostly be sharing other resources with you, should you want to dig deeper into the SVM or Quadratic Programming in ...
🌐
Python Programming
pythonprogramming.net › soft-margin-svm-machine-learning-tutorial
Soft Margin Support Vector Machine
Our new optimization is the above calculation, where slack is greater than or equal to zero. The closer to 0 the slack is, the more "hard-margin" we are. The higher the slack, the more soft the margin is. If slack was 0, then we'd have a typical hard-margin classifier.
Top answer
1 of 1
11

As for scipy.optimize, you misuse its optimization methods. Both Newton-CG and BFGS assume your cost function is smooth, which is not the case. If you use a robust gradient-free method, like Nelder-Mead, you will converge to the right point in most cases (I have tried it).

Your problem can be theoretically solved by gradient descent, but only if you adapt it to a non-smooth function. Currently, your algorithm approaches optimum quickly, but starts jumping around instead of converging, due to a large learning rate combined with a sharp change in gradient where the maximum in the cost function changes from 0 to positive:

You can calm these oscillations down by decreasing learning rate each time when your costs fails to decrease relative to the previous iteration

def train(self):

    #----------Optimize using scipy.optimize----------
    if self.method=='optimize':
        opt=optimize.minimize(self.costFunc,self.theta,args=(self.xdata,self.ydata),\
                jac=self.jac,method='BFGS')
        self.theta=opt.x

    #---------Optimize using Gradient descent---------
    elif self.method=='GD':
        costs=[]
        lr=self.lr

        for ii in range(self.n_iter):
            dj=self.jac(self.theta,self.xdata,self.ydata)
            old_theta = self.theta.copy()
            self.theta=self.theta-lr*dj
            cii=self.costFunc(self.theta,self.xdata,self.ydata)

            # if cost goes up, decrease learning rate and restore theta
            if len(costs) > 0 and cii > costs[-1]:
                lr *= 0.9
                self.theta = old_theta
            costs.append(cii)

        self.costs=numpy.array(costs)

    return self

This small amendment to your code results in much better convergence:

and in resulting parameters which are pretty close to the optimal - like [0.50110433 0.50076661] or [0.50092616 0.5007394 ].

In modern applications (like neural networks) this adaptation of learning rate is implemented within advanced gradient descent algorithms like ADAM, which constantly track changes in mean and variance of the gradient.

Update. This second part of the answer concerns the secont version of the code.

About ADAM. You got exploding vt because of the line vt=vt/(1-beta2**t). You should normalize only the value of vt used to calculate a gradient step, not the value that goes to the next iteration. Like here:

...
mt=beta1*mt_1+(1-beta1)*dj
vt=beta2*vt_1+(1-beta2)*dj**2
mt_temp=mt/(1-beta1**t)
vt_temp=vt/(1-beta2**t)
old_theta=self.theta
self.theta=self.theta-lr*mt_temp/(numpy.sqrt(vt_temp)+epsilon)
mt_1=mt
vt_1=vt
...

About instability. Both Nelder-Mead method and gradient descent depend on initial value of the parameters, that's the sad truth. You can try to improve convergence by making more iterations of GD and fading learning rate in a wiser way, or by decreasing optimization parameters like xatol and fatol for Nelder-Mead method.

However, even if you achieve perfect convergence (parameter values like [ 1.81818459 -1.81817712 -4.09093887] in your case), you have problems. Convergence can be roughly checked by the following code:

print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b+1e-3]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b-1e-3]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta-1e-3, [mysvm.b]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta+1e-3, [mysvm.b]]), mysvm.xdata, mysvm.ydata))

which results in

6.7323592305075515
6.7335116664996
6.733895813394582
6.745819882839341
6.741974212439457

Your cost increases if you change theta or the intercept in any direction - thus, the solution is optimal. But then solution of sklearn is not optimal (from the point of view of mysvm), because the code

print(mysvm.costFunc(numpy.concatenate([clf.coef_[0], clf.intercept_]), mysvm.xdata, mysvm.ydata))

prints 40.31527145374271! It means, you have reached a local minimum, but the sklearn's SVM has minimized something different.

And if you read the documentation of sklearn, you can find what's wrong: they minimize sum(errors) * C + 0.5 * penalty, and you minimize mean(errors) * C + 0.5 * penalty!!! This is the most probable cause of discrepancy.

🌐
Medium
hai-dang.medium.com › solve-and-implement-support-vector-machine-part-2-soft-margin-3d488dd96e5d
Solve and implement Support Vector Machine (Part 2 — Soft margin) | by Dang Nguyen | Medium
May 17, 2022 - Solve and implement Support Vector Machine (Part 2 — Soft margin) Table of contents: 1. Introduction 2. Find solution for Soft margin SVM 3. Implement in Python 4. Evaluation 1. Introduction In the …
🌐
Xavierbourretsicotte
xavierbourretsicotte.github.io › SVM_implementation.html
Support Vector Machine: Python implementation using CVXOPT — Data Blog
In this second notebook on SVMs we will walk through the implementation of both the hard margin and soft margin SVM algorithm in Python using the well known CVXOPT library. While the algorithm in its mathematical form is rather straightfoward, its implementation in matrix form using the CVXOPT ...
Find elsewhere
🌐
Tullo
tullo.ch › articles › svm-py
A basic soft-margin kernel SVM implementation in Python —Andrew Tulloch
November 26, 2013 - The full implementation of the training (using cvxopt as a quadratic program solver) in Python is given below: The code is fairly self-explanatory, and follows the given training algorithm quite closely. To compute our Lagrange multipliers, we simply construct the Gram matrix and solve the given QP. We then pass our trained support vectors and their corresponding Lagrange multipliers and weights to the SVMPredictor, whose implementation is given below.
🌐
Emile Mathieu
emilemathieu.fr › posts › 2018 › 08 › svm
An Efficient Soft-Margin Kernel SVM Implementation In Python - Emile Mathieu
August 7, 2018 - This short tutorial aims at introducing support vector machine (SVM) methods from its mathematical formulation along with an efficient implementation in a few lines of Python! Do play with the full code hosted on my github page. I strongly recommend reading Support Vector Machine Solvers (from L.
🌐
GitHub
github.com › emilemathieu › blog_svm
GitHub - emilemathieu/blog_svm: An Efficient Soft-Margin Kernel SVM Implementation In Python
An Efficient Soft-Margin Kernel SVM Implementation In Python - emilemathieu/blog_svm
Author   emilemathieu
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › support-vector-machine-algorithm
Support Vector Machine (SVM) Algorithm - GeeksforGeeks
Hard Margin: A maximum-margin ... Soft Margin: Allows some misclassifications by introducing slack variables, balancing margin maximization and misclassification penalties when data is not perfectly separable....
Published   2 weeks ago
🌐
GitHub
github.com › juliusberner › svm_tf_pytorch
GitHub - juliusberner/svm_tf_pytorch: Soft-margin SVM gradient-descent implementation in PyTorch and TensorFlow/Keras
Linear soft-margin support-vector machine (gradient-descent) implementation in PyTorch and TensorFlow 2.x (and comparison to scikit-learn). Teaching Material for Machine Learning in Physics VDSP-ESI Winter School 2020: Getting used to ML frameworks and in particular to automatic differentiation. Local: create virtual environment · clone the repository · install requirements.txt · start jupyter notebook and open svm_tf_pytorch.ipynb ·
Author   juliusberner
🌐
sandipanweb
sandipanweb.wordpress.com › 2018 › 04 › 23 › implementing-a-soft-margin-kernelized-support-vector-machine-binary-classifier-with-quadratic-programming-in-r-and-python
Implementing a Soft-Margin Kernelized Support Vector Machine Binary Classifier with Quadratic Programming in R and Python | sandipanweb
April 24, 2018 - The following figure describes the soft-margin SVM in a more formal way. The following figures show how the SVM dual quadratic programming problem can be formulated using the R quadprog QP solver (following the QP formulation in the R package quadprog). The following figures show how the SVM dual quadratic programming problem can be formulated using the Python CVXOPT QP solver (following the QP formulation in the python library CVXOPT).
🌐
scikit-learn
scikit-learn.org › 1.5 › auto_examples › svm › plot_svm_margin.html
SVM Margins Example — scikit-learn 1.5.2 documentation
A small value of C includes more/all the observations, allowing the margins to be calculated using all the data in the area. # Code source: Gaël Varoquaux # Modified for documentation by Jaques Grobler # License: BSD 3 clause import matplotlib.pyplot as plt import numpy as np from sklearn import svm # we create 40 separable points np.random.seed(0) X = np.r_[np.random.randn(20, 2) - [2, 2], np.random.randn(20, 2) + [2, 2]] Y = [0] * 20 + [1] * 20 # figure number fignum = 1 # fit the model for name, penalty in (("unreg", 1), ("reg", 0.05)): clf = svm.SVC(kernel="linear", C=penalty) clf.fit(X, Y) # get the separating hyperplane w = clf.coef_[0] a = -w[0] / w[1] xx = np.linspace(-5, 5) yy = a * xx - (clf.intercept_[0]) / w[1] # plot the parallels to the separating hyperplane that pass through the # support vectors (margin away from hyperplane in direction # perpendicular to hyperplane).
🌐
GitHub
github.com › ajtulloch › svmpy
GitHub - ajtulloch/svmpy: Basic soft-margin kernel SVM implementation in Python
This is a basic implementation of a soft-margin kernel SVM solver in Python using numpy and cvxopt.
Starred by 261 users
Forked by 110 users
Languages   Python 100.0% | Python 100.0%
🌐
DEV Community
dev.to › harsimranjit_singh_0133dc › support-vector-machines-from-hard-margin-to-soft-margin-1bj1
Support Vector Machines: From Hard Margin to Soft Margin - DEV Community
August 12, 2024 - \xi_i \geq 0 ξi​≥0 The Soft Margin SVM problem is a type of quadratic programming problem. It involves a quadratic objective function and linear constraints. To solve the problem efficiently, it is often useful to convert it to its dual ...
🌐
GeeksforGeeks
geeksforgeeks.org › machine learning › using-a-hard-margin-vs-soft-margin-in-svm
Using a Hard Margin vs Soft Margin in SVM - GeeksforGeeks
July 23, 2025 - This is a soft margin approach. Training the model on the data (features X and labels y). This involves finding the optimal hyperplane that separates the classes with the largest margin.
🌐
Medium
medium.com › @ai.mirghani › mathematical-formulation-of-the-soft-margin-svm-and-its-implementation-python-5842bda138f0
Mathematical formulation of the Soft Margin SVM and its implementation [Python] | by Ahmed Mirghani | Medium
April 2, 2023 - The only difference in the case of soft margin SVM is that α_i is bound between 0 and C instead of being only lower bound by 0 in the case of hard margin. As we know from the article Implementation of Hard Margin SVM using Quadratic Programming with Examples [CVXOPT Python], the quadratic solver of the CVXOPT library accepts the inequality conditions in a less than or equal form.