As for scipy.optimize, you misuse its optimization methods. Both Newton-CG and BFGS assume your cost function is smooth, which is not the case. If you use a robust gradient-free method, like Nelder-Mead, you will converge to the right point in most cases (I have tried it).
Your problem can be theoretically solved by gradient descent, but only if you adapt it to a non-smooth function. Currently, your algorithm approaches optimum quickly, but starts jumping around instead of converging, due to a large learning rate combined with a sharp change in gradient where the maximum in the cost function changes from 0 to positive:

You can calm these oscillations down by decreasing learning rate each time when your costs fails to decrease relative to the previous iteration
def train(self):
#----------Optimize using scipy.optimize----------
if self.method=='optimize':
opt=optimize.minimize(self.costFunc,self.theta,args=(self.xdata,self.ydata),\
jac=self.jac,method='BFGS')
self.theta=opt.x
#---------Optimize using Gradient descent---------
elif self.method=='GD':
costs=[]
lr=self.lr
for ii in range(self.n_iter):
dj=self.jac(self.theta,self.xdata,self.ydata)
old_theta = self.theta.copy()
self.theta=self.theta-lr*dj
cii=self.costFunc(self.theta,self.xdata,self.ydata)
# if cost goes up, decrease learning rate and restore theta
if len(costs) > 0 and cii > costs[-1]:
lr *= 0.9
self.theta = old_theta
costs.append(cii)
self.costs=numpy.array(costs)
return self
This small amendment to your code results in much better convergence:

and in resulting parameters which are pretty close to the optimal - like [0.50110433 0.50076661] or [0.50092616 0.5007394 ].
In modern applications (like neural networks) this adaptation of learning rate is implemented within advanced gradient descent algorithms like ADAM, which constantly track changes in mean and variance of the gradient.
Update. This second part of the answer concerns the secont version of the code.
About ADAM. You got exploding vt because of the line vt=vt/(1-beta2**t). You should normalize only the value of vt used to calculate a gradient step, not the value that goes to the next iteration. Like here:
...
mt=beta1*mt_1+(1-beta1)*dj
vt=beta2*vt_1+(1-beta2)*dj**2
mt_temp=mt/(1-beta1**t)
vt_temp=vt/(1-beta2**t)
old_theta=self.theta
self.theta=self.theta-lr*mt_temp/(numpy.sqrt(vt_temp)+epsilon)
mt_1=mt
vt_1=vt
...
About instability. Both Nelder-Mead method and gradient descent depend on initial value of the parameters, that's the sad truth. You can try to improve convergence by making more iterations of GD and fading learning rate in a wiser way, or by decreasing optimization parameters like xatol and fatol for Nelder-Mead method.
However, even if you achieve perfect convergence (parameter values like [ 1.81818459 -1.81817712 -4.09093887] in your case), you have problems. Convergence can be roughly checked by the following code:
print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b+1e-3]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta, [mysvm.b-1e-3]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta-1e-3, [mysvm.b]]), mysvm.xdata, mysvm.ydata))
print(mysvm.costFunc(numpy.concatenate([mysvm.theta+1e-3, [mysvm.b]]), mysvm.xdata, mysvm.ydata))
which results in
6.7323592305075515
6.7335116664996
6.733895813394582
6.745819882839341
6.741974212439457
Your cost increases if you change theta or the intercept in any direction - thus, the solution is optimal. But then solution of sklearn is not optimal (from the point of view of mysvm), because the code
print(mysvm.costFunc(numpy.concatenate([clf.coef_[0], clf.intercept_]), mysvm.xdata, mysvm.ydata))
prints 40.31527145374271! It means, you have reached a local minimum, but the sklearn's SVM has minimized something different.
And if you read the documentation of sklearn, you can find what's wrong: they minimize sum(errors) * C + 0.5 * penalty, and you minimize mean(errors) * C + 0.5 * penalty!!! This is the most probable cause of discrepancy.