Videos
Set $\mathbf w = \phi(\mathbf x)\cdot \mathbf u$ so that $\mathbf w^t \phi(\mathbf x)=\mathbf u^t \cdot \mathbf K$ and $\mathbf w^t\mathbf w = \mathbf u^t\mathbf K\mathbf u$, with $\mathbf K = \phi(\mathbf x)^t\phi(\mathbf x)$, where $\phi(x)$ is a mapping of the original input matrix, $\mathbf x$. This allows one to solve the SVM through the primal formulation. Using your notation for the loss:
$$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{u}^t \cdot \mathbf{K}^{(i)} + b)\right)} + \dfrac{1}{2} \mathbf{u}^t \cdot \mathbf{K} \cdot \mathbf{u}$$
$ \mathbf{K}$ is a $m \times m$ matrix, and $\mathbf{u}$ is a $m \times 1$ matrix. Neither is infinite.
Indeed, the dual is usually faster to solve, but the primal has it's advantages as well, such as approximate solutions (which are not guaranteed in the dual formulation).
Now, why is the dual so much more prominent isn't obvious at all: [1]
The historical reasons for which most of the research in the last decade has been about dual optimization are unclear. We believe that it is because SVMs were first introduced in their hard margin formulation [Boser et al., 1992], for which a dual optimization (because of the constraints) seems more natural. In general, however, soft margin SVMs should be preferred, even if the training data are separable: the decision boundary is more robust because more training points are taken into account [Chapelle et al., 2000]
Chapelle (2007) argues the time complexity of both primal and dual optimization is $\mathcal{O}\left(nn_{sv} + n_{sv}^3\right)$, worst case being $\mathcal{O}\left(n^3\right)$, but they analyzed quadratic and approximate hinge losses, so not a proper hinge loss, as it's not differentiable to be used with Newton's method.
[1] Chapelle, O. (2007). Training a support vector machine in the primal. Neural computation, 19(5), 1155-1178.
If we apply a transformation $\phi$ to all input weight vectors ($\mathbf{x}^{(i)}$), we get the following cost function:
$J(\mathbf{w}, b) = C {\displaystyle \sum\limits_{i=1}^{m} max\left(0, 1 - y^{(i)} (\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b)\right)} \quad + \quad \dfrac{1}{2} \mathbf{w}^t \cdot \mathbf{w}$
The kernel trick replaces $\phi(\mathbf{u})^t \cdot \phi(\mathbf{v})$ by $K(\mathbf{u}, \mathbf{v})$. Since the weight vector $\mathbf{w}$ is not transformed, the kernel trick cannot be applied to the cost function above.
The cost function above corresponds to the primal form of the SVM objective:
$\underset{\mathbf{w}, b, \mathbf{\zeta}}\min{C \sum\limits_{i=1}^m{\zeta^{(i)}} + \dfrac{1}{2}\mathbf{w}^t \cdot \mathbf{w}}$
subject to $y^{(i)}(\mathbf{w}^t \cdot \phi(\mathbf{x}^{(i)}) + b) \ge 1 - \zeta^{(i)})$ and $\zeta^{(i)} \ge 0$ for $i=1, \cdots, m$
The dual form is:
$\underset{\mathbf{\alpha}}\min{\dfrac{1}{2}\mathbf{\alpha}^t \cdot \mathbf{Q} \cdot \mathbf{\alpha} - \mathbf{1}^t \cdot \mathbf{\alpha}}$
subject to $\mathbf{y}^t \cdot \mathbf{\alpha} = 0$ and $0 \le \alpha_i \le C$ for $i = 1, 2, \cdots, m$
where $\mathbf{1}$ is a vector full of 1s and $\mathbf{Q}$ is an $m \times m$ matrix with elements $Q_{ij} = y^{(i)} y^{(j)} \phi(\mathbf{x}^{(i)})^t \cdot \phi(\mathbf{x}^{(j)})$.
Now we can use the kernel trick by computing $Q_{ij}$ like so:
$Q_{ij} = y^{(i)} y^{(j)} K(\mathbf{x}^{(i)}, \mathbf{x}^{(j)})$
So the kernel trick can only be used on the dual form of the SVM problem (plus some other algorithms such as logistic regression).
Now you can use off-the-shelf Quadratic Programming libraries to solve this problem, or use Lagrangian multipliers to get an unconstrained function (the dual cost function), then search for a minimum using Gradient Descent or any other optimization technique. One of the most efficient approach seems to be the SMO algorithm implemented by the libsvm library (for kernelized SVM).