Here I will try and provide some intuition for SVCs. The code for the figures is at the end.
Linear SVCs Suppose you have a batch of 2-pixel "images", and each image is labelled either A or B. Here's a sample of 5 images:

Each pixel is considered as a feature. In this example, the two pixels defined a 2D feature space that we could visualise along an x and y axis. When you have an 8x8 image for example, you have a 64-dimensional feature space.
For each image (there are 500), you can consider the pixel values (intensities) as features. Plotting the samples in feature space (pixel 1 - pixel 2 space), and colouring them by their label, yields this representation:

Linear SVCs work by drawing a line in feature space that separates the two classes. The best line is the one with the widest margin. The entire area above the decision line is considered to belong to class A, and the entire area below the decision line is considered to be class B. When you have a new sample, the SVC assesses where it falls in relation to the decision line: above the line means it will predict 'A', and below the line means it'll predict 'B'.

The margin is defined by where it first hits a sample. The samples that limit or define the margins are called support vectors:

In this case the best margin is defined by 3 samples. In practice, you want your margin (and therefore the SVC's decision line or solution) to be defined by more samples, i.e. allow more margin violations. This is determined by the regularisation parameter C. Smaller values of C average more samples when defining the margin (more regularised, potentially underfitting). Larger values of C only let a few samples become support vectors, so the margin might end up depending on just a few noisy points (less regularised, potentially overfitting). The optimal C for your dataset is a matter of hyperparameter tuning.
Non-linear SVCs and the kernel trick The kernel trick is used where you can't simply linearly separate the classes using the features you have, such as in this example:

In such cases, you can try creating more complex features, such as these quadratic ones: pixel1 x pixel2, (pixel1)^2, (pixel2)^2 (similar to kernel='poly', degree=2). In this richer feature space, you have a better chance of finding a linear separation of the classes. The kernel trick is an efficient way of converging on a linear separation in the higher dimensional space - it saves you explicitly computing the new features, while still allowing you to navigate the space for a solution. When you bring that linear solution back into the original pixel1-pixel2 space, it'll be nonlinear because it was only linear in the higher dimensional space where it was found. This results in complex decision boundaries:

I tried to replicate this result manually - i.e. without using the kernel trick. I combined the features to create the product features described above (pixel 1 times pixel 2, etc), and then ran it through a linear SVC such that it would find a linear separation of my quadratic feature set. I get a similar result (green) to the above (red):

SVCs are binary classifiers, like the one shown above. When you apply SVCs to more than 2 classes, sklearn creates several binary SVCs (one for each pair of classes), and the final results is determined using a one-vs-rest strategy.
Code used to generate the plots
import numpy as np
import pandas as pd
from matplotlib import pyplot as plt
#
#Synthesise 2-pixel images. Each image belongs to A or B.
#
np.random.seed(rand_seed)
synth = np.random.randn(4, 250)
X_df = pd.DataFrame({'pixel1': np.c_[ synth[[0]] + 2, synth[[1]] - 2 ].ravel(),
'pixel2': np.c_[ synth[[2]] + 2, synth[[3]] - 5 ].ravel(),
'label': ['A'] * 250 + ['B'] * 250})
X_df.index.name = 'sample'
X, y = X_df.drop(columns='label').values, X_df.label.values
#
# View some samples
#
f, axs = plt.subplots(1, 5)
for ax in axs.flatten():
i = np.random.randint(0, 500)
ax.imshow(X_df.loc[i, ['pixel1', 'pixel2']].infer_objects().values[None, :],
cmap='binary', vmin=-5, vmax=5)
#ax.axis('off')
ax.set_title(X_df.label[i])
ax.set_xticks([])
ax.set_yticks([])
#
# View samples in feature space (pixel1-pixel2 space)
#
f, ax = plt.subplots(figsize=(6, 3))
for label in X_df.label.unique():
X_df.loc[X_df.label == label].plot(
kind='scatter', x='pixel1', y='pixel2', c='red' if label=='A' else 'blue', label=label, ax=ax
)
ax_lims = ax.axis()
#
# Fit linear SVC
#
from sklearn.svm import SVC
svc = SVC(kernel='linear').fit(X, y)
#Plot decision boundary
(w0, w1), b = svc.coef_.ravel().tolist(), svc.intercept_
x0 = np.array([-10, 10])
x1 = (-w0*x0 - b) / w1
ax.plot(x0, x1, 'k', label='decision boundary', linewidth=3)
ax.plot(x0, x1 + 1/w1, 'k:', label='margin', linewidth=1)
ax.plot(x0, x1 - 1/w1, 'k:', linewidth=1)
ax.axis(ax_lims)
ax.legend()
#Highlight support vectors
ax.scatter(
svc.support_vectors_[:, 0], svc.support_vectors_[:, 1],
marker='s', edgecolor='k',facecolor='none', s=70, label='support vectors'
)
ax.legend()
Code example for non-linear SVMs/kernel trick:
#
# Kernel trick demo
#
np.random.seed(1)
#Synthesise a dataset that's not linearly seperable
synth = np.random.randn(4, 250)
X_df = pd.DataFrame({'pixel1': np.c_[ synth[[0]] + 2, synth[[1], :125] - 2, synth[[1], 125:] + 8].ravel(),
'pixel2': np.c_[ synth[[2]] + 2, synth[[3], :125] - 5, synth[[3], :125] + 8 ].ravel(),
'label': ['A'] * 250 + ['B'] * 250})
X_df.index.name = 'sample'
X, y = X_df.drop(columns='label').values, X_df.label.values
#Plot the data
f, ax = plt.subplots(figsize=(6, 3))
for label in X_df.label.unique():
X_df.loc[X_df.label == label].plot(
kind='scatter', x='pixel1', y='pixel2', c='red' if label=='A' else 'blue', label=label, ax=ax
)
ax_lims = ax.axis()
#
#SVC(kernel='poly') uses the kernel trick
#
svc = SVC(kernel='poly', degree=2, C=1e6).fit(X, y)
xx, yy = np.meshgrid(np.linspace(-10, 10),
np.linspace(-10, 10))
X = np.c_[xx.ravel(), yy.ravel()]
preds = svc.predict(X).reshape(xx.shape)
preds[preds == 'A'] = 0
preds[preds == 'B'] = 1
preds = np.float32(preds)
decision = svc.decision_function(X).reshape(xx.shape)
ax.contour(xx, yy, preds, cmap='Reds')
#
# Manual equivalent of the SVC above
# but without using the kernel trick
#
#Create a richer feature space from the existing features
#We'll opt for a quadratic polynomial feature space, like the SVC above
X2_df = X_df.copy()
X2_df['pixel1_squared'] = X2_df.pixel1 ** 2
X2_df['pixel2_squared'] = X2_df.pixel2 ** 2
X2_df['pixel1_pixel2'] = X2_df.pixel1 * X2_df.pixel2
X2 = X2_df.drop(columns='label').values
#Find a linear separation in this space
svc2 = SVC(kernel='linear', C=1e6).fit(X2, y)
#Plot the decision boundary
preds2 = svc2.predict( np.c_[X, X[:, 0]**2, X[:, 1]**2, X[:, 0]*X[:, 1]] ).reshape(xx.shape)
preds2[preds2 == 'A'] = 0
preds2[preds2 == 'B'] = 1
preds2 = np.float32(preds2)
decision2 = svc2.decision_function( np.c_[X, X[:, 0]**2, X[:, 1]**2, X[:, 0]*X[:, 1]] ).reshape(xx.shape)
ax.contour(xx, yy, preds2, cmap='Greens')
Answer from MuhammedYunus on Stack Overflow