Videos
Honestly there's probably not going to be anything faster than np.inner or np.dot. If you find intermediate variables annoying, you could always create a lambda function:
sqeuclidean = lambda x: np.inner(x, x)
np.inner and np.dot leverage BLAS routines, and will almost certainly be faster than standard elementwise multiplication followed by summation.
In [1]: %%timeit -n 1 -r 100 a, b = np.random.randn(2, 1000000)
((a - b) ** 2).sum()
....:
The slowest run took 36.13 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 100: 6.45 ms per loop
In [2]: %%timeit -n 1 -r 100 a, b = np.random.randn(2, 1000000)
np.linalg.norm(a - b, ord=2) ** 2
....:
1 loops, best of 100: 2.74 ms per loop
In [3]: %%timeit -n 1 -r 100 a, b = np.random.randn(2, 1000000)
sqeuclidean(a - b)
....:
1 loops, best of 100: 2.64 ms per loop
np.linalg.norm(..., ord=2) uses np.dot internally, and gives very similar performance to using np.inner directly.
I don't know if the performance is any good, but (a**2).sum() calculates the right value and has the non-repeated argument you want. You can replace a with some complicated expression without binding it to a variable, just remember to use parentheses as necessary, since ** binds more tightly than most other operators: ((a-b)**2).sum()