Honestly there's probably not going to be anything faster than np.inner or np.dot. If you find intermediate variables annoying, you could always create a lambda function:
sqeuclidean = lambda x: np.inner(x, x)
np.inner and np.dot leverage BLAS routines, and will almost certainly be faster than standard elementwise multiplication followed by summation.
In [1]: %%timeit -n 1 -r 100 a, b = np.random.randn(2, 1000000)
((a - b) ** 2).sum()
....:
The slowest run took 36.13 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 100: 6.45 ms per loop
In [2]: %%timeit -n 1 -r 100 a, b = np.random.randn(2, 1000000)
np.linalg.norm(a - b, ord=2) ** 2
....:
1 loops, best of 100: 2.74 ms per loop
In [3]: %%timeit -n 1 -r 100 a, b = np.random.randn(2, 1000000)
sqeuclidean(a - b)
....:
1 loops, best of 100: 2.64 ms per loop
np.linalg.norm(..., ord=2) uses np.dot internally, and gives very similar performance to using np.inner directly.
Videos
Honestly there's probably not going to be anything faster than np.inner or np.dot. If you find intermediate variables annoying, you could always create a lambda function:
sqeuclidean = lambda x: np.inner(x, x)
np.inner and np.dot leverage BLAS routines, and will almost certainly be faster than standard elementwise multiplication followed by summation.
In [1]: %%timeit -n 1 -r 100 a, b = np.random.randn(2, 1000000)
((a - b) ** 2).sum()
....:
The slowest run took 36.13 times longer than the fastest. This could mean that an intermediate result is being cached
1 loops, best of 100: 6.45 ms per loop
In [2]: %%timeit -n 1 -r 100 a, b = np.random.randn(2, 1000000)
np.linalg.norm(a - b, ord=2) ** 2
....:
1 loops, best of 100: 2.74 ms per loop
In [3]: %%timeit -n 1 -r 100 a, b = np.random.randn(2, 1000000)
sqeuclidean(a - b)
....:
1 loops, best of 100: 2.64 ms per loop
np.linalg.norm(..., ord=2) uses np.dot internally, and gives very similar performance to using np.inner directly.
I don't know if the performance is any good, but (a**2).sum() calculates the right value and has the non-repeated argument you want. You can replace a with some complicated expression without binding it to a variable, just remember to use parentheses as necessary, since ** binds more tightly than most other operators: ((a-b)**2).sum()
Use numpy.linalg.norm:
dist = numpy.linalg.norm(a-b)
This works because the Euclidean distance is the l2 norm, and the default value of the ord parameter in numpy.linalg.norm is 2.
For more theory, see Introduction to Data Mining:

Use scipy.spatial.distance.euclidean:
from scipy.spatial import distance
a = (1, 2, 3)
b = (4, 5, 6)
dst = distance.euclidean(a, b)
For numpy 1.9+
Note that, as perimosocordiae shows, as of NumPy version 1.9, np.linalg.norm(x, axis=1) is the fastest way to compute the L2-norm.
For numpy < 1.9
If you are computing an L2-norm, you could compute it directly (using the axis=-1 argument to sum along rows):
np.sum(np.abs(x)**2,axis=-1)**(1./2)
Lp-norms can be computed similarly of course.
It is considerably faster than np.apply_along_axis, though perhaps not as convenient:
In [48]: %timeit np.apply_along_axis(np.linalg.norm, 1, x)
1000 loops, best of 3: 208 us per loop
In [49]: %timeit np.sum(np.abs(x)**2,axis=-1)**(1./2)
100000 loops, best of 3: 18.3 us per loop
Other ord forms of norm can be computed directly too (with similar speedups):
In [55]: %timeit np.apply_along_axis(lambda row:np.linalg.norm(row,ord=1), 1, x)
1000 loops, best of 3: 203 us per loop
In [54]: %timeit np.sum(abs(x), axis=-1)
100000 loops, best of 3: 10.9 us per loop
Resurrecting an old question due to a numpy update. As of the 1.9 release, numpy.linalg.norm now accepts an axis argument. [code, documentation]
This is the new fastest method in town:
In [10]: x = np.random.random((500,500))
In [11]: %timeit np.apply_along_axis(np.linalg.norm, 1, x)
10 loops, best of 3: 21 ms per loop
In [12]: %timeit np.sum(np.abs(x)**2,axis=-1)**(1./2)
100 loops, best of 3: 2.6 ms per loop
In [13]: %timeit np.linalg.norm(x, axis=1)
1000 loops, best of 3: 1.4 ms per loop
And to prove it's calculating the same thing:
In [14]: np.allclose(np.linalg.norm(x, axis=1), np.sum(np.abs(x)**2,axis=-1)**(1./2))
Out[14]: True