Factsheet
space
space
Videos
A norm is a function that takes a vector as an input and returns a scalar value that can be interpreted as the "size", "length" or "magnitude" of that vector. More formally, norms are defined as having the following mathematical properties:
- They scale multiplicatively, i.e. Norm(aΒ·v) = |a|Β·Norm(v) for any scalar a
- They satisfy the triangle inequality, i.e. Norm(u + v) β€ Norm(u) + Norm(v)
- The norm of a vector is zero if and only if it is the zero vector, i.e. Norm(v) = 0 β v = 0
The Euclidean norm (also known as the LΒ² norm) is just one of many different norms - there is also the max norm, the Manhattan norm etc. The LΒ² norm of a single vector is equivalent to the Euclidean distance from that point to the origin, and the LΒ² norm of the difference between two vectors is equivalent to the Euclidean distance between the two points.
As @nobar's answer says, np.linalg.norm(x - y, ord=2) (or just np.linalg.norm(x - y)) will give you Euclidean distance between the vectors x and y.
Since you want to compute the Euclidean distance between a[1, :] and every other row in a, you could do this a lot faster by eliminating the for loop and broadcasting over the rows of a:
dist = np.linalg.norm(a[1:2] - a, axis=1)
It's also easy to compute the Euclidean distance yourself using broadcasting:
dist = np.sqrt(((a[1:2] - a) ** 2).sum(1))
The fastest method is probably scipy.spatial.distance.cdist:
from scipy.spatial.distance import cdist
dist = cdist(a[1:2], a)[0]
Some timings for a (1000, 1000) array:
a = np.random.randn(1000, 1000)
%timeit np.linalg.norm(a[1:2] - a, axis=1)
# 100 loops, best of 3: 5.43 ms per loop
%timeit np.sqrt(((a[1:2] - a) ** 2).sum(1))
# 100 loops, best of 3: 5.5 ms per loop
%timeit cdist(a[1:2], a)[0]
# 1000 loops, best of 3: 1.38 ms per loop
# check that all 3 methods return the same result
d1 = np.linalg.norm(a[1:2] - a, axis=1)
d2 = np.sqrt(((a[1:2] - a) ** 2).sum(1))
d3 = cdist(a[1:2], a)[0]
assert np.allclose(d1, d2) and np.allclose(d1, d3)
The concept of a "norm" is a generalized idea in mathematics which, when applied to vectors (or vector differences), broadly represents some measure of length. There are various different approaches to computing a norm, but the one called Euclidean distance is called the "2-norm" and is based on applying an exponent of 2 (the "square"), and after summing applying an exponent of 1/2 (the "square root").
It's a bit cryptic in the docs, but you get Euclidean distance between two vectors by setting the parameter ord=2.
sum(abs(x)**ord)**(1./ord)
becomes sqrt(sum(x**2)).
Note: as pointed out by @Holt, the default value is ord=None, which is documented to compute the "2-norm" for vectors. This is, therefore, equivalent to ord=2 (Euclidean distance).
OK let's see if this helps you. Suppose you have two functions $f,g:[a,b]\to \mathbb{R}$. If someone asks you what is distance between $f(x)$ and $g(x)$ it is easy you would say $|f(x)-g(x)|$. But if I ask what is the distance between $f$ and $g$, this question is kind of absurd. But I can ask what is the distance between $f$ and $g$ on average? Then it is $$ \dfrac{1}{b-a}\int_a^b |f(x)-g(x)|dx=\dfrac{||f-g||_1}{b-a} $$ which gives the $L^1$-norm. But this is just one of the many different ways you can do the averaging: Another way would be related to the integral $$ \left[\int_a^b|f(x)-g(x)|^p dx\right]^{1/p}:=||f-g||_{p} $$ which is the $L^p$-norm in general.
Let us investigate the norm of $f(x)=x^n$ in $[0,1]$ for different $L_p$ norms. I suggest you draw the graphs of $x^{p}$ for a few $p$ to see how higher $p$ makes $x^{p}$ flatter near the origin and how the integral therefore favors the vicinity of $x=1$ more and more as $p$ becomes bigger. $$ ||x||_p=\left[\int_0^1 x^{p}dx\right]^{1/p}=\frac{1}{(p+1)^{1/p}} $$ The $L^p$ norm is smaller than $L^m$ norm if $m>p$ because the behavior near more points is downplayed in $m$ in comparison to $p$. So depending on what you want to capture in your averaging and how you want to define `the distance' between functions, you utilize different $L^p$ norms.
This also motivates why the $L^\infty$ norm is nothing but the essential supremum of $f$; i.e. you filter everything out other than the highest values of $f(x)$ as you let $p\to \infty$.
There are several good answers here, one accepted. Nevertheless I'm surprised not to see the $L^2$ norm described as the infinite dimensional analogue of Euclidean distance.
In the plane, the length of the vector $(x,y)$ - that is, the distance between $(x,y)$ and the origin - is $\sqrt{x^2 + y^2}$. In $n$-space it's the square root of the sum of the squares of the components.
Now think of a function as a vector with infinitely many components (its value at each point in the domain) and replace summation by integration to get the $L^2$ norm of a function.
Finally, tack on the end of last sentence of @levap 's answer:
... the $L^2$ norm has the advantage that it comes from an inner product and so all the techniques from inner product spaces (orthogonal projections, etc) can be applied when we use the $L^2$ norm.
Well, while it is true that under some noation $\nabla_x \|x\|^2_2=2x$, I think it is a bad habit as it doesn't go natuarlly with matrix calculus. Usually, given $f:\mathbb R^n\rightarrow\mathbb R^m$, its derivative matrix $Df$ is an $m\times n$ matrix. The $ij$ entry is the derivative of $f_i$ w.r.t $x_j$. So in this case we expect $D(\|x\|^2_2)$ to be an $1\times n$ row vector. Namely, $D(\|x\|^2_2)=2x^T$.
The chain rule only works naturally in this setting, when we are using derivative matrices. It says that (under some simplification), $D(f\circ g) (x)= Df(g(x))\cdot Dg(x)$ The multiplication is matrix multiplication. Under this setting we see the following: I am using the well known formula $D(Ax)=A$ (which makes sense, a linear approximation of a linear function is itself) $$D_x(\|z-Zx\|_2^2)=2(z-Zx)^T D(z-Zx)=-2(z^T-x^TZ^T)Z$$ This is a row vector of course. I am guessing your book equate it to zero. In this case, to get a nicer notation you may transpoe both sides and get $Z^T(z-Zx)=0$. Other approaches would be using the fact that the squared norm is just $(Zx-z)^T(Zx-z)$, foiling and then there's no need to use the chain rule.
Let's write the squared norm in this form and use $y$ instead of $z$ to avoid confusion
\begin{equation} \begin{split} f & = \|y - Zx\|^2_2 \\ & = (y - Zx)^T(y - Zx) \\ & = (y^T - x^TZ^T)(y - Zx) \\ & = y^Ty - y^TZx - x^TZ^Ty + x^TZ^TZx \\ df & = d(y^Ty) - d(y^TZx) - d(x^TZ^Ty) + d(x^TZ^TZx) \end{split} \end{equation}
Now we will work out each term separately,
It is clear that $\frac{d(y^Ty)}{dx} =0$, so no need to develope it further
For the 2nd term $d(y^TZx)$,
\begin{equation} \begin{split} d(y^TZx) & = (dy^T)Zx + y^T(dZ)x + y^TZ(dx) \\ \frac{d(y^TZx)}{dx} & = y^TZ \\ \end{split} \end{equation}
For the 3rd term $d(x^TZ^Ty)$, we will use this property
$$ d(y^TZx) = d(y^TZx)^T = d(x^TZ^Ty) $$
so
$$ \frac{d(x^TZ^Ty)}{dx} = y^TZ $$
For the last term $d(x^TZ^TZx)$, we have just to put $x=y$ and differentiate, so
\begin{equation} \begin{split} d(x^TZ^TZx) & = (x^TZ^TZ)dx + (x^TZ^TZ)dx \\ \frac{d(x^TZ^TZx)}{dx} &= x^T(Z^TZ + Z^TZ) \\ &= 2x^T(Z^TZ) \\ \end{split} \end{equation}
Note: Depending on your preferred Layout convention, the derivative could be either
$$\frac{d(x^TZ^TZx)}{dx} = x^T(Z^TZ + (Z^TZ)^T) = 2 x^TZ^TZ$$
or
$$\frac{d(x^TZ^TZx)}{dx} = 2Z^TZx$$
Finally, putting all terms together, we obtain
$$ d(\|y - Zx\|^2_2) = 2x^T(Z^TZ) - 2y^TZ = 2(x^TZ^T -y^T)Z$$
using other convention, we obtain
$$ d(\|y - Zx\|^2_2) = 2Z^TZx - 2Z^Ty = 2Z^T(Zx -y)$$