Videos
OK let's see if this helps you. Suppose you have two functions . If someone asks you what is distance between
and
it is easy you would say
. But if I ask what is the distance between
and
, this question is kind of absurd. But I can ask what is the distance between
and
on average? Then it is
which gives the
-norm. But this is just one of the many different ways you can do the averaging: Another way would be related to the integral
which is the
-norm in general.
Let us investigate the norm of in
for different
norms. I suggest you draw the graphs of
for a few
to see how higher
makes
flatter near the origin and how the integral therefore favors the vicinity of
more and more as
becomes bigger.
The
norm is smaller than
norm if
because the behavior near more points is downplayed in
in comparison to
. So depending on what you want to capture in your averaging and how you want to define `the distance' between functions, you utilize different
norms.
This also motivates why the norm is nothing but the essential supremum of
; i.e. you filter everything out other than the highest values of
as you let
.
There are several good answers here, one accepted. Nevertheless I'm surprised not to see the norm described as the infinite dimensional analogue of Euclidean distance.
In the plane, the length of the vector - that is, the distance between
and the origin - is
. In
-space it's the square root of the sum of the squares of the components.
Now think of a function as a vector with infinitely many components (its value at each point in the domain) and replace summation by integration to get the norm of a function.
Finally, tack on the end of last sentence of @levap 's answer:
... the
norm has the advantage that it comes from an inner product and so all the techniques from inner product spaces (orthogonal projections, etc) can be applied when we use the
norm.
In your use case, the actual norm of the vector does not matter since you are only concerned about the dominant eigenvalue. The only reason to normalize during the iteration is to keep the numbers from growing exponentially. You scale the vector however you want to prevent numeric overflow.
A key concept about eigenvectors and eigenvalues is that the set of vectors corresponding to an eigenvalue form a linear subspace. This is a consequence of multiplication by a matrix being a linear map. In particular, any scalar multiple of an eigenvector is also an eigenvector for the same eigenvalue.
The Wikipedia article Power method mentions the use of the Rayleigh quotient to compute an approximation to the dominant eigenvalue. For real vectors and matrices it is given by the value
There are probably good reasons for the use of this formula. Of course, if
is normalized so that
then you can simplify that to
Tell me which of these looks more like a ball, and I'll tell you which one is the better norm:

More seriously, the two-norm is given by an inner product, and this has far reaching consequences: like orthogonality, projections, complemented subspaces, orthonormal bases, etc., etc., etc., features that we see in Hilbert spaces.
The first point is proven as follows.
From the SVD of $A = UDV^T$ we can see that eigenvalues of $A^TA = VD^2V^T$ are just squared ones from $A$. At the same time the columns of $V$ are the eigenvectors of $A^TA$. So, exploiting orthogonality of eigenvectors
$$\|Ax\|_2^2 = \|UDVx\|_2^2 = \|D\left(Vx\right)\|_2^2 = \|De_{\lambda}\|x\|\|_2^2 = \|\sqrt{\lambda}\|x\|\|_2^2= \lambda\|x\|^2$$
The proof is based on the property of the second matrix norm
$${\displaystyle \|A\|_{2}={\sqrt {\lambda _{\max }(A^{^{T}}A)}}}$$. $${\displaystyle \|A^{^{T}}A\|_{2}={\sqrt {\lambda _{\max }(A^{^{T}}AA^{^{T}}A)}}}.$$
The same reasoning from the first point says that the eigenvalues of $A^{^{T}}AA^{^{T}}A$ are just the ones of $A^{^{T}}A$ being squared. So $${\displaystyle \|A\|_{2}={\sqrt {\lambda _{\max }(A^{^{T}}A)}}}$$ $${\displaystyle \|A^{^{T}}A\|_{2}={\sqrt {\lambda _{\max }(A^{^{T}}AA^{^{T}}A)}}} = {\sqrt {\lambda _{\max }^2(A^{^{T}}A)}} = \vert\lambda _{\max }(A^{^{T}}A)\vert = \lambda_{\max}(A^TA)=\max(\lambda_{\min}^2(A),\lambda_{\max}^2(A)) \ge \|A\|_{2}^2 $$
Edit
The inequality becomes after the notion that $\lambda_{\max}(A^TA)=\max(\lambda_{\min}^2(A),\lambda_{\max}^2(A))$, because eigenvalues of $A$ could be negative.
For the problem $(2)$, we have the following:
By definition of 2-norm of a matrix, $\|A\|_2=\underset{\|x\|_2=1}{max}\|Ax\|_2$ , where $x \in \mathbb{R^n}$, $A$ is a $m \times n$ matrix and $Ax \in \mathbb{R^m}$.
Also, by SVD, $A=U\Sigma V^T$, where $U, V$ are $m \times n$ and $n \times n$ unitary (orthonormal) matrices, respectively, and $\Sigma=\begin{bmatrix} \sigma_0 & 0 & \ldots & 0 & \ldots\\ 0 & \sigma_1 & \ldots & 0 & \ldots \\ 0 & 0 & \ldots & 0 & \ldots \\ 0 & 0 & \ldots & \sigma_{n-1} & \ldots \\ \ldots & \ldots & \ldots & \ldots & \ldots \end{bmatrix}$, a $m \times n$ diagonal matrix with $\sigma_0 \geq \sigma_1 \geq \ldots \geq \sigma_n > 0$, with $\sigma_i$ being the singular values of the matrix $A$.
Now, $A^TA=V\Sigma U^TU\Sigma V^T=V\Sigma^2 U^T$, where $U, V$ are unitary with $U^TU=V^TV=I$.
Also, unitary matrices preserves norms, i.e., $\|Ux\|_2=\|x\|_2$ for a unitary matrix $U$.
Hence, By definition of 2-norm,
$\begin{array} \\ \|A^TA\|_2&=&\underset{\|x\|_2=1} {max}\|A^TAx\|_2 \\ &=&\underset{\|x\|_2=1} {max}\|V\Sigma^2U^Tx\|_2 \\ &=&\underset{\|x\|_2=1} {max}\|\Sigma^2U^Tx\|_2 \text{ (since V is unitary, hence norm-preserving)} \\ &=&\underset{\|x\|_2=1} {max}\|\Sigma^2x\|_2 \text{ (since U is unitary)} \\ &=& \sum_{i=0}^{n-1}\sigma_i^2x_i^2 \\ &\leq& \sum_{i=0}^{n-1}\sigma_0^2x_i^2 \text{ (since $\sigma_0 \geq \sigma_i$, $\forall{i}$)}\\ &=& \sigma_0^2\sum_{i=0}^{n-1}x_i^2 \\ &=& \sigma_0^2 \text{ (since $\|x\|_2=1$)} \end{array}$
Also, for a specific $x=e_0=\begin{bmatrix}1\\0\\.\\.\\.\\0\\0\end{bmatrix} \in \mathbb{R}^n$,
$\begin{array} \\ \|A^TA\|_2&=&\underset{\|x\|_2=1} {max}\|A^TAx\|_2 \\ &=&\underset{\|x\|_2=1} {max}\|\Sigma^2x\|_2 \\ &\geq& \|\Sigma^2e_0\|_2 \\ &=& \sigma_0^2 \end{array}$
Hence, combining the above two, $\|A^TA\|_2=\sigma_0^2$
Also, from here, we have, $\|A\|_2=\sigma_0$, the largest singular value.