Videos
OK let's see if this helps you. Suppose you have two functions $f,g:[a,b]\to \mathbb{R}$. If someone asks you what is distance between $f(x)$ and $g(x)$ it is easy you would say $|f(x)-g(x)|$. But if I ask what is the distance between $f$ and $g$, this question is kind of absurd. But I can ask what is the distance between $f$ and $g$ on average? Then it is $$ \dfrac{1}{b-a}\int_a^b |f(x)-g(x)|dx=\dfrac{||f-g||_1}{b-a} $$ which gives the $L^1$-norm. But this is just one of the many different ways you can do the averaging: Another way would be related to the integral $$ \left[\int_a^b|f(x)-g(x)|^p dx\right]^{1/p}:=||f-g||_{p} $$ which is the $L^p$-norm in general.
Let us investigate the norm of $f(x)=x^n$ in $[0,1]$ for different $L_p$ norms. I suggest you draw the graphs of $x^{p}$ for a few $p$ to see how higher $p$ makes $x^{p}$ flatter near the origin and how the integral therefore favors the vicinity of $x=1$ more and more as $p$ becomes bigger. $$ ||x||_p=\left[\int_0^1 x^{p}dx\right]^{1/p}=\frac{1}{(p+1)^{1/p}} $$ The $L^p$ norm is smaller than $L^m$ norm if $m>p$ because the behavior near more points is downplayed in $m$ in comparison to $p$. So depending on what you want to capture in your averaging and how you want to define `the distance' between functions, you utilize different $L^p$ norms.
This also motivates why the $L^\infty$ norm is nothing but the essential supremum of $f$; i.e. you filter everything out other than the highest values of $f(x)$ as you let $p\to \infty$.
There are several good answers here, one accepted. Nevertheless I'm surprised not to see the $L^2$ norm described as the infinite dimensional analogue of Euclidean distance.
In the plane, the length of the vector $(x,y)$ - that is, the distance between $(x,y)$ and the origin - is $\sqrt{x^2 + y^2}$. In $n$-space it's the square root of the sum of the squares of the components.
Now think of a function as a vector with infinitely many components (its value at each point in the domain) and replace summation by integration to get the $L^2$ norm of a function.
Finally, tack on the end of last sentence of @levap 's answer:
... the $L^2$ norm has the advantage that it comes from an inner product and so all the techniques from inner product spaces (orthogonal projections, etc) can be applied when we use the $L^2$ norm.
The first point is proven as follows.
From the SVD of $A = UDV^T$ we can see that eigenvalues of $A^TA = VD^2V^T$ are just squared ones from $A$. At the same time the columns of $V$ are the eigenvectors of $A^TA$. So, exploiting orthogonality of eigenvectors
$$\|Ax\|_2^2 = \|UDVx\|_2^2 = \|D\left(Vx\right)\|_2^2 = \|De_{\lambda}\|x\|\|_2^2 = \|\sqrt{\lambda}\|x\|\|_2^2= \lambda\|x\|^2$$
The proof is based on the property of the second matrix norm
$${\displaystyle \|A\|_{2}={\sqrt {\lambda _{\max }(A^{^{T}}A)}}}$$. $${\displaystyle \|A^{^{T}}A\|_{2}={\sqrt {\lambda _{\max }(A^{^{T}}AA^{^{T}}A)}}}.$$
The same reasoning from the first point says that the eigenvalues of $A^{^{T}}AA^{^{T}}A$ are just the ones of $A^{^{T}}A$ being squared. So $${\displaystyle \|A\|_{2}={\sqrt {\lambda _{\max }(A^{^{T}}A)}}}$$ $${\displaystyle \|A^{^{T}}A\|_{2}={\sqrt {\lambda _{\max }(A^{^{T}}AA^{^{T}}A)}}} = {\sqrt {\lambda _{\max }^2(A^{^{T}}A)}} = \vert\lambda _{\max }(A^{^{T}}A)\vert = \lambda_{\max}(A^TA)=\max(\lambda_{\min}^2(A),\lambda_{\max}^2(A)) \ge \|A\|_{2}^2 $$
Edit
The inequality becomes after the notion that $\lambda_{\max}(A^TA)=\max(\lambda_{\min}^2(A),\lambda_{\max}^2(A))$, because eigenvalues of $A$ could be negative.
For the problem $(2)$, we have the following:
By definition of 2-norm of a matrix, $\|A\|_2=\underset{\|x\|_2=1}{max}\|Ax\|_2$ , where $x \in \mathbb{R^n}$, $A$ is a $m \times n$ matrix and $Ax \in \mathbb{R^m}$.
Also, by SVD, $A=U\Sigma V^T$, where $U, V$ are $m \times n$ and $n \times n$ unitary (orthonormal) matrices, respectively, and $\Sigma=\begin{bmatrix} \sigma_0 & 0 & \ldots & 0 & \ldots\\ 0 & \sigma_1 & \ldots & 0 & \ldots \\ 0 & 0 & \ldots & 0 & \ldots \\ 0 & 0 & \ldots & \sigma_{n-1} & \ldots \\ \ldots & \ldots & \ldots & \ldots & \ldots \end{bmatrix}$, a $m \times n$ diagonal matrix with $\sigma_0 \geq \sigma_1 \geq \ldots \geq \sigma_n > 0$, with $\sigma_i$ being the singular values of the matrix $A$.
Now, $A^TA=V\Sigma U^TU\Sigma V^T=V\Sigma^2 U^T$, where $U, V$ are unitary with $U^TU=V^TV=I$.
Also, unitary matrices preserves norms, i.e., $\|Ux\|_2=\|x\|_2$ for a unitary matrix $U$.
Hence, By definition of 2-norm,
$\begin{array} \\ \|A^TA\|_2&=&\underset{\|x\|_2=1} {max}\|A^TAx\|_2 \\ &=&\underset{\|x\|_2=1} {max}\|V\Sigma^2U^Tx\|_2 \\ &=&\underset{\|x\|_2=1} {max}\|\Sigma^2U^Tx\|_2 \text{ (since V is unitary, hence norm-preserving)} \\ &=&\underset{\|x\|_2=1} {max}\|\Sigma^2x\|_2 \text{ (since U is unitary)} \\ &=& \sum_{i=0}^{n-1}\sigma_i^2x_i^2 \\ &\leq& \sum_{i=0}^{n-1}\sigma_0^2x_i^2 \text{ (since $\sigma_0 \geq \sigma_i$, $\forall{i}$)}\\ &=& \sigma_0^2\sum_{i=0}^{n-1}x_i^2 \\ &=& \sigma_0^2 \text{ (since $\|x\|_2=1$)} \end{array}$
Also, for a specific $x=e_0=\begin{bmatrix}1\\0\\.\\.\\.\\0\\0\end{bmatrix} \in \mathbb{R}^n$,
$\begin{array} \\ \|A^TA\|_2&=&\underset{\|x\|_2=1} {max}\|A^TAx\|_2 \\ &=&\underset{\|x\|_2=1} {max}\|\Sigma^2x\|_2 \\ &\geq& \|\Sigma^2e_0\|_2 \\ &=& \sigma_0^2 \end{array}$
Hence, combining the above two, $\|A^TA\|_2=\sigma_0^2$
Also, from here, we have, $\|A\|_2=\sigma_0$, the largest singular value.