## Norms and Condition Numbers

We have had a few occasions to talk about norms of vectors. I want to take a look at them, and to also look at norms of matrices, and then at condition numbers of matrices.

## Vector Norms

Let me jump right in. Here’s a 2-dimensional vector:

we get the Euclidean length of the vector – it’s the length of the hypotenuse of a 3-4-5 right triangle.

That’s just the length of a vector. Ho hum.

Well, it’s one possible definition of length… and we use the word norm for the general idea. In fact, we could have asked for

and gotten the same answer.

How we did get that answer? We took the square root of the sum of squares of the components of v.

One generalization is to take the pth root of the sum of pth powers… these are usually called p-norms, or lp norms.

(p can be any real number not less than 1. I’m working in a finite dimensional vector space, but the concept works fine for a countably infinite vector space – i.e. sequences and infinite series – and for uncountably infinite vector spaces – i.e. spaces of functions and integrals. In function spaces these are called Lp norms, with a capital L.)

Here, take p = 3. First I ask Mathematica nicely…

(I started with v… took the absolute values of its components… cubed each absolute value… added them up… took the cube root… asked for a decimal representation.)

p = 1 is a very common choice. (This is the l1, pronounced el-one, norm used in compressed sensing.)

OK… that’s the sum of the absolute values of the components… as it should be: take the 1-root of the sum of the first powers of (the absolute values of) the components.

Geometrically, that amounts to getting from one end of the vector to the other by traveling along the axes… and that’s why the 1-norm is often called the taxi-cab, city-block, or Manhattan norm.

There’s another very frequently used case: take the limit as p goes to infinity.

It picked out the largest single component (in absolute value).

Let’s see that numerically, for p = 1, 2, 3, 4, 9, 16, 25:

We see that as p gets larger, the norm gets closer to 4, which is the largest component (in absolute value).

So, out of all possible p-norms, I will single out the 1-, 2-, and $\infty$-norms of vectors as being of particular interest.

Having shown them to you, let me give you the definition.

The absolute value is positive definite, homogeneous, and satisfies the triangle inequality:

for real numbers a, x, y.

We say that a real-valued function $\parallel \cdot\parallel\$ on a vector space is a vector norm if it satisfies the three conditions

for vectors x and y, and real numbers a.

It can be shown that p-norms satisfy those conditions for p >= 1, and in the limit as p -> infinity.

## Matrix Norms

Next, let’s look at norms of matrices. If nothing else, we could treat an mxn matrix as a vector of length mn. We will see that we don’t want to, in general, although we will do it for the vector 2-norm.

Here’s a matrix – like our typical data matrices it is taller than wide:

Write it as a vector… and ask for the 1-, 2-, and $\infty$-norms of the vector a:

Now let’s ask Mathematica for the corresponding norms of the matrix A:

They’re completely different. What the heck is going on?

## The Submultiplicative Property

It turns out that we would like matrix norms to satisfy an additional condition, which I will call the submultiplicative property. Given two matrices A and B, we want the norm of the product to be bounded by the product of the norms:

$\parallel A B\parallel \le \parallel A\parallel \parallel B\parallel\$ .

Why?

Let me give you an example.

One of most important matrices you will ever see is the matrix exponential. (You can find “matrix exponential” in the Tag Cloud.) By analogy with the Taylor series for the exponential function of the real variable x

we can write the corresponding infinite series for any square matrix A – oops, let me use X, since A is a numerical matrix at the moment:

Because X is square, we may compute powers of it, just multiplying it by itself…

But how do know we end up with a matrix? Well, we start with a finite polynomial with terms up to X^n…

Finally, we take the limit as n goes to infinity of the RHS: it is precisely $Exp(\parallel X\parallel\$ ), i.e. the exponential of the real number $\parallel X\parallel \$ .

So we have shown that the infinite series of matrices for exp(X) converges if the norm of X is finite. This is a good thing.

Anyway, that's what the submultiplicative property gets us, the matrix exponential and other matrix functions defined by power series.

Let's consider another special case. Any vector v can be considered a column matrix, so the submultiplicative property takes the form

$\parallel A v\parallel \le \parallel A\parallel \parallel v\parallel \$

and that in turn says that

$\parallel A\parallel \ge \parallel A v\parallel / \parallel v\parallel \$.

(I will refer to that as “the inequality”.)

That is, the submultiplicative property implies that the norm of A is not less than $\parallel A v\parallel / \parallel v\parallel\$ , for all vectors v. To put that another way, if this inequality does not hold, then the matrix norm does not satisfy the submultiplicative property. We do not know, however, that if the inequality does hold, that the submulticative property holds.

To put it yet another way, the inequality is clearly necessary, but it is not at all clear that it is sufficient.

You might be a little bothered by all those $\parallel \cdot \parallel \$ symbols. Yes, $\parallel v \parallel\$ and the $\parallel A \parallel \$ will turn out to be different norms. And in principle, the two vector norms $\parallel A v \parallel \$ and $\parallel v \parallel \$ could have been different, too. It is not difficult to make the notation reflect that the norms are, or might be, different, but I'm not going to do that.

## The Operator Norm

The key is that given vector norms for $\parallel A v \parallel \$, i.e. for a vector in the range; and for $\parallel v\parallel\$ , i.e. for a vector in the domain, if we were to define the matrix norm $\parallel A \parallel \$ by

$\parallel A\parallel = \text{max} \parallel A v\parallel / \parallel v\parallel\$ , taken over all vectors v,

then we might have a shot at the submultiplicative property. Note that the RHS is the ratio of two vector norms.

It turns out, this succeeds. That definition is called the operator norm… and it satisfies the submultiplicative property.

Oh, we can also show that instead of taking the maximum over all vectors v, it suffices (by homogeneity) to take the maximum over all unit vectors. You will almost certainly see it written that way.

Why did I take the time to describe the submultiplicative property? Because back in my schooldays I was just handed the operator norm without any justification.

So, given our 1-, 2-, and $\infty$– vector norms, we are going to define 1-, 2-, and $\infty$– matrix norms as the corresponding operator norms.

Well, as I do often, I'm just going to hand you the answers. I think they're pretty slick.

The matrix 1-norm – defined as the maximum of $\parallel A v\parallel / \parallel v\parallel \$ using the vector 1-norms – turns out to be the maximum of the column sums of (the absolute values of) the entries. It is often called the column-sum norm.

The matrix $\infty$-norm turns out to be the maximum of the row sums of (the absolute values of) the entries, and it is often called the row-sum norm.

The matrix 2-norm turns out to be the largest singular value of the matrix. Yee ha! You just knew that a singular value would show up somewhere!

Let's confirm that. Here's A again… and its $\infty$-norm

Here are the three row sums of the absolute values…

… and the maximum row sum is the $\infty$-norm 9.

Here's the 1-norm of A…

Here are the two column sums of the absolute values:

… and the maximum row sum is the 1-norm 15.

Here's the 2-norm of A…

And there it is: the largest singular value is the 2-norm.

So that's why we got different answers for the norms of the matrix A and the norms of the corresponding vector a: Mathematica uses the operator norms for the matrices, and the vector norms for the vector.

Just to be clear: recall the vector a:

## The Frobenius Matrix-Norm

We're not quite done. There is another matrix norm that satisfies the submultiplicative property: the square root of the sum of squares. It's called the Frobenius norm, and Mathematica knows it. This should be the same whether we use the matrix A or the vector a, but I have to specify different names: the matrix A and “Frobenius”, or the vector a and “2”.

Let me be clear. We get the Frobenius matrix-norm of an m by n matrix by treating the matrix as a vector of length mn, and computing the vector 2-norm. We get the matrix 2-norm of a matrix as the operator norm derived from the vector 2-norm.

It can be confusing. The Frobenius norm and the matrix 2-norm are different, but they both derive from the vector 2-norm – just in different ways. On the other hand, the Frobenius norm of a matrix is the 2-norm of the vector created by flattening the matrix.

There is another characterization of the Frobenius norm. In the Singular Value Decomposition

A = u w v',

the matrices u and v are orthogonal, i.e. they are rotations. It should not be surprising that $\parallel A\parallel = \parallel w\parallel\$ for the Frobenius norm — but the Frobenius norm of w is the square root of the sum of squares of its entries… and its entries are precisely the singular values of A.

So the Frobenius norm of A is the square root of the sum of squares of its singular values. The matrix 2-norm is just one of the singular values; the Frobenius norm involves all of them.

Now, I don’t know whether any vector p-norm other than 2 satisfies the submultiplicative property. It’s true for p = 2, and false for $p = \infty\$, and probably false for all p greater than 2.

Here's another example, a square matrix so I can multiply it by itself.

What we really want are the norms of the corresponding vectors:

What I wonder is: $\parallel b^2\parallel < \parallel b\parallel^2\$?

Yes, for the vector 1-norm.

No, for the vector $\infty$-norm.

Equality, for the vector 2-norm. That's OK.

Perhaps it's true for 1 <= p <= 2. I don't know. But if it is, then why don't we use the original 1-norm for matrices, i.e. just add up all (the absolute values of) the entries?

## Summary of Vector and Matrix Norms

Let's stay with what we have:

• the 1-, 2-, and $\infty$-norms for vectors
• the 1-, 2-, and $\infty$-norms for matrices, which turn out to be the
• column-sum norm
• row-sum norm
• maximum singular value
• and the Frobenius norm for matrices, which is equivalently
• the square root of the sum of squares
• the square root of the sum of squares of the singular values

## Condition Number of a Matrix

Now let's talk about the condition number of an invertible square matrix. And I am going to use only the 2-norm, but the definition of condition number applies to any matrix norm.

Given the form of A…

… of rank no more than 2, we know that the 2×2 matrix A'A might be invertible, while the 3×3 matrix AA' cannot be. (A’ is the transpose of A.)

Here's A'A… and its 2-norm… which is also its largest singular value:

By the way, I keep using singular values, even though A'A is square. The singular values of A’A are the eigenvalues of A’A:

What about the inverse of A'A? Here is the inverse matrix… and its norm… and its singular values.

Ho hum.

Question. Is there any relationship between the singular values of A'A and the singular values of its inverse?

Hell yes. They are inverses. In different orders, of course, but they are inverses. Here are the singular values of A’A… their inverses… and the singular values of $(A^T A)^{-1}\$.

For studying the invertibility of matrices, we end up looking at the product of the norms. I'll talk about why after we've done some computations.

The condition number, $\chi(X)\$, of an invertible matrix X is the product of the norms of X and its inverse:

$\chi(X) = \parallel X\parallel \parallel X^{-1}\parallel$

and if we use the 2-norm, then the condition number of a matrix is the largest singular value of X times the largest singular value of $X^{-1}\$… but that's the largest singular value of X times the inverse of the smallest singular value of X… and that is just the largest singular value of X divided by the smallest singular value of X.

To find the condition number of a matrix, using the matrix 2-norm, find the list of singular values, and take the ratio of largest to smallest.

That’s fine if we have an invertible matrix X. What if X has the special form X = A’A?

Question. Is there any relationship between the singular values of A'A and the singular values of A?

Hell yes! The singular values (eigenvalues) of A'A are the squares of the singular values of A. Here are the singular values of A… here are their squares… and here are the singular values of A'A…

This means that given our matrix A and its singular values, we could compute the condition number of A'A by squaring the ratio of the largest to smallest singular values of A:

It also means that we could define the condition number of any matrix as the ratio of largest to smallest singular values, even for a non-square matrix – which cannot possibly have an inverse at all, never mind a norm for its inverse.

As it happens, I've been using the ratio of singular values, rather than the squared ratio, when I've been looking at multicollinearity in subsets of data. I’ve really been looking at a condition number of the design matrix X rather than of X’X.

## The Hilbert Matrix

Let's talk about why we look at the condition number. Here's what I think is the most famous ill-conditioned matrix in all of mathematics. It's called the Hilbert matrix, and Mathematica knows about it. Here’s a Wikipedia link.

The (i,j) entry is $\frac{1}{i+j-1}\$. Here’s the 5×5 Hilbert matrix:

Let's get the inverse of h, and then see if $h^{-1}\ h = 1\$:

The entries in the inverse matrix are quite large, quite a bit larger than the entries in h. But the computations are perfect – because Mathematica worked with rational numbers instead of machine precision.

Let's compute the condition number for that matrix, using the 2-norm, so the condition number is the ratio of largest to smallest singular values.

Here's the singular value list… the ratio of smallest to largest… and the base-10 logarithm of the condition number:

The condition number is up in the millions. More importantly, its base-10 logarithm is about 6.

Let's what happens if instead of having exact rational numbers, we have numbers to eight places.

h now looks like this:

If multiply h by its inverse, we get

I’m not sure exactly what they mean, for example, by 0. x 10^-2, but I assume they’re saying the number looks like 0 so long as we round off to .01 . Or something like that.

Now drop to seven digits:

and if we ask for its inverse, Mathematica complains:

Six digits will be the same, qualitatively. But drop to 5 digits, and we're looking at horror.

Here's the product of the alleged inverse times h…

The (4,4) entry of the product is 0 — instead of 1.

Yikes.

And that's what the condition number is telling us. We're going to lose precision when we compute the inverse of the matrix h – more than 5 digits in this case – and if we only had 5 digits to start with, we will lose all of them.

As a rule of thumb, then, the log base 10 of the condition number of a matrix tells us how many digits we might lose when we compute the inverse. It's not guaranteed to be that bad, but it could be.

## Summary

So. Whenever I compute the condition number of a data matrix X by taking the ratio of its largest to smallest singular values, I would want to square that number to assess how inaccurate the inverse of X'X might be. Now, since Mathematica appears to be working in double precision (16 places), we're probably not in much danger even if the condition number is in the millions.

In fact, that's what we've seen. The condition number for design matrix X for the raw hald data was… so the condition number for X'X was… with logarithm…

i.e. a condition number in the millions, and a log of more than 7. It’s worse than the Hilbert matrix. But we had no trouble inverting the matrix. We didn’t work with 7 digits only.

(Strictly speaking, we had no trouble doing the regression. I have, in addition, checked that if I explicitly compute X'X and ask for its inverse, Mathematica will not complain.)

So.

We generalized from the Euclidean length of a vector to the p-norms of a vector. We focused on the 1-, 2-, and $\infty$-norms of a vector.

We defined the operator norm of a matrix based on any norm of a vector, in order to guarantee that the matrix norm would satisfy the submultiplicative property. We speak of the operator p-norm derived from a vector p-norm, so we have 1-, 2-, and $\infty$-norms of a matrix.

I have declared without proof that the matrix 1-norm is the column-sum norm, the matrix $\infty$-norm is the row-sum norm, and the matrix 2-norm is the largest singular value of the matrix.

I have also introduced the Frobenius norm of a matrix, the square root of the sum of squares of the entries of a matrix, because it satisfies the submultiplicative property, although as far as I know, it is not an operator norm.

Then we defined the condition number of an invertible matrix as the product of the norm of the matrix and the norm of its inverse. We saw that, for the 2-norm, the condition number was the ratio of largest to smallest singular values.

That let us define the condition number of any matrix as the ratio of largest to smallest singular values. Not only do we have a condition number for a square matrix of the form X'X, but also for X itself.

Finally, we looked at the Hilbert matrix, to see what the condition number means in practice.

I will close by saying that the norm is one of three marvelous generalizations, the other two being an inner product and a metric.

For finite-dimensional vectors, we have the dot product (the sum of the products of corresponding components); we get the norm of a vector as the square root of its dot product with itself; we get the distance between two points as the norm of the vector between the points.

An inner product generalizes the dot product. Some norms can be derived from an inner product – but of the p-norms, only the 2-norm comes from an inner product. Both of these concepts require a vector space.

A metric generalizes the definition of distance between two points. It can be defined on arbitrary point sets, and gives us what is called a metric space. It does not require a vector space, but it can be defined on one – and we do that almost all the time once we have a norm.

Metric spaces are a wonderful stepping stone from the real line R to n-dimensional space and then to topological spaces.

If you hand me a topological space, I will ask if the topology can be derived from a metric. If you hand me a vector space with a metric on it, I will ask if the metric can be derived from a norm. And if you hand me a norm, I will ask if it can be derived from an inner product.

Enough. More than enough, if this is new to you.