I’ve just finished watching three hours of lectures on something called compressed sensing. I had fully expected to put out a post showing how to compute time from position on an elliptical orbit. But I really want to show you a key piece of mathematics. This was utterly new to me.

I want to show you, via a small example, that minimizing the L1 norm often returns sparse solutions.

Hey, what?

Imagine you have a one megapixel photograph. More likely, the raw file created by your digital camera is 8 to 15 megapixels… but if you save it as JPEG, you may compress it by a factor of – what? – 20? (I’m too busy to work it out.)

The key is that there is some representation in which the photograph has a sparse representation… the data can be represented with good accuracy by fewer numbers than we started with. We see something similar when we represent a sound recording by its Fourier coefficients, or computer tomography by a Radon transform.

Let me get to the simple example.

This will be a simulation. I am going to solve the linear system

Ax = b

for a very special x. In fact, I’m going to start with A and x, compute b, and then show you how to recover x.

What’s the big deal? We know how to solve systems of linear equations – well, if the matrix A is square and invertible.

We also know how to “solve” systems of linear equations if the matrix A has more rows than columns – that’s what we’re doing with ordinary least squares. Yes, we change our notation, and write

.

There is no exact solution, but we get as close as possible, by minimizing e.

Now, what if the matrix A has more columns than rows? The linear system looks like this:

We have more unknowns than equations; there are infinite number of solutions.

But what I’m interested in is the specific case when the true solution x is **sparse**, by which I mean that it has many components which are zero.

Let me offer another example. Suppose we are trying to find out what genetics contributes to some hereditary ailment… for some large number of genes, we need to know which alleles are present… we may be investigating 1 million alleles, we may have only 10,000 different patients, and we expect that fewer than 100 of the alleles are significant.

In that case, we have a data matrix with 1 million columns, 10,000 rows, and we expect to find an x vector of length 10,000 but with fewer than 100 non-zero components.

Not a problem. We can do it. I’m impressed.

I want an example small enough to display. The lecture used 500 columns, 250 rows, and about 75 nonzero components in x. I will construct the matrix A randomly, as the lecturer did, but with 5 rows and 10 columns. Then I will construct x with 3 nonzero components, and I will compute b.

Then I will show you how to recover x from A and b.

Okay, we need some tools. SeedRandom, RandomReal, and RandomInteger. Well, let me use them first and then discuss them.

SeedRandom[1234]

This initializes the pseudo-random number generator to the same starting point every time… my answers will not change as I rerun this notebook.

Each of the entries in A has come from a normal distribution with mean zero and variance 1.

Let me display its transpose – because it fits on the page!

I emphasize that that is the transpose, with 5 columns: A has 10 columns and 5 rows. Next, let me specify which three components of x are nonzero. I ask for three random integers between 1 and 10.

Let me denote this constructed x by the symbol… and let me initialize all 10 of its values to zero:

=ConstantArray[0,10]

= {0,0,0,0,0,0,0,0,0,0}

Then, let me set components 3, 5, 9 to random numbers – normally distributed with mean zero and variance 1:

This means that is…

{0, 0, 0.319184, 0, 1.65857, 0, 0, 0, -1.0439, 0}

Finally, compute b:

= {1.40704, -1.61999, -0.686037, 0.808871, -3.32831}.

We have just guaranteed that for our A and b.

Given only A and b, could we find that x, i.e. ?

How? We now have 5 equations in 10 unknowns. We should expect an infinite number of solutions.

Let x be unknown:

Clear[x,X]

X=Array[x,10]

= {x[1], x[2], x[3], x[4], x[5], x[6], x[7], x[8], x[9], x[10]}

If I ask Mathematica® to solve the equations

A.x = b

I get solutions for the first five components in terms of the last five components; I do have an infinity of solutions:

It would be particularly easy to set the last 5 components to zero, to get one specific solution from that general one:

a0 = {0.420035, 0.905288, 1.22993, 0.410167, 2.09934, 0, 0, 0, 0, 0}

I should point out that we could have asked Mathematica to find any random solution:

Interesting: he set the first 5 components to zero instead of the last 5. Let me save this one, too.

a3=X/.s3

= {{0, 0, 0, 0, 0, 3.06254, -0.123846, 0.43392, -3.65655, -0.907229}}

Is there some way to pick out sparse solutions?

Well, we could look for a least squares solution. We can’t very well minimize Ax – b… it’s always zero. I suppose we could look for the smallest x. But smallest in what sense?

Let’s find a solution with minimum 2-norm:

That command said to minimize the** 2-norm of x** (i.e. Norm[X,2]), which is the square root of the sum of squares of the components of x, subject to the constraint that Ax = b, i.e. x is a solution. The answer not only gave us rules for the x[i], but the value of the 2-norm (1.40578).

We could write (and I saved a2 for this)

a2=X/.s2[[2]]

= {0.272937, 0.223415, 0.520461, -0.281362, 0.540451, -0.501829, 0.149906, 0.387262, -0.875282, -0.138842}

Now, the **1-norm of a vector** is the sum of the absolute values of its components.

Let’s try minimizing the 1-norm of X subject to the constraint that X is a solution of Ax = b.

OK, we see many very small numbers; set them to zero…

That’s what I wanted to show you: **the L1 norm finds sparse solutions**. It found the special x we started with.

We might as well look at the 2-norms of all four solutions (NOTE that they start with a0):

The smallest is for a1, as it should be.

As I said, that’s the math I wanted to show you: we were able to pick out what’s called a 3-sparse solution (because it had only 3 nonzero components) out of the infinity of possible solutions.

But it was the 1-norm not the 2-norm, not the least=squares solution, that worked.

Now for a lot of remarks and resources.

Finding a solution with exactly 3 nonzero components clearly depended upon there being such a solution. True, we didn’t need to know that there were three rather than four or five or whatever; but we seem to have found the sparsest solution. We will want to know how to deal with noise, and with components that are small rather than zero.

In addition, it was apparently important that the matrix A was white noise; more generally, it needs to be “incoherent” with respect to x. (That is, generating A from a normal distribution was more than convenient.)

It also turns out that minimizing the 1-norm is a surrogate for minimizing what is called the 0-norm. It is defined as the number of nonzero components. I could protest that it isn’t really the 0-norm (see this Wikipedia article) – partly because it isn’t a norm, mathematically; and partly because there is already a 0-norm – but the fact remains the literature for this subject uses something called the 0-norm, and I don’t think the name will change.

Anyway, the point of this example was that minimizing the 1-norm apparently found that x which had the fewest number of nonzero components; that is, it apparently minimized the 0-norm as well as the 1-norm.

Oh, other people and I sometimes write L1 for the 1-norm of a vector, but it should be used for the 1-norm of a function; we would normally use a script lower-case l instead. Not a big deal. You will also see “Ell-1″.

The lectures I watched were by Emmanuel Candes of Caltech. The first one is here; there are links under this movie for the next two. You might note and remember that the very beginning of lecture three sketches the geometry, so you might want to look at that early on, rather than take it as it comes.

Resources for compressed sensing – that’s what the whole subject is called in signal processing –are at Rice University. Let me mention that they list print tutorials at the front, then in a much later section entitled “talks”, there are also “online talks”.

There are a whole lot of fascinating applications and potential applications for this little idea, that the 1-norm finds sparse solutions. But I think I’ll let you chase them down.

Oh, okay. In addition to the ones you can learn about from Candes, in signal processing and statistics, I am particularly interested in dictionaries of basis functions and non-parametric estimation. And variable selection, too!

I’d like to close with one story from Candes’ lectures. He was working long-distance with a professor at the University of Wisconsin on either computer tomography or MRI. The images must have been simulations… because every time Candes sent back his solution, it was absolutely correct. (So it couldn’t have been a matter of a better approximation.)

Finally, the recipient called him up and said he wasn’t a scientist… he had to be cheating… and he wanted the phone number for Candes’ superior.

Candes sent him the code (so he could try it for himself, and see that it got the right answers without cheating?). Anyway, that sufficed.

I think this use of the L1 norm is marvelous.

My thanks to the reader who aimed me at Candes’ lectures and provided the Rice University link.

March 29, 2011 at 2:39 am

Rip,

You and your readers might also be interested in the following page:

Compressed Sensing: The Big Picture

http://sites.google.com/site/igorcarron2/cs

Cheers,

Igor.

March 31, 2011 at 7:42 am

Thanks, Igor… that’s a pretty rich page, and I’ve already found a few things that interest me.

In particular, however, I want to recommend Terry Tao’s slides which mention the 12-coin problem:

https://docs.google.com/viewer?url=http%3A%2F%2Fterrytao.files.wordpress.com%2F2009%2F08%2Fcompressed-sensing1.pdf

April 1, 2011 at 6:09 am

Wow, that is really neat! I’ve never seen that before either!

April 1, 2011 at 1:50 pm

This is really cool! I hope to learning more stuff about compressed sensing.

Thank you, Rip!

April 4, 2011 at 5:30 pm

Hi James,

I just realized it was you who asked for stepwise code… it’s been a while, but maybe the April 4 post in OLS will help.

rip

April 11, 2011 at 10:40 pm

Who can introduce details of the L1 norm? Can be an article or report.

April 12, 2011 at 3:39 pm

hi Ahn,

The L1 norm of a vector is the sum of the absolute values of its components.

April 14, 2011 at 9:12 am

Thank alot, rip!

April 14, 2011 at 9:14 am

I’m finding out about magic L1, you can share about it?

April 16, 2011 at 7:22 am

Hi Anh,

I’m not sure what you mean. First, Magic L1 is the name that Candes and Romberg use for their MATLAB code for solving convex linear programming problems — i.e. for minimizing the L1 norm.

Second, you can find a few other entries by searching on “L1 magic”, but “compressed sensing” seems to be a more common term nowadays. I’ll admit I think of it as “L1 magic”, because it’s getting an answer via the L1 norm to a problem that “should have” required a sort of L0 norm: minimum number of components. Strictly speaking, compressed sensing pushes the idea one step further: if you’re not going to use all the data, then don’t collect it all in the first place. That is, L1 magic means — to me — the finding of a sparse solution rather than the collection of less data.

Third, I would indeed love to talk more about finding sparse solutions. But I don’t understand much more than I have put in this post. I need time to learn.

Finally, I’ll remind you that there are some links in the post, and there is a link in one of these comments.

Good luck.

Rip

April 29, 2011 at 10:48 am

Thanks, RIP!

I feel your posts is very useful for what I needed.

” convex linear programming — i.e. for minimizing the L1 norm”

You can introduction to L1 Norm?

General and detailed!

May 2, 2011 at 4:18 pm

Hi Anh,

You clearly want more information, but I have no idea what I can give you beyond all the references in my post and these comments.

Keep looking for what you need. Maybe someday I’ll happen to provide something you can use… but I don’t know what that will be, or when.

May 4, 2011 at 6:50 am

Thanks! very much :)

August 24, 2012 at 4:57 am

awesome post at an understandable level–thanks for writing! just one thing though–while candes had a brief stint at caltech, he was at stanford before that and returned a few years ago :)

August 24, 2012 at 7:41 am

Hi Sanjay,

Thanks for the info – since Stanford is right close by, I can keep my eyes open for lectures by him!

rip

January 23, 2013 at 10:07 pm

Brilliant for beginners ! Thanks

March 19, 2013 at 1:59 pm

Hi, I am trying the same method you mentioned above with A=20 x 5 but my results are not so good as the one you gave. Am I doing something wrong? (I am kinda new in these stuff sorry)

March 23, 2013 at 8:38 am

So what’s the first thing that disagrees with my post?