Keep digging.

In my last post, we saw, or rather took it on faith, that Stochastic Gradient Descent(SGD) is much faster in getting to a reasonable solution compared to Gradient Descent(GD). Instead of investigating the theoretical underpinnings, I took the easier way out and just directed the reader to the excellent discussion in this paper. Furthermore, that post was more focused on the experiments I ran with CRF++, and a detailed theoretical discussion would have been a little out of place there. This is an attempt to rectify that, to go through the derivations myself, to get a better understanding of the concepts in the process.

To recap, the supervised learning problem we are trying to solve is this: We are given a set of i.i.d. data-points drawn from an unknown distribution ), where is a vector and is a scalar, and we are trying to learn a function such that we can correctly predict given a new instance . How do we determine whether we got a good or not? One way to gauge that is to assign some penalty for wrong predictions: If we have a cost function which is a measure of how much penalty we pay for mis-predicting as , then one criterion would be to find an that leads to the least expected value of that penalty. Formally, we want to find an such that the following quantity is minimized:

Well, that sounds topping, except that we are bogged down by two limitations:

  1. We can’t possibly investigate all the functions in the world to come up with the one that minimizes this quantity.
  2. We don’t know the distribution ; all we have are samples drawn from the distribution.

So we do what we can - we approximate. We create a pool of candidate functions/function-families to investigate (call that set ), and we approximate the expectation with sample average , called the empirical risk so that our objecive takes the following form:

One of the most common ways to generate the family of functions is to take up a family of the form , in which each function is parametrized by some weight vector , e.g. , . Assuming our function belongs to such a family, the problem then reduces to just finding the optimum that minimizes , i.e.

As we have already seen, gradient descent(GD) starts with an initial estimate for , iterates over the complete data and updates the current estimate on the basis of the gradient of the loss function:

where is the learning rate. Stochastic gradient descent (SGD), on the other hand, updates the current estimate based on a random data point from the dataset on each iteration (instead of going over the whole dataset before making an update like GD) :

So, intuitively, GD calculates the proper gradient (with all the data available to it) and descends in that direction whereas SGD only approximates the original gradient based on a single data point. So, it stands to reason that GD should lead to a faster convergence, making the best possible movement towards the optimum on each iteration whereas SGD will be making noisy, sub-optimal movements, hiccupping around, resulting in a gnarly progress. And theory does support this conjecture: Under some regularity assumptions, when is close enough to , and is reasonable, it takes iterations to get to an that’s in a neighborhood of (the corresponding to ), i.e. . SGD, on the other hand, takes iterations to achieve the same. Remember that each iteration of SGD is times faster than that of GD, as GD has to go through all the datapoints in an iteration while SGD looks at only one. So, while SGD takes time to reach the optimum, GD does so in time.

  GD SGD
Time per iteration
Iterations to accuracy
Time to accuracy

To put these numbers into perspective, if we have a million data points (), and , GD would reach the optimum ~10 times faster than SGD. But guess what… SGD almost always performs better with large datasets in practice[1], reaching an acceptable solution much faster than GD! So what exactly is happening here? Well, the catch is, in practice, we don’t always need , the optimal defined beforehand (in fact it might even be a bad idea to use as it might not lead to good generalization); all we need is a that’s in the vicinity of . And SGD is extremely good at getting to a close to ; alas, once there, it bumps around a lot before reaching . GD, on the other hand, takes a longer time than SGD to reach the vicinity of , but races to the optimum once it gets there. The question now is, can we get a theoretically sound explanation for this behavior?

Yes, we can (duh – why else would I be writing this)! We just have to begin at the beginning. Remember what our original goal was - to find an which minimizes , the real expectation of loss . But limited by the inability to explore an unbounded function space with inexhaustible datapoints for endless time, we took recourse to approximations, to minimize in lieu of , on a parametrized set of functions. Let’s go through our approximations, and try to get an estimate of how much error we introduced as we made each approximation. We started out to find . Constraining the function space we could explore to meant we could only hope to find . But we don’t have access to - only an approximation through . So all we can do is get . Now optimization takes time: So we would be happy to get within a neighborhood of , the optimal , instead of getting to the exact value of , so that we ultimately arrive at such that . The error we appropriate through all these approximations, by finding instead of is (using linearity of expectation):

  • is the approximation error, sometimes also called the bias. This arises because we can only explore a limited number of function-families belonging to . If we increase the coverage of , we will increase the chance of finding a better function to approximate , leading to a lower error.
  • is the estimation error, sometimes also called the variance. This arises because we can’t explore an infinite number of samples from the distribution. There are two ways we can reduce this error:
    1. Use simpler function families, which have less expressive capabilities and fewer parameters, and we would be able to get a good fit even without a lot of data1
    2. Get more data, so that becomes a better approximation of .
  • is the optimization error, which can of course be reduced if we run more iterations of the optimization routine.

Now we know that , since . Assuming the VC dimension of is bound by , and , which is the case when we are dealing with big datasets, we have from learning theory 234:

so that:

with , where the final equivalence is done because it gives a more realistic view of the asymptotic behavior as compared to the pessimistic bound of the equality[1].

Now let’s step back for a moment and recap what we are doing. We saw that due to certain limitations, we have to resort to a few approximations, and we have to make do with a sub-optimal function instead of , which leads to an error in our estimation. Our goal is to minimize this, and we can do that in the following ways:

  1. Get a bigger , explore more functions, reduce . The catch: takes more time to explore more functions
  2. Process more samples, use a bigger , reduce . The catch: takes more time to operate on more samples
  3. Use a smaller , reduce . The catch: takes more iterations, and consequently time, to get closer to

In a large scale learning task, time is of the essence, and since all the 3 quantities- - are variables we can manipulate, and since each one of them has a strong influence on the running time, we are faced with a trade-off involving the three quantities, or by proxy, a trade-off on which component of the error we want to reduce by what degree. Now, in the limiting case, we will have . Why is that so? Because the convergence rate of is limited by the convergence rate of its slowest term, and it won’t make sense to spend resources making another term decrease faster. For example, if is the one bumming around, and the other two are cruising forward with abandon, then we would be better off making smaller at the expense of a smaller and , to put on steroids and help it catch up to its fervent compatriots. Since, , , and , we have , i.e. . Also,

so that . Now, from the table, GD takes

time to reach accuracy , whereas SGD does it in time. Since in the asymptotic case2 , GD takes time to error , compared to for SGD. Substituting our earlier value of and assuming , SGD would reach the optimum ~100 times faster than GD! So even though SGD, with its languid traipsing to the optimal , sucks in comparison to GD as an optimizer, it needs far less time to get to a pre-defined expected risk , which is what we are usually looking for in practice. Isn’t that interesting!

REFERENCES

  1. Stochastic Gradient Descent Tricks, Bottou L

1Notice that reducing the expressivity of a function space would lead to an increase in the bias or approx. error: the proverbial bias-variance trade-off.

2Because, . Even if we remain conservative and say , the conclusion still holds.