# What is the intuition behind Hoeffding's inequality

## concentration

In this teaching unit, elementary concepts of the calculus of probability are used. If you don't feel too confident about this, please first take a look at the relevant teaching unit.

Author: Martin Krejca

When we analyze random processes, it is often not possible to assign an exact probability to every elementary event. Often this is because the probability space is infinite. In addition, it is often not mentioned explicitly because, on the one hand, random variables are used and, on the other hand, it is intuitively clear what you are talking about. In such cases you are often satisfied with less.

### The pitfalls of the expected value

Usually one is interested in the expectation of a particular random variable that models the process that one is investigating. On the one hand, the expected value is of course a nice measure, since it describes the average behavior of a random process and thus provides compact information about a probability space. On the other hand, the expected value is only of limited help, as the following two examples are intended to illustrate.

• We consider a fair roll of the dice (i.e. a uniformly distributed random variable over \ (\ {1, \ ldots, 6 \} \)). The expectation of such a throw is \ (3 {,} 5 \). Here it becomes clear how much information is lost due to the expected value: The value itself is not a valid result of a throw and, moreover, all results are equally likely (and not more likely for the \ (3 \) or the \ (4 \)).
• Now we consider a random variable \ (X \) over \ (\ {5, 995 \} \) with \ (P (X = 5) = \ tfrac {1} {2} \) and \ (P (X = 995 ) = \ tfrac {1} {2} \). Here the expected value is \ (500 \) and the problems that arose when the dice were thrown only get more serious: Now the only two values ​​that the random variable can assume are still very far from the expected value. So what use does it still have?

In order to answer this question, one must first realize that the expected value - as we have already briefly stated above - is always only a single number and can therefore represent much less information than the underlying probability space (which can potentially be infinite) . The expected value is just the average of an arbitrary weighting (according to the corresponding random variable) of the elementary probabilities. You may or may not be interested in that.

If you now know the expected value of a random variable and observe a result, you can at least get some additional information: If the observed value is greater than the expected value, there must also be smaller values ​​(and vice versa). However, it is unclear whether there are many much smaller values ​​with less probability or whether there are a few not so much smaller values ​​with greater probability.

A really useful property of an expected value is to be finite. To do this, consider the following example:

• We toss a fair coin so many times in a row that it is upside down for the first time. Theoretically, heads never have to come, because with each new throw there is a probability of \ (\ tfrac {1} {2} \) to get tails. However, the greater the number of attempts, this probability tends towards \ (0 \). Stochastically, the result of the coin toss is a geometrically distributed random variable with a probability of success \ (\ tfrac {1} {2} \). The expected value is \ (2 \). That means that you will almost certainly throw your head at some point.

The fact that the expected value in the above example is actually finite is due to the fact that the probability of only throwing a number goes down exponentially and thus strongly counteracts the linear increase in the number of unsuccessful attempts. But how strong does this counterforce actually have to be? To understand that, let's just look at a few infinite probability spaces. Finite probability spaces always have a finite expectation, since the real numbers are closed under addition and multiplication.

• A random variable \ (X \) is given, which can take values ​​in \ (\ mathbf {N} \ smallsetminus \ {0, 1 \} \). Furthermore we define for all natural numbers \ (i \ geq 2 \) that \ (P (X = i) = \ tfrac {1} {i} \). So the probabilities only decrease linearly. In order to be able to speak of a probability space at all, they also have to add up to \ (1 \). However, this sum corresponds to the harmonic series (except for the first term). So it diverges. Accordingly, no such probability space can exist.
• Let us now look again at a random variable \ (X \), which can assume the same values ​​as before. This time we determine for \ (i \ geq 3 \) that \ (P (X = i) = \ tfrac {1} {i ^ 2} \). For \ (i = 2 \) we simply cheat \ (P (X = 2) = 1 - \ sum_ {i = 3} ^ {\ infty} \ tfrac {1} {i ^ 2} \). The infinite sum obviously converges this time. So the quadratic descent gave us a probability space. If we now want to calculate the expected value of \ (X \), then we (almost) come back to the sum from the previous example. The expected value is therefore infinite.
This little game can now be continued forever: Let us assume that the probabilities such as \ (\ tfrac {1} {i ^ 3} \) fall. Then the expected value is finite, but the variance is still infinite. So probabilities that fall like a polynomial are not that nice. The geometric distribution from the coin flip example does not have this problem. It has a finite expectation and a finite variance.

### Deviations from the expected value

With the exception of the last two examples, we have always discussed the deviation of a random variable from its expected value. That was of course intentional, because many intuitively imagine the expected value as the most representative value of a random variable (although this value does not even have to be assumed!). This implicitly goes hand in hand with the fact that the expected value is the most probable. It is not so; at least not always. The dice roll serves as a good example again, because with it all outcomes are equally likely.

However, it is also the case that there are actually many processes that focus heavily on their expected value. This means that it is relatively unlikely that the result of such a process will deviate significantly from the expected value. This is the case, for example, with the geometric distribution (our coin tossing example): With a probability of \ (\ tfrac {3} {4} \) you will hit the head once after a maximum of two tosses.

Another example of a lumped distribution is the binomial distribution. It has in common with the geometric distribution that it describes processes that are carried out several times. This also corresponds more to the intuition behind the expected value: It is not a representative value for just one experiment, but for many (independent) experiments.

It is important that there is not always a strong concentration when several experiments are carried out one after the other, and that there can also be individual experiments that are highly concentrated, such as the normal or exponential distribution. The information that is relevant to us is simply the probability of a deviation that is relevant to us. Interestingly, this information is much more meaningful than just the expected value. And usually you are not interested in anything more than that. If it is clear that a process is very unlikely to deviate from a specific value (often the expected value), then the exact distribution is actually no longer important.

### The Markov Inequality

The first inequality we'll look at is one-way, but it's very general. The only requirement that we make of the random variable that we want to stochastically limit is that it must not be negative. Not more.

• Let \ (X \) be a nonnegative random variable and let \ (i \ in \ mathbf {R} ^ + \! \). Then \ [P (X \ geq i) \ leq \ frac {\ mathrm {E} (X)} {i} \ applies. \]

The proof is very simple and the informative value does not seem too great at first. Only when \ (i \ geq \ mathrm {E} (X) \) holds, one arrives at something meaningful. One advantage, however, is that \ (i \) can be any (positive) value; not just the expected value! That makes the inequality very flexible.

It is also important to see that although the deviation only falls linearly, it potentially covers an infinite number of cases. This is in contrast to the last two examples in the first section. There we considered \ (P (X = i) \) and not, as here, \ (P (X \ geq i) \).

Let us now concentrate on multiples of \ (\ mathrm {E} (X) \). With the help of the Markov inequality, we can now see that an increasing deviation from the expected value (upwards) becomes increasingly unlikely. Unfortunately, we do not receive any information about deviations downwards.1

In the following we consider inequalities that limit the deviation of a random variable in both directions. Interestingly, they all build on the Markov inequality.

### The Chebyshev inequality

A very simple approach to restrict a deviation in both directions from the Markov inequality is to use the random variable \ (\ big (X - \ mathrm {E} (X) \ big) ^ 2 \) instead of \ (X \) consider. The expectation of the new random variable is simply the variance of \ (X \) and the following statement emerges:

• Let \ (X \) be any random variable and let \ (a \ in \ mathbf {R} ^ + \! \). Then \ [P \ Big (\ big (X - \ mathrm {E} (X) \ big) ^ 2 \ geq a ^ 2 \ Big) = P \ Big (\ big \ vert X - \ mathrm {E} (X) \ big \ vert \ geq a \ Big) \ leq \ frac {\ mathrm {Var} (X)} {a ^ 2} \. \]

Although we only use the Markov inequality in principle, we now make a statement about arbitrary random variables. The trick is simply to cleverly transform the random variable.2

Fortunately, the insight that the Chebyshev inequality gives us is very intuitive: the greater the deviation of a random variable from its expected value, the less likely it will occur. But does that already correspond to a concentration?

### What does concentrated actually mean?

We saw earlier how to estimate the probability that a random variable will deviate from its expected value. Normally, however, one does not speak of concentration at the same time. When to call something concentrated is really up to you; however, there is usually consensus in the analysis of randomized algorithms.

A very important parameter of a randomized algorithm is its running time. Most of all, one is interested in the expected running time.3 Now, of course, one would like to have efficient algorithms, i.e. ones whose (expected) running time is polynomial. If you have such an efficient algorithm, it will be executed several times in its lifespan. A polynomial number of attempts is realistic. It would therefore be desirable if it almost never deviated significantly from its expected duration during this time.

An event that has a superpolynomial low probability4 possesses, almost certainly does not appear in many polynomial experiments. This simply follows from a Union Bound. For this reason it is often said that a random variable is concentrated around a value if the probability of deviating from it is at most superpolynomially small. In the following, however, we speak of concentration simply when we can limit a deviation in both directions.

### The Chernoff Barriers

In this section we get a little more specific. This means that we are restricting the random variables we are looking at more than before. But we also get much stronger results.

In contrast to the two previous inequalities, we are looking at two inequalities here, not just one. In addition, in many cases one can actually use these two inequalities to show concentration around the expected value. So let's take a look at the actual statements.

• Let \ (X_1, \ ldots, X_n \) i. i. d. bernoulli distribute random variables with probability of success \ (p \). Then applies
• on the one hand for every \ (\ delta \ in \ mathbf {R} ^ + \) \ [P \ bigg (\ sum_ {i = 1} ^ n X_i \ geq (1 + \ delta) np \ bigg) \ leq \ mathrm {e} ^ {- \ frac {\ min \ {\ delta, \ delta ^ 2 \}} {3} pn} \]
• and on the other hand for each \ (\ delta \ in (0, 1) \) \ [P \ bigg (\ sum_ {i = 1} ^ n X_i \ leq (1 - \ delta) np \ bigg) \ leq \ mathrm { e} ^ {- \ frac {\ delta ^ 2} {2} pn} \. \]

This describes the intuition that repeated execution leads to a strong concentration, just as it was already mentioned at the beginning. The sum of the \ (X_i \) is a binomial distribution random variable and \ (np \) is its expected value. In principle, the Chernoff limits cover only one distribution. But this occurs quite often.

Another important point is that the inequalities listed here only limit a relative deviation from the expected value. There are also Chernoff bounds for absolute deviations. But then they don't look so nice anymore.

### The Hoeffding inequality

We now consider the last inequality in this unit, we become a bit more general again, but we continue to look at \ (n \) independent random processes.

In the previous section we saw that the Chernoff bounds give good concentrations. Unfortunately, the random variables have to be Bernoulli distributed for this; so they can only have two values.5 In addition, the random variables must all be distributed identically. These two conditions are thrown overboard in the Hoeffding inequality. In addition, the absolute deviation is limited again.

• Let \ (X_1, \ ldots, X_n \) be independent random variables, so that for every \ (i \ in \ {1, \ ldots, n \} \) always \ (a_i \ leq X_i - \ mathrm {E} (X_i) \ leq b_i \) holds, where \ (a_i, b_i \ in \ mathbf {R} \) are. Then for each \ (c \ in \ mathbf {R} ^ + \) \ [P \ bigg (\ sum_ {i = 1} ^ n \ big (X_i - \ mathrm {E} (X_i) \ big) \ geq c \ bigg) \ leq \ mathrm {e} ^ {- \ frac {2c ^ 2} {\ sum_ {i = 1} ^ n (b_i - a_i) ^ 2}} \. \]

This one inequality initially apparently only restricts an upward deviation from the expected value. But if we choose \ (X'_i = - X_i \), then we see that the same limit also applies downwards.6 We can safely replace the \ (\ geq c \) with a \ (\ leq -c \). With a union bound over these two cases, we also get a good estimate of how likely it is to deviate from the expected value at all.

Finally, an important observation from the Hoeffding inequality should be mentioned: The random processes that are considered must be limited. In fact, you have to state how much each process can deviate from the expected value. The only additional information you get afterwards is how likely a deviation will occur. So one can consider the Hoeffding unequal, for example, only for a single \ (X \), i.e. \ (n = 1 \). Often one is interested in the asymptotic behavior of \ (n \). What the inequality then says is that concentrated independent processes in the sum are also concentrated or become even more concentrated.

### What did we learn?

• Expected values ​​are good, concentrations are better
• The sum of independent random processes is often highly concentrated
• Transforming random variables can help you get the results you want

1Although \ (X \) is nonnegative, a lower bound would make sense in order to really speak of a concentration.↩

2This is a very popular trick in stochastics: if something doesn't fit, it is just made to fit. Often you just have to cleverly transform your process in order to save yourself a lot of arithmetic work

3 Here it is already useful to know that the expected running time is finite. Because that means that the algorithm terminates almost certainly! ↩

4These are probabilities \ (p \) for which for every \ (c \ in \ mathbf {R} ^ + \) it holds that \ (p \ in \ mathrm {o} (n ^ {- c}) \) . Actually one should speak of rational functions with numerator \ (1 \). But that would be very laborious. Exponentially low probabilities fall into this category

5In the case of Bernoulli-distributed random variables, one usually means those which can only take on the two values ​​\ (0 \) and \ (1 \). If there are two other values, you can transform your random variable so that it is Bernoulli distributed in the conventional sense. This transformation can then also be put into the Chernoff bounds

6What then changes is only that \ (a_i \) and \ (b_i \) are exchanged and their signs change. Since only the difference is decisive in the inequality and this is also squared, the end result does not change