# Lecture 2: Review of Probability Theory

## Preview text

ECE 830 Fall 2011 Statistical Signal Processing instructor: R. Nowak
Lecture 2: Review of Probability Theory
Probabilistic models will be used throughout the course to represent noise, errors, and uncertainty in signal processing problems. This lecture reviews the basic notation, terminology and concepts of probability theory.
1 Basic Probability
Probability theory begins with three basic components. The set of all possible outcomes, denoted Ω. The collection of all sets of outcomes (events), denoted A. And a probability measure P. Speciﬁcation of the triple (Ω, A, P) deﬁnes the probability space which models a real-world measurement or experimental process. Example 1
Ω = {all outcomes of the roll of a die} = {1, 2, 3, 4, 5, 6}
A = {all possible sets of outcomes} = {{1}, . . . , {6}, {1, 2}, . . . , {5, 6}, . . . , {1, 2, 3, 4, 5, 6}}
P = probability of all sets/events What is the probability of a given ω ∈ Ω, say ω = 3? What is the probability of the event ω ∈ {1, 2, 3}?
The basic physical model here is that all six outcomes are equally probable.
1.1 Probability Measures
Probability measures must satisfy the following properties: 1. P(A) ≥ 0 , ∀A ∈ A 2. P (Ω) = 1 3. if A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B)
Show that the last condition also implies that in general P(A ∪ B) ≤ P(A) + P(B), an inequality sometimes called the union bound.
1.2 Conditional Probability
Consider two events A, B ∈ A. The (conditional) probability that A occurs given B occurs is P (A|B) := P(A ∩ B) P(B)
1

Lecture 2: Review of Probability Theory

2

Example 2

Ω = {1, 2, 3, 4, 5, 6} A = {1, 2} B = {2, 3}

Recall that we say that A “occurs” if the outcome is either 1 or 2. Now suppose you are told that B occurs. Then it seems more probable that A has also occurred.
The probability that A occurs, without knowledge of whether B has occurred, is 1/3. That is, P(A) = 1/3. From the formula above it is an easy calculation to see that

P (A|B) = P(A ∩ B) = P({2}) = 1/6 = 1/2

P(B)

P({2, 3}) 1/3

So we see that knowledge of B provides important information about A. Does this make sense with your intuition?

1.3 Independence
Two events A and B are said to be independent if P (A|B) = P(A). In otherwords, B provides no information about whether A has occurred.
Example 3 Supose we have two dice. Then
Ω = {all pairs of outcomes of the roll two dice} = {(1, 1), (1, 2), . . . , (6, 6)}
Let A = {1st die is 1} and B = {2nd die is 1}. P(A) = 1/6. P (A|B) = P(P{(({11,1}))}) = 11//366 = 1/6. As we know, the value of one die does not inﬂuence the other. The two outcomes are independent.

1.4 Bayes Rule

Again consider two events A and B. Recall the conditional probability P (A|B). Bayes rule is a formula for the “inverse” conditional probability, P (B|A). Bayes’ Rule is

P (B|A) = P (A|B) P(B) P(A)
It is easy to verify by recalling the deﬁnition of conditional probability above. This inversion formula will play a central role in signal estimation problems later in the course.

Example 4 Geneticists have determined that a particular genetic defect is related to a certain disease. Many people with the disease also have this defect, but there are disease-free people with the defect. The geneticists have found that 0.01% of the general population has the disease and that the defect is present in 50% of these cases. They also know that 0.1% of the population has the defect. What is the probability that a person with the defect has the disease?
This is a simple application of Bayes’ Rule. We are interested in two events: ‘defect’ and ‘disease’. Here is what the geneticists know: P(disease) = 0.0001, P (defect|disease) = 0.5, P(defect) = 0.001. We have everything we need to apply Bayes’ Rule:

P (disease|defect) = P (defect|disease) P(disease) = 0.5 × 0.0001 = 0.05

P(defect)

0.001

In other words, if a person has the defect then the chance they have the disease is 5%. In the general population, on the other hand, the chance that a person has the disease is 0.01%. So this “genetic marker” is quite informative.

Lecture 2: Review of Probability Theory

3

2 Random Variables
Often in applications the underlying probability space is not explicitly identiﬁed. Rather we work with random variables which are mappings from Ω to more concrete spaces such as Rn.
Example 5 A real-valued random variable is a mapping X : Ω → R, which means that for every ω ∈ Ω there is a corresponding value X(ω) ∈ R. Random variables can also be vectors, for example a mapping X : Ω → Rn. Since P speciﬁes the probability of every subset of Ω, it also induces probabilities on events expressed in terms of X. For example, if X is a real-valued scalar random variable, then this is an event:
{X ≥ 0} ≡ {ω : X(ω) ≥ 0}
Therefore we can compute the probability of this event:
P(X ≥ 0) = P({ω : X(ω) ≥ 0})
More generally, for any set A we may consider the event {X ∈ A} and its probability P(X ∈ A).

2.1 Cumulative Distributions and Densities
For real-valued scalar random variables it is common to consider the probabilities P(X ≤ x) as a function of x. This function is called the cumulative distribution function of X, and it is often denoted by FX (x). The name can be understood by observing that the diﬀerence between FX (x1) and FX (x2) for any x2 > x1 is equal to the probability that X takes a value in the interval (x1, x2]; i.e.,

FX (x2) − FX (x1) = P(X ≤ x2) − P(X ≤ x1) = P(x1 < X ≤ x2)

The equality above follows by noting that {X ≤ x2} = {X ≤ x1} ∪ {x1 < X ≤ x2} and these two sets are disjoint. Therefore by the basic property of probabilities we have that

P(X ≤ x2) = P(X ≤ x1) + P(x1 < X ≤ x2).

For continuous random variables we can consider the limit of the diﬀerence as x2 → x1. Assume that

the

limit

lim∆→0

FX (x+∆)−FX (x) ∆

:= pX (x) exists, in other words FX

is diﬀerentiable at x.

Since Fx is a

monotonic increasing function, pX (x) ≥ 0. If Fx is diﬀerentiable everywhere, then by the Fundamental

Theorem of Calculus we have x

FX (x) =

pX (x) dx

−∞

Note also that limx→∞ FX (x) = −∞∞ pX (x) = 1, since with probability 1 the variable X takes on a ﬁnite value. The function pX (x) is called the probability density function (pdf) of X. Observe that P(x1 < X ≤ x2) = xx12 pX (x) dx, which explains the term “density.”

Example 6 Recall the test statistic from the communication problem in Lecture 1

n

n

t=

sixi

=

θ

s

2 2

+

si i

i=1

i=1

If the errors are noise-like then we expect z :=

n i=1

si

i

≈ 0.

Moreover

as

n

increases

it

is

reasonable

to

suppose that the approximation becomes more and more accurate since the individual errors may be randomly

positive

or

negative
2

and

tend

to

cancel

each

other

out.

A

reasonable

probability

den√sity

model

for

z

is

then

p(z) := √ 1

e

1 2

z

/n,

a

Gaussian

density

with

zero

mean

and

standard

deviation

1/

n. As n gets larger the

2πn

probability mass is more concentrated about 0. Since t is just a shifted version of z (shifted by the constant

θ s 22), the density for t is

p(t) = √ 1

e

1 2

(t

−θ

s

22 )2 /n

2πn

Lecture 2: Review of Probability Theory

4

s

2 2

when

θ

=

+1

and

s

2 2

when

θ

=

−1,

so

our

test

of

whether

t

was larger or less than 0 is reasonable. Assuming this model we can easily calculate the probabilities of an

incorrect decisions about which bit was sent; i.e., P(X < 0 | θ = +1) and P(X > 0 | θ = −1).

The general form of a Gaussian (or “Normal”) density with mean µ and standard deviation σ is

√1

e

1 2

(

z

µ)2

2

.

Shorthand notation for this distribution is N (µ, σ2) and shorthand for indicating

2πσ2

that a random variable X has this distribution is X ∼ N (µ, σ2). So for our test statistic we could write

t ∼ N (θ

s

2 2

,

n−

1

)

.

The

“standard

normal”

distribution

is

N (0, 1).

Scalar random variables that take values in a discrete set, such as {1, 2, . . . }, do not have a density function. Instead they have a probability mass function (pmf). A pmf has the same interpretation as the density. If X takes values in a set {x1, x2, . . . } (which may be ﬁnite or inﬁnite), then the pmf of X is given by
pX (x) = P(X = xi) 1x=xi ,
i
where 1x=xi is the indicator function that takes a value 1 if x = xi and 0 otherwise. Note that pX (x) = P(X = x).

Example 7 One of the most common discrete distributions is the binomial distribution. Suppose you toss

a coin n times and count the number of heads. This number is a random variable X taking a value between

0 and n, and of course it will depend on p, the probability of observing a head in a single toss. The binomial

distribution gives the probability of observing k heads in the n trials, for k = 0, . . . , n, and has the following

form:

n
pX (x) =

n pk(1 − p)n−k 1x=k

k

k=1

In other words, P(X = k) = nk pk(1 − p)n−k, where nk = k!(nn−! k)! is the binomial coeﬃcient. It is easy to understand this formula. There are nk sequences of heads and tails that have exactly k heads, and the probability of each such sequence is pk(1 − p)n−k. Our shorthand notation for this case is X ∼ Bi(n, p).

To simplify the notation, often we will denote probability functions by F (x) or p(x), dropping the subscript and using the choice of argument variable to indicate the underlying random variable.

3 Multivariate Distributions

In many signal processing problems we will need to consider vector-valued random variables. Such a random

variable consists of many “variates” (the individual elements making up the vector which can be viewed as

scalar random variables). Multivariate distributions describe the joint probability relationships among the

variates. The most common multivariate distribution, and the one that will be used most often in the course,

is the multivariate Gaussian (or Normal) distribution.

Suppose X : Ω → Rd; that is X is a d-dimensional vector of scalar random variables. If the individual

component random variables are independently distributed, then we have a trivial joint distribution which

is just the product of the univariate distributions of each component. For example, if X = [X1 , X2]T and

X1 and X2 are independent and have densities, then the joint density of X is just pX (x) = pX1 (x1)pX2 (x2).

Things are more interesting if the component random variables are not independent, in which case the

multivariate distribution does not factor in this simple way. In fact, the joint distribution factorizes into

univariate densities if and only if the component random variables are independent.

The multivariate Gaussian density is one model that can represent dependent random variables and it

has the following form

pX (x) =

1 exp − 1 (x − µ)T Σ−1(x − µ)

(2π)d|Σ|

2

Lecture 2: Review of Probability Theory

5

where the argument x ∈ Rd, µ ∈ Rd is the mean vector of the density, and Σ ∈ Sd+ is the covariance matrix, where Sd+ denotes the cone of all positive semi-deﬁnite symmetric matrices (i.e., all symmetric matrices with non-negative eigenvalues). The term |Σ| is the determinant of Σ. We will use X ∼ N (µ, Σ) as shorthand for “X is multivariate Gaussian distributed with mean µ and covariance Σ.” It is easy to see that when d = 1 this reduces to the scalar Gaussian density function.
If Σ is a diagonal matrix (i.e., all oﬀ-diagonal entries are zero), then the component random variables are uncorrelated. Moreover, it is easy to veriﬁy that in that case the multivariate Gaussian density factorizes into univariate component densities, which means that uncorrelated Gaussian random variables are also independent. This is a special characteristic of the Gaussian density and in general being uncorrelated does not imply independence.

Figure 1: Multivariate Gaussian Distributions. The top row show the Gaussian density (left) and its contour plot (right) with mean [0 0]T and covariance [1 0; 0 1]. The second row is the same except that the covariance is [1 0.5; 0.5 1]; positive correlation. The third row is the same except that the covariance is [1 − 0.5; −0.5 1]; negative correlation.
Example 8 Suppose that we have a linear system driven by a Gaussian white noise process. That is, a sequence of independent N (0, 1) variables, denoted X1, . . . , Xn is the input to the linear system. The output of the system can be expressed as Y = H X, where H is a n × n matrix describing the action of the system on the input sequence X = [X1, . . . , Xn]T . The joint distribution of X is N (0, I), where I denotes the n × n identity matrix. The joint distribution of the output Y is N (0, HHT ). The output is correlated due to the action of the system.
Example 9 Imagine we have a sensor network monitoring a manufacturing facility. The network consists of n nodes and its function is to monitor for failures in the system. Let X = [X1, . . . , Xn]T denote the set of

Lecture 2: Review of Probability Theory

6

scalar measurements produced by the network and suppose that it can modeled as a realization of a N (µ, Σ) random vector for a particular choice of µ and Σ. Furthermore, assume that the set B ⊂ Rn denotes the set of all vectors indicative of failures. Then the probability of failure is given by P(X ∈ B), which can be calculated (numerically) by integrating the N (µ, Σ) over the set B.
It is sometimes necessary to consider conditional probabilities and expectations. Let X and Y be random variables. Suppose we observe Y = y. If X and Y are dependent, then knowledge that Y = y tells us something about X. The conditional probability of X given Y = y is deﬁned according to the deﬁnition of conditional probability. If the random variables have a joint density or mass function, then the conditional density or mass function is deﬁned as
p(x, y) p(x|y) =
p(y)
where p(y) = p(x, y) dx and x is the variable and y is ﬁxed. Suppose we were interested in the probability that X ∈ A, for some set A, given Y = y. Then we have P (X ∈ A | Y = y) = A p(x|y) dx.
4 Expectation
In addition to computing the probabilities that random variables fall into certain sets it is also useful to compute the average or expected value of random variables. If a random variable X has density pX (x), then the expectation of f (X), where f is any function of X, is
E[f (X)] = f (x) pX (x) dx
If the random variable is discrete, then the expectation is
E[f (X)] = f (xi) P(X = xi)
i
Here are some special cases.
mean: µ = E[X]
variance: σ2 = E[(X − E[X])2]
probability: P(X ∈ A) = E[1{X∈A}]
characteristic function: φ(ω) := E[exp(−iωX)]
The characteristic function of X is the Fourier transform of its density. If X is a d-dimensional random variable, X = [X1, . . . , Xd]T , then the mean is
 µ1  E[X] =  ... 
µd
and the covariance matrix is Σ = E[(X − µ)(X − µ)T ], where Σi,j = E[(Xi − µi)(Xj − µj)], for i, j = 1 . . . , d. Note that if Y = AX, where A is an m × n matrix, then the random vector Y has mean E[AX] = Aµ
and covariance E[(AX − Aµ)(AX − Aµ)T )] = AΣAT . If X is multivariate Gaussian distributed, then so is Y (which can be shown using the characteristic function). In that case Y ∼ N (Aµ, AΣAT ) (recall Example 8). This is a special property of the Gaussian distribution (generally linear transformations will alter the distributional characterstics).
We also sometimes need to use conditional expectations. For example, suppose X and Y are two dependent random variables. If we observe Y = y, then the expectation of f (X) may diﬀer from E[f (X)] (the

Lecture 2: Review of Probability Theory

7

unconditional expectation). The conditional expectation is denoted by E[f (X) | Y = y]. If X and Y have a joint density, then the conditional expectation is computed using the conditional density as follows:
E[f (X) | Y = y] = f (x) p(x|y) dx
If X and Y are independent random variables, then E[XY ] = E[X] E[Y ], a simple consequence of the fact that the joint density or mass function factorizes.

5 Convergence of Sums of Independent Random Variables

The most important form of statistic considered in this course is a sum of independent random variables.

Example 10 A biologist is studying the new artiﬁcial lifeform called synthia. She is interested to see if

the synthia cells can survive in cold conditions. To test synthia’s hardiness, the biologist will conduct n

independent experiments. She has grown n cell cultures under ideal conditions and then exposed each to

cold conditions. The number of cells in each culture is measured before and after spending one day in cold

conditions. The fraction of cells surviving the cold is recorded. Let x1, . . . , xn denote the recorded fractions.

The average p := n1

n i=1

xi

is

an

estimator

of

the

survival

probability.

Understanding behavior of sums of independent random variables is extremely important. For instance,

the biologist in the example above would like to know that the estimator is reasonably accurate. Let

X1, . . . , Xn be independent and identically distributed random variables with variance σ2 < ∞ and consider

the average µ := n1

n i=1

Xi.

First

note

that

E[µ]

=

E[X ].

An

easy

calculation

shows

that

the

variance

of

µ

is σ2/n. So the average has the same mean value as the random variables and the variance is reduced by a

factor of n. Lower variance means less uncertainty. So it is possible to reduce uncertainty by averaging. The

more we average, the less the uncertainty (assuming, as we are, that the random variables are independent,

which implies they are uncorrelated).

The argument above quantiﬁes the eﬀect of averaging on the variance, but often we would like to say

more about the distribution of the average. The Central Limit Theorem is a classic result showing that the

probability distribution of the average of n independent and identically distributed random variables with

mean µ and variance σ2 < ∞ tends to a Gaussian distribution with mean µ and variance σ2/n, regardless

of the form of the distribution of the variables. By ‘tends to’ we mean in the limit as n tends to inﬁnity.

In many applications we would like to say something more about the distributional characteristics for

ﬁnite values of n. One approach is to calculate the distribution of the average explicitly. Recall that if

the random variables have a density pX , then the density of the sum

n i=1

Xi

is

the

n-fold

convolution

of

the density pX with itself (again this hinges on the assumption that the random variables are independent;

it is easy to see by considering the characteristic function of the sum and recalling that multiplication of

Fourier transforms is equivalent to convolution in the inverse domain). However, this exact calculation can be

sometimes diﬃcult or impossible, if for instance we don’t know the density pX , and so sometimes probability

bounds are more useful.

Let Z be a non-negative random variable and take t > 0. Then

E[Z] ≥ E[Z 1Z≥t] ≥ E[t 1Z≥t] = t P(Z ≥ t)
The result P(Z ≥ t) ≤ E[Z]/t is called Markov’s Inequality. Now we can use this to get a bound on the probability ‘tails’ of Z. Let t > 0
P(|Z − E[Z]| ≥ t) = P ((Z − E[Z])2 ≥ t2) E[(Z − E[Z])2]
≤ t2 Var(Z)
= t2 ,

Lecture 2: Review of Probability Theory

8

where Var(Z) denotes the variance of Z. This inequality is known as Chebyshev’s Inequality. If we apply

this to the average µ = n1

n i=1

Xi,

then

we

have

σ2 P(|µ − µ| ≥ t) ≤ nt2
where µ and σ2 are the mean and variance of the random variables {Xi}. This shows that not only is the variance reduced by averaging, but the tails of the distribution (probability of observing values a distance of more than t from the mean) are smaller.
The tail bound given by Chebyshev’s Inequality is loose, and much tighter bounds are possible under slightly stronger assumptions. In particular, if the random variables {Xi} are bounded or sub-Gaussian (meaning the tails of the probability distribution decay at least as fast as Gaussian tails), then the tails of the average converge exponential fast in n. The simplest result of this form is for bounded random variables.

Theorem 1 (Hoeﬀding’s Inequality). Let X1, X2, ..., Xn be independent bounded random variables such that

Xi ∈ [ai, bi] with probability 1. Let Sn =

n i=1

Xi.

Then

for

any

t

>

0,

we

have

2t2

P(|Sn − E[Sn]| ≥ t) ≤ 2 e Pni=1(bi−ai)2

If the random variables {Xi} are binary-valued, then this result is usually referred to as the Chernoﬀ Bound.

The proof of Hoeﬀding’s Inequality, which relies on a clever generalization of Markov’s inequality and some

elementary concepts from convex analysis, is given in the next section.

Now suppose that the random variables in the average µ = n1 Xi ≤ b. Let c = (b − a)2. Then Hoeﬀding’s Inequality implies

n i=1

Xi

are

bounded

according

to

a

P(|µ − µ| ≥ t)

2

e−

2nt2 c

(1)

In other words, the tails of the distribution of the average are tending to zero at an exponential rate in n, much faster than indicated by Chebeyshev’s Inequality. Similar exponential tail bounds hold for averages of iid sub-Gaussian variables. Using tail bounds like these we can prove the so-called laws of large numbers.

Theorem 2 (Weak Law of Large Numbers). Let X1, X2, ..., Xn be iid random variables with E[|Xi|] < ∞.

Then n1

n i=1

Xi

converges

in

probability

to

E[Xi].

Theorem 3 (Strong Law of Large Numbers). Let X1, X2, ..., Xn be iid random variables with E[|Xi|] < ∞.

Then n1

n i=1

Xi

converges

almost

surely

E[Xi].

Example 11 Let us revisit the synthia experiments. The biologist has collected n observations, x1, . . . , xn,

each corresponding to the fraction of cells that survived in a given experiment. Her estimator of the survival

rate is n1

n i=1

xi.

How conﬁdent can she be that this is an accurate estimator of the true survival rate?

Let us model her observations as realizations of n iid random variables X1, . . . , Xn with mean p and deﬁne

p = n1

n i=1

Xi.

We

say

that

her

estimator

is

probability

approximately

correct

with

non-negative

parameters

( , δ) if

P(|p − p| > ) ≤ δ

The random variables are bounded between 0 and 1 and so the value of c in (1) above is equal to 1. For

desired accuracy > 0 and conﬁdence 1 − δ, how many experiments will be suﬃcient? From (1) we equate

δ

=

2

exp(−2n

2)

which

yields

n

1 22

log(2/δ).

Note

that

this

requires

no

knowledge

of

the

distribution

of

the

{Xi}

apart

from

the

fact

that

they

are

bounded.

The

result

can

be

summarized

as

follows.

If

n

1 22

log(2/δ),

then the probability that her estimate is oﬀ the mark by more than is less than δ.

Lecture 2: Review of Probability Theory

9

6 Proof of Hoeﬀding’s Inequality

Let X be any random variable and s > 0. Note that P(X ≥ t) = P(esX ≥ est) ≤ e−stE[esX ] , by using Markov’s inequality, and noting that esx is a non-negative monotone increasing function. For clever choices

of s this can be quite a good bound.

Let’s look now at

n i=1

Xi

− E[Xi].

Then

n
P( Xi − E[Xi] ≥ t) ≤ e−stE es(Pni=1 Xi−E[Xi])

i=1

= e−stE

n
es(Xi −E[Xi ])

i=1 n
= e−st E es(Xi−E[Xi]) ,

i=1

where the last step follows from the independence of the Xi’s. To complete the proof we need to ﬁnd a good bound for E es(Xi−E[Xi]) .

Figure 2: Convexity of exponential function.

Lemma 1 Let Z be a r.v. such that E[Z] = 0 and a ≤ Z ≤ b with probability one. Then

E esZ

≤ e . s2(b−a)2 8

This upper bound is derived as follows. By the convexity of the exponential function (see Fig. 2),

Thus,

esz ≤ z − a esb + b − z esa, for a ≤ z ≤ b .

b−a

b−a

E[esZ ] ≤ E Z − a esb + E b − Z esa

b−a

b−a

= b esa − a esb , since E[Z] = 0

b−a

b−a

= (1 − λ + λes(b−a))e−λs(b−a) , where λ = −a b−a

Lecture 2: Review of Probability Theory

10

Now let u = s(b − a) and deﬁne φ(u) ≡ −λu + log(1 − λ + λeu) ,

so that

E[esZ ] ≤ (1 − λ + λes(b−a))e−λs(b−a) = eφ(u) .

We want to ﬁnd a good upper-bound on eφ(u). Let’s express φ(u) as its Taylor series with remainder:

u2 φ(u) = φ(0) + uφ (0) + φ (v) for some v ∈ [0, u] .
2

λeu φ (u) = −λ + 1 − λ + λeu ⇒ φ (0) = 0

λeu

λ2e2u

φ (u) = 1 − λ + λeu − (1 − λ + λeu)2

λeu

λeu

= 1 − λ + λeu (1 − 1 − λ + λeu )

= ρ(1 − ρ) ,

where ρ = 1−λλ+euλeu . Now note that ρ(1 − ρ) ≤ 1/4, for any value of ρ (the maximum is attained when

ρ = 1/2, therefore φ (u) ≤ 1/4. So ﬁnally we have φ(u) ≤ u2 = s2(b−a)2 , and therefore

8

8

E[esZ ]

e s2(b−a)2 8

.

Now, we can apply this upper bound to derive Hoeﬀding’s inequality.

n
P(Sn − E[Sn] ≥ t) ≤ e−st E[es(Xi−E[Xi])]

i=1

≤ e e n
−st

s2 (bi−ai)2 8

i=1

=

e−stes2

Pn
i=1

(bi −ai )2 8

−2t2

=

e

Pn i=1

(bi

−ai

)2

by choosing s =

4t ni=1(bi − ai)2

The same result applies to the r.v.’s −X1, . . . , −Xn, and combining these two results yields the claim of the theorem.