Lecture 2: Review of Probability Theory
Download Lecture 2: Review of Probability Theory
Preview text
ECE 830 Fall 2011 Statistical Signal Processing instructor: R. Nowak
Lecture 2: Review of Probability Theory
Probabilistic models will be used throughout the course to represent noise, errors, and uncertainty in signal processing problems. This lecture reviews the basic notation, terminology and concepts of probability theory.
1 Basic Probability
Probability theory begins with three basic components. The set of all possible outcomes, denoted Ω. The collection of all sets of outcomes (events), denoted A. And a probability measure P. Specification of the triple (Ω, A, P) defines the probability space which models a real-world measurement or experimental process. Example 1
Ω = {all outcomes of the roll of a die} = {1, 2, 3, 4, 5, 6}
A = {all possible sets of outcomes} = {{1}, . . . , {6}, {1, 2}, . . . , {5, 6}, . . . , {1, 2, 3, 4, 5, 6}}
P = probability of all sets/events What is the probability of a given ω ∈ Ω, say ω = 3? What is the probability of the event ω ∈ {1, 2, 3}?
The basic physical model here is that all six outcomes are equally probable.
1.1 Probability Measures
Probability measures must satisfy the following properties: 1. P(A) ≥ 0 , ∀A ∈ A 2. P (Ω) = 1 3. if A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B)
Show that the last condition also implies that in general P(A ∪ B) ≤ P(A) + P(B), an inequality sometimes called the union bound.
1.2 Conditional Probability
Consider two events A, B ∈ A. The (conditional) probability that A occurs given B occurs is P (A|B) := P(A ∩ B) P(B)
1
Lecture 2: Review of Probability Theory
2
Example 2
Ω = {1, 2, 3, 4, 5, 6} A = {1, 2} B = {2, 3}
Recall that we say that A “occurs” if the outcome is either 1 or 2. Now suppose you are told that B occurs. Then it seems more probable that A has also occurred.
The probability that A occurs, without knowledge of whether B has occurred, is 1/3. That is, P(A) = 1/3. From the formula above it is an easy calculation to see that
P (A|B) = P(A ∩ B) = P({2}) = 1/6 = 1/2
P(B)
P({2, 3}) 1/3
So we see that knowledge of B provides important information about A. Does this make sense with your intuition?
1.3 Independence
Two events A and B are said to be independent if P (A|B) = P(A). In otherwords, B provides no information about whether A has occurred.
Example 3 Supose we have two dice. Then
Ω = {all pairs of outcomes of the roll two dice} = {(1, 1), (1, 2), . . . , (6, 6)}
Let A = {1st die is 1} and B = {2nd die is 1}. P(A) = 1/6. P (A|B) = P(P{(({11,1}))}) = 11//366 = 1/6. As we know, the value of one die does not influence the other. The two outcomes are independent.
1.4 Bayes Rule
Again consider two events A and B. Recall the conditional probability P (A|B). Bayes rule is a formula for the “inverse” conditional probability, P (B|A). Bayes’ Rule is
P (B|A) = P (A|B) P(B) P(A)
It is easy to verify by recalling the definition of conditional probability above. This inversion formula will play a central role in signal estimation problems later in the course.
Example 4 Geneticists have determined that a particular genetic defect is related to a certain disease. Many people with the disease also have this defect, but there are disease-free people with the defect. The geneticists have found that 0.01% of the general population has the disease and that the defect is present in 50% of these cases. They also know that 0.1% of the population has the defect. What is the probability that a person with the defect has the disease?
This is a simple application of Bayes’ Rule. We are interested in two events: ‘defect’ and ‘disease’. Here is what the geneticists know: P(disease) = 0.0001, P (defect|disease) = 0.5, P(defect) = 0.001. We have everything we need to apply Bayes’ Rule:
P (disease|defect) = P (defect|disease) P(disease) = 0.5 × 0.0001 = 0.05
P(defect)
0.001
In other words, if a person has the defect then the chance they have the disease is 5%. In the general population, on the other hand, the chance that a person has the disease is 0.01%. So this “genetic marker” is quite informative.
Lecture 2: Review of Probability Theory
3
2 Random Variables
Often in applications the underlying probability space is not explicitly identified. Rather we work with random variables which are mappings from Ω to more concrete spaces such as Rn.
Example 5 A real-valued random variable is a mapping X : Ω → R, which means that for every ω ∈ Ω there is a corresponding value X(ω) ∈ R. Random variables can also be vectors, for example a mapping X : Ω → Rn. Since P specifies the probability of every subset of Ω, it also induces probabilities on events expressed in terms of X. For example, if X is a real-valued scalar random variable, then this is an event:
{X ≥ 0} ≡ {ω : X(ω) ≥ 0}
Therefore we can compute the probability of this event:
P(X ≥ 0) = P({ω : X(ω) ≥ 0})
More generally, for any set A we may consider the event {X ∈ A} and its probability P(X ∈ A).
2.1 Cumulative Distributions and Densities
For real-valued scalar random variables it is common to consider the probabilities P(X ≤ x) as a function of x. This function is called the cumulative distribution function of X, and it is often denoted by FX (x). The name can be understood by observing that the difference between FX (x1) and FX (x2) for any x2 > x1 is equal to the probability that X takes a value in the interval (x1, x2]; i.e.,
FX (x2) − FX (x1) = P(X ≤ x2) − P(X ≤ x1) = P(x1 < X ≤ x2)
The equality above follows by noting that {X ≤ x2} = {X ≤ x1} ∪ {x1 < X ≤ x2} and these two sets are disjoint. Therefore by the basic property of probabilities we have that
P(X ≤ x2) = P(X ≤ x1) + P(x1 < X ≤ x2).
For continuous random variables we can consider the limit of the difference as x2 → x1. Assume that
the
limit
lim∆→0
FX (x+∆)−FX (x) ∆
:= pX (x) exists, in other words FX
is differentiable at x.
Since Fx is a
monotonic increasing function, pX (x) ≥ 0. If Fx is differentiable everywhere, then by the Fundamental
Theorem of Calculus we have x
FX (x) =
pX (x) dx
−∞
Note also that limx→∞ FX (x) = −∞∞ pX (x) = 1, since with probability 1 the variable X takes on a finite value. The function pX (x) is called the probability density function (pdf) of X. Observe that P(x1 < X ≤ x2) = xx12 pX (x) dx, which explains the term “density.”
Example 6 Recall the test statistic from the communication problem in Lecture 1
n
n
t=
sixi
=
θ
s
2 2
+
si i
i=1
i=1
If the errors are noise-like then we expect z :=
n i=1
si
i
≈ 0.
Moreover
as
n
increases
it
is
reasonable
to
suppose that the approximation becomes more and more accurate since the individual errors may be randomly
positive
or
negative
2
and
tend
to
cancel
each
other
out.
A
reasonable
probability
den√sity
model
for
z
is
then
p(z) := √ 1
e
−
1 2
z
/n,
a
Gaussian
density
with
zero
mean
and
standard
deviation
1/
n. As n gets larger the
2πn
probability mass is more concentrated about 0. Since t is just a shifted version of z (shifted by the constant
θ s 22), the density for t is
p(t) = √ 1
e
−
1 2
(t
−θ
s
22 )2 /n
2πn
Lecture 2: Review of Probability Theory
4
This density is peaked about
s
2 2
when
θ
=
+1
and
about
−
s
2 2
when
θ
=
−1,
so
our
test
of
whether
t
was larger or less than 0 is reasonable. Assuming this model we can easily calculate the probabilities of an
incorrect decisions about which bit was sent; i.e., P(X < 0 | θ = +1) and P(X > 0 | θ = −1).
The general form of a Gaussian (or “Normal”) density with mean µ and standard deviation σ is
√1
e
−
1 2
(
z
−
µ)2
/σ
2
.
Shorthand notation for this distribution is N (µ, σ2) and shorthand for indicating
2πσ2
that a random variable X has this distribution is X ∼ N (µ, σ2). So for our test statistic we could write
t ∼ N (θ
s
2 2
,
n−
1
)
.
The
“standard
normal”
distribution
is
N (0, 1).
Scalar random variables that take values in a discrete set, such as {1, 2, . . . }, do not have a density function. Instead they have a probability mass function (pmf). A pmf has the same interpretation as the density. If X takes values in a set {x1, x2, . . . } (which may be finite or infinite), then the pmf of X is given by
pX (x) = P(X = xi) 1x=xi ,
i
where 1x=xi is the indicator function that takes a value 1 if x = xi and 0 otherwise. Note that pX (x) = P(X = x).
Example 7 One of the most common discrete distributions is the binomial distribution. Suppose you toss
a coin n times and count the number of heads. This number is a random variable X taking a value between
0 and n, and of course it will depend on p, the probability of observing a head in a single toss. The binomial
distribution gives the probability of observing k heads in the n trials, for k = 0, . . . , n, and has the following
form:
n
pX (x) =
n pk(1 − p)n−k 1x=k
k
k=1
In other words, P(X = k) = nk pk(1 − p)n−k, where nk = k!(nn−! k)! is the binomial coefficient. It is easy to understand this formula. There are nk sequences of heads and tails that have exactly k heads, and the probability of each such sequence is pk(1 − p)n−k. Our shorthand notation for this case is X ∼ Bi(n, p).
To simplify the notation, often we will denote probability functions by F (x) or p(x), dropping the subscript and using the choice of argument variable to indicate the underlying random variable.
3 Multivariate Distributions
In many signal processing problems we will need to consider vector-valued random variables. Such a random
variable consists of many “variates” (the individual elements making up the vector which can be viewed as
scalar random variables). Multivariate distributions describe the joint probability relationships among the
variates. The most common multivariate distribution, and the one that will be used most often in the course,
is the multivariate Gaussian (or Normal) distribution.
Suppose X : Ω → Rd; that is X is a d-dimensional vector of scalar random variables. If the individual
component random variables are independently distributed, then we have a trivial joint distribution which
is just the product of the univariate distributions of each component. For example, if X = [X1 , X2]T and
X1 and X2 are independent and have densities, then the joint density of X is just pX (x) = pX1 (x1)pX2 (x2).
Things are more interesting if the component random variables are not independent, in which case the
multivariate distribution does not factor in this simple way. In fact, the joint distribution factorizes into
univariate densities if and only if the component random variables are independent.
The multivariate Gaussian density is one model that can represent dependent random variables and it
has the following form
pX (x) =
1 exp − 1 (x − µ)T Σ−1(x − µ)
(2π)d|Σ|
2
Lecture 2: Review of Probability Theory
5
where the argument x ∈ Rd, µ ∈ Rd is the mean vector of the density, and Σ ∈ Sd+ is the covariance matrix, where Sd+ denotes the cone of all positive semi-definite symmetric matrices (i.e., all symmetric matrices with non-negative eigenvalues). The term |Σ| is the determinant of Σ. We will use X ∼ N (µ, Σ) as shorthand for “X is multivariate Gaussian distributed with mean µ and covariance Σ.” It is easy to see that when d = 1 this reduces to the scalar Gaussian density function.
If Σ is a diagonal matrix (i.e., all off-diagonal entries are zero), then the component random variables are uncorrelated. Moreover, it is easy to verifiy that in that case the multivariate Gaussian density factorizes into univariate component densities, which means that uncorrelated Gaussian random variables are also independent. This is a special characteristic of the Gaussian density and in general being uncorrelated does not imply independence.
Figure 1: Multivariate Gaussian Distributions. The top row show the Gaussian density (left) and its contour plot (right) with mean [0 0]T and covariance [1 0; 0 1]. The second row is the same except that the covariance is [1 0.5; 0.5 1]; positive correlation. The third row is the same except that the covariance is [1 − 0.5; −0.5 1]; negative correlation.
Example 8 Suppose that we have a linear system driven by a Gaussian white noise process. That is, a sequence of independent N (0, 1) variables, denoted X1, . . . , Xn is the input to the linear system. The output of the system can be expressed as Y = H X, where H is a n × n matrix describing the action of the system on the input sequence X = [X1, . . . , Xn]T . The joint distribution of X is N (0, I), where I denotes the n × n identity matrix. The joint distribution of the output Y is N (0, HHT ). The output is correlated due to the action of the system.
Example 9 Imagine we have a sensor network monitoring a manufacturing facility. The network consists of n nodes and its function is to monitor for failures in the system. Let X = [X1, . . . , Xn]T denote the set of
Lecture 2: Review of Probability Theory
6
scalar measurements produced by the network and suppose that it can modeled as a realization of a N (µ, Σ) random vector for a particular choice of µ and Σ. Furthermore, assume that the set B ⊂ Rn denotes the set of all vectors indicative of failures. Then the probability of failure is given by P(X ∈ B), which can be calculated (numerically) by integrating the N (µ, Σ) over the set B.
It is sometimes necessary to consider conditional probabilities and expectations. Let X and Y be random variables. Suppose we observe Y = y. If X and Y are dependent, then knowledge that Y = y tells us something about X. The conditional probability of X given Y = y is defined according to the definition of conditional probability. If the random variables have a joint density or mass function, then the conditional density or mass function is defined as
p(x, y) p(x|y) =
p(y)
where p(y) = p(x, y) dx and x is the variable and y is fixed. Suppose we were interested in the probability that X ∈ A, for some set A, given Y = y. Then we have P (X ∈ A | Y = y) = A p(x|y) dx.
4 Expectation
In addition to computing the probabilities that random variables fall into certain sets it is also useful to compute the average or expected value of random variables. If a random variable X has density pX (x), then the expectation of f (X), where f is any function of X, is
E[f (X)] = f (x) pX (x) dx
If the random variable is discrete, then the expectation is
E[f (X)] = f (xi) P(X = xi)
i
Here are some special cases.
mean: µ = E[X]
variance: σ2 = E[(X − E[X])2]
probability: P(X ∈ A) = E[1{X∈A}]
characteristic function: φ(ω) := E[exp(−iωX)]
The characteristic function of X is the Fourier transform of its density. If X is a d-dimensional random variable, X = [X1, . . . , Xd]T , then the mean is
µ1 E[X] = ...
µd
and the covariance matrix is Σ = E[(X − µ)(X − µ)T ], where Σi,j = E[(Xi − µi)(Xj − µj)], for i, j = 1 . . . , d. Note that if Y = AX, where A is an m × n matrix, then the random vector Y has mean E[AX] = Aµ
and covariance E[(AX − Aµ)(AX − Aµ)T )] = AΣAT . If X is multivariate Gaussian distributed, then so is Y (which can be shown using the characteristic function). In that case Y ∼ N (Aµ, AΣAT ) (recall Example 8). This is a special property of the Gaussian distribution (generally linear transformations will alter the distributional characterstics).
We also sometimes need to use conditional expectations. For example, suppose X and Y are two dependent random variables. If we observe Y = y, then the expectation of f (X) may differ from E[f (X)] (the
Lecture 2: Review of Probability Theory
7
unconditional expectation). The conditional expectation is denoted by E[f (X) | Y = y]. If X and Y have a joint density, then the conditional expectation is computed using the conditional density as follows:
E[f (X) | Y = y] = f (x) p(x|y) dx
If X and Y are independent random variables, then E[XY ] = E[X] E[Y ], a simple consequence of the fact that the joint density or mass function factorizes.
5 Convergence of Sums of Independent Random Variables
The most important form of statistic considered in this course is a sum of independent random variables.
Example 10 A biologist is studying the new artificial lifeform called synthia. She is interested to see if
the synthia cells can survive in cold conditions. To test synthia’s hardiness, the biologist will conduct n
independent experiments. She has grown n cell cultures under ideal conditions and then exposed each to
cold conditions. The number of cells in each culture is measured before and after spending one day in cold
conditions. The fraction of cells surviving the cold is recorded. Let x1, . . . , xn denote the recorded fractions.
The average p := n1
n i=1
xi
is
an
estimator
of
the
survival
probability.
Understanding behavior of sums of independent random variables is extremely important. For instance,
the biologist in the example above would like to know that the estimator is reasonably accurate. Let
X1, . . . , Xn be independent and identically distributed random variables with variance σ2 < ∞ and consider
the average µ := n1
n i=1
Xi.
First
note
that
E[µ]
=
E[X ].
An
easy
calculation
shows
that
the
variance
of
µ
is σ2/n. So the average has the same mean value as the random variables and the variance is reduced by a
factor of n. Lower variance means less uncertainty. So it is possible to reduce uncertainty by averaging. The
more we average, the less the uncertainty (assuming, as we are, that the random variables are independent,
which implies they are uncorrelated).
The argument above quantifies the effect of averaging on the variance, but often we would like to say
more about the distribution of the average. The Central Limit Theorem is a classic result showing that the
probability distribution of the average of n independent and identically distributed random variables with
mean µ and variance σ2 < ∞ tends to a Gaussian distribution with mean µ and variance σ2/n, regardless
of the form of the distribution of the variables. By ‘tends to’ we mean in the limit as n tends to infinity.
In many applications we would like to say something more about the distributional characteristics for
finite values of n. One approach is to calculate the distribution of the average explicitly. Recall that if
the random variables have a density pX , then the density of the sum
n i=1
Xi
is
the
n-fold
convolution
of
the density pX with itself (again this hinges on the assumption that the random variables are independent;
it is easy to see by considering the characteristic function of the sum and recalling that multiplication of
Fourier transforms is equivalent to convolution in the inverse domain). However, this exact calculation can be
sometimes difficult or impossible, if for instance we don’t know the density pX , and so sometimes probability
bounds are more useful.
Let Z be a non-negative random variable and take t > 0. Then
E[Z] ≥ E[Z 1Z≥t] ≥ E[t 1Z≥t] = t P(Z ≥ t)
The result P(Z ≥ t) ≤ E[Z]/t is called Markov’s Inequality. Now we can use this to get a bound on the probability ‘tails’ of Z. Let t > 0
P(|Z − E[Z]| ≥ t) = P ((Z − E[Z])2 ≥ t2) E[(Z − E[Z])2]
≤ t2 Var(Z)
= t2 ,
Lecture 2: Review of Probability Theory
8
where Var(Z) denotes the variance of Z. This inequality is known as Chebyshev’s Inequality. If we apply
this to the average µ = n1
n i=1
Xi,
then
we
have
σ2 P(|µ − µ| ≥ t) ≤ nt2
where µ and σ2 are the mean and variance of the random variables {Xi}. This shows that not only is the variance reduced by averaging, but the tails of the distribution (probability of observing values a distance of more than t from the mean) are smaller.
The tail bound given by Chebyshev’s Inequality is loose, and much tighter bounds are possible under slightly stronger assumptions. In particular, if the random variables {Xi} are bounded or sub-Gaussian (meaning the tails of the probability distribution decay at least as fast as Gaussian tails), then the tails of the average converge exponential fast in n. The simplest result of this form is for bounded random variables.
Theorem 1 (Hoeffding’s Inequality). Let X1, X2, ..., Xn be independent bounded random variables such that
Xi ∈ [ai, bi] with probability 1. Let Sn =
n i=1
Xi.
Then
for
any
t
>
0,
we
have
−
2t2
P(|Sn − E[Sn]| ≥ t) ≤ 2 e Pni=1(bi−ai)2
If the random variables {Xi} are binary-valued, then this result is usually referred to as the Chernoff Bound.
The proof of Hoeffding’s Inequality, which relies on a clever generalization of Markov’s inequality and some
elementary concepts from convex analysis, is given in the next section.
Now suppose that the random variables in the average µ = n1 Xi ≤ b. Let c = (b − a)2. Then Hoeffding’s Inequality implies
n i=1
Xi
are
bounded
according
to
a
≤
P(|µ − µ| ≥ t)
≤
2
e−
2nt2 c
(1)
In other words, the tails of the distribution of the average are tending to zero at an exponential rate in n, much faster than indicated by Chebeyshev’s Inequality. Similar exponential tail bounds hold for averages of iid sub-Gaussian variables. Using tail bounds like these we can prove the so-called laws of large numbers.
Theorem 2 (Weak Law of Large Numbers). Let X1, X2, ..., Xn be iid random variables with E[|Xi|] < ∞.
Then n1
n i=1
Xi
converges
in
probability
to
E[Xi].
Theorem 3 (Strong Law of Large Numbers). Let X1, X2, ..., Xn be iid random variables with E[|Xi|] < ∞.
Then n1
n i=1
Xi
converges
almost
surely
E[Xi].
Example 11 Let us revisit the synthia experiments. The biologist has collected n observations, x1, . . . , xn,
each corresponding to the fraction of cells that survived in a given experiment. Her estimator of the survival
rate is n1
n i=1
xi.
How confident can she be that this is an accurate estimator of the true survival rate?
Let us model her observations as realizations of n iid random variables X1, . . . , Xn with mean p and define
p = n1
n i=1
Xi.
We
say
that
her
estimator
is
probability
approximately
correct
with
non-negative
parameters
( , δ) if
P(|p − p| > ) ≤ δ
The random variables are bounded between 0 and 1 and so the value of c in (1) above is equal to 1. For
desired accuracy > 0 and confidence 1 − δ, how many experiments will be sufficient? From (1) we equate
δ
=
2
exp(−2n
2)
which
yields
n
≥
1 22
log(2/δ).
Note
that
this
requires
no
knowledge
of
the
distribution
of
the
{Xi}
apart
from
the
fact
that
they
are
bounded.
The
result
can
be
summarized
as
follows.
If
n
≥
1 22
log(2/δ),
then the probability that her estimate is off the mark by more than is less than δ.
Lecture 2: Review of Probability Theory
9
6 Proof of Hoeffding’s Inequality
Let X be any random variable and s > 0. Note that P(X ≥ t) = P(esX ≥ est) ≤ e−stE[esX ] , by using Markov’s inequality, and noting that esx is a non-negative monotone increasing function. For clever choices
of s this can be quite a good bound.
Let’s look now at
n i=1
Xi
− E[Xi].
Then
n
P( Xi − E[Xi] ≥ t) ≤ e−stE es(Pni=1 Xi−E[Xi])
i=1
= e−stE
n
es(Xi −E[Xi ])
i=1 n
= e−st E es(Xi−E[Xi]) ,
i=1
where the last step follows from the independence of the Xi’s. To complete the proof we need to find a good bound for E es(Xi−E[Xi]) .
Figure 2: Convexity of exponential function.
Lemma 1 Let Z be a r.v. such that E[Z] = 0 and a ≤ Z ≤ b with probability one. Then
E esZ
≤ e . s2(b−a)2 8
This upper bound is derived as follows. By the convexity of the exponential function (see Fig. 2),
Thus,
esz ≤ z − a esb + b − z esa, for a ≤ z ≤ b .
b−a
b−a
E[esZ ] ≤ E Z − a esb + E b − Z esa
b−a
b−a
= b esa − a esb , since E[Z] = 0
b−a
b−a
= (1 − λ + λes(b−a))e−λs(b−a) , where λ = −a b−a
Lecture 2: Review of Probability Theory
10
Now let u = s(b − a) and define φ(u) ≡ −λu + log(1 − λ + λeu) ,
so that
E[esZ ] ≤ (1 − λ + λes(b−a))e−λs(b−a) = eφ(u) .
We want to find a good upper-bound on eφ(u). Let’s express φ(u) as its Taylor series with remainder:
u2 φ(u) = φ(0) + uφ (0) + φ (v) for some v ∈ [0, u] .
2
λeu φ (u) = −λ + 1 − λ + λeu ⇒ φ (0) = 0
λeu
λ2e2u
φ (u) = 1 − λ + λeu − (1 − λ + λeu)2
λeu
λeu
= 1 − λ + λeu (1 − 1 − λ + λeu )
= ρ(1 − ρ) ,
where ρ = 1−λλ+euλeu . Now note that ρ(1 − ρ) ≤ 1/4, for any value of ρ (the maximum is attained when
ρ = 1/2, therefore φ (u) ≤ 1/4. So finally we have φ(u) ≤ u2 = s2(b−a)2 , and therefore
8
8
E[esZ ]
≤
e s2(b−a)2 8
.
Now, we can apply this upper bound to derive Hoeffding’s inequality.
n
P(Sn − E[Sn] ≥ t) ≤ e−st E[es(Xi−E[Xi])]
i=1
≤ e e n
−st
s2 (bi−ai)2 8
i=1
=
e−stes2
Pn
i=1
(bi −ai )2 8
−2t2
=
e
Pn i=1
(bi
−ai
)2
by choosing s =
4t ni=1(bi − ai)2
The same result applies to the r.v.’s −X1, . . . , −Xn, and combining these two results yields the claim of the theorem.
Lecture 2: Review of Probability Theory
Probabilistic models will be used throughout the course to represent noise, errors, and uncertainty in signal processing problems. This lecture reviews the basic notation, terminology and concepts of probability theory.
1 Basic Probability
Probability theory begins with three basic components. The set of all possible outcomes, denoted Ω. The collection of all sets of outcomes (events), denoted A. And a probability measure P. Specification of the triple (Ω, A, P) defines the probability space which models a real-world measurement or experimental process. Example 1
Ω = {all outcomes of the roll of a die} = {1, 2, 3, 4, 5, 6}
A = {all possible sets of outcomes} = {{1}, . . . , {6}, {1, 2}, . . . , {5, 6}, . . . , {1, 2, 3, 4, 5, 6}}
P = probability of all sets/events What is the probability of a given ω ∈ Ω, say ω = 3? What is the probability of the event ω ∈ {1, 2, 3}?
The basic physical model here is that all six outcomes are equally probable.
1.1 Probability Measures
Probability measures must satisfy the following properties: 1. P(A) ≥ 0 , ∀A ∈ A 2. P (Ω) = 1 3. if A ∩ B = ∅, then P(A ∪ B) = P(A) + P(B)
Show that the last condition also implies that in general P(A ∪ B) ≤ P(A) + P(B), an inequality sometimes called the union bound.
1.2 Conditional Probability
Consider two events A, B ∈ A. The (conditional) probability that A occurs given B occurs is P (A|B) := P(A ∩ B) P(B)
1
Lecture 2: Review of Probability Theory
2
Example 2
Ω = {1, 2, 3, 4, 5, 6} A = {1, 2} B = {2, 3}
Recall that we say that A “occurs” if the outcome is either 1 or 2. Now suppose you are told that B occurs. Then it seems more probable that A has also occurred.
The probability that A occurs, without knowledge of whether B has occurred, is 1/3. That is, P(A) = 1/3. From the formula above it is an easy calculation to see that
P (A|B) = P(A ∩ B) = P({2}) = 1/6 = 1/2
P(B)
P({2, 3}) 1/3
So we see that knowledge of B provides important information about A. Does this make sense with your intuition?
1.3 Independence
Two events A and B are said to be independent if P (A|B) = P(A). In otherwords, B provides no information about whether A has occurred.
Example 3 Supose we have two dice. Then
Ω = {all pairs of outcomes of the roll two dice} = {(1, 1), (1, 2), . . . , (6, 6)}
Let A = {1st die is 1} and B = {2nd die is 1}. P(A) = 1/6. P (A|B) = P(P{(({11,1}))}) = 11//366 = 1/6. As we know, the value of one die does not influence the other. The two outcomes are independent.
1.4 Bayes Rule
Again consider two events A and B. Recall the conditional probability P (A|B). Bayes rule is a formula for the “inverse” conditional probability, P (B|A). Bayes’ Rule is
P (B|A) = P (A|B) P(B) P(A)
It is easy to verify by recalling the definition of conditional probability above. This inversion formula will play a central role in signal estimation problems later in the course.
Example 4 Geneticists have determined that a particular genetic defect is related to a certain disease. Many people with the disease also have this defect, but there are disease-free people with the defect. The geneticists have found that 0.01% of the general population has the disease and that the defect is present in 50% of these cases. They also know that 0.1% of the population has the defect. What is the probability that a person with the defect has the disease?
This is a simple application of Bayes’ Rule. We are interested in two events: ‘defect’ and ‘disease’. Here is what the geneticists know: P(disease) = 0.0001, P (defect|disease) = 0.5, P(defect) = 0.001. We have everything we need to apply Bayes’ Rule:
P (disease|defect) = P (defect|disease) P(disease) = 0.5 × 0.0001 = 0.05
P(defect)
0.001
In other words, if a person has the defect then the chance they have the disease is 5%. In the general population, on the other hand, the chance that a person has the disease is 0.01%. So this “genetic marker” is quite informative.
Lecture 2: Review of Probability Theory
3
2 Random Variables
Often in applications the underlying probability space is not explicitly identified. Rather we work with random variables which are mappings from Ω to more concrete spaces such as Rn.
Example 5 A real-valued random variable is a mapping X : Ω → R, which means that for every ω ∈ Ω there is a corresponding value X(ω) ∈ R. Random variables can also be vectors, for example a mapping X : Ω → Rn. Since P specifies the probability of every subset of Ω, it also induces probabilities on events expressed in terms of X. For example, if X is a real-valued scalar random variable, then this is an event:
{X ≥ 0} ≡ {ω : X(ω) ≥ 0}
Therefore we can compute the probability of this event:
P(X ≥ 0) = P({ω : X(ω) ≥ 0})
More generally, for any set A we may consider the event {X ∈ A} and its probability P(X ∈ A).
2.1 Cumulative Distributions and Densities
For real-valued scalar random variables it is common to consider the probabilities P(X ≤ x) as a function of x. This function is called the cumulative distribution function of X, and it is often denoted by FX (x). The name can be understood by observing that the difference between FX (x1) and FX (x2) for any x2 > x1 is equal to the probability that X takes a value in the interval (x1, x2]; i.e.,
FX (x2) − FX (x1) = P(X ≤ x2) − P(X ≤ x1) = P(x1 < X ≤ x2)
The equality above follows by noting that {X ≤ x2} = {X ≤ x1} ∪ {x1 < X ≤ x2} and these two sets are disjoint. Therefore by the basic property of probabilities we have that
P(X ≤ x2) = P(X ≤ x1) + P(x1 < X ≤ x2).
For continuous random variables we can consider the limit of the difference as x2 → x1. Assume that
the
limit
lim∆→0
FX (x+∆)−FX (x) ∆
:= pX (x) exists, in other words FX
is differentiable at x.
Since Fx is a
monotonic increasing function, pX (x) ≥ 0. If Fx is differentiable everywhere, then by the Fundamental
Theorem of Calculus we have x
FX (x) =
pX (x) dx
−∞
Note also that limx→∞ FX (x) = −∞∞ pX (x) = 1, since with probability 1 the variable X takes on a finite value. The function pX (x) is called the probability density function (pdf) of X. Observe that P(x1 < X ≤ x2) = xx12 pX (x) dx, which explains the term “density.”
Example 6 Recall the test statistic from the communication problem in Lecture 1
n
n
t=
sixi
=
θ
s
2 2
+
si i
i=1
i=1
If the errors are noise-like then we expect z :=
n i=1
si
i
≈ 0.
Moreover
as
n
increases
it
is
reasonable
to
suppose that the approximation becomes more and more accurate since the individual errors may be randomly
positive
or
negative
2
and
tend
to
cancel
each
other
out.
A
reasonable
probability
den√sity
model
for
z
is
then
p(z) := √ 1
e
−
1 2
z
/n,
a
Gaussian
density
with
zero
mean
and
standard
deviation
1/
n. As n gets larger the
2πn
probability mass is more concentrated about 0. Since t is just a shifted version of z (shifted by the constant
θ s 22), the density for t is
p(t) = √ 1
e
−
1 2
(t
−θ
s
22 )2 /n
2πn
Lecture 2: Review of Probability Theory
4
This density is peaked about
s
2 2
when
θ
=
+1
and
about
−
s
2 2
when
θ
=
−1,
so
our
test
of
whether
t
was larger or less than 0 is reasonable. Assuming this model we can easily calculate the probabilities of an
incorrect decisions about which bit was sent; i.e., P(X < 0 | θ = +1) and P(X > 0 | θ = −1).
The general form of a Gaussian (or “Normal”) density with mean µ and standard deviation σ is
√1
e
−
1 2
(
z
−
µ)2
/σ
2
.
Shorthand notation for this distribution is N (µ, σ2) and shorthand for indicating
2πσ2
that a random variable X has this distribution is X ∼ N (µ, σ2). So for our test statistic we could write
t ∼ N (θ
s
2 2
,
n−
1
)
.
The
“standard
normal”
distribution
is
N (0, 1).
Scalar random variables that take values in a discrete set, such as {1, 2, . . . }, do not have a density function. Instead they have a probability mass function (pmf). A pmf has the same interpretation as the density. If X takes values in a set {x1, x2, . . . } (which may be finite or infinite), then the pmf of X is given by
pX (x) = P(X = xi) 1x=xi ,
i
where 1x=xi is the indicator function that takes a value 1 if x = xi and 0 otherwise. Note that pX (x) = P(X = x).
Example 7 One of the most common discrete distributions is the binomial distribution. Suppose you toss
a coin n times and count the number of heads. This number is a random variable X taking a value between
0 and n, and of course it will depend on p, the probability of observing a head in a single toss. The binomial
distribution gives the probability of observing k heads in the n trials, for k = 0, . . . , n, and has the following
form:
n
pX (x) =
n pk(1 − p)n−k 1x=k
k
k=1
In other words, P(X = k) = nk pk(1 − p)n−k, where nk = k!(nn−! k)! is the binomial coefficient. It is easy to understand this formula. There are nk sequences of heads and tails that have exactly k heads, and the probability of each such sequence is pk(1 − p)n−k. Our shorthand notation for this case is X ∼ Bi(n, p).
To simplify the notation, often we will denote probability functions by F (x) or p(x), dropping the subscript and using the choice of argument variable to indicate the underlying random variable.
3 Multivariate Distributions
In many signal processing problems we will need to consider vector-valued random variables. Such a random
variable consists of many “variates” (the individual elements making up the vector which can be viewed as
scalar random variables). Multivariate distributions describe the joint probability relationships among the
variates. The most common multivariate distribution, and the one that will be used most often in the course,
is the multivariate Gaussian (or Normal) distribution.
Suppose X : Ω → Rd; that is X is a d-dimensional vector of scalar random variables. If the individual
component random variables are independently distributed, then we have a trivial joint distribution which
is just the product of the univariate distributions of each component. For example, if X = [X1 , X2]T and
X1 and X2 are independent and have densities, then the joint density of X is just pX (x) = pX1 (x1)pX2 (x2).
Things are more interesting if the component random variables are not independent, in which case the
multivariate distribution does not factor in this simple way. In fact, the joint distribution factorizes into
univariate densities if and only if the component random variables are independent.
The multivariate Gaussian density is one model that can represent dependent random variables and it
has the following form
pX (x) =
1 exp − 1 (x − µ)T Σ−1(x − µ)
(2π)d|Σ|
2
Lecture 2: Review of Probability Theory
5
where the argument x ∈ Rd, µ ∈ Rd is the mean vector of the density, and Σ ∈ Sd+ is the covariance matrix, where Sd+ denotes the cone of all positive semi-definite symmetric matrices (i.e., all symmetric matrices with non-negative eigenvalues). The term |Σ| is the determinant of Σ. We will use X ∼ N (µ, Σ) as shorthand for “X is multivariate Gaussian distributed with mean µ and covariance Σ.” It is easy to see that when d = 1 this reduces to the scalar Gaussian density function.
If Σ is a diagonal matrix (i.e., all off-diagonal entries are zero), then the component random variables are uncorrelated. Moreover, it is easy to verifiy that in that case the multivariate Gaussian density factorizes into univariate component densities, which means that uncorrelated Gaussian random variables are also independent. This is a special characteristic of the Gaussian density and in general being uncorrelated does not imply independence.
Figure 1: Multivariate Gaussian Distributions. The top row show the Gaussian density (left) and its contour plot (right) with mean [0 0]T and covariance [1 0; 0 1]. The second row is the same except that the covariance is [1 0.5; 0.5 1]; positive correlation. The third row is the same except that the covariance is [1 − 0.5; −0.5 1]; negative correlation.
Example 8 Suppose that we have a linear system driven by a Gaussian white noise process. That is, a sequence of independent N (0, 1) variables, denoted X1, . . . , Xn is the input to the linear system. The output of the system can be expressed as Y = H X, where H is a n × n matrix describing the action of the system on the input sequence X = [X1, . . . , Xn]T . The joint distribution of X is N (0, I), where I denotes the n × n identity matrix. The joint distribution of the output Y is N (0, HHT ). The output is correlated due to the action of the system.
Example 9 Imagine we have a sensor network monitoring a manufacturing facility. The network consists of n nodes and its function is to monitor for failures in the system. Let X = [X1, . . . , Xn]T denote the set of
Lecture 2: Review of Probability Theory
6
scalar measurements produced by the network and suppose that it can modeled as a realization of a N (µ, Σ) random vector for a particular choice of µ and Σ. Furthermore, assume that the set B ⊂ Rn denotes the set of all vectors indicative of failures. Then the probability of failure is given by P(X ∈ B), which can be calculated (numerically) by integrating the N (µ, Σ) over the set B.
It is sometimes necessary to consider conditional probabilities and expectations. Let X and Y be random variables. Suppose we observe Y = y. If X and Y are dependent, then knowledge that Y = y tells us something about X. The conditional probability of X given Y = y is defined according to the definition of conditional probability. If the random variables have a joint density or mass function, then the conditional density or mass function is defined as
p(x, y) p(x|y) =
p(y)
where p(y) = p(x, y) dx and x is the variable and y is fixed. Suppose we were interested in the probability that X ∈ A, for some set A, given Y = y. Then we have P (X ∈ A | Y = y) = A p(x|y) dx.
4 Expectation
In addition to computing the probabilities that random variables fall into certain sets it is also useful to compute the average or expected value of random variables. If a random variable X has density pX (x), then the expectation of f (X), where f is any function of X, is
E[f (X)] = f (x) pX (x) dx
If the random variable is discrete, then the expectation is
E[f (X)] = f (xi) P(X = xi)
i
Here are some special cases.
mean: µ = E[X]
variance: σ2 = E[(X − E[X])2]
probability: P(X ∈ A) = E[1{X∈A}]
characteristic function: φ(ω) := E[exp(−iωX)]
The characteristic function of X is the Fourier transform of its density. If X is a d-dimensional random variable, X = [X1, . . . , Xd]T , then the mean is
µ1 E[X] = ...
µd
and the covariance matrix is Σ = E[(X − µ)(X − µ)T ], where Σi,j = E[(Xi − µi)(Xj − µj)], for i, j = 1 . . . , d. Note that if Y = AX, where A is an m × n matrix, then the random vector Y has mean E[AX] = Aµ
and covariance E[(AX − Aµ)(AX − Aµ)T )] = AΣAT . If X is multivariate Gaussian distributed, then so is Y (which can be shown using the characteristic function). In that case Y ∼ N (Aµ, AΣAT ) (recall Example 8). This is a special property of the Gaussian distribution (generally linear transformations will alter the distributional characterstics).
We also sometimes need to use conditional expectations. For example, suppose X and Y are two dependent random variables. If we observe Y = y, then the expectation of f (X) may differ from E[f (X)] (the
Lecture 2: Review of Probability Theory
7
unconditional expectation). The conditional expectation is denoted by E[f (X) | Y = y]. If X and Y have a joint density, then the conditional expectation is computed using the conditional density as follows:
E[f (X) | Y = y] = f (x) p(x|y) dx
If X and Y are independent random variables, then E[XY ] = E[X] E[Y ], a simple consequence of the fact that the joint density or mass function factorizes.
5 Convergence of Sums of Independent Random Variables
The most important form of statistic considered in this course is a sum of independent random variables.
Example 10 A biologist is studying the new artificial lifeform called synthia. She is interested to see if
the synthia cells can survive in cold conditions. To test synthia’s hardiness, the biologist will conduct n
independent experiments. She has grown n cell cultures under ideal conditions and then exposed each to
cold conditions. The number of cells in each culture is measured before and after spending one day in cold
conditions. The fraction of cells surviving the cold is recorded. Let x1, . . . , xn denote the recorded fractions.
The average p := n1
n i=1
xi
is
an
estimator
of
the
survival
probability.
Understanding behavior of sums of independent random variables is extremely important. For instance,
the biologist in the example above would like to know that the estimator is reasonably accurate. Let
X1, . . . , Xn be independent and identically distributed random variables with variance σ2 < ∞ and consider
the average µ := n1
n i=1
Xi.
First
note
that
E[µ]
=
E[X ].
An
easy
calculation
shows
that
the
variance
of
µ
is σ2/n. So the average has the same mean value as the random variables and the variance is reduced by a
factor of n. Lower variance means less uncertainty. So it is possible to reduce uncertainty by averaging. The
more we average, the less the uncertainty (assuming, as we are, that the random variables are independent,
which implies they are uncorrelated).
The argument above quantifies the effect of averaging on the variance, but often we would like to say
more about the distribution of the average. The Central Limit Theorem is a classic result showing that the
probability distribution of the average of n independent and identically distributed random variables with
mean µ and variance σ2 < ∞ tends to a Gaussian distribution with mean µ and variance σ2/n, regardless
of the form of the distribution of the variables. By ‘tends to’ we mean in the limit as n tends to infinity.
In many applications we would like to say something more about the distributional characteristics for
finite values of n. One approach is to calculate the distribution of the average explicitly. Recall that if
the random variables have a density pX , then the density of the sum
n i=1
Xi
is
the
n-fold
convolution
of
the density pX with itself (again this hinges on the assumption that the random variables are independent;
it is easy to see by considering the characteristic function of the sum and recalling that multiplication of
Fourier transforms is equivalent to convolution in the inverse domain). However, this exact calculation can be
sometimes difficult or impossible, if for instance we don’t know the density pX , and so sometimes probability
bounds are more useful.
Let Z be a non-negative random variable and take t > 0. Then
E[Z] ≥ E[Z 1Z≥t] ≥ E[t 1Z≥t] = t P(Z ≥ t)
The result P(Z ≥ t) ≤ E[Z]/t is called Markov’s Inequality. Now we can use this to get a bound on the probability ‘tails’ of Z. Let t > 0
P(|Z − E[Z]| ≥ t) = P ((Z − E[Z])2 ≥ t2) E[(Z − E[Z])2]
≤ t2 Var(Z)
= t2 ,
Lecture 2: Review of Probability Theory
8
where Var(Z) denotes the variance of Z. This inequality is known as Chebyshev’s Inequality. If we apply
this to the average µ = n1
n i=1
Xi,
then
we
have
σ2 P(|µ − µ| ≥ t) ≤ nt2
where µ and σ2 are the mean and variance of the random variables {Xi}. This shows that not only is the variance reduced by averaging, but the tails of the distribution (probability of observing values a distance of more than t from the mean) are smaller.
The tail bound given by Chebyshev’s Inequality is loose, and much tighter bounds are possible under slightly stronger assumptions. In particular, if the random variables {Xi} are bounded or sub-Gaussian (meaning the tails of the probability distribution decay at least as fast as Gaussian tails), then the tails of the average converge exponential fast in n. The simplest result of this form is for bounded random variables.
Theorem 1 (Hoeffding’s Inequality). Let X1, X2, ..., Xn be independent bounded random variables such that
Xi ∈ [ai, bi] with probability 1. Let Sn =
n i=1
Xi.
Then
for
any
t
>
0,
we
have
−
2t2
P(|Sn − E[Sn]| ≥ t) ≤ 2 e Pni=1(bi−ai)2
If the random variables {Xi} are binary-valued, then this result is usually referred to as the Chernoff Bound.
The proof of Hoeffding’s Inequality, which relies on a clever generalization of Markov’s inequality and some
elementary concepts from convex analysis, is given in the next section.
Now suppose that the random variables in the average µ = n1 Xi ≤ b. Let c = (b − a)2. Then Hoeffding’s Inequality implies
n i=1
Xi
are
bounded
according
to
a
≤
P(|µ − µ| ≥ t)
≤
2
e−
2nt2 c
(1)
In other words, the tails of the distribution of the average are tending to zero at an exponential rate in n, much faster than indicated by Chebeyshev’s Inequality. Similar exponential tail bounds hold for averages of iid sub-Gaussian variables. Using tail bounds like these we can prove the so-called laws of large numbers.
Theorem 2 (Weak Law of Large Numbers). Let X1, X2, ..., Xn be iid random variables with E[|Xi|] < ∞.
Then n1
n i=1
Xi
converges
in
probability
to
E[Xi].
Theorem 3 (Strong Law of Large Numbers). Let X1, X2, ..., Xn be iid random variables with E[|Xi|] < ∞.
Then n1
n i=1
Xi
converges
almost
surely
E[Xi].
Example 11 Let us revisit the synthia experiments. The biologist has collected n observations, x1, . . . , xn,
each corresponding to the fraction of cells that survived in a given experiment. Her estimator of the survival
rate is n1
n i=1
xi.
How confident can she be that this is an accurate estimator of the true survival rate?
Let us model her observations as realizations of n iid random variables X1, . . . , Xn with mean p and define
p = n1
n i=1
Xi.
We
say
that
her
estimator
is
probability
approximately
correct
with
non-negative
parameters
( , δ) if
P(|p − p| > ) ≤ δ
The random variables are bounded between 0 and 1 and so the value of c in (1) above is equal to 1. For
desired accuracy > 0 and confidence 1 − δ, how many experiments will be sufficient? From (1) we equate
δ
=
2
exp(−2n
2)
which
yields
n
≥
1 22
log(2/δ).
Note
that
this
requires
no
knowledge
of
the
distribution
of
the
{Xi}
apart
from
the
fact
that
they
are
bounded.
The
result
can
be
summarized
as
follows.
If
n
≥
1 22
log(2/δ),
then the probability that her estimate is off the mark by more than is less than δ.
Lecture 2: Review of Probability Theory
9
6 Proof of Hoeffding’s Inequality
Let X be any random variable and s > 0. Note that P(X ≥ t) = P(esX ≥ est) ≤ e−stE[esX ] , by using Markov’s inequality, and noting that esx is a non-negative monotone increasing function. For clever choices
of s this can be quite a good bound.
Let’s look now at
n i=1
Xi
− E[Xi].
Then
n
P( Xi − E[Xi] ≥ t) ≤ e−stE es(Pni=1 Xi−E[Xi])
i=1
= e−stE
n
es(Xi −E[Xi ])
i=1 n
= e−st E es(Xi−E[Xi]) ,
i=1
where the last step follows from the independence of the Xi’s. To complete the proof we need to find a good bound for E es(Xi−E[Xi]) .
Figure 2: Convexity of exponential function.
Lemma 1 Let Z be a r.v. such that E[Z] = 0 and a ≤ Z ≤ b with probability one. Then
E esZ
≤ e . s2(b−a)2 8
This upper bound is derived as follows. By the convexity of the exponential function (see Fig. 2),
Thus,
esz ≤ z − a esb + b − z esa, for a ≤ z ≤ b .
b−a
b−a
E[esZ ] ≤ E Z − a esb + E b − Z esa
b−a
b−a
= b esa − a esb , since E[Z] = 0
b−a
b−a
= (1 − λ + λes(b−a))e−λs(b−a) , where λ = −a b−a
Lecture 2: Review of Probability Theory
10
Now let u = s(b − a) and define φ(u) ≡ −λu + log(1 − λ + λeu) ,
so that
E[esZ ] ≤ (1 − λ + λes(b−a))e−λs(b−a) = eφ(u) .
We want to find a good upper-bound on eφ(u). Let’s express φ(u) as its Taylor series with remainder:
u2 φ(u) = φ(0) + uφ (0) + φ (v) for some v ∈ [0, u] .
2
λeu φ (u) = −λ + 1 − λ + λeu ⇒ φ (0) = 0
λeu
λ2e2u
φ (u) = 1 − λ + λeu − (1 − λ + λeu)2
λeu
λeu
= 1 − λ + λeu (1 − 1 − λ + λeu )
= ρ(1 − ρ) ,
where ρ = 1−λλ+euλeu . Now note that ρ(1 − ρ) ≤ 1/4, for any value of ρ (the maximum is attained when
ρ = 1/2, therefore φ (u) ≤ 1/4. So finally we have φ(u) ≤ u2 = s2(b−a)2 , and therefore
8
8
E[esZ ]
≤
e s2(b−a)2 8
.
Now, we can apply this upper bound to derive Hoeffding’s inequality.
n
P(Sn − E[Sn] ≥ t) ≤ e−st E[es(Xi−E[Xi])]
i=1
≤ e e n
−st
s2 (bi−ai)2 8
i=1
=
e−stes2
Pn
i=1
(bi −ai )2 8
−2t2
=
e
Pn i=1
(bi
−ai
)2
by choosing s =
4t ni=1(bi − ai)2
The same result applies to the r.v.’s −X1, . . . , −Xn, and combining these two results yields the claim of the theorem.
Categories
You my also like
Random Variables And Probability Distributions
240.5 KB11K2.2KPOL571 Lecture Notes: Random Variables and Probability
139.4 KB1.7K552Continuous Random Variables
79.7 KB9.2K2.8KDerivation of Gaussian Probability Distribution: A New Approach
537 KB1.9K358Chapter 5 Statistical Models in Simulation
718 KB91.5K12.8KProbability, Random Variables and Random Processes
71.2 KB26.8K7.2KStatistical Analysis from the Fourier Integral Theorem
2 MB30.2K3.3KProbability Theory And Stochastic Processes
13.7 MB3.8K1.1KSta 2200 Probability And Statistics Ii
1.6 MB24.7K5.2K