# POL571 Lecture Notes: Random Variables and Probability

## Preview text

POL 571: Random Variables and Probability Distributions
Kosuke Imai Department of Politics, Princeton University
February 22, 2006
1 Random Variables and Distribution Functions
Often, we are more interested in some consequences of experiments than experiments themselves. For example, a gambler is more interested in how much they win or lose than the games they play. Formally, a random variable is a function which maps the sample space into R or its subset.
Deﬁnition 1 A random variable is a function X : Ω → R satisfying A(x) = {ω ∈ Ω : X(ω) ≤ x} ∈ F for all x ∈ R. Such a function is said to be F-measurable. After an experiment is done, the outcome ω ∈ Ω is revealed and a random variable X(ω) takes some value in R. For example, in a coin toss experiment, you may assign the value of 1 to a head and 0 to a tail. The reason for the technical requirement will become clear when we deﬁne the distribution function of a random variable, which describes how likely it is for X to take at least as large as a particular value.
Deﬁnition 2 The (cumulative) distribution function of a random variable X is the function F : R → [0, 1] given by F (x) = P (A(x)) where A(x) = {ω ∈ Ω : X(ω) ≤ x} or equivalently F (x) = P (X ≤ x). We sometimes write FX (x) to emphasize this function is deﬁned for the random variable X. The (cumulative) distribution function is also often called “CDF.”
Deﬁnition 3 The two random variables, X and Y , are said to be identically distributed if P (X ∈ A) = P (Y ∈ A) for any A ∈ F or equivalently FX (x) = FY (x) for any x. Finally, a distribution function has the following important properties.
Theorem 1 (Distribution Function I) Let x and y be real numbers. A distribution function F (x) of a random variable X satisﬁes the following properties.
1. limx→−∞ F (x) = 0 and limx→∞ F (x) = 1. 2. F is increasing, i.e., if x < y, then F (x) ≤ F (y). 3. F is right-continuous, i.e., limx↓c F (x) = F (c) for any c ∈ R. Similarly, one can also prove the following additional properties.
1

Theorem 2 (Distribution Function II) A distribution function F (x) of a random variable X satisﬁes the following properties.
1. P (X > x) = 1 − F (x).

2. P (x < X ≤ y) = F (y) − F (x).

3. P (X = x) = F (x) − limy↑x F (y). Let’s look at some examples of random variable and their distribution functions.

Example 1

1. Bernoulli distribution. In a coin toss experiment, a Bernoulli random variable can be deﬁned as X(head) = 1 and X(tail) = 0. What is the distribution function?

2. Geometric distribution. This random variable represents the number of Bernoulli trials required before the ﬁrst success.

3. Logistic distribution. The distribution function of a logistic random variable is given by

F (x)

=

1 1+e−x

.

Conﬁrm

that

this

satisﬁes

Theorem

1.

Random variables can be classiﬁed into two classes based on their distribution functions.

Deﬁnition 4 Let X be a random variable. 1. X is said to be discrete if its distribution function is a step function.

2. X is said to be continuous if its distribution function is a continuous function.
From the materials we learned in POL 502, you should be able to show that the distribution function of a uniform random variable as well as that of a logistic random variable is continuous.
If a uniform random number generator is available (e.g., runif() in R), one can simulate a continuous random variable using the inverse of its distribution function. This is called the inverse CDF method where CDF stands for the cumulative distribution function.

Theorem 3 (Inverse CDF Method) Let F be a distribution function. If U is a uniform random variable, then the distribution function of the random variable F −1(U ) is given by F where F −1(u) = inf{x : F (x) ≥ u} with 0 ≤ u ≤ 1 is called the generalized inverse of F (or the quantile function of X).
The reason for this cumbersome deﬁnition of F −1 is that the distribution function is in general not one-to-one. However, for continuous distributions with a strictly increasing distribution function, F −1 equals the ordinary inverse function.

Example 2 Write R functions that simulate a random variable from the following distributions via the inverse CDF method.
1. Bernoulli distribution. Its CDF is given in Example 1.
2. Logistic distribution. Its CDF is given in Example 1.
3. Exponential distribution. The probability function is given by F (x) = 1 − e−λx.
Of course, the uniform random variable is a theoretical construct, and only a “pseudo-random” number is available to us, which is basically a deterministic sequence of values based on the random seed mimicking a random number. Von Neumann once said “There is no such thing as a random number – there are only methods of producing random numbers.”

2

2 Probability Density and Mass Functions
While the distribution function deﬁnes the distribution of a random variable, we are often interested in the likelihood of a random variable taking a particular value. This is given by the probability density and mass functions for continuous and discrete random variables, respectively.
Deﬁnition 5 Let X be a random variable and x ∈ R. 1. If X is discrete, then it has the probability mass function f : R → [0, 1] deﬁned by
f (x) = P (X = x).

2. If X is continuous, then it has the probability density function, f : R → [0, ∞), which satisﬁes
x
F (x) = f (t) dt
−∞
where F (x) is the distribution function of X.
We may write fX (x) to stress that the probability function is for the random variable X. Although the mass function corresponds to the probability, the density function does not. In particular, the latter is not bounded by 1. Fist, let’s consider discrete random variables. The following theorem is a corollary of Theorems 1 and 2.

Theorem 4 (Probability Mass Function) Let X be a discrete random variable and X be a set of all possible values X can take. Then, its probability mass function f (x) and distribution function F (x) have the following relationships.

F (x) =

f (t), f (x) = F (x) − lim F (t), and

f (t) = 1.

t↑x

{t∈X :t≤x}

t∈X

We consider commonly used discrete random variables and their probability mass functions.

Example 3

1. Binomial distribution. The sum of n identically distributed Bernoulli random variables
with probability of success p is a Binomial random variable, whose probability mass function
is f (x) = n px(1 − p)n−x, for x = 0, 1, . . . , n. x

2. Bernoulli distribution. This is a special case of Binomial distribution with n = 1. The probability mass function is f (x) = px(1 − p)1−x.

3. Negative binomial distribution. A negative binomial random variable represents the number of failures required before the rth success occurs in Bernoulli trials. The probability mass function is given by

f (x) =

r+x−1 x

pr(1 − p)x, for x = 0, 1, 2 . . .

4. Geometric distribution. This is a special case of negative binomial distribution with r = 1. The probability mass function is given by f (x) = p(1 − p)x for x = 0, 1, 2, . . ..

3

5. Poisson distribution. A Poisson random variable X has the following probability mass function and the parameter λ
f (x) = λx e−λ, for x = 0, 1, 2, . . . x!
There is an interesting relationship between Poisson and Binomial distributions.

Theorem 5 (Poisson approximation to Binomial) If n is large and p is small, Poisson probability mass function can approximate Binomial probability mass function.
For continuous distributions, the probability density function has the following properties.

Theorem 6 (Probability Density Function) Let X be a continuous random variable.

1. Its probability density function f (x) has the following properties,

b
P (X = x) = 0, P (a ≤ X ≤ b) = f (x) dx, and
a

f (x) dx = 1.
−∞

2. If the distribution function, F (x), is diﬀerentiable at x, then f (x) = F (x). Now, look at some examples of continuous random variables.

Example 4

1. Logistic distribution. What is the probability density function of Logistic distribution?

2. Gamma distribution. A Gamma random variable takes non-negative values and has the

following density function with the parameters α > 0 (shape parameter), β > 0 (scale param-

eter),

f (x) = βαxα−1 e−βx Γ(α)

where Γ(α) = 0∞ xα−1e−x dx is called the Gamma function.

3. Exponential distribution. This is a special case of Gamma distribution with α = 1, i.e., f (x) = βe−βx. This distribution has the “memoryless” property.

4. χ2 distribution. This is another special case of Gamma distribution with α = ν/2 and β = 1/2 where ν is called the degrees of freedom parameter.

5. Beta distribution. A Beta random variable takes values in [0, 1] and has the following density function with the parameters α, β > 0

f (x) = Γ(α + β) xα−1(1 − x)β−1, Γ(α)Γ(β)

where B(α, β) = ΓΓ((αα)+Γ(ββ)) = 01 xα−1(1 − x)β−1 dx is called the Beta function.
6. Uniform distribution. This is a special case of a Beta random variable with α = β = 1. The density function is given by f (x) = 1. More generally, a uniform random variable X takes values in a closed interval, [a, b], with the density function f (x) = b−1a .

4

7. Normal (Gaussian) distribution. Normal distribution has two parameters, mean µ and

variance σ2,

1

(x − µ)2

f (x) = √2πσ2 exp − 2σ2

If µ = 0 and σ2 = 1, then it is called the standard Normal distribution.

We now consider the “truncation” of a probability distribution where some values cannot be observed and hence are eliminated from the sample space.

Theorem 7 (Truncated Distribution) Let X be a discrete (continuous) random variable and denote its probability function and probability mass (density) function by F (x) and f (x), respectively. If the distribution is truncated so that only the values in X are observed, then the probability mass (density) function of the truncated random variable is given by,
1{x ∈ X }f (x) g(x) =
P (X ∈ X )
Let’s consider some examples of truncated distributions.

Example 5
1. 0-truncated Poisson distribution. What is the probability mass function of the 0-truncated Poisson distribution where 0 cannot be observed?
2. Right-truncated Normal distribution. What is the probability density function of the right-truncated Normal distribution where any value that is greater than t cannot be observed?
Once we have a random sample from a distribution, we can deﬁne the “empirical counterpart” of the distribution function.

Deﬁnition 6 Let X be a random variable with the distribution function F (x), and x1, x2, . . . , xn be a random sample from the distribution. The empirical (cumulative) distribution function is deﬁned by
F (x) = ni=1 1{xi ≤ x} . n
Example 6 Plot the empirical distribution function using a random sample you generated for the logistic distribution. Compare it with the true distribution function.
Given a random sample from a probability distribution, we can also come up with a reasonable guess (i.e., estimation, the term used in statistics about which we will be learning soon) of the underlying probability density function. The oldest and most widely used method is the histogram.

Deﬁnition 7 Let x1, x2, . . . , xn a random sample from a distribution whose probability density

function is f (x). Given an origin x0 and a bin width h, we deﬁne the bins of the histogram to be

the intervals [x0 + mh, x0 + (m + 1)h) for positive and negative integers m. Then, the histogram is

deﬁned by,

fˆ(x) = n1h n 1{xi ∈ the same bin as x}.
i=1

Example 7 In R, obtain a random sample from a distribution of your choice (use a continuous distribution). Compare the histogram with the true probability density function. Let the bin width of the histogram vary to see how sensitive the graph is to this variable.

5

3 Random Vector and Joint Distributions

So far, we have considered a single random variable. However, the results can be readily extended to a random vector, a vector of multiple random variables. For example, we can think about an experiment where we throw two dice instead of one at each time. Then, we need to deﬁne the joint probability (mass or density) function for a random vector. For simplicity, we only consider bivariate distributions, but the same principle applies to multivariate distributions in general.

Deﬁnition 8 Let X and Y be random variables. The joint distribution function of X and Y is F : R2 → [0, 1] deﬁned by
F (x, y) = P (X ≤ x, Y ≤ y).

1. If (X, Y ) is a discrete random vector, the joint probability mass function f : R2 → R is deﬁned by f (x, y) = P (X = x, Y = y).

2. If (X, Y ) is a continuous random vector, the joint probability density function f : R2 → R is

deﬁned by

yx

F (x, y) =

f (s, t) ds dt.

−∞ −∞

If F (x, y) is diﬀerentiable at (x, y), then we have f (x, y) = ∂x∂∂2y F (x, y).

Note that for a continuous distribution, we have

db

P (a ≤ X ≤ b, c ≤ Y ≤ d) =

f (x, y) dx dy.

ca

Let’s look at a couple of examples.

Example 8
1. Multinomial distribution. An k-dimensional multinomial random vector X = (X1, . . . , Xk) has the following probability mass function.

f (x1, x2, . . . , xk) =

n x1 x2 . . . xk

px11 px22 · · · pxnk

where p = (p1, p2, . . . , pn) is an k-dimensional vector of probabilities with

n=

k i=1

xi.

A

special

case

of

this

distribution

is

Binomial

distribution.

k i=1

pi

=

1

and

2. Multivariate Normal distribution. An k-dimensional multivariate normal random vector X = (X1, . . . , Xk) with the following density function

f (x) = (2π)−k/2|Σ|−1/2 exp

1 − (x − µ)

Σ−1(x − µ)

2

where µ is an k-dimensional vector of mean and Σ is an k × k positive deﬁnite covariance matrix.

One can easily obtain the marginal distribution function from the joint distribution function.

6

Theorem 8 (Marginal and Joint Distribution Functions) Let X and Y be random variables and F (x, y) be their joint distribution function. If FX (x) and FY (y) are the marginal distribution functions of X and Y , respectively, then

FX (x) = lim F (x, y), and FY (y) = lim F (x, y).

y→∞

x→∞

We now deﬁne the Independence of random variables.

Deﬁnition 9 Two random variables, X and Y , are said to be independent if

F (x, y) = FX (x)FY (y),
where F is the joint distribution function, FX and FY are the marginal distribution functions for X and Y , respectively.
It is also possible to obtain the marginal probability mass (density) function from the joint probability mass (density) function. If we have the joint density function of the two random variables X and Y , we can obtain the marginal density function of X by “integrating out” Y .

Theorem 9 (Marginal and Joint Density (Mass) Functions) Let X and Y be random variables where fX (x) and fY (y) are the marginal probability mass (density) functions. Let f (x, y) be their joint probability mass (density) function. X = {x : fX (x) > 0} and Y = {y : fY (y) > 0} are called the support of the marginal distributions of X and Y , respectively.
1. If both X and Y are discrete, then

fX (x) = f (x, y), and fY (y) = f (x, y).

y∈Y

x∈X

2. If both X and Y are continuous, then

fX (x) =

f (x, y) dy, and fY (y) =

f (x, y) dx.

y∈Y

x∈X

In addition to joint and marginal distributions, conditional distributions are often of interest.

Deﬁnition 10 Let X and Y be random variables with the marginal probability mass (density) function, fX (x) and fY (y), and the joint probability mass (density) function, f (x, y). The conditional mass (density) functions of X given Y and of Y given X is deﬁned by

f (x, y)

f (x, y)

f (x | y) =

, and f (y | x) =

,

fY (y)

fX (x)

respectively.

Now, we show that the independence can be checked by decomposing the joint density (mass) function.

Theorem 10 (Independence) Let X and Y be random variables with the joint probability mass (density) function, f (x, y). X and Y are independent if and only if there exist functions g(x) and h(y) such that
f (x, y) = g(x)h(y),
for every x ∈ R and y ∈ R.

7

From this theorem, it is immediate that if random variables X and Y are independent, then f (x | y) = fX (x) and f (y | x) = fY (y). Let’s consider a couple of examples.

Example 9
1. Multinomial and Binomial distributions. Let (X1, X2, . . . , Xk) be a multinomial random vector. Show that the marginal distribution of xi for any i ∈ {1, . . . , k} is a Binomial distribution. Also, show that (X1, X2, . . . , Xk−1) conditional on Xk = xk follows a multinomial distribution.
2. Bivariate Normal distribution. Rewrite the bivariate Normal density function using means µ1, µ2, variances σ1, σ2 and the correlation ρ. Write the joint density function as the product of the marginal and conditional density functions. Conﬁrm that the correlation is zero if and only if the two random variables are independent.
Many distributions can be derived hierarchically by combining conditional and marginal distributions. Here is one important example.

Example 10 Student t distribution. A Student t random variable, X, with ν degrees of freedom, mean µ, and variance σ2 can be simulated in the following manner,

X | ν, µ, σ2, Z ∼ Normal µ, σ2ν , Z
Z | ν ∼ χ2ν .

Use this fact to show that the density function of Student t distribution is equal to,

2

Γ((ν + 1)/2)

(X − µ)2 −(ν+1)/2

f (x | µ, σ ) =

1+

.

Γ(ν/2) νπσ2

νσ2

8 