# Importance sampling algorithms for Bayesian networks

Download Importance sampling algorithms for Bayesian networks

## Preview text

MCM: 2749

+ Model

ARTICLE IN PRESS

pp. 1–19 (col. ﬁg: NIL)

Mathematical and Computer Modelling xx (xxxx) xxx–xxx

www.elsevier.com/locate/mcm

UNC ORREC TED PR OOF

Importance sampling algorithms for Bayesian networks: Principles and performance

Changhe Yuana, Marek J. Druzdzelb,∗

a Decision Systems Laboratory, Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, United States b Decision Systems Laboratory, School of Information Sciences and Intelligent Systems Program, University of Pittsburgh,

Pittsburgh, PA 15260, United States

Received 10 March 2005; accepted 4 May 2005

1 2

Abstract

3

Precision achieved by stochastic sampling algorithms for Bayesian networks typically deteriorates in the face of extremely 4

unlikely evidence. In addressing this problem, importance sampling algorithms seem to be most successful. We discuss the 5

principles underlying the importance sampling algorithms in Bayesian networks. After that, we describe Evidence Pre-propagation 6

Importance Sampling (EPIS-BN), an importance sampling algorithm that computes an importance function using two techniques: 7

loopy belief propagation [K. Murphy, Y. Weiss, M. Jordan, Loopy belief propagation for approximate inference: An empirical 8

study, in: Proceedings of the Fifteenth Annual Conference on Uncertainty in Artiﬁcial Intelligence, UAI-99, San Francisco, CA, 9

Morgan Kaufmann Publishers, 1999, pp. 467–475; Y. Weiss, Correctness of local probability propagation in graphical models with 10

loops, Neural Computation 12 (1) (2000) 1–41] and -cutoff heuristic [J. Cheng, M.J. Druzdzel, BN-AIS: An adaptive importance 11

sampling algorithm for evidential reasoning in large Bayesian networks, Journal of Artiﬁcial Intelligence Research 13 (2000) 12

155–188]. We test the performance of EPIS-BN on three large real Bayesian networks and observe that on all three networks it 13

outperforms AIS-BN [J. Cheng, M.J. Druzdzel, BN-AIS: An adaptive importance sampling algorithm for evidential reasoning 14

in large Bayesian networks, Journal of Artiﬁcial Intelligence Research 13 (2000) 155–188], the current state-of-the-art algorithm, 15

while avoiding its costly learning stage. We also compare importance sampling to Gibbs sampling and discuss the role of the 16

-cutoff heuristic in importance sampling for Bayesian networks.

17

c 2005 Elsevier Ltd. All rights reserved.

18

Keywords: Importance sampling; Importance function; EPIS-BN; Evidence pre-propogation; -cutoff

19

1. Introduction

20

Bayesian networks (BNs) [4] are a powerful modelling tool especially suitable for problems involving uncertainty. 21 They offer a compact, intuitive, and efﬁcient graphical representation of uncertain relationships among variables in a 22 domain and have proven their value in many disciplines over the last decade. Bayesian networks have been applied 23 successfully to a variety of decision problems, including medical diagnosis, prognosis, therapy planning, machine 24

∗ Corresponding author. Tel.: +1 412 624 9432; fax: +1 412 624 2788. E-mail addresses: [email protected] (C.Yuan), [email protected] (M.J. Druzdzel).

0895-7177/$ - see front matter c 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.mcm.2005.05.020

MCM: 2749

ARTICLE IN PRESS

2

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

UNC ORREC TED PR OOF

Fig. 1. An example Bayesian network for HIV infection.

1 diagnosis, user interfaces, natural language interpretation, planning, vision, robotics, data mining, fraud detection, 2 and many others. Some examples of real-world applications are described in a special issue of Communication of 3 ACM, on practical applications of decision-theoretical methods in AI, Vol. 38, No. 3, March 1995.

4 1.1. Introduction to Bayesian networks

5 Bayesian networks are directed acyclic graphs (DAGs) in which nodes represent random variables and arcs

6 represent direct probabilistic dependences among them. A Bayesian network encodes the joint probability distribution 7 over a set of variables {X1, . . . , Xn}, where n is ﬁnite, and decomposes it into a product of conditional probability 8 distributions over each variable given its parents in the graph. In the case of nodes with no parents, prior probability is 9 used. The joint probability distribution over {X1, . . . , Xn} can be obtained by taking the product of all of these prior 10 and conditional probability distributions:

n

11

Pr(X1, . . . , Xn) = Pr(Xi |P A(Xi )),

(1)

i =1

12 where P A(Xi ) denotes the parent nodes of Xi . Fig. 1 shows a highly simpliﬁed example Bayesian network that 13 models the causes of HIV virus infection and AIDS.1 The variables in this model are: HIV infection (H ), sexual 14 Intercourse (I ), the use of blood Transfusion (T ), Needle sharing (N ), Mosquito bite (M), and AIDS ( A). For the sake 15 of simplicity, we assumed that each of these variables is binary. For example, H has two outcomes, denoted h and 16 h, representing “HIV infection present” and “HIV infection absent”, respectively. A directed arc between N and H 17 denotes the fact that whether or not an individual shares needles will impact the likelihood of them contracting the 18 HIV virus. Similarly, an arc from H to A denotes that HIV infection inﬂuences the likelihood of developing AIDS. 19 Lack of directed arcs is also a way of expressing knowledge, notably assertions of (conditional) independence. For 20 instance, lack of directed arcs between N , I , T , and A encodes the knowledge that needle sharing, sexual intercourse, 21 and blood transfusion can inﬂuence the chance of developing AIDS, A, only indirectly through an HIV infection, H . 22 These causal assertions can be translated into statements of conditional independence: A is independent of N , I , and 23 T given H . In mathematical notation,

24

Pr(A|H ) = Pr(A|H, N ) = Pr(A|H, I ) = Pr(A|H, T ) = Pr(A|H, N , I, T ).

25 Structural independences, i.e., independences that are expressed by the structure of the network, are captured by 26 the so-called Markov condition, which states that a node (here A) is independent of its non-descendants (here N , I , 27 and T ) given its parents (here H ). 28 Similarly, the absence of arc I → N means that the individual’s decision to engage in a sexual intercourse will not 29 inﬂuence their chances of sharing needles. The absence of any links between mosquito bite M and the remainder of 30 the variables means that M is independent of the other variables. In fact, M would typically be considered irrelevant 31 to the problem of HIV infection, and we added it to the model only for the sake of presentation. 32 These independence properties imply that

33

Pr(N , I, T, M, H, A) = Pr(N ) Pr(I ) Pr(T ) Pr(M) Pr(H |N , I, T ) Pr(A|H ),

1 The network is a modiﬁed version of the network presented in [5].

MCM: 2749

ARTICLE IN PRESS

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

3

i.e., that the joint probability distribution over the graph nodes can be factored into the product of the conditional 1 probabilities of each node, given its parents in the graph. Please note that this expression is just an instance of Eq. (1). 2

1.2. Inference and complexity

3

UNC ORREC TED PR OOF

The assignment of values to observed variables is usually called evidence. The most important type of reasoning 4

in a probabilistic system based on Bayesian networks is known as belief updating, which amounts to computing 5

the probability distribution over the variables of interest, given the evidence. In the example model of Fig. 1, the 6

variable of interest could be H and the focus of computation could be the posterior probability distribution over 7

H , given the observed values of N , I , and T , i.e., Pr(H |N = n, I = i, T = t). We use a lower case letter 8

to denote the state of a variable. Another type of reasoning focuses on computing the most probable explanation, 9

i.e., the most probable instantiation, of the variables of interest, given the evidence. In the example model of 10

Fig. 1, we may be interested to know the most likely instantiation of I ,N , and T , given the observed value of 11

A, i.e., max{I,N,T } Pr(I, N , T | A = a). Several ingenious exact algorithms have been proposed to address these 12

reasoning tasks, including variable elimination [6], clustering algorithm [7], belief propagation for polytree [4], cutset 13

conditioning [4], and symbolic probabilistic inference (SPI) [8]. However, it has been shown that exact inference in 14

Bayesian networks is NP-hard [9]. With practical models reaching the size of thousands of variables, exact inference 15

in Bayesian networks is apparently infeasible. Although approximate inference to any desired precision has been 16

shown to be NP-hard as well [10], for very complex networks it is the only alternative that will produce any result at 17

all. Therefore, many approximate inference algorithms have been proposed. Some of them are actually approximate 18

versions of exact algorithms, including bounded conditioning [11], localized partial evaluation [12], incremental 19

SPI [13], probabilistic partial evaluation [14], and mini-bucket elimination [15]. Other approximate algorithms are 20

inherently approximate methods, including loopy belief propagation [1], variational methods [16], search based 21

algorithms [17], and stochastic sampling algorithms. Stochastic sampling algorithms are actually a large family that 22

contains many instances. Some of these are probabilistic logic sampling [18], likelihood weighting [19,20], backward 23

sampling [21], importance sampling [20], AIS-BN algorithm [3], IS algorithm [22], and IS T algorithm [23]. A 24

subclass of stochastic sampling methods, called Markov Chain Monte Carlo (MCMC) methods, includes Gibbs 25

sampling, Metropolis sampling, and Hybrid Monte Carlo sampling [24–26]. The problem of approximate inference 26

algorithms is often that they provide no guarantee regarding the quality of their results. However, the family of 27

stochastic sampling algorithms is an exception, because theoretically they will converge to the exact solutions if 28

based on sufﬁciently many samples. Furthermore, among all stochastic sampling algorithms, importance sampling- 29

based algorithms seem to provide the best performance, because many excellent methods have been proposed for 30

calculating good importance functions, whose quality is critical to the results. We will review the existing importance 31

sampling algorithms for Bayesian networks in Section 2.

32

The outline of the paper is as follows. In Section 2, we give a general introduction to importance sampling and 33

the existing importance sampling algorithms for Bayesian networks. In Section 3, we discuss the Evidence Pre- 34

propagation Importance Sampling algorithm for Bayesian Networks (EPIS-BN) algorithm, which uses the loopy 35

belief propagation algorithm to calculate an importance function. Finally, in Section 4, we describe the results of 36

experimental tests of the EPIS-BN algorithm on several large real Bayesian networks.

37

2. Importance sampling

38

In this section, we ﬁrst introduce the basic theory of importance sampling, and then review the existing importance 39

sampling algorithms for Bayesian networks.

40

2.1. Theory of importance sampling

41

We start with the theoretical roots of importance sampling. Let f (X ) be a function of n variables X = (X1, . . . , Xn) 42

over the domain Ω ⊂ Rn. Consider the problem of estimating the multiple integral

43

V = f (X )dX.

Ω

(2) 44

UNC ORREC TED PR OOF

MCM: 2749

ARTICLE IN PRESS

4

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

1 We assume that the domain of integration of f (X ) is bounded, i.e., that V exists. Importance sampling approaches 2 this problem by estimating

3 V = f (X ) I (X )dX, (3) Ω I(X)

4 where I (X ), which is called the importance function, is a probability density function such that I (X ) > 0 for any 5 X ⊂ Ω . One practical requirement of I (X ) is that it should be easy to sample from. In order to estimate the integral, 6 we generate samples X1, X2, . . . , X N from I (X ) and use the generated values in the sample-mean formula

1 N f (Xi)

7

Vˆ =

.

(4)

N i 1 I (Xi)

=

8 The estimator in Eq. (4) almost surely converges as follows:

9

Vˆ → V.

(5)

10 under the following weak assumptions [27]:

11 Assumption 1. f (X ) is proportional to a proper probability density function deﬁned on Ω .

12 Assumption 2. {Xi }i∞=1 is a sequence of i.i.d. random samples, the common distribution having a probability density 13 function I (X ).

14 Assumption 3. The support of I (X ) includes Ω .

15 Assumption 4. V exists and is ﬁnite.

16 Importance sampling assigns more weight to regions where f (X ) > I (X ) and less weight to regions where 17 f (X ) < I (X ) to correctly estimate V. We do not have much control over what is required in Assumptions 1, 2 and 4, 18 because they are either the inherent properties of the problem at hand or the characteristic of Monte Carlo simulation. 19 We only have the freedom to choose which importance function to use, as long as it satisﬁes Assumption 3. In the 20 context of Bayesian netoworks, since Ω is compact, we can easily devise an importance function that satisﬁes the 21 assumption. 22 Rubinstein [28] proves that if f (X ) > 0, the optimal importance function is

23 I (X ) = f (X ) , (6) V

24 which is actually the posterior distribution. The concept of the optimal importance function does not seem to be useful, 25 because ﬁnding V is equivalent to ﬁnding the posterior distribution, which is the problem that we are facing. However, 26 it suggests that, if we ﬁnd instead a function that is close enough to the optimal importance function, we can still 27 expect good convergence rate.

28 2.2. Importance sampling in Bayesian networks

29 Importance sampling has become the basis for several state-of-the-art stochastic sampling-based inference 30 algorithms for Bayesian networks. These algorithms inherit the characteristic that their accuracy largely depends 31 on the quality of the importance functions that they manage to get. The theoretical convergence rate is in the order of 32 √1m , where m is the number of samples, for essentially all Monte Carlo methods. Therefore, the further the importance 33 function is from the posterior distribution, the more samples it needs to converge. The number of samples needed 34 increases at least at a quadratic speed. Hence, given a ﬁxed number of samples, any effort to make the importance 35 function closer to the posterior distribution will directly inﬂuence the precision of sampling algorithms. To achieve 36 a given precision, a good importance function can save us lots of samples. This is best represented graphically in 37 Fig. 2. Obviously, there is a tradeoff between the quality of the importance function and the amount of effort spent on

MCM: 2749

ARTICLE IN PRESS

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

5

UNC ORREC TED PR OOF

Fig. 2. Importance sampling: the tradeoff between the quality of importance function and the amount of effort spent getting the function.

devising it. In this section, we review some existing importance sampling algorithms for Bayesian networks. Based 1

on the different methods that they use to get the importance function, we classify them into three families.

2

The ﬁrst family uses the prior distribution of a Bayesian network as the importance function. Since they spend 3

no effort in trying to get a good importance function, they typically need more time to converge. Probabilistic 4

logic sampling [18] and likelihood weighting [19,20] both belong to this category. When there is no evidence, these 5

two algorithms reduce to the same algorithm. Their difference becomes evident when evidence is introduced. The 6

probabilistic logic sampling instantiates all the nodes in a Bayesian network by sampling from the prior distribution 7

and discards all samples that are not compatible with the evidence. Obviously, the logic sampling is very inefﬁcient 8

when the evidence is unlikely. On the contrary, the likelihood weighting only instantiates the nodes without evidence 9

and associates each sample with the weight

10

w = P(xi |P A(xi )).

xi ∈E

(7) 11

By making use of all samples, likelihood weighting improves the sampling efﬁciency. However, when the evidence 12

is unlikely, most of the sample weights are small, and only several samples with very large weights dominate the 13

sample. In such cases, the variance of the sample weights can be huge and, hence, the algorithm is still inefﬁcient.

14

The second family resorts to learning methods to learn an importance function. Self-importance 15

sampling (SIS) [20], adaptive importance sampling [29], and AIS-BN [3] all belong to this family. The SIS tries to 16

revise the prior distribution periodically using samples in order to make the sampling distribution gradually approach 17

the posterior distribution. The adaptive importance sampling parameterizes the importance function using a set of 18

parameters Θ, and devises several updating rules based on gradient descent to adapt the current sampling distribution 19

into an importance function. The AIS-BN algorithm learns an importance function starting from a modiﬁed prior. It 20

modiﬁes the prior using two heuristics: (1) initializing the probability distributions of parents of evidence nodes to 21

the uniform distribution; and (2) adjusting very small probabilities in the conditional probability tables composing 22

the importance function to higher values. After that, the AIS-BN algorithm draws some samples and estimates an 23

importance function that approaches the optimal importance function.

24

The third family directly computes an importance function in the light of both the prior distribution and the 25

evidence. The backward sampling [21], IS [22], and annealed importance sampling [30] algorithms all belong to 26

this category. The backward sampling modiﬁes the prior distribution so that it allows for generating samples from 27

evidence nodes in the direction that is opposite to the topological order of nodes in the network. The main idea of the 28

IS algorithm originates from the variable elimination algorithm [6]. A full variable elimination algorithm is an exact 29

algorithm that looks for optimal solutions so, instead, the IS algorithm uses an approximate version of the variable 30

UNC ORREC TED PR OOF

MCM: 2749

ARTICLE IN PRESS

6

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

1 elimination algorithm to compute an importance function. The idea is to set a limit on the size of potentials built 2 when eliminating variables. Whenever the size of a potential exceeds the limit, the approximate method will create an 3 approximate version for it. The annealed importance sampling algorithm starts by sampling from the prior distribution. 4 However, instead of directly assigning weights to the samples, the algorithm sets up a series of distributions, with the 5 last one being the posterior distribution. By annealing each sample using Markov chains deﬁned by the series of 6 distributions, the algorithm tries to get a set of samples that are generated from a distribution that is close to the 7 posterior distribution. The main drawback of this algorithm is that it needs to sample many times in order to get just 8 one sample. The EPIS-BN algorithm that we propose in this paper also belongs to this family. Theoretically, we 9 can apply any approximation technique to obtain the importance function. For instance, it is possible that variational 10 approximation methods [16] can be applied to this end. 11 The AIS-BN algorithm is the current state-of-the-art importance sampling algorithm for Bayesian networks. 12 Empirical results showed that the AIS-BN algorithm achieved over two orders of magnitude improvement in 13 convergence over likelihood weighting and self-importance sampling algorithms. The other algorithms that we 14 reviewed in this section typically report moderate improvements over likelihood weighting algorithm. Therefore, 15 we will mainly compare our proposed algorithm against the AIS-BN algorithm in the later experiments. We also 16 compare our results against Gibbs sampling, an algorithm from the MCMC family.

17 3. EPIS-BN: Evidence pre-propagation importance sampling algorithm

18 In predictive inference, since both evidence and soft evidence are in the roots of the network, stochastic forward 19 sampling algorithms, such as the probabilistic logic sampling [18], can easily reach high precision. However, in 20 diagnostic reasoning, especially when the evidence is extremely unlikely, sampling algorithms can exhibit a mismatch 21 between the sampling distribution and the posterior distribution. In such cases, most samples may be incompatible 22 with the evidence and be useless. Some stochastic sampling algorithms, such as likelihood weighting and importance 23 sampling, try to make use of all the samples by assigning weights for them. But, in practice, most of the weights turn 24 out to be too small to be effective. Backward sampling [21] tries to deal with this problem by sampling backwards 25 from the evidence nodes, but it may fail to consider the soft evidence present in the roots [23]. Whichever sampling 26 order is chosen, a good importance function has to take into account the information that is present ahead in the 27 network. If we do sampling in the topological order of the network, we need an importance function that will match 28 the information from the evidence nodes. In this section, we propose a new importance sampling algorithm, which we 29 call Evidence Pre-propagation Importance Sampling algorithm for Bayesian Networks (EPIS-BN). In this algorithm, 30 we ﬁrst use loopy belief propagation to compute an approximation of the optimal importance function, and then apply 31 -cutoff heuristic to cut off small probabilities in the importance function.

32 3.1. Loopy belief propagation

33 The goal of the belief propagation algorithm [4] is to ﬁnd the posterior beliefs of each node X , i.e., BEL(x) = 34 P(X = x|E), where E denotes the set of evidence nodes. In a polytree, any node X d-separates E into two subsets 35 E+, the evidence connected to the parents of X , and E−, the evidence connected to the children of X . Given the 36 state of X , the two subsets are independent. Therefore, node X can collect messages separately from them in order to 37 compute its posterior beliefs. The message from E+ is deﬁned as

38

π(X ) = P(x|E+)

(8)

39 and the message from E− is deﬁned as

40

λ(x) = P(E−|x).

(9)

41 π(X ) and λ(x) messages can be decomposed into more detailed messages between neighboring nodes as follows:

42 λ(t)(x ) = λX (x ) λ(Ytj)(x ) (10)

j

MCM: 2749

ARTICLE IN PRESS

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

7

UNC ORREC TED PR OOF

and

1

π (t)(x) = P(X = x|U = u) πX(t)(uk ),

u

k

(11) 2

where λX (x) is a message that a node sends to itself [31]. The message that X sends to its parent Ui is given by:

3

λ(Xt+1)(ui ) = α λ(t)(x )

P(x|u) πX(t)(uk )

x

uk :k=i

k=i

(12) 4

and the message that X sends to its child Y j is

5

πY(tj+1)(x ) = απ (t)(x )λX (x ) λ(Ytk)(uk ).

k=i

(13) 6

After a node X receives all the messages, it can compute its posterior marginal distribution over X (belief) by

7

BEL(x) = αλ(x)π(X ),

(14) 8

where α is a normalizing constant. When this algorithm is applied to a polytree, the leaves and roots of the network 9

can send out their messages immediately. The evidence nodes can send out their messages as well. By propagating 10

these messages, eventually all messages will be sent. The algorithm terminates with correct beliefs. With slight 11

modiﬁcations, we can apply the belief propagation algorithm to networks with loops. The resulting algorithm is 12

called loopy belief propagation [1,2]. We start by initializing the messages that all evidence nodes send to themselves 13

to be vectors of a 1 for observed state and 0s for other states. All other messages are vectors of 1s. Then, in parallel, 14

all of the nodes recompute their new outgoing messages based on the incoming messages from the last iteration. By 15

running the propagation for a number of iterations (say, equal to the length of the diameter of the network), we can 16 assess convergence by checking if any belief changes by more than a small threshold (say, 10−3). In general, loopy 17

belief propagation will not give the correct posteriors for networks with loops. However, extensive investigations on 18

the performance of loopy belief propagation that are performed recently report surprisingly accurate results [1,2,32, 19

33]. As of now, more thorough understanding of why the results are so good has yet to be developed. For our purpose 20

of getting an approximate importance function, we need not wait until loopy belief propagation converges, so whether 21

or not loopy belief propagation converges to the correct posteriors is not critical.

22

3.2. Importance function in the EPIS-BN algorithm

23

Let X = {X1, X2, . . . , Xn} be the set of variables in a Bayesian network, P A(Xi ) be the parents of Xi , and E be the 24

set of evidence variables. Based on the theoretical considerations in Section 2, we know that the optimal importance 25

function is

26

ρ(X\E) = P(X|E).

(15) 27

Although we know the mathematical expression for the optimal importance function, it is difﬁcult to obtain the 28

function exactly. In our algorithm, we use the following importance function:

29

n

ρ(X\E) = P(Xi |P A(Xi ), E),

i =1

where each P(Xi |PA(Xi ), E) is deﬁned as an importance conditional probability table (ICPT) [3].

(16) 30

31

Deﬁnition 1. An importance conditional probability table (ICPT) of a node Xi is a table of posterior probabilities 32

P(Xi |PA(Xi ), E) conditional on the evidence and indexed by its immediate predecessors, PA(Xi ).

33

This importance function only partially considers the effect of all the evidence on every node. As Cheng and 34

Druzdzel [3] point out, when the posterior structure of the network changes dramatically as the result of observed 35

evidence, this importance function may perform poorly. However, our empirical results show that it is a good 36

approximation to the optimal importance function.

37

MCM: 2749

ARTICLE IN PRESS

8

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

UNC ORREC TED PR OOF

1 The AIS-BN [3] algorithm adopts a long learning step to learn approximations of these ICPTs and, hence, the 2 importance function. The following theorem shows that in polytrees we can calculate them directly.

3 Theorem 1. Let Xi be a variable in a polytree, and E be the set of evidence. The exact ICPT P(Xi |P A(Xi ), E) for 4 Xi is

5

α(P A(Xi ))P(Xi |P A(Xi ))λ(Xi ),

(17)

6 where α(P A(Xi )) is a normalizing constant dependent on P A(Xi ).

7 Proof. Let E = E+ ∪ E−, where E+ is the evidence connected to the parents of Xi , and E− is the evidence connected 8 to the children of Xi , then

9

P(Xi |P A(Xi ), E) = P(Xi |P A(Xi ), E+, E−)

10

= P(Xi |P A(Xi ), E−)

P(E− | Xi , P A(Xi ))P(Xi |P A(Xi ))

11

=

P(E− | P A(Xi ))

12

= α(P A(Xi ))λ(Xi )P(Xi |P A(Xi )).

13 If a node has no descendant with evidence, its ICPT is identical to its CPT. This property is also pointed out

14 in [3] (Theorem 2). 15 In networks with loops, getting the exact λ messages for all variables is equivalent to calculating the exact solutions, 16 which is an NP-hard problem. However, because our goal is to obtain a good and not necessarily optimal importance 17 function, we can satisfy it by calculating approximations of the λ messages. Given the surprisingly good performance 18 of loopy belief propagation, we believe that this can also provide us with good approximations of the λ messages.

19 3.3. The -cutoff heuristic

20 Another heuristic method that we use in EPIS-BN is -cutoff [3], i.e., setting some threshold and replacing

21 any smaller probability in the network by . At the same time, we compensate for this change by subtracting it from

22 the largest probability in the same conditional probability distribution. This method is originally used in AIS-BN

23 to speed up its importance function learning step [3]. However, we ﬁnd that it is even more suitable for a different

24 purpose, which is to try to make the importance function possess heavy tails. As noted by Geweke [27], the tails of

25 the importance function should not decay faster than the tails of the posterior distribution. Otherwise, the convergence

26 rate will be slow. We show in a related paper [34], which is currently under review, that it is also true in the context of

27 Bayesian networks. We review the main results here.

28 Let f be the joint probability distribution of a Bayesian network. Druzdzel [35] shows that f approximately follows

29 the lognormal distribution. Therefore, we can look at any importance sampling algorithm for Bayesian networks

30 as using one lognormal distribution as the importance function to compute the expectation of another lognormal

31

distribution. Let

f (X ) be the original density of a Bayesian network and let

f (ln X )

∝

N (µ f

,

σ

2 f

)

.

We

assume

32 that we cannot sample from f (X ) but we can only evaluate it at any point. Let the importance function be I (X ),

33

which

satisﬁes

I (ln

X)

∝

N

(µ

I

,

σ

2 I

)

.

After

a

simple

calculation,

we

obtain

the

variance

of

the

importance

sampling

34 estimator as

f 2(X)

2

35

VarI (X)(w(X )) = I (X ) dX − f (X )dX

µI −µ f 2

σI 2 σf

σf

2 σI

2 1

σ−

36

=

ef

− 1,

2 σσIf 2 − 1

(18)

37 where w(X ) = If ((XX)) . The necessary condition for the variance in Eq. (18) to exist is that 2( σσIf )2 − 1 > 0, which 38 means that the variance of I (ln X ) should at least be greater than half of the variance of f (ln X ). Note that | µIσ−fµ f |

MCM: 2749

ARTICLE IN PRESS

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

9

UNC ORREC TED PR OOF

Fig. 3. A plot of σσIf against the variance when using the importance function I (ln X ) ∝ N (µI , σI2) with different µI s to integrate the density f (ln X ) ∝ N (µ f , σ 2f ). The legend shows different values of µIσ−fµ f .

can be looked on as the standardized distance between µI and µ f with regard to f (ln X ). We plot the variance against 1 σσIf for different values of | µIσ−fµ f | in Fig. 3. 2

We observe that, as the tails of the importance function become slimmer, the variance increases rapidly and 3

suddenly goes to inﬁnity. However, when the tails become heavier, the variance increases slowly. In practice, we 4

usually have no clue about the real shape of f (X ). Therefore, after we have applied loopy belief propagation to 5

calculate an importance function, the function will not be a precise estimate of the optimal importance function and 6

is likely to possess slim tails. To prevent this situation from happening, we apply the -cutoff heuristic to adjust the 7

small probabilities in our ICPTs. The tails in the context of Bayesian networks can be interpreted as the states with 8

extremely small probabilities and extremely large probabilities, located in the tails of the approximate lognormal 9

distributions of the Bayesian networks.

10

The optimal threshold value is highly network dependent. Furthermore, if the calculated importance function is 11

close to the ideal importance function, we may depart from this ideal by applying -cutoff.

12

3.4. The EPIS-BN algorithm

13

The basic EPIS-BN algorithm is outlined in Fig. 4. There are three main stages in the algorithm. The ﬁrst stage 14

includes Steps 1–2, which initialize the parameters. The second stage, including Steps 3–6, applies loopy belief 15

propagation and -cutoff to calculate an importance function. The last stage, Step 7, does the actual importance 16

sampling.

17

The parameter m, the number of samples, is a matter of a network-independent tradeoff between precision and 18

time. More samples will lead to a better precision. However, the optimal values of the propagation length d and 19

the threshold value for the -cutoff are highly network dependent. We will recommend some values based on our 20

empirical results in Section 4.2.

21

4. Experimental results

22

To test the performance of the EPIS-BN algorithm, we applied it to three large real Bayesian networks: 23 ANDES [36], CPCS [37], and PATHFINDER [38], and compared our results to those of AIS-BN, the current state- 24 of-the-art importance sampling algorithm, and those of Gibbs sampling, a representative of the MCMC methods, 25 which are believed to perform well in Bayesian networks. A reviewer on the earlier version of this paper suggested 26

MCM: 2749

ARTICLE IN PRESS

10

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

UNC ORREC TED PR OOF

Fig. 4. The evidence pre-propagation importance sampling algorithm for bayesian networks (EPIS-BN).

1 comparing our algorithm to an MCMC algorithm. The ANDES network [36] consists of 233 nodes. This network 2 has a large depth and high connectivity and it was shown to be difﬁcult for the AIS-BN algorithm [3]. The CPCS 3 network [37] that we used has 179 nodes, which is a subset of the full CPCS network created by Max Henrion 4 and Malcolm Pradhan. The PATHFINDER network [38] contains 135 nodes. This section presents the results of our 5 experiments. We implemented our algorithm in C++ and performed our tests on a 2.5 GHz Pentium IV Windows XP 6 computer with 1GB memory Windows XP.

7 4.1. Experimental method

8 To compare the accuracy of sampling algorithms, we compare their departure from the exact solutions, which

9 we calculate using the clustering algorithm [7]. The distance metric that we use is Hellinger’s distance [39]. 10 Hellinger’s distance between two distributions f 1 and f 2, which have probabilities P1(xi j ) and P2(xi j ) for state 11 j ( j = 1, 2, . . . , ni ) of node i respectively, such that Xi ∈ E, is deﬁned as:

ni

{ P1(xi j ) − P2(xi j )}2

12

H (F1, F2) = Xi ∈N\E j=1

,

(19)

ni

Xi ∈N\E

13 where N is the set of all nodes in the network, E is the set of evidence nodes, and ni is the number of states for node i. 14 Hellinger’s distance weights small absolute probability differences near 0 much more heavily than similar 15 probability differences near 1. In many cases, Hellinger’s distance provides results that are equivalent to 16 Kullback–Leibler divergence. However, a major advantage of Hellinger’s distance is that it can handle zero 17 probabilities, which are common in Bayesian networks. Cheng and Druzdzel [3] used mean square error (MSE) 18 in their experiments. The main drawback of MSE is that it assigns equal distance for the same absolute probability

+ Model

ARTICLE IN PRESS

pp. 1–19 (col. ﬁg: NIL)

Mathematical and Computer Modelling xx (xxxx) xxx–xxx

www.elsevier.com/locate/mcm

UNC ORREC TED PR OOF

Importance sampling algorithms for Bayesian networks: Principles and performance

Changhe Yuana, Marek J. Druzdzelb,∗

a Decision Systems Laboratory, Intelligent Systems Program, University of Pittsburgh, Pittsburgh, PA 15260, United States b Decision Systems Laboratory, School of Information Sciences and Intelligent Systems Program, University of Pittsburgh,

Pittsburgh, PA 15260, United States

Received 10 March 2005; accepted 4 May 2005

1 2

Abstract

3

Precision achieved by stochastic sampling algorithms for Bayesian networks typically deteriorates in the face of extremely 4

unlikely evidence. In addressing this problem, importance sampling algorithms seem to be most successful. We discuss the 5

principles underlying the importance sampling algorithms in Bayesian networks. After that, we describe Evidence Pre-propagation 6

Importance Sampling (EPIS-BN), an importance sampling algorithm that computes an importance function using two techniques: 7

loopy belief propagation [K. Murphy, Y. Weiss, M. Jordan, Loopy belief propagation for approximate inference: An empirical 8

study, in: Proceedings of the Fifteenth Annual Conference on Uncertainty in Artiﬁcial Intelligence, UAI-99, San Francisco, CA, 9

Morgan Kaufmann Publishers, 1999, pp. 467–475; Y. Weiss, Correctness of local probability propagation in graphical models with 10

loops, Neural Computation 12 (1) (2000) 1–41] and -cutoff heuristic [J. Cheng, M.J. Druzdzel, BN-AIS: An adaptive importance 11

sampling algorithm for evidential reasoning in large Bayesian networks, Journal of Artiﬁcial Intelligence Research 13 (2000) 12

155–188]. We test the performance of EPIS-BN on three large real Bayesian networks and observe that on all three networks it 13

outperforms AIS-BN [J. Cheng, M.J. Druzdzel, BN-AIS: An adaptive importance sampling algorithm for evidential reasoning 14

in large Bayesian networks, Journal of Artiﬁcial Intelligence Research 13 (2000) 155–188], the current state-of-the-art algorithm, 15

while avoiding its costly learning stage. We also compare importance sampling to Gibbs sampling and discuss the role of the 16

-cutoff heuristic in importance sampling for Bayesian networks.

17

c 2005 Elsevier Ltd. All rights reserved.

18

Keywords: Importance sampling; Importance function; EPIS-BN; Evidence pre-propogation; -cutoff

19

1. Introduction

20

Bayesian networks (BNs) [4] are a powerful modelling tool especially suitable for problems involving uncertainty. 21 They offer a compact, intuitive, and efﬁcient graphical representation of uncertain relationships among variables in a 22 domain and have proven their value in many disciplines over the last decade. Bayesian networks have been applied 23 successfully to a variety of decision problems, including medical diagnosis, prognosis, therapy planning, machine 24

∗ Corresponding author. Tel.: +1 412 624 9432; fax: +1 412 624 2788. E-mail addresses: [email protected] (C.Yuan), [email protected] (M.J. Druzdzel).

0895-7177/$ - see front matter c 2005 Elsevier Ltd. All rights reserved. doi:10.1016/j.mcm.2005.05.020

MCM: 2749

ARTICLE IN PRESS

2

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

UNC ORREC TED PR OOF

Fig. 1. An example Bayesian network for HIV infection.

1 diagnosis, user interfaces, natural language interpretation, planning, vision, robotics, data mining, fraud detection, 2 and many others. Some examples of real-world applications are described in a special issue of Communication of 3 ACM, on practical applications of decision-theoretical methods in AI, Vol. 38, No. 3, March 1995.

4 1.1. Introduction to Bayesian networks

5 Bayesian networks are directed acyclic graphs (DAGs) in which nodes represent random variables and arcs

6 represent direct probabilistic dependences among them. A Bayesian network encodes the joint probability distribution 7 over a set of variables {X1, . . . , Xn}, where n is ﬁnite, and decomposes it into a product of conditional probability 8 distributions over each variable given its parents in the graph. In the case of nodes with no parents, prior probability is 9 used. The joint probability distribution over {X1, . . . , Xn} can be obtained by taking the product of all of these prior 10 and conditional probability distributions:

n

11

Pr(X1, . . . , Xn) = Pr(Xi |P A(Xi )),

(1)

i =1

12 where P A(Xi ) denotes the parent nodes of Xi . Fig. 1 shows a highly simpliﬁed example Bayesian network that 13 models the causes of HIV virus infection and AIDS.1 The variables in this model are: HIV infection (H ), sexual 14 Intercourse (I ), the use of blood Transfusion (T ), Needle sharing (N ), Mosquito bite (M), and AIDS ( A). For the sake 15 of simplicity, we assumed that each of these variables is binary. For example, H has two outcomes, denoted h and 16 h, representing “HIV infection present” and “HIV infection absent”, respectively. A directed arc between N and H 17 denotes the fact that whether or not an individual shares needles will impact the likelihood of them contracting the 18 HIV virus. Similarly, an arc from H to A denotes that HIV infection inﬂuences the likelihood of developing AIDS. 19 Lack of directed arcs is also a way of expressing knowledge, notably assertions of (conditional) independence. For 20 instance, lack of directed arcs between N , I , T , and A encodes the knowledge that needle sharing, sexual intercourse, 21 and blood transfusion can inﬂuence the chance of developing AIDS, A, only indirectly through an HIV infection, H . 22 These causal assertions can be translated into statements of conditional independence: A is independent of N , I , and 23 T given H . In mathematical notation,

24

Pr(A|H ) = Pr(A|H, N ) = Pr(A|H, I ) = Pr(A|H, T ) = Pr(A|H, N , I, T ).

25 Structural independences, i.e., independences that are expressed by the structure of the network, are captured by 26 the so-called Markov condition, which states that a node (here A) is independent of its non-descendants (here N , I , 27 and T ) given its parents (here H ). 28 Similarly, the absence of arc I → N means that the individual’s decision to engage in a sexual intercourse will not 29 inﬂuence their chances of sharing needles. The absence of any links between mosquito bite M and the remainder of 30 the variables means that M is independent of the other variables. In fact, M would typically be considered irrelevant 31 to the problem of HIV infection, and we added it to the model only for the sake of presentation. 32 These independence properties imply that

33

Pr(N , I, T, M, H, A) = Pr(N ) Pr(I ) Pr(T ) Pr(M) Pr(H |N , I, T ) Pr(A|H ),

1 The network is a modiﬁed version of the network presented in [5].

MCM: 2749

ARTICLE IN PRESS

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

3

i.e., that the joint probability distribution over the graph nodes can be factored into the product of the conditional 1 probabilities of each node, given its parents in the graph. Please note that this expression is just an instance of Eq. (1). 2

1.2. Inference and complexity

3

UNC ORREC TED PR OOF

The assignment of values to observed variables is usually called evidence. The most important type of reasoning 4

in a probabilistic system based on Bayesian networks is known as belief updating, which amounts to computing 5

the probability distribution over the variables of interest, given the evidence. In the example model of Fig. 1, the 6

variable of interest could be H and the focus of computation could be the posterior probability distribution over 7

H , given the observed values of N , I , and T , i.e., Pr(H |N = n, I = i, T = t). We use a lower case letter 8

to denote the state of a variable. Another type of reasoning focuses on computing the most probable explanation, 9

i.e., the most probable instantiation, of the variables of interest, given the evidence. In the example model of 10

Fig. 1, we may be interested to know the most likely instantiation of I ,N , and T , given the observed value of 11

A, i.e., max{I,N,T } Pr(I, N , T | A = a). Several ingenious exact algorithms have been proposed to address these 12

reasoning tasks, including variable elimination [6], clustering algorithm [7], belief propagation for polytree [4], cutset 13

conditioning [4], and symbolic probabilistic inference (SPI) [8]. However, it has been shown that exact inference in 14

Bayesian networks is NP-hard [9]. With practical models reaching the size of thousands of variables, exact inference 15

in Bayesian networks is apparently infeasible. Although approximate inference to any desired precision has been 16

shown to be NP-hard as well [10], for very complex networks it is the only alternative that will produce any result at 17

all. Therefore, many approximate inference algorithms have been proposed. Some of them are actually approximate 18

versions of exact algorithms, including bounded conditioning [11], localized partial evaluation [12], incremental 19

SPI [13], probabilistic partial evaluation [14], and mini-bucket elimination [15]. Other approximate algorithms are 20

inherently approximate methods, including loopy belief propagation [1], variational methods [16], search based 21

algorithms [17], and stochastic sampling algorithms. Stochastic sampling algorithms are actually a large family that 22

contains many instances. Some of these are probabilistic logic sampling [18], likelihood weighting [19,20], backward 23

sampling [21], importance sampling [20], AIS-BN algorithm [3], IS algorithm [22], and IS T algorithm [23]. A 24

subclass of stochastic sampling methods, called Markov Chain Monte Carlo (MCMC) methods, includes Gibbs 25

sampling, Metropolis sampling, and Hybrid Monte Carlo sampling [24–26]. The problem of approximate inference 26

algorithms is often that they provide no guarantee regarding the quality of their results. However, the family of 27

stochastic sampling algorithms is an exception, because theoretically they will converge to the exact solutions if 28

based on sufﬁciently many samples. Furthermore, among all stochastic sampling algorithms, importance sampling- 29

based algorithms seem to provide the best performance, because many excellent methods have been proposed for 30

calculating good importance functions, whose quality is critical to the results. We will review the existing importance 31

sampling algorithms for Bayesian networks in Section 2.

32

The outline of the paper is as follows. In Section 2, we give a general introduction to importance sampling and 33

the existing importance sampling algorithms for Bayesian networks. In Section 3, we discuss the Evidence Pre- 34

propagation Importance Sampling algorithm for Bayesian Networks (EPIS-BN) algorithm, which uses the loopy 35

belief propagation algorithm to calculate an importance function. Finally, in Section 4, we describe the results of 36

experimental tests of the EPIS-BN algorithm on several large real Bayesian networks.

37

2. Importance sampling

38

In this section, we ﬁrst introduce the basic theory of importance sampling, and then review the existing importance 39

sampling algorithms for Bayesian networks.

40

2.1. Theory of importance sampling

41

We start with the theoretical roots of importance sampling. Let f (X ) be a function of n variables X = (X1, . . . , Xn) 42

over the domain Ω ⊂ Rn. Consider the problem of estimating the multiple integral

43

V = f (X )dX.

Ω

(2) 44

UNC ORREC TED PR OOF

MCM: 2749

ARTICLE IN PRESS

4

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

1 We assume that the domain of integration of f (X ) is bounded, i.e., that V exists. Importance sampling approaches 2 this problem by estimating

3 V = f (X ) I (X )dX, (3) Ω I(X)

4 where I (X ), which is called the importance function, is a probability density function such that I (X ) > 0 for any 5 X ⊂ Ω . One practical requirement of I (X ) is that it should be easy to sample from. In order to estimate the integral, 6 we generate samples X1, X2, . . . , X N from I (X ) and use the generated values in the sample-mean formula

1 N f (Xi)

7

Vˆ =

.

(4)

N i 1 I (Xi)

=

8 The estimator in Eq. (4) almost surely converges as follows:

9

Vˆ → V.

(5)

10 under the following weak assumptions [27]:

11 Assumption 1. f (X ) is proportional to a proper probability density function deﬁned on Ω .

12 Assumption 2. {Xi }i∞=1 is a sequence of i.i.d. random samples, the common distribution having a probability density 13 function I (X ).

14 Assumption 3. The support of I (X ) includes Ω .

15 Assumption 4. V exists and is ﬁnite.

16 Importance sampling assigns more weight to regions where f (X ) > I (X ) and less weight to regions where 17 f (X ) < I (X ) to correctly estimate V. We do not have much control over what is required in Assumptions 1, 2 and 4, 18 because they are either the inherent properties of the problem at hand or the characteristic of Monte Carlo simulation. 19 We only have the freedom to choose which importance function to use, as long as it satisﬁes Assumption 3. In the 20 context of Bayesian netoworks, since Ω is compact, we can easily devise an importance function that satisﬁes the 21 assumption. 22 Rubinstein [28] proves that if f (X ) > 0, the optimal importance function is

23 I (X ) = f (X ) , (6) V

24 which is actually the posterior distribution. The concept of the optimal importance function does not seem to be useful, 25 because ﬁnding V is equivalent to ﬁnding the posterior distribution, which is the problem that we are facing. However, 26 it suggests that, if we ﬁnd instead a function that is close enough to the optimal importance function, we can still 27 expect good convergence rate.

28 2.2. Importance sampling in Bayesian networks

29 Importance sampling has become the basis for several state-of-the-art stochastic sampling-based inference 30 algorithms for Bayesian networks. These algorithms inherit the characteristic that their accuracy largely depends 31 on the quality of the importance functions that they manage to get. The theoretical convergence rate is in the order of 32 √1m , where m is the number of samples, for essentially all Monte Carlo methods. Therefore, the further the importance 33 function is from the posterior distribution, the more samples it needs to converge. The number of samples needed 34 increases at least at a quadratic speed. Hence, given a ﬁxed number of samples, any effort to make the importance 35 function closer to the posterior distribution will directly inﬂuence the precision of sampling algorithms. To achieve 36 a given precision, a good importance function can save us lots of samples. This is best represented graphically in 37 Fig. 2. Obviously, there is a tradeoff between the quality of the importance function and the amount of effort spent on

MCM: 2749

ARTICLE IN PRESS

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

5

UNC ORREC TED PR OOF

Fig. 2. Importance sampling: the tradeoff between the quality of importance function and the amount of effort spent getting the function.

devising it. In this section, we review some existing importance sampling algorithms for Bayesian networks. Based 1

on the different methods that they use to get the importance function, we classify them into three families.

2

The ﬁrst family uses the prior distribution of a Bayesian network as the importance function. Since they spend 3

no effort in trying to get a good importance function, they typically need more time to converge. Probabilistic 4

logic sampling [18] and likelihood weighting [19,20] both belong to this category. When there is no evidence, these 5

two algorithms reduce to the same algorithm. Their difference becomes evident when evidence is introduced. The 6

probabilistic logic sampling instantiates all the nodes in a Bayesian network by sampling from the prior distribution 7

and discards all samples that are not compatible with the evidence. Obviously, the logic sampling is very inefﬁcient 8

when the evidence is unlikely. On the contrary, the likelihood weighting only instantiates the nodes without evidence 9

and associates each sample with the weight

10

w = P(xi |P A(xi )).

xi ∈E

(7) 11

By making use of all samples, likelihood weighting improves the sampling efﬁciency. However, when the evidence 12

is unlikely, most of the sample weights are small, and only several samples with very large weights dominate the 13

sample. In such cases, the variance of the sample weights can be huge and, hence, the algorithm is still inefﬁcient.

14

The second family resorts to learning methods to learn an importance function. Self-importance 15

sampling (SIS) [20], adaptive importance sampling [29], and AIS-BN [3] all belong to this family. The SIS tries to 16

revise the prior distribution periodically using samples in order to make the sampling distribution gradually approach 17

the posterior distribution. The adaptive importance sampling parameterizes the importance function using a set of 18

parameters Θ, and devises several updating rules based on gradient descent to adapt the current sampling distribution 19

into an importance function. The AIS-BN algorithm learns an importance function starting from a modiﬁed prior. It 20

modiﬁes the prior using two heuristics: (1) initializing the probability distributions of parents of evidence nodes to 21

the uniform distribution; and (2) adjusting very small probabilities in the conditional probability tables composing 22

the importance function to higher values. After that, the AIS-BN algorithm draws some samples and estimates an 23

importance function that approaches the optimal importance function.

24

The third family directly computes an importance function in the light of both the prior distribution and the 25

evidence. The backward sampling [21], IS [22], and annealed importance sampling [30] algorithms all belong to 26

this category. The backward sampling modiﬁes the prior distribution so that it allows for generating samples from 27

evidence nodes in the direction that is opposite to the topological order of nodes in the network. The main idea of the 28

IS algorithm originates from the variable elimination algorithm [6]. A full variable elimination algorithm is an exact 29

algorithm that looks for optimal solutions so, instead, the IS algorithm uses an approximate version of the variable 30

UNC ORREC TED PR OOF

MCM: 2749

ARTICLE IN PRESS

6

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

1 elimination algorithm to compute an importance function. The idea is to set a limit on the size of potentials built 2 when eliminating variables. Whenever the size of a potential exceeds the limit, the approximate method will create an 3 approximate version for it. The annealed importance sampling algorithm starts by sampling from the prior distribution. 4 However, instead of directly assigning weights to the samples, the algorithm sets up a series of distributions, with the 5 last one being the posterior distribution. By annealing each sample using Markov chains deﬁned by the series of 6 distributions, the algorithm tries to get a set of samples that are generated from a distribution that is close to the 7 posterior distribution. The main drawback of this algorithm is that it needs to sample many times in order to get just 8 one sample. The EPIS-BN algorithm that we propose in this paper also belongs to this family. Theoretically, we 9 can apply any approximation technique to obtain the importance function. For instance, it is possible that variational 10 approximation methods [16] can be applied to this end. 11 The AIS-BN algorithm is the current state-of-the-art importance sampling algorithm for Bayesian networks. 12 Empirical results showed that the AIS-BN algorithm achieved over two orders of magnitude improvement in 13 convergence over likelihood weighting and self-importance sampling algorithms. The other algorithms that we 14 reviewed in this section typically report moderate improvements over likelihood weighting algorithm. Therefore, 15 we will mainly compare our proposed algorithm against the AIS-BN algorithm in the later experiments. We also 16 compare our results against Gibbs sampling, an algorithm from the MCMC family.

17 3. EPIS-BN: Evidence pre-propagation importance sampling algorithm

18 In predictive inference, since both evidence and soft evidence are in the roots of the network, stochastic forward 19 sampling algorithms, such as the probabilistic logic sampling [18], can easily reach high precision. However, in 20 diagnostic reasoning, especially when the evidence is extremely unlikely, sampling algorithms can exhibit a mismatch 21 between the sampling distribution and the posterior distribution. In such cases, most samples may be incompatible 22 with the evidence and be useless. Some stochastic sampling algorithms, such as likelihood weighting and importance 23 sampling, try to make use of all the samples by assigning weights for them. But, in practice, most of the weights turn 24 out to be too small to be effective. Backward sampling [21] tries to deal with this problem by sampling backwards 25 from the evidence nodes, but it may fail to consider the soft evidence present in the roots [23]. Whichever sampling 26 order is chosen, a good importance function has to take into account the information that is present ahead in the 27 network. If we do sampling in the topological order of the network, we need an importance function that will match 28 the information from the evidence nodes. In this section, we propose a new importance sampling algorithm, which we 29 call Evidence Pre-propagation Importance Sampling algorithm for Bayesian Networks (EPIS-BN). In this algorithm, 30 we ﬁrst use loopy belief propagation to compute an approximation of the optimal importance function, and then apply 31 -cutoff heuristic to cut off small probabilities in the importance function.

32 3.1. Loopy belief propagation

33 The goal of the belief propagation algorithm [4] is to ﬁnd the posterior beliefs of each node X , i.e., BEL(x) = 34 P(X = x|E), where E denotes the set of evidence nodes. In a polytree, any node X d-separates E into two subsets 35 E+, the evidence connected to the parents of X , and E−, the evidence connected to the children of X . Given the 36 state of X , the two subsets are independent. Therefore, node X can collect messages separately from them in order to 37 compute its posterior beliefs. The message from E+ is deﬁned as

38

π(X ) = P(x|E+)

(8)

39 and the message from E− is deﬁned as

40

λ(x) = P(E−|x).

(9)

41 π(X ) and λ(x) messages can be decomposed into more detailed messages between neighboring nodes as follows:

42 λ(t)(x ) = λX (x ) λ(Ytj)(x ) (10)

j

MCM: 2749

ARTICLE IN PRESS

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

7

UNC ORREC TED PR OOF

and

1

π (t)(x) = P(X = x|U = u) πX(t)(uk ),

u

k

(11) 2

where λX (x) is a message that a node sends to itself [31]. The message that X sends to its parent Ui is given by:

3

λ(Xt+1)(ui ) = α λ(t)(x )

P(x|u) πX(t)(uk )

x

uk :k=i

k=i

(12) 4

and the message that X sends to its child Y j is

5

πY(tj+1)(x ) = απ (t)(x )λX (x ) λ(Ytk)(uk ).

k=i

(13) 6

After a node X receives all the messages, it can compute its posterior marginal distribution over X (belief) by

7

BEL(x) = αλ(x)π(X ),

(14) 8

where α is a normalizing constant. When this algorithm is applied to a polytree, the leaves and roots of the network 9

can send out their messages immediately. The evidence nodes can send out their messages as well. By propagating 10

these messages, eventually all messages will be sent. The algorithm terminates with correct beliefs. With slight 11

modiﬁcations, we can apply the belief propagation algorithm to networks with loops. The resulting algorithm is 12

called loopy belief propagation [1,2]. We start by initializing the messages that all evidence nodes send to themselves 13

to be vectors of a 1 for observed state and 0s for other states. All other messages are vectors of 1s. Then, in parallel, 14

all of the nodes recompute their new outgoing messages based on the incoming messages from the last iteration. By 15

running the propagation for a number of iterations (say, equal to the length of the diameter of the network), we can 16 assess convergence by checking if any belief changes by more than a small threshold (say, 10−3). In general, loopy 17

belief propagation will not give the correct posteriors for networks with loops. However, extensive investigations on 18

the performance of loopy belief propagation that are performed recently report surprisingly accurate results [1,2,32, 19

33]. As of now, more thorough understanding of why the results are so good has yet to be developed. For our purpose 20

of getting an approximate importance function, we need not wait until loopy belief propagation converges, so whether 21

or not loopy belief propagation converges to the correct posteriors is not critical.

22

3.2. Importance function in the EPIS-BN algorithm

23

Let X = {X1, X2, . . . , Xn} be the set of variables in a Bayesian network, P A(Xi ) be the parents of Xi , and E be the 24

set of evidence variables. Based on the theoretical considerations in Section 2, we know that the optimal importance 25

function is

26

ρ(X\E) = P(X|E).

(15) 27

Although we know the mathematical expression for the optimal importance function, it is difﬁcult to obtain the 28

function exactly. In our algorithm, we use the following importance function:

29

n

ρ(X\E) = P(Xi |P A(Xi ), E),

i =1

where each P(Xi |PA(Xi ), E) is deﬁned as an importance conditional probability table (ICPT) [3].

(16) 30

31

Deﬁnition 1. An importance conditional probability table (ICPT) of a node Xi is a table of posterior probabilities 32

P(Xi |PA(Xi ), E) conditional on the evidence and indexed by its immediate predecessors, PA(Xi ).

33

This importance function only partially considers the effect of all the evidence on every node. As Cheng and 34

Druzdzel [3] point out, when the posterior structure of the network changes dramatically as the result of observed 35

evidence, this importance function may perform poorly. However, our empirical results show that it is a good 36

approximation to the optimal importance function.

37

MCM: 2749

ARTICLE IN PRESS

8

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

UNC ORREC TED PR OOF

1 The AIS-BN [3] algorithm adopts a long learning step to learn approximations of these ICPTs and, hence, the 2 importance function. The following theorem shows that in polytrees we can calculate them directly.

3 Theorem 1. Let Xi be a variable in a polytree, and E be the set of evidence. The exact ICPT P(Xi |P A(Xi ), E) for 4 Xi is

5

α(P A(Xi ))P(Xi |P A(Xi ))λ(Xi ),

(17)

6 where α(P A(Xi )) is a normalizing constant dependent on P A(Xi ).

7 Proof. Let E = E+ ∪ E−, where E+ is the evidence connected to the parents of Xi , and E− is the evidence connected 8 to the children of Xi , then

9

P(Xi |P A(Xi ), E) = P(Xi |P A(Xi ), E+, E−)

10

= P(Xi |P A(Xi ), E−)

P(E− | Xi , P A(Xi ))P(Xi |P A(Xi ))

11

=

P(E− | P A(Xi ))

12

= α(P A(Xi ))λ(Xi )P(Xi |P A(Xi )).

13 If a node has no descendant with evidence, its ICPT is identical to its CPT. This property is also pointed out

14 in [3] (Theorem 2). 15 In networks with loops, getting the exact λ messages for all variables is equivalent to calculating the exact solutions, 16 which is an NP-hard problem. However, because our goal is to obtain a good and not necessarily optimal importance 17 function, we can satisfy it by calculating approximations of the λ messages. Given the surprisingly good performance 18 of loopy belief propagation, we believe that this can also provide us with good approximations of the λ messages.

19 3.3. The -cutoff heuristic

20 Another heuristic method that we use in EPIS-BN is -cutoff [3], i.e., setting some threshold and replacing

21 any smaller probability in the network by . At the same time, we compensate for this change by subtracting it from

22 the largest probability in the same conditional probability distribution. This method is originally used in AIS-BN

23 to speed up its importance function learning step [3]. However, we ﬁnd that it is even more suitable for a different

24 purpose, which is to try to make the importance function possess heavy tails. As noted by Geweke [27], the tails of

25 the importance function should not decay faster than the tails of the posterior distribution. Otherwise, the convergence

26 rate will be slow. We show in a related paper [34], which is currently under review, that it is also true in the context of

27 Bayesian networks. We review the main results here.

28 Let f be the joint probability distribution of a Bayesian network. Druzdzel [35] shows that f approximately follows

29 the lognormal distribution. Therefore, we can look at any importance sampling algorithm for Bayesian networks

30 as using one lognormal distribution as the importance function to compute the expectation of another lognormal

31

distribution. Let

f (X ) be the original density of a Bayesian network and let

f (ln X )

∝

N (µ f

,

σ

2 f

)

.

We

assume

32 that we cannot sample from f (X ) but we can only evaluate it at any point. Let the importance function be I (X ),

33

which

satisﬁes

I (ln

X)

∝

N

(µ

I

,

σ

2 I

)

.

After

a

simple

calculation,

we

obtain

the

variance

of

the

importance

sampling

34 estimator as

f 2(X)

2

35

VarI (X)(w(X )) = I (X ) dX − f (X )dX

µI −µ f 2

σI 2 σf

σf

2 σI

2 1

σ−

36

=

ef

− 1,

2 σσIf 2 − 1

(18)

37 where w(X ) = If ((XX)) . The necessary condition for the variance in Eq. (18) to exist is that 2( σσIf )2 − 1 > 0, which 38 means that the variance of I (ln X ) should at least be greater than half of the variance of f (ln X ). Note that | µIσ−fµ f |

MCM: 2749

ARTICLE IN PRESS

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

9

UNC ORREC TED PR OOF

Fig. 3. A plot of σσIf against the variance when using the importance function I (ln X ) ∝ N (µI , σI2) with different µI s to integrate the density f (ln X ) ∝ N (µ f , σ 2f ). The legend shows different values of µIσ−fµ f .

can be looked on as the standardized distance between µI and µ f with regard to f (ln X ). We plot the variance against 1 σσIf for different values of | µIσ−fµ f | in Fig. 3. 2

We observe that, as the tails of the importance function become slimmer, the variance increases rapidly and 3

suddenly goes to inﬁnity. However, when the tails become heavier, the variance increases slowly. In practice, we 4

usually have no clue about the real shape of f (X ). Therefore, after we have applied loopy belief propagation to 5

calculate an importance function, the function will not be a precise estimate of the optimal importance function and 6

is likely to possess slim tails. To prevent this situation from happening, we apply the -cutoff heuristic to adjust the 7

small probabilities in our ICPTs. The tails in the context of Bayesian networks can be interpreted as the states with 8

extremely small probabilities and extremely large probabilities, located in the tails of the approximate lognormal 9

distributions of the Bayesian networks.

10

The optimal threshold value is highly network dependent. Furthermore, if the calculated importance function is 11

close to the ideal importance function, we may depart from this ideal by applying -cutoff.

12

3.4. The EPIS-BN algorithm

13

The basic EPIS-BN algorithm is outlined in Fig. 4. There are three main stages in the algorithm. The ﬁrst stage 14

includes Steps 1–2, which initialize the parameters. The second stage, including Steps 3–6, applies loopy belief 15

propagation and -cutoff to calculate an importance function. The last stage, Step 7, does the actual importance 16

sampling.

17

The parameter m, the number of samples, is a matter of a network-independent tradeoff between precision and 18

time. More samples will lead to a better precision. However, the optimal values of the propagation length d and 19

the threshold value for the -cutoff are highly network dependent. We will recommend some values based on our 20

empirical results in Section 4.2.

21

4. Experimental results

22

To test the performance of the EPIS-BN algorithm, we applied it to three large real Bayesian networks: 23 ANDES [36], CPCS [37], and PATHFINDER [38], and compared our results to those of AIS-BN, the current state- 24 of-the-art importance sampling algorithm, and those of Gibbs sampling, a representative of the MCMC methods, 25 which are believed to perform well in Bayesian networks. A reviewer on the earlier version of this paper suggested 26

MCM: 2749

ARTICLE IN PRESS

10

C. Yuan, M.J. Druzdzel / Mathematical and Computer Modelling xx (xxxx) xxx–xxx

UNC ORREC TED PR OOF

Fig. 4. The evidence pre-propagation importance sampling algorithm for bayesian networks (EPIS-BN).

1 comparing our algorithm to an MCMC algorithm. The ANDES network [36] consists of 233 nodes. This network 2 has a large depth and high connectivity and it was shown to be difﬁcult for the AIS-BN algorithm [3]. The CPCS 3 network [37] that we used has 179 nodes, which is a subset of the full CPCS network created by Max Henrion 4 and Malcolm Pradhan. The PATHFINDER network [38] contains 135 nodes. This section presents the results of our 5 experiments. We implemented our algorithm in C++ and performed our tests on a 2.5 GHz Pentium IV Windows XP 6 computer with 1GB memory Windows XP.

7 4.1. Experimental method

8 To compare the accuracy of sampling algorithms, we compare their departure from the exact solutions, which

9 we calculate using the clustering algorithm [7]. The distance metric that we use is Hellinger’s distance [39]. 10 Hellinger’s distance between two distributions f 1 and f 2, which have probabilities P1(xi j ) and P2(xi j ) for state 11 j ( j = 1, 2, . . . , ni ) of node i respectively, such that Xi ∈ E, is deﬁned as:

ni

{ P1(xi j ) − P2(xi j )}2

12

H (F1, F2) = Xi ∈N\E j=1

,

(19)

ni

Xi ∈N\E

13 where N is the set of all nodes in the network, E is the set of evidence nodes, and ni is the number of states for node i. 14 Hellinger’s distance weights small absolute probability differences near 0 much more heavily than similar 15 probability differences near 1. In many cases, Hellinger’s distance provides results that are equivalent to 16 Kullback–Leibler divergence. However, a major advantage of Hellinger’s distance is that it can handle zero 17 probabilities, which are common in Bayesian networks. Cheng and Druzdzel [3] used mean square error (MSE) 18 in their experiments. The main drawback of MSE is that it assigns equal distance for the same absolute probability

## Categories

## You my also like

### Bayesian Based Neural Network Model for Solar Photovoltaic

362.9 KB24.9K9K### Bayesian Statistics (a very brief introduction)

3.9 MB4.6K1.5K### Climbing the Hills of Compiled Credal Networks

1.3 MB6.5K2.2K### Network Theory III: Bayesian Networks, Information and Entropy

241.6 KB20026### Active Learning: Theory And Applications

3.1 MB91.8K34K### Naive Bayes Models for Probability Estimation

129.2 KB46.1K17.5K### Market Analysis and Trading Strategies with Bayesian Networks

877.8 KB33.6K16.1K### Defeasible Bayesian Reasoning

724.3 KB94.8K46.5K### Statistical Science The Interplay of Bayesian and Frequentist

347.4 KB35.7K9.6K