# Notes On Nonparametric Regression Estimation

## Preview text

Notes On Nonparametric Regression Estimation
James L. Powell Department of Economics University of California, Berkeley

The Nadaraya-Watson Kernel Regression Estimator Suppose that zi (yi; x0i is a (p + 1)-dimensional random vector that is jointly continuously distributed, with yi being a scalar random variable. Denoting the joint density function of zi as fy;x(y; x); the conditional mean g(x) of yi given xi = x (assuming it exists) is given by

g(x)

E[yijxi = x]

R

= Ry fy;x(y; x)dy

R fy;x(y; x)dy

= y fy;x(y; x)dy ;

fx(x)

where fx(x) is the marginal density function of xi: If f^y;x(y; x) is the kernel density estimator of fy;x(y; x);

i.e.,

f^

1 Xn (y; x) =

1

K~

y

yi ; x

xi

y;x

n hp+1

h

h

i=1

for some (p + 1)-dimensional kernel function K~ (v; u) satisfying R K~ (v; u)dvdu = 1; then an analogue

estimator for g(x) = E[yijxi = x] would substitute the kernel estimator f^y;x for fy;x in the expression for g(x): Further assuming that the …rst “moment” of K~ is zero,
Z u K~ (v; u)dvdu = 0 v

(which could be ensured by choosing a K~ that is symmetric about zero with bounded support), this

analogue estimator for g(x) can be simpli…ed to

R y

f^

(y; x)dy

g^(x) = R y;x

f^y;x(y; x)dy

1 Pn K x xi yi

=

nhp 1

Pi=n1

h xx

;

nhp

i=1 K

i
h

where

K (u)

Z K~ (v; u)dv:

1

The estimator g^(x); known as the Nadaraya-Watson kernel regression estimator, can be written as a

weighted average

X

g^(x)

win yi;

i

where

K

x xi h

win Pn

x xj

j=1 K h

P has i win = 1. Since K(u) ! 0 as kuk ! 1 (because K is integrable), it follows that win ! 0 for …xed

h as kx xik ! 1; and also that win ! 0 for …xed kx xik as h ! 0; hence g^(x) is a “locally-weighted

average” of the dependent variable yi, with increasing weight put on observations with values of xi that

are close to the target value x as n ! 1:

For the special case of p = 1 (i.e., one regressor) and K(u) = 1fjuj 1=2g (the density of a

U nif orm(

1=2; 1=2) variate), the kernel regression estimator g^(x) takes the form

Pn 1fx h=2 xi x + h=2g yi

Pi=n1

;

i=1 1fx h=2 xi x + h=2g

an average of yi values with corresponding xi values within h=2 of x: This estimator is sometimes called the “regressogram,” in analogy with the histogram estimator of a density function at x:

Derivation of the conditions for consistency of g^(x); and of its rate of convergence to g(x); follow the

analogous derivations for the kernel density estimator. Indeed, g^(x) can be written as

t^(x) g^(x) = f^(x) ;

where f^(x) is the usual kernel density estimator of the marginal density of xi; so the conditions for consistency of the denominator of g^(x) – i.e., h ! 0 and nhp ! 1 as n ! 1 – have already been

established, and it is easy to show the same conditions imply that

t^(x) !p t(x) g(x)f (x):

The bias and variance of the numerator t^(x) are also straightforward extensions of the corresponding

2

formulae for the kernel density estimator f^(x); here

"

#

E[t^(x)] = E n1hp Xn K x h xi yi

"

i=1

#

1 Xn x xi

= E nhp K h

g(xi)

Z

i=1

1 xz = hp K h g(x)f (z)dz

Z

= K(u)g(x hu)f (x hu)du;

which is the same formula as for the expectation of f^(x) with “g(x)f (x)” replacing “f (x)” throughout.

Assuming the product g(x)f (x) is twice continously di¤erentiable, etc., the same Taylor’s series expansion

as for the bias of f^(x) yields the bias of t^(x) as

E[t^(x)]

h2 @2g(x)f (x) g(x)f (x) = 2 tr @[email protected]
= O(h2):

Z uu0 K (u)du

+ o(h2)

And the variance of t^(x) is !
V ar(t^(x)) = V ar n1 Xn h1p K x h xi yi
i=1

1 1 x xi

2

= n E hp K h yi

1 (E[t^(x)])2 n

1Z 1 =

Kx z

2
[ 2(z) + g(z)2]f (z)dz

1 (E[t^(x)])2

n Zh2p

h

n

= 1 [K (u)]2 [ 2(x hu) + g(x hu)2]f (x hu)du 1 (E[f^(x)])2

nhp Z

n 1

= [ 2(x) + g(x)2]f (x) [K (u)]2 du + o ;

nhp

nhp

where 2 (x) V ar[yijxi = x]. So, as for the kernel density estimator, the MSE of the numerator of g^(x)

is of order [O(h2)]2 + O(1=nhp); and the optimal bandwidth h has !
1 1=(p+4) h =O n ;

just like f^(x): A “delta method”argument then implies that this yields the best rate of convergence of the ratio g^(x) = t^(x)=f^(x) to the true value g(x):

Derivation of the asymptotic distribution of g^(x) uses that “delta method” argument. First, the Lia-

punov condition can be veri…ed for the triangular array

zin h1p K x h xi ( 1 + 2yi);

3

where 1 and 2 are arbitrary constants, leading to the same requirement as for f^(x) (namely, nhp ! 1

as h ! 0 and n ! 1) for zn to be asymptotically normal, with

pnhp(zn E[zn]) = pnhp 1(f^(x) E[f^(x)])

2(t^(x) E t^(x) ) Z

! dN (0;

2 1

+

2

1

2g(x) +

2 2

g(x)2 +

2(x)

f (x)

[K (u)]2 du):

(**)

The Cramér-Wald device then implies that the numerator t^(x) and denominator f^(x) are jointly asymp-

totically normal, and the usual delta method approximation

p nhp(g^(x)

p nhp

E[f^(x)](t^(x)

E[t^(x)])

E[t^(x)](f^(x)

E[t^(x)]=E[f^(x)]) =

f^(x)E[f^(x)]

pnhp (t^(x) E[t^(x)]) g(x)(f^(x) E[f^(x)])

= p
+op nhp t^(x)

f (x) E[t^(x)]

+ op pnhp(f^(x)

E[f^(x)]) E[f^(x)])

yields

p nhp(g^(x)

E[t^(x)]=E[f^(x)]) !d N (0; 2(x) Z [K (u)]2 du) f (x)

after (**) is applied with 1 = g(x)=f (x) and 2 = 1=f (x):

When the bandwidth tends to zero at the optimal rate,

1 1=(p+4) hn = c n ;

then the asymptotic distribution of g^(x) is biased when centered at the true value g(x);

p nhp(g^(x)

g(x)) !d N ( (x); 2(x) Z [K (u)]2 du); f (x)

where now

pnhp h(E[t^(x)] t(x)) g(x)(E[f^(x)] f (x))i

(x)

lim

c(p+4)=2 = 2f (x) tr

f (x)

@2g(x)f (x) @[email protected]

@2f (x) g(x) @[email protected]

Z uu0K(u)du :

And if the bandwidth tends to zero faster than the optimal rate, i.e., “undersmoothing” is assumed, so

that

1 1=(p+4) h =o n ;

4

then pnhp h(E[t^(x)] t(x)) g(x)(E[f^(x)] f (x))i lim f (x) = 0;

and the bias term vanishes from the asymptotic distribution,

p nhp(g^(x)

g(x)) !d N (0; 2(x) Z [K (u)]2 du); f (x)

as for the kernel density estimator f^(x):

Discrete Regressors

Some Other Nonparametric Regression Methods

Cross-Validation

5