Notes On Nonparametric Regression Estimation
Download Notes On Nonparametric Regression Estimation
Preview text
Notes On Nonparametric Regression Estimation
James L. Powell Department of Economics University of California, Berkeley
The Nadaraya-Watson Kernel Regression Estimator Suppose that zi (yi; x0i is a (p + 1)-dimensional random vector that is jointly continuously distributed, with yi being a scalar random variable. Denoting the joint density function of zi as fy;x(y; x); the conditional mean g(x) of yi given xi = x (assuming it exists) is given by
g(x)
E[yijxi = x]
R
= Ry fy;x(y; x)dy
R fy;x(y; x)dy
= y fy;x(y; x)dy ;
fx(x)
where fx(x) is the marginal density function of xi: If f^y;x(y; x) is the kernel density estimator of fy;x(y; x);
i.e.,
f^
1 Xn (y; x) =
1
K~
y
yi ; x
xi
y;x
n hp+1
h
h
i=1
for some (p + 1)-dimensional kernel function K~ (v; u) satisfying R K~ (v; u)dvdu = 1; then an analogue
estimator for g(x) = E[yijxi = x] would substitute the kernel estimator f^y;x for fy;x in the expression for g(x): Further assuming that the …rst “moment” of K~ is zero,
Z u K~ (v; u)dvdu = 0 v
(which could be ensured by choosing a K~ that is symmetric about zero with bounded support), this
analogue estimator for g(x) can be simpli…ed to
R y
f^
(y; x)dy
g^(x) = R y;x
f^y;x(y; x)dy
1 Pn K x xi yi
=
nhp 1
Pi=n1
h xx
;
nhp
i=1 K
i
h
where
K (u)
Z K~ (v; u)dv:
1
The estimator g^(x); known as the Nadaraya-Watson kernel regression estimator, can be written as a
weighted average
X
g^(x)
win yi;
i
where
K
x xi h
win Pn
x xj
j=1 K h
P has i win = 1. Since K(u) ! 0 as kuk ! 1 (because K is integrable), it follows that win ! 0 for …xed
h as kx xik ! 1; and also that win ! 0 for …xed kx xik as h ! 0; hence g^(x) is a “locally-weighted
average” of the dependent variable yi, with increasing weight put on observations with values of xi that
are close to the target value x as n ! 1:
For the special case of p = 1 (i.e., one regressor) and K(u) = 1fjuj 1=2g (the density of a
U nif orm(
1=2; 1=2) variate), the kernel regression estimator g^(x) takes the form
Pn 1fx h=2 xi x + h=2g yi
Pi=n1
;
i=1 1fx h=2 xi x + h=2g
an average of yi values with corresponding xi values within h=2 of x: This estimator is sometimes called the “regressogram,” in analogy with the histogram estimator of a density function at x:
Derivation of the conditions for consistency of g^(x); and of its rate of convergence to g(x); follow the
analogous derivations for the kernel density estimator. Indeed, g^(x) can be written as
t^(x) g^(x) = f^(x) ;
where f^(x) is the usual kernel density estimator of the marginal density of xi; so the conditions for consistency of the denominator of g^(x) – i.e., h ! 0 and nhp ! 1 as n ! 1 – have already been
established, and it is easy to show the same conditions imply that
t^(x) !p t(x) g(x)f (x):
The bias and variance of the numerator t^(x) are also straightforward extensions of the corresponding
2
formulae for the kernel density estimator f^(x); here
"
#
E[t^(x)] = E n1hp Xn K x h xi yi
"
i=1
#
1 Xn x xi
= E nhp K h
g(xi)
Z
i=1
1 xz = hp K h g(x)f (z)dz
Z
= K(u)g(x hu)f (x hu)du;
which is the same formula as for the expectation of f^(x) with “g(x)f (x)” replacing “f (x)” throughout.
Assuming the product g(x)f (x) is twice continously di¤erentiable, etc., the same Taylor’s series expansion
as for the bias of f^(x) yields the bias of t^(x) as
E[t^(x)]
h2 @2g(x)f (x) g(x)f (x) = 2 tr @[email protected]
= O(h2):
Z uu0 K (u)du
+ o(h2)
And the variance of t^(x) is !
V ar(t^(x)) = V ar n1 Xn h1p K x h xi yi
i=1
1 1 x xi
2
= n E hp K h yi
1 (E[t^(x)])2 n
1Z 1 =
Kx z
2
[ 2(z) + g(z)2]f (z)dz
1 (E[t^(x)])2
n Zh2p
h
n
= 1 [K (u)]2 [ 2(x hu) + g(x hu)2]f (x hu)du 1 (E[f^(x)])2
nhp Z
n 1
= [ 2(x) + g(x)2]f (x) [K (u)]2 du + o ;
nhp
nhp
where 2 (x) V ar[yijxi = x]. So, as for the kernel density estimator, the MSE of the numerator of g^(x)
is of order [O(h2)]2 + O(1=nhp); and the optimal bandwidth h has !
1 1=(p+4) h =O n ;
just like f^(x): A “delta method”argument then implies that this yields the best rate of convergence of the ratio g^(x) = t^(x)=f^(x) to the true value g(x):
Derivation of the asymptotic distribution of g^(x) uses that “delta method” argument. First, the Lia-
punov condition can be veri…ed for the triangular array
zin h1p K x h xi ( 1 + 2yi);
3
where 1 and 2 are arbitrary constants, leading to the same requirement as for f^(x) (namely, nhp ! 1
as h ! 0 and n ! 1) for zn to be asymptotically normal, with
pnhp(zn E[zn]) = pnhp 1(f^(x) E[f^(x)])
2(t^(x) E t^(x) ) Z
! dN (0;
2 1
+
2
1
2g(x) +
2 2
g(x)2 +
2(x)
f (x)
[K (u)]2 du):
(**)
The Cramér-Wald device then implies that the numerator t^(x) and denominator f^(x) are jointly asymp-
totically normal, and the usual delta method approximation
p nhp(g^(x)
p nhp
E[f^(x)](t^(x)
E[t^(x)])
E[t^(x)](f^(x)
E[t^(x)]=E[f^(x)]) =
f^(x)E[f^(x)]
pnhp (t^(x) E[t^(x)]) g(x)(f^(x) E[f^(x)])
= p
+op nhp t^(x)
f (x) E[t^(x)]
+ op pnhp(f^(x)
E[f^(x)]) E[f^(x)])
yields
p nhp(g^(x)
E[t^(x)]=E[f^(x)]) !d N (0; 2(x) Z [K (u)]2 du) f (x)
after (**) is applied with 1 = g(x)=f (x) and 2 = 1=f (x):
When the bandwidth tends to zero at the optimal rate,
1 1=(p+4) hn = c n ;
then the asymptotic distribution of g^(x) is biased when centered at the true value g(x);
p nhp(g^(x)
g(x)) !d N ( (x); 2(x) Z [K (u)]2 du); f (x)
where now
pnhp h(E[t^(x)] t(x)) g(x)(E[f^(x)] f (x))i
(x)
lim
c(p+4)=2 = 2f (x) tr
f (x)
@2g(x)f (x) @[email protected]
@2f (x) g(x) @[email protected]
Z uu0K(u)du :
And if the bandwidth tends to zero faster than the optimal rate, i.e., “undersmoothing” is assumed, so
that
1 1=(p+4) h =o n ;
4
then pnhp h(E[t^(x)] t(x)) g(x)(E[f^(x)] f (x))i lim f (x) = 0;
and the bias term vanishes from the asymptotic distribution,
p nhp(g^(x)
g(x)) !d N (0; 2(x) Z [K (u)]2 du); f (x)
as for the kernel density estimator f^(x):
Discrete Regressors
Some Other Nonparametric Regression Methods
Cross-Validation
5
James L. Powell Department of Economics University of California, Berkeley
The Nadaraya-Watson Kernel Regression Estimator Suppose that zi (yi; x0i is a (p + 1)-dimensional random vector that is jointly continuously distributed, with yi being a scalar random variable. Denoting the joint density function of zi as fy;x(y; x); the conditional mean g(x) of yi given xi = x (assuming it exists) is given by
g(x)
E[yijxi = x]
R
= Ry fy;x(y; x)dy
R fy;x(y; x)dy
= y fy;x(y; x)dy ;
fx(x)
where fx(x) is the marginal density function of xi: If f^y;x(y; x) is the kernel density estimator of fy;x(y; x);
i.e.,
f^
1 Xn (y; x) =
1
K~
y
yi ; x
xi
y;x
n hp+1
h
h
i=1
for some (p + 1)-dimensional kernel function K~ (v; u) satisfying R K~ (v; u)dvdu = 1; then an analogue
estimator for g(x) = E[yijxi = x] would substitute the kernel estimator f^y;x for fy;x in the expression for g(x): Further assuming that the …rst “moment” of K~ is zero,
Z u K~ (v; u)dvdu = 0 v
(which could be ensured by choosing a K~ that is symmetric about zero with bounded support), this
analogue estimator for g(x) can be simpli…ed to
R y
f^
(y; x)dy
g^(x) = R y;x
f^y;x(y; x)dy
1 Pn K x xi yi
=
nhp 1
Pi=n1
h xx
;
nhp
i=1 K
i
h
where
K (u)
Z K~ (v; u)dv:
1
The estimator g^(x); known as the Nadaraya-Watson kernel regression estimator, can be written as a
weighted average
X
g^(x)
win yi;
i
where
K
x xi h
win Pn
x xj
j=1 K h
P has i win = 1. Since K(u) ! 0 as kuk ! 1 (because K is integrable), it follows that win ! 0 for …xed
h as kx xik ! 1; and also that win ! 0 for …xed kx xik as h ! 0; hence g^(x) is a “locally-weighted
average” of the dependent variable yi, with increasing weight put on observations with values of xi that
are close to the target value x as n ! 1:
For the special case of p = 1 (i.e., one regressor) and K(u) = 1fjuj 1=2g (the density of a
U nif orm(
1=2; 1=2) variate), the kernel regression estimator g^(x) takes the form
Pn 1fx h=2 xi x + h=2g yi
Pi=n1
;
i=1 1fx h=2 xi x + h=2g
an average of yi values with corresponding xi values within h=2 of x: This estimator is sometimes called the “regressogram,” in analogy with the histogram estimator of a density function at x:
Derivation of the conditions for consistency of g^(x); and of its rate of convergence to g(x); follow the
analogous derivations for the kernel density estimator. Indeed, g^(x) can be written as
t^(x) g^(x) = f^(x) ;
where f^(x) is the usual kernel density estimator of the marginal density of xi; so the conditions for consistency of the denominator of g^(x) – i.e., h ! 0 and nhp ! 1 as n ! 1 – have already been
established, and it is easy to show the same conditions imply that
t^(x) !p t(x) g(x)f (x):
The bias and variance of the numerator t^(x) are also straightforward extensions of the corresponding
2
formulae for the kernel density estimator f^(x); here
"
#
E[t^(x)] = E n1hp Xn K x h xi yi
"
i=1
#
1 Xn x xi
= E nhp K h
g(xi)
Z
i=1
1 xz = hp K h g(x)f (z)dz
Z
= K(u)g(x hu)f (x hu)du;
which is the same formula as for the expectation of f^(x) with “g(x)f (x)” replacing “f (x)” throughout.
Assuming the product g(x)f (x) is twice continously di¤erentiable, etc., the same Taylor’s series expansion
as for the bias of f^(x) yields the bias of t^(x) as
E[t^(x)]
h2 @2g(x)f (x) g(x)f (x) = 2 tr @[email protected]
= O(h2):
Z uu0 K (u)du
+ o(h2)
And the variance of t^(x) is !
V ar(t^(x)) = V ar n1 Xn h1p K x h xi yi
i=1
1 1 x xi
2
= n E hp K h yi
1 (E[t^(x)])2 n
1Z 1 =
Kx z
2
[ 2(z) + g(z)2]f (z)dz
1 (E[t^(x)])2
n Zh2p
h
n
= 1 [K (u)]2 [ 2(x hu) + g(x hu)2]f (x hu)du 1 (E[f^(x)])2
nhp Z
n 1
= [ 2(x) + g(x)2]f (x) [K (u)]2 du + o ;
nhp
nhp
where 2 (x) V ar[yijxi = x]. So, as for the kernel density estimator, the MSE of the numerator of g^(x)
is of order [O(h2)]2 + O(1=nhp); and the optimal bandwidth h has !
1 1=(p+4) h =O n ;
just like f^(x): A “delta method”argument then implies that this yields the best rate of convergence of the ratio g^(x) = t^(x)=f^(x) to the true value g(x):
Derivation of the asymptotic distribution of g^(x) uses that “delta method” argument. First, the Lia-
punov condition can be veri…ed for the triangular array
zin h1p K x h xi ( 1 + 2yi);
3
where 1 and 2 are arbitrary constants, leading to the same requirement as for f^(x) (namely, nhp ! 1
as h ! 0 and n ! 1) for zn to be asymptotically normal, with
pnhp(zn E[zn]) = pnhp 1(f^(x) E[f^(x)])
2(t^(x) E t^(x) ) Z
! dN (0;
2 1
+
2
1
2g(x) +
2 2
g(x)2 +
2(x)
f (x)
[K (u)]2 du):
(**)
The Cramér-Wald device then implies that the numerator t^(x) and denominator f^(x) are jointly asymp-
totically normal, and the usual delta method approximation
p nhp(g^(x)
p nhp
E[f^(x)](t^(x)
E[t^(x)])
E[t^(x)](f^(x)
E[t^(x)]=E[f^(x)]) =
f^(x)E[f^(x)]
pnhp (t^(x) E[t^(x)]) g(x)(f^(x) E[f^(x)])
= p
+op nhp t^(x)
f (x) E[t^(x)]
+ op pnhp(f^(x)
E[f^(x)]) E[f^(x)])
yields
p nhp(g^(x)
E[t^(x)]=E[f^(x)]) !d N (0; 2(x) Z [K (u)]2 du) f (x)
after (**) is applied with 1 = g(x)=f (x) and 2 = 1=f (x):
When the bandwidth tends to zero at the optimal rate,
1 1=(p+4) hn = c n ;
then the asymptotic distribution of g^(x) is biased when centered at the true value g(x);
p nhp(g^(x)
g(x)) !d N ( (x); 2(x) Z [K (u)]2 du); f (x)
where now
pnhp h(E[t^(x)] t(x)) g(x)(E[f^(x)] f (x))i
(x)
lim
c(p+4)=2 = 2f (x) tr
f (x)
@2g(x)f (x) @[email protected]
@2f (x) g(x) @[email protected]
Z uu0K(u)du :
And if the bandwidth tends to zero faster than the optimal rate, i.e., “undersmoothing” is assumed, so
that
1 1=(p+4) h =o n ;
4
then pnhp h(E[t^(x)] t(x)) g(x)(E[f^(x)] f (x))i lim f (x) = 0;
and the bias term vanishes from the asymptotic distribution,
p nhp(g^(x)
g(x)) !d N (0; 2(x) Z [K (u)]2 du); f (x)
as for the kernel density estimator f^(x):
Discrete Regressors
Some Other Nonparametric Regression Methods
Cross-Validation
5
Categories
You my also like
Nested Kernel: An Operating System Architecture for Intra
399.8 KB17.9K8.4KProject Estimator Manual
2.3 MB22.7K2.7KMEMS Inertial Sensors
97.5 KB8.6K1.6KAsymptotic Theory of Statistical Estimation 1
561 KB85.4K42.7KReverse Debugging of Kernel Failures in Deployed Systems
394.1 KB23.7K9.7KResolvent and new activation functions for linear programming
1 MB52K14.6KProcesses, Protection and the Kernel:
423.2 KB13.5K1.9KSafe Kernel Programming with Rust
747.4 KB52.9K18KLinux Kernel Parameters
986.9 KB37.9K8.3K