next up previous
Next: About this document

Econ 551: Lecture Note 7
Asymptotic Efficiency of Maximum Likelihood
The Hajek Representation Theorem

1. Background: For simplicity, we will assume a ``cross sectional framework'' where we have IID observations tex2html_wrap_inline242 from a ``true density'' tex2html_wrap_inline244 , where tex2html_wrap_inline246 is an interior point of a compact parameter space in tex2html_wrap_inline248 and tex2html_wrap_inline250 satisfies standard regularity conditions given, e.g. in White (1982) ( tex2html_wrap_inline246 is the unique maximizer of tex2html_wrap_inline254 and tex2html_wrap_inline256 is twice continuously differentiable in tex2html_wrap_inline258 and the expectation of the absolute value of the hessian of tex2html_wrap_inline256 is bounded by an integrable function of x). It would be straightforward to generalize the results presented here to time-series or heterogeneous INID observations, but at the cost of extra assumptions and extra notational complexity. The key maintained assumption needed for the results below is the hypothesis of correct specification: i.e. that tex2html_wrap_inline244 is the true data generating process for tex2html_wrap_inline242 .

2. Motivation: In previous lectures I derived the Cramer-Rao inequality and showed that the inverse of the information matrix equals the Cramer-Rao lower bound. While this bound holds for all N (where N is the number of observations), the Cramer-Rao inequality can generally only be used to prove efficiency of unbiased estimators by showing their variance equals the Cramer-Rao lower bound. However as we noted in Econ 551, most estimators (including most maximum likelihood estimators) are biased in finite samples. Although the Cramer-Rao inequality holds for biased as well as unbiased estimators, in general it is impossible to evaluate the lower bound when estimators are biased since it involves a formula for the bias term which is generally unknown. The Cramer-Rao lower bound becomes more useful in large samples for consistent estimators since consistency implies the estimators are asymptotically unbiased. In particular we showed that maximum likelihood estimators are consistent and asymptotically normal under weak regularity conditions. In addition, if the parametric model is correctly specified the asymptotic covariance matrix equals the inverse of the information matrix. This suggests that maximum likelihood estimators are asymptotically efficient.

3. The problem of ``superefficiency'' In Econ 551 we discussed the ``superefficiency'' counterexamples by Stein and Hodges which showed that one can construct estimators that do better than maximum likelihood (in the sense of having a smaller variance than the ML estimator) if the true parameter tex2html_wrap_inline246 is equal to certain isolated points of the parameter space. A lot of effort and ingenuity has gone into finding ways to rule out these superefficient counterexamples. Statisticians realized that the superefficient estimators were irregular in the sense that if the true parameter were arbitrarily close but not equal to a point of superefficiency, the superefficient estimator would do worse than the maximum likelihood estimator. In other words the asymptotic distribution of superefficient estimators can be adversely affected by small perturbations in the true parameter tex2html_wrap_inline246 . Thus, we can rule out the superefficient counterexamples by restricting our attention to regular estimators, i.e. those for which the asymptotic distribution is invariant to small perturbations in the value of the true parameter tex2html_wrap_inline246 . This is formalized in the following

Definition: An estimator tex2html_wrap_inline278 is a regular estimator of a tex2html_wrap_inline280 parameter vector tex2html_wrap_inline246 if the following conditions hold:

1.
tex2html_wrap_inline284 , where tex2html_wrap_inline286 is some random vector depending on tex2html_wrap_inline246 .

2.
Consider a sequence of local alternatives to tex2html_wrap_inline246 of the form tex2html_wrap_inline292 . Suppose observations tex2html_wrap_inline242 are triangular array, i.e. IID draws from tex2html_wrap_inline296 . Then we have

equation39

2.
independent of tex2html_wrap_inline298 and this convergence is uniform for tex2html_wrap_inline300 for any constant C > 0.

Comment: It is not hard to show that all of the ``counterexamples'' to the efficiency of maximum likelihood estimation including the Stein estimator and the superefficient estimator of Hodges are ``irregular'' in the sense of failing to satisfy condition in equation (1): the asymptotic distribution of these estimators depends on the vector tex2html_wrap_inline298 , i.e. the asymptotic distribution depends on the ``direction of approach'' of the sequence of local alternatives tex2html_wrap_inline306 as they converge to tex2html_wrap_inline246 . By choosing ``poor directions of approach'' we can show that there are sequences of local alternatives for which the superefficient estimators do worse than maximum likelihood. Of course for our definition to make sense we need to show that the maximum likelihood estimator is regular.

Lemma: The maximum likelihood estimator tex2html_wrap_inline310 is a regular estimator of tex2html_wrap_inline246 .

Proof: (sketch) Expanding the first order condition for tex2html_wrap_inline310 about the true parameter vector tex2html_wrap_inline292 and solving for tex2html_wrap_inline318 we get:

equation50

where tex2html_wrap_inline320 is the average of the hessian of the log-likelihood terms and tex2html_wrap_inline322 is a point on the line segment between tex2html_wrap_inline324 and tex2html_wrap_inline310 . Applying the Lindeberg-Levy Central Limit Theorem we can show that

equation56

and using the fact that the average hessian of the log-likelihood converges uniformly to tex2html_wrap_inline328 and tex2html_wrap_inline322 converges with probability 1 to tex2html_wrap_inline246 , we have that

equation63

independent of tex2html_wrap_inline298 , i.e. the ML estimator tex2html_wrap_inline310 is a regular estimator of tex2html_wrap_inline246 .

4. Log-Likelihood Ratios and Local Asymptotic Normality. Hajek, LeCam and others realized that the asymptotic properties of maximum likelihood estimators were a result of a property they termed local asymptotic normality: i.e. the log-likelihood ratio converges to particular a normal random variable. To state this property formally, it it helpful to consider the following sequence of local alternatives tex2html_wrap_inline306 to tex2html_wrap_inline246 formed as follows:

equation70

where tex2html_wrap_inline344 is the information matrix.

Definition: The parametric model tex2html_wrap_inline250 is said to have the local asymptotic normality property at tex2html_wrap_inline246 iff for any tex2html_wrap_inline350 we have:

equation75

where tex2html_wrap_inline352 ,

equation84

is the likelihood for N observations, tex2html_wrap_inline306 is the sequence of local alternatives given in equation (5), and tex2html_wrap_inline358 where I is the tex2html_wrap_inline362 identity matrix.

Lemma: If tex2html_wrap_inline250 satisfies standard regularity conditions (e.g. White 1982), then the parametric model tex2html_wrap_inline250 has the LAN property.

Proof: Expand tex2html_wrap_inline368 in a second-order Taylor series about tex2html_wrap_inline246 we get:

eqnarray91

where tex2html_wrap_inline322 is on the line segment between tex2html_wrap_inline246 and tex2html_wrap_inline324 . Using the Central Limit Theorem it is easy to show that the second term on the right hand side of equation (8) converges in distribution to tex2html_wrap_inline378 where tex2html_wrap_inline358 and I is the tex2html_wrap_inline362 identity matrix. Using the uniform law of large numbers, it is easy to show the third term on the right hand side of equation (8) converges with probability 1 to tex2html_wrap_inline386 .

Hajek Representation Theorem: Suppose the parametric model tex2html_wrap_inline250 has the LAN property at tex2html_wrap_inline246 . Then if tex2html_wrap_inline392 is any regular estimator of tex2html_wrap_inline246 we have:

equation102

where tex2html_wrap_inline358 (I is the tex2html_wrap_inline362 identity matrix) and tex2html_wrap_inline402 denotes convolution, i.e. the random vectors tex2html_wrap_inline404 and tex2html_wrap_inline406 are independently distributed.

Proof: (sketch) Since tex2html_wrap_inline408 is assumed to be a regular estimator of tex2html_wrap_inline246 , then its asymptotic distribution does not depend on the direction of approach of a sequence of local alternatives tex2html_wrap_inline306

equation109

where tex2html_wrap_inline406 is some tex2html_wrap_inline280 random vector that is independent of the direction of approach of tex2html_wrap_inline324 to tex2html_wrap_inline246 as indexed by tex2html_wrap_inline298 . Let tex2html_wrap_inline424 denote the characteristic function of tex2html_wrap_inline406 , i.e.

equation118

where tex2html_wrap_inline428 , and tex2html_wrap_inline430 denotes the inner product of vectors t and Y in tex2html_wrap_inline248 . from the density tex2html_wrap_inline296 . Since convergence in distribution implies pointwise convergence of characteristic functions we have:

equation122

for each tex2html_wrap_inline440 , where tex2html_wrap_inline442 denotes the expectation of with respect to the underlying random variables tex2html_wrap_inline242 which are IID draws from tex2html_wrap_inline296 . Now we can write

equation130

Notice that tex2html_wrap_inline448 is the CF of a N(0,I) random vector. If we can show that tex2html_wrap_inline452 is also a CF, then by the Inversion theorem for characteristic functions it follows that tex2html_wrap_inline452 is the CF for some random vector tex2html_wrap_inline456 and tex2html_wrap_inline458 , so the Hajek Representation Theorem (the result in equation (10)) will follow as a special case when tex2html_wrap_inline460 . To show that tex2html_wrap_inline452 is a CF we appeal to the

Lévy-Cramer Continuity Theorem: If a sequence of random vectors tex2html_wrap_inline464 with corresponding characteristic functions tex2html_wrap_inline466 converge pointwise to some function tex2html_wrap_inline468 and if tex2html_wrap_inline468 is continuous at t=0, then tex2html_wrap_inline468 is the characteristic function of some random vector tex2html_wrap_inline476 and tex2html_wrap_inline478 .

Since tex2html_wrap_inline424 is a CF, Bochner's Theorem ( theorem providing necessary and sufficient conditions for a function to be a CF for some random vector) guarantees that it is continuous at t=0, so the product tex2html_wrap_inline452 is also continuous at t=0. So to complete the proof of the Hajek Representation Theorem we need to show there is a sequence of characteristic functions converging pointwise to tex2html_wrap_inline452 . Let tex2html_wrap_inline490 . Then we can write

eqnarray148

Now here is the heuristic part of the proof (for a fully rigorous proof that justifies this step see Ibragimov, I.A. and R.Z. Khas'minskii (1981) Statistical Estimation - Asymptotic Theory Springer Verlag). By the LAN condition in equation (6) we have for large N:

equation164

where tex2html_wrap_inline494 . So substituting equation (16) into equation (15) we get:

eqnarray169

This implies that

equation180

Now let tex2html_wrap_inline496 (this can be justified by a bit of complex analysis -- Mantel's Theorem). Then we have:

equation187

This equation states that the characteristic function of the random vector tex2html_wrap_inline498 converges pointwise to the function tex2html_wrap_inline500 . It follows from the Lévy-Cramer Continuity Theorem that this function is a characteristic function for some random vector tex2html_wrap_inline456 . In summary we have:

eqnarray196

It follows that

equation213

so by the Lévy-Cramer continuity theorem and the inversion theorem for characteristic functions we have:

equation218




next up previous
Next: About this document

John Rust
Mon Apr 21 15:48:19 CDT 1997