No Title

Solutions to Problem Set 0

Economics 551, Yale University

Woocheol Kim and John Rust

Question 1 We are asked to compute maximum likelihood estimates of the parameter vector in the model given by:

where we are instructed to believe that are IID draws from and . Actually you were mislead: the errors are heteroscedastic, so conditional on we have where . So you will be estimating a misspecified model, and later in Econ 551 we will discuss test statistics which are capable of detecting this misspecification. In the meantime your job is to calculate the MLE's of the parameters, . The first step is to write down the likelihood function for the data, . In general ``brute force'' maximization of may not be a good idea: it might be better to try a ``divide and conquer'' strategy. Note that the the joint density of (y,x), , is a product of a conditional likelihood of y given x, , times the marginal density of x, . It is easy to see that this factorization or separability in the joint likelihood enables us to compute the MLEs for the parameters and the parameters independently. It also implies a block diagonality property which enable us to show that the asymptotic distributions of these parameters are independent.

eqnarray42

where denotes the conditional likelihood of the 's given the 's, denotes the marginal likelihood of the 's, and . The separability of parameters in the third and forth equation allows us to break the estimation problem into two subproblems, which ordinarily makes the programming considerably easier and computations considerably faster:

(1)

calculate the MLE's for

from the conditional likelihood

with FOC's

eqnarray102

Note that there is further separability in this first subproblem: the FOC for is the same as the FOC for nonlinear least squares (NLS) estimation of in equation (2) above, ignoring since it doesn't affect the solution for . Once we have computed the NLS estimate of , we use the second equation to compute as the sample variance of the estimated NLS residuals.

(2)

calculate the MLE's for

from the marginal likelihood

with FOC's:

eqnarray133

respectively. Using attached Gauss code nlreg.gpr for computing NLS estimates (the sum of squared errors, derivatives and hessian are coded in the eval_nls.g procedure) we are able to numerically solve for a vector that sets the FOC for given in equation (2) above to zero. There are closed-form expressions for the MLEs for the remaining parameters:

eqnarray155

Figures 1 and 2 below plot the true and estimated regression functions for this problem. Figure 1 plots the data points also: we see that both the true and estimated regression functions are generally quite close to each other and both go through the middle of the ``data cloud''. However we see that the MLE gives a substantially downward biased estimate of and this causes the estimated regression function to make big divergences from the true regression function at extreme high and low values of x, say |x| > 4. However since there are very few high or low values of x in the sample, the NLS and MLE are not able to ``penalize'' this divergence: the MLE sets since it helps fit the data around x=0 where most of the data points are. Figure 2 provides a blow up of Figure 1 without the data points to show you how the estimated regression function diverges from the truth near x=0. Overall, despite the misspecification of the heteroscedasticity, the MLE and NLS estimators seem to do a pretty good job of uncovering the true regression function, at least for those x's where we have sufficient data. Note that the data plot indicates heteroscedasticity, since the variance of the data around the regression function is bigger in the middle of the graph (near x=0) than at large positive or negative values of x.

Since the true model used to generate data1.asc has heteroscedastic and not homoscedastic error terms as assumed here, it is easy to show that the MLE of the misspecified model converges to the expectation of . We can calculate as follows:

eqnarray212

We leave it to you to show that even though the model is misspecified, the MLE converges to as . Indeed in this case we find that , which happens to be almost exactly equal to the ``true'' value.

Recall that the asymptotic distribution of the standardized MLE estimator is given by:

where is the Hessian and is the information matrix (both evaluated at ). However the likelihood is misspecified in this case (due to heteroscedasticity), and it is easy to verify that the equality of does not hold, so the correct asymptotic covariance matrix is given by the White ``misspecification consistent'' formula in equation (4) rather than by the inverse of the information matrix For example, consider the block of , or :

We see that a sufficient condition for the two expressions above to equal each other is for all x, i.e. for the model to be homoscedastic as you were asked to assume. The failure of this equality can be a basis for a specification test statistic that can detect model misspecification which we will discuss in more detail below. Note that even despite the misspecification, the separability property implies that the Hessian is a block diagonal matrix:

displaymath645

displaymath646

where . It is also easy to verify that despite the misspecification of the heteroscedasicity, the information matrix is still block diagonal. Block diagonality of and implies that the covariance matrix of is block diagonal. One can see further block diagonality between and and between and . This block diagonality is not just a consequence of the separability in both the marginal and conditional likelihood in the parameters describing the mean or conditional mean ( and , respectively) and the variance parameters ( and , respectively). You should verify through direct calculation that this block diagonality is a result of the symmetry of the normal distribution, which implies that , where the conditional distribution of given X is , and similarly, the distribution of , which implies that for any positive odd integer k. If the normal distribution were not symmetric the block diagonality property wouldn't hold.

Using the block diagonality property, it is easy to compute the standard errors of the full parameter vector, . The covariance matrix of is given by 1/N times the upper block of and it is easy to verify that this is the same as the covariance matrix for the nonlinear least squares estimator for (which is also numerically identical to the MLE) that is output from the nlreg.gpr program. By working with equation (4), you can show that the estimated variance of is given by where is the sample analog of the fourth central moment of :

There is a similar formula for the estimated variance of . We have , so the estimated standard error of is 0.036. The estimated standard error of is .

Question 2. If the model is correctly specified, we know that the information equality will hold which implies that:

We compare White's misspecification consistent estimate to the inverse of information:

displaymath648

displaymath649

where all the estimates are given from the results of eval.g. Following White (``Maximum Likelihood Estimation of Misspecified Models'' Econometrica 1982) we can construct a formal hypothesis test statistic using the difference between the estimates of the upper diagonal elements of and the corresponding elements of . This statistic should be small if the null hypothesis of correct specification is true (since in that case), and large if the model is misspecified (since it is not necessarily true that is the model is misspecified). The large difference in the two difference estimates of the covariance matrix for suggests that and are different, and hence that the model is misspecified. However we did not actually compute the actual test statistic to see at what level of significance null would actually be rejected (i.e. to compute the marginal signficance level of the test statistic) since we didn't expect you to know about this particular specification test statistic at this stage of the course.

Question 3. If the true model were log-linear, i.e. if

where , then it is easy to see that the are conditionally lognormally distributed so it is valid to take log transformation and estimate by OLS:

. It would not difficult to show that the OLS estimates of the log-linearized model are actually the maximum likelihood estimates of the original lognormal specification (make sure you understand this by writing down the lognormal likelihood function and verifying what we just said is true)! However the error term for the specification of Model I in question 1 is additive and not multiplicative, and so the log transformation is generally not valid. Indeed, there is a positive probability of observing negative realizations of , something that has zero probability under the lognormal specification. Thus, to even do OLS one must screen out all pairs where is negative, something that is generally not a good idea. When one does OLS on this nonrandomly selected subsample, it is not surprising that the estimates are highly biased:

displaymath652

We can show analytically why the OLS estimates for Model II will be inconsistent when the true model is Model I with additive normal errors rather than multiplicative lognormal errors. After screening out negative y's, the OLS estimator solves:

As , we can show that the right hand side of the above equation converges uniformly to

In general the that minimizes the expression above is not the same as the that minimizes

which is the true when the conditional mean function is correctly specified. Therefore, we conclude that the probability limits for 's from the two different models( Model I and Model II) are not the same.\

Question 4 Figure 3 below plots the squared residuals from the MLE/NLS estimation results in Question 1. The figure also plots the results of the following nonlinear regression:

where as before and is a conformable parameter vector. We see substantial evidence of heteroscedasticity, confirming our earlier visual impression in looking at the data in figure 1. The estimated conditional variance function looks concave and symmetric w.r.t. y-axis, almost like a normal density. Table 3 summarizes the estimation results. According to table 3, the regression coefficient for the quadratic term is significant and seems to dominate the form of the heteroscedasticity plotted in figure 3. Figure 4 does a blow up, plotting the estimated and true conditional variance functions.

displaymath655

Some students may have used simple OLS, estimating a specification like

rather than the exponential specification in equation (7). This is also OK since we weren't specific about what type of tool to use to check for heteroscedasticity. The only disadvantage of the OLS specification in (8) over the exponential specification in (7) is that the latter doesn't guarantee that for all x. However we find that even in the linear specification most of the predicted values are indeed positive. Figure 4 compares the predicted values of using both specifications, and we can see that the negative predicted values of occur at the extreme high and low values of the observed x's.

Question 5. Now we consider full information maximum likelihood (FIML) estimation of Model III, which is the same as model I, but with an exponential specification for conditional heteroscedasticity. Thus the joint density for (y,x) is given by:

where , and . We want to consider simultaneous or ``full information maximum likelihood'' (FIML) estimation of the parameter vector . The log-likelihood function is given by:

eqnarray401

where and The gradients of with respect to are:

eqnarray439

The gradients for with respect to are the same as in Question 1. The hessian matrix for with respect to is given by:

eqnarray471

It is easy to verify (using the law of iterated expectations) that when , the expectation of , i.e. we have block diagonality between the and parameters (assuming the model is correctly specified). Similarly one can verify that the block of the information matrix is zero. This implies that the asymptotic covariance between the maximum likelihood estimates and is zero, so they are asymptotically independently distributed. This independence suggests the following 2-step procedure to obtain initial consistent estimates of : 1) estimate by NLS (see attached Gauss code eval_nls.g and shell program nlreg.gpr ), 2) use the estimated squared residuals to estimate the parameters by NLS using the exponential specification in equation (7). We did this using the same eval_nls.g procedure we used for step 1, with a slight modification of nlreg.gpr to substitute instead of as the dependent variable in the regression.

However we can do even better than this. We can do a 3rd step, weighted NLS or feasible generalized least squares (FGLS) estimation of using the estimated conditional variance from step 2 as weights. The procedure eval_fgls.g provides the code to do the FGLS estimation. Due to the block diagonality property, it is not hard to show that the FGLS estimates of have the same asymptotic distribution as the MLE: i.e. FGLS is asymptotically efficient in this case. To see this, note that the gradient and hessian of with respect to is the same as the gradient and hessian for the following FGLS criterion function:

We know that the block diagonality property implies that as long as is any consistent estimator of that a solution to the FOC is asymptotically efficient. But since this is also the FOC for the FGLS estimator (10), it follows that is also an asymptotically efficient estimator, i.e. it attains not only the Chamberlain efficiency bound for condition moment restrictions, but the Cramer-Rao lower bound as well. It is not hard to show that these two bounds coincide in this case: make sure you understand this by verifying the equality yourself.

It is not apparent that the obtained from the 4th step of our estimation procedure which regresses the squared residuals from the FGLS estimation in step 3 on will be asymptotically efficient since the first order condition for from maximizing with respect to does not appear to be the same as the FOC for from the nonlinear regression in step 4 of our suggested estimation procedure. So to get fully efficient estimates, we can use as starting values for direct FIML estimation of the full parameter vector using . The procedure eval_fiml.g and the shell program mle.gpr implement full maximum likelihood estimation of Model III (note we have also provided the procedure hesschk.g to allow you to compare numerical and analytically calculated values of the hessian matrix, verifying that the analytic formulas for the hessian matrix given above are correct). This full model is rather delicate and we were unable to get it to converge starting from . However we had no problems with convergence starting from . Table 4 below compares the FGLS and MLE estimates.

displaymath656

We see that the FIML and FGLS estimates of are very close to each other and the standard errors are nearly identical, as we would expect from the theoretical result shown above that the FGLS estimator of is asymptotically efficient. There are more significant differences in the FIML and FGLS estimates of . In particular the standard errors of the FGLS estimates are significantly larger than the FIML estimates of , which suggests that the FGLS estimates are not asymptotically efficient. Students should be able to verify that this is the case by deriving analytic formulas for the asymptotic covariance matrix for the MLE and FGLS estimators of .

Question 6 The Nadaraya-Watson estimator is used to provide a nonparametric estimation of

where and is defined to be a Gaussian density function. For the choice of a bandwidth parameter, a rule of thumb is used: with The Gauss program that computed these estimates is kernel.gpr. You should experiment with different bandwidths, showing that when h is much less than the automatically chosen value of h=.34 the fitted regression tends to be too wiggly, and too smooth when h is substantially greater than h=0.34, so in this case the automatically chosen bandwidth seems like the best compromise. There are more sophisticated ways to choose bandwidths such as ``least squares cross validation'' described in class. The simple rule given above does almost as well and is much simpler to compute. The kernel.gpr program also includes two types of polynomial series estimation, ordinary polynomials and nonlinear least squares using the exponential specification considered in problem 4. Figure 5 and Figure 6 compare the different estimators for the data in data1.asc and data2.asc, respectively. We can see that all of the estimators give similar answers in the region of the x-space where most of the data points lie, i.e. around x=0.

Questions 7 and 8 To explore the form of heteroscedasticity, we repeat the procedures described above, but using the squared residuals from Question 6 as dependent variable. Figure 7, which plots true and estimated conditional variance functions , shows that similar to the results on estimation of the conditional expectation in part 6, all of the different parametric and nonparametric estimation methods give similar answers in the region of the x-space where most of the data points lie, i.e. around x=0. Note particularly that the parametric exponential specification of and the nonparametric kernel density estimate of are quite close to each other except at extreme values of x. The same story also emerges in Figure 8, which plots the estimated squared residuals from the first stage NLS estimates of the exponential specification of the conditional expectation function. We didn't plot estimates of from the exponential specification, since we were unable to get the nlreg.gpr program to converge, even when we started the estimation from the true values of . This problem is probably a result of the large values chosen for the components, , especially under the polynomial specification at extreme values of x where the computer runs into underflow or overflow problems. With some ``playing around'' one might be able to coax nlreg.gpr into convergence, but here is a case where the parameter estimates from the exponential specification seems rather fragile and non-robust.

About this document ...

Next: About this document

John Rust
Sun Apr 12 18:46:07 CDT 1998