next up previous
Next: About this document

Solutions to Problem Set 1

Economics 551, Yale University

Professor John Rust

Question 1 The choice probability for the general case when

displaymath368

where

eqnarray13

The choice probability is given by:

eqnarray23

where we use the fact that tex2html_wrap_inline398 , so we standardized tex2html_wrap_inline400 by subtracting tex2html_wrap_inline402 from both sides of the inequality in the probability in the second line of the above equation and divided both sides by its standard deviation, allowing us to use the standard normal CDF tex2html_wrap_inline404 in the last line. Note that when tex2html_wrap_inline406 and tex2html_wrap_inline408 and tex2html_wrap_inline410 , this equation reduces to

eqnarray37

We note that we must make identifying normalizations of the tex2html_wrap_inline412 and tex2html_wrap_inline414 parameters of this model, since there are infinitely many different combinations of the 5 free parameters in tex2html_wrap_inline412 and tex2html_wrap_inline414 that yield the same conditional probability in equation (1), and are thus observationally equivalent. For example, let tex2html_wrap_inline420 and let tex2html_wrap_inline422 , tex2html_wrap_inline424 and tex2html_wrap_inline426 be any parameters satisfying 1) tex2html_wrap_inline414 is positive semidefinite, and 2) tex2html_wrap_inline430 (one example is tex2html_wrap_inline432 and tex2html_wrap_inline434 ). This model has the same choice probability as the model in equation (2) where tex2html_wrap_inline436 , and tex2html_wrap_inline438 and tex2html_wrap_inline410 . Therefore we need to impose arbitrary identifying normalizations in order to estimate the model. One common normalization is that tex2html_wrap_inline436 and tex2html_wrap_inline408 and tex2html_wrap_inline410 . An alternative identifying normalization is tex2html_wrap_inline436 and tex2html_wrap_inline450 and tex2html_wrap_inline424 a free parameter to be estimated. However whether it is possible to identify the covariance term tex2html_wrap_inline424 depends on the specification of the utilities tex2html_wrap_inline456 , i=0,1. We will generally need to impose additional identifying normalizations to estimate the parameters of the utilities, tex2html_wrap_inline456 , i=0,1. For example if the utility function is linear in parameters, i.e. tex2html_wrap_inline464 , i=0,1 with tex2html_wrap_inline468 , then it is easy to see that without further restrictions it is not possible to identify tex2html_wrap_inline470 and tex2html_wrap_inline424 simultaneously, even with the normalization that tex2html_wrap_inline474 . To see this, note that for the linear in parameters specification we have:

eqnarray65

It should be clear that any combination of tex2html_wrap_inline476 such that

displaymath369

for a fixed vector tex2html_wrap_inline478 are observationally equivalent, and that there are infinitely many such combinations. Therefore we must make a further normalization of the tex2html_wrap_inline470 coefficients. A typical normalization is that tex2html_wrap_inline482 and that one of the components of tex2html_wrap_inline484 is normalized to 1. Since we are free to choose different normalizations, when interpreting the estimation results from the probit model we need to keep the underlying normalization in mind. For the rest of this problem set we will use the normalization tex2html_wrap_inline408 and tex2html_wrap_inline410 , and tex2html_wrap_inline482 . Under this normalization the choice probability is given by tex2html_wrap_inline494 , so the estimated value of tex2html_wrap_inline470 is interpreted as the impact of an additional unit of x on the incremental utility of choosing alternative 1, i.e.

displaymath370

Question 2 See answers to question 7 and 8 of 1997 Econ 551 problem set 3.

Question 3 The ``true model'' used to generate the data in model 3 was a probit model. Table 1 below presents the true tex2html_wrap_inline470 coefficients and the logit and probit estimates of these values which were estimated using the shell program estimate.gpr using two procedures, log_mle.g and prb_mle.g that compute the log-likelihood, gradients and hessians for the logit and probit specifications, respectively. Both log-likelihoods have the following general form:

equation83

displaymath371

Notice that the logit parameter estimates are ``further'' from the true parameters than the probit estimates. One ``metric'' for measuring this distance is the Wald test statistic tex2html_wrap_inline504 of the hypothesis that the estimated logit parameters equals the true parameters

displaymath372

where tex2html_wrap_inline506 is the estimated misspecification-consistent covariance matrix for tex2html_wrap_inline508 :

displaymath373

where tex2html_wrap_inline510 and tex2html_wrap_inline512 are the sample analogs of the hessian and information matrix of the log-likelihood, respectively. Computing the Wald statistic for the misspecified logit model we have tex2html_wrap_inline514 , which corresponds to a marginal significance level of tex2html_wrap_inline516 given that under the null hypothesis tex2html_wrap_inline518 , a Chi-squared random variable with 4 degrees of freedom. The Wald test statistic that the estimated probit parameters equals the true values is tex2html_wrap_inline520 , which corresponds to a marginal significance level of 0.594. Thus we can clearly reject the hypothesis that the logit model is correctly specified, but we do not reject the hypothesis that the probit model is correctly specified. However our ability to compute this statistic requires prior knowledge the true parameters tex2html_wrap_inline524 . Of course in nearly all ``real'' applications we do not know tex2html_wrap_inline524 so this type of Wald test is infeasible.

Later in Econ 551 we will consider general specification tests, such as White's (1982) Econometrica Information matrix test statistic (which is not necessarily a consistent test), or Bieren's (1990) Econometrica specification test statistic of functional form (which is a consistent test). These allow us to test whether the parametric model tex2html_wrap_inline528 is correctly specified (i.e. whether there exists a tex2html_wrap_inline524 such that tex2html_wrap_inline532 , where tex2html_wrap_inline534 is the true conditional choice probability) without any prior knowledge of tex2html_wrap_inline524 or, indeed, without any prior information about what the true model really is.

However the estimation results suggest that the power of these ``omnibus'' specification test statistics may be low, even with samples as large as N=3000. To see how hard it might be to test this hypothesis, consider figure 1. For example comparing tex2html_wrap_inline540 and tex2html_wrap_inline542 we find that they are very close in both the probit and logit specifications. Tables 2 and 3 present the estimated values of tex2html_wrap_inline540 and tex2html_wrap_inline542 for the probit and logit specifications, respectively.

eqnarray122

eqnarray133

Figure 1 plots the true conditional choice probability tex2html_wrap_inline556 , i.e. the probit model evaluated at the true parameters and at the (sorted) x values in the data file data3.asc, the estimated probit and logit models, and the logit model evaluated at the true parameter values tex2html_wrap_inline524 . We see that even thought the estimated parameter values tex2html_wrap_inline562 for the logit and probit models are significantly different from each other, the estimated choice probabilities are nearly identical for each x in the sample. Indeed the estimated logit and probit choice probabilities are visually virtually indistinguishable. Maximum likelihood is doing its best (in the presence of noise) to try to fit the true choice probability tex2html_wrap_inline556 , and we see that both the logit and probit models are sufficiently flexible functional forms that we can approximate the data about equally well with either specification. As a result the maximized value of the log-likelihood is almost identical for both models, i.e. tex2html_wrap_inline568 for both the probit and logit specifications. Recalling the discussion of neural networks in our presentation of non-parametric estimation methods, both the logit and probit models can be regarded as simplified neural networks with a single hidden unit and the logistic and normal cdf's as ``squashing functions.'' Given that neural networks can approximate a wide variety of functions, it isn't so surprising that the logit and probit choice probabilities can approximate each other very well, with each yielding virtually the same overall fit. Thus, one can imagine it would be very hard for an omnibus specification test statistic to discern which of these models is the true model generating the data.

Figure 1 also plots the predicted logit choice probabilities that result from evaluating the logit model at the true parameter values tex2html_wrap_inline524 . We can see that in this case the choice probabilities of the logit model are quite different from the choice probabilities of the true probit model. However the logit maximum likelihood estimates are not converging to tex2html_wrap_inline524 when the model is misspecified. Instead, the misspecified maximum likelihood estimator is converging to the parameter vector tex2html_wrap_inline574 which minimizes the Kullback-Liebler distance between the chosen parametric specification and the true choice probability:

eqnarray149

where tex2html_wrap_inline576 is the standard normal density, the marginal density of the x variables used to generate the data in this problem. Given the flexibility of the logit specification, we find that tex2html_wrap_inline580 is almost identical to the true probit specification tex2html_wrap_inline582 even though tex2html_wrap_inline574 and tex2html_wrap_inline524 are fairly different parameter vectors.

Question 4 We can also use nonlinear least squares to consistently estimate tex2html_wrap_inline524 , assuming the specification of the choice probability is correct, since by definition of the conditional choice probability we have:

displaymath374

Thus, tex2html_wrap_inline582 is the true conditional expectation function, so even though the dependent variable y only takes on the values tex2html_wrap_inline594 we still have a valid regression equation:

displaymath375

where the error term also takes on two possible values tex2html_wrap_inline596 , but satisfies tex2html_wrap_inline598 by construction. By the general uniform consistency arguments presented in class, it is easy to show that the nonlinear least squares estimator tex2html_wrap_inline562 defined by:

equation160

will be a consistent estimator of tex2html_wrap_inline524 if the model is correctly specified, i.e. if tex2html_wrap_inline604 , where tex2html_wrap_inline606 is the standard normal CDF, but if the choice probability is misspecified, then with probability 1 we have tex2html_wrap_inline608 where tex2html_wrap_inline574 is given by:

displaymath376

Question 5 Table 4 presents the NLS estimates of tex2html_wrap_inline470 for the logit and probit specifications of tex2html_wrap_inline534 using the estimation program estimate.gpr and the procedures log_nls.g and prb_nls.g respectively.

displaymath377

Comparing Tables 1 and 4 we see that the MLE and NLS estimates of tex2html_wrap_inline524 are virtually identical for each specification. The standard errors are virtually the same in this case as well. In general the NLS estimator is less efficient than the MLE since the latter attains the Cramer-Rao lower bound when the model is correctly specified. In this case the MLE and NLS estimates happen to be amazingly close to each other, and the standard errors of the NLS estimates are actually minutely smaller than the standard errors of the MLE estimates (for example for the MLE estimator of tex2html_wrap_inline484 we have tex2html_wrap_inline622 whereas for the NLS estimator we have tex2html_wrap_inline624 ). This anomaly is probably not due to a programming error on my part (since running the gradient and hessian check options in estimate.gpr reveals that the analytic formulas I programmed match the numerical values quite closely), but probably due to to a combination of roundoff error and estimation noise. Although the Cramer-Rao lower bound holds asymptotically, it need not hold in finite samples for the sample analog estimates of the covariance matrix, which can be potentially quite noisy estimates of the asymptotic covariance matrix. It is straightforward to show that for the NLS has asymptotic covariance matrix tex2html_wrap_inline626 given by:

equation196

where

displaymath378

and

displaymath379

whereas for the correctly specified probit model the Cramer-Rao lower bound, tex2html_wrap_inline628 , is given by

equation208

Thus we have tex2html_wrap_inline630 unless the model is homoscedastic, i.e. when tex2html_wrap_inline632 for all x, which implies that tex2html_wrap_inline582 is a constant for all x, which is almost never the case in any ``interesting'' application. We conclude that the MLE estimator of tex2html_wrap_inline524 is necessarily more efficient than the NLS estimator, and the only reason why the NLS has slightly smaller estimated standard errors in this example is due to round-off and estimation error. For other sample sizes, say N=500, we do find that the estimated standard deviations of the MLE estimator are smaller than the NLS estimator. For example when N=500 the NLS estimator of tex2html_wrap_inline644 is tex2html_wrap_inline646 and its standard error is tex2html_wrap_inline648 , whereas the MLE estimator is tex2html_wrap_inline650 and its standard error is tex2html_wrap_inline652 . Thus, while we do see an efficiency gain to doing maximum likelihood, it is far from overwhelming in this particular case.

Question 6 It is easy to see that the errors tex2html_wrap_inline654 in the regression formulation of the binary choice model, tex2html_wrap_inline656 are heteroscedastic with conditional variance tex2html_wrap_inline658 given by:

displaymath380

(Too see this, note that the conditional variance of tex2html_wrap_inline660 and tex2html_wrap_inline662 given tex2html_wrap_inline664 are the same, and the latter is a Bernoulli random variable that takes on the value tex2html_wrap_inline666 with probability tex2html_wrap_inline668 . As is well known, a Bernoulli random variable has variance p(1-p)). Thus, we have a case where heteroscedasticity has a known functional form and we can make use of it to compute feasible generalized least squares (FGLS) estimates of tex2html_wrap_inline524 . In the first stage we compute the NLS estimates of tex2html_wrap_inline524 and using these estimates, call them tex2html_wrap_inline676 , we compute estimated conditional variance tex2html_wrap_inline678 given by the formula above but with the first stage NLS estimates tex2html_wrap_inline680 in place of tex2html_wrap_inline524 . Then in the second stage we compute the FGLS estimates tex2html_wrap_inline684 as the solution to the following weighted least squares problem:

equation219

The FGLS estimates of tex2html_wrap_inline524 , computed by log_fgls.g and prb_fgls.g in the logit and probit cases, respectively, are virtually identical to the NLS estimates of tex2html_wrap_inline524 , which are in turn virtually identical to the maximum likelihood estimates in the logit and probit specifications presented in Table 1 so I didn't bother to present them here.

Should we conclude from this that there isn't much heteroscedasticity in this problem? Figure 2 plots tex2html_wrap_inline690 for this problem and we see that there is indeed substantial heteroscedasticity, with fairly large variation in the effective weighting of the observations. However by plotting the relative contribution of the terms in the weighted and unweighted sum of squared residuals, you will find that except for a small number of observations with the lowest values of tex2html_wrap_inline664 which the FGLS estimator assigns very high weights to, the relative sizes of the vast majority of the true squared residuals in both the FGLS and NLS estimators are very similar. This explains why the FGLS and NLS estimators are not very different even though there appears to be substantial heteroscedasticity in this problem.

Question 7 The FGLS estimator is asymptotically equivalent to the maximum likelihood estimator, a result suggested by the fact that the likelihood function and the weighted sum of squared residuals happen to have the same first order conditions:

eqnarray232

Actually, the first order conditions are only identical for the continuously updated version of the FGLS estimator, where instead of using a first stage NLS estimate tex2html_wrap_inline676 to make an estimated correction for heteroscedasticity, we continually update our estimate of the heteroscedasticity as tex2html_wrap_inline470 changes, so the same tex2html_wrap_inline470 appears in the numerator and denominator terms in the third equation above whereas in the FGLS estimator tex2html_wrap_inline676 appears in the denominator terms. However recalling the logic of the ``Amemiya correction'' we need to consider whether it is necessary to account for the estimation noise in the first stage estimates tex2html_wrap_inline676 in deriving the asymptotic distribution of the FGLS estimator, tex2html_wrap_inline684 . It will turn out that there is a form of ``block diagonality'' here which enables the FGLS estimator to be ``adaptive'' in the sense that the asymptotic distribution of the FGLS estimator tex2html_wrap_inline684 does not depend on whether we use the noisy first stage NLS estimator to compute a noisy estimate of the conditional variance tex2html_wrap_inline708 to use as weights, or if we use the true conditional variance tex2html_wrap_inline710 .

Before we show this, we first show that if we did use the true conditional variance as the weights in the FGLS estimator, it would be as efficient as maximum likelihood: i.e. the FGLS estimator attains the Cramer-Rao lower bound. To see this do a Taylor-series expansion of the first order condition for the FGLS estimator about tex2html_wrap_inline524 :

eqnarray247

where tex2html_wrap_inline714 is on the line segment between tex2html_wrap_inline684 and tex2html_wrap_inline524 and

equation257

By the uniform Strong Law of Large Numbers, we have that tex2html_wrap_inline720 with probability 1 where

equation262

Since tex2html_wrap_inline722 with probability 1, it follows that tex2html_wrap_inline724 with probability 1. Using the law of iterated expectations we can show that the second term in the above expectation is zero when tex2html_wrap_inline726 so that

equation265

The Central Limit Theorem implies that

equation268

where it is easy to see that tex2html_wrap_inline728 . Combining all results in equations tex2html_wrap_inline730 we see that the asymptotic distribution of the FGLS estimator is given by:

equation276

Thus the asymptotic covariance matrix of the FGLS estimator is the inverse of the information matrix (see equation 6), so it is asymptotically efficient.

Now we need to show that if we computed the FGLS estimator using the (inverse of) the estimated conditional variance tex2html_wrap_inline708 instead of the true conditional variance as weights, the asymptotic distribution is still the same as that given in (13) above. We do this using the same logic as for the general derivation of the ```Amemiya correction'', Taylor expanding the FGLS FOC in both variables tex2html_wrap_inline676 and tex2html_wrap_inline684 about their limiting value tex2html_wrap_inline524 . That is, if we define the function tex2html_wrap_inline740 by

equation284

then we have the following joint Taylor series expansion for tex2html_wrap_inline742 about tex2html_wrap_inline744

equation290

We know that the NLS estimator is asymptotically normal, so tex2html_wrap_inline746 , i.e. it is bounded in probability. Thus, the FGLS estimator that uses estimated conditional variance as weights will have the same asymptotic distribution as the (infeasible) FGLS estimator that of the true conditional variance as weights if we can show that with probability 1 we have:

eqnarray295

But this follows from the USLLN and the consistency of tex2html_wrap_inline676 and tex2html_wrap_inline684 .

Question 8 Figure 3 presents a comparison of the true choice probability and nonparametric estimates of this probability using both kernel and series estimators from the program kernel.gpr.

The series estimator seems to provide a better estimator of the true choice probability than the kernel density estimator in this case. The series estimator is just the predicted tex2html_wrap_inline752 value from a simple OLS regression of the tex2html_wrap_inline754 on a constant and the first 3 powers of tex2html_wrap_inline664 :

displaymath381

and the kernel estimator is the standard Nadaraya-Watson estimator

displaymath382

where tex2html_wrap_inline758 and tex2html_wrap_inline760 is defined to be a Gaussian density function. For the choice of a bandwidth parameter, tex2html_wrap_inline762 a rule of thumb is used: tex2html_wrap_inline764 with tex2html_wrap_inline766 In this case the automatically chosen bandwith turned out to be tex2html_wrap_inline768 . The series estimator is much faster to compute than the kernel estimator, since the above summations must be carried out for each of the N=3000 observations in the sample in order to plot the estimated choice probability for each observation. Comparing the fit of the parametric and nonparametric models in figures 1 and 2, we see that even though the logit and probit models are ``parametric'', they have sufficient flexibility to enable them to provide a better fit than either the kernel density or series estimators. This conclusion is obviously specific to this example where the true conditional choice probability was generated by a probit model, and as we saw from figure 1, one can adjust the parameters to make the predicted probabilities of the logit and probit models quite close to each other.

Figure 4 plots the estimated choice probabilities produced by both the probit and logit maximum likelihood estimates and the kernel and series nonparametric estimates. We see that except for the ``hump'' in the kernel density estimate, all the estimates are very close to each other. It would appear to be quite difficult to say which estimate was the ``correct'' one: instead we conclude that 4 different ways of estimating the conditional choice probability give approximately the same results.




next up previous
Next: About this document

John Rust
Mon Apr 20 01:48:49 CDT 1998