No Title

Spring 2001 John Rust
Economics 551b 37 Hillhouse, Rm. 27

SOLUTIONS TO FINAL EXAM

April 27, 2001

Part I: 15 minutes, 15 points. Answer all questions below:

1. Suppose $\{\tilde X_1,\ldots,\tilde X_N\}$ are IID draws from a $N(\mu,\sigma^2)$ distribution (i.e. a normal distribution with mean $\mu$ and variance $\sigma^2$ ). Consider the estimator $\hat\theta_N$ defined by:

$\begin{displaymath}\hat\theta_N = \left( {1 \over N} \sum_{i=1}^N \tilde X_i \right)^2 \end{displaymath}$

(1)

Which of the following statements are true and which are false?

To answer this, note that the sample mean of the N IID/ observations $\{\tilde X_1,\ldots,\tilde X_N\}$ , $\overline X_N$ is distributed as $N(\mu,\sigma^2/N)$ . Then $\hat\theta_N$ is the square of $\overline X_N$ and is thus a non-central $\chi^2$ random variable. Its expectation is

$\begin{displaymath}E\{\hat\theta_N\} = \left( \mu^2 + {\sigma^2 \over N}\right) \end{displaymath}$

(2)

and its variance is

$\begin{displaymath}\mbox{var}(\hat\theta_N) = E\{ \hat\theta^2_N\} - [E\{\hat\t... ...{ 2 \sigma^2 \over N^2} + { 4 \mu^2 \sigma^2 \over N}\right). \end{displaymath}$

(3)

Thus it is clear that $\hat\theta_N$ converges in probability to $\mu^2$ and is an upward biased estimator of $\mu^2$ . These conclusions would follow even if the X_i's were not normally distributed. In that case we would use the continuous mapping theorem and the fact that $\hat\theta_N$ is a continuous function (x²) of the sample mean $\overline X_N$ , and thus, $\hat\theta_N$ converges in probability to $\mu^2$ is a simple application of the continuous mapping theorem. Also since the function x² is convex, Jensen's inequality can be used to show that

$\begin{displaymath}E\{\hat\theta_N\}= E\{ [\overline X_N]^2\} > [E\{\overline X_N\}]^2 = \mu^2. \end{displaymath}$

(4)

From these results the following true and false answers should now be obvious:

A.: $\hat\theta_N$ is a consistent estimator of $\sigma^2$ . False.
B.: $\hat\theta_N$ is an unbiased estimator of $\sigma^2$ . False.
C.: $\hat\theta_N$ is a consistent estimator of $\mu$ . False.
D.: $\hat\theta_N$ is an unbiased estimator of $\mu$ . False.
E.: $\hat\theta_N$ is a consistent estimator of $\mu^2$ . True.
F.: $\hat\theta_N$ is an unbiased estimator of $\mu^2$ . False.
G.: $\hat\theta_N$ is an upward biased estimator of $\mu^2$ . True.
H.: $\hat\theta_N$ is a downward biased estimator of $\mu^2$ . False.

2. Consider estimation of the linear model

$\begin{displaymath}y=X\beta + \epsilon \end{displaymath}$

(5)

based on N IID observations $\{y_i,X_i\}$ where X_i is a $K \times 1$ vector of independent variables and y_i is a $1 \times 1$ scalar independent variable. Mark each of the following statements as true or false:

A.

The Gauss-Markov Theorem proves that the ordinary least squares estimator (OLS) it BLUE (Best Linear Unbiased Estimator). True.

B.

The Gauss-Markov Theorem requires that the error term in the regression $\epsilon$ be normally distributed with mean 0 and variance $\sigma^2$ . False.

C.

The Gauss-Markov Theorem does not apply if the true regression function does not equal $X\beta$ , i.e. if $E\{y\vert X\} \ne X\beta$ . True.

D.

The Gauss-Markov Theorem does not apply if there is heteroscedasticity. True.

E.

The Gauss-Markov Theorem does not apply if the error term has a non-normal distribution. False.

F.

The maximum likelihood estimator of $\beta$ is more efficient than the OLS estimator of $\beta$ . True.

G.

The OLS estimator of $\beta$ will be unbiased only if the error terms are distributed independently of X and have mean 0. False.

H.

The maximum likelihood estimator of $\beta$ is the same as OLS only in the case where $\epsilon$ is normally distributed. True.

I.

The OLS estimator will be a consistent estimator of $\beta$ even if the error term $\epsilon$ is not normal and even if there is heteroscedasticity. True.

J.

The OLS estimator of the asymptotic covariance matrix for $\beta$ , $\hat\sigma^2 (X'X/N)^{-1}$ (where $\hat\sigma^2$ is the sample variance of the estimated residuals $\hat\epsilon_i=y_i-X_i\hat\beta$ ) is a consistent estimator regardless of whether $\epsilon$ is normally distributed or not. True.

K.

L.

If the distribution of $\epsilon$ is double exponential, i.e. if $f(\epsilon)=\exp\{ -\vert\epsilon\vert/\sigma\}/(2\sigma)$ , the maximum likelihood estimator of $\beta$ is the Least Absolute Deviations estimator and it is asymptotically efficient relative to the OLS estimator. True.

M.

The OLS estimator cannot be used if the regression function is misspecified, i.e. if the true regression function $E\{y\vert X\} \ne X\beta$ . False.

N.

The OLS estimator will be inconsistent if $\epsilon$ and X are correlated. True.

O.

The OLS estimator will be inconsistent if the dependent variable y is truncated, i.e. if the dependent variable is actually determined by the relation

$\begin{displaymath}y= \max[0, X\beta + \epsilon] \end{displaymath}$

(6)

True.

P.

The OLS estimator is inconsistent if $\epsilon$ has a Cauchy distribution, i.e. if the density of $\epsilon$ is given by

$\begin{displaymath}f(\epsilon)={ 1 \over \pi (1+ \epsilon^2)} \end{displaymath}$

(7)

True.

Q.

The 2-stage least squares estimator is a better estimator than the OLS estimator because it has two stages and is therefore twice as efficient. False.

R.

If the set of instrumental variables W and the set of regressors X in the linear model coincide, then 2 stage least squares estimator of $\beta$ is the same as the OLS estimator of $\beta$ . True.

Part II: 30 minutes, 30 points. Answer 2 of the following 6 questions below.

QUESTION 1 (Probability question) Suppose $\tilde Z$ is a $K \times 1$ random vector with a multivariate N(0,I) distribution, i.e. $E\{\tilde Z\}=0$ where 0 is a $K \times 1$ vector of zeros and $E\{\tilde Z \tilde Z'\}=I$ where I is the $K \times K$ identity matrix. Let M be a $K \times K$ idempotent matrix, i.e. a matrix that satisfies

M² = M * M = M

(8)

Show that

$\begin{displaymath}\tilde Z' M \tilde Z \sim \chi^2(J) \end{displaymath}$

(9)

where $\chi^2(J)$ denotes a chi-squared random variable with J degrees of freedom and $J= \mbox{rank}(M)$ . Hint: Use the fact that M has a singular value decomposition, i.e.

M = X D X'

(10)

where X' X = I and D is a diagonal matrix whose diagonal elements are equal to either 1 or 0.

Answer: Let X be the $K \times K$ orthonormal matrix in the singular value decompostion of the idempotent matrix M. Since X'X=I, it follows that $\tilde W \equiv X'\tilde Z$ is N(0,I). Thus, $\tilde Z'M\tilde Z$ can be rewritten as $(X'\tilde Z)' D (X' \tilde Z)$ . Since D is a diagonal matrix with J 1's K - J 0's on its main diagonal, it follows that $(X'\tilde Z)' D (X' \tilde Z)= \tilde W' D \tilde W$ algebraically the sum of J IDD N(0,1) random variables and thus has a $\chi^2(J)$ distribution. That is, assuming without loss of generality that the first J elements of the diagonal of D are 1's and the remaining K-J elements are 0's we have

$\begin{displaymath}\tilde Z' M \tilde Z = (X'\tilde Z)' D (X' \tilde Z) = \tilde W' D \tilde W = \tilde W_1 + \cdots + \tilde W_J. \end{displaymath}$

(11)

Since the $\{\tilde W_j\}$ are IID N(0,1)'s, the result follows.

QUESTION 2 (Markov Processes)

A.

(10%) Are Markov processes of any use in econometrics? Describe some examples of how Markov processes are used in econometrics such as providing models of serially dependent data, as a framework for establishing convergence of estimators and proving laws of large numbers, central limit theorems, etc. and as computational tool for doing simulations.

B.

(10%) What is a random walk? Is a random walk always a Markov process? If not, provide a counter-example.

C.

(40%) What is the ergodic or invariant distribution of a Markov process? Do all Markov processes have invariant distributions? If not, provide a counterexample of a Markov process that doesn't have an invariant distribution. Can a Markov process have more than 1 invariant distribution? If so, give an example.

D.

(40%) Consider the discrete Markov process $\{X_{t}\}=\{1,2,3\}$ with transition probability

$\begin{eqnarray*}P\{X_{t+1}=1\vert X_{t}=1\}&=&\frac{1}{2} \quad P\{X_{t+1} =2\... ...3\vert X_{t}=2\}=\frac{1}{4} \quad P\{X_{t+1}=2\vert X_{t}=3\}=1 \end{eqnarray*}$

Does this process have an invariant distribution? If so, find all of them.

ANSWERS:

A.

Markov processes play a major role in econometrics, since they it provides one of the simplest yet most general frameworks for modeling temporal dependence. Markov processes are used extensively in time series econometrics, since there are laws of large numbers and central limit theorems that apply to very general classes of Markov processes that satisfy a ``geometric ergodicity'' condition. Markov processes are also used extensively in Gibbs Sampling, which is a technique for simulating draws from a posterior distribution in econometric models where the posterior has no convenient analytical solution.

B.

A random walk $\{X_t\}$ is a special type of Markov process that is represented as

$\begin{displaymath}X_t = X_{t-1} + \epsilon_t, \end{displaymath}$

(12)

where $\{\epsilon_t\}$ is an IID process that is independent of $\{X_t\}$ . If $E\{\epsilon_t\} > 0$ the random walk has positive drift and if $E\{\epsilon_t\} < 0$ it has negative drift. A random walk is always a Markov process, since X_t-1 is a sufficient statistic for determining the probability distribution of X_t, and previous values $\{X_{t-2},X_{t-3},\ldots,\}$ are irrelevant. If F is the CDF for $\epsilon_t$ , then the Markov transition probability for $\{X_t\}$ is given by

$\begin{displaymath}\mbox{Pr}\{ X_t \le x' \vert X_{t-1}=x\} = F(x'-x). \end{displaymath}$

(13)

C.

If a Markov Process has a transition probabilty P(x'|x), then its invariant distribution $\Pi$ is defined by

$\begin{displaymath}\Pi(x') = \int P(x'\vert x)\Pi(dx). \end{displaymath}$

(14)

What this equation says is that if $X_t \sim \Pi(x)$ (i.e. X_t is distributed according to the probability distribution $\Pi$ ), then X_t+1 is also distributed according to this same probability distribution. Not all Markov processes have invariant distributions. A random walk does not have an invariant distribution, i.e. there is no solution to the equation (14) above. To see this, note in particular that due to the independence between X_t-1 and $\epsilon_t$ we have

$\begin{displaymath}\mbox{var}(X_t) = \mbox{var}(X_{t-1}+ \epsilon_t) = \mbox{var}(X_{t-1}) + \mbox{var}(\epsilon_t) > \mbox{var}(X_{t-1}),\end{displaymath}$

(15)

so that regardless of what distribution X_t-1 has, it is impossible for X_t to have this same distribution.

D.

The transition probability matrix P for this process is given by the following $3 \times 3$ matrix

$\begin{displaymath}P= \left[ \begin{array}{ccc} 1/2 & 1/3& 1/6 \\ 3/4 & 0 & 1/4 \\ 0 & 1 & 0 \end{array} \right] \end{displaymath}$

(16)

The invariant probability is the solution $\Pi$ to the $3 \times 3$ system of equations

$\begin{displaymath}\Pi = \Pi P \end{displaymath}$

(17)

We can write this out as

$\displaystyle \pi_1$	$\textstyle = {1 \over 2} \pi_1 + { 3 \over 4} \pi_2 + 0 \pi_3$
$\displaystyle \pi_2$	$\textstyle = {1 \over 3} \pi_1 + 0 \pi_2 + \pi_3$
$\displaystyle \pi_3$	$\textstyle = {1 \over 6} \pi_1 + {1 \over 4} \pi_2 + 0 \pi_3$	(18)

You can verify the the unique non-zero solution to the above system of equations is $(\pi_1,\pi_2,\pi_3)=(1/2,1/3,1/6)$ , i.e. the unique invariant distribution is the same as the first row of P.

QUESTION 3 (Consistency of M-estimator) Consider an M-estimator defined by:

$\begin{displaymath}\widehat{\theta }_{N}=\arg \max_{\theta \in \Theta }Q_{N}(\theta ). \end{displaymath}$

Suppose following two conditions are given

(i) (Identification) For all $\varepsilon >0$

$\begin{displaymath}Q(\theta ^{*})>\sup_{\theta \notin B(\theta ^{*},\varepsilon )}Q(\theta ) \end{displaymath}$

where $B(\theta ^{*},\varepsilon )=\{\theta \in R^k \big\vert \Vert\theta-\theta^*\Vert < \epsilon\}$ .

(ii) (Uniform Convergence)

$\begin{displaymath}\sup_{\theta \in \Theta }\left\vert Q_{N}(\theta )-Q(\theta )\right\vert \stackrel{p% }{\rightarrow }0. \end{displaymath}$

Prove consistency of the estimator by showing

$\begin{displaymath}P\left( \widehat{\theta }_{N}\notin B(\theta ^{*},\varepsilon )\right) \rightarrow 0. \end{displaymath}$

ANSWER Uniform convergence in probability can be stated formally as follows: for any $\delta > 0$ we have

$\begin{displaymath}\lim_{N \to \infty} \mbox{Pr}\left\{ \sup_{\theta \in \Theta ... ...rt Q_{N}(\theta )-Q(\theta )\right\vert < \delta \right\} = 1. \end{displaymath}$

(19)

Now, given any $\varepsilon >0$ , define $\delta$ by

$\begin{displaymath}\delta \equiv Q(\theta^*) - \sup_{\theta \notin B(\theta ^{*},\varepsilon )}Q(\theta ) \end{displaymath}$

(20)

The identification assumption implies that $\delta > 0$ . Now, we want to show that for any $\varepsilon >0$ we have

$\begin{displaymath}\lim_{N \to \infty} \mbox{Pr}\left\{ \widehat{\theta }_{N}\notin B(\theta ^{*},\varepsilon )\right\} = 0. \end{displaymath}$

(21)

Notice if $\widehat{\theta }_{N}\notin B(\theta ^{*},\varepsilon )$ then we have

$\begin{displaymath}Q_N(\theta^*) - \sup_{\theta \notin B(\theta ^{*},\varepsilon )}Q_N(\theta) \le 0. \end{displaymath}$

(22)

So it is sufficient to show that uniform convergence implies that

$\begin{displaymath}\lim_{N \to \infty} \mbox{Pr}\left\{ Q_N(\theta^*) - \sup_{\... ...otin B(\theta^{*},\varepsilon )}Q_N(\theta) \le 0\right\} = 0. \end{displaymath}$

(23)

Using the $\delta$ defined in equation (20) and the definition of uniform convergence in probability in equation (19), we have

$\begin{displaymath}\lim_{N \to \infty} \mbox{Pr}\left\{ \sup_{\theta \in \Theta ... ... Q_{N}(\theta )-Q(\theta )\right\vert < \delta/3 \right\} = 1. \end{displaymath}$

(24)

Thus, for N sufficiently large, the following inequalities will hold with probability arbitrarily close to 1,

$\displaystyle Q_N(\theta^*)$	>	$\displaystyle Q(\theta^*) - \delta/3$
$\displaystyle \sup_{\theta \notin B(\theta^{*},\varepsilon )}Q_N(\theta)$	<	$\displaystyle \sup_{\theta \notin B(\theta^{*},\varepsilon )}Q(\theta) + \delta/3$	(25)

Combining the above inequalities, it follows that the following inequality will hold with probability arbitrarily close to 1 for Nsufficiently large:

$\begin{displaymath}Q_N(\theta^*) - \sup_{\theta \notin B(\theta^{*},\varepsilon ... ...repsilon )}Q(\theta) - {2 \delta \over 3} = { \delta \over 3}. \end{displaymath}$

(26)

This implies that

$\begin{displaymath}\lim_{N \to \infty} \mbox{Pr}\left\{ Q_N(\theta^*) - \sup_{\... ...,\varepsilon )}Q_N(\theta) > { 1 \delta \over 3} \right\} = 1. \end{displaymath}$

(27)

Since $\delta$ is arbitrary, this implies that

$\begin{displaymath}\lim_{N \to \infty} \mbox{Pr}\left\{ Q_N(\theta^*) - \sup_{\... ...otin B(\theta^{*},\varepsilon )}Q_N(\theta) \le 0\right\} = 0. \end{displaymath}$

(28)

Since the event that $\widehat{\theta }_{N}\notin B(\theta ^{*},\varepsilon )$ is a subset of the event that $Q_N(\theta^*) - \sup_{\theta \notin B(\theta^{*},\varepsilon )}Q_N(\theta) \le 0$ , it follows that the limit in equation (19) holds, i.e. $\hat\theta_N$ is a consistent estimator of $\theta^*$ .

QUESTION 4 (Time series question) Suppose $\{X_t\}$ is an ARMA(p,q) process, i.e.

$\begin{displaymath}A(L)X_t = B(L)\epsilon_t \end{displaymath}$

where A(L) is a $q^{\hbox{\rm th}}$ order lag-polynomial

$\begin{displaymath}A(L) = \alpha_0 + \alpha_1 L + \alpha_2 L^2 + \cdots + \alpha_q L^q \end{displaymath}$

and B(L) is a $p^{\hbox{\rm th}}$ order lag-polynomial

$\begin{displaymath}B(L) = \beta_0 + \beta_1 L + \beta_2 L^2 + \cdots + \beta_p L^p \end{displaymath}$

and the lag-operator L^k is defined by

L^k X_t = X_t-k

and $\{\epsilon_t\}$ is a white-noise process, $E\{\epsilon_t\}=0$ and (cov( $\epsilon_t,\epsilon_s$ )=0 if $t\ne s$ , $=\sigma^2$ if t=s).

A.: (30%) Write down the autocovariance and spectral density functions for this process.
B.: (30%) Show that if p = 0 an autoregression of X_t on q lags of itself provides a consistent estimate of $(\alpha_0/\sigma, \ldots,\alpha_q/\sigma)$ . Is the autoregression still consistent if p > 0?
C.: (40%) Assume that a central limit theorem holds, i.e. the distribution of normalized sums of $\{X_t\}$ to converge in distribution to a normal random variable. Write down an expression for the variance of the limiting normal distribution.

ANSWERS

A.

The answer to this question is very complicated if you attempt to proceed via direct calculation (although it can be done), but it much easier if you use the concept of a z-transform and the covariance generating function G(z) of the scalar process $\{X_t\}$ . The answer is that spectral density function $f(\lambda)$ for the $\{X_t\}$ process is given by

$\begin{displaymath}f(\lambda) = {\sigma^2 \over 2 \pi} { \vert B(e^{-i \lambda})\vert^2 \over \vert A(e^{- i \lambda})\vert^2 } \end{displaymath}$

(29)

provided the characteristic polynomial A(z)=0 has no roots on the unit circle. The autocovariances of the $\{X_t\}$ process can then be derived from the spectral density via the formula

$\begin{displaymath}\mbox{cov}(X_t,X_{t+k}) = \gamma_k = {1 \over 2 \pi} \int_{-\pi}^\pi f(\lambda) e^{i\lambda k} d\lambda. \end{displaymath}$

(30)

Answering this question presumes a basic familiarity with Fourier transform technology. I repeat the basics of this below.

Given a sequence of real numbers $\{\psi_k\}$ where k ranges from $-\infty,\ldots,\infty$ the z-transform G(z) is defined by

$\begin{displaymath}G(z) = \sum_{k=-\infty}^\infty \psi_k z^k \end{displaymath}$

(31)

where z is a complex variable satisfying r^-1 < |z| < r for some r > 1. The autocovariance generating function is then just the z-transform of the autocovariance sequence $\{\gamma_k\}$ :

$\begin{displaymath}G(z) \equiv \sum_{k=-\infty}^\infty \gamma_k z^k \end{displaymath}$

(32)

where $\gamma_k = \mbox{cov}(X_t,X_{t+k})=E\{X_t X_{t+k}\}$ . Thus, if we can find a representation for G(z), we can pick off the autocovariances $\gamma_k$ as the coefficient of z^k of the power series representation for G(z). Alternatively we can define the spectral density $f(\lambda)$ for the $\{X_t\}$ process by

$\begin{displaymath}f(\lambda) = \sum_{k=-\infty}^\infty \gamma_k e^{-i \lambda k} \end{displaymath}$

(33)

where $i = \sqrt{-1}$ . Note that for by the standard properties of Fourier series, we can uncover the autocovariance $\gamma_k$ by the formula:

$\begin{displaymath}\gamma_k = {1 \over 2 \pi} \int_{-\pi}^\pi f(\lambda) e^{i\lambda k} d\lambda. \end{displaymath}$

(34)

This is due to the fact that the sequence of complex valued functions $\{e^{i\lambda k}\}$ mapping $[-\pi,\pi]$ to the unit circle in the complex plane are an orthogonal sequence under the complex inner product for complex-valued functions mapping $[-\pi,\pi] \to C$ defined by:

$\begin{displaymath}\langle f,g\rangle = \int_{-\pi}^\pi f(\lambda)\overline g(\lambda)d\lambda, \end{displaymath}$

(35)

where $\overline g(\lambda)$ is the complex conjugate of $g(\lambda)$ . Since the complex conjugate of $e^{i \lambda k}$ is $e^{-i \lambda k}$ we have

$\begin{displaymath}\langle e^{i\lambda j},e^{i\lambda k}\rangle \equiv \int_{-\p... ...da k} d\lambda = \int_{-\pi}^\pi e^{i \lambda (j-k)}d\lambda. \end{displaymath}$

(36)

Clearly if j=k then we have

$\begin{displaymath}\langle e^{i\lambda j},e^{i\lambda k}\rangle = \int_{-\pi}^\pi d\lambda = 2 \pi, \end{displaymath}$

(37)

but if $j\ne k$ we have, using the identity $e^{i\lambda} = \cos(\lambda) + i \sin(\lambda)$ ,

$\begin{displaymath}\langle e^{i\lambda j},e^{i\lambda k}\rangle = \int_{-\pi}^\pi \cos(\lambda (j-k)) + i \sin(\lambda (j-k))d\lambda = 0. \end{displaymath}$

(38)

since $\sin(k\lambda)$ and $\cos(k\lambda)$ are periodic functions for any non-zero integer k, their integrals over the interval $[-\pi,\pi]$ are zero. Thus, since $\{e^{i\lambda k}\}$ is an orthogonal family, $\gamma_k$ is essentially the $k^{\mbox{th}}$ regression coefficient if we ``regress'' the spectral density function against the sequence of orthonormal basis functions $\{e^{i\lambda k}\}$ ,

$\begin{displaymath}\langle f(\lambda),e^{i \lambda k}\rangle = \int_{-\pi}^\pi ... ... e^{-i \lambda j} e^{i \lambda k} d \lambda = 2 \pi \gamma_k. \end{displaymath}$

(39)

Solving the above equation for $\gamma_k$ results in the Fourier inversion formula in equation (34) above. Note also that the spectral density is related to the covariance generating function by the identity

$\begin{displaymath}f(\lambda) = G(e^{-i\lambda}), \end{displaymath}$

(40)

so the problem reduces to finding an expression for the covariance generating function for an $\mbox{ARMA}(p,q)$ process. Assume that the characteristic polynomial A(z) has no roots on the unit circle, i.e. there is no complex number z with $\vert z\vert= z \overline z = 1$ such that A(z)=0. In this case it can be show (see Theorem 3.1.3 of Brockwell and Davis, 1991), that the ARMA process $\{X_t\}$ has an infinite moving average representation:

$\begin{displaymath}X_t = \sum_{j=-\infty}^\infty \psi_j \epsilon_{t-j}, \end{displaymath}$

(41)

where $\psi_j$ is the $j^{\mbox{th}}$ coefficient in the power series representation of the z-transform of the $\{\psi_j\}$ sequence, where the z-transform $\Psi(z)$ is given by

$\begin{displaymath}\Psi(z) = B(z)A(z)^{-1}. \end{displaymath}$

(42)

However covariance generating function for an infinite MA process (41) can be derived as follows:

$\begin{displaymath}\gamma_k = \mbox{cov}(X_{t+k},X_t) = \sigma^2 \sum_{j=-\infty}^\infty \psi_j \psi_{j+\vert k\vert} \end{displaymath}$

(43)

Thus, the autocovariance generating function is given by

G(z)	=	$\displaystyle \sum_{k=-\infty}^\infty \gamma_k z^k$
	=	$\displaystyle \sigma^2 \sum_{k=-\infty}^\infty \sum_{j=-\infty}^\infty \psi_j \psi_{j+\vert k\vert}$
	=	$\displaystyle \sigma^2 \left[ \sum_{j=-\infty}^\infty \psi_j^2 + \sum_{k=1}^\infty \sum_{j=-\infty}^\infty \psi_j \psi_{j+k}(z^k+z^{-k}) \right]$
	=	$\displaystyle \sigma^2 \left( \sum_{j=-\infty}^\infty \psi_j z^j\right) \left( \sum_{k=-\infty}^\infty \psi_k z^{-k}\right)$
	=	$\displaystyle \sigma^2 \Psi(z) \Psi(z^{-1}).$	(44)

However using the fact that $\Psi(z)=B(z)A(z)^{-1}$ it follows that

$\begin{displaymath}G(z) = \sigma^2 { B(z) B(z^{-1}) \over A(z) A(z^{-1}) } \end{displaymath}$

(45)

Substituting $z=e^{-i\lambda}$ we obtain

$\begin{displaymath}f(\lambda) = G(e^{i \lambda}) = \sigma^2 { B(e^{i\lambda}) B(... ...ert B(e^{i\lambda})\vert^2 \over \vert A(e^{i\lambda})\vert^2} \end{displaymath}$

(46)

since $A(e^{-i\lambda})= \overline A(e^{i\lambda})$ and thus $A(e^{i \lambda}) A(e^{-i\lambda}) = \vert A(e^{-i\lambda})\vert^2$ and similarly for B.

B.

When q=0 we can write the ARMA representation for $\{X_t\}$ in autoregressive form:

$\begin{displaymath}X_t = { \alpha_1 \over \alpha_0}X_{t-1} + \cdots { \alpha_q \over \alpha_0} X_{t-q} + {\beta_0 \over \alpha_0} \varepsilon_t \end{displaymath}$

(47)

Since $\{\varepsilon_t\}$ is serially uncorrelated, and since X_t-jdepends only on lagged values $(\varepsilon_{t-j},\varepsilon_{t-j-1},\ldots)$ , it follows that $\mbox{cov}(\varepsilon_t,X_{t-j})=0$ so the coefficients $\alpha_j/\alpha_0$ and the error variance $\beta_0^2 \sigma^2/ \alpha_0^2$ in the above equation can be consistently estimated by OLS. We cannot identify all the parameters unless we make an identifying normalization on the variance of the white noise process such as $\sigma^2=1$ , or normalize $\beta_0=1$ . Suppose we make the latter normalization. Then the variance of the estimated residuals provides a consistent estimator of $\sigma^2/\alpha_0$ , and then dividing the estimated regression coefficient for the $j^{\mbox{th}}$ lag of X_t in the above autoregression by the square root of the estimated variance of the residuals provides a consistent estimator of $\alpha_j/\alpha_0$ .

C.

Since $E\{X_t\}=0$ then under suitable mixing conditions a central limit theorem will hold, i.e.

$\begin{displaymath}{ 1 \over \sqrt{T}} {\sum_{t=1}^T X_t} \phantom{,}_{\Longrightarrow \atop d}\, N(0,\Omega) \end{displaymath}$

(48)

where $\Omega$ is the long run variance given by

$\begin{displaymath}\Omega = \sum_{j=-\infty}^\infty \gamma_j \end{displaymath}$

(49)

where $\gamma_j=\mbox{cov}(X_t,X_{t+j})$ is the autocovariance at lag j, which can be derived from the spectral density function computed in part A.

QUESTION 5 (Empirical question) Assume that shoppers always choose only a single brand of canned tuna fish from the available selection of Kalternative brands of tuna fish each time they go shopping at a supermarket. Assume initially that the (true) probability that the decision-maker chooses brand k is the same for everybody and is given by $\theta^*_k$ , $k=1,\ldots,K$ . Marketing researchers would like to learn more about these choice probabilities, $\theta^*=(\theta^*_1,\ldots,\theta^*_K)$ and spend a great deal of money sampling shoppers' actual tuna fish choices in order to try to estimate these probabilities. Suppose the Chicken of the Sea Tuna company undertook a survey of N shoppers and for each shopper shopping at a particular supermarket with a fixed set of K brands of tuna fish, recorded the brand b_i chosen by shopper i, $i=1,\ldots,N$ . Thus, b₁=2 denotes the observation that consumer 1 chose tuna brand 2, and b₄=K denotes the observation that consumer 4 chose tuna brand K, etc.

A.

(10%) Without doing any estimation, are there any general restrictions that you can place on the $K \times 1$ parameter vector $\theta^*$ ?

Answer: we must have $\theta^*_j \ge 0$ and $\sum_{j=1}^K \theta^*_j =1$ .

B.

(10%) Is it reasonable to suppose that $\theta^*_k$ is the same for everyone? Describe several factors that could lead different people to have different probabilities of purchasing different brands of tuna. If you were a consultant to Chicken of the Sea, what additional data would you recommend that they collect in order to better predict the probabilities that consumers buy various brands of tuna? Describe how you would use this data once it was collected.

Answer: no, it is quite unreasonable to assume that everyone has the same purchase probability. People of different ages, income levels, ethnic backgrounds and so forth are likely to have different tastes for tuna. Also, Chicken of the Sea is just one of many different brands of tuna and the prices of the competing brands and observed characteristics of the competing brands (such as whether the tuna is packed in oil or water, the consistency of the tuna, and other characteristics) affects the probability a given consumer will choose Chicken of the Sea. Let the vector of observed characteristics for the K brands be given by the $L \times K$ matrix $Z=(Z_1,\ldots,Z_K)$ (i.e. there are L observed characteristics for each of the K different brands). Let the characteristics of the $j^{\mbox{th}}$ household be denoted by the $M \times 1$ vector X_j. Then a model that reflects observed heterogeneity and the competing brand characteristics would result in the following general form of the conditional probability that household j will choose brand k from the set of competing tuna brands offered in the store at time of purchase, $\mbox{Pr}(k\vert X_j,Z)$ . An example of a model of consumer choice behavior that results in a specific functional form for $\mbox{Pr}(k\vert X_j,Z)$ is the multinomial logit model. This is a model derived from a model of utility maximization where the utility of choosing brand kis given by $u(X_j,Z_k,\theta)+\epsilon_k$ , where $(\epsilon_1,\ldots,\epsilon_K)$ are unobserved factors affecting household j's decision, and are assumed to have a Type III extreme value distribution. In this case, the implied formula for $\mbox{Pr}(k\vert X_j,Z)$ is given by

$\begin{displaymath}\mbox{Pr}(k\vert X_j,Z,\theta,\sigma) = { \exp\{ u(X_j,Z_k,\t... ...} \over \sum_{k'=1}^K \exp\{ u(X_j,Z_{k'},\theta)/\sigma \} } \end{displaymath}$

(50)

where $\sigma$ is the scale parameter in the marginal distribution of $\epsilon_k$ . Thus, given data $(X_1,\ldots,X_N)$ on the characteristics of N consumers, and their choices of tuna $(d_1,\ldots,d_N)$ and the observed characteristics Z, we could estimate the parameter vector $\theta$ by maximum likelihood using the log-likelihood function $L_N(\theta)$ given by

$\begin{displaymath}L_N(\theta) = {1 \over N} \sum_{j=1}^N \log\left( \mbox{Pr}(d_j\vert X_j,Z,\theta,\sigma) \right). \end{displaymath}$

(51)

and the estimated model could be used to predict how the probabilities of purchasing different brands of tuna (and the predicted aggregate market shares) change in response to changes in prices or observed characteristics of the different brands of tuna.

C.

(20%) Using the observations $\{b_1,\ldots,b_N\}$ on the observed brand choices of the sample of N shoppers, write down an estimator for $\theta^*$ (under the assumption that the ``true'' brand choice probabilities $\theta^*$ are the same for everyone). Is your estimator unbiased?

Answer: In the simpler case where there are no characteristics X_j or product attributes Z, the the choice probability can be represented by a single parameter, $\mbox{Pr}(k\vert X_j,Z,\theta)=\theta_k$ . These $\theta_k$ are also the observed market shares since everyone is homogeneous. The market share for brand k can be estimated in this sample as the fraction of the N people who choose brand k,

$\begin{displaymath}s_k = {1 \over N} \sum_{i=1}^N I\{b_i = k\} \end{displaymath}$

(52)

Thus if s_k is the observed market share for product k, then we can estimate $\theta_k$ by $\hat\theta_k=s_k$ .

D.

(20%) What is the maximum likelihood estimator of $\theta^*$ ? Is the maximum likelihood estimator unbiased?

Answer: The likelihood function in this case can be written as

$\begin{displaymath}L_N(\theta) = {1 \over N} \sum_{j=1}^N \sum_{k=1}^K I\{ b_j = k\} \log(\theta_k). \end{displaymath}$

(53)

subject to the constraint that $1 = \theta_1 + \cdots + \theta_K$ . Introducing a lagrange multiplier $\lambda$ for this constraint, the lagrangian for the likelihood function is

$\begin{displaymath}{\cal L}(\theta,\lambda) = {1 \over N} \sum_{j=1}^N \sum_{k=1... ..._j = k\} \log(\theta_k) + \lambda (1 - \sum_{k=1}^K \theta_k) \end{displaymath}$

(54)

The first order conditions are

$\begin{displaymath}{1 \over N} \sum_{i=1}^N { I \{ b_i = k\} \over \theta_k} - \lambda = 0.\end{displaymath}$

(55)

Solving this for $\hat\theta_k$ and substituting this into the constraint, we can solve for $\lambda$ , obtaining $\hat\lambda=1$ . The resulting estimator is the same as the intuitive market share estimator given above, i.e.

$\begin{displaymath}\hat\theta_k = s_k \end{displaymath}$

(56)

If the data $\{b_1,\ldots,b_N\}$ are really IID and the ``representative consumer'' model is really correct, then $\hat\theta_k$ is an unbiased estimator of $\theta_k^*$ since

$\begin{displaymath}E\{\hat\theta_k\} = {1 \over N} \sum_{i=1}^N E\left\{ I\{ b_i = k\}\right\} = \theta^*_k \end{displaymath}$

(57)

since the random variable $I\{b_i = k\}$ is a bernoulli random variable which equals 1 with probability $\theta^*_k$ and 0 with probability $1-\theta^*_k$ .

E.

(40%) Suppose Chicken of the Sea Tuna company also collected data on the prices $\{p_1,\ldots,p_K\}$ that the supermarket charged for each of the K different brands of tuna fish. Suppose someone proposed that the probability of buying brand j was a function of the prices of all the various brands of tuna, $\theta^*_j(p_1,\ldots,p_K)$ , given by:

$\begin{displaymath}\theta^*_j(p_1,\ldots,p_K)= { \exp\left\{ \beta_j + \alpha p_... ...\over \sum_{k=1}^K \exp\left\{ \beta_k + \alpha p_k \right\} } \end{displaymath}$

Describe in general terms how to estimate the parameters $(\alpha,\beta_1,\ldots,\beta_K)$ . If $\alpha >0$ , does an increase in p_j decrease or increase the probability that a shopper would buy brand j?

Answer: This answer was already discussed in the answer to part B. The model is a special case of the more general multinomial logit model discussed in the answer to part B. In this case the implicit utility function only depends on the single characteristic of brand k, namely its price p_k and the other characteristics of the brand are implicitly captured in the brand-specific dummy variable $\beta_k$ . Since now consumer-level characteristics enter the model, the utility function is given by

$\begin{displaymath}u(X_j,Z_k,\theta) = \beta_k + \alpha p_k \end{displaymath}$

(58)

where $\theta=(\beta_1,\ldots,\beta_K,\alpha)$ . If $\alpha >0$ then the utility of brand k increases in the price of the brand k, an economically counter-intuitive result. This sugggests that the probability of purchasing brand k is an increasing function of p_k and this can be verified by computing

$\begin{displaymath}{ \partial \mbox{Pr} \over \partial p_K} (k\vert p_1,\ldots,... ..._K,\theta) [1 - \mbox{Pr}(k\vert p_1,\ldots,p_K,\theta)] > 0. \end{displaymath}$

(59)

QUESTION 6 (Regression question) Let (y_t,x_t) be IID observations from a regression model

$\begin{displaymath}y_t = \beta x_t + \epsilon_t \end{displaymath}$

where y_t, x_t, and $\epsilon_t$ are all scalars. Suppose that $\epsilon_t$ is normally distributed with $E\{\epsilon_t\vert x_t\}=0$ , but $\mbox{var}(\epsilon_t\vert x_t)=\sigma^2 \vert x_t\vert^\theta$ . Consider the following two estimators for $\beta^*$ :

$\begin{displaymath}\hat\beta^1_T = { \sum_{t=1}^T y_t \over \sum_{t=1}^T x_t } \end{displaymath}$

$\begin{displaymath}\hat\beta^2_T = { \sum_{t=1}^T x_t y_t \over \sum_{t=1}^T x^2_t } \end{displaymath}$

A.

(20%) Are these two estimators consistent estimators of $\beta^*$ ? Which estimator is more efficient when: 1) if we know a priori that $\theta^*=0$ , and 2) we don't know $\theta^*$ ? Explain your reasoning for full credit.

Answer: Both estimators are consistent estimators of $\beta^*$ . To see this note that by dividing the numerator and denominator of $\hat\beta^1_T$ and applying the Law of Large Numbers we obtain

$\begin{displaymath}\hat\beta^1_T\phantom{,}_{\longrightarrow \atop p} { E\{y\} \over E\{x\}} = { \beta^* E\{x\} \over E\{x\} } = \beta^*. \end{displaymath}$

(60)

The second estimator is the OLS estimator and it is also a consistent estimator of $\beta$

$\begin{displaymath}\hat\beta^2_T\phantom{,}_{\longrightarrow \atop p} { E\{xy\} \over E\{x^2\}} = { \beta^* E\{x^2\} \over E\{x^2\} } = \beta^*. \end{displaymath}$

(61)

When $\theta^*=0$ the Gauss-Markov Theorem applies and the OLS estimator is the best linear unbiased estimator of $\beta^*$ . It is also the maximum likelihood estimator when the errors are normally distributed, and so is asymptotically efficient in the class of all (potentially nonlinear) regular estimators of $\beta^*$ . We can derive the asymptotic efficiency of $\hat\beta^2_T$ relative to $\hat\beta^1_T$ through a simple application of the central limit theorem. We have

$\begin{displaymath}\sqrt{T}(\hat\beta^1_T - \beta^*) = { { 1 \over \sqrt{T}} \su... ...\over E\{x\}} \sim N\left(0,{\sigma^2 \over E\{x\}^2} \right). \end{displaymath}$

(62)

where $\tilde Z \sim N(0,\sigma^2)$ . Similarly, the asymptotic distribution of the OLS estimator $\hat\beta^2_T$ is given by

$\begin{displaymath}\sqrt{T}(\hat\beta^2_T - \beta^*) = { { 1 \over \sqrt{T}} \su... ...ver E\{x^2\}} \sim N\left(0,{\sigma^2 \over E\{x^2\}} \right). \end{displaymath}$

(63)

where $\tilde W \sim N(0,\sigma^2 E\{x^2\})$ . If the variance of $\tilde x$ is positive we have

$\displaystyle \mbox{var}(\tilde x)$	=	$\displaystyle E\{x^2\} - E\{x\}^2 > 0$
	$\textstyle \Longrightarrow$	$\displaystyle E\{x^2\} > E\{x\}^2.$	(64)

This implies that the asymptotic variance of $\hat\beta^1_T$ is greater than the asymptotic variance of $\hat\beta^2_T$ .

In the case where we don't know $\theta$ we can repeat the calculations given above, but the asymptotic distributions of the two estimators will depend on the unknown parameter $\theta^*$ . In particular, when $\theta^* \ne 0$ the unconditional variance of $\epsilon_t$ is given by

$\begin{displaymath}\mbox{var}(\epsilon_t) = E \{ \mbox{var}(\epsilon_t\vert x_t) \} = \sigma^2 E\{\vert x\vert^{\theta^*}\}. \end{displaymath}$

(65)

This implies that

$\begin{displaymath}\sqrt{T}(\hat\beta^1_T - \beta^*) = { { 1 \over \sqrt{T}} \su... ...igma^2 E\{\vert x\vert^{\theta^*}\} \over [E\{x\}]^2} \right). \end{displaymath}$

(66)

since with heteroscedasticity, the random variable $\tilde Z$ , the asymptotic distribution of $1/\sqrt{T} \sum_{t=1}^T \epsilon_t$ , is $N(0,\sigma^2 E\{\vert x\vert^{\theta^*}\})$ instead of $N(0,\sigma^2)$ . Similarly we have

$\begin{displaymath}\sqrt{T}(\hat\beta^2_T - \beta^*) = { { 1 \over \sqrt{T}} \su... ...E\{ x^2 \vert x\vert^{\theta^*}\} \over [E\{x^2\}]^2} \right). \end{displaymath}$

(67)

In this case, which of the two estimators $\hat\beta^1_T$ or $\hat\beta^2_T$ is more efficient depends on the value of $\theta^*$ .

B.

(20%) Write down an asymptotically optimal estimator for $\beta^*$ if we know the value of $\theta^*$ a priori.

Answer: If we know $\theta^*$ we can do maximum likelihood using the conditional density of y given x given by

$\begin{displaymath}f(y\vert x,\beta,\theta^*) = { 1 \over \sqrt{2 \pi} \sigma \v... ...eft\{ - (y - x \beta)^2 \over \vert x\vert^{\theta^*}\right\}. \end{displaymath}$

(68)

The maximum likeihood estimator in this case can be easily shown to be a form of weighted least squares:

$\begin{displaymath}\hat\beta_T = \mathop{\it argmin}_{\beta \in R} \sum_{t=1}^T { (y_t - x_t \beta)^2 \over \vert x_t\vert^{\theta^*}}. \end{displaymath}$

(69)

C.

(20%) Write down an asymptotically optimal estimator for $(\beta^*,\theta^*)$ if we don't know the value of $\theta^*$ a priori.

Answer: If we don't know $\theta^*$ a priori we can still use the likelihood function given in part B to estimate $(\beta,\theta)$ jointly. The maximum likelihood estimator for $\beta$ can also be cast as a weighted least squares estimator, but in the case where $\theta^*$ is not known we replace $\theta^*$ in formula (69) by $\theta(\beta)$ , where this is the unique solution to

$\begin{displaymath}\sum^T_{t=1} \log(\vert x_t\vert) = \sum_{t=1}^T { (y_t- x_t\beta)^2 \log (\vert x_t\vert) \over \vert x_t\vert^\theta }. \end{displaymath}$

(70)

The maximum likelihood estimator for $\theta$ is then given by $\theta(\hat\beta_T)$ where $\hat\beta_T$ is the weighted least squares estimator given above.

D.

(20%) Describe the feasible GLS estimator for $(\beta^*,\theta^*)$ . Is the feasible GLS estimator asymptotically efficient?

Answer: The feasible GLS estimator is based on an initial inefficient estimator $\hat\beta_T$ of $\beta^*$ which is used to construct estimated residuals $\hat\epsilon_t = (y_t - x_t \hat\beta_T)$ and from these an estimator for $\theta^*$ . If we could observe the true residuals we could estimate $\theta^*$ via the following nonlinear regression of $\epsilon_t^2$ on x_t

$\begin{displaymath}\epsilon^2_t = \sigma^2 \vert x_t\vert^{\theta^*} + u_t \end{displaymath}$

(71)

where $E\{u_t\vert x_t\}=0$ . This suggests that it should be possible to estimate $\theta^*$ using the estimated residuals $\{\hat\epsilon_t\}$ as follows

$\begin{displaymath}\hat\theta_T = \mathop{\it argmin}_{\theta \in R, \sigma^2 > ... ...t=1}^T (\hat\epsilon_t^2 - \sigma^2 \vert x_t\vert^\theta )^2. \end{displaymath}$

(72)

It can be shown that if the initial estimator $\hat\beta_T$ is $\sqrt{T}$ -consistent, then the nonlinear least squares estimator for $\theta^*$ given above will also be $\sqrt{T}$ -consistent, and that the following three step, feasible GLS estimator for $\beta^*$ will be asymptotically efficient:

$\begin{displaymath}\hat\beta^{\mbox{f}}_T = \mathop{\it argmin}_{\beta \in R} \s... ...^T { (y_t - x_t \beta)^2 \over \vert x_t\vert^{\hat\theta_T}}. \end{displaymath}$

(73)

E.

(20%) How would your answers to parts A to D change if you didn't know the distribution of $\epsilon_t$ was normal?

Answer: The answer to part A is unchanged. However if we don't know the form of the conditional distribution of $\epsilon_t$ given x_t, we can't write down a likelihood function that will determine the asymptotically optimal estimator for $\beta^*$ , regardless of whether we know $\theta^*$ or not. Thus, there is no immediate answer to parts B and C. In part D we can still do the same feasible GLS estimator, and while it is possible to show that this is asymptotically efficient relative to OLS, it is not clear that it is asymptotically optimal. There is a possibility of doing adaptive estimation, i.e. of using a first stage inefficient estimator of $\beta^*$ to construct estimated residuals $\hat\epsilon_t$ and then using these estimated residuals to try to estimate the conditional density $f(\epsilon\vert x)$ non-parametrically. Then using this nonparamtric distribution we could do maximum likelihood. Unfortunately the known results for this sort of adaptive estimation procedure requires that the error term $\epsilon_t$ be independent of x_t. However if there is heteroscedasticity, then $\epsilon_t$ will not be independent of x_t and adaptive estimation may not be possible. In this case the most efficient possible estimator can be ascertained by deriving the semi-parametric efficiency bound for the parameter of interest $\beta$ , where the conditional density $f(\epsilon\vert x)$ is treated as a non-parametric ``nuisance parameter''. However this goes far beyond what I expected students to write in the answer to this exam.

Part III (60 minutes, 55 points). Do 1 out of the 4 questions below.

QUESTION 1 (Hypothesis testing) Consider the GMM estimator with IID data, i.e the observations $\{y_i,x_i\}$ are independent and identically distributed using the moment condition $H(\theta) = E\{h(\tilde y,\tilde x,\theta)\}$ , where h is a $J \times 1$ vector of moment conditions and $\theta$ is a $K \times 1$ vector of parameters to be estimated. Assume that the moment conditions are correctly specified, i.e. assume there is a unique $\theta^*$ such that $H(\theta^*)=0$ . Show that in the overidentified case (J >K) that the minimized value of the GMM criterion function is asymptotically $\chi^2$ with J-K degrees of freedom:

$\begin{displaymath}N H_N(\hat\theta_N)' [\hat\Omega_N]^{-1} H_N(\hat\theta_N)_{\Longrightarrow \atop d} \chi^2(J-K), \end{displaymath}$

(74)

where H_N is a $J \times 1$ vector of moment conditions, $\theta$ is a $K \times 1$ vector of parameters, $\chi^2(J-K)$ is a Chi-squared random variable with J-K degrees of freedom,

$\begin{displaymath}\hat\theta_N = \mathop{\it argmin}_{\theta\in\Theta} H_N(\theta)' [\hat\Omega_N]^{-1} H_N(\theta), \end{displaymath}$

$\begin{displaymath}H_N(\theta) = {1 \over N} \sum_{i=1}^N h(y_i,x_i,\theta), \end{displaymath}$

and $\hat\Omega_N$ is a consistent estimator of $\Omega$ given by

$\begin{displaymath}\Omega = E\{ h(\tilde y,\tilde x,\theta^*)h(\tilde y,\tilde x,\theta^*)'\}. \end{displaymath}$

Hint: Use Taylor series expansions to provide a formula for $\sqrt N (\hat\theta_N - \theta^*)$ from the first order condition for $\hat\theta_N$

$\begin{displaymath}\nabla H_N(\hat\theta_N)' \hat\Omega^{-1}_N H_N(\hat\theta_N) = 0 \end{displaymath}$

(75)

and a Taylor series expansion of $H_N(\hat\theta_N)$ about $\theta^*$

$\begin{displaymath}H_N(\hat\theta_N) = H_N(\theta^*) + \nabla H_N(\tilde \theta_N)(\hat\theta_N-\theta^*) \end{displaymath}$

(76)

where

$\begin{displaymath}\nabla H_N(\theta) \equiv {1 \over N} \sum_{i=1}^N {\partial h \over \partial \theta}(y_i,x_i,\theta)\end{displaymath}$

(77)

is the $(J \times K)$ matrix of partial derivatives of the moment conditions $H_N(\theta)$ with respect to $\theta$ and $\tilde \theta_N$ is a vector each of whose elements are on the line segment joining the corresponding components of $\hat\theta_N$ and $\theta^*$ . Use the above two equations to derive the following formula for $H_N(\hat\theta_N)$

$\begin{displaymath}H_N(\hat\theta_N) = M_N H_N(\theta^*) \end{displaymath}$

(78)

where

$\begin{displaymath}M_N= \left[ I - \nabla H_N(\hat\theta_N)[\nabla H_N(\hat\thet... ...N)]^{-1} \nabla H_N(\hat\theta_N)' \hat\Omega^{-1}_N \right]. \end{displaymath}$

(79)

Show that with probability 1 we have $M_N \to M$ where M is a $(J \times J)$ idempotent matrix. Then using this result, and using the Central Limit Theorem to show that

$\begin{displaymath}\sqrt{N} H_N(\theta^*)_{\Longrightarrow \atop d} N(0,\Omega), \end{displaymath}$

(80)

and using the probability result from Question 0 of Part II, show that the minimized value of the GMM criterion function does indeed converge in distribution to a $\chi^2(J-K)$ random variable as claimed in equation (74).

ANSWER: The hint provides most of the answer. Plugging the Taylor series expansion for $H_N(\hat\theta_N)$ given in equation (76) into the GMM first order condition given in equation (75) and solving for $(\hat\theta_N-\theta^*)$ we obtain

$\begin{displaymath}\hat\theta_N - \theta^* = -\left[ \nabla H_N(\hat\theta_N)\ha... ...{-1} \nabla H_N(\hat\theta_N) \hat\Omega^{-1}_N H_N(\theta^*). \end{displaymath}$

(81)

Substituting the above expression for $\hat\theta_N - \theta^*$ back into the Taylor series expansion for $H_N(\hat\theta_N)$ in equation (76) we obtain the representation for $H_N(\hat\theta_N)$ given in equations (78) and (79). Now we can write the optimized value of the GMM objective function as

$\displaystyle H_N(\hat\theta_N)' [\hat\Omega^{-1}_N]^{-1} H_N(\hat\theta_N)$	=	$\displaystyle H_N(\theta^)' M_N' \hat\Omega_N^{-1} M_N H_N(\theta^)$
	=	$\displaystyle H_N(\theta^)' \Omega^{-1/2} \Omega^{1/2} M_N ' \hat \Omega^{-1/2}_N \hat\Omega^{-1/2}_N M_N \Omega^{1/2} \Omega^{-1/2} H_N(\theta^)$	(82)

Now, since $\Omega=E\{h(\tilde y,\tilde x,\theta^*)h(\tilde y,\tilde x,\theta^*)'\}$ , it follows from the Central Limit Theorem that

$\begin{displaymath}\sqrt N H_N(\theta^*)\phantom{\,}_{\Longrightarrow \atop d} N(0,\Omega). \end{displaymath}$

(83)

so that

$\begin{displaymath}\sqrt N \Omega^{-1/2} H_N(\theta^*)\phantom{\,}_{\Longrightarrow \atop d} N(0,I), \end{displaymath}$

(84)

where I is the $J \times J$ identity matrix. Now consider the matrix in the middle of the expansion of the quadratic form in equation (82). We have

$\begin{displaymath}\hat\Omega^{-1/2}_N M_N \Omega^{1/2} \phantom{\,}_{\longrightarrow \atop p} \Omega^{-1/2} M \Omega^{1/2} \equiv Q, \end{displaymath}$

(85)

where

$\begin{displaymath}Q = \left[I - \Omega^{-1/2} \nabla H(\theta^*)[\nabla H(\thet... ...la H(\theta^*)]^{-1} \nabla H(\theta^*)' \Omega^{-1/2}\right], \end{displaymath}$

(86)

and where

$\begin{displaymath}M = \left[I - \nabla H(\theta^*)[\nabla H(\theta^*)'\Omega^{-... ...abla H(\theta^*)]^{-1} \nabla H(\theta^*)' \Omega^{-1}\right], \end{displaymath}$

(87)

and where $\nabla H(\theta^*) = E\{ \partial h(\tilde y,\tilde x,\theta^*)/\partial \theta'\}$ . It is straightfoward to verify that the matrix Q in equation (86) is symmetric and idempotent. Thus, we have

$\begin{displaymath}N H_N(\hat\theta_N)' [\hat\Omega^{-1}_N]^{-1} H_N(\hat\theta_... ...ow \atop d} [Q \tilde Z]' [Q \tilde Z] = \tilde Z' Q \tilde Z, \end{displaymath}$

(88)

where $\tilde Z \sim N(0,I)$ . By the probability result in Question 1 of Part II, it follows that $\tilde Z' Q \tilde Z \sim \chi^2(\mbox{rank}(Q))$ . However we have $\mbox{rank}(Q) = \mbox{rank}(M)$ , and $\mbox{rank}(M) \le J - K$ due to the fact that

$\begin{displaymath}M \nabla H(\theta^*) = 0, \end{displaymath}$

(89)

where 0 denotes a $J \times K$ matrix of zeros, as can be verified by multiplying $\nabla H(\theta^*)$ on both sides of equation (87). However since Q = I - R where R is given by

$\begin{displaymath}R= \Omega^{-1/2} \nabla H(\theta^*)[\nabla H(\theta^*)'\Omega^{-1}\nabla H(\theta^*)]^{-1} \nabla H(\theta^*)' \Omega^{-1/2} \end{displaymath}$

(90)

and $\mbox{rank}(R) \le K$ , it follows that $\mbox{rank}(Q) \ge J- K$ . Combining these two inequalities we have $\mbox{rank}(Q) = J-K$ and we conclude that we have established the result tat

$\begin{displaymath}N H_N(\hat\theta_N) \hat\Omega^{-1}_N H_N(\theta_N)\phantom{,}_{\Longrightarrow \atop d} \chi^2(J-K). \end{displaymath}$

(91)

QUESTION 2 (Consistency of Bayesian posterior) Consider a Bayesian who has observes IID data $(X_1,\ldots,X_N)$ , where $f(x\vert\theta)$ is the likelihood for a single observation, and $p(\theta)$ is the prior density over an unknown finite-dimensional parameter $\theta \in R^K$ .

A.

(10%) Use Bayes Rule to derive a formula for the posterior density of $\theta$ given $(X_1,\ldots,X_N)$ .

Answer: The posterior is given by

$\begin{displaymath}f(\theta\vert X_1,\ldots,X_N) = { \prod_{i=1}^N f(X_i\vert\t... ... \over \int \prod_{i=1}^N f(X_i\vert\theta)p(\theta)d\theta}. \end{displaymath}$

(92)

B.

(20%) Let $P(\theta \in A\vert X_1,\ldots,X_N\}$ be the posterior probability $\theta$ is in some set $A \subset \Theta$ given the first N observations. Show that this posterior probability satisfies the Law of iterated expectations:

$\begin{displaymath}E\left\{ P(\theta \in A\vert X_1,\ldots,X_{N+1})\big\vert X_1,\ldots,X_N\right\} = P(\theta \in A\vert X_1,\ldots,X_N). \end{displaymath}$

Answer: The formula for the posterior probability that $\theta \in A$ given $(X_1,\ldots,X_N)$ is just the expectation of the indicator function $I\{\theta \in A\}$ with respect to the posterior density for $\theta$ given above. That is,

$\begin{displaymath}P(\theta \in A\vert X_1,\ldots, X_N) = { \int I\{\theta \in A... ... \over \int \prod_{i=1}^N f(X_i\vert\theta)p(\theta)d\theta}. \end{displaymath}$

(93)

Similarly, we have

$\begin{displaymath}P(\theta \in A\vert X_1,\ldots, X_N,X_{N+1}) = { \int I\{\the... ...er \int \prod_{i=1}^{N+1} f(X_i\vert\theta)p(\theta)d\theta}. \end{displaymath}$

(94)

Now, to compute the conditional expectation $E\left\{ P(\theta \in A\vert X_1,\ldots,X_{N+1})\big\vert X_1,\ldots,X_N\right\}$ we note that the appropriate density to use is our posterior belief about X_N+1 given $(X_1,\ldots,X_N)$ . This conditional density can be derived using the posterior for $\theta$

$\displaystyle f(X_{N+1}\vert X_1,\ldots,X_N)$	=	$\displaystyle \int f(X_{N+1}\vert\theta)f(\theta\vert X_1,\ldots,X_N)d\theta$
	=	$\displaystyle { \int \prod_{i=1}^{N+1} f(X_i\vert\theta)p(\theta)d\theta \over \int \prod_{i=1}^N f(X_i\vert\theta)p(\theta)d\theta }.$	(95)

Thus, $E\left\{ P(\theta \in A\vert X_1,\ldots,X_{N+1})\big\vert X_1,\ldots,X_N\right\}$ is given by

$\begin{displaymath}\int_{X_{N+1}} { \int_\theta I\{ \theta \in A\} \prod_{i=1}^{... ...eta)p(\theta)d\theta} f(X_{N+1}\vert X_1,\ldots,X_N)dX_{N+1}. \end{displaymath}$

(96)

Using the formula for $f(X_{N+1}\vert X_1,\ldots,X_N)$ given in equation (95) we get

$\textstyle \phantom{=}$	$\displaystyle \int_{X_{N+1}} { \int_\theta I\{ \theta \in A\} \prod_{i=1}^{N+1}... ...{N+1} f(X_i\vert\theta)p(\theta)d\theta} f(X_{N+1}\vert X_1,\ldots,X_N)dX_{N+1}$
=	$\displaystyle \int_{X_{N+1}} { \int_\theta I\{ \theta \in A\} \prod_{i=1}^{N+1}... ...ta)d\theta \over \int \prod_{i=1}^N f(X_i\vert\theta)p(\theta)d\theta }dX_{N+1}$
=	$\displaystyle \int_{X_{N+1}} { \int_\theta I\{ \theta \in A\} f(X_{N+1}\vert\th... ...ta)d\theta \over \int \prod_{i=1}^N f(X_i\vert\theta)p(\theta)d\theta }dX_{N+1}$
=	$\displaystyle \int_\theta { \int_{X_{N+1}} f(X_{N+1}\vert\theta)dX_{N+1} I\{\th... ...a)p(\theta)d\theta \over \int \prod_{i=1}^N f(X_i\vert\theta)p(\theta)d\theta }$
=	$\displaystyle \int_\theta { I\{ \theta \in A\} \prod_{i=1}^{N} f(X_i\vert\theta)p(\theta)d\theta \over \int \prod_{i=1}^N f(X_i\vert\theta)p(\theta)d\theta }$
=	$\displaystyle P(\theta \in A\vert X_1,\ldots,X_N).$	(97)

C.

(20%) A martingale is a stochastic process $\{\tilde Z_t\}$ that satisfies $E\left\{\tilde Z_{t+1}\vert{\cal I}_t\right\}=\tilde Z_t$ , where ${\cal I}_t$ denotes the information set at time t and includes knowledge of all past Z_t's up to time t, ${\cal I}_t \supset (\tilde Z_1,\ldots,\tilde Z_t)$ . Use the result in part A to show that the process $\{\tilde Z_t\}$ where $\tilde Z_t = P(\theta \in A\vert\tilde X_1\ldots,X_t)$ is a martingale. (We are interested in martingales because the Martingale Convergence Theorem can be used to show that if $\theta$ is finite-dimensional, then the posterior distribution converges with probability 1 to a point mass on the true value of $\theta$ generating the observations $\{X_i\}$ . But you don't have to know anything about this to answer this question.)

The Law of the Iterated Expectations argument above is the proof that the $\{Z_t\}$ process, $Z_t \equiv P(\theta \in A\vert X_1,\ldots,X_t)$ , is a martingale. That is, if we let ${\cal I}_t = (X_1,\ldots,X_t)$ , then we have

$\begin{displaymath}E\{Z_{t+1}\vert{\cal I}_t\} = E\{ P(\theta \in A\vert X_1,\ldots,X_{t+1}) \vert X_1,\ldots,X_t\}. \end{displaymath}$

(98)

The Law of Iterated Expectations result above establishes that

$\begin{displaymath}E\{ P(\theta \in A\vert X_1,\ldots,X_{t+1}) \vert X_1,\ldots,X_t\} = P(\theta \in A\vert X_1,\ldots,X_t), \end{displaymath}$

(99)

from which we conclude that the posterior probability process is a martingale.

D.

(50%) Suppose that if $\theta$ is restricted to the K-dimensional simplex, $\theta=(\theta_1,\ldots,\theta_K)$ with $\theta_i\in(0,1)$ , $i=1,\ldots,K$ , $1=\sum_{i=1}^K \theta_i$ , and the distribution of X_i given $\theta$ is multinomial with parameter $\theta$ , i.e.

$\begin{displaymath}Pr\{X_i = k\} = \theta_k, \quad k=1,\ldots,K.\end{displaymath}$

Suppose the prior distribution over $\theta$ , $p(\theta)$ is Dirichlet with parameter $\alpha$ :

$\begin{displaymath}p(\theta) = { \Gamma(\alpha_1+\cdots + \alpha_K) \over \Gamm... ...alpha_K) } \theta_1^{\alpha_1-1} \cdots \theta_K^{\alpha_K-1} \end{displaymath}$

where both $\theta_i > 0$ and $\alpha_i > 0$ , $i=1,\ldots,K$ . Compute the posterior distribution and show 1) the posterior is also Dirichlet (i.e. the Dirichlet is a conjugate family), and show directly that as $N \to \infty$ that the posterior distribution converges to a point mass on the true parameter $\theta$ generating the data.

Answer: The Dirichlet-Multinomial combination is a conjugate family of distributions. That is, if the prior distribution is Dirichlet with prior hyperparameters $(\alpha_1,\ldots,\alpha_K)$ and the data are generated by a multinomial with K mutually exclusive outcomes, then the posterior distribution after observing N IID draws from the multinomial is also Dirichlet with parameter $(\alpha_1 + n_1,\ldots,\alpha_K+n_K)$ where

$\begin{displaymath}n_k = \sum_{i=1}^N I\{X_i = k\} \end{displaymath}$

(100)

By the Law of Large Numbers we have that

$\begin{displaymath}{ n_k \over N} = { 1\over N} \sum_{i=1}^N I\{X_i = k\} \,_{\longrightarrow \atop p} E\{ I\{X_i=k\}\} = \theta^*_k. \end{displaymath}$

(101)

We prove the consistency of the posterior by showing that for any $\theta \ne \theta^*$ we have with probability 1

$\begin{displaymath}\lim_{N \to \infty} \log\left({ p(\theta^*\vert X_1,\ldots,X_... ...(\theta\vert X_1,\ldots,X_N)} \right)\ \longrightarrow \infty. \end{displaymath}$

(102)

This implies that the limiting posterior puts infinitely more weight on the event that $\theta=\theta^*$ than on any other possible value for $\theta$ . Dividing by N and taking limits we have

$\begin{displaymath}\lim_{N \to \infty} { 1\over N} \log\left({ p(\theta^*\vert X... ...^K \theta^*_k \left[ \log(\theta^*_k) - \log(\theta_k)\right]. \end{displaymath}$

(103)

However by the Information Inequality we have

$\begin{displaymath}\sum_{k=1}^K \theta^*_k \left[ \log(\theta^*_k) - \log(\theta_k)\right] > 0. \end{displaymath}$

(104)

This result implies that with probability 1

$\begin{displaymath}\lim_{N \to \infty} \log\left({ p(\theta^*\vert X_1,\ldots,X_... ...a\vert X_1,\ldots,X_N)} \right)\right] \longrightarrow \infty, \end{displaymath}$

(105)

since the latter term converges with probability 1 to a positive quantity.

Another way to see the result is to note that if the $K \times 1$ vector $\tilde \theta$ has a Dirichlet distribution with parameter $(\alpha_1,\ldots,\alpha_K)$ then

$\begin{displaymath}E\{\tilde \theta_j\} = { \alpha_j \over \sum_{k=1}^K \alpha_k}, \end{displaymath}$

(106)

and

$\begin{displaymath}\mbox{var}(\tilde \theta_i) = { \alpha_j (\sum_{k=1}^K \alpha... ...^K \alpha_k\right)^2 \left( \sum_{k=1}^K \alpha_k +1\right)}. \end{displaymath}$

(107)

Since the posterior distribution is Dirichlet with parameter $(\alpha_1 + n_1,\ldots,\alpha_K+n_K)$ , we can divide the numerator and denominator of the expression for $E\{\theta_j\vert X_1,\ldots,X_N\}$ by N and use the Law of Large Numbers to show that in the limit with probability 1 we have

$\begin{displaymath}E\{\tilde \theta_j\vert X_1\ldots,X_N\} = { \alpha_j + n_j \o... ...row} { \theta^*_j \over \sum_{k=1}^K \theta^*_k} = \theta^*_k. \end{displaymath}$

(108)

Via a similar sort of calculation, we can show that the conditional variance $\mbox{var}(\tilde \theta_j\vert X_1,\ldots,X_N\}$ converges to zero since we have

$\displaystyle \mbox{var}(\tilde \theta_i\vert X_1,\ldots,X_N)$	=	$\displaystyle { (\alpha_j+n_j) \left(\sum_{k=1}^K (\alpha_k+n_k) - (\alpha_j+n_... ..._{k=1}^K (\alpha_k+n_k)\right)^2 \left( \sum_{k=1}^K (\alpha_k+n_k) +1 \right)}$
	=	$\displaystyle { (\alpha_j/N+n_j/N) \left(\sum_{k=1}^K (\alpha_k/N+n_k/N) - (\al... ... (\alpha_k/N+n_k/N)\right)^2 \left( \sum_{k=1}^K (\alpha_k/N+n_k/N) +1 \right)}$	(109)

and the numerator of the latter expression converges with probability 1 to $\theta^*_j (1-\theta^*_j)$ but the denominator converges to $+\infty$ with probability 1.

QUESTION 3 Consider the random utility model:

$\begin{displaymath}\tilde u_d = v_d + \tilde \epsilon_d, \quad d=1,\ldots,D \end{displaymath}$

(110)

where $\tilde u_d$ is a decision-maker's payoff or utility for selecting alternative d from a set containing D possible alternatives (we assume that the individual only chooses one item). The term v_d is known as the deterministic or strict utility from alternative d and the error term $\tilde \epsilon_d$ is the random component of utility. In empirical applications v_d is often specified as

$\begin{displaymath}v_d = X_d\beta \end{displaymath}$

(111)

where X_d is a vector of observed covariates and $\beta$ is a vector of coefficients determining the agent's utility to be estimated. The interpretation is that X_d represents a vector of characteristics of the decision-maker and alternative d that are observable by the econometrician and $\epsilon_d$ represents characteristics of the agent and alternative d that affect the utility of choosing alternative d which are unobserved by the econometrician. Define the agent's decision rule $\delta(\epsilon_1,\ldots,\epsilon_D)$ by:

$\begin{displaymath}\delta(\epsilon) = \mbox{\it argmax\/}_{d=1,\ldots,D} \left[ v_d + \tilde \epsilon_d\right] \end{displaymath}$

(112)

i.e. $\delta(\epsilon)$ is the optimal choice for an agent whose unobserved utility components are $\epsilon=(\epsilon_1,\ldots,\epsilon_D)$ . Then the agent's choice probability $P\{d\vert X\}$ is given by:

$\begin{displaymath}P\left\{ d \vert X\right\} = \int I\{ d = \delta(\epsilon)\} f(\epsilon\vert X)d\epsilon \end{displaymath}$

(113)

where $X=(X_1,\ldots,X_D)$ is the vector of observed characteristics of the agent and the D alternatives and $f(\epsilon\vert X)$ is the conditional density function of the random components of utility given the values of observed components X, and $I\{\delta(\epsilon)=d\}$ is the indicator function given by $I\{\delta(\epsilon)=d\} =1$ if $\delta(\epsilon)=d$ and 0 otherwise. Note that the integral above is actually a multivariate integral over the D components of $\epsilon=(\epsilon_1,\ldots,\epsilon_D)$ , and simply represents the probability that the values of the vector of unobserved utilities $\epsilon$ lead the agent to choose alternative d.

Definition: The Social Surplus Function $U(v_1,\ldots,v_D,X)$ is given by:

$\begin{displaymath}U(v_1,\ldots,v_D,X) = E\left\{ \max_{d=1,\ldots,D}[ v_d + \ti... ...ilon_1,\ldots,\epsilon_D\vert X)d\epsilon_1 \cdots d\epsilon_D \end{displaymath}$

(114)

The Social Surplus function is the expected maximized utility of the agent.¹

A.

(50%) Prove the Williams-Daly-Zachary Theorem:

$\begin{displaymath}{\partial U \over \partial v_d}(v_1,\ldots,v_D,X) = P\{d \vert X\} \end{displaymath}$

(115)

and discuss its relationship to Roy's Identity.

Hint: Interchange the differentiation and expectation operations when computing $\partial U/\partial v_d$ :

$\begin{eqnarray*}{\partial U \over \partial v_d}(v_1,\ldots,v_D,X) & = &\partial... ...silon_1,\ldots,\epsilon_D\vert X)d\epsilon_1 \cdots d\epsilon_D \end{eqnarray*}$

and show that

$\begin{displaymath}{\partial \over \partial v_d} \max_{d=1,\ldots,D}[v_d + \epsilon_d] = I\{d = \delta(\epsilon)\}. \end{displaymath}$

Answer: The hint gives away most of the answer. We simply appeal to the Lebesgue Dominated Convergence Theorem to justify the interchange of integration and differentiation operators. As long as the distribution of the $\{\epsilon_d\}$ 's has a density, the derivative

$\begin{displaymath}\partial/\partial v_d \max_{d=1,\ldots,D}[v_d + \epsilon_d] = I\{d = \delta(\epsilon)\}. \end{displaymath}$

(116)

exists almost everywhere with respect to this density and is bounded by 1, so that the Lebesgue Dominated Convergence Theorem applies. It is easy to see why the partial derivative of $\max_{d=1,\ldots,D}[v_d + \epsilon_d]$ equals the indicator function $I\{d = \delta(\epsilon)\}$ : if this function equals 1then alternative d yields the highest utility and we have

$\begin{displaymath}v_d + \epsilon_d > v_{d'} + \epsilon_{d'} \quad \forall d' \ne d \end{displaymath}$

Thus, $v_d + \epsilon_d = \max_{d'=1,\ldots,D}[v_{d'} + \epsilon_{d'}]$ and we have $\partial /\partial v_d \max_{d=1,\ldots,D}[v_d + \epsilon_d] =1$ when $I\{d = \delta(\epsilon)\}=1$ . However when $I\{d = \delta(\epsilon)\}=0$ , then alternative d is not the utility maximizing choice, so that $\max_{d'=1,\ldots,D}[v_{d'} + \epsilon_{d'}] > v_d + \epsilon_d$ . It follows that we have $\partial /\partial v_d \max_{d=1,\ldots,D}[v_d + \epsilon_d] =0$ when $I\{d = \delta(\epsilon)\}=0$ so that the identity claimed in (116) holds with probability 1, and so via the Lebesgue Dominated Convergence Theorem we have

$\displaystyle {\partial U \over \partial v_d} (v_1,\ldots,v_D,X)$	=	$\displaystyle { \partial \over \partial v_d} \int_{\epsilon_1}\cdots\int_{\epsi... ...\epsilon_d]f(\epsilon_1,\ldots,\epsilon_D\vert X)d\epsilon_1 \cdots d\epsilon_D$
	=	$\displaystyle \int_{\epsilon_1}\cdots\int_{\epsilon_D} { \partial \over \partia... ...\epsilon_d]f(\epsilon_1,\ldots,\epsilon_D\vert X)d\epsilon_1 \cdots d\epsilon_D$
	=	$\displaystyle \int_{\epsilon_1}\cdots\int_{\epsilon_D} I\{d = \delta(\epsilon)\}f(\epsilon_1,\ldots,\epsilon_D\vert X)d\epsilon_1 \cdots d\epsilon_D$
	=	$\displaystyle P\{d\vert X\}.$	(117)

B.

(50%) Consider the special case of the random utility model when $\epsilon=(\epsilon_1,\ldots,\epsilon_D)$ has a multivariate (Type I) extreme value distribution:

$\begin{displaymath}f(\epsilon\vert X) = \prod_{d=1}^D \exp\{-\epsilon_d\}\exp\left\{-\exp\{-\epsilon_d\}\right\}. \end{displaymath}$

(118)

Show that the conditional choice probability $P\{d\vert X\}$ is given by the multinomial logit formula:

$\begin{displaymath}P\{d\vert X\} = { \exp\{ v_d/\sigma\} \over \sum_{d'=1}^D \exp\{ v_{d'}/\sigma\} }. \end{displaymath}$

(119)

Hint 1: Use the Williams-Daly-Zachary Theorem, showing that in the case of the extreme value distribution (118) the Social Surplus function is given by

$\begin{displaymath}U(v_1,\ldots,v_D,X)= \sigma\gamma+ \sigma\log\left[ \sum_{d=1}^D \exp\{ v_d/\sigma\} \right]. \end{displaymath}$

(120)

where $\gamma = .577216 \ldots$ is Euler's constant.

Hint 2: To derive equation (120) show that the extreme value family is max-stable: i.e. if $(\epsilon_1,\ldots,\epsilon_D)$ are IID extreme value random variables, then $\max_d \{\epsilon_d\}$ also has an extreme value distribution. Also use the fact that the expectation of a single extreme value random variable with location parameter $\alpha$ and scale parameter $\sigma$ is given by:

$\begin{displaymath}E\{\tilde \epsilon\} = \int_{-\infty}^{+\infty} \epsilon \exp... ...\{-\exp\{-\epsilon\}\right\}d\epsilon = \alpha + \sigma\gamma, \end{displaymath}$

(121)

and the CDF is given by

$\begin{displaymath}F(x\vert\alpha,\sigma) = P\{\tilde\epsilon \le x\vert\alpha,\... ...ft\{ - \exp\left\{ {-(x-\alpha) \over \sigma}\right\}\right\}. \end{displaymath}$

(122)

Hint 3: Let $(\epsilon_1,\ldots,\epsilon_D)$ be INID (independent, non-identically distributed) extreme value random variables with location parameters $(\alpha_1,\ldots,\alpha_D)$ and common scale parameter $\sigma$ . Show that this family is max-stable by proving that $\max(\epsilon_1,\ldots,\epsilon_D)$ is an extreme value random variable with scale parameter $\sigma$ and location parameter

$\begin{displaymath}\alpha = \sigma \log\left[ \sum_{d=1}^D \exp\{ \alpha_d/\sigma\} \right] \end{displaymath}$

(123)

Answer: Once again, the hints are virtually the entire answer to the problem. By hint 1, if the Social Surplus function is given by equation (120) then by the Williams-Daly-Zachary Theorem we have

$\begin{displaymath}P\{d\vert X\} = { \partial \over \partial v_d} \left[ \sigma\... ...p\{ v_d/\sigma\} \over \sum_{d'=1}^D \exp\{ v_{d'}/\sigma\} }. \end{displaymath}$

(124)

Now to show that the Social Surplus function has the form given in equation (120), we use the fact that if $\{\epsilon_d\}$ are independent random variables, we have following formula for the probability distribution of the random variable $\max_{d=1,\ldots,D}[v_d + \epsilon_d]$ :

$\begin{displaymath}\mbox{Pr}\left\{ \max_{d=1,\ldots,D}[v_d + \epsilon_d] \le x\... ...\prod_{d=1}^D \mbox{Pr}\left\{ v_d + \epsilon_d \le x\right\}. \end{displaymath}$

(125)

Now, let $\epsilon_d$ have a Type III extreme value value distribution with location parameter $\alpha_d=0$ and scale parameter $\sigma > 0$ . Then it is easy to see that $v_d + \epsilon_d$ is also a Type III extreme value random variate with location parameter v_d and scale parameter $\sigma$ . That is, the family of independent Type III extreme distributions is max-stable. Plugging in the formula for the Type III extreme value distribution from equation (122) into the formula for the CDF of $\max_{d=1,\ldots,D}[v_d + \epsilon_d]$ given above, we find that

$\begin{displaymath}\mbox{Pr}\left\{ \max_{d=1,\ldots,D}[v_d + \epsilon_d] \le x\... ...ft\{ - \exp\left\{ {-(x-\alpha) \over \sigma}\right\}\right\}, \end{displaymath}$

(126)

where the location parameter is given by the log-sum formula in equation (123). The form of the Social Surplus Function in equation (120) then follows from the formula for the expectation of an extreme value random variate in equation (121), and formula (123) for the location parameter of the maximum of a collection of independent Type III extreme random variables, i.e.

$\begin{displaymath}U(v_1,\ldots,v_D,X) \equiv E\left\{ \max_{d=1,\ldots,D}[v_d +... ...a+ \sigma\log\left[ \sum_{d=1}^D \exp\{ v_d/\sigma\} \right]. \end{displaymath}$

(127)

QUESTION 4 (Latent Variable Models) The Binary Probit Model can be viewed as a simple type of latent variable model. There is an underlying linear regression model

$\begin{displaymath}\tilde z = X\beta^* + \epsilon \end{displaymath}$

(128)

but where the dependent variable $\tilde z$ is latent, i.e. it is not observed by the econometrician. Instead we observe the dependent variable y given by

$\begin{displaymath}y = \left\{ \begin{array}{ll} 1 & \mbox{if} \quad \tilde z > 0 \\ 0 & \mbox{if} \quad \tilde z \le 0 \end{array} \right. \end{displaymath}$

(129)

1.

(5%) Assume that the error term $\epsilon \sim N(0,\sigma^2)$ . Show that the scale of $\beta^*$ and the parameter $\sigma^2$ is not simultaneously identified and therefore without loss of generality we can normalize $\sigma^2=1$ and interpret the estimated $\beta$ coefficients as being the true coefficients $\beta^*$ divided by $\sigma$ :

$\begin{displaymath}\beta = { \beta^* \over \sigma}. \end{displaymath}$

(130)

Answer: Notice that if $\lambda > 0$ is an arbitrary positive constant, then if we divide both sides of equation (128) by $\lambda$ , the probability distribution for the observed dependent variable has not changed since we have

$\begin{displaymath}\tilde z > 0 \Longleftrightarrow {\tilde z \over \lambda} > 0.\end{displaymath}$

(131)

Thus the model with latent variable $\tilde z/\lambda$ is observationally equivalent to the model with the latent variable $\tilde z$ . If we normalize the variance of $\epsilon$ to 1, this is equivalent to dividing $\tilde z$ by the standard deviation $\sigma$ of the underlying ``true'' $\epsilon$ variable, so that our estimates of $\beta$ should be interpreted as being estimates of $\beta/\sigma$ .

2.

(10%) Derive the conditional probability $\mbox{Pr}\{y=1\vert X\}$ in terms of X, $\beta$ and the standard normal CDF, $\Phi$ and use this probability to write down the likelihood function for NIID observations of pairs $\{(y_i,X_i)\}, i=1,\ldots,N$ .

Answer: We have

$\begin{displaymath}\mbox{Pr}\{ y=1\vert X,\beta^*\} = \mbox{Pr}\{ \tilde z > 0 \... ...on > 0\} = \mbox{Pr}\{ -\epsilon < X\beta^*\} = \Phi(X\beta^*),\end{displaymath}$

(132)

where $\Phi$ is the CDF of a N(0,1) random variable, and we used the fact that if $\epsilon \sim N(0,1)$ then $-\epsilon \sim N(0,1)$ . Using this formula, the likelihood for N observations $\{y_i,X_i\}$ is given by

$\begin{displaymath}L(\beta) = \prod_{i=1}^N [\Phi(X_i\beta)]^{y_i} [1-\Phi(X_i\beta)]^{(1-y_i)}. \end{displaymath}$

(133)

3.

(20%) Show that $\beta$ can be consistently estimated by nonlinear least squares by writing down the least squares problem and sketching a proof for its consistency.

We observe that y satisfies the following nonlinear regression equation:

$\begin{displaymath}y = \Phi(X\beta^*) + \xi, \end{displaymath}$

(134)

where $E\{\xi\vert X\}=0$ . To see this, note that conditional on X the residual $\xi$ takes on two possible values. If y=1, which occurs with probability $\Phi(X\beta^*)$ , then $\xi=1-\Phi(X\beta^*)$ . If y=0, which occurs with probability $1-\Phi(X\beta^*)$ , then $\xi=-\Phi(X\beta^*)$ . Thus we have the conditional expectation is given by

$\begin{displaymath}E\{\xi\vert X\} = [1-\Phi(X\beta^*)]\Phi(X\beta^*) - \Phi(X\beta^*) [1-\Phi(X\beta^*)] = 0. \end{displaymath}$

(135)

Thus, since the conditional expectation of y is given by the parametric function $\Phi(X\beta^*)$ it follows from the general results on the consistency of nonlinear least squares that the nonlinear least squares estimator

$\begin{displaymath}\hat\beta^n_N = \mathop{\it argmin}_{\beta \in R^k} \sum_{i=1}^N [y_i - \Phi(X_i\beta)]^2 \end{displaymath}$

(136)

will be a consistent estimator of $\beta^*$ .

4.

(20%) Derive the asymptotic distribution of the maximum likelihood estimator by providing an analytical formula for the asymptotic covariance matrix of the MLE estimator $\hat\beta_N$

Hint: This is the inverse of the information matrix ${\cal I}$ . Derive a formula for ${\cal I}$ in terms of $\Phi$ , X and $\beta$ and possibly other terms.

Answer: We know that if the model is correctly specified and basic regularity conditions hold, that the maximum likelihood estimator, $\beta^M_N$ , is consistent and asymptotically normally distributed with

$\begin{displaymath}\sqrt{N} [\hat \beta^m_N - \beta^*]\phantom{,}_{\Longrightarrow \atop d} N(0,{\cal I}^{-1}), \end{displaymath}$

(137)

where ${\cal I}$ is the Information Matrix given by

$\begin{displaymath}{\cal I} = E\{ {\partial \over \partial \beta} \log f(y\vert ... ...){ \partial \over \partial \beta'} \log f(y\vert X,\beta^*)\}. \end{displaymath}$

(138)

In the case of the probit model we have

$\begin{displaymath}\log f(y\vert X,\beta^*) = y \log(\Phi(X\beta)) + (1-y)\log(1-\Phi(X\beta)), \end{displaymath}$

(139)

and so we have

$\begin{displaymath}{\partial \over \partial \beta} \log f(y\vert X,\beta^*) = {y... ...i(X\beta)} - { (1-y) \phi(X\beta) X \over (1 - \Phi(X\beta)) } \end{displaymath}$

(140)

where

$\begin{displaymath}\phi(X\beta) = \Phi'(X\beta) = {1 \over \sqrt{2 \pi}} \exp\{ -(X\beta)^2/2\}. \end{displaymath}$

(141)

Using this formula it is not hard to see that

$\displaystyle {\cal I}$	=	$\displaystyle E\left\{ \left[ { 1\over \Phi(X\beta^)} + { 1 \over [1-\Phi(X\beta^)]}\right] \phi^2(X\beta^*) XX' \right\}$
	=	$\displaystyle E\left\{ \left[ { \phi^2(X\beta^) X X'\over \Phi(X\beta^)[1-\Phi(X\beta^*)]} \right] \right\}.$	(142)

5.

(20%) Derive the asymptotic distribution of the nonlinear least squares estimator and compare it to the maximum likelihood estimator. Is the nonlinear least squares estimator asymptotically inefficient?

Answer: The first order condition for the nonlinear least squares estimator $\hat\beta_N$ is given by:

$\begin{displaymath}0 = {1\over N} \sum_{i=1}^N [y_i - \Phi(X_i\hat\beta_N)] \phi(X_i\hat\beta_N) X_i. \end{displaymath}$

(143)

Expanding this first order condition in a Taylor series about $\beta^*$ we obtain

0	=	$\displaystyle {1 \over N} \sum_{i=1}^N [y_i - \Phi(X_i\beta^)] \phi(X_i\beta^) X_i$
	-	$\displaystyle \left[ {1\over N}\sum_{i=1}^N \phi^2(X_i\tilde \beta_N)X_i X_i' -... ...ilde \beta_N)]\phi'(X_i\tilde \beta_N) X_i X_i'\right] (\hat\beta_N - \beta^*).$	(144)

where $\tilde \beta_N$ is a vector each of who coordinates are on the line segment joining the corresponding components of $\hat\beta_N$ and $\beta^*$ . Solving the above equation for $\sqrt{N}(\hat\beta-\beta^*)$ we obtain

$\displaystyle \sqrt{N}(\hat\beta_N-\beta^*)$	=	$\displaystyle \left[ {1\over N}\sum_{i=1}^N \phi^2(X_i\tilde \beta_N)X_i X_i' -... ...^N [y_i - \Phi(X_i\tilde \beta_N)]\phi'(X_i\tilde \beta_N) X_i X_i'\right]^{-1}$
	$\textstyle \times$	$\displaystyle \left[{1 \over \sqrt{N}} \sum_{i=1}^N [y_i - \Phi(X_i\beta^)] \phi(X_i\beta^) X_i\right].$	(145)

Applying the Central Limit Theorem to the second term in brackets in the above equation we have

$\begin{displaymath}{1 \over \sqrt{N}} \sum_{i=1}^N \left[y_i - \Phi(X_i\beta^*)]... ...) X_i\right] \phantom{,}_{\Longrightarrow \atop d} N(0,\Omega),\end{displaymath}$

(146)

where $\Omega$ is given by

$\displaystyle \Omega$	=	$\displaystyle E\left\{ \left[ [1-\Phi(X\beta^)]^2 \Phi(X\beta^) + [\Phi(X\beta^)]^2 [1-\Phi(X\beta^)] \right] \phi^2(X\beta^*) X X'\right\}$
	=	$\displaystyle E\left\{ [1- \Phi^2(X\beta^)]\phi^2(X\beta^) X X' \right\}.$	(147)

Appealing to the uniform strong law of large numbers, we can show that the other term in equation (145) converges to the following limiting value with probability 1:

$\begin{displaymath}\left[ {1\over N}\sum_{i=1}^N \phi^2(X_i\tilde \beta_N)X_i X_... ...phi'(X_i\tilde \beta_N) X_i X_i'\right] \longrightarrow \Sigma \end{displaymath}$

(148)

where

$\begin{displaymath}\Sigma = E\left\{ \phi^2(X\beta^*) X X'\right\}. \end{displaymath}$

(149)

It follows that the asymptotic distribution of the nonlinear least squares estimator is given by

$\begin{displaymath}\sqrt{N}[\hat\beta_N - \beta^*] \phantom{,}_{\Longrightarrow \atop d} N(0, \Sigma^{-1} \Omega \Sigma^{-1}). \end{displaymath}$

(150)

Since the maximum likelihood estimator is an asymptotically efficient estimator and the nonlinear least squares estimator is a potentially inefficient estimator, we have

$\begin{displaymath}{\cal I}^{-1} \le \Sigma^{-1} \Omega \Sigma^{-1}. \end{displaymath}$

(151)

To see that the inequality is strict in general, consider the special case where there is a degenerate distribution with only one possible X vector. Then turning the above inequality around we want to show that

$\begin{displaymath}{\cal I} > \Sigma \Omega^{-1} \Sigma. \end{displaymath}$

(152)

However when the distribution of X is degenerate we have

$\begin{displaymath}{\cal I}= { \phi^2(X\beta^*) XX' \over \Phi(X\beta^*)[1-\Phi(X\beta^*)]}. \end{displaymath}$

(153)

Similarly we have

$\begin{displaymath}\Sigma \Omega^{-1} \Sigma = {\phi^2(X\beta^*) X X' \over [1-\Phi^2(X\beta^*)] }. \end{displaymath}$

(154)

However since

$\begin{displaymath}{1 \over \Phi(X\beta^*)[1-\Phi(X\beta^*)]} > { 1 \over [1-\Phi^2(X\beta^*)] }, \end{displaymath}$

(155)

it follows that ${\cal I} > \Sigma \Omega^{-1} \Sigma$ , so that the nonlinear least squares estimator will generally be strictly asymptotically inefficient in comparison to the maximum likelihood estimator.

6.

(25%) Show that the nonlinear least squares estimator of $\beta$ is subject to heteroscedasticity by deriving an explicit formula for the conditional variance of the error term in the nonlinear regression formulation of the estimation problem. Can you form a more efficient estimator by correcting for this heteroscedasticity in a two stage feasible GLS procedure (i.e. in stage 1 computing an initial consistent, but inefficient estimator of $\beta$ by ordinary nonlinear least squares and in stage two using this initial consistent estimator to correct for the heteroscedasticity and using the stage two estimator of $\beta$ as the feasible GLS estimator)? If so, is this feasible GLS procedure asymptotically efficient? If you believe so, provide a sketch of the derviation of the asymptotic distribution of the feasible GLS estimator. Otherwise provide a counterexample or a sketch of an argument why you believe the feasible GLS procedure is asymptotically inefficient relative to the maximum likelihood estimator.

Answer: There is heteroscedasticity in the nonlinear regression formulation of the probit estimation problem in (134) since we have

$\begin{displaymath}\mbox{var}(\xi\vert X) = E\{\xi^2\vert X\} = [1-\Phi(X\beta^*)]^2\Phi(X\beta^*) + [\Phi(X\beta^*)]^2 [1-\Phi(X\beta^*)]. \end{displaymath}$

(156)

Now suppose we do an initial first step nonlinear least squares estimation to obtain an initial $\sqrt{N}$ -consistent estimator $\hat\beta_N$ and then use this to construct a second stage weighted nonlinear least squares problem as follows:

$\begin{displaymath}\hat\beta_N^g= \mathop{\it argmin}_{\beta \in R^k} {1 \over N... ...at\beta_N) + [\Phi(X\hat\beta_N)]^2 [1-\Phi(X\hat\beta_N)]}. \end{displaymath}$

(157)

It turns out that this two stage, feasible GLS estimator has the same asymptotic distribution as maximum likelihood, i.e. it is an asymptotically efficient estimator. It is easiest to see this result by assuming first that we know the exact form of the heteroscedasticity, i.e. in the denominator of the second stage we weight the observations by the inverse of the exact conditional heteroscedasticity givein in equation (156). Then repeating the Taylor series expansion argument that we used to derive the asymptotic distribution of the unweighted nonlinear least squares estimator, it is not difficult to show that

$\displaystyle \sqrt{N}(\hat\beta_N-\beta^*)$	=	$\displaystyle \left[ {1\over N}\sum_{i=1}^N { \phi^2(X_i\tilde \beta_N)X_i X_i'... ..._N)]\phi'(X_i\tilde \beta_N) X_i X_i' \over E\{\xi_i^2\vert X_i\}} \right]^{-1}$
	$\textstyle \times$	$\displaystyle \left[{1 \over \sqrt{N}} \sum_{i=1}^N { [y_i - \Phi(X_i\beta^)] \phi(X_i\beta^) X_i \over E\{u^2_i\vert X_i\}}\right].$	(158)

Once again, appealing to the Central Limit Theorem, we can show that the second term in equation (158) converges in distribution to

$\begin{displaymath}{1 \over \sqrt{N}} \sum_{i=1}^N { [y_i - \Phi(X_i\beta^*)] \p... ...\vert X_i\}}\phantom{,}_{\Longrightarrow \atop d} N(0,\Omega), \end{displaymath}$

(159)

where in the GLS case $\Omega$ is given by

$\displaystyle \Omega$	=	$\displaystyle E\left\{ { \phi^2(X\beta^) X X' \over \left[ [1-\Phi(X\beta^)]^2 \Phi(X\beta^) + [\Phi(X\beta^)]^2 [1-\Phi(X\beta^*)] \right]} \right\}$
	=	$\displaystyle E\left\{ { \phi^2(X\beta^) X X' \over \Phi(X\beta^) [1 - \Phi(X\beta^*)] } \right\}$
	=	$\displaystyle {\cal I}.$	(160)

Similarly, we can show that the other term in equation (158) converges with probability 1 to the matrix $\Sigma$ ,

$\begin{displaymath}\left[ {1\over N}\sum_{i=1}^N { \phi^2(X_i\tilde \beta_N)X_i ... ...i' \over E\{\xi_i^2\vert X_i\}} \right] \longrightarrow \Sigma \end{displaymath}$

(161)

where we also have $\Sigma = {\cal I}$ . Thus, the GLS estimator converges in distribution to

$\begin{displaymath}\sqrt{N}[\hat\beta^f_N - \beta^*]\phantom{,}_{\Longrightarrow... ...p d} N(0,\Sigma^{-1} \Omega \Sigma^{-1}) = N(0,{\cal I}^{-1}), \end{displaymath}$

(162)

so the GLS estimator is asymptotically efficient. To show that the feasible GLS estimator (i.e. the one using the estimated conditional variance as weights instead of weighting by the true conditional variance) has this same distribution is a rather tedious exercise in the properties of uniform convergence and will be omitted.

Final Comment: I note that the GMM efficiency bound for the conditional moment restriction

$\begin{displaymath}H(\beta^*\vert X) = E\{ h(\tilde y,\tilde X,\beta^*)\vert\tilde X=X)\} \end{displaymath}$

(163)

coincides with ${\cal I}^{-1}$ when $h(\tilde y,\tilde X,\beta)=\tilde y - \Phi(\tilde X\beta)$ . To see this, recall that the GMM bound for conditional moment restrictions is given by

$\begin{displaymath}\left[ E\left\{ \nabla H(\beta^*\vert X) \Omega^{-1}(X) \nabla H(\beta^*\vert X)'\right\} \right]^{-1}, \end{displaymath}$

(164)

where

$\begin{displaymath}\Omega(X) = E\{h(\tilde y,\tilde X,\beta^*)h(\tilde y,\tilde X,\beta^*)'\vert\tilde X=X\} \end{displaymath}$

(165)

and

$\begin{displaymath}\nabla H(\beta^*\vert X) = E\left\{ {\partial \over \partial ... ...a} h(\tilde y,\tilde X,\beta^*) \beta'\vert\tilde X=X\right\}. \end{displaymath}$

(166)

In the case where $h(\tilde y,\tilde X,\beta)=\tilde y - \Phi(\tilde X\beta)$ we have

$\begin{displaymath}\Omega(X)= E\{\xi^2\vert X\} = [1-\Phi(X\beta^*)]^2\Phi(X\beta^*) + [\Phi(X\beta^*)]^2 [1-\Phi(X\beta^*)], \end{displaymath}$

(167)

that is, $\Omega(X)$ is just the conditional heteroscedasticity of the residuals in the nonlinear regression formulation of the probit problem. Also, we have

$\begin{displaymath}\nabla H(\beta^*\vert X) = -\phi(X\beta^*) X. \end{displaymath}$

(168)

Plugging these into the matrix in the inside of the expectation of the GMM bound we have

$\begin{displaymath}\nabla H(\beta^*\vert X) \Omega^{-1}(X) \nabla H(\beta^*\ver... ...X\beta^*) X X' \over \Phi(X \beta^*) [ 1 - \Phi(X\beta^*)] }. \end{displaymath}$

(169)

Taking expectations with respect to X and comparing to the formula for the information matrix in equation (138) we see that

$\begin{displaymath}E\left\{ \nabla H(\beta^*\vert X) \Omega^{-1}(X) \nabla H(\beta^*\vert X)'\right\} = {\cal I}. \end{displaymath}$

(170)

Since the GMM bound is the inverse of this matrix, it equals the inverse of the information matrix, ${\cal I}^{-1}$ , and hence is the same as the (asymptotic) Cramér-Rao lower bound.

About this document ...

Next: About this document ...

John Rust
2001-05-01