next up previous
Next: About this document ...

Spring 2001 John Rust
Economics 551b 37 Hillhouse, Rm. 27

FINAL EXAM

April 27, 2001

INSTRUCTIONS: Do all Parts I, II and III below. You are required to answer all questions in Part I, 2 out of 6 questions in Part II, and 1 out of 4 questions from Part III. Total points for the final exam is 100. Part I should take about 15 minutes and is worth 15 points. Part II should take about 30 minutes and is worth 30 points. Part III should take about 60 minutes and is worth 55 points. You have 3 hours for the exam, but my expectation that almost all students will complete it in two hours.

Part I: 15 minutes, 15 points. Answer all questions below:

1. Suppose $\{\tilde X_1,\ldots,\tilde X_N\}$ are IID draws from a $N(\mu,\sigma^2)$ distribution (i.e. a normal distribution with mean $\mu$ and variance $\sigma^2$). Consider the estimator $\hat\theta_N$ defined by:

\begin{displaymath}\hat\theta_N = \left( {1 \over N} \sum_{i=1}^N \tilde
X_i \right)^2 \end{displaymath} (1)

Which of the following statements are true and which are false?

A.
$\hat\theta_N$ is a consistent estimator of $\sigma^2$.

B.
$\hat\theta_N$ is an unbiased estimator of $\sigma^2$.

C.
$\hat\theta_N$ is a consistent estimator of $\mu$.

D.
$\hat\theta_N$ is an unbiased estimator of $\mu$.

E.
$\hat\theta_N$ is a consistent estimator or $\mu^2$.

F.
$\hat\theta_N$ is an unbiased estimator of $\mu^2$.

G.
$\hat\theta_N$ is an upward biased estimator of $\mu^2$.

H.
$\hat\theta_N$ is a downward biased estimator of $\mu^2$.

2. Consider estimation of the linear model

\begin{displaymath}y=X\beta + \epsilon \end{displaymath} (2)

based on N IID observations $\{y_i,X_i\}$ where Xi is a $K \times 1$ vector of independent variables and yi is a $1 \times 1$ scalar independent variable. Mark each of the following statements as true or false:

A.
The Gauss-Markov Theorem proves that the ordinary least squares estimator (OLS) it BLUE (Best Linear Unbiased Estimator).

B.
The Gauss-Markov Theorem requires that the error term in the regression $\epsilon$ be normally distributed with mean 0 and variance $\sigma^2$.

C.
The Gauss-Markov Theorem does not apply if the true regression function does not equal $X\beta$, i.e. if $E\{y\vert X\} \ne X\beta$.

D.
The Gauss-Markov Theorem does not apply if there is heteroscedasticity.

E.
The Gauss-Markov Theorem does not apply if the error term has a non-normal distribution.

F.
The maximum likelihood estimator of $\beta$ is more efficient than the OLS estimator of $\beta$.

G.
The OLS estimator of $\beta$ will be unbiased only if the error terms are distributed independently of X and have mean 0.

H.
The maximum likelihood estimator of $\beta$ is the same as OLS only in the case where $\epsilon$ is normally distributed.

I.
The OLS estimator will be a consistent estimator of $\beta$even if the error term $\epsilon$ is not normal and even if there is heteroscedasticity.

J.
The OLS estimator of the asymptotic covariance matrix for $\beta$, $\hat\sigma^2 (X'X/N)^{-1}$ (where $\hat\sigma^2$ is the sample variance of the estimated residuals $\hat\epsilon_i=y_i-X_i\hat\beta$) is a consistent estimator regardless of whether $\epsilon$ is normally distributed or not.

K.
The OLS estimator of the asymptotic covariance matrix for $\beta$, $\hat\sigma^2 (X'X/N)^{-1}$ (where $\hat\sigma^2$ is the sample variance of the estimated residuals $\hat\epsilon_i=y_i-X_i\hat\beta$) is a consistent estimator regardless of whether there is heteroscedasticity in $\epsilon$.

L.
If the distribution of $\epsilon$ is double exponential, i.e. if $f(\epsilon)=\exp\{ -\vert\epsilon\vert/\sigma\}/(2\sigma)$, the maximum likelihood estimator of $\beta$ is the Least Absolute Deviations estimator and it is asymptotically efficient relative to the OLS estimator.

M.
The OLS estimator cannot be used if the regression function is misspecified, i.e. if the true regression function $E\{y\vert X\} \ne X\beta$.

N.
The OLS estimator will be inconsistent if $\epsilon$ and X are correlated.

O.
The OLS estimator will be inconsistent if the dependent variable y is truncated, i.e. if the dependent variable is actually determined by the relation

\begin{displaymath}y= \max[0, X\beta + \epsilon] \end{displaymath} (3)

P.
The OLS estimator is inconsistent if $\epsilon$ has a Cauchy distribution, i.e. if the density of $\epsilon$ is given by

\begin{displaymath}f(\epsilon)={ 1 \over \pi (1+ \epsilon^2)} \end{displaymath} (4)

Q.
The 2-stage least squares estimator is a better estimator than the OLS estimator because it has two stages and is therefore twice as efficient.

R.
If the set of instrumental variables W and the set of regressors X in the linear model coincide, then 2 stage least squares estimator of $\beta$ is the same as the OLS estimator of $\beta$.

Part II: 30 minutes, 30 points. Answer 2 of the following 6 questions below.

QUESTION 1 (Probability question) Suppose $\tilde Z$ is a $K \times 1$ random vector with a multivariate N(0,I) distribution, i.e. $E\{\tilde Z\}=0$ where 0 is a $K \times 1$ vector of zeros and $E\{\tilde Z \tilde Z'\}=I$where I is the $K \times K$ identity matrix. Let M be a $K \times K$idempotent matrix, i.e. a matrix that satisfies

M2 = M * M = M (5)

Show that

\begin{displaymath}\tilde Z' M \tilde Z \sim \chi^2(J) \end{displaymath} (6)

where $\chi^2(J)$ denotes a chi-squared random variable with J degrees of freedom and $J= \mbox{rank}(M)$. Hint: Use the fact that M has a singular value decomposition, i.e.

M = X D X' (7)

where X' X = I and D is a diagonal matrix whose diagonal elements are equal to either 1 or 0.

QUESTION 2 (Markov Processes)

A.
(10%) Are Markov processes of any use in econometrics? Describe some examples of how Markov processes are used in econometrics such as providing models of serially dependent data, as a framework for establishing convergence of estimators and proving laws of large numbers, central limit theorems, etc. and as computational tool for doing simulations.

B.
(10%) What is a random walk? Is a random walk always a Markov process? If not, provide a counter-example.

C.
(40%) What is the ergodic or invariant distribution of a Markov process? Do all Markov processes have invariant distributions? If not, provide a counterexample of a Markov process that doesn't have an invariant distribution. Can a Markov process have more than 1 invariant distribution? If so, give an example.
D.
(40%) Consider the discrete Markov process $\{X_{t}\}=\{1,2,3\}$ with transition probability

\begin{eqnarray*}P\{X_{t+1}=1\vert X_{t}=1\}&=&\frac{1}{2} \quad
P\{X_{t+1} =2\...
...3\vert X_{t}=2\}=\frac{1}{4} \quad
P\{X_{t+1}=2\vert X_{t}=3\}=1
\end{eqnarray*}


Does this process have an invariant distribution? If so, find all of them.

QUESTION 3 (Consistency of M-estimator) Consider an M-estimator defined by:


\begin{displaymath}\widehat{\theta }_{N}=\arg \max_{\theta \in \Theta }Q_{N}(\theta ).
\end{displaymath}

Suppose following two conditions are given

(i) (Identification) For all $\varepsilon >0$


\begin{displaymath}Q(\theta ^{*})>\sup_{\theta \notin B(\theta ^{*},\varepsilon )}Q(\theta
)
\end{displaymath}

where $B(\theta ^{*},\varepsilon )=\{\theta \in R^k \big\vert \Vert\theta-\theta^*\Vert
< \epsilon\}$.

(ii) (Uniform Convergence)


\begin{displaymath}\sup_{\theta \in \Theta }\left\vert Q_{N}(\theta )-Q(\theta )\right\vert
\stackrel{p%
}{\rightarrow }0.
\end{displaymath}

Prove consistency of the estimator by showing


\begin{displaymath}P\left( \widehat{\theta }_{N}\notin B(\theta ^{*},\varepsilon )\right)
\stackrel{p}{\rightarrow }0.
\end{displaymath}

QUESTION 4 (Time series question) Suppose $\{X_t\}$ is an ARMA(p,q) process, i.e.

\begin{displaymath}A(L)X_t = B(L)\epsilon_t \end{displaymath}

where A(L) is a $q^{\hbox{\rm th}}$ order lag-polynomial

\begin{displaymath}A(L) = \alpha_0 + \alpha_1 L + \alpha_2 L^2 + \cdots + \alpha_q L^q
\end{displaymath}

and B(L) is a $p^{\hbox{\rm th}}$ order lag-polynomial

\begin{displaymath}B(L) = \beta_0 + \beta_1 L + \beta_2 L^2 + \cdots + \beta_p L^p \end{displaymath}

and the lag-operator Lk is defined by

Lk Xt = Xt-k

and $\{\epsilon_t\}$ is a white-noise process, $E\{\epsilon_t\}=0$ and (cov( $\epsilon_t,\epsilon_s$)=0 if $t\ne s$, $=\sigma^2$ if t=s).

A.
(30%) Write down the autocovariance and spectral density functions for this process.
B.
(30%) Show that if p = 0 an autoregression of Xt on q lags of itself provides a consistent estimate of $(\alpha_0/\sigma,
\ldots,\alpha_q/\sigma)$. Is the autoregression still consistent if p > 0?
C.
(40%) Assume that a central limit theorem holds, i.e. the distribution of normalized sums of $\{X_t\}$ to converge in distribution to a normal random variable. Write down an expression for the variance of the limiting normal distribution.

QUESTION 5 (Empirical question) Assume that shoppers always choose only a single brand of canned tuna fish from the available selection of Kalternative brands of tuna fish each time they go shopping at a supermarket. Assume initially that the (true) probability that the decision-maker chooses brand k is the same for everybody and is given by $\theta^*_k$, $k=1,\ldots,K$. Marketing researchers would like to learn more about these choice probabilities, $\theta^*=(\theta^*_1,\ldots,\theta^*_K)$ and spend a great deal of money sampling shoppers' actual tuna fish choices in order to try to estimate these probabilities. Suppose the Chicken of the Sea Tuna company undertook a survey of N shoppers and for each shopper shopping at a particular supermarket with a fixed set of K brands of tuna fish, recorded the brand bi chosen by shopper i, $i=1,\ldots,N$. Thus, b1=2 denotes the observation that consumer 1 chose tuna brand 2, and b4=K denotes the observation that consumer 4 chose tuna brand K, etc.

A.
(10%) Without doing any estimation, are there any general restrictions that you can place on the $K \times 1$ parameter vector $\theta^*$?

B.
(10%) Is it reasonable to suppose that $\theta^*_k$ is the same for everyone? Describe several factors that could lead different people to have different probabilities of purchasing different brands of tuna. If you were a consultant to Chicken of the Sea, what additional data would you recommend that they collect in order to better predict the probabilities that consumers buy various brands of tuna? Describe how you would use this data once it was collected.

C.
(20%) Using the observations $\{b_1,\ldots,b_N\}$ on the observed brand choices of the sample of N shoppers, write down an estimator for $\theta^*$ (under the assumption that the ``true'' brand choice probabilities $\theta^*$ are the same for everyone). Is your estimator unbiased?

D.
(20%) What is the maximum likelihood estimator of $\theta^*$? Is the maximum likelihood estimator unbiased?

E.
(40%) Suppose Chicken of the Sea Tuna company also collected data on the prices $\{p_1,\ldots,p_K\}$ that the supermarket charged for each of the K different brands of tuna fish. Suppose someone proposed that the probability of buying brand j was a function of the prices of all the various brands of tuna, $\theta^*_j(p_1,\ldots,p_K)$, given by:

\begin{displaymath}\theta^*_j(p_1,\ldots,p_K)= { \exp\left\{ \beta_j + \alpha p_...
...\over \sum_{k=1}^K \exp\left\{ \beta_k + \alpha p_k \right\} }
\end{displaymath}

Describe in general terms how to estimate the parameters $(\alpha,\beta_1,\ldots,\beta_K)$. If $\alpha >0$, does an increase in pj decrease or increase the probability that a shopper would buy brand j?

QUESTION 6 (Regression question) Let (yt,xt) be IID observations from a regression model

\begin{displaymath}y_t = \beta x_t + \epsilon_t \end{displaymath}

where yt, xt, and $\epsilon_t$ are all scalars. Suppose that $\epsilon_t$ is normally distributed with $E\{\epsilon_t\vert x_t\}=0$, but $\mbox{var}(\epsilon_t\vert x_t)=\sigma^2 \vert x_t\vert^\theta$. Consider the following two estimators for $\beta^*$:

\begin{displaymath}\hat\beta^1_T = { \sum_{t=1}^T y_t \over \sum_{t=1}^T x_t } \end{displaymath}


\begin{displaymath}\hat\beta^2_T = { \sum_{t=1}^T x_t y_t \over \sum_{t=1}^T x^2_t } \end{displaymath}

A.
(20%) Are these two estimators consistent estimators of $\beta^*$? Which estimator is more efficient when: 1) if we know a priori that $\theta^*=0$, and 2) we don't know $\theta^*$? Explain your reasoning for full credit.

B.
(20%) Write down an asymptotically optimal estimator for $\beta^*$ if we know the value of $\theta^*$ a priori.

C.
(20%) Write down an asymptotically optimal estimator for $(\beta^*,\theta^*)$ if we don't know the value of $\theta^*$ a priori.

D.
(20%) Describe the feasible GLS estimator for $(\beta^*,\theta^*)$. Is the feasible GLS estimator asymptotically efficient?

E.
(20%) How would your answers to parts A to D change if you didn't know the distribution of $\epsilon_t$ was normal?

Part III (60 minutes, 55 points). Do 1 out of the 4 questions below.

QUESTION 1 (Hypothesis testing) Consider the GMM estimator with IID data, i.e the observations $\{y_i,x_i\}$ are independent and identically distributed using the moment condition $H(\theta) = E\{h(\tilde y,\tilde x,\theta)\}$, where h is a $J \times 1$ vector of moment conditions and $\theta$ is a $K \times 1$ vector of parameters to be estimated. Assume that the moment conditions are correctly specified, i.e. assume there is a unique $\theta^*$ such that $H(\theta^*)=0$. Show that in the overidentified case (J >K) that the minimized value of the GMM criterion function is asymptotically $\chi^2$ with J-K degrees of freedom:

 \begin{displaymath}N H_N(\hat\theta_N) [\hat\Omega_N]^{-1} H_N(\hat\theta_N)_{\Longrightarrow \atop d} \chi^2(J-K),
\end{displaymath} (8)

where HN is a $J \times 1$ vector of moment conditions, $\theta$ is a $K \times 1$ vector of parameters, $\chi^2(J-K)$ is a Chi-squared random variable with J-K degrees of freedom,

\begin{displaymath}\hat\theta_N = \mathop{\it argmin}_{\theta\in\Theta} H_N(\theta) [\hat\Omega_N]^{-1} H_N(\theta), \end{displaymath}


\begin{displaymath}H_N(\theta) = {1 \over N} \sum_{i=1}^N h(y_i,x_i,\theta), \end{displaymath}

and $\hat\Omega_N$ is a consistent estimator of $\Omega$ given by

\begin{displaymath}\Omega = E\{ h(\tilde y,\tilde x,\theta^*)h(\tilde y,\tilde x,\theta^*)'\}. \end{displaymath}

Hint: Use Taylor series expansions to provide a formula for $\sqrt N (\hat\theta_N - \theta^*) $ from the first order condition for $\hat\theta_N$

\begin{displaymath}\nabla H_N(\hat\theta_N) \hat\Omega^{-1}_N H_N(\hat\theta_N) = 0
\end{displaymath} (9)

and a Taylor series expansion of $H_N(\hat\theta_N)$ about $\theta^*$

\begin{displaymath}H_N(\hat\theta_N) = H_N(\theta^*) + \nabla H_N(\tilde
\theta_N)(\hat\theta_N-\theta^*) \end{displaymath} (10)

where

\begin{displaymath}\nabla H_N(\theta) \equiv {1 \over N} \sum_{i=1}^N {\partial h \over
\partial \theta}(y_i,x_i,\theta)\end{displaymath} (11)

is the $(J \times K)$ matrix of partial derivatives of the moment conditions $H_N(\theta)$ with respect to $\theta$ and $\tilde \theta_N$is a vector each of whose elements are on the line segment joining the corresponding components of $\hat\theta_N$ and $\theta^*$. Use the above two equations to derive the following formula for $H_N(\hat\theta_N)$

\begin{displaymath}H_N(\hat\theta_N) = M_N H_N(\theta^*) \end{displaymath} (12)

where

\begin{displaymath}M_N= \left[ I - \nabla H_N(\hat\theta_N)[\nabla H_N(\hat\thet...
...N)]^{-1} \nabla
H_N(\hat\theta_N)' \hat\Omega^{-1}_N \right]. \end{displaymath} (13)

Show that with probability 1 we have $M_N \to M$ where M is a $(J \times J)$ idempotent matrix. Then using this result, and using the Central Limit Theorem to show that

\begin{displaymath}\sqrt{N} H_N(\theta^*)_{\Longrightarrow \atop d}
N(0,\Omega), \end{displaymath} (14)

and using the probability result from Question 0 of Part II, show that the minimized value of the GMM criterion function does indeed converge in distribution to the $\chi^2(J-K)$ random variable as claimed in equation (8).

QUESTION 2 (Consistency of Bayesian posterior) Consider a Bayesian who has observes IID data $(X_1,\ldots,X_N)$, where $f(x\vert\theta)$ is the likelihood for a single observation, and $p(\theta)$ is the prior density over an unknown finite-dimensional parameter $\theta \in R^K$.

A.
(10%) Use Bayes Rule to derive a formula for the posterior density of $\theta$ given $(X_1,\ldots,X_N)$.

B.
(20%) Let $P(\theta \in
A\vert X_1,\ldots,X_N\}$ be the posterior probability $\theta$ is in some set $A \subset \Theta$ given the first N observations. Show that this posterior probability satisfies the Law of iterated expectations:

\begin{displaymath}E\left\{ P(\theta \in A\vert X_1,\ldots,X_{N+1})\big\vert X_1,\ldots,X_N\right\}
= P(\theta \in A\vert X_1,\ldots,X_N). \end{displaymath}

C.
(20%) A martingale is a stochastic process $\{\tilde Z_t\}$ that satisfies $E\left\{\tilde Z_{t+1}\vert{\cal I}_t\right\}=\tilde Z_t$, where ${\cal I}_t$ denotes the information set at time t and includes knowledge of all past Zt's up to time t, ${\cal I}_t \supset (\tilde Z_1,\ldots,\tilde Z_t)$. Use the result in part A to show that the process $\{\tilde Z_t\}$ where $\tilde Z_t = P(\theta \in A\vert\tilde X_1\ldots,X_t)$ is a martingale. (We are interested in martingales because the Martingale Convergence Theorem can be used to show that if $\theta$ is finite-dimensional, then the posterior distribution converges with probability 1 to a point mass on the true value of $\theta$ generating the observations $\{X_i\}$. But you don't have to know anything about this to answer this question.)

D.
(50%) Suppose that if $\theta$ is restricted to the K-dimensional simplex, $\theta=(\theta_1,\ldots,\theta_K)$ with $\theta_i\in(0,1)$, $i=1,\ldots,K$, $1=\sum_{i=1}^K \theta_i$, and the distribution of Xi given $\theta$ is multinomial with parameter $\theta$, i.e.

\begin{displaymath}Pr\{X_i = k\} = \theta_k, \quad k=1,\ldots,K.\end{displaymath}

Suppose the prior distribution over $\theta$, $p(\theta)$ is Dirichlet with parameter $\alpha$:

\begin{displaymath}p(\theta) = { \Gamma(\alpha_1+\cdots + \alpha_K) \over
\Gamm...
...alpha_K) } \theta_1^{\alpha_1-1} \cdots
\theta_K^{\alpha_K-1} \end{displaymath}

where both $\theta_i > 0$ and $\alpha_i > 0$, $i=1,\ldots,K$. Compute the posterior distribution and show 1) the posterior is also Dirichlet (i.e. the Dirichlet is a conjugate family), and show directly that as $N \to \infty$ that the posterior distribution converges to a point mass on the true parameter $\theta$ generating the data.

QUESTION 3 Consider the random utility model:

\begin{displaymath}\tilde u_d = v_d + \tilde \epsilon_d, \quad d=1,\ldots,D \end{displaymath} (15)

where $\tilde u_d$ is a decision-maker's payoff or utility for selecting alternative d from a set containing D possible alternatives (we assume that the individual only chooses one item). The term vd is known as the deterministic or strict utility from alternative d and the error term $\tilde \epsilon_d$ is the random component of utility. In empirical applications vd is often specified as

\begin{displaymath}v_d = X_d\beta \end{displaymath} (16)

where Xd is a vector of observed covariates and $\beta$ is a vector of coefficients determining the agent's utility to be estimated. The interpretation is that Xd represents a vector of characteristics of the decision-maker and alternative d that are observable by the econometrician and $\epsilon_d$ represents characteristics of the agent and alternative d that affect the utility of choosing alternative d which are unobserved by the econometrician. Define the agent's decision rule $\delta(\epsilon_1,\ldots,\epsilon_D)$by:

\begin{displaymath}\delta(\epsilon) = \mbox{\it argmax\/}_{d=1,\ldots,D} \left[ v_d +
\tilde \epsilon_d\right] \end{displaymath} (17)

i.e. $\delta(\epsilon)$ is the optimal choice for an agent whose unobserved utility components are $\epsilon=(\epsilon_1,\ldots,\epsilon_D)$. Then the agent's choice probability $P\{d\vert X\}$ is given by:

\begin{displaymath}P\left\{ d \vert X\right\} = \int I\{ d = \delta(\epsilon)\}
f(\epsilon\vert X)d\epsilon \end{displaymath} (18)

where $X=(X_1,\ldots,X_D)$ is the vector of observed characteristics of the agent and the D alternatives and $f(\epsilon\vert X)$ is the conditional density function of the random components of utility given the values of observed components X, and $I\{\delta(\epsilon)=d\}$is the indicator function given by $I\{\delta(\epsilon)=d\} =1$if $\delta(\epsilon)=d$ and 0 otherwise. Note that the integral above is actually a multivariate integral over the D components of $\epsilon=(\epsilon_1,\ldots,\epsilon_D)$, and simply represents the probability that the values of the vector of unobserved utilities $\epsilon$lead the agent to choose alternative d.

Definition: The Social Surplus Function $U(v_1,\ldots,v_D,X)$ is given by:

\begin{displaymath}U(v_1,\ldots,v_D,X) = E\left\{ \max_{d=1,\ldots,D}[ v_d + \ti...
...ilon_1,\ldots,\epsilon_D\vert X)d\epsilon_1 \cdots
d\epsilon_D \end{displaymath} (19)

The Social Surplus function is the expected maximized utility of the agent.1

A.
(50%) Prove the Williams-Daly-Zachary Theorem:

\begin{displaymath}{\partial U(v_1,\ldots,v_D,X) \over \partial v_d} = P\{d \vert X\} \end{displaymath} (20)

and discuss its relationship to Roy's Identity.

Hint: Interchange the differentiation and expectation operations when computing $\partial U/\partial v_d$:

\begin{eqnarray*}{\partial U(v_1,\ldots,v_D,X) \over \partial v_d} & =
&\partial...
...silon_1,\ldots,\epsilon_D\vert X)d\epsilon_1 \cdots
d\epsilon_D \end{eqnarray*}


and show that

\begin{displaymath}\partial/\partial v_d \max_{d=1,\ldots,D}[v_d +
\epsilon_d] = I\{d = \delta(\epsilon)\}. \end{displaymath}

B.
(50%) Consider the special case of the random utility model when $\epsilon=(\epsilon_1,\ldots,\epsilon_D)$has a multivariate (Type I) extreme value distribution:

 \begin{displaymath}f(\epsilon\vert X) = \prod_{d=1}^D \exp\{-\epsilon_d\}\exp\left\{-\exp\{-\epsilon_d\}\right\}.
\end{displaymath} (21)

Show that the conditional choice probability $P\{d\vert X\}$ is given by the multinomial logit formula:

\begin{displaymath}P\{d\vert X\} = { \exp\{ v_d\} \over \sum_{d'=1}^D \exp\{ v_{d'}\} }.
\end{displaymath} (22)

Hint 1: Use the Williams-Daly-Zachary Theorem, showing that in the case of the extreme value distribution (21) the Social Surplus function is given by

 \begin{displaymath}U(v_1,\ldots,v_D,X)= \gamma+ \log\left[ \sum_{d=1}^D \exp\{ v_d\} \right].
\end{displaymath} (23)

where $\gamma = .577216 \ldots$ is Euler's constant.

Hint 2: To derive equation (23) show that the extreme value family is max-stable: i.e. if $(\epsilon_1,\ldots,\epsilon_D)$ are IID extreme value random variables, then $\max_d \{\epsilon_d\}$ also has an extreme value distribution. Also use the fact that the expectation of a single extreme value random variable with location parameter $\alpha$ and scale parameter $\sigma$ is given by:

\begin{displaymath}E\{\tilde \epsilon\} = \int_{-\infty}^{+\infty} \epsilon
\exp...
...\{-\exp\{-\epsilon\}\right\}d\epsilon =
\alpha + \sigma\gamma, \end{displaymath} (24)

and the CDF is given by

\begin{displaymath}F(x\vert\alpha,\sigma) = P\{\tilde\epsilon \le x\vert\alpha,\...
...ft\{ - \exp\left\{ {-(x-\alpha) \over \sigma}\right\}\right\}.
\end{displaymath} (25)

Hint 3: Let $(\epsilon_1,\ldots,\epsilon_D)$ be INID (independent, non-identically distributed) extreme value random variables with location parameters $(\alpha_1,\ldots,\alpha_D)$ and common scale parameter $\sigma$. Show that this family is max-stable by proving that $\max(\epsilon_1,\ldots,\epsilon_D)$ is an extreme value random variable with scale parameter $\sigma$ and location parameter

\begin{displaymath}\alpha = \sigma \log\left[ \sum_{d=1}^D \exp\{ \alpha_d/\sigma\}
\right] \end{displaymath} (26)

QUESTION 4 (Latent Variable Models) The Binary Probit Model can be viewed as a simple type of latent variable model. There is an underlying linear regression model

\begin{displaymath}\tilde y = X\beta^* + \epsilon \end{displaymath} (27)

but where the dependent variable $\tilde y$ is latent, i.e. it is not observed by the econometrician. Instead we observe the dependent variable y given by

\begin{displaymath}y = \left\{ \begin{array}{ll} 1 & \mbox{if} \quad \tilde y
> 0 \\
0 & \mbox{if} \quad \tilde y \le 0 \end{array} \right. \end{displaymath} (28)

1.
(5%) Assume that the error term $\epsilon \sim N(0,\sigma^2)$. Show that the scale of $\beta^*$ and the parameter $\sigma^2$ is not simultaneously identified and therefore without loss of generality we can normalize $\sigma^2=1$ and interpret the estimated $\beta$ coefficients as being the true coefficients $\beta^*$divided by $\sigma$:

\begin{displaymath}\beta = { \beta^* \over \sigma}. \end{displaymath} (29)

2.
(10%) Derive the conditional probability $\mbox{Pr}\{y=1\vert X\}$in terms of X, $\beta$ and the standard normal CDF, $\Phi$ and use this probability to write down the likelihood function for NIID observations of pairs $\{(y_i,X_i)\}, i=1,\ldots,N$.

3.
(20%) Show that $\beta$ can be consistently estimated by nonlinear least squares by writing down the least squares problem and sketching a proof for its consistency.

4.
(20%) Derive the asymptotic distribution of the maximum likelihood estimator by providing an analytical formula for the asymptotic covariance matrix of the MLE estimator $\hat\beta_N$ (Hint: This is the inverse of the information matrix ${\cal I}$. Derive a formula for ${\cal I}$ in terms of $\Phi$, X and $\beta$ and possibly other terms.)

5.
(20%) Derive the asymptotic distribution of the nonlinear least squares estimator and compare it to the maximum likelihood estimator. Is the nonlinear least squares estimator asymptotically inefficient?

6.
(25%) Show that the nonlinear least squares estimator of $\beta$ is subject to heteroscedasticity by deriving an explicit formula for the conditional variance of the error term in the nonlinear regression formulation of the estimation problem. Can you form a more efficient estimator by correcting for this heteroscedasticity in a two stage feasible GLS procedure (i.e. in stage 1 computing an initial consistent, but inefficient estimator of $\beta$ by ordinary nonlinear least squares and in stage two using this initial consistent estimator to correct for the heteroscedasticity and using the stage two estimator of $\beta$ as the feasible GLS estimator)? If so, is this feasible GLS procedure asymptotically efficient? If you believe so, provide a sketch of the derviation of the asymptotic distribution of the feasible GLS estimator. Otherwise provide a counterexample or a sketch of an argument why you believe the feasible GLS procedure is asymptotically inefficient relative to the maximum likelihood estimator.



 
next up previous
Next: About this document ...
John Rust
2001-04-27