Introductory Econometrics

Jeffrey M. Wooldridge

Chapter 9

More on Specification and Data Issues - all with Video Answers

Educators

Chapter Questions

Problem 1

(i) Apply RESET from equation $(9.3)$ to the model estimated in Computer Exercise $\mathrm{C} 5$ in Chapter $7 .$ Is there evidence of functional form misspecification in the equation?
(ii) Compute a heteroskedasticity-robust form of RESET. Does your conclusion from part (i) change?

Krish Desai

Krish Desai

Numerade Educator

Problem 2

Use the data set WAGE2 for this exercise.
(i) Use the variable KWW (the "knowledge of the world of work" test score) as a proxy for ability
in place of IQ in Example 9.3. What is the estimated return to education in this case?
(ii) Now, use IQ and KWW together as proxy variables. What happens to the estimated return to
education?
(iii) In part (ii), are IQ and KWW individually significant? Are they jointly significant?

Krish Desai

Numerade Educator

Problem 3

Use the data from JTRAIN for this exercise.
(i) Consider the simple regression model
$$\log (s c r a p)=\beta_{0}+\beta_{1} g r a n t+u$$
where scrap is the firm scrap rate and grant is a dummy variable indicating whether a firm
received a job training grant. Can you think of some reasons why the unobserved factors in u
might be correlated with grant?
(ii) Estimate the simple regression model using the data for 1988. (You should have 54
observations.) Does receiving a job training grant significantly lower a firm's scrap rate?
(iii) Now, add as an explanatory variable log $\left(s c r a p_{87}\right) .$ How does this change the estimated effect of grant? Interpret the coefficient on grant. Is it statistically significant at the 5$\%$ level against the one-sided alternative $\mathrm{H}_{1} : \beta_{g r a n t}<0 ?$
(iv) Test the null hypothesis that the parameter on $\log \left(\operatorname{scrap}_{87}\right)$ is one against the two-sided alternative. Report the $p$ -value for the test.
(v) Repeat parts (iii) and (iv), using heteroskedasticity-robust standard errors, and briefly discuss
any notable differences.

Heather Duong

Numerade Educator

Problem 4

Use the data for the year 1990 in INFMRT for this exercise.
(i) Reestimate equation (9.43), but now include a dummy variable for the observation on the
District of Columbia (called DC). Interpret the coefficient on DC and comment on its size and
significance.
(ii) Compare the estimates and standard errors from part (i) with those from equation (9.44). What
do you conclude about including a dummy variable for a single observation?

Krish Desai

Numerade Educator

Problem 5

Use the data in RDCHEM to further examine the effects of outliers on OLS estimates and to see how
LAD is less sensitive to outliers. The model is
$$=\beta_{0}+\beta_{1} \text { sales }+\beta_{2} \text { sales }^{2}+\beta_{3} \text { profmarg }+u$$
where you should first change sales to be in billions of dollars to make the estimates easier to
interpret.
(i) Estimate the above equation by OLS, both with and without the firm having annual sales of
almost $40 billion. Discuss any notable differences in the estimated coefficients.
(ii) Estimate the same equation by LAD, again with and without the largest firm. Discuss any
important differences in estimated coefficients.
(iii) Based on your findings in (i) and (ii), would you say OLS or LAD is more resilient to outliers?

Krish Desai

Numerade Educator

Problem 6

Redo Example 4.10 by dropping schools where teacher benefits are less than 1$\%$ of salary.
(i) How many observations are lost?
(ii) Does dropping these observations have any important effects on the estimated trade off?

Krish Desai

Numerade Educator

Problem 7

Use the data in LOANAPP for this exercise.
(i) How many observations have $o b r a t>40,$ that is, other debt obligations more than 40$\%$ of total income?
(ii) Reestimate the model in part (iii) of Computer Exercise $\mathrm{C8}$ , excluding observations with obrat $>40 .$ What happens to the estimate and $t$ statistic on white?
(iii) Does it appear that the estimate of $\beta_{\text {white}}$ is overly sensitive to the sample used?

Heather Duong

Numerade Educator

Problem 8

Use the data in TWOYEAR for this exercise.
(i) The variable stotal is a standardized test variable, which can act as a proxy variable for
unobserved ability. Find the sample mean and standard deviation of stotal.
(ii) Run simple regressions of jc and univ on stotal. Are both college education variables
statistically related to stotal? Explain.
(iii) Add stotal to equation (4.17) and test the hypothesis that the returns to two- and four-year
colleges are the same against the alternative that the return to four-year colleges is greater. How
do your findings compare with those from Section 4-4?
(iv) Add stotall to the equation estimated in part (iii). Does a quadratic in the test score variable
seem necessary?
(v) Add the interaction terms stotal jc and stotal univ to the equation from part (iii). Are these terms
jointly significant?
(vi) What would be your final model that controls for ability through the use of stotal? Justify your
answer.

Krish Desai

Numerade Educator

Problem 9

In this exercise, you are to compare OLS and LAD estimates of the effects of 401(k) plan eligibility on
net financial assets. The model is
$$=\beta_{0}+\beta_{1} i n c+\beta_{2} i n c^{2}+\beta_{3} a g e+\beta_{4} a g e^{2}+\beta_{5} m a l e+\beta_{6} e 40 l k+u$$
(i) Use the data in 401 $\mathrm{KSUBS}$ to estimate the equation by OLS and report the results in the usual form. Interpret the coefficient on $e 401 k .$
(ii) Use the OLS residuals to test for heteroskedasticity using the Breusch-Pagan test. Is u
independent of the explanatory variables?
(iii) Estimate the equation by LAD and report the results in the same form as for OLS. Interpret the
LAD estimate of $\beta_{6}$ .
(iv) Reconcile your findings from parts (i) and (iii).

Krish Desai

Numerade Educator

Problem 10

You need to use two data sets for this exercise, JTRAIN2 and JTRAIN3. The former is the outcome of
a job training experiment. The file JTRAIN3 contains observational data, where individuals themselves
largely determine whether they participate in job training. The data sets cover the same time period.
(i) In the data set JTRAIN2, what fraction of the men received job training? What is the fraction in
JTRAIN3? Why do you think there is such a big difference?
(ii) Using JTRAIN2, run a simple regression of $r e 78$ on train. What is the estimated effect of
participating in job training on real earnings?
(iii) Now add as controls to the regression in part (ii) the variables $r e 74,$ re75, educ, age, black,
and hisp. Does the estimated effect of job training on $r e 78$ change much? How come? (Hint:
Remember that these are experimental data.)
(iv) Do the regressions in parts (ii) and (iii) using the data in JTRAIN3, reporting only the estimated
coefficients on train, along with their $t$ statistics. What is the effect now of controlling for the
$\quad$ extra factors, and why?
(v) Define avgre $=(r e 74+r e 75) / 2 .$ Find the sample averages, standard deviations, and minimum and maximum values in the two data sets. Are these data sets representative of the same
populations in 1978$?$
(vi) Almost 96$\%$ of men in the data set JTRAIN2 have avgre less than $\$ 10,000 .$ Using only these men, run the regression and report the training estimate and its $t$ statistic. Run the same regression for JTRAIN $3,$ using only men with avgre $\leq 10 .$ For the subsample of low-income men, how do the estimated training effects compare across the experimental and nonexperimental data sets?
(vii) Now use each data set to run the simple regression $r e 78$ on train, but only for men who were unemployed in 1974 and $1975 .$ How do the training estimates compare now?
(viii) Using your findings from the previous regressions, discuss the potential importance of having
comparable populations underlying comparisons of experimental and nonexperimental estimates.

Krish Desai

Numerade Educator

Problem 11

Use the data in MURDER only for the year 1993 for this question, although you will need to first
obtain the lagged murder rate, say mrate-1.
(i) Run the regression of mrdre on exec, unem. What are the coefficient and $t$ statistic on exec?
Does this regression provide any evidence for a deterrent effect of capital punishment?
(ii) How many executions are reported for Texas during 1993$?$ (Actually, this is the sum of
executions for the current and past two years.) How does this compare with the other states?
Add a dummy variable for Texas to the regression in part (i). Is its $t$ statistic unusually large? From this, does it appear Texas is an "outlier"?
(iii) To the regression in part (i) add the lagged murder rate. What happens to $\hat{\beta}_{\text {exec}}$ and its statistical significance?
(iv) For the regression in part (iii), does it appear Texas is an outlier? What is the effect on $\hat{\beta}_{\text {exec}}$ from dropping Texas from the regression?

Heather Duong

Numerade Educator

Problem 12

Use the data in ELEM94 95 to answer this question. See also Computer Exercise $\mathrm{C} 10$ in Chapter $4 .$
(i) Using all of the data, run the regression lavgsal on bs, lenrol, lstaff, and lunch. Report the coefficient on $b s$ along with its usual and heteroskedasticity-robust standard errors. What do you
conclude about the economic and statistical significance of $\hat{\beta}_{b s} ?$
(ii) Now drop the four observations with $b s>.5,$ that is, where average benefits are (supposedly
more than 50$\%$ of average salary. What is the coefficient on $b s ?$ Is it statistically significant
using the heteroskedasticity-robust standard error?
(iii) Verify that the four observations with $b s>.5$ are $68,1,127,1,508,$ and $1,670 .$ Define four
dummy variables for each of these observations. (You might call them $d 68, d 1127, d 1508,$
and $d 1670 .$ ) Add these to the regression from part (i) and verify that the OLS coefficients
and standard errors on the other variables are identical to those in part (ii). Which of the four
dummies has a $t$ statistic statistically different from zero at the 5$\%$ level?
(iv) Verify that, in this data set, the data point with the largest studentized residual (largest $t$ statistic on the dummy variable) in part (iii) has a large influence on the OLS estimates. (That is, run
OLS using all observations except the one with the large studentized residual.) Does dropping,
in turn, each of the other observations with $b s>.5$ have important effects?
(v) What do you conclude about the sensitivity of OLS to a single observation, even with a large
sample size?
(vi) Verify that the LAD estimator is not sensitive to the inclusion of the observation identified in
part (iii).

Heather Duong

Numerade Educator

Problem 13

Use the data in CEOSAL. 2 to answer this question.
(i) Estimate the model
$$\ lsalary = beta_{0}+\beta_{1} \text { lsales }+\beta_{2} \text {lmktval}+\beta_{3} \text { ceoten }+\beta_{4} \text {ceoten}^{2}+u$$
by OLS using all of the observations, where lsalary, lsales, and lmktvale are all natural
logarithms. Report the results in the usual form with the usual OLS standard errors. (You may
verify that the heteroskedasticity-robust standard errors are similar.)
(ii) In the regression from part (i) obtain the studentized residuals; call these stri. How many
studentized residuals are above 1.96 in absolute value? If the studentized residuals were
independent draws from a standard normal distribution, about how many would you expect to
be above two in absolute value with 177 draws?
(iii) Reestimate the equation in part (i) by OLS using only the observations with $\left|s t r_{i}\right| \leq 1.96 .$ How do the coefficients compare with those in part (i)?
(iv) Estimate the equation in part (i) by LAD, using all of the data. Is the estimate of $\beta_{1}$ closer to the OLS estimate using the full sample or the restricted sample? What about for $\beta_{3} ?$
(v) Evaluate the following statement: "Dropping outliers based on extreme values of studentized
residuals makes the resulting OLS estimates closer to the LAD estimates on the full sample."

Heather Duong

Numerade Educator

Problem 14

Use the data in ECONMATH to answer this question. The population model is
$$score=\beta_{0}+\beta_{1} a c t+u$$
(i) For how many students is the ACT score missing? What is the fraction of the sample?
Define a new variable, actmiss, which equals one if act is missing, and zero otherwise.
(ii) Create a new variable, say act0, which is the act score when act is reported and zero when act is
missing. Find the average of act0 and compare it with the average for act.
(iii) Run the simple regression of score on act using only the complete cases. What do you obtain
for the slope coefficient and its heteroskedasticity-robust standard error?
(iv) Run the simple regression of score on $a c t 0$ using all of the cases. Compare the slope coefficient with that in part (ii) and comment.
(v) Now use all of the cases and run the regression
$$score_{i} \text { on } a c t 0_{i},$$
What is the slope estimate on $a c t 0_{i} ?$ How does it compare with the answers in parts (iii)
and (iv)?
(vi) Comparing regressions (iii) and (v), does using all cases and adding the missing data estimator
improve estimation of $\beta_{1}$ ?
(vii) If you add the variable colgpa to the regressions in parts (iii) and (v), does this change your
answer to part (vi)?

Heather Duong

Numerade Educator