Statistics Unlocking the Power of Data

Robin H. Lock, Patti Frazer Lock, Kari Lock Morgan

Chapter 9

Inference for Regression - all with Video Answers

Educators

Section 1

Inference for Slope and Correlation

Problem 1

In Exercises 9.1 to $9.4,$ use the computer output (from different computer packages) to estimate the intercept $\beta_{0},$ the slope $\beta_{1},$ and to give the equation for the least squares line for the sample. Assume the response variable is $Y$ in each case.
$$
\begin{aligned}
&\text { The regression equation is } Y=29.3+4.30 \mathrm{X}\\
&\begin{array}{lrrrr}
\text { Predictor } & \text { Coef } & \text { SE Coef } & \text { T } & \text { P } \\
\text { Constant } & 29.266 & 6.324 & 4.63 & 0.000 \\
\text { X } & 4.2969 & 0.6473 & 6.64 & 0.000
\end{array}
\end{aligned}
$$

Sheryl Ezze

Sheryl Ezze

Numerade Educator

Problem 2

Use the computer output (from different computer packages) to estimate the intercept $\beta_{0},$ the slope $\beta_{1},$ and to give the equation for the least squares line for the sample. Assume the response variable is $Y$ in each case.
$$
\begin{aligned}
&\text { The regression equation is }\\
&Y=808-3.66 \mathrm{~A}\\
&\begin{array}{lrrrr}
\text { Predictor } & \text { Coef } & \text { SE Coef } & \text { T } & \text { P } \\
\text { Constant } & 807.79 & 87.78 & 9.20 & 0.000 \\
\text { A } & -3.659 & 1.199 & -3.05 & 0.006
\end{array}
\end{aligned}
$$

Sheryl Ezze

Numerade Educator

Problem 3

Use the computer output (from different computer packages) to estimate the intercept $\beta_{0},$ the slope $\beta_{1},$ and to give the equation for the least squares line for the sample. Assume the response variable is $Y$ in each case.
$$
\begin{array}{lrrrr}
\text { Coefficients: } & \text { Estimate } & \text { Std.Error } & \mathrm{t} \text { value } & \operatorname{Pr}(>|\mathrm{t}|) \\
\text { (Intercept) } & 77.44 & 14.43 & 5.37 & 0.000 \\
\text { Score } & -15.904 & 5.721 & -2.78 & 0.012
\end{array}
$$

Sheryl Ezze

Numerade Educator

Problem 4

Use the computer output (from different computer packages) to estimate the intercept $\beta_{0},$ the slope $\beta_{1},$ and to give the equation for the least squares line for the sample. Assume the response variable is $Y$ in each case.
$$
\begin{array}{lrrrr}
\text { Coefficients: } & \text { Estimate } & \text { Std.Error } & \mathrm{t} \text { value } & \mathrm{Pr}(>|\mathrm{t}|) \\
\text { (Intercept) } & 7.277 & 1.167 & 6.24 & 0.000 \\
\text { Dose } & -0.3560 & 0.2007 & -1.77 & 0.087
\end{array}
$$

Sheryl Ezze

Numerade Educator

Problem 5

Exercises 9.5 to 9.8 show some computer output for fitting simple linear models. State the value of the sample slope for each model and give the null and alternative hypotheses for testing if the slope in the population is different from zero. Identify the p-value and use it (and a $5 \%$ significance level) to make a clear conclusion about the effectiveness of
the model.
$$
\begin{array}{lrrrr}
\text { The regression equation is } \mathrm{Y}=89.4 & -8.20 \mathrm{X} & \\
\text { Predictor } & \text { Coef } & \text { SE Coef } & \mathrm{T} & \mathrm{P} \\
\text { Constant } & 89.406 & 4.535 & 19.71 & 0.000 \\
\mathrm{X} & -8.1952 & 0.9563 & -8.57 & 0.000
\end{array}
$$

Sheryl Ezze

Numerade Educator

Problem 6

Show some computer output for fitting simple linear models. State the value of the sample slope for each model and give the null and alternative hypotheses for testing if the slope in the population is different from zero. Identify the p-value and use it (and a $5 \%$ significance level) to make a clear conclusion about the effectiveness of the model.
$$
\begin{array}{lrrrr}
\text { The regression equation is } \mathrm{Y}=82.3-0.0241 \mathrm{X} & \\
\text { Predictor } & \text { Coef } & \text { SE Coef } & \mathrm{T} & \mathrm{P} \\
\text { Constant } & 82.29 & 11.80 & 6.97 & 0.000 \\
\mathrm{X} & -0.02413 & 0.02018 & -1.20 & 0.245
\end{array}
$$

Sheryl Ezze

Numerade Educator

Problem 7

Show some computer output for fitting simple linear models. State the value of the sample slope for each model and give the null and alternative hypotheses for testing if the slope in the population is different from zero. Identify the p-value and use it (and a $5 \%$ significance level) to make a clear conclusion about the effectiveness of the model.$$
\begin{array}{lrrrr}
\text { Coefficients: } & \text { Estimate } & \text { Std.Error } & \mathrm{t} \text { value } & \operatorname{Pr}(>|\mathrm{t}|) \\
\text { (Intercept) } & 7.277 & 1.167 & 6.24 & 0.000 \\
\text { Dose } & -0.3560 & 0.2007 & -1.77 & 0.087
\end{array}
$$

Sheryl Ezze

Numerade Educator

Problem 8

Show some computer output for fitting simple linear models. State the value of the sample slope for each model and give the null and alternative hypotheses for testing if the slope in the population is different from zero. Identify the p-value and use it (and a $5 \%$ significance level) to make a clear conclusion about the effectiveness of the model.$$
\begin{array}{lrrrr}
\text { Coefficients: } & \text { Estimate } & \text { Std.Error } & \mathrm{t} \text { value } & \operatorname{Pr}(>|\mathrm{t}|) \\
\text { (Intercept) } & 807.79 & 87.78 & 9.30 & 0.000 \\
\mathrm{~A} & -3.659 & 1.199 & -3.05 & 0.006
\end{array}
$$

Sheryl Ezze

Numerade Educator

Problem 9

In Exercises 9.9 and $9.10,$ find and interpret a $95 \%$ confidence interval for the slope of the model indicated.
The model given by the output in Exercise 9.5 , with $n=24$

Sheryl Ezze

Numerade Educator

Problem 10

Find and interpret a $95 \%$ confidence interval for the slope of the model indicated.
The model given by the output in Exercise 9.7 , with $n=30$.

Sheryl Ezze

Numerade Educator

Problem 11

In Exercises 9.11 to $9.14,$ test the correlation, as indicated. Show all details of the test.
Test for a positive correlation; $r=0.35 ; n=30$.

Sheryl Ezze

Numerade Educator

Problem 12

Test the correlation, as indicated. Show all details of the test.
Test for evidence of a linear association; $r=0.28 ; n=10$

Sheryl Ezze

Numerade Educator

Problem 13

Test the correlation, as indicated. Show all details of the test.
Test for evidence of a linear association; $r=0.28 ; n=100$.

Sheryl Ezze

Numerade Educator

Problem 14

Test the correlation, as indicated. Show all details of the test.
Test for a negative correlation; $r=-0.41$; $n=18$.

Sheryl Ezze

Numerade Educator

Problem 15

Student Survey: Correlation Matrix A correlation matrix allows us to see lots of correlations at once, between many pairs of variables. A correlation matrix for several variables (Exercise, $T V,$ Height, Weight, and $G P A$ ) in the StudentSurvey dataset is given. For any pair of variables (indicated by the row and the column), we are given two values: the correlation as the top number and the p-value for a two-tail test of the corrclation right bencath it.
(a) Which two variables are most strongly positively correlated? What is the correlation? What is the p-value? What does a positive correlation mean in this situation?
(b) Which two variables are most strongly negatively correlated? What is the correlation? What is the p-value? What does a negative correlation mean in this situation?
(c) At a $5 \%$ significance level, list any pairs of variables for which there is not convincing evidence of a linear association.

Sheryl Ezze

Numerade Educator

Problem 16

The dataset NBAPlayers2015 is introduced on page 91 and contains information on many variables for players in the NBA (National Basketball Association) during the $2014-2015$ season. The dataset includes information for all players who averaged more than 24 minutes per game $(n=182)$ and 25 variables, including Age, Points (number of points for the season per game), FTPct (free throw shooting percentage), Rebounds (number of rebounds for the season), and Steals (number of steals for the season). A correlation matrix for these five variables is shown. A correlation matrix allows us to see lots of correlations at once, between many pairs of variables. For any pair of variables (indicated by the row and the column), we are given two values: the correlation as the top number and the p-value for a two-tail test of the correlation right beneath it.
(a) Which two variables are most strongly positively correlated? What is the correlation? What is the p-value? What does a positive correlation mean in this situation?
(b) Which two variables are most strongly negatively correlated? What is the correlation? What is the p-value? What does a negative correlation mean in this situation?
(c) At a $5 \%$ significance level, list any pairs of variables for which there is not convincing evidence of a linear association.

Sheryl Ezze

Numerade Educator

Problem 17

Verbal SAT as a Predictor of GPA A scatterplot with regression line is shown in Figure 9.7 for a regression model using Verbal SAT score, VerbalSAT, to predict grade point average in college, $G P A,$ using the data in StudentSurvey. We also show computer output below of the regression analysis.
(a) Use the scatterplot to determine whether we should have any significant concerns about the conditions being met for using a linear model with these data.
(b) Use the fitted model to predict the GPA of a person with a score on the Verbal SAT exam of $650 .$
(c) What is the estimated slope in this regression model? Interpret the slope in context.
(d) What is the test statistic for a test of the slope? What is the p-value? What is the conclusion of the test, in context?
(e) What is $R^{2} ?$ Interpret it in context.

Sheryl Ezze

Numerade Educator

Problem 18

Data 4.1 on page 258 introduces a study that examines the effect of light at night on weight gain in mice. In the full study of 27 mice over a fourweek period, the mice who had a light on at night gained significantly more weight than the mice with darkness at night, despite eating the same number of calories and exercising the same amount. Researchers noticed that the mice with light at night ate a greater percentage of their calories during the day (when mice are supposed to be sleeping). The computer output shown below allows us to examine the relationship between percent of calories eaten during the day, DayPct, and body mass gain in grams, $B M$ Gain. A scatterplot with regression line is shown in Figure 9.8 .(a) Use the scatterplot to determine whether we should have any strong concerns about the conditions being met for using a linear model with these data.
(b) What is the correlation between these two variables? What is the p-value from a test of the correlation? What is the conclusion of the test, in context?
(c) What is the least squares line to predict body mass gain from percent daytime consumption? What gain is predicted for a mouse that eats $50 \%$ of its calories during the day $($ DayPct $=50) ?$
(d) What is the estimated slope for this regression model? Interpret the slope in context.
(e) What is the p-value for a test of the slope? What is the conclusion of the test, in context?
(f) What is the relationship between the p-value of the correlation test and the p-value of the slope test?
(g) What is $R^{2}$ for this linear model? Interpret it in context.
(h) Verify that the correlation squared gives the coefficient of determination $R^{2}$.

Sheryl Ezze

Numerade Educator

Problem 19

A recent study in Great Britain $^{4}$ examines the relationship between the number of friends an individual has on Facebook and grey matter density in the areas of the brain associated with social perception and associative memory. The data are available in the dataset FacebookFriends and the relevant variables are GMdensity (normalized $z$ -scores of grey matter density in the relevant regions) and $F B$ friends (the number of friends on Facebook). The study included 40 students at City University London. A scatterplot of the data is shown in Figure 9.9 and computer output for both correlation and regression is shown below.
(a) Use the scatterplot to determine whether any of the study participants had grey matter density scores more than two standard deviations from the mean. (Hint: The grey matter density scores used in the scatterplot are $z$ -scores!) If so, in each case, indicate if the grey matter density score is above or below the mean and estimate the number of Facebook friends for the individual.
(b) Use the scatterplot to determine whether we should have any significant concerns about the conditions being met for using a linear model with these data.
(c) What is the correlation between these two variables? What is the p-value from a test of the correlation? What is the conclusion of the test, in context?
(d) What is the least squares line to predict the number of Facebook friends based on the normalized grey matter density score? What number of Facebook friends is predicted for a person whose normalized score is $0 ?$ Whose normalized score is $+1 ?$ Whose normalized score is $-1 ?$
(e) What is the p-value for a test of the slope? Compare it to the p-value for the test of correlation.
(f) What is $R^{2}$ for this linear model? Interpret it in context.

Lucas Finney

Numerade Educator

Problem 20

In Exercise $9.19,$ we give computer output for a regression line to predict the number of Facebook friends a student will have, based on a normalized score of the grey matter density in the areas of the brain associated with social perception and associative memory. Data for the sample of $n=40$ students are stored in FacebookFriends.
(a) What is the slope in this regression analysis? What is the standard error for the slope?
(b) Use the information from part (a) to calculate the test statistic to test the slope to determine whether GMdensity is an effective predictor of FBfriends. Give the hypotheses for the test, find the p-value, and make a conclusion. Show your work. Verify the values of the test statistic and the p-value using the computer output in Exercise 9.19
(c) Use the information from part (a) to find and interpret a $95 \%$ confidence interval for the slope.

Sheryl Ezze

Numerade Educator

Problem 21

The FloridaLakes dataset, introduced in Data 2.4, includes data on 53 lakes in Florida. Two of the variables recorded are $p H$ (acidity of the lake water) and AvgMercury (average mercury level for a sample of fish from each lake). We wish to use the $\mathrm{pH}$ of the lake water (which is easy to measure) to predict average mercury levels in fish, which is harder to measure. A scatterplot of the data is shown in Figure 2.49 (a) on page 109 and we see that the conditions for fitting a linear model are reasonably met. Computer output for the regression analysis is shown below.
(a) Use the fitted model to predict the average mercury level in fish for a lake with a pH of 6.0 .
(b) What is the slope in the model? Interpret the slope in context.
(c) What is the test statistic for a test of the slope? What is the p-value? What is the conclusion of the test, in context?
(d) Compute and interpret a $95 \%$ confidence interval for the slope.
(e) What is $R^{2} ?$ Interpret it in context.

Sheryl Ezze

Numerade Educator

Problem 22

The FloridaLakes dataset, introduced in Data 2.4, includes data on 53 lakes in Florida. Figure 9.10 shows a scatterplot of Alkalinity (concentration of calcium carbonate in $\mathrm{mg} / \mathrm{L}$ ) and AvgMercury (average mercury level for a sample of fish from each lake). Explain using the conditions for a linear model why we might hesitate to fit a linear model to these data to use Alkalinity to predict average mercury levels in fish.

Sheryl Ezze

Numerade Educator

Problem 23

Hantavirus is carried by wild rodents and causes severe lung disease in humans. A study $^{5}$ on the California Channel Islands found that increased prevalence of the virus was linked with greater precipitation. The study adds "Precipitation accounted for $79 \%$ of the variation in prevalence."
(a) What notation or terminology do we use for the value $79 \%$ in this context?
(b) What is the response variable? What is the explanatory variable?
(c) What is the correlation between the two variables?

Sheryl Ezze

Numerade Educator

Problem 24

Teams in the National Football League (NFL) in the US play four pre-season games each year before the regular season starts. Do teams that do well in the pre-season tend to also do well in the regular season? We are interested in whether there is a positive linear association between the number of wins in the pre-season and the number of wins in the regular season for teams in the NFL.
(a) What are the null and alternative hypotheses for this test?
(b) The correlation between these two variables for the 32 NFL teams over the 10 year period from 2005 to 2014 is 0.067 . Use this sample (with $n=320$ ) to calculate the appropriate test statistic and determine the p-value for the test.
(c) State the conclusion in context, using a $5 \%$ significance level.
(d) When an NFL team goes undefeated in the pre-season, should the fans expect lots of wins in the regular season?

Sheryl Ezze

Numerade Educator

Problem 25

The Honeybee dataset introduced in Exercise 2.218 on page 135 shows an estimated number of honeybee colonies in the United States for the years 1995 through 2012 (18 years). The correlation between year and number of colonies from these data is $r=-0.41$
(a) Treating these as a sample of years do we have significant evidence that the number of honeybee colonies is linearly related to year? Give the t-statistic and the p-value, as well as a conclusion in context.
(b) What percent of the variability in number of honeybee colonies can be explained by year in these data?

Sheryl Ezze

Numerade Educator

Problem 26

The dataset HomesForSaleCA contains a random sample of 30 houses for sale in California. We are interested in whether there is a positive association between the number of bathrooms and number of bedrooms in each house.
(a) What are the null and alternative hypotheses for testing the correlation?
(b) Find the correlation in the sample.
(c) Calculate (or use technology to find) the appropriate test statistic, and determine the p-value.
(d) State the conclusion in context.

Foster Wisusik

Numerade Educator

Problem 27

A random sample of 50 countries is stored in the dataset SampCountries. Two variables in the dataset are life expectancy (LifeExpectancy) and percentage of government expenditure spent on health care (Health) for each country. We are interested in whether or not the percent spent on health care can be used to effectively predict life expectancy.
(a) What are the cases in this model?
(b) Create a scatterplot with regression line and use it to determine whether we should have any serious concerns about the conditions being met for using a linear model with these data.
(c) Run the simple linear regression, and report and interpret the slope.
(d) Find and interpret a $95 \%$ confidence interval for the slope.
(e) Is the percentage of government expenditure on health care a significant predictor of life expectancy?
(f) The population slope (for all countries) is 0.467 . Is this captured in your $95 \%$ CI from part (d)?
(g) Find and interpret $R^{2}$ for this linear model.

Jameson Kuper

Numerade Educator

Problem 28

A common (and hotly debated) saying among sports fans is "Defense wins championships." Is offensive scoring ability or defensive stinginess a better indicator of a team's success? To investigate this question we'll use data from the $2015-2016$ National Basketball Association (NBA) regular season. The data $^{6}$ stored in NBAStandings2016 include each team's record (wins, losses, and winning percentage) along with the average number of points the team scored per game (PtsFor) and average number of points scored against them ( PtsAgainst).
(a) Examine scatterplots for predicting $\operatorname{WinPct}$ using PtsFor and predicting WinPct using PtsAgainst. In each case, discuss whether conditions for fitting a linear model appear to be met.
(b) Fit a model to predict winning percentage (WinPct) using offensive ability (PtsFor). Write down the prediction equation and comment on whether PtsFor is an effective predictor.
(c) Repeat the process of part (b) using PtsAgainst as the predictor.
(d) Compare and interpret $R^{2}$ for both models.
(e) The Golden State Warriors set an NBA record by winning 73 games in the regular season and only losing 9 (WinPct $=0.890$ ). They scored an average of 114.9 points per game while giving up an average of 104.1 points against. Find the predicted winning percentage for the Warriors using each of the models in (b) and (c).
(f) Overall, does one of the predictors, PtsFor or PtsAgainst, appear to be more effective at explaining winning percentages for NBA teams? Give some justification for your answer.

Shu Naito

Numerade Educator

Problem 29

Use the dataset AllCountries to examine the correlation between birth rate and life expectancy across countries of the world.
(a) Plot the data. Do birth rate and life expectancy appear to be linearly associated?
(b) From this dataset, can we conclude that the population correlation between birth rate and life expectancy is different from zero?
(c) Explain why inference is not necessary to answer part (b).
(d) For every percent increase in birth rate, how much does the predicted life expectancy of a country change?
(e) From this dataset, can we conclude that lowering the birth rate of a country will increase its life expectancy? Why or why not?

James Kiss

Numerade Educator