THE SCENARIO: You are the General Manager of a Major League baseball team. As the trading deadline approaches, you must decide whether to trade for power or pitching. The first step is to gather some inferential statistics about the number of wins achieved by Major League teams. You also wish to determine if the number of runs a team scores is useful for predicting the number of games it wins in a season.
THE DATA: The runs scored and wins of 20 randomly selected Major League teams from the past decade are contained in the file: BASEBALL.xlsx. For linear regression, the X-variable is the number of runs scored, and the Y-variable is the number of wins.
INSTRUCTIONS: Answer all the questions below. All calculations must be performed with Excel or PHStat. Attach Excel or PHStat output where indicated. You will receive zero credit for any answer lacking the required Excel or PHStat output.
ROUND OFF ALL CALCULATIONS TO AT LEAST FOUR DECIMAL PLACES. Highlight the cells with output where decimal places need setting. Then use the "Increase Decimal" tool on Excel's Home/Number menu.
1. Find the mean and standard deviation of the sample WINS:
PASTE EXCEL DESCRIPTIVE STATISTICS BELOW.
Sample Mean: ____________
Sample Standard deviation: _____________
2. Assume that the population is normally distributed, but the population standard deviation is not known.
Use your sample data to find a 95% confidence interval for the true mean number of wins for all Major League teams:
PASTE PHSTAT OUTPUT BELOW:
State the margin of error of the confidence interval: ______________
How can you increase the precision of this confidence interval, without changing the sample size?
Assume that the population standard deviation is 10 and assume the population is approximately normally distributed. Find the sample size that would be required to determine a 95% confidence interval if we want to be within 3 games of the true mean. That is, we want the margin of error, e, to not exceed 3 games.
PASTE PHSTAT OUTPUT BELOW
4. Using your sample data, conduct a hypothesis test at the alpha = 0.05 significance level. You may assume that the population standard deviation is not known and that the population is approximately normally distributed.
Is there sufficient evidence to conclude that the mean number of wins for all teams is more than 75?
PASTE PHSTAT OUTPUT BELOW:
Using the critical value approach: State the conclusion of the hypothesis test and the reason for the conclusion.
Using the p-value approach: State the conclusion of the hypothesis test and the reason for the conclusion.
Assume that the null hypothesis is true. What is the probability of obtaining a test statistic equal to or more extreme than the one generated by the hypothesis test?
Will the conclusion of the hypothesis test be different if alpha is changed to 0.10 (while all other inputs remain the same)? Explain the reason why or why not.
5. Suppose it is known that 12 out of the 20 teams in the sample had a season winning percentage no better than 0.500.
(a) Find a 95% confidence interval for the true proportion of all teams that had a season winning percentage no better than 0.500.
PASTE PHSTAT OUTPUT BELOW:
(b) What is your opinion of the precision of this confidence interval? Give a reason for your answer.
6. Assuming that we have no way to estimate the population proportion, find the sample size that would be required to determine a 95% confidence interval for the true proportion of all teams that have a season winning percentage better than 0.500. We want to be within 0.10 of the true population proportion, that is, we want the margin of error, e, to not exceed 0.10.
PASTE PHSTAT OUTPUT BELOW
LINEAR REGRESSION – Use the sample to complete this section. Remember, the X variable is NUMBER OF RUNS SCORED, and the Y variable is NUMBER OF WINS.
7. PASTE A SCATTER PLOT BELOW:
8. Perform the regression analysis using PHSTAT and PASTE THE PRINTOUT BELOW:
9. Interpreting the regression output.
PLEASE NOTE: You must give answers that are specific to this regression model. For example, do not say that the regression equation is: ; that is the generic regression equation. You need to write the equation that expresses the relationship between RUNS SCORED, and WINS. Be equally specific in your other answers.
i. State the regression equation: ____________________________________
ii. Explain the exact meaning of the slope of the regression equation:
iii. Explain the exact meaning of the y-intercept of the regression equation:
iv. State the standard error of the estimate, and explain its exact meaning:
v. State the coefficient of determination, and explain its exact meaning:
Predict the number of wins for a team that scores 670 runs (round off to the nearest integer). _______________
10. Using the Excel printout from Question 8, test the null hypothesis that there is no linear relationship between X and Y. Test at alpha = 0.05 significance level.
i. State the null hypothesis: _______________________
ii. State the alternate hypothesis: ___________________
iii. Test result and reason for test result __________________
iv. Assume that the null hypothesis is true. What is the probability of obtaining a test statistic equal to or more extreme than the one shown in the regression output?
11.
(a) PASTE RESIDUAL PLOT BELOW:
(b) From the residual plot, do you think that the two regression assumptions listed below are satisfied? Give the reason for your conclusion.
Linearity: ___________________________________
Reason: ____________________________________
Equal Variance: ______________________________
Reason: ____________________________________
12.
(a) PASTE A NORMAL PROBABILITY PLOT OF RESIDUALS BELOW:
(b) From the normal probability plot, do you think the normality assumption for regression is satisfied? Give the reason for your conclusion.
13. Determine 95% confidence and prediction intervals for X = 670.
PASTE PHSTAT OUTPUT BELOW:
14. Typically, the first assessment of how well a regression model predicts is based on R square (the coefficient of determination). The higher the R square, the more of the variation in observed Y-values is explained by the variation in observed X-values.
Suppose you want to find out if there's a model that is a better predictor of wins than runs scored. You ask your staff to come up with alternate models. It turns out that when the X variable is the number of gallons of beer sold during a game, wins are predicted with an R Square of 0.7750.
Would you stop using the runs scored/wins model and use the beer sold/wins model instead? Why or why not?