Question

1 (a) ( ) Load the California housing dataset provided in sklearn.datasets, and construct a random 70/30 train-test split. Set the random seed to a number of your choice to make the split reproducible. What is the value of d here? (b) ( ) Train a random forest of 100 decision trees using default hyperparameters. Report the training and test MSEs. What is the value of m used? (c) ( ) Write code to compute the pairwise (Pearson) correlations between the test set predictions of all pairs of distinct trees. Report the average of all these pairwise correlations. You can retrieve all the trees in a RandomForestClassifier object using the estimators_\nattribute. (d) ( ) Repeat (b) and (c) for m = 1 to d. Produce a table containing the training and test MSEs, and the average correlations for all m values. In addition, plot the training and test MSEs against m in a single figure, and plot the average correlation against m in another figure. (e) ( ) Describe how the average correlation changes as m increases. Explain the observed pattern. (f) ( ) A data scientist claims that we should choose m such that the average correlation is smallest, because it gives us maximum reduction in the variance, thus maximum reduction in the expected prediction error. True or false? Justify your answer.

          1
(a) ( ) Load the California housing dataset provided in sklearn.datasets, and
construct a random 70/30 train-test split. Set the random seed to a number of your
choice to make the split reproducible. What is the value of d here?
(b) ( ) Train a random forest of 100 decision trees using default hyperparameters.
Report the training and test MSEs. What is the value of m used?
(c) ( ) Write code to compute the pairwise (Pearson) correlations between the
test set predictions of all pairs of distinct trees. Report the average of all these
pairwise correlations.
You can retrieve all the trees in a RandomForestClassifier object using the estimators_\nattribute.
(d) ( ) Repeat (b) and (c) for m = 1 to d. Produce a table containing the training
and test MSEs, and the average correlations for all m values. In addition, plot the
training and test MSEs against m in a single figure, and plot the average correlation
against m in another figure.
(e) ( ) Describe how the average correlation changes as m increases. Explain the
observed pattern.
(f) ( ) A data scientist claims that we should choose m such that the average
correlation is smallest, because it gives us maximum reduction in the variance, thus
maximum reduction in the expected prediction error. True or false? Justify your
answer.

1
(a) ( ) Load the California housing dataset provided in sklearn.datasets, and
construct a random 70/30 train-test split. Set the random seed to a number of your
choice to make the split reproducible. What is the value of d here?
(b) ( ) Train a random forest of 100 decision trees using default hyperparameters.
Report the training and test MSEs. What is the value of m used?
(c) ( ) Write code to compute the pairwise (Pearson) correlations between the
test set predictions of all pairs of distinct trees. Report the average of all these
pairwise correlations.
You can retrieve all the trees in a RandomForestClassifier object using the estimators.
(d) ( ) Repeat (b) and (c) for m = 1 to d. Produce a table containing the training
and test MSEs, and the average correlations for all m values. In addition, plot the
training and test MSEs against m in a single figure, and plot the average correlation
against m in another figure.
(e) ( ) Describe how the average correlation changes as m increases. Explain the
observed pattern.
(f) ( ) A data scientist claims that we should choose m such that the average
correlation is smallest, because it gives us maximum reduction in the variance, thus
maximum reduction in the expected prediction error. True or false? Justify your
answer.

Added by Craig F.

Question

Please give Ace some feedback