a) Load the California housing dataset provided in sklearn.datasets and construct a random 70/30 train-test split. Set the random seed to a number of your choice to make the split reproducible. What is the value of d here?
b) Train a random forest of 100 decision trees using default hyperparameters. Report the training and test MSEs. What is the value of m used?
c) Write code to compute the pairwise (Pearson) correlations between the test set predictions of all pairs of distinct trees. Report the average of all these pairwise correlations. You can retrieve all the trees in a RandomForestClassifier object using the estimators_ attribute.
d) Repeat (b) and (c) for m = 1 to d. Produce a table containing the training and test MSEs, and the average correlations for all m values. In addition, plot the training and test MSEs against m in a single figure, and plot the average correlation against m in another figure.
e) Describe how the average correlation changes as m increases. Explain the observed pattern.
f) A data scientist claims that we should choose m such that the average correlation is smallest because it gives us maximum reduction in the variance, thus maximum reduction in the expected prediction error. True or false? Justify your answer.