On rapindminder, Predicting Housing Median Prices. The file BostonHousing.csv contains information on over 500 census tracts in Boston, where for each tract multiple attributes are recorded. The last column (CAT.MEDV) was derived from median value (MEDV) such that it obtains the value 1 if MEDV > 30 and 0 otherwise. Consider the goal of predicting the MEDV of a tract, given the information in the
first 12 columns.
Partition the data into training (60%) and holdout (40%) sets.
a) b) Perform a k-NN prediction with all 12 predictors (ignore the CAT.MEDV attribute), trying
values of k from 1 to 10 with RapidMiner's Optimize Parameters (Grid) operator. Use nested
10-fold cross-validation on the training set within this operator. Make sure to normalize the
data with the Normalize operator, and use the Group Models operator so that the
normalization from the training data is used for validation as well. What is the best k? What
does it mean?
Predict the MEDV for a tract with the following information, using the best k:
c) d) e) f) (Hint: Create a new .csv file with the above data. The easiest way to do that is to make a
copy of the csv data file provided, delete the data (keep column headers) and any extra
attributes and enter the above data. Then import this data in RapidMiner.)
If we used the above k-NN algorithm to score the training data, what would be the error of
the training set?
Report the error rate (averaged) from the 10-fold cross-validation for the best-k. Why is the
validation data error overly optimistic compared with the error rate when applying this k-NN
predictor to new data?
Report the error rate for the holdout set using the optimized k-NN predictor. Compare this
with the validation error rate found earlier and comment.
If the purpose is to predict MEDV for several thousands of new tracts, what would be the
disadvantage of using k-NN prediction? List the operations that the algorithm goes through
in order to produce each prediction.