You have been given access to a large movie rating dataset
containing about 5M records with fields like Movie Name, Average
Movie Rating, Genre, Number of Reviewers, Date of Release and few
other numeric columns. You plan to build a Data mining model that
predicts the average review based on the other columns. Which is
the best approach you would adopt to build the model.
Randomly sample a few 1000's of records and
explore whether you can predict the rating with reasonable accuracy
dropping features that don't aid in improving the predictive
accuracy
Use the entire dataset to build a regression model
to predict the average movie rating by regressing against the
remaining columns dropping features that don't aid in improving the
predictive accuracy
None of the above
Drop the movie titles and Genres because they are
unstructured data and only use the numeric columns to build a
regression model using the rest of the entire data set.