So far we have simply set the number of mixture components but this is also a parameter that we must estimate from data. How does the log-likelihood of the data vary as a function of assuming we avoid locally optimal solutions?
To compensate, we need a selection criterion that penalizes the number of parameters used in the model. The Bayesian information criterion (BIC) is a criterion for model selection. It captures the tradeoff between the log-likelihood of the data, and the number of parameters that the model uses. The BIC of a model is defined as:
where is the log-likelihood of the data under the current model (highest log-likelihood we can achieve by adjusting the parameters in the model), is the number of free parameters, and is the number of data points. This score rewards a larger log-likelihood, but penalizes the number of parameters used to train the model. In a situation where we wish to select models, we want a model with the the highest BIC.