Zipf's law of word distribution states the following: Take a large corpus of text, count the frequency of every word in the corpus, and then rank these frequencies in decreasing order. Let $f_I$ be the $I$ th largest frequency in this list; that is, $f_1$ is the frequency of the most common word (usually "the"), $f_2$ is the frequency of the second most common word, and so on. Zipf's law states that $f_I$ is approximately equal to $\alpha / I$ for some constant $\alpha$. The law tends to be highly accurate except for very small and very large values of $I$.
Choose a corpus of at least 20,000 words of online text, and verify Zipf's law experimentally. Define an error measure and find the value of $\alpha$ where Zipf's law best matches your experimental data. Create a $\log -\log$ graph plotting $f_I$ vs. $I$ and $\alpha / I$ vs. $I$. (On a $\log -\log$ graph, the function $\alpha / I$ is a straight line.) In carrying out the experiment, be sure to eliminate any formatting tokens (e.g., HTML tags) and normalize upper and lower case.