In general, to guarantee the convergence of Q-Learning to optimal Q-values...
It is necessary that every state-action pair is visited infinitely often.
It is necessary that the learning rate α (weight given to new samples) is decreased to 0 over time
It is necessary that the discount γ is less than 0.5.
It is necessary that actions get chosen according to argmaxaQ(s,a)