Given data {(x_i, y_i) ∈ ℐ^d % #177;1} : i ∈ [1, n]}, logistic regression is another popular classification method in Machine Learning, which amounts to the following minimization problem:
min_{w ∈ ℐ^d} {f(w) := 1/n ∑_{i=1}^n log(1 + e^{-y_i✨w, x_i✩}) + λ/2 ||w||^2},
where λ > 0 is a regularization parameter.
Work out the gradient function ∇f(w) and the Hessian function ∇^2f(w). Show that its gradient function ∇f(w) is Lipschitz continuous with constant L ≤ 1/n ∑_{i=1}^n ||x_i||^2 + λ, i.e.
||∇f(w) - ∇f(w')|| ≤ (1/n ∑_{i=1}^n ||x_i||^2 + λ) ||w - w'||, ∀ w, w' ∈ ℐ^d.
(Recall that in Homework 3 we have shown the objective function f of the logistic regression is convex)
Write down the pseudo-code of Stochastic gradient descent (SGD) for logistic regression.
(Bonus question) Consider the average of iterates, i.e. w_T = 1/T ∑_{t=1}^T w_t where {w_t : t ∈ [1, T]} is generated by SGD with step sizes η_t = 1/(λt). Prove that the following convergence rate for SGD in logistic regression:
E(f_2(w_T)) - min_{w ∈ ℐ^d} f_2(w) = O(log T / T).