10 Generalized Linear Regression
In the problems of this section
$$x^T\beta = \beta_0 + \sum_{i=1}^p \beta_i x_i$$
Problem 10.2.
In binary logistic regression the probabilitymass function of Y|X = x is
$$P(y|x; \beta) = \sigma(yx^T\beta),$$
where $\sigma(x) = \frac{1}{1+e^{-x}}$ and $y \in \{-1,1\}$. A training set is
$$D_{tr} = \{(x_1, y_1),..., (x_n, y_n)\}$$
where $y_i \in \{-1, 1\}$. The likelihood function of $\beta$ is
$$L(\beta) \stackrel{def}{=} \prod_{i=1}^n P(y_i|x_i; \beta)$$
a) The negative loglikelihood is
Show that
$$-l(\beta) \stackrel{def}{=} -ln L(\beta)$$
$$-l(\beta) = \sum_{i=1}^n ln[1 + e^{-y_i x_i^T \beta}]$$
b) With $\beta = (\beta_0, \beta_1, \beta_2,..., \beta_p)$ and $x_i^T\beta = \beta_0 + \sum_{i=1}^p \beta_i x_i$ show that
$$\frac{\partial}{\partial \beta_0} ln[1 + e^{-y_i x_i^T \beta}] = -y_i (1 - P(y_i|x_i; \beta))$$
c) With $\beta = (\beta_0, \beta_1, \beta_2,..., \beta_p)$ and $x_i^T\beta = \beta_0 + \sum_{i=1}^p \beta_i x_i$ show for $k = 1,..., p$ that
$$\frac{\partial}{\partial \beta_k} ln[1 + e^{-y_i x_i^T \beta}] = -y_i x_{ik} (1 - P(y_i|x_i; \beta))$$
d) One wants to find the maximum likelihood estimate $\hat{\beta}_{ML}$ of $\beta$ using $D_{tr}$. We have
$$\hat{\beta}_{ML} = argmin[-l(\beta)].$$
We assume that $D_{tr}$ is such that a unique $\hat{\beta}_{MLE}$ exists. Give a stochastic gradient descent
algorithm for $\hat{\beta}_{ML}$.