Computing the closed form solution in part (2) could be computational burden for large number of users or movies. A remedy for this would be using iterative optimization algorithms such as Stochastic Gradient Descent (SGD) where you sample movies in updating the latent features for users and sample users in updating the latent features of movies. Derive the updating rules for u (t) i and v (t) j at tth iteration of SGD. Show the detailed steps
Added by Larry W.
Step 1
To derive the updating rules for the latent features of users \( u_i \) and movies \( v_j \) using Stochastic Gradient Descent (SGD), we will follow these steps: Show more…
Show all steps
Your feedback will help us improve your experience
Sri K and 95 other AP CS educators are ready to help you.
Ask a new question
Labs
Want to see this concept in action?
Explore this concept interactively to see how it behaves as you change inputs.
Key Concepts
Recommended Videos
Question: Principled method for learning the step size in gradient descent: In class, we discussed that when we perform gradient descent to minimize the target function J(w) with respect to the step size a(k) at each iteration, it is a crucial hyperparameter. We further said that we can experimentally determine a(k) through cross-validation. There is actually a principled way for computing the optimal a(k) in each iteration, and we are going to derive the expression for that. (a) (0.5 point) According to Taylor series expansion, for a differentiable function f(x) around a point x0, we have: f(x) = f(x0) + (Vf(x0))T(x - x0) + (1/2)(x - x0)TH(x - x0) where Vf(x0) is the gradient vector and H is the Hessian matrix of f evaluated at x0. Let w(k) be the value of f at the kth iteration of gradient descent. Show that the second-order Taylor expansion of the target function J(w) around w(k) is the following: J(w) = J(w(k)) + (VJ(w(k)))T(w - w(k)) + (1/2)(w - w(k))TH(w - w(k)) (b) (1 point) Show that the above expression of J(w) evaluated at w(k + 1) (i.e. at the (k + 1)th gradient descent iteration) can be written as: J(w(k + 1)) = J(w(k)) - a(k)(VJ(w(k)))T(VJ(w(k))) + 3a(k)(VJ(w(k)))TH(w - w(k)) + a(k)(w - w(k))TH(w - w(k)) Tip: Take into account the gradient descent update rule w(k + 1) = w(k) - a(k)VJ(w(k)). (c) (1 point) Show that minimizing the above expression with respect to a(k) and setting it to zero results in: a(k) = (VJ(w(k)))T(VJ(w(k))) / (VJ(w(k)))TH(w - w(k))(VJ(w(k)))T The above expression gives the closed-form solution for the optimal a(k) in each iteration (i.e. the a(k) that minimizes the target function for the next iteration). (d) (Bonus) What is the cost of computing a(k) at each iteration using the above expression?
Sri K.
Given the function f(x) = (x2 cos(x) + sin(x) − x), derive the variable step size (γ) that will ensure faster convergence of the gradient descent method and compare the results with (3) i) constant step size ii) decaying step size using various values of α0 and k iii) Bold driver algorithm
Akash M.
Which of the following statement(s) is / are true for Gradient Decent (GD) and Stochastic Gradient Decent (SGD)? In GD and SGD, you update a set of parameters in an iterative manner to minimize the error function. In SGD, you have to run through all the samples in your training set for a single update of a parameter in each iteration. In GD, you either use the entire data or a subset of training data to update a parameter in each iteration. a. 1 and 2 b. 2 and 3 c. Only 2 d. Only 3 e. 1,2 and 3 f. Only 1
Madhur L.
Recommended Textbooks
Computer Science and Information Technology
Introduction to Programming Using Python
Computer Science - An Overview
Transcript
18,000,000+
Students on Numerade
Trusted by students at 8,000+ universities
Watch the video solution with this free unlock.
EMAIL
PASSWORD