Appendix

Some supplemental information for the topics we have covered.

Convexity

Convex Function

A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if it satisfies the following.

\begin{aligned} f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y) \text{, } \forall \alpha \in (0,1),x,y \in \mathbb{R}^n \end{aligned}

$f$ is convex if and only if for all $x$ , $y$ ,

\begin{aligned} f(y)\ge f(x)+\nabla f(x){}^{T}(y-x) \end{aligned}

$f$ is strictly convex if and only if for all $x$ , $y$ ,

\begin{aligned} f(y) > f(x)+\nabla f(x){}^{T}(y-x) \end{aligned}

$f$ is $\mu$ -strongly convex if and only if there exists a $\mu>0$ such that for all $x$ , $y$ ,

\begin{aligned} f(y) \geq f(x)+\nabla f(x){}^{T}(y-x) + \frac{\mu}{2} ||x-y||^2 \end{aligned}

If the objective function $f$ is convex, then any local minimum is a global minimum.
If the objective function $f$ is strictly convex, then the minimum is unique.
If the objective function $f$ is strongly convex, then the minimum is unique.

Return to

Lipschitz Continuity

For the following convex optimization problem,

\begin{aligned} \min_x f(x) \end{aligned}

$f$ has a Lipschitz-continuous gradient if there exists $L>0$ such that

\begin{aligned} || \nabla f(x) - \nabla f(y) || \leq L ||x-y|| \text{, } \forall x,y \in \mathbb{R} \end{aligned}

If $f$ is twice differentiable such that $|| \nabla^2 f(x) || \leq L$ for all $x$ and some $L>0$ , then $f$ has a Lipschitz continuous gradient with constant $L$ .

Return to

Limitations of Regular Gradient Descent

Consistency

Informally, a consistent estimator just means that if we use more data to estimate a parameter, it is guaranteed to converge to the true value of the parameter.

Gradient Calculation

For simplicity, we are using the least squares loss objective function $f$ .

\begin{aligned} f(\boldsymbol{\theta}) = ||A\boldsymbol{\theta}-\boldsymbol{b}||^{2} \end{aligned}

Recall that our data has $p$ features and $n$ total samples, so our matrix $A$ has dimensions $n \times p$ . Our labels $b$ (the "response" vector) have a corresponding sample for each data sample, so its dimensions are $n \times 1$ . Lastly, our parameters vector has a parameter for each of the $p$ features, so its dimensions are $p \times 1$ .

Full gradient calculation

\begin{aligned} \nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ a_{2,1} & a_{2,2} & ... & a_{2,p-1} & a_{2,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ a_{2,1} & a_{2,2} & ... & a_{2,p-1} & a_{2,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ \theta_{2}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix}b_{1}\\ b_{2}\\ ...\\ b_{n} \end{bmatrix}\right)=\begin{bmatrix} \nabla_{full}f\left(\boldsymbol{\theta}\right)_{1}\\ \nabla_{full}f\left(\boldsymbol{\theta}\right)_{2}\\ ...\\ \nabla_{full}f\left(\boldsymbol{\theta}\right)_{p} \end{bmatrix} \end{aligned}

Return to

Limitations of Regular Gradient Descent

Stochastic gradient calculation

Select a random index $i$ from $\{ 1,...n \}$ .

\begin{aligned} \nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ ...\\ \color{green}\theta_{i}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{1}\\ ...\\ \color{green}b_{i}\\ ...\\ b_{n} \end{bmatrix}\right) \end{aligned}

Compute the stochastic gradient (approximation of full gradient).

\begin{aligned} \nabla f_{i}\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ \theta_{2}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{i} \end{bmatrix}\right)=\begin{bmatrix} \nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{1}\\ \nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{2}\\ ...\\ \nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{p} \end{bmatrix} \end{aligned}

Return to

Batch gradient calculation

We first select a subset of indices $I_k \subset \{ 1,...,n \}$ where $|I_k|=b < < n$ . In this example we select $i$ , $j$ , $k$ .

\begin{aligned} \nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{j,1} & \color{green}a_{j,2} & \color{green}... & \color{green}a_{j,p-1} & \color{green}a_{j,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{k,1} & \color{green}a_{k,2} & \color{green}... & \color{green}a_{k,p-1} & \color{green}a_{k,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{j,1} & \color{green}a_{j,2} & \color{green}... & \color{green}a_{j,p-1} & \color{green}a_{j,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{k,1} & \color{green}a_{k,2} & \color{green}... & \color{green}a_{k,p-1} & \color{green}a_{k,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ ...\\ \color{green}\theta_{i}\\ ...\\ \color{green}\theta_{j}\\ ...\\ \color{green}\theta_{k}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{1}\\ ...\\ \color{green}b_{i}\\ ...\\ \color{green}b_{j}\\ ...\\ \color{green}b_{k}\\ ...\\ b_{n} \end{bmatrix}\right) \end{aligned}

We compute the batch gradient.

\begin{aligned} \nabla f_{batch}\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \\ a_{j,1} & a_{j,2} & ... & a_{j,p-1} & a_{j,p} \\ a_{k,1} & a_{k,2} & ... & a_{k,p-1} & a_{k,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \\ a_{j,1} & a_{j,2} & ... & a_{j,p-1} & a_{j,p} \\ a_{k,1} & a_{k,2} & ... & a_{k,p-1} & a_{k,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ \theta_{2}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{i} \\ b_{j} \\ b_{k} \end{bmatrix}\right)=\begin{bmatrix} \nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{1}\\ \nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{2}\\ ...\\ \nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{p} \end{bmatrix} \end{aligned}

Which results in the average of the selected gradients.

Appendix

Some supplemental information for the topics we have covered.

Convexity

Convex Function

A function $f: \mathbb{R}^n \rightarrow \mathbb{R}$ is convex if it satisfies the following.

\begin{aligned} f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y) \text{, } \forall \alpha \in (0,1),x,y \in \mathbb{R}^n \end{aligned}

$f$ is convex if and only if for all $x$ , $y$ ,

\begin{aligned} f(y)\ge f(x)+\nabla f(x){}^{T}(y-x) \end{aligned}

$f$ is strictly convex if and only if for all $x$ , $y$ ,

\begin{aligned} f(y) > f(x)+\nabla f(x){}^{T}(y-x) \end{aligned}

$f$ is $\mu$ -strongly convex if and only if there exists a $\mu>0$ such that for all $x$ , $y$ ,

\begin{aligned} f(y) \geq f(x)+\nabla f(x){}^{T}(y-x) + \frac{\mu}{2} ||x-y||^2 \end{aligned}

If the objective function $f$ is convex, then any local minimum is a global minimum.
If the objective function $f$ is strictly convex, then the minimum is unique.
If the objective function $f$ is strongly convex, then the minimum is unique.

Return to

Lipschitz Continuity

For the following convex optimization problem,

\begin{aligned} \min_x f(x) \end{aligned}

$f$ has a Lipschitz-continuous gradient if there exists $L>0$ such that

\begin{aligned} || \nabla f(x) - \nabla f(y) || \leq L ||x-y|| \text{, } \forall x,y \in \mathbb{R} \end{aligned}

If $f$ is twice differentiable such that $|| \nabla^2 f(x) || \leq L$ for all $x$ and some $L>0$ , then $f$ has a Lipschitz continuous gradient with constant $L$ .

Return to

Limitations of Regular Gradient Descent

Consistency

Informally, a consistent estimator just means that if we use more data to estimate a parameter, it is guaranteed to converge to the true value of the parameter.

Gradient Calculation

For simplicity, we are using the least squares loss objective function $f$ .

\begin{aligned} f(\boldsymbol{\theta}) = ||A\boldsymbol{\theta}-\boldsymbol{b}||^{2} \end{aligned}

Full gradient calculation

\begin{aligned} \nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ a_{2,1} & a_{2,2} & ... & a_{2,p-1} & a_{2,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ a_{2,1} & a_{2,2} & ... & a_{2,p-1} & a_{2,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ \theta_{2}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix}b_{1}\\ b_{2}\\ ...\\ b_{n} \end{bmatrix}\right)=\begin{bmatrix} \nabla_{full}f\left(\boldsymbol{\theta}\right)_{1}\\ \nabla_{full}f\left(\boldsymbol{\theta}\right)_{2}\\ ...\\ \nabla_{full}f\left(\boldsymbol{\theta}\right)_{p} \end{bmatrix} \end{aligned}

Return to

Limitations of Regular Gradient Descent

Stochastic gradient calculation

Select a random index $i$ from $\{ 1,...n \}$ .

\begin{aligned} \nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ ...\\ \color{green}\theta_{i}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{1}\\ ...\\ \color{green}b_{i}\\ ...\\ b_{n} \end{bmatrix}\right) \end{aligned}

Compute the stochastic gradient (approximation of full gradient).

\begin{aligned} \nabla f_{i}\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ \theta_{2}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{i} \end{bmatrix}\right)=\begin{bmatrix} \nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{1}\\ \nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{2}\\ ...\\ \nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{p} \end{bmatrix} \end{aligned}

Return to

Batch gradient calculation

We first select a subset of indices $I_k \subset \{ 1,...,n \}$ where $|I_k|=b < < n$ . In this example we select $i$ , $j$ , $k$ .

\begin{aligned} \nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{j,1} & \color{green}a_{j,2} & \color{green}... & \color{green}a_{j,p-1} & \color{green}a_{j,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{k,1} & \color{green}a_{k,2} & \color{green}... & \color{green}a_{k,p-1} & \color{green}a_{k,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{j,1} & \color{green}a_{j,2} & \color{green}... & \color{green}a_{j,p-1} & \color{green}a_{j,p}\\ ... & ... & ... & ... & ...\\ \color{green}a_{k,1} & \color{green}a_{k,2} & \color{green}... & \color{green}a_{k,p-1} & \color{green}a_{k,p}\\ ... & ... & ... & ... & ...\\ a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ ...\\ \color{green}\theta_{i}\\ ...\\ \color{green}\theta_{j}\\ ...\\ \color{green}\theta_{k}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{1}\\ ...\\ \color{green}b_{i}\\ ...\\ \color{green}b_{j}\\ ...\\ \color{green}b_{k}\\ ...\\ b_{n} \end{bmatrix}\right) \end{aligned}

We compute the batch gradient.

\begin{aligned} \nabla f_{batch}\left(\boldsymbol{\theta}\right)=\begin{bmatrix} a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \\ a_{j,1} & a_{j,2} & ... & a_{j,p-1} & a_{j,p} \\ a_{k,1} & a_{k,2} & ... & a_{k,p-1} & a_{k,p} \end{bmatrix}^{T} \left(\begin{bmatrix} a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \\ a_{j,1} & a_{j,2} & ... & a_{j,p-1} & a_{j,p} \\ a_{k,1} & a_{k,2} & ... & a_{k,p-1} & a_{k,p} \end{bmatrix} \begin{bmatrix} \theta_{1}\\ \theta_{2}\\ ...\\ \theta_{p} \end{bmatrix}-\begin{bmatrix} b_{i} \\ b_{j} \\ b_{k} \end{bmatrix}\right)=\begin{bmatrix} \nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{1}\\ \nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{2}\\ ...\\ \nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{p} \end{bmatrix} \end{aligned}

Which results in the average of the selected gradients.