Appendix
Some supplemental information for the topics we have covered.
Convexity
Convex Function
A function f : R n → R f: \mathbb{R}^n \rightarrow \mathbb{R} f : R n → R is convex if it satisfies the following.
f ( α x + ( 1 − α ) y ) ≤ α f ( x ) + ( 1 − α ) f ( y ) , ∀ α ∈ ( 0 , 1 ) , x , y ∈ R n \begin{aligned}
f(\alpha x + (1-\alpha)y) \leq \alpha f(x) + (1-\alpha)f(y) \text{, } \forall \alpha \in (0,1),x,y \in \mathbb{R}^n
\end{aligned} f ( αx + ( 1 − α ) y ) ≤ α f ( x ) + ( 1 − α ) f ( y ) , ∀ α ∈ ( 0 , 1 ) , x , y ∈ R n
f f f is convex if and only if for all x x x , y y y ,
f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) \begin{aligned}
f(y)\ge f(x)+\nabla f(x){}^{T}(y-x)
\end{aligned} f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x )
f f f is strictly convex if and only if for all x x x , y y y ,
f ( y ) > f ( x ) + ∇ f ( x ) T ( y − x ) \begin{aligned}
f(y) > f(x)+\nabla f(x){}^{T}(y-x)
\end{aligned} f ( y ) > f ( x ) + ∇ f ( x ) T ( y − x )
f f f is μ \mu μ -strongly convex if and only if there exists a μ > 0 \mu>0 μ > 0 such that for all x x x , y y y ,
f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + μ 2 ∣ ∣ x − y ∣ ∣ 2 \begin{aligned}
f(y) \geq f(x)+\nabla f(x){}^{T}(y-x) + \frac{\mu}{2} ||x-y||^2
\end{aligned} f ( y ) ≥ f ( x ) + ∇ f ( x ) T ( y − x ) + 2 μ ∣∣ x − y ∣ ∣ 2
If the objective function f f f is convex, then any local minimum is a global minimum.
If the objective function f f f is strictly convex, then the minimum is unique.
If the objective function f f f is strongly convex, then the minimum is unique.
Return to
Lipschitz Continuity
For the following convex optimization problem,
min x f ( x ) \begin{aligned}
\min_x f(x)
\end{aligned} x min f ( x )
f f f has a Lipschitz-continuous gradient if there exists L > 0 L>0 L > 0 such that
∣ ∣ ∇ f ( x ) − ∇ f ( y ) ∣ ∣ ≤ L ∣ ∣ x − y ∣ ∣ , ∀ x , y ∈ R \begin{aligned}
|| \nabla f(x) - \nabla f(y) || \leq L ||x-y|| \text{, } \forall x,y \in \mathbb{R}
\end{aligned} ∣∣∇ f ( x ) − ∇ f ( y ) ∣∣ ≤ L ∣∣ x − y ∣∣ , ∀ x , y ∈ R
If f f f is twice differentiable such that ∣ ∣ ∇ 2 f ( x ) ∣ ∣ ≤ L || \nabla^2 f(x) || \leq L ∣∣ ∇ 2 f ( x ) ∣∣ ≤ L for all x x x and some L > 0 L>0 L > 0 , then f f f has a Lipschitz continuous gradient with constant L L L .
Return to
Consistency
Informally, a consistent estimator just means that if we use more data to estimate a parameter, it is guaranteed to converge to the true value of the parameter.
Gradient Calculation
For simplicity, we are using the least squares loss objective function f f f .
f ( θ ) = ∣ ∣ A θ − b ∣ ∣ 2 \begin{aligned}
f(\boldsymbol{\theta}) = ||A\boldsymbol{\theta}-\boldsymbol{b}||^{2}
\end{aligned} f ( θ ) = ∣∣ A θ − b ∣ ∣ 2
Recall that our data has p p p features and n n n total samples, so our matrix A A A has dimensions n × p n \times p n × p . Our labels b b b (the "response" vector) have a corresponding sample for each data sample, so its dimensions are n × 1 n \times 1 n × 1 . Lastly, our parameters vector has a parameter for each of the p p p features, so its dimensions are p × 1 p \times 1 p × 1 .
Full gradient calculation
∇ f ( θ ) = [ a 1 , 1 a 1 , 2 . . . a 1 , p − 1 a 1 , p a 2 , 1 a 2 , 2 . . . a 2 , p − 1 a 2 , p . . . . . . . . . . . . . . . a n , 1 a n , 2 . . . a n , p − 1 a n , p ] T ( [ a 1 , 1 a 1 , 2 . . . a 1 , p − 1 a 1 , p a 2 , 1 a 2 , 2 . . . a 2 , p − 1 a 2 , p . . . . . . . . . . . . . . . a n , 1 a n , 2 . . . a n , p − 1 a n , p ] [ θ 1 θ 2 . . . θ p ] − [ b 1 b 2 . . . b n ] ) = [ ∇ f u l l f ( θ ) 1 ∇ f u l l f ( θ ) 2 . . . ∇ f u l l f ( θ ) p ] \begin{aligned}
\nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix}
a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\
a_{2,1} & a_{2,2} & ... & a_{2,p-1} & a_{2,p}\\
... & ... & ... & ... & ...\\
a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p}
\end{bmatrix}^{T}
\left(\begin{bmatrix}
a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\
a_{2,1} & a_{2,2} & ... & a_{2,p-1} & a_{2,p}\\
... & ... & ... & ... & ...\\
a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p}
\end{bmatrix}
\begin{bmatrix}
\theta_{1}\\
\theta_{2}\\
...\\
\theta_{p}
\end{bmatrix}-\begin{bmatrix}b_{1}\\
b_{2}\\
...\\
b_{n}
\end{bmatrix}\right)=\begin{bmatrix}
\nabla_{full}f\left(\boldsymbol{\theta}\right)_{1}\\
\nabla_{full}f\left(\boldsymbol{\theta}\right)_{2}\\
...\\
\nabla_{full}f\left(\boldsymbol{\theta}\right)_{p}
\end{bmatrix}
\end{aligned} ∇ f ( θ ) = a 1 , 1 a 2 , 1 ... a n , 1 a 1 , 2 a 2 , 2 ... a n , 2 ... ... ... ... a 1 , p − 1 a 2 , p − 1 ... a n , p − 1 a 1 , p a 2 , p ... a n , p T a 1 , 1 a 2 , 1 ... a n , 1 a 1 , 2 a 2 , 2 ... a n , 2 ... ... ... ... a 1 , p − 1 a 2 , p − 1 ... a n , p − 1 a 1 , p a 2 , p ... a n , p θ 1 θ 2 ... θ p − b 1 b 2 ... b n = ∇ f u ll f ( θ ) 1 ∇ f u ll f ( θ ) 2 ... ∇ f u ll f ( θ ) p
Return to
Stochastic gradient calculation
Select a random index i i i from { 1 , . . . n } \{ 1,...n \} { 1 , ... n } .
∇ f ( θ ) = [ a 1 , 1 a 1 , 2 . . . a 1 , p − 1 a 1 , p . . . . . . . . . . . . . . . a i , 1 a i , 2 . . . a i , p − 1 a i , p . . . . . . . . . . . . . . . a n , 1 a n , 2 . . . a n , p − 1 a n , p ] T ( [ a 1 , 1 a 1 , 2 . . . a 1 , p − 1 a 1 , p . . . . . . . . . . . . . . . a i , 1 a i , 2 . . . a i , p − 1 a i , p . . . . . . . . . . . . . . . a n , 1 a n , 2 . . . a n , p − 1 a n , p ] [ θ 1 . . . θ i . . . θ p ] − [ b 1 . . . b i . . . b n ] ) \begin{aligned}
\nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix}
a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\
... & ... & ... & ... & ...\\
\color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\
... & ... & ... & ... & ...\\
a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p}
\end{bmatrix}^{T}
\left(\begin{bmatrix}
a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\
... & ... & ... & ... & ...\\
\color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\
... & ... & ... & ... & ...\\
a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p}
\end{bmatrix}
\begin{bmatrix}
\theta_{1}\\
...\\
\color{green}\theta_{i}\\
...\\
\theta_{p}
\end{bmatrix}-\begin{bmatrix}
b_{1}\\
...\\
\color{green}b_{i}\\
...\\
b_{n}
\end{bmatrix}\right)
\end{aligned} ∇ f ( θ ) = a 1 , 1 ... a i , 1 ... a n , 1 a 1 , 2 ... a i , 2 ... a n , 2 ... ... ... ... ... a 1 , p − 1 ... a i , p − 1 ... a n , p − 1 a 1 , p ... a i , p ... a n , p T a 1 , 1 ... a i , 1 ... a n , 1 a 1 , 2 ... a i , 2 ... a n , 2 ... ... ... ... ... a 1 , p − 1 ... a i , p − 1 ... a n , p − 1 a 1 , p ... a i , p ... a n , p θ 1 ... θ i ... θ p − b 1 ... b i ... b n
Compute the stochastic gradient (approximation of full gradient).
∇ f i ( θ ) = [ a i , 1 a i , 2 . . . a i , p − 1 a i , p ] T ( [ a 1 , 1 a 1 , 2 . . . a 1 , p − 1 a 1 , p ] [ θ 1 θ 2 . . . θ p ] − [ b i ] ) = [ ∇ s t o c h a s t i c f i ( θ ) 1 ∇ s t o c h a s t i c f i ( θ ) 2 . . . ∇ s t o c h a s t i c f i ( θ ) p ] \begin{aligned}
\nabla f_{i}\left(\boldsymbol{\theta}\right)=\begin{bmatrix}
a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p}
\end{bmatrix}^{T}
\left(\begin{bmatrix}
a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}
\end{bmatrix}
\begin{bmatrix}
\theta_{1}\\
\theta_{2}\\
...\\
\theta_{p}
\end{bmatrix}-\begin{bmatrix}
b_{i}
\end{bmatrix}\right)=\begin{bmatrix}
\nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{1}\\
\nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{2}\\
...\\
\nabla_{stochastic}f_{i}\left(\boldsymbol{\theta}\right)_{p}
\end{bmatrix}
\end{aligned} ∇ f i ( θ ) = [ a i , 1 a i , 2 ... a i , p − 1 a i , p ] T [ a 1 , 1 a 1 , 2 ... a 1 , p − 1 a 1 , p ] θ 1 θ 2 ... θ p − [ b i ] = ∇ s t oc ha s t i c f i ( θ ) 1 ∇ s t oc ha s t i c f i ( θ ) 2 ... ∇ s t oc ha s t i c f i ( θ ) p
Return to
Batch gradient calculation
We first select a subset of indices I k ⊂ { 1 , . . . , n } I_k \subset \{ 1,...,n \} I k ⊂ { 1 , ... , n } where ∣ I k ∣ = b < < n |I_k|=b < < n ∣ I k ∣ = b << n . In this example we select i i i , j j j , k k k .
∇ f ( θ ) = [ a 1 , 1 a 1 , 2 . . . a 1 , p − 1 a 1 , p . . . . . . . . . . . . . . . a i , 1 a i , 2 . . . a i , p − 1 a i , p . . . . . . . . . . . . . . . a j , 1 a j , 2 . . . a j , p − 1 a j , p . . . . . . . . . . . . . . . a k , 1 a k , 2 . . . a k , p − 1 a k , p . . . . . . . . . . . . . . . a n , 1 a n , 2 . . . a n , p − 1 a n , p ] T ( [ a 1 , 1 a 1 , 2 . . . a 1 , p − 1 a 1 , p . . . . . . . . . . . . . . . a i , 1 a i , 2 . . . a i , p − 1 a i , p . . . . . . . . . . . . . . . a j , 1 a j , 2 . . . a j , p − 1 a j , p . . . . . . . . . . . . . . . a k , 1 a k , 2 . . . a k , p − 1 a k , p . . . . . . . . . . . . . . . a n , 1 a n , 2 . . . a n , p − 1 a n , p ] [ θ 1 . . . θ i . . . θ j . . . θ k . . . θ p ] − [ b 1 . . . b i . . . b j . . . b k . . . b n ] ) \begin{aligned}
\nabla f\left(\boldsymbol{\theta}\right)=\begin{bmatrix}
a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\
... & ... & ... & ... & ...\\
\color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\
... & ... & ... & ... & ...\\
\color{green}a_{j,1} & \color{green}a_{j,2} & \color{green}... & \color{green}a_{j,p-1} & \color{green}a_{j,p}\\
... & ... & ... & ... & ...\\
\color{green}a_{k,1} & \color{green}a_{k,2} & \color{green}... & \color{green}a_{k,p-1} & \color{green}a_{k,p}\\
... & ... & ... & ... & ...\\
a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p}
\end{bmatrix}^{T}
\left(\begin{bmatrix}
a_{1,1} & a_{1,2} & ... & a_{1,p-1} & a_{1,p}\\
... & ... & ... & ... & ...\\
\color{green}a_{i,1} & \color{green}a_{i,2} & \color{green}... & \color{green}a_{i,p-1} & \color{green}a_{i,p}\\
... & ... & ... & ... & ...\\
\color{green}a_{j,1} & \color{green}a_{j,2} & \color{green}... & \color{green}a_{j,p-1} & \color{green}a_{j,p}\\
... & ... & ... & ... & ...\\
\color{green}a_{k,1} & \color{green}a_{k,2} & \color{green}... & \color{green}a_{k,p-1} & \color{green}a_{k,p}\\
... & ... & ... & ... & ...\\
a_{n,1} & a_{n,2} & ... & a_{n,p-1} & a_{n,p}
\end{bmatrix}
\begin{bmatrix}
\theta_{1}\\
...\\
\color{green}\theta_{i}\\
...\\
\color{green}\theta_{j}\\
...\\
\color{green}\theta_{k}\\
...\\
\theta_{p}
\end{bmatrix}-\begin{bmatrix}
b_{1}\\
...\\
\color{green}b_{i}\\
...\\
\color{green}b_{j}\\
...\\
\color{green}b_{k}\\
...\\
b_{n}
\end{bmatrix}\right)
\end{aligned} ∇ f ( θ ) = a 1 , 1 ... a i , 1 ... a j , 1 ... a k , 1 ... a n , 1 a 1 , 2 ... a i , 2 ... a j , 2 ... a k , 2 ... a n , 2 ... ... ... ... ... ... ... ... ... a 1 , p − 1 ... a i , p − 1 ... a j , p − 1 ... a k , p − 1 ... a n , p − 1 a 1 , p ... a i , p ... a j , p ... a k , p ... a n , p T a 1 , 1 ... a i , 1 ... a j , 1 ... a k , 1 ... a n , 1 a 1 , 2 ... a i , 2 ... a j , 2 ... a k , 2 ... a n , 2 ... ... ... ... ... ... ... ... ... a 1 , p − 1 ... a i , p − 1 ... a j , p − 1 ... a k , p − 1 ... a n , p − 1 a 1 , p ... a i , p ... a j , p ... a k , p ... a n , p θ 1 ... θ i ... θ j ... θ k ... θ p − b 1 ... b i ... b j ... b k ... b n
We compute the batch gradient.
∇ f b a t c h ( θ ) = [ a i , 1 a i , 2 . . . a i , p − 1 a i , p a j , 1 a j , 2 . . . a j , p − 1 a j , p a k , 1 a k , 2 . . . a k , p − 1 a k , p ] T ( [ a i , 1 a i , 2 . . . a i , p − 1 a i , p a j , 1 a j , 2 . . . a j , p − 1 a j , p a k , 1 a k , 2 . . . a k , p − 1 a k , p ] [ θ 1 θ 2 . . . θ p ] − [ b i b j b k ] ) = [ ∇ a v g f b a t c h ( θ ) 1 ∇ a v g f b a t c h ( θ ) 2 . . . ∇ a v g f b a t c h ( θ ) p ] \begin{aligned}
\nabla f_{batch}\left(\boldsymbol{\theta}\right)=\begin{bmatrix}
a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \\
a_{j,1} & a_{j,2} & ... & a_{j,p-1} & a_{j,p} \\
a_{k,1} & a_{k,2} & ... & a_{k,p-1} & a_{k,p}
\end{bmatrix}^{T}
\left(\begin{bmatrix}
a_{i,1} & a_{i,2} & ... & a_{i,p-1} & a_{i,p} \\
a_{j,1} & a_{j,2} & ... & a_{j,p-1} & a_{j,p} \\
a_{k,1} & a_{k,2} & ... & a_{k,p-1} & a_{k,p}
\end{bmatrix}
\begin{bmatrix}
\theta_{1}\\
\theta_{2}\\
...\\
\theta_{p}
\end{bmatrix}-\begin{bmatrix}
b_{i} \\
b_{j} \\
b_{k}
\end{bmatrix}\right)=\begin{bmatrix}
\nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{1}\\
\nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{2}\\
...\\
\nabla_{avg}f_{batch}\left(\boldsymbol{\theta}\right)_{p}
\end{bmatrix}
\end{aligned} ∇ f ba t c h ( θ ) = a i , 1 a j , 1 a k , 1 a i , 2 a j , 2 a k , 2 ... ... ... a i , p − 1 a j , p − 1 a k , p − 1 a i , p a j , p a k , p T a i , 1 a j , 1 a k , 1 a i , 2 a j , 2 a k , 2 ... ... ... a i , p − 1 a j , p − 1 a k , p − 1 a i , p a j , p a k , p θ 1 θ 2 ... θ p − b i b j b k = ∇ a vg f ba t c h ( θ ) 1 ∇ a vg f ba t c h ( θ ) 2 ... ∇ a vg f ba t c h ( θ ) p
Which results in the average of the selected gradients.