# **Lab: Back-propagation and gradient descent**

In this lab, we are going to implement the gradient descent algorithm for a simple feed-forward neural network from scratch.

We consider a neural network with two inputs, 2 hidden layers (each has 2 hidden nodes), and one output. We are using the ReLU activation function and assume no biases at the nodes. Mathematically, this can be represented as:

$$
h = \sigma \left( W^T x\right), ~~~ k = \sigma ( U^T h), ~~~ z = V^T k
$$

where $\sigma$ denotes the ReLU activation and

$~~~~~~~~W =
\begin{bmatrix}
w_{11} & w_{12} \\
w_{21} & w_{22}
\end{bmatrix},~
$
$U =
\begin{bmatrix}
u_{11} & u_{12} \\
u_{21} & u_{22}
\end{bmatrix},
$
$V = \begin{bmatrix}
v_{1}  \\
v_{2}
\end{bmatrix}$

are the weights of the network. We will denote the output $z$ of this network as $f(\theta, x)$.

For a data point $(x, y)$, the loss function on this data point is defined as
$$
L(\theta, x, y) = [f(\theta, x) - y]^2.
$$

## **Part 1: Define the forward map**

**Write a Python function to compute the ReLU activation function**

**Write a Python function to compute the forward map of the network.**

This function inputs

*   $\theta = [w_{11}, w_{12}, w_{21}, w_{21}, u_{11}, u_{12}, u_{21}, u_{22}, v_{1}, v_{2}] \in \mathbb{R}^{10}$
*   $x \in \mathbb{R}^2$, $y \in \mathbb{R}$

and outputs

$$
[h_1, h_2, k_1, k_2, z, L]
$$

**Sanity check:** Pick $\theta$ randomly in $[-1, 1]^{10}$, $x = \begin{bmatrix}
-1 \\
1
\end{bmatrix}$, and y = 60.

Compute and print out $L(\theta, x, y)$.

## **Part 2: Back-propagation**

**Write a Python function to compute the derivative of the ReLU activation function**

**Write a Python function to compute the derivative of $L$with respect to all parameters of the network.**

This function inputs

*   $\theta = [w_{11}, w_{12}, w_{21}, w_{21}, u_{11}, u_{12}, u_{21}, u_{22}, v_{1}, v_{2}] \in \mathbb{R}^{10}$
*   $x \in \mathbb{R}^2$, $y \in \mathbb{R}$

and outputs $\nabla_{\theta}{L}$.

## **Part 3: Gradient descent**

Consider the dataset of two data points:
*   $x_1 = \begin{bmatrix}
-1 \\
1
\end{bmatrix}$, and $y_1 = 60$.
*   $x_2 = \begin{bmatrix}
-1 \\
0.5
\end{bmatrix}$, and $y_2 = 20$.

Define
$$
J(\theta) = L(\theta, x_1, y_1) + L(\theta, x_2, y_2)$$




Implement the following procedure

*   Start at a random value of $\theta$
*   Perform 100 steps of gradient descent for the objective function $J(\theta)$ with learning rate $\rho = 0.005$
*   Plot the value of the objective function over the steps

