2019-02-01

deeplearning.ai / Deep learning & NN

23 minutes read (About 3383 words)

Shallow Neural Networks-2

本文为 Andrew Ng 深度学习课程第一部分神经网络和深度学习的笔记，对应第三周浅层神经网络的相关课程及相关作业。

Why do you need non-linear activation function

为什么神经网络需要非线性的激活函数？不能使用线性的激活函数，比如 $g(z) = z$ 吗？

假设我们使用线性的激活函数 $g(z) = z$ ，那么有：

$a^{[1]} = z^{[1]} = W^{[1]}x + b^{[1]}$ $a^{[2]} = z^{[2]} = W^{[2]}a^{[1]} + b^{[2]}$

把 $a^{[1]}$ 带入，则有

$a^{[2]} = W^{[2]}(W^{[1]}x + b^{[1]}) + b^{[2]} = (W^{[2]}W{[1]})x + (W^{[2]}b^{[1]}+b{[2]}) = W'x+b'$

我们可以得到，$a^{[2]}$ 仍是输入 $x$ 的线性组合。也就是说，使用线性函数的神经网络仅仅只是把输入 $x$ 线性组合再输出。即便是包含许多隐藏层的神经网络，如果使用的是线性的激活函数，不管多少层，得到的输出依然是 $x$ 的线性组合，也就意味着隐藏层根本没有什么作用。所以，隐藏层激活函数必须是非线性的，否则将失去意义。

只有一个地方你可能会使用线性激活函数，在机器学习的回归问题中，$y$ 是一个实数，比如你像预测房地产的价格，那么 $y$ 是一个实数，而不是像二分类问题那样要么 $0$ 要么 $1$ ，这种情况下，在输出层你可能会使用线性激活函数，但隐藏层不会使用线性激活函数。

Derivatives of activation functions

在反向传播的过程中，我们需要计算激活函数的导数，那么我们来看一下上述这些激活函数的导数。

Sigmoid 函数 $g(z) = \frac{1}{1+e^{-z}}$

$g'(z) = g(z)(1-g(z)) = a(1-a)$

tanh 函数

$g(z) = \frac{e^{z}-e^{-z}}{e^{z}+e^{-z}}$ $g'(z) = 1-(g(z))^{2} = 1-a^2$

ReLU 函数 $g(z) = max(0,z)$

$g'(z) = \begin{cases} 0,\quad z < 0 \\ 1,\quad z \geq 0 \end{cases}$

Leaky ReLU 函数 $g(z) = max(0.01z,z)$

$g'(z) = \begin{cases} 0.01,\quad z < 0 \\\ 1,\quad \quad z \geq 0 \end{cases}$

Gradient descent for neural networks

好了，有了以上的铺垫，我们终于可以实现在单隐藏层神经网络上的梯度下降算法了。

由于是单隐藏层神经网络，那么我们有参数 $W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}$ 。我们一般用 $n_x = n^{[0]}$ 来表示输入层特征的个数，用 $n^{[1]}$ 表示隐藏层节点个数，用 $n^{[2]}=1$ 表示输出层节点个数。其中，$W^{[1]}$ 的维度为 $(n^{[1]}, n^{[0]})$ ，$b^{[1]}$ 的维度为 $(n^{[1]}, 1)$ ，$W^{[2]}$ 的维度为 $(n^{[2]}, n^{[1]})$ ，$b^{[2]}$ 的维度为 $(n^{[2]}, 1)$ 。

假设我们在做二元分类，那么 Cost function 为：

$J(W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]}) = \frac{1}{m}\sum_{i=1}^{m}L(\hat{y}, y)$

整个训练神经网络的过程为：

Repeat{

initialize parameters $W^{[1]}, b^{[1]}, W^{[2]}, b^{[2]} $

compute predicts $\hat{y}^{(i)}, i \in [1,m]$

compute $dW^{[1]}, db^{[1]}, W^{[2]}, b^{[2]}$

update $W^{[1]} = W^{[1]} - \alpha dW^{[1]}, b^{[1]} = b^{[1]} - \alpha db^{[1]} …$

}

在训练神经网络时，随机初始化参数很重要，并不是单纯的全部初始化为 $0$ ，我们将在后续详细讨论。

在之前，我们讨论了如果计算 predict (预测值) ，以及如何向量化实现整个过程，所以现在的关键在于，如何计算这些偏导项 $dW^{[1]}, db^{[1]}$ …

神经网络正向传播的过程为：

$Z^{[1]} = W^{[1]}X + b^{[1]}$ $A^{[1]} = g(Z^{[1]})$ $Z^{[2]} = W^{[2]}A^{[1]} + b^{[2]}$ $A^{[2]} = g(Z^{[2]}) = \sigma(Z^{[2]})$

反向传播的过程为：

$dZ^{[2]} = A^{[2]}-Y$ $dW^{[2]} = \frac{1}{m}dZ^{[2]}A^{[1]T}$ $db^{[2]} = \frac{1}{m}np.sum(dZ^{[2]},axis=1,keepdims=True)$

这个过程其实和之前的 Logistic Regression 很相似。需要注意的是，这里$db^{[2]}$ 的计算直接用了 python 语句，np.sum函数的参数 axis 代表在哪个维度上求和，keepdims为了保持 $db^{[2]}$ 的形状为 $(n^{[2]},1)$ 而不是奇怪的 $(n^{[2]}, )$

$dZ^{[2]} = W^{[2]T}dZ^{[2]} * g^{[1]'}(Z^{[1]})$ $dW^{[1]} = \frac{1}{m}dZ^{[1]}X^{T}$ $db^{[1]}=\frac{1}{m}np.sum(dZ^{[1]},axis = 1,keepdims = True)$

需要注意的是 $dZ^{[2]} = W^{[2]T}dZ^{[2]} *g^{[1]’}(Z^{[1]})$ 这里是对应元素相乘，$W^{[2]T}dZ^{[2]}$ 的形状为 $(n^{[1]}, m)$ ，$g’^{[1]}(Z^{[1]})$ 的形状也为 $(n^{[1]}, m)$ 。

关于详细的推导过程，会在接下来的部分详细说明。

Backpropagation intuition

先回顾一下之前我们在实现 Logistic Regression 时是如何推导的，可以参考之前的博客 Logistic Regression Gradient Descent.

由于现在多了一层隐藏层，整个反向传播过程会更复杂一点。

首先，先考虑单个样本的情况，先画出计算图，如下图：

根据导数链式法则，可以计算出：

$dz^{[2]} = \frac{\partial L}{\partial z^{[2]}} = \frac{\partial L}{\partial a^{[2]}} \cdot \frac{\partial a^{[2]}}{\partial z^{[2]}}= a^{[2]}-y$ $dW^{[2]}= \frac{\partial L}{\partial W^{[2]}} = \frac{\partial L}{\partial z^{[2]}} \cdot \frac{\partial z^{[2]}}{\partial W^{[2]}} = dz^{[2]} \cdot \frac{\partial z^{[2]}}{\partial W^{[2]}} = dz^{[2]}a^{[1]T}$

注意，这里 $dW^{[2]}$ 的形状为 $(n^{[2]}, n^{[1]})$ ， $n^{[1]}$ 为隐藏层节点个数，而 $a^{[1]}$ 形状为 $(n^{[1]}, n^{[2]})$ ，故这里需要 $a^{[1]}$ 转置，即 $a^{[1]T}$ 。

$db^{[2]}= \frac{\partial L}{\partial b^{[2]}} = \frac{\partial L}{\partial z^{[2]}} \cdot \frac{\partial z^{[2]}}{\partial b^{[2]}} = dz^{[2]} \cdot \frac{\partial z^{[2]}}{\partial b^{[2]}} = dz^{[2]}$ $dz^{[1]} = \frac{\partial L}{\partial z^{[1]}} = \frac{\partial L}{\partial a^{[1]}} \cdot \frac{\partial a^{[1]}}{\partial z^{[1]}} = \frac{\partial L}{\partial z^{[2]}} \cdot \frac{\partial z^{[2]}}{\partial a^{[1]}} \cdot \frac{\partial a^{[1]}}{\partial z^{[1]}} = dz^{[2]} \cdot \frac{\partial z^{[2]}}{\partial a^{[1]}} \cdot \frac{\partial a^{[1]}}{\partial z^{[1]}}$

即，

$dz^{[1]} = W^{[2]T}dz^{[2]} * g^{[1]'}(z^{[1]})$

同样，这里的 $W^{[2]T}$ 也需要转置，具体为什么，我也不是很明白，但可以从矩阵乘法的规则来判断其形状是否正确。需要注意的是，这里的乘法 $*$ 为对应元素相乘，而不是一般意义上的矩阵乘法。在实现过程中，你必须要确保矩阵的形状相匹配，这里 $W^{[2]T}$ 形状为 $(n^{[1]}, n^{[2]})$ ，$dz^{[2]}$ 的形状为 $(n^{[2]}, 1)$ ，$z^{[1]}$ 的形状为 $(n^{[1]}, 1)$

$dW^{[1]} = dz^{[1]} \cdot \frac{\partial z^{[1]}}{\partial W^{[1]}} = dz^{[1]}x^{T} = dz^{[1]}a^{[0]T}$ $db^{[1]} = dz^{[1]} \cdot \frac{\partial z^{[1]}}{\partial b^{[1]}} = dz^{[1]}$

到这里为止，我们就推导完了反向传播的6个公式。接下来，我们需要将其推广到 $m$ 个训练样本的向量化实现上，得到结果如下，：

Random Initialization

神经网络中的参数 $W$ 是不能和 Logistic Regression 那样全部初始化为 $0$ 的，我们来分析一下原因。

假设我们有这样一个神经网络，如下图所示：

假设，我们将 $W^{[1]}, W^{[2]}$ 都初始化为零矩阵，那么经过正向传播以后，我们会得到 $a^{[1]}_1 = a^{[1]}_{2}$ ，那么根据对称性，在反向传播后会有 $dz^{[1]}_{1} = dz^{[1]}_{2}$ ， $dW^{[1]}_{1}=dW^{[1]}_{2}$ ，无论你执行多少次梯度下降算法，隐藏层的每个节点都在做相同的操作。这样的话，最后我们获得的 $W^{[1]}, W^{[2]}$ 每行元素都相同，也就是说所有隐藏层中的节点都可以用一个节点来代替，多余的节点没有任何意义，这不是我们想要的。另外，参数 $b$ 可以全部初始化为 $0$ ，不会发生上面提到的问题。

所以，我们必须随机初始化所有的参数。python 语句如下：

W1 = np.random.randn((2,2))*0.01
b1 = np.zero((2,1))
W1 = np.random.randn((1,2))*0.01
b1 = 0

你可能为由疑问，为什么这里要 $*0.01$ ，为什么是 $0.01$ 而不是其他的数字？事实上，我们倾向于把矩阵初始化为非常非常小的随机值。因为如果你用 tanh 函数或者 sigmoid 函数作为激活函数， $W$ 比较小，那正向传播后得到的 $z$ 也会比较小，经过激活函数后所得到的 $a$ 也会接近于 $0$ ，而在 $0$ 附近，激活函数的斜率比较大，能大大地提高梯度下降算法的更新速度，即学习的速度。如果你使用的激活函数为 ReLU 或是 Leaky ReLU 则没有这个问题。

有时候，会有比 $0.01$ 更好用的常数，但如果你只是训练一个单隐层神经网络，或是一个相对较浅的神经网络，没有太多隐藏层，使用 $0.01$ 没有太大问题。但是当你训练一个很深的神经网络时，你可能需要尝试一下别的常数，关于常数的详细内容会在后续部分提到。

Homework

# Package imports
import numpy as np
import matplotlib.pyplot as plt
from testCases_v2 import *
import sklearn
import sklearn.datasets
import sklearn.linear_model
from planar_utils import plot_decision_boundary, sigmoid, load_planar_dataset, load_extra_datasets


def layer_sizes(X, Y):
    """
    Arguments:
    X -- input dataset of shape (input size, number of examples)
    Y -- labels of shape (output size, number of examples)

    Returns:
    n_x -- the size of the input layer
    n_h -- the size of the hidden layer
    n_y -- the size of the output layer
    """
    n_x = X.shape[0]  # size of input layer
    n_h = 4
    n_y = Y.shape[0]  # size of output layer
    return (n_x, n_h, n_y)


def initialize_parameters(n_x, n_h, n_y):
    """
    Argument:
    n_x -- size of the input layer
    n_h -- size of the hidden layer
    n_y -- size of the output layer

    Returns:
    params -- python dictionary containing your parameters:
                    W1 -- weight matrix of shape (n_h, n_x)
                    b1 -- bias vector of shape (n_h, 1)
                    W2 -- weight matrix of shape (n_y, n_h)
                    b2 -- bias vector of shape (n_y, 1)
    """

    np.random.seed(2)  # we set up a seed so that your output matches ours although the initialization is random.

    W1 = np.random.randn(n_h, n_x) * 0.01
    b1 = np.zeros((n_h, 1))
    W2 = np.random.randn(n_y, n_h) * 0.01
    b2 = np.zeros((n_y, 1))

    assert (W1.shape == (n_h, n_x))
    assert (b1.shape == (n_h, 1))
    assert (W2.shape == (n_y, n_h))
    assert (b2.shape == (n_y, 1))

    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}

    return parameters


def forward_propagation(X, parameters):
    """
    Argument:
    X -- input data of size (n_x, m)
    parameters -- python dictionary containing your parameters (output of initialization function)

    Returns:
    A2 -- The sigmoid output of the second activation
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2"
    """
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]

    # Implement Forward Propagation to calculate A2 (probabilities)
    Z1 = np.dot(W1, X) + b1
    A1 = np.tanh(Z1)
    Z2 = np.dot(W2, A1) + b2
    A2 = sigmoid(Z2)

    assert (A2.shape == (1, X.shape[1]))

    cache = {"Z1": Z1,
             "A1": A1,
             "Z2": Z2,
             "A2": A2}

    return A2, cache


def compute_cost(A2, Y, parameters):
    """
    Computes the cross-entropy cost given in equation (13)

    Arguments:
    A2 -- The sigmoid output of the second activation, of shape (1, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)
    parameters -- python dictionary containing your parameters W1, b1, W2 and b2

    Returns:
    cost -- cross-entropy cost given equation (13)
    """

    m = Y.shape[1]  # number of example

    # Compute the cross-entropy cost
    logprobs = np.multiply(np.log(A2), Y) + np.multiply(np.log(1 - A2), 1 - Y)
    cost = -np.sum(logprobs) / m

    cost = np.squeeze(cost)  # makes sure cost is the dimension we expect.
    # E.g., turns [[17]] into 17
    assert (isinstance(cost, float))

    return cost


def backward_propagation(parameters, cache, X, Y):
    """
    Implement the backward propagation using the instructions above.

    Arguments:
    parameters -- python dictionary containing our parameters
    cache -- a dictionary containing "Z1", "A1", "Z2" and "A2".
    X -- input data of shape (2, number of examples)
    Y -- "true" labels vector of shape (1, number of examples)

    Returns:
    grads -- python dictionary containing your gradients with respect to different parameters
    """
    m = X.shape[1]

    # First, retrieve W1 and W2 from the dictionary "parameters".
    W1 = parameters["W1"]
    W2 = parameters["W2"]

    # Retrieve also A1 and A2 from dictionary "cache".
    A1 = cache["A1"]
    A2 = cache["A2"]

    # Backward propagation: calculate dW1, db1, dW2, db2.
    dZ2 = A2 - Y
    dW2 = np.dot(dZ2, A1.T) / m
    db2 = np.sum(dZ2, axis=1, keepdims=True) / m
    dZ1 = np.dot(W2.T, dZ2) * (1 - np.power(A1, 2))
    dW1 = np.dot(dZ1, X.T) / m
    db1 = np.sum(dZ1, axis=1, keepdims=True) / m

    grads = {"dW1": dW1,
             "db1": db1,
             "dW2": dW2,
             "db2": db2}

    return grads


def update_parameters(parameters, grads, learning_rate=1.2):
    """
    Updates parameters using the gradient descent update rule given above

    Arguments:
    parameters -- python dictionary containing your parameters
    grads -- python dictionary containing your gradients

    Returns:
    parameters -- python dictionary containing your updated parameters
    """
    # Retrieve each parameter from the dictionary "parameters"
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]

    # Retrieve each gradient from the dictionary "grads"
    dW1 = grads["dW1"]
    db1 = grads["db1"]
    dW2 = grads["dW2"]
    db2 = grads["db2"]

    # Update rule for each parameter
    W1 = W1 - learning_rate * dW1
    b1 = b1 - learning_rate * db1
    W2 = W2 - learning_rate * dW2
    b2 = b2 - learning_rate * db2

    parameters = {"W1": W1,
                  "b1": b1,
                  "W2": W2,
                  "b2": b2}

    return parameters


def nn_model(X, Y, n_h, num_iterations=10000, print_cost=False):
    """
    Arguments:
    X -- dataset of shape (2, number of examples)
    Y -- labels of shape (1, number of examples)
    n_h -- size of the hidden layer
    num_iterations -- Number of iterations in gradient descent loop
    print_cost -- if True, print the cost every 1000 iterations

    Returns:
    parameters -- parameters learnt by the model. They can then be used to predict.
    """

    np.random.seed(3)
    n_x = layer_sizes(X, Y)[0]
    n_y = layer_sizes(X, Y)[2]

    # Initialize parameters, then retrieve W1, b1, W2, b2. Inputs: "n_x, n_h, n_y". Outputs = "W1, b1, W2, b2, parameters".
    parameters = initialize_parameters(n_x, n_h, n_y)
    W1 = parameters["W1"]
    b1 = parameters["b1"]
    W2 = parameters["W2"]
    b2 = parameters["b2"]

    # Loop (gradient descent)

    for i in range(0, num_iterations):

        # Forward propagation. Inputs: "X, parameters". Outputs: "A2, cache".
        A2, cache = forward_propagation(X, parameters)

        # Cost function. Inputs: "A2, Y, parameters". Outputs: "cost".
        cost = compute_cost(A2, Y, parameters)

        # Backpropagation. Inputs: "parameters, cache, X, Y". Outputs: "grads".
        grads = backward_propagation(parameters, cache, X, Y)

        # Gradient descent parameter update. Inputs: "parameters, grads". Outputs: "parameters".
        parameters = update_parameters(parameters, grads)

        # Print the cost every 1000 iterations
        if print_cost and i % 1000 == 0:
            print("Cost after iteration %i: %f" % (i, cost))

    return parameters


def predict(parameters, X):
    """
    Using the learned parameters, predicts a class for each example in X

    Arguments:
    parameters -- python dictionary containing your parameters
    X -- input data of size (n_x, m)

    Returns
    predictions -- vector of predictions of our model (red: 0 / blue: 1)
    """

    # Computes probabilities using forward propagation, and classifies to 0/1 using 0.5 as the threshold.
    A2, cache = forward_propagation(X, parameters)
    predictions = (A2 > 0.5)

    return predictions


def main():
    # random seed
    np.random.seed(1)  # set a seed so that the results are consistent

    # Load data
    X, Y = load_planar_dataset()

    # Visualize the data
    # 原来的代码会报错，同样planar_utils.py 21行也需要修改
    plt.scatter(X[0, :], X[1, :], c=Y.reshape(X[0, :].shape), cmap=plt.cm.Spectral)
    # plt.show()

    # shape of dataset
    shape_X = X.shape
    shape_Y = Y.shape
    m = X.shape[1]  # training set size

    # Build a model with a n_h-dimensional hidden layer
    parameters = nn_model(X, Y, n_h=4, num_iterations=10000, print_cost=True)

    # Plot the decision boundary
    plot_decision_boundary(lambda x: predict(parameters, x.T), X, Y)
    plt.title("Decision Boundary for hidden layer size " + str(4))
    plt.show()

    # Print accuracy
    predictions = predict(parameters, X)
    print('Accuracy: %d' % float(
        (np.dot(Y, predictions.T) + np.dot(1 - Y, 1 - predictions.T)) / float(Y.size) * 100) + '%')

    # # Train the logistic regression classifier
    # clf = sklearn.linear_model.LogisticRegressionCV()
    # clf.fit(X.T, Y.T)
    #
    # # Plot the decision boundary for logistic regression
    # plot_decision_boundary(lambda x: clf.predict(x), X, Y)
    # plt.title("Logistic Regression")
    # plt.show()
    #
    # # Print accuracy
    # LR_predictions = clf.predict(X.T)
    # print('Accuracy of logistic regression: %d ' % float(
    #     (np.dot(Y, LR_predictions) + np.dot(1 - Y, 1 - LR_predictions)) / float(Y.size) * 100) +
    #       '% ' + "(percentage of correctly labelled datapoints)")


main()

# Deep Learning

Shallow Neural Networks-2

Why do you need non-linear activation function

Derivatives of activation functions

Gradient descent for neural networks

Backpropagation intuition

Random Initialization

Homework

Catalogue

Categories

Tag Cloud

Recent

Archives

Tags

Your browser is out-of-date!