# [repost ]Stanford机器学习—第五讲. 神经网络的学习 Neural Networks learning

orignal:http://blog.csdn.net/abcjennifer/article/details/7758797

===============================

（一）、Cost function

（二）、Backpropagation algorithm

（三）、Backpropagation intuition

（四）、Implementation note: Unrolling parameters

（六）、Random initialization

（七）、Putting it together

===============================

（一）、Cost function

hypothesis与真实值之间的距离为 每个样本-每个类输出 的加和，对参数进行regularization的bias项处理所有参数的平方和

===============================

（二）、Backpropagation algorithm

$E = \frac{1}{2}\sum_{i}{(y_i-a_i)^2}$

$\Delta W \propto -\frac{\partial E}{\partial W}$

$\Delta \Theta_{k-1} = \frac{\partial E}{\partial a_k}\cdot \frac{\partial a_k}{\partial z_k} \cdot \frac{\partial z_k}{\Theta _{k-1}}$

$\frac{\partial E}{\partial a_k} = a_k-y \\ \frac{\partial a_k}{\partial z_k} = \frac{\partial g(z_k))}{\partial z_k} = \frac{e^{-z}}{(1+e^{-z})^2} = a_k(1-a_k)\\ \frac{\partial z_k}{\partial \Theta _{k-1}} = a_{k-1}$

$\Delta\Theta_{k} = \xi (y-a_k)a_k(1-a_k)a_{k-1}$

$\delta_{k} = (y-a_k)a_k(1-a_k)$

$\Delta\Theta_k = \xi \delta_k \cdot a_{k-1}$

\begin{align*} \frac{\partial E}{\partial a_{j}}=\sum_{k} \frac{\partial E}{\partial a_k}\cdot \frac{\partial a_k}{\partial z_k}\cdot \frac{\partial z_k}{\partial a_{j}}\\ = \sum_{k}(y-a_k)\cdot a_k(1-a_k)\cdot \Theta_j \end{align*}

ps：最后一步之所以写+=而非直接赋值是把Δ看做了一个矩阵，每次在相应位置上做修改。

===============================

（三）、Backpropagation intuition

Cost(i)=y(i)log(hθ(x(i)))+(1- y(i))log(1- hθ(x(i)))

===============================

(四)、Implementation note: Unrolling parameters

function [jVal, gradient] = costFunction(theta)
optTheta = fminunc(@costFunction, initialTheta, options)

===============================

Summary: 有以下几点需要注意

-在back propagation中计算出J(θ)对θ的导数D，并组成vector（Dvec）

-看是否得到相同（or相近）的结果

-（这一点非常重要）停止check，只用back propagation 来进行神经网络学习（否则会非常慢，相当慢）

===============================

（六）、Random Initialization

this means all of your hidden units are computing all of the exact same function of the input. So this is a highly redundant representation. 因为一层内的所有计算都可以归结为1个，而这使得一些interesting的东西被ignore了。所以我们应该打破这种symmetry，randomly选取每一个parameter，在[-ε,ε]范围内：

===============================

（七）、Putting it together

1. 选择神经网络结构

No. of input units: Dimension of features
No. output units: Number of classes
Reasonable default: 1 hidden layer, or if >1 hidden layer, have same no. of hidden units in every layer (usually the more the better)

2. 神经网络的训练

① Randomly initialize weights
② Implement forward propagation to get hθ(x(i)) for anyx(i)
③ Implement code to compute cost function J(θ)
④ Implement backprop to compute partial derivatives

test: