- Gradient descent over entire
*network*weight vector - Easily generalized to arbitrary directed graphs
- Will find a local, not necessarily global error minimum
- In practice, often works well (can run multiple times)

- Often include weight
*momentum*

- Minimizes error over
*training*examples- Will it generalize well to subsequent examples?

- Training can take thousands of iterations
slow!
- Using network after training is very fast