Russell and Norvig, Chapter 19: Learning in Neural and Belief Networks 19.1 How the Brain Works - neuron (Fig19.1, p565) - axon-dendrite connections (10-10^5) between neurons are called synapses - synapses propagate electrochemical signals - number, placement and strength of connections changes over time - learning ? - conclude the brain (or "mysticism") causes the mind 19.2 Neural Networks - unit [Fig19.4, p568] - input and output units - activation functions [Fig19.5, p569] - network structures - feed-forward - perceptron - multilayer network [Fig19.7, p572] - hidden units - one big enough hidden layer approximates any continuous function - recurrent (brain) - optimal structure ? 19.3 Perceptrons [Fig19.8, p574] - learns linearly-separable functions [Fig19.9, p575] - algorithm [Fig19.11, p577] - update: Wj = Wj + alpha * Ij * Err - Wj = weight on link from input Ij - alpha = learning rate (0.1 - 0.3) - Err = difference between desired and observed output - symbolic inputs - local encoding - one input unit per attribute - each attribute value assigned a real value (0.0-1.0) - distributed encoding - one unit for each attr/value pair - comparison of perceptron to decision tree [Fig19.12, p578] 19.4 Multilayer Feed-Forward Networks - backpropagation - generic network [Fig19.13, p579] - hidden to output - Wji = Wji + alpha * a_j * delta_i - delta_i = Err_i * g'(in_i) - input to hidden - Wkj = Wkj + alpha * Ik * delta_j - delta_j = g'(in_j) sum_i Wji * delta_i - algorithm [Fig19.14, p581] - gradient descent search in weight space [Fig19.16, p582] - see comparison to belief networks at end 19.5 Applications of Neural Networks - pronunciation of English words (NETtalk) - input: 7 char window over text, 26+3 (punctuation) inputs per char - hidden: 80 - output: phonemes - 95% on training, 78% on testing - handwritten character recognition (zip codes) - input: 16x16 for a digit - 3 hidden layers: 768, 192, 30 units (not fully connected) - output: 10 units, one per digit - 99% on testing - hardware version in use - driving - autonomous land vehicle in a neural network (alvinn) - input: 30x32 pixel image - hidden: 5, fully connected - output: 20 units for each possible steering direction - up to 90 miles, up to 70 mph - extensive learning for each road type/condition 19.6 Bayesian Methods for Learning Belief Networks - bayesian learning - use hypotheses as intermediary between data and predictions - given data D and competing hypotheses Hi predicting X - P(X|D) = P(X|Hi)*P(Hi|D) is best choice (distribution) - but, P(Hi|D) is usually intractable - find Hi maximizing P(Hi|D) - maximum a posteriori (MAP) hypothesis - P(X|D) ~ P(X|Hmap)*P(Hmap|D) - applying Bayes rule - P(Hi|D) = P(D|Hi)*P(Hi) / P(D) - P(D) fixed - choose P(Hi) as uniform or Ockham bias - uniform: P(D|Hi) maximum liklihood (ML) hypothesis - belief network learning - types - known structure, fully observable - learn CPTs - unknown structure, fully observable - learn network topology (model-based) - then know structure problem - known structure, hidden variables - similar to neural network learning - unknown structure, hidden variables (?) - adaptive probabilistic networks - given @P(D)/@CPT, use gradient descent (NN) learning - (@P(D_j)/@w_i)/P(D_j) = P(x_i,u_i | D_j) / w_i - where w_i = CPT_i and D_j are the examples - belief networks vs neural networks - BN representation semantically meaningful - BN representation local, NN distributed - continuous variables handled better by NN, but no relations - BNs can be automatically created - NNs have fast inference at the expense of limited training - each epoch takes |examples|*|weights| - exponential number of epochs possible - BNs have slow inference, but more expressive representation - no specific inputs and outputs in BNs - easier to include prior knowledge into BNs - NNs more sensitive to noise