learning as optimization (gradient descent) cost function: squared error sigmoid rather than step function for differentiability gradient gives delta rule batch vs. online update multilayer perceptron more powerful than perceptron XOR problem Kolmogorov's theorem backpropagation learning forward pass backward pass (to propagate error) learning update \Delta W^l_{ij} = \eta\delta_i^l x_j^{l-1} \Delta b_i^l = \eta\delta_i^l LeNet convolution local receptive fields weight-sharing subsampling http://www.research.att.com/~yann/ocr/ hierarchical visual system? more complex features more invariance selectivity/invariance as AND/OR AND: convolution OR: subsampling limitations to our understanding computation too many synaptic weights and neurons many engineers hate this type of solution learning simple rules create the machine why should constraints help? intuition: use of prior information reducing number of parameters is good hidden units emergent properties (not specified in learning) Robinson quote ---------------------------------------------------------------------- antecedents of backprop in psychology stimulus-response associations (Thorndike, Pavlov) another tradition which emphasizes William James on the stream of consciousness (1892) synapses <----> associations physiology psychology problems storage: synaptic strengths->neural activity retrieval: neural activity->synaptic strengths Hebb (1949) Let us assume that the persistence or repetition of a reverberatory activity (or "trace") tends to induce lasting cellular changes that add to its stability. . . . When an axon of cell A is near enough to excite a cell B and repeatedly or persistently takes part in firing it, some growth process or metabolic change takes place in one or both cells such that A's efficiency, as one of the cells firing B, is increased. Hebb's solution: storage: Hebbian learning retrieval: reverberatory activity ---------------------------------------------------------------------- content-addressable memory novel pattern triggers recall of a stored memory s_i = \sgn(\sum_j w_{ij}s_j) case of one pattern $\xi_i$ w_{ij}=\xi_i\xi_j (reversed pattern is also stored) case of many patterns superposition Hebbian learning interference energy function multistability and attractors visualize with energy landscape retrieval: flow to a minimum storage: lowering energy ---------------------------------------------------------------------- attractors in the brain?