Russell and Norvig, Chapter 24: Perception 24.1 Introduction - perception initiated through sensors - vision and speech recognition - S = f(W) - S = sensory stimulus - W = world (including agent) - f = way in which world generates sensory stimuli - want W = f^{-1}(S) - full inverse requires incomplete, ambiguous and unnecessary information - uses for vision - object recognition - manipulation - navigation 24.2 Image Formation - vision converts scene to 2D image - pinhole camera [Fig24.1, p726] - inverted image - perspective projection - parallel-line vanishing point - scaled orthographic projection - assume objects of negligible depth - perspective approximated as scaling - lens system - pixels - vision camera: 512x512 - human eye: 120M rods - photometry - light reflected either - diffusely: absorbed and then emitted in all directions - specularly: reflected off surface in a particular direction - spectrophotometry - image intensity I(x,y,t) at location (x,y) at time t - multidimensional based on different light wavelengths - typically assume three-dimensional for red-green-blue colors 24.3 Image-Processing Operations for Early Vision - edge detection [include stuff from vision file]* - intensity profile I of an edge [Fig24.7, p732] - differentiate and look for peaks - but noise generates spurious peaks - smooth using convolution and look for peaks - convolution - h is the convolution of two functions f and g (denoted h = f * g) if - h(x) = int(-inf,+inf) f(u)g(x-u)du (continuous, 1D) - h(x) = sum(u=-inf,+inf) f(u)g(x-u) (discrete, 1D) - f and g usually zero in many places of an image - h(x,y) = int(-inf,+inf) int(-inf,+inf) f(u,v) g(x-u,y-v) du dv (2D) - h(x,y) = sum(-inf,+inf) sum(-inf,+inf) f(u,v) g(x-u,y-v) (2D) - Gaussian smoothing - convolve with G_sigma(x) - G_sigma(x) = [1/(sqrt(2*pi)*sigma] * e^(-x^2/2*sigma^2) - smoothing increased with sigma - note: f * g' = (f * g)' - convolve I with derivative of Gaussian function G_sigma(x) - G_sigma(x)' = (- x)/(sqrt(2*pi)*sigma^3) * e^(-x^2/2*sigma^2) - show curve (~sin(x)) - 1D edge detection - R = I * G_sigma' - find absolute value of R - edges at points where ||R|| > T_n for some threshold T_n - ||R|| = R normalized between 0 and 1 - 2D edge detection (vertical edges) - R_v(x,y) = I(x,y) * f_v(x,y) - where f_v(x,y) = G_sigma(x)' G_sigma(y) - smooths in y direction, peaks and smooths in x direction - find absolute value of R_v(x,y) - edges at points where ||R_v||(x,y) > T_n for some threshold T_n - 2D edge detection (arbitrary orientation) - R_v(x,y) = I(x,y) * f_v(x,y), R_h(x,y) = I(x,y) * f_h(x,y) - where f_h(x,y) = G_sigma(y)' G_sigma(x) - R(x,y) = R_v(x,y)^2 + R_h(x,y)^2 - find absolute value of R(x,y) - edges at points where ||R||(x,y) > T_n for some threshold T_n - edge orientation - tan(theta) = R_v(x,y) / R_h(x,y) - ^^^ Canny edge detector ^^^ - choose sigma and T_n - edge extension [see vision file]* - combine neighboring edge pixels with same orientation Hough Transform [see vision file]* 24.4 Extracting 3D Information Using Vision - three aspects - segmenting image into objects - find location and orientation (pose) of objects relative to observer - find shape of objects - need more than edge detection - undetected edges near object boundaries - edges due to noise - motion [Fig24.8, page736, Fig24.9, p737] - rate of motion (optical flow) provides info about distance - watch a particular set of pixels over time - can also be cast as agent moving in stationary world - compute time til collision with objects - binocular stereopsis [Fig24.12, p739] - given two images taken from different views (two eyes) - eyes separated by baseline b - horizontal disparity H of a point in the two images - depth of point Z = b/H - texture gradients - surface distance - texture primitives shrink with distance from camera - surface orientation - normal vectors to texture primitives - texture identification helps shape determination - texture changes -> shape boundaries - texture identification helps object recognition - compare to known object textures - shading - determine surface normals to aid shape determination - contour - lines are either [Fig24.20, p747] - limbs: surface of object tangential to camera angle (<<- or ->>) - edges: surface normal discontinuity - concave (-), convex (+), or occluding (<-- or -->) - Huffman-Clowes label set [Fig24.23, p748] - trihedral objects - vertices composed of the intersection of 3 planes - Waltz algorithm for labeling line drawings - constraint satisfaction problem (CSP) - each line gets one label 24.5 Using Vision for Manipulation and Navigation - example from highway driving - specialized vision system - can ignore much of the image (e.g., grass next to road) 24.6 Object Representation and Recognition - given - scene consisting of one or more objects from a collection of known objects O1,...,On - image of scene taken from unknown position and orientation - determine - which of O1,...,On are present in scene - position and orientation (pose) of each object - representation - polyhedral approximations (solids, no curves) - general, but lengthy when accuracy required - generalized cylinders - matching - alignment method - object represented by m points u_1,...,u_m - object in scene rotated by R and translated by t - corresponding n image points p_1,...,p_n - p_i = Pi(R u_i + t), or - p_i = Q u_i - can find Q given three sets of u-p points - keep trying triplets of points until good Q found - a good Q maps remaining points well - expensive for many possible objects - stapler example [Fig24.29, p755] - projective invariants - geometric invariants - object values unchanged by pose in image - e.g., cross-ratio [Fig24.30, p756] - ratio of the ratio of two line segments within a line does not change from scene to image - used to consider only those library objects having the same value for the invariant - models using invariants can be acquired directly from image 24.7 Speech Recognition See file rn24.7.