Russell and Norvig, Section 24.7: Speech Recognition - mapping from digitally-encoded acoustic signal to string of words - diagnosis problem - speech understanding also includes - what speech sounds were uttered - chosing words based on intention - chosing meaning of words based on intention - natural language - speech sounds - human language limited to 40 or 50 sounds - called phones - 49 phones for English [Fig24.32, p758] - need to identify phones based on features of acoustic signal - e.g., frequency or amplitude - phones to words - define each word's pronunciation as a sequence of phones - phones to words -> lookup - homophones: two words with same sound (e.g., heh, hay) - one word with two pronunciations (e.g., Caribbean) - segmentation: separation between words - fluent language has little silence - signal processing [Fig24.33, p759] - analog speech signal (energy) to digital - sampling rate - 8-16 KHz (8000-16000 times per second) - quantization factor - precision of sample - 8 to 12 bits - 8000 samples/sec * 8 bits/sample = 64000 bits/sec = 8000 bytes/sec - ~0.5 MBytes / minute - decomposition into frames [refer to Fig24.33, p759] - frame is a group of samples - 10 msecs = 80 samples at 8 KHz - size chosen based on features to be detected - overlap to prevent loss on boundaries - search frame for features - e.g., frequency changes or sudden silence - vector quantization - divide n-dimensional feature space into 256 regions - assign frame to one of these regions (need 1 byte/frame) - for 10 msec frame: 80 bytes -> 1 byte - one minute of speech: 0.5 MBytes -> 6KBytes + overlap - quantization (regions) chosen to minimize information loss - speaker environment - accents, vocal tracts - amount of background noise - filter out for general-purpose speech recognition - need for speaker identification - speech recognition model - reasoning with uncertainty - hardware sensitivity (signal -?-> sample) - words -?-> phones -?-> signal - speaker dependencies - P(words|signal) = P(words) * P(signal|words) / P(signal) - P(signal) = normalizing constant (ignored) - P(words) = language model - preferences for combinations of words - "ate the gun" < "ate the bun" - P(signal|words) = acoustic model - preferences for word/phone combinations - [t][ow][m][ey][t][ow] for tomato, not [aa] - language model: P(words) - preferences for - words in context: "ate the bun" > "ate the gun" - word ordering: "ate the bun" > "bun the ate" - assign probability to each possible string (infinite) - Probabilistic Context Free Grammar (PCFG) - assigns probability to each rewrite rule - but, ignores context - given string of words w1...wn - P(w1...wn) = P(w1) * P(w2|w1) * P(w3|w1,w2) * ... = Pi(i=1,n) P(wi|w1...w_i-1) - simplify using bigram model - probability of wi depends only on previous word w_i-1 - P(w1...wn) = P(w1) * P(w2|w1) * P(w3|w2) * ... = Pi(i=1,n) P(wi|w_i-1) - estimate P(wi|w_i-1) based on training corpus - P(w2|w1) = (#times w2 follows w1) / (#times w1 occurs) - e.g., training corpus = Chapter 24 [Fig24.34, p761] - trigram model P(wi|w_i-1,w_i-2) - more difficult to estimate from training corpus - weighted sum of trigram, bigram and unigram (word frequency) models - P(w1...wn) = Pi(i=1,n) c1*P(wi) + c2*P(wi|w_i-1) + c3*P(wi|w_i-1,w_i-2) - where c1 + c2 + c3 = 1 - bigram and trigram - capture some local contextual information - (e.g., subject-verb agreement) - acoustic model: P(signal|words) - word --> sequence of phones --> acoustic signal --> vector quantization - phonetic variations due to - pronunciation different by dialect - coarticulation (slurring of phones) - pronunciations as Markov models [Fig24.35, p763] - states represent phones - transitions represent succession with some probability - if only one successor, probability = 1 - one Markov model per word - P(phones|word) = product of corresponding transition probabilites - still need P(signal|phone) - Hidden Markov Model (HMM) - example for [m] phone [Fig24.36, p764] - each state has multiple outputs with associated probabilities - outputs represent vector quantization values - transitions can be loops - permits iteration of VQ values for slow speakers - hidden because don't know which state (Onset, Mid, End) produced which output (C1-C7) - so, pronunciation MM is actually a HMM - other phones have similar HMM - given VQ values, compute P(VQ values|phone) - P(VQ values|phone) = P(state transition path) * P(VQ value|state) - e.g., VQ values [C3,C5,C6] - P([C3,C5,C6]|[m]) = P(Onset->Mid) * P(Mid->End) * P(End->Final) * P([C3]|Onset) * P(C5|Mid) * P([C6]|End) = (0.7)(0.1)(0.6) * (0.3)(0.1)(0.5) = 0.00063 - most phones last 5-10 frames (frame = 10 msecs) - duplicate VQ values - e.g. P([C3,C3,C5,C5,C6,C6]|[m]) = P(Onset->Onset) * ... * P([C3]|Onset) * P([C3]|Onset) * ... + P(Onset->Mid) * ... * P([C3]|Onset) * P([C3]|Mid) * ... - acquiring HMM probabilities - from data (see last section) - putting the models together - want P(word|signal) ~ P(word) * P(signal|word) - given - P(word_i|word_i-1): language bigram model - P(phones|word): word pronunciation HMM - P(signal|phone): phone HMM - one big HMM - bigram model - each state a word - every word has transition arc to every other word - replace word state with word pronunciation model - states are now phones - replace each phone state with phone HMM - states are now VQ value distributions - using big HMM - can assign probabilities to each possible word string - infeasible, too many word strings - Viterbi algorithm - given HMM and VQ value sequence [C1,...,Cn] - outputs most probable path and its probability - in general, would consider all paths - but Markov Property (MP) allows ignoring less probable paths - MP: most probable path for the rest of any sequence depends only on where it starts, not any part of path before that - e.g., [C1,C3,C4,C6] [Fig24.37, p766] - ovals labeled with state and probability of path ending in state - transition P1;P2 means - probability of making this transition is P1 - probability of outputing VQ value given transition is P2 - find all paths that can output C1 - Onset - find all paths from Onset generating C3 - Onset,Onset - Onset,Mid - find all paths from above generating C4 - Onset, Onset, Mid: prob = 0.022 (forget this path) - Onset, Mid, Mid: prob = 0.0441 (choose this one) - Onset, Mid, End - continue until final state reached - trace backwards along bold arcs to extract path - complexity O(bMn) - M = number of states in big HMM - b = branching factor from states (at most M) - n = length of QV value sequence - compare to O(M^n) without Markov property - training the model - learn HMM probabilities from training set of [signal,words] pairs - current systems 80-98% accurate depending on - hardware - size of vocabulary - amount of pause between utterances - strength of language model - variations of speakers