Russell and Norvig, Chapter 23: Practical Natural Language Processing 23.1 Practical Applications - focussed on particular domain and task - machine translation - 20K-100K words in lexicon - 100-10K grammar rules - database access - information retrieval - document search based on key words - text categorization - news story seach based on subject - data extraction - fill in the slots of the best-matching frame - others - email filter - internet query server 23.2 Efficient Parsing - three improvements - chart parsing - record successful phrase parses in a chart - avoid duplicate effort on backtracking - combined top-down and bottom-up approaches - packed forest of parse trees instead of all possible trees - chart parsing [Fig23.1, p698] - for n word string, n+1 vertices 0 to n - words in between vertices - chart entry (edge): [start, end, grammar rule] - dot in grammar rule marks progress of parse - e.g., [0, 2, S --> NP * VP] - first two words comprise a NP of an S - entries in the chart are never removed - entries added by - initializer (e.g., [0, 0, S' --> * S]) - predictor (e.g., [0, 0, S --> * NP VP]) - scanner (e.g., [0, 1, NP --> Pronoun *]) - completer (e.g., [0, 1, S --> NP * VP]) (i.e., extender) - algorithm - nondeterministic [Fig23.2, p699] - deterministic [Fig23.6, p702] - e.g., [Fig23.3-4, p700] - extracting parses - remember children trees when Completer combines edges - avoid exponential number of parses by embedding choice points directly in parse - e.g., [S [S {NP1 | NP2} ] and [S {NP3 | NP4} ] ] - this is called a packed forest 23.3 Scaling Up the Lexicon - from string of characters to words - four steps - tokenization - extract word strings and punctuation - easy any English (hyphenation vs dash) - morphological analysis - describe word in terms of prefixes, suffixes and root forms - dictionay lookup of word meaning - error analysis for unknown words - guess part of speech from word form (e.g., -ed --> past tense verb) - capitalization implies proper noun - special formats (e.g., dates, times) - spelling correction - Wordnet - ~100K lexicon in public domain - sample in [Fig23.7, p705] - missing - frequency information (e.g., liklihood of usage) - semantic restrictions (e.g., typical direct objects) 23.4 Scaling Up the Grammar - nominal compounds and apposition - e.g., "the wumpus world simulator" - grammar rule: Noun --> Noun Noun - semantics - Noun(lamda(y) exists(x) sem1(x) & sem2(y) & Related(x,y)) --> Noun(sem1) Noun(sem2) - e.g., Wumpus(w) & World(wld) & Related(w,wld) & Simulator(s) & Related(wld,s) - adjective phrases - e.g., "the smelly wumpus" - grammar rule: noun --> adjective noun - semantics - Noun(lamda(x) sem1(x) & sem2(x)) --> Adjective(sem1) Noun(sem2) - e.g., exists(x) smelly(x) & wumpus(x) - not full proof: "red herring" "real leather" "fake gun" - determiners - e.g., "three pits" - grammar rules (q = quantifier) - Det(q) --> Article(q) | Number(q) NP([q(x) noun(x)]) --> Det(q) Noun(noun) - e.g., [3(x) pit(x)] - logic form ? - noun phrases revisited (person and number) - NP(case,Person(3),number,[q(x) sem(x)]) --> Det(number,q) Noun(number,sem) - nouns are always third person - pronouns may be first, second or third person - S(rel(obj)) --> NP(Subject,person,number,obj) VP(person,number,rel) - subject/verb agreement - clausal complements - e.g., "I believe the wumpus is dead" - grammar rules - VP(subcat) --> VP([S|subcat]) S VP(subcat) --> VP([VP|subcat]) VP Verb([S]) --> "believe" Verb([VP]) --> "want" - relative clauses - e.g., "the wumpus that I saw" (gap) - grammar rules - NP(gap) --> NP(gap) RelClause RelClause --> Pronoun(Relative) S(Gap(NP)) NP(Gap(NP)) --> "" - questions - yes/no ("did you see the wumpus") - gapped ("where did you see the wumpus") - grammar rules - S --> Question Question --> SubjInvert SubjInvert --> Aux NP VP - Question --> Pronoun(Interrogative) SubjInvert(Gap(NP)) - handling agrammatical strings - e.g., "wumpus dead arrow" - insert possible fillers - then, disambiguate 23.5 Ambiguity - reasoning about evidence under uncertainty - belief networks - syntactic evidence - e.g., "agent 1 asked agent 2 to kill wumpus 1 during the last action" - "during..." modifies "kill", not "asked" - lexical evidence - "the agent shot the wumpus" - preference of a verb for one subcategorization - "shot" dead more likely than snap "shot" - semantic evidence - "the wumpus liked the agent" - wumpi only eat agents, "liked the taste of" - metonymy - one object used to stand for another - "Chrysler announced a new model" - metaphor - one literal meaning suggests another via analogy - "the student massaged the code" 23.6 Discourse Understanding - model of speaker's intentions - categorize sentences into one of the intentions - instantiate instances of intentions (e.g., frames) - apply coherence relations to disambiguate