Parsing Algorithms for Uncertain Input
The University of Groningen is a research university with a global outlook, deeply rooted in Groningen, the Netherlands. Quality has had top priority for four hundred years, and with success: the University is currently in or around the top 100 on several influential ranking lists.
The Computational Linguistics group is part of the Center for Language and Cognition Groningen. The computational linguistics group focuses on natural language processing by computers, from theoretical, experimental and applied perspectives. Areas of interest are parsing, wide-coverage grammars (especially for Dutch), machine translation, machine learning, and dialectometry. Strong ties to the researchers specializing in semantics and descriptive linguistics have developed from common interests in mathematical linguistics and corpora.
Project: Parsing Algorithms for Uncertain Input
Gertjan van Noord, Rob van der Goot
The automated analysis of natural language is an important ingredient for future applications which require the ability to understand natural language. For carefully edited texts current algorithms now obtain good results. However, for user generated content such as tweets and contributions to Internet fora, these methods are not adequate - for a variety of reasons including spelling mistakes, grammatical mistakes, unusual tokenization, partial utterances, interruptions. Likewise, the analysis of spoken language faces enormous challenges. One important aspect in which current methods break down is that they take the input very literal. Disfluencies, small mistakes or unexpected interruptions in the input often lead to serious problems. In contrast, humans understand such utterances without problems and are often not even aware of a spelling mistake or a grammatical mistake in the input.
We propose to study a model of language analysis in which the purpose of the parser is to provide the analysis of the `intended' utterance, which obviously is closely related to the observed input, but might be slightly different. The relation between the observed sentence and the intended sentence is modeled by a kernel function on input string pairs. Such a kernel function accounts for different kinds of noise. The kernel function might model errors such as disfluencies, false starts, word swaps, etc. More concretely, this kernel function can be thought of as a weighted finite-state transducer, mapping an observed input to a weighted finite state automaton representing a probability distribution over possible intended inputs. The parser then is supposed to pick the best parse out of the set of parses of all possible inputs - taking into account the various probabilities. Note that there is an obvious similarity with parsing word graphs (word lattices) as output of a speech recognizer, as well as with some earlier techniques in ill-formed input parsing. The current model combines and generalizes these ideas. The study will focus on questions of the following types: can we efficiently compute such an analysis (taking into account a variety of possible formalizations), and what type of disfluencies, noise, mistakes, etc., in the input can be effectively modeled in this approach?
Programs & Grants
100 years ago nobody would have imagined that it may make sense to talk to machines. Today, in the days of speech recognition and speech synthesis to be found in cars, computers, phones and many other devices this is already normal. But it doesn't stop there.