ShortTalk: Dictation Made Rewarding

Related Work

The design and the efficiency of editors were intensely studied in the '70s. For example, the classic work “The Psychology of Human-Computer Interaction” by Card, Moran, and Newell (1983) focuses mostly on editing. It provides a wealth of data relating to human cognition and motor ability, along with rigorous models of cognitive tasks. The GOMS model in particular is interesting since it ties operational efficiency to the tactical, cognitive processes necessary to express keyboard commands.

Curiously, there is apparently no work related to speech recognition that follows up on this work. Research in dictation systems has focused on usability for simple dictation tasks, such as in medical applications, and the error correction problem—a fundamental issue that commercial systems have no good solutions for. Such questions about basic usability are very important, but they are orthogonal to the aims of ShortTalk.

There is a great amount of work in the area of using natural language for commanding small devices, like PDAs. This work is of little relevance to editing, since it assumes relatively simple tasks that are carried out after little or no training. Editing is a professional activity and a process so complicated that it requires substantial training. Thus our approach do not contradict established research in speech user interfaces. Even so, the simplicity of spoken interfaces for operating common appliances is sometimes assumed to be obviously generalizable to operating computers through the use of natural language. There is no published indication, to the knowledge of the author, that such a paradigm is, or ever will be, workable.

There is overwhelming and undeniable evidence that the opposite paradigms holds if we look at our ability to produce symbols not necessarily tied to language. Humans are served well with concise, highly abstract notations based on a universal, small vocabulary of signs. In fact, hundreds of millions of trained users of personal computers are conditioned to the spontaneous and unconscious issuing of sequences of such signs. For example, when the user comes upon the thought that “data base” should have been “database”, the sequence of signs for “go back over the word”,“delete the character before the cursor”, go forward over the next word” is expressed immediately and seemingly without further conscious thinking—through finger movements.

The hypothesis that natural language—by the virtue of it being natural—is immediately natural to use is untenable according to the study of Karl, Pettey, and Shneiderman “ Speech Activated versus Mouse-Activated Commands for Word Processing Applications: An Empirical Evaluation .” They showed that when users are asked to use a simple set of commands a natural language such as “page up” instead of using the corresponding keys, task performance may be severely affected for non-trivial editing situations. Although the authors note that mouse activation of commands is slower than speech activation according to their empirical results, they draw the somewhat surprising hypothesis that speech itself interferes with thinking: the use of speech for commands effects adversely short-term memory.

We believe that there is a much simpler explanation: the lack of training using the small vocabulary of voice commands is the source of the cognitive load. The situation would have been no different, we hypothesize, had the keys of the keyboard be rearranged. A more compelling conclusion from these results is that the naturalness of natural language is an illusion: it becomes natural to use the two words “page up” for the specific effect of moving the page on the screen upwards only after an amount of training that makes their use a trained reflex. Thus, the words “page up” when spoken to computer constitute a new meaning, which is not initially wired in the brain. The fallacy is that the naturalness of “page up,” as usually understood, confuses the obvious mnemonic quality of these two words with whether they, with their specific contextual meaning, are “wired” to be spontaneously used.

It appears that the editing problem has been seriously addressed only by the people who are most likely to find a solution for it: injured programmers or computer scientists, who have both the motivation and the skills to work on a solution.

The essay “Speech Command and Control ” approaches user interface design from a philosophical standpoint very similar to our own. The writer, Kim Patch, is a reporter and editor who has used dictation system technology for almost ten years. She argues that the key to success is the steady assimilation of a specialized jargon. Moreover, it is an illusion to think that just because humans are good at spoken language the use of speech recognition is an immediately natural activity. But since the human repertoire of words is much greater than keyboard symbols, speech should in theory be much more suited to the task of editing than a keyboard.

Kim Patch has published her macros, which probably constitute the most comprehensive published collection of spoken commands for general computer use. The commands have evolved over several years, and constitute a succinct, constructed language that identifies important editing concepts, many of which have no immediate analogy in natural language.

There are many similarities between ShortTalk constructs and Kim Patch's language. For example, “Another 3 Up 2” combines “Another Line” with three return keys and two up arrows. Here, “Another Line” is a command that inserts a newline character after the end of the current line; replacing “Line” with 3 makes the command insert 3 newline characters instead. The command “Up 2” moves the cursor up two lines. The use of "Another Line" illustrates a fundamental trait shared with ShortTalk: Kim Patch has evolved an ontology of editing concepts that is far more sophisticated than found in any commercially available product. (ShortTalk does not directly represent the concept of “Another Line”; the equivalent phrase would be “ex line loon”, which combines the atomic concept of moving the cursor to the end of the line with the concept of inserting a newline.)

Where ShortTalk and Kim Patch's language differ fundamentally is in their approach to disambiguation. By construction, ShortTalk commands may be embedded continuously within dictation. In contrast, the language of Kim Patch may be used only with pauses before and after commands, but as the previous composite example shows, it is by design sometimes not necessary to insert pauses between commands.

For the application of program editing, Alain Desilets has made a voice-controlled system called VoiceGrip “VoiceGrip: a tool for programming-by-voice,” International Journal of Speech Technology, 4, 2001. VoiceGrip makes it possible to enter common identifiers such as “SysPtr” by saying “system pointer”, a very useful technique that would complement ShortTalk.

Alain Desilets also proposes the use of natural language for entering program code, especially for idiomatic constructs. It is our belief that the cognitive overhead of remembering such a system of specialized, natural language may not be in a reasonable relationship to the frequency of the situations that can be tackled this way. Moreover, the approach leaves unsolved the issue of handling the myriads of little editing situations that occur in practice, such as those we mentioned in audio demos. In some sense, this natural language approach based on natural language understanding is the diametric opposite of the principle we advocate: that humans through their intellectuals superiority are better served by a precise, well-defined, universal, but entirely combinatorial, tool. In addition, it remains an open question whether a natural language approach, even if made as universal ShortTalk, in fact would entail less training: somehow the user must learn the concepts, syntax, and limits of the language—something that may be much more difficult given the variability and vagueness of natural language. The VoiceGrip project is continuing as the VoiceCode Programming by Voice Toolbox.

A related project is that of Emacs Voice Commander by Hans van Dam. It proposes to formulate a spoken equivalent of every command for the GNU Emacs text editor, a tool used by many professional programmers. Emacs possesses an large number of commands, but they are not constructed to follow strict principles of orthogonality. The reason for this is quite simple: there is no easy way to map a orthogonal command language to keys unless two or three keystrokes are used for each command. Moreover, these keystrokes probably would have to involve modifier keys. In practice, it is much more important that common functions are mapped to simple key combinations. For that reason, there was never an incentive to construct an editing language as systematic as ShortTalk. For example, there is no command in Emacs that positions the cursor at the first following character that is not a white space. In ShortTalk, this is expressed as “ex stretch”, where “ex” means “go to the end of” and “stretch” means stretch of white space.

In an attempt to enhance general speech user interfaces, the Universal Speech Interface project proposes that users must be taught a limited set of skills, namely the use of a handful of keywords and interaction patterns, to use speech recognition across a range of applications. Such a framework provides an analog of the “look-and-feel” of graphic user interfaces. Thus, the lack of natural language helps a user understand the limits of the system, while providing efficient spoken query mechanisms. Additionally, the response of the system will be fast and reliable, since less intelligence is required on the part of the computer. The Universal Speech Interface does not use constructed words, since there is no issue of distinguishing dictation from commands. The project also differs from our approach in that the training involved is minimal compared to the effort of learning to manipulate a text editing system.