ShortTalk: Dictation Made Rewarding

Introduction

If the killer app is alluring enough, learning curve will often take care of itself.

With the recent arrival of microprocessors operating in the GHz-range, speech recognition is becoming an efficient means of writing—as long as no editing is involved. But in most situations, people who write on the computer rely heavily on editing: text is being produced in a chaotic process, where sentences, words, and even individual characters, are deleted, modified, or moved around. Additionally, technical writing and programming demand entering text that is not easily pronounced: symbols, program identifiers, and markup are prominent examples. Dictation system vendors have emphasized natural language commands as the natural way of using a computer. Let us look at the three reasons natural language does not work well.

Verbosity of natural language

No one using a dictation system should be forced to enounce cumbersome utterances like “go to the beginning of the line” or “exclamation mark exclamation mark exclamation mark” to accomplish editing work that is trivially carried out by keyboard. To be practical, editing by voice must be a fluent activity that carries a high information rate. As illustrated by the phrase above, current dictation systems are deficient in this regard, because natural language is verbose for describing common editing tasks. This is one important reason behind the limited appeal of speech recognition as a keyboard replacement.

Poverty of natural language

There is no culture of using language in front of the computer screen that has conveyed bindings of the syntax of natural language to the intricacies of moving text around. (With the exception of swear words for the action “undo”.) Humans do have experience using natural language for ordering airline tickets. Consequently, it is a sensible challenge to try to build spoken, interactive systems where a human agent is replaced by a computer. But, as everybody knows, it is very difficult to use natural language to convey editing operations to a person sitting at a computer, without a good amount of gesturing, pointing, corrections, and retractions. Thus, for all its richness, natural language is paradoxically an impoverished interface for editing.

Ambiguity of natural language

If a user dictates “select a good restaurant” to a commercial dictation system, those four words will not appear in the text. The problem is that “select” is a command. In current systems, if just a slight pause occurs before certain fragments of natural language, then the utterance is interpreted as a command, not as dictation. Consequently, commands and dictation cannot be fluently interspersed. In practice, it is very unnatural to force pauses between commands, and almost impossible to remember not to pause before dictation that may be interpreted as a command. So, the use of natural language for commands is inherently flawed because of the arising ambiguity.

The macro trap

A major selling point for the professional, and high-margin, versions of dictation systems is the macro facility that allows users to define their own speech commands. Although superficially compelling, the presumption that a user is served well by complementing the built-in command language with new constructs is seriously flawed. The command language should be complete, from the outset. A user should not be engaged in the construction of a command language, which is a monumental task. Many users wind up adding hundreds of commands, which become inconsistent, difficult to remember, and never quite up to the job anyway. Too many editing situations remain difficult to tackle.

If the natural language technology promoted by dictation system vendors was complete, then there would be a very small need for command extensions. With the keyboard, most professional users get along without defining keyboard macros. Thus, the emphasis put on the macro facilities is a strong indication that the natural language command facilities are fundamentally inadequate.

ShortTalk

ShortTalk solves the problems above in a way that is in essence completely non-revolutionary, namely by acknowledging the superiority of the human mind over the computer and its willingness to absorb symbols and language. The ShortTalk philosophy is completely utilitarian: the computer is a specialized tool for getting work done and the human is bound to face a learning situation, no matter what. Therefore, the goal is to make the tool universal and as efficient as possible through the careful choosing of concepts and syntax. This efficiency will be the principal motivation for learning the use of the command language.

Hundreds of millions of people have been trained according to another manifestation of the same principle: the keyboard is a tool that represents a couple of hundred of symbols that can be quickly and unconsciously combined by the trained user for superior efficiency. The symbols include letters, but also many command keys that encode a variety of editing concepts (CTRL-V for paste the clipboard content, CTRL-SHIFT-left-arrow for placing the cursor to the left of the current word, etc.). The success of the keyboard proves that the human mind is in possession of combinatorial skills allowing intents to be effortlessly expressed through the string together of mechanically-activated symbols.

Thus, ShortTalk is a spoken adaptation of the proven ability of the human mind to unconsciously combine symbols from a limited vocabulary in order to solve editing tasks. But ShortTalk is much more powerful, since a few dozen editing concepts can be combined in thousands of different ways. Consequently, most editing can be accomplished faster through ShortTalk than through the keyboard. This distinguishes ShortTalk markedly from current commercial offerings.