ShortTalk: Dictation Made Rewarding

Executive Summary

ShortTalk is a new method for composing text by speech. This spoken command language is carefully designed to be rewarding to use, right from the beginning. In contrast to so-called “natural language technology” of available dictation systems, ShortTalk can be fluently interspersed with dictation. There are no cumbersome phrases like “go to the beginning of the line.” Instead, ShortTalk codifies natural and universal editing concepts that can be combined in command phrases, typically consisting of only two syllables.

For example, “ghin line”—with “ghin” as in “beGINning”—is an unambiguous spoken command for moving the cursor to the beginning of the line. It is a rewarding phrase, because it is faster to say “ghin line” than to find and press the home key on the keyboard.

With almost no application- or user-specific vocabulary, ShortTalk works for e-mails and structured text such as XML or source code. Analytical and empirical studies indicate that ShortTalk, combined with occasional pointing, may be faster than conventional editing using keyboard and mouse.

The technology holds the promise of making tablet-based computers attractive for text entry, since only few keys are needed to complement speech input that takes advantage of ShortTalk.

A one-minute video demo is available.




Keywords

Dictation system, speech recognition, user interface, speech interface, spoken command language, editing by voice, stenophonic principle, entropy of command languages, tool use and language acquisition.

Introduction

If the killer app is alluring enough, learning curve will often take care of itself.

With the recent arrival of microprocessors operating in the GHz-range, speech recognition is becoming an efficient means of writing—as long as no editing is involved. But in most situations, people who write on the computer rely heavily on editing: text is being produced in a chaotic process, where sentences, words, and even individual characters, are deleted, modified, or moved around. Additionally, technical writing and programming demand entering text that is not easily pronounced: symbols, program identifiers, and markup are prominent examples. Dictation system vendors have emphasized natural language commands as the natural way of using a computer. Let us look at the three reasons natural language does not work well.

Verbosity of natural language

No one using a dictation system should be forced to enounce cumbersome utterances like “go to the beginning of the line” or “exclamation mark exclamation mark exclamation mark” to accomplish editing work that is trivially carried out by keyboard. To be practical, editing by voice must be a fluent activity that carries a high information rate. As illustrated by the phrase above, current dictation systems are deficient in this regard, because natural language is verbose for describing common editing tasks. This is one important reason behind the limited appeal of speech recognition as a keyboard replacement.

Poverty of natural language

There is no culture of using language in front of the computer screen that has conveyed bindings of the syntax of natural language to the intricacies of moving text around. (With the exception of swear words for the action “undo”.) Humans do have experience using natural language for ordering airline tickets. Consequently, it is a sensible challenge to try to build spoken, interactive systems where a human agent is replaced by a computer. But, as everybody knows, it is very difficult to use natural language to convey editing operations to a person sitting at a computer, without a good amount of gesturing, pointing, corrections, and retractions. Thus, for all its richness, natural language is paradoxically an impoverished interface for editing.

Ambiguity of natural language

If a user dictates “select a good restaurant” to a commercial dictation system, those four words will not appear in the text. The problem is that “select” is a command. In current systems, if just a slight pause occurs before certain fragments of natural language, then the utterance is interpreted as a command, not as dictation. Consequently, commands and dictation cannot be fluently interspersed. In practice, it is very unnatural to force pauses between commands, and almost impossible to remember not to pause before dictation that may be interpreted as a command. So, the use of natural language for commands is inherently flawed because of the arising ambiguity.

The macro trap

A major selling point for the professional, and high-margin, versions of dictation systems is the macro facility that allows users to define their own speech commands. Although superficially compelling, the presumption that a user is served well by complementing the built-in command language with new constructs is seriously flawed. The command language should be complete, from the outset. A user should not be engaged in the construction of a command language, which is a monumental task. Many users wind up adding hundreds of commands, which become inconsistent, difficult to remember, and never quite up to the job anyway. Too many editing situations remain difficult to tackle.

If the natural language technology promoted by dictation system vendors was complete, then there would be a very small need for command extensions. With the keyboard, most professional users get along without defining keyboard macros. Thus, the emphasis put on the macro facilities is a strong indication that the natural language command facilities are fundamentally inadequate.

ShortTalk

ShortTalk solves the problems above in a way that is in essence completely non-revolutionary, namely by acknowledging the superiority of the human mind over the computer and its willingness to absorb symbols and language. The ShortTalk philosophy is completely utilitarian: the computer is a specialized tool for getting work done and the human is bound to face a learning situation, no matter what. Therefore, the goal is to make the tool universal and as efficient as possible through the careful choosing of concepts and syntax. This efficiency will be the principal motivation for learning the use of the command language.

Hundreds of millions of people have been trained according to another manifestation of the same principle: the keyboard is a tool that represents a couple of hundred of symbols that can be quickly and unconsciously combined by the trained user for superior efficiency. The symbols include letters, but also many command keys that encode a variety of editing concepts (CTRL-V for paste the clipboard content, CTRL-SHIFT-left-arrow for placing the cursor to the left of the current word, etc.). The success of the keyboard proves that the human mind is in possession of combinatorial skills allowing intents to be effortlessly expressed through the string together of mechanically-activated symbols.

Thus, ShortTalk is a spoken adaptation of the proven ability of the human mind to unconsciously combine symbols from a limited vocabulary in order to solve editing tasks. But ShortTalk is much more powerful, since a few dozen editing concepts can be combined in thousands of different ways. Consequently, most editing can be accomplished faster through ShortTalk than through the keyboard. This distinguishes ShortTalk markedly from current commercial offerings.

Tutorial

ShortTalk is a collection of editing concepts that can be stringed together in “phrases.” A phrase is usually made of one or two, sometimes three concepts. Some concepts may stand alone, others may occur only as part of phrases. The resulting language is probably not much more complex than that which can be learned by some non-human primates, who compose phrases made out of signs for food, objects, and simple actions. We call ShortTalk an editing language although it is not a language in the sense that ordinary people or linguists use the term. In fact, we have taken the opposite “primate,” proto-language point of view since 1) there is such staggering evidence that humans are masters of sequencing formal symbols as when they tap away on their keyboards, play instruments, etc. and 2) there's no reason to believe that this ability does not translate into the spoken domain, just as you can type digits at about the same speed you can enounce them. And, evidently, humans can sequence words in real languages that are infinitely more complex than proto-languages.

It's about learning and learning is about rewards

Using your voice for controlling your computer effectively is a matter of training. This holds whatever are the syntactic clothes of the utterances, be they mnemonic words, stilted natural language or diffuse and hard-to-deliniate “real” natural language. The greatest motivation is obviously usability: efficiency, systematism, and simplicity. These three factors will be argued below as we introduce the main concepts of ShortTalk.

The forward and backward distinction

Editing actions a very often relative to position. For example, search or identification of nearby words or lines are always backwards or forwards relative to the current position. Thus direction is a primary piece of knowledge implicit in our cognition about editing situations. It would be a waste of effort not to systematically represent direction in editing concepts. The ShortTalk solution is extremely simple and terse: the vowel denotes direction. For example, “go aift hello” means place the cursor after the occurrence of “hello” following the current position; and “go ooft” means place the cursor after the occurrence of “hello” preceding the cursor. So, the distinction is: “oo” means backward and “ai” means forward. (Both are short vowels; after all this is ShortTalk.) This system applies to any concept where direction makes sense.

Actions that may stand alone

Pressing the space bar is “spooce” for half the syllabic effort of saying “space bar.” The same applies to “loon” for return, usually called “new line” in Natural Language Systems. It saves the user the offense of having dictation such as “the new line is that” misinterpreted. (And should the subject exceptionally be a certain aquatic bird, the user may use the phrasing “l-rall loon” to type loon.) Keys like up and left arrow have similar mnemonic names: go up becomes “goop”, that is “go oop” (“oo” sound for backwards motion alterates the vowel of “up”) and up arrow becomes “gloof” for “go left” in a similar manner. Then we derive “graif” for “go right” (not “grait” because it sounds the same as “great”). For a step down towards the netherworld, “go nether” becomes “gnaith”. This part that concerns the mapping of common symbols is the most foreign part of ShortTalk, although it indeed is conceptually trivial. It is about finding effective non-ambiguous names for, and and indispensable control keys. Fortunately, there are not that many of them.

Beginning/end

How to go to the beginning or end of something which you are at? That's easy: “ghin” for “beginning” and “ex” for exit or ending. Thus, to go to the beginning of the word, say “ghin word.” to go to the end of the paragraph, say “ex para.”

Numbers

Scottish “ane” is for one, “twain” for two, “traio” for three, “fairn” for four, and “faif” for five. It stops here because it seems that the eye can quickly identify only four or five items. Commercial systems offers commands like “move cursor down 17 lines”. In ShortTalk, you would say “line faif”, then “goink”, for repeating the last command, and then one more “goink” to end up very close to the destination. Then you'll be able to immediately see that “line twain” will bring you to where you want to be. The point is that you did not all along want to know that the precise count is 17. ShortTalk is about getting you work done, not about implementing voice commands that may be “natural” but barely usable.

Often a number is used where direction makes sense: if “line faif” means “go down five lines,” then does “line foof” means go “up lines”? Yes! So, now we have ten useful and efficient numerals that eliminate the offensive guesswork about the meaning of “to”, “2”, “two”,... The numerals are also crucial to the disambiguation of commands from dictation. In fact, they can never appear by themselves. That is why “line twain” can be embedded in continuous dictation. And, by the way, “Mark Twain” still comes out as “Mark Twain” because there is no ShortTalk concept named “mark.”

Characters, words, lines, paragraphs,...

Structural concepts for various kinds of pieces of text are all there: “char” (as in charcoal) for characters, “word” for words, “line” for lines and “para” for paragraphs. A word with hyphens is “eed” (for identifier). A “ting” is a thing that is any stretch of characters that are not spaces (useful for email addresses), a “tier” is the line without the newline character, an “inner” is everything inside of quotes or parenthesis, “term” is a quotation or the whole parenthesized expression, and a “senten” is a sentence. There are a couple of more such structure concepts, and together they cover most imaginable characterizations of pieces of text, whether in technical writing or programming. Now combine them with numerals and you have already have a very powerful set of tools for just moving the cursor.

For example, “ting ane” skips over all whitespace to put the cursor at the first visible character (letter, parenthesis, whatever) after the cursor. The command “word twoon” puts the cursor at the second word before the current word. Now we got 10 numerals times 10 concepts for moving the cursor locally. Note that the effort is pretty minimal: the ai/oo principle, the numerals “ane”, “twain”, ..., “faif”, and ten mostly obvious and known terms for pieces of text yield a hundred commands. Human affinity for combining symbols means that the utility of this little grammar is exponential over time: hesitancy is soon replaced by “automatic” utterances that reflect your intentions. Contrast this situation to Natural Language Technology where you will struggle with questions such “is it 'move right' or 'go right'?” and with persistent misrecognitions of your intentions as to whether you meant dictation or commands (because of the forced pauses that most be inserted beween commands).

Common places

So when you say “this paragraph” with commercial systems does it refer to the paragraph where the mouse pointer is or where the text cursor is? ShortTalk rejects such ambiguity for human reasons: no user should tolerate the whims and moods of programmers who try to interpret natural languages. So again there is a simple system at work: “hare” is here for where the cursor is and “tair” is “there” for where the pointer is. But in ShortTalk there are even more useful positional concepts not made available in most editors. The reason for the relative poverty of editing by keys is simple: there are not enough keys on the keyboard to express and conveniently even if the concepts are latent in our perception of editing.

For example, ShortTalk keeps track of where the cursor was before the last cursor excursion. So, if you begin moving the cursor around after typing something, this position called “mairk” marks the end of what you typed even as the cursor is no longer there. Mairk is denoted visually (by a brown square). This position is really useful—indeed it is a part of our “where was I?” reasoning about editing. To go to the mairk, say “gairk” for “go to mairk.” To insert a space at mairk say “spooce lairk” and to capitalize the word at mairk say “caip lairk.” The concept of mairk is borrowed from the Emacs text editor; in Emacs however, mairk is expressed in only a couple of composite commands that are bound to seemingly random keys.

Another essential concept is that of the last position where something changed: it is often the position at the start of the last inserted text. Often you forget a space at that place, or maybe the capitalization is wrong. This position is called “loost.” Naturally, one goes to “loost”, which is marked green, by simply saying “goost.”

Actions

Actions have concise mnemonics: to capitalize is “caip,” to uppercase is “aipper”, to fix spacing and capitalization (after e.g. “.”) is “fix”, to simply insert a space is spooce,” etc. When editing, we combine actions and places as the situation calls for. For example, after we dictated “we helped it going” into existing text and the screen now displays “the most we had.we helped it going|” with the cursor “|” now being at the end, we say “fix loost” to repair “we” right after the period. This operation does not move the cursor.

Compare this to reaching for the mouse, moving it to locate the period, then clicking it, then find the keyboard again to insert spaces, delete the wrongly-cased letter, inserting the uppercased one, then reaching for the mouse again to reposition the cursor... This example indicates why ShortTalk is much faster than traditional mechanical interfaces in many common situations.

Grabbing and smacking

ShortTalk integrates mouse and cursor positions in commands that greatly amplify the power of a pointing device. For example, the command “grab ting” copies the e-mail address at the mouse pointer to where the cursor is. Thus, to insert an e-mail address in the middle of the text, you can say “please write to grab ting as soon as possible” (without any pauses) while your hand at the same time pushes the mouse so that it is placed somewhere over the e-mail address.

In order to delete something, you use the concept “smack.” So, smack senten” deletes the sentence where the cursor is. And, “smack senten tair” deletes the sentence where the mouse pointer is. If in addition you want to move the cursor to where deletion happens, then you say “smack senten gook”. To delete something while copying it to the keyboard, you use “rem” for “remove.” So, “rem twoon” removes the word at the cursor and the one preceeding it. We just illustrated another ShortTalk principle: concepts can be omitted as long as the resulting phrase is not something that is part of the natural language. There are always appropriate defaults. In this case, “rem twoon” means “rem eed twoon.”

Searching for stuff

Again the principles are very simple:“baif” or “boof” identify the position before words to look for and “aift” or “ooft” identify positions after. The vowels determine the search direction. So, above we might also have said “fix boof we” to fix the problem around the period. Of course, if we just wanted to insert a white space at this position, we would just put the “spooce” word together with “boof we”: “spooce boof we” does the job. Generally, you can easily fix capitalization and spacing issues in a second or two in this way without using mouse or keyboard. Because it is so much more efficient, these commands become ingrained quickly.

Symbols

You do not need to learn the shorthand for for symbols, but some will be so convenient that you may long for them. For example, the ShortTalk name for “!” is “clam” (as in “exCLAMation mark”). So, “clam traio” inserts three exclamation marks. (Instead of “exclamation mark, exclamation mark, exclamation mark”.)

What is more?

We have already covered all the essential aspects of ShortTalk. There is more of course: window manipulation, insertion of markup, and formatting commands. In the sidebar, you'll find a link to a complete overview of the ShortTalk syntax.

Audio Demos

There are myriads of editing situations that must be solvable in an efficient manner. We give a few samples below. You may notice that ShortTalk is probably faster, even much faster, than your current editor for solving most of these problems.

Fix spacing and capitalization before a “.”—don't loose cursor

A common situation: everything we said to the speech recognizer was taken down correctly, except for a little spacing or capitalization problem.

Before

the most we had.we helped it going|

After

the most we had. We helped it going|

ShortTalk solution (1.0s)

fix boof we helped            Play

Explanation

The “fix” action carries out this common operation at the place indicated by the “boof” search designator: before the earliest occurrence of “we helped”. The action does not change the cursor position.

Delete an errant “%”—don't loose cursor

A common problem: something close to the cursor needs to be deleted, but we don't want to lose the current cursor position.

Before

|The most...We helped% it going

After

|The most...We helped it going

ShortTalk solution (1.5s)

smack sorch per ane            Play

Explanation

The “smack” operator deletes the text identified by the search designator “sorch”, and “per ane” means “one percent sign”.

Concatenate letters

ShortTalk uses both positive and negative numerals to identify text around the cursor.

Before

with A T & T|. The company

After

with AT&T|. The company

ShortTalk solution (.9s)

speece truo            Play

Explanation

“speece truo” applies the no-space operator to the three identifiers preceding the cursor.

Add some space around a “+”-sign

Lots of idiomatic uses of backspace, arrow keys, and spacebar key can be accomplished as quickly in ShortTalk as by keyboard.

Before

z = x+y|

After

z = x + |y

ShortTalk solution (2.4s)

gloof, spooce, gloof twain, spooce, gairk            Play

Explanation

“gloof” means “press the left arrow” and “spooce” means “press the spacebar”. The modifier “twain” means do it twice. “gairk” takes the cursor to mairk, which is the text anchor cast at the beginning of the last movement command.

Capitalize inside a word and put the word in quotation marks

Nitty-gritty manipulation of a word involving capitalization in the middle is easily accomplished.

Before

<body class=myclass>...|

After

<body class="|myClass">

ShortTalk solution (2.9s)

go boof class, caip hare, choose word, quote pair            Play

Explanation

“go boof class” positions the cursor before “class”, “caip hare” capitalizes, “choose word” selects (highlights) the whole word around cursor, and “quote pair” introduces quotation marks around the selection.

Insert a “!” after “by far”, add “em” markup, and put in parentheses

ShortTalk supports sophisticated markup editing in XML.

Before

|The most challenging method by far ...

After

The most challenging method (|<em>by far!</em>)...

ShortTalk solution (7.2s)

go aift by far, stroop, clam ane, choose ting twoon, snex e. m., choose term, par pair            Play

Explanation

“go aift by far” positions the cursor after “by far”—the word “stroop” is a neutral word that delimits the search string; “clam ane” inserts the exclamation mark; “choose ting twoon” selects the two continuous pieces of characters before the cursor; “snex e. m.” is an XML-specific command that inserts the em-tag around the selected region; “choose term” selects the tagged region (element); and “par pair” inserts parentheses around the selected region.

Delete the modifier of the sentence

For much editing, the explicit identification by mentioning the whole text range is slow. ShortTalk offers an arsenal of structural identification concepts along with short commands for skipping to an individual symbol.

Before

After two years, he left for Paris. |

After

He left for Paris.

ShortTalk solution (1.4s)

skoop cam, reese senten            Play

Explanation

“skoop cam” skips backwards until before the first comma; and “reese senten” deletes backwards to the beginning of the sentence.

Fetch a program identifier in a declaration and insert

A text-editing idiom in programming is to reuse an identifier or even subexpression.

Before

int StrangeFunc(int * myPtr) {
		    int *t, *myStrgPtr; 
		    ...
		    *myPtr = *|

After

int StrangeFunc(int * myPtr) {
		    int *t, *myStrgPtr; 
		    ...
		    *myPtr = * myStrgPtr|

ShortTalk solution (3.0s)

go ooft strange, skaip line, word oon, push lairk            Play

Explanation

“go ooft strange” positions the cursor somewhere in the first line of the function definition; “skaip line” positions the cursor at the beginning of the next line; “word oon” positions the cursor at the first letter of “myStrgPtr” by going backwards; and “push lairk” pushes the identifier “myStrgPtr” to where the excursion began. In practice, an even faster alternative is to simply point the mouse at the desired identifier and then issue the ShortTalk command
grab eed
which inserts the identifier at the mouse position where the cursor is.

Video Demos

The video (ISDN/Cable bandwidth: Microsoft WMF version or RealMedia version) shows a common scenario of writing a letter that needs only little editing. (Low bandwidth versions are: Microsoft WMF version or RealMedia version.)

The video makes two important points. First, it shows how modern speech recognizers will quickly convert spoken words into text. (We have ignored the problem of speech recognition errors in this video. In practice, somewhere from 2% to 10% of words are not recognized correctly.) Second, the video shows how ShortTalk makes editing the text very fast. In the demo, we use ShortTalk for moving the cursor around, capitalization, insertion of lines, insertion of punctuation, moving text, and deleting text. Often we carry out quick operations in a two or three-syllable command without loosing the cursor position. That is convenient in the frequent situations, where some little error has to be corrected in the vicinity of the cursor. The video also shows how ShortTalk commands can be issued one a time or as a quick series of utterances interspersed with dictation. Stringing commands together in spurts is inherent to how the human mind works—but impossible to do in commercial systems without the forced use of highly unnatural pauses between individual commands.

Even for the simple editing and punctuation we employ in this demo, we would have spent almost three times the vocal effort had we done it with, say, NaturallySpeaking's command language.

AcknowledgmentsBrian Roark gratitously donated the use of his voice for these demos.

Experimental Evidence

How does it work in practice? How frequent are ShortTalk commands? How many of them are used?

There is only one way of answering: record all activity of a user over some period of time. Such a transcript is available in a large file (5.3Mb), which shows my own activity over a period. Dictated text has been replaced by “x”'s for reasons of privacy.

The transcript reveals that on average, a ShortTalk command is issued per two words dictated, and that approximately 500 different commands are used (out of about 76,000 commands issued). [Here, we have not counted the use of spelling using alpha-bravo words as commands; also, different search strings are not considered important. If this intensive use of commands is typical, then it provides a striking explanation for the inefficiency of current dictation systems: since on average a ShortTalk command corresponds to several words in a natural language system, a dictation system user would mainly be speaking commands if the same effects were too be obtained. In contrast, a ShortTalk command is on average less than two syllables, which is the effort added per two words of dictation. These numbers support our claim, we believe, that the clumsiness and vagueness of natural language makes it a markedly bad choice as a carrier of editing intentions from human to machine.

Finally, we mention that our logs show that a substantial number of keystrokes, almost all spread over some 20 keys, is necessary to accomplish repetitive tasks efficiently. In my case, I am using a foot keyboard, which is part of a foot rest. A compelling alternative—an obvious one—is a reduced, low-impact keyboard designed to complement speech recognition. It would consist of large, nicely separated control keys for use by hand. The same idea applies to the tablet PC.

Related Work

The design and the efficiency of editors were intensely studied in the '70s. For example, the classic work “The Psychology of Human-Computer Interaction” by Card, Moran, and Newell (1983) focuses mostly on editing. It provides a wealth of data relating to human cognition and motor ability, along with rigorous models of cognitive tasks. The GOMS model in particular is interesting since it ties operational efficiency to the tactical, cognitive processes necessary to express keyboard commands.

Curiously, there is apparently no work related to speech recognition that follows up on this work. Research in dictation systems has focused on usability for simple dictation tasks, such as in medical applications, and the error correction problem—a fundamental issue that commercial systems have no good solutions for. Such questions about basic usability are very important, but they are orthogonal to the aims of ShortTalk.

There is a great amount of work in the area of using natural language for commanding small devices, like PDAs. This work is of little relevance to editing, since it assumes relatively simple tasks that are carried out after little or no training. Editing is a professional activity and a process so complicated that it requires substantial training. Thus our approach do not contradict established research in speech user interfaces. Even so, the simplicity of spoken interfaces for operating common appliances is sometimes assumed to be obviously generalizable to operating computers through the use of natural language. There is no published indication, to the knowledge of the author, that such a paradigm is, or ever will be, workable.

There is overwhelming and undeniable evidence that the opposite paradigms holds if we look at our ability to produce symbols not necessarily tied to language. Humans are served well with concise, highly abstract notations based on a universal, small vocabulary of signs. In fact, hundreds of millions of trained users of personal computers are conditioned to the spontaneous and unconscious issuing of sequences of such signs. For example, when the user comes upon the thought that “data base” should have been “database”, the sequence of signs for “go back over the word”,“delete the character before the cursor”, go forward over the next word” is expressed immediately and seemingly without further conscious thinking—through finger movements.

The hypothesis that natural language—by the virtue of it being natural—is immediately natural to use is untenable according to the study of Karl, Pettey, and Shneiderman “ Speech Activated versus Mouse-Activated Commands for Word Processing Applications: An Empirical Evaluation .” They showed that when users are asked to use a simple set of commands a natural language such as “page up” instead of using the corresponding keys, task performance may be severely affected for non-trivial editing situations. Although the authors note that mouse activation of commands is slower than speech activation according to their empirical results, they draw the somewhat surprising hypothesis that speech itself interferes with thinking: the use of speech for commands effects adversely short-term memory.

We believe that there is a much simpler explanation: the lack of training using the small vocabulary of voice commands is the source of the cognitive load. The situation would have been no different, we hypothesize, had the keys of the keyboard be rearranged. A more compelling conclusion from these results is that the naturalness of natural language is an illusion: it becomes natural to use the two words “page up” for the specific effect of moving the page on the screen upwards only after an amount of training that makes their use a trained reflex. Thus, the words “page up” when spoken to computer constitute a new meaning, which is not initially wired in the brain. The fallacy is that the naturalness of “page up,” as usually understood, confuses the obvious mnemonic quality of these two words with whether they, with their specific contextual meaning, are “wired” to be spontaneously used.

It appears that the editing problem has been seriously addressed only by the people who are most likely to find a solution for it: injured programmers or computer scientists, who have both the motivation and the skills to work on a solution.

The essay “ Speech Command and Control” approaches user interface design from a philosophical standpoint very similar to our own. The writer, Kim Patch, is a reporter and editor who has used dictation system technology for almost ten years. She argues that the key to success is the steady assimilation of a specialized jargon. Moreover, it is an illusion to think that just because humans are good at spoken language the use of speech recognition is an immediately natural activity. But since the human repertoire of words is much greater than keyboard symbols, speech should in theory be much more suited to the task of editing than a keyboard.

Kim Patch has published her macros, which probably constitute the most comprehensive published collection of spoken commands for general computer use. The commands have evolved over several years, and constitute a succinct, constructed language that identifies important editing concepts, many of which have no immediate analogy in natural language.

There are many similarities between ShortTalk constructs and Kim Patch's language. For example, “Another 3 Up 2” combines “Another Line” with three return keys and two up arrows. Here, “Another Line” is a command that inserts a newline character after the end of the current line; replacing “Line” with 3 makes the command insert 3 newline characters instead. The command “Up 2” moves the cursor up two lines. The use of "Another Line" illustrates a fundamental trait shared with ShortTalk: Kim Patch has evolved an ontology of editing concepts that is far more sophisticated than found in any commercially available product. (ShortTalk does not directly represent the concept of “Another Line”; the equivalent phrase would be “ex line loon”, which combines the atomic concept of moving the cursor to the end of the line with the concept of inserting a newline.)

Where ShortTalk and Kim Patch's language differ fundamentally is in their approach to disambiguation. By construction, ShortTalk commands may be embedded continuously within dictation. In contrast, the language of Kim Patch may be used only with pauses before and after commands, but as the previous composite example shows, it is by design sometimes not necessary to insert pauses between commands.

For the application of program editing, Alain Desilets has made a voice-controlled system called VoiceGrip “VoiceGrip: a tool for programming-by-voice,” International Journal of Speech Technology, 4, 2001. VoiceGrip makes it possible to enter common identifiers such as “SysPtr” by saying “system pointer”, a very useful technique that would complement ShortTalk.

Alain Desilets also proposes the use of natural language for entering program code, especially for idiomatic constructs. It is our belief that the cognitive overhead of remembering such a system of specialized, natural language may not be in a reasonable relationship to the frequency of the situations that can be tackled this way. Moreover, the approach leaves unsolved the issue of handling the myriads of little editing situations that occur in practice, such as those we mentioned in audio demos. In some sense, this natural language approach based on natural language understanding is the diametric opposite of the principle we advocate: that humans through their intellectuals superiority are better served by a precise, well-defined, universal, but entirely combinatorial, tool. In addition, it remains an open question whether a natural language approach, even if made as universal ShortTalk, in fact would entail less training: somehow the user must learn the concepts, syntax, and limits of the language—something that may be much more difficult given the variability and vagueness of natural language. The VoiceGrip project is continuing as the VoiceCode Programming by Voice Toolbox.

A related project is that of Emacs Voice Commander by Hans van Dam. It proposes to formulate a spoken equivalent of every command for the GNU Emacs text editor, a tool used by many professional programmers. Emacs possesses an large number of commands, but they are not constructed to follow strict principles of orthogonality. The reason for this is quite simple: there is no easy way to map a orthogonal command language to keys unless two or three keystrokes are used for each command. Moreover, these keystrokes probably would have to involve modifier keys. In practice, it is much more important that common functions are mapped to simple key combinations. For that reason, there was never an incentive to construct an editing language as systematic as ShortTalk. For example, there is no command in Emacs that positions the cursor at the first following character that is not a white space. In ShortTalk, this is expressed as “ex stretch”, where “ex” means “go to the end of” and “stretch” means stretch of white space.

In an attempt to enhance general speech user interfaces, the Universal Speech Interface project proposes that users must be taught a limited set of skills, namely the use of a handful of keywords and interaction patterns, to use speech recognition across a range of applications. Such a framework provides an analog of the “look-and-feel” of graphic user interfaces. Thus, the lack of natural language helps a user understand the limits of the system, while providing efficient spoken query mechanisms. Additionally, the response of the system will be fast and reliable, since less intelligence is required on the part of the computer. The Universal Speech Interface does not use constructed words, since there is no issue of distinguishing dictation from commands. The project also differs from our approach in that the training involved is minimal compared to the effort of learning to manipulate a text editing system.

Conclusion

Our work shows that editing by voice can be made significantly more efficient than possible with current dictation systems. In fact, analyses of editing situations and empirical measurements indicate that editing by speech carries the potential of beating the keyboard and mouse in efficiency.

Our assumption that editing by speech demands a substantial learning effort is contrary to conventional wisdom about the role of speech recognition. Editing is so complicated that innate naturalness of the user interface does not exist in our opinion. The rational approach is to let efficiency, the amount of editing information that can be transmitted per second, drive the development of a spoken interface. For the user, efficiency is the strongest motivation for learning the complex tool any unfamiliar command language is. And we argued that natural language, being verbose, ambiguous, and impoverished for the task, may be a poor underpinning for such a tool (even if it could be understood by a very intelligent machine).

Our perspective and results demonstrate that the natural match between human and machine may be the one that recognizes the superiority of the human mind over computational capabilities of machines. Consequently, the potential of speech recognition is dramatically amplified by abandoning the use of natural language for commands. (A statement that does not in any way contradict the importance or usefulnes of natural language understanding for help systems and for interactive applications.)

The design of the keyboard in the 19th century was not derailed due to the existence of false analogies with “natural” human activities. But speech recognition for editing may have been fundamentally misunderstood thanks to the tantalizing, but for this purpose fruitless, idea that computers may understand human language. Our perspective also brings to front some other general issues about human cognition and linguistic performance:

FAQ

Is ShortTalk easy to learn?

There are no studies of the acquisition of command languages sophisticated enough to replace the keyboard. ShortTalk rewards the beginner through its superior efficiency—several times that of natural language technology of commercially available systems. Most command names are whimsical and easy to remember. Our hypothesis is that the strong reinforcement provided by ShortTalk makes it easier for the brain to adopt to the communicative significance of the command language syntax. In other words, the alternative of using natural language for carrying the chosen editing concepts, with the ensuing syntactic verbosity, blandness of command phrases, and mode confusion, likely results in a less learnable command language.

Will ShortTalk be valuable for input on tablets?

Yes. With ShortTalk, a few keys will still be necessary for some repetitive tasks that cannot be accomplished effectively by speech recognition. But in the main, editing is very efficiently accomplished by speech alone, complemented with some pointing.

I always felt that the computer human interfaces were about adapting the computer to the way the human worked, not the other way around. So your approach that rejects natural language must be misguided?

This argument is instinctively put forward by many people, including researchers in Human-Computer Interfaces. Indeed, it is a valid one for many applications. But applied to the activity of editing, the argument speciously assumes that our language is inherently so meaningful that it is an effective substitute for skills acquired by adaptation. Sadly there are no such miracles, and natural language may in fact be a barrier to skill development because of its inefficiency and vagueness. The use of the keyboard requires extensive training—and this adaptation is unavoidable. There is no inherently “natural” keyboard design requiring no training, as well as there is no natural way of creating alphabets and writing systems. Why would the complex task of editing not require significant human adaptation whatever the means of communication is—be it typing Control-C for “copy current selection into clipboard” or equivalently saying “<pause> copy that <pause>” (natural language technology) or saying “copy tat” with no pauses (ShortTalk)?

That natural language would be well-suited for the task of editing is understandable, but wishful thinking—given the verbosity and vagueness of commands in natural language. And by the way, how effective is natural language when you sit next to somebody who is editing text on a computer; do you succinctly, fluently, and precisely get you editing suggestions across? Or do you stumble for words, say “no, no, not there”, gesture, and point your ideas across?

ShortTalk is in fact precisely aimed at the way the humans like to work: with as little effort as at all possible!

Still, ShortTalk sounds weird; there must be easier ways to edit by speech?

I strongly believe that any usable command language must be constructed according to the principles of ShortTalk. Such a language is characterized by easiness in the following sense: it solves almost any editing situation in a very few words. A more verbose language would be ineffective. Most users would resist learning an ineffective tool. And, yes, ShortTalk sounds weird. But it should, otherwise it would not solve the mode problem (separating dictation from commands). ShortTalk allows the user to fluently mix dictation and commands—commercial systems with their natural language approach do not.

Does ShortTalk rely on the use of writing macros?

Professional-grade dictation systems offer programming facilities known as macros. The ShortTalk philosophy is to offer a complete solution from the outset, where the user is not forced to develop patches for an inherently insufficient command and control system. However, whenever an editing situation calls for the repetition of a sequence of commands, ShortTalk allows for the easy recording and play back. (EmacsListen itself offers a context-free grammar format that allows s-expressions to be bound to syntactic categories of the command grammar.)

Is ShortTalk available?

Yes. Carnegie Mellon University has accepted a donation from AT&T Labs, which comprises ShortTalk and the EmacsListen prototype. However, the current implementation only works with GNU Emacs, a text editor for professional programmers and other professionals.

Could ShortTalk be connected to other speech engines?

Yes, that should be relatively straightforward.

Why did it take six years to develop ShortTalk?

It was not obvious to me that using speech recognition for editing was even a feasible task. In fact, I believed the opposite the first four years. I did not know that editing by voice could become a fluent, automatic activity once a systematic conceptual framework had been formulated.

ShortTalk is a renegade approach that ignores established research in speech user interfaces. Doesn't it deserve universal rejection and condemnation?

Virtually all research in spoken computer interfaces concerns non-expert applications: call processing, dialogue systems, and multimodal interfaces for portable devices. The use of natural language is essential in these areas (although divergent views have been proposed such as the Universal Speech Interface, promoted by Roni Rosenfeld and his collaborators at CMU). ShortTalk addresses an entirely different scenario and is therefore not at odds with most established research. Editing is a complex domain that innately requires considerable skill and training. Our philosophy and results probably have no bearing on traditional speech user interfaces, and vice versa.

The idea of using syllables to encode concepts is a weak one. Are single syllables not more difficult to recognize than polysyllabic words?

There are between 15,000 and 30,000 different syllables in English. By using unusual syllables, or even foreign syllables, that are phonetically distinct from common ones, superior accuracy can be achieved. Not only are words like “sorch” for “search” easily distinguished from real words, they are easy to remember as any four-year-old knows from listening to the enticingly strange, but meaningful universe of Dr. Seuss.

Speech recognition in the office will never make it because of privacy concerns.

This is a real issue. Standard cubicle environments are not conducive to the use of speech recognition for dictation of sensitive documents. For people affected by cumulative traumHa disorders, the employer should, in my opinion, be obliged to offer a private office or a better insulated cubicle.

Interestingly, the use of ShortTalk itself presents much less of a problem: very little information about the document is revealed through the spoken commands. And, since much keyboard work, such as programming, mostly involves editing and repetitive tasks, the use of ShortTalk may still be a significant part of reducing the strain of using a computer.

Talking to your computer all day will ruin your voice?

The use of dictation systems has been associated with voice strain according to anecdotal evidence. Early dictation system users complained about the strain of disjointed speech resulting from the need to separate each word from the next by a small pause. The informal consensus seems to be that the modern systems that transcripe continuous speech are less stressfull. For the command and control part, modern dictation systems still require pauses, a deficiency that has been solved by ShortTalk. A CNN article “Is voice recognition dangerous for your health? article” discusses the problem.

Acknowledgments

Ten years of mostly frustrated, frivolous, and fruitless experimentation with alternative input technology led to the design of ShortTalk and other of my input technologies. I am extraordinary grateful to my employers, the University of Aarhus, Denmark, and AT&T Labs for having supported my special needs throughout the ten years I have spent recovering from a typing injury. Indeed, these needs went well above the conventional trials of mounds of “ergonomic” input devices, but included secretarial typing assistance, expensive early dictation software, and most of all time to write the software that ultimately fulfilled the extraordinary potential of speech recognition technology.

In particular, I am grateful to Julia Hirshberg and my boss Michael Merritt for strongly supporting the development of an improved user interface for dictation systems around 1996, not long after I joined AT&T. Erik Ostrom expertly programmed most of this first interface, in Emacs Lisp, which I used for over three years. This was the first version of EmacsListen. Around the same time, Thomas Rene Nielsen published his demacs macros for Emacs, which helped promote some of the ideas that emerged in discussions we had at Aarhus.

In 2000, I attended the VoiceCode design meeting arranged by Alain Desilet, and Jonathan Epstein, and Eric S. Johansson. This event was most inspiring, because it clearly demonstrated the enormous gulf between the capabilities of commercial dictation systems and the needs of professional users. Around the same time Barry Jaspan published his VR-mode, which enables NaturallySpeaking, the continuous dictation system, to communicate with Emacs. (Barry has generously allowed VR-mode to be disseminated under a BSD-type license, which has enabled it to be included in the EmacsListen distribution.)

Thus energized, and further spurred on by David Jeschke, I embarked on a minor rewrite of Erik's code that would enable me to use EmacsListen with continuous dictation thanks to the Barry's excellent code.

As I started writing the new code, I got sucked into a vortex of feature creep and perfection that led to substantial revisions of the ontology, functionality, phonology, and grammar of the editing language that I now call ShortTalk. Over the next couple of years, I plan on gradually admitting to my boss how distracting this work was.

Update: September 2004. I owe thanks to Joe Sommer, Mehryar Mohri, and Alex Rudnicky for their patient efforts in enabling the software to find a home. Also I should thank the many people who have written to me and encouraged the software to be released.


By Nils Klarlund.
Copyright © 2004 Carnegie Mellon University.
XSLT &PythonPowered