(1) Telepresence Project
Department of Computer Science
University of Toronto
Toronto, Ontario
Canada M5S 1A4
{anujg, jer, willy}@dgp.toronto.edu
+1-416-978-0703
(2) Hitachi Research Laboratory
Hitachi Limited
7-1-1, Omika-cho
Hitachi, Ibaraki, Japan
tanikosi@hrl.hitachi.co.jp
+81-294-52-7524
(3) Alias | Wavefront
110 Richmond Street
Toronto, Ontario
Canada M5C 1P1
bbuxton@aw.sgi.com
+1-416-362-8558
To alleviate this problem, we developed the Audio/Video Server Attendant (AVSA), which provides remote attendees control over our media space, using speech as input.
Hence, if the appliance is a television, the applications discussed are video on demand (VOD), or home shopping. If the appliance is a computer, the applications are things like e-mail, hypertext, and the World Wide Web. And if a telephone is the appliance? It is mostly taken for granted and seldom enters into the conversation. Few, if any general models exist that rise above the individual appliances and which both contain and stimulate the discourse.
Our frustration with this situation is fueled by our collective experience over the past eight years of actually living with these technologies, first at Rank Xerox EuroPARC [2], and more recently within the context of the Ontario Telepresence Project at the University of Toronto [9]. This experience has had two main effects. First, it has convinced us that there really are powerful and valuable applications that can emerge from this convergence, and second, that some of the most interesting applications are not obvious and will not emerge from the prevalent superficial investigation of potential. One successful result is Xerox's Portholes application [5] and its successor developed at the University of Toronto [1][8].
What we have developed could be called a "video server." But it is not a video server in the way normally thought of. For example, it is not designed to provide a virtual video store on the end of a wire. Rather, it is a voice-activated server that facilitates browsing through electronic hyperdocuments, supports human-human communication, and supports video (actually demo) on demand. Each, as shown in Figure 1, is reminiscent of the computer-based worldwide web, telephone, and television, respectively. Yet together, they present something quite unlike anything in the current mind-set; something that helps us stretch in our thinking about these issues. We call the server the Audio/ Video Server Attendant (AVSA).
Our hope is that the work described provides not only examples of a new class of application and how media can be combined, but also demonstrates an approach to unveiling other such applications.
Previously, when a codec calls an iiif site like the Toronto Telepresence Project, what they get -- at best -- is a connection to the server and a view out of the window camera. With the AVSA, the intent is to present callers with a graphical menu of options, overlaid on the view from the window camera. The AVSA incorporates a simple speech recognition system which in turn enables incoming callers to navigate through the menu options. Services available ideally encompass connecting to individuals (including the physical receptionist), messaging services and on-line services such as demo-on-demand, or product/service information.
Once the connection has been established, the AVSA can then be used to mediate transactions during the conversation. For example, if I begin our conversation looking at you, but subsequently want to look at a document on your desk (real or virtual), the same technology that enabled me to connect to you in the first also supports this gaze redirection.
We now sketch out the main parts of the system by way of overview, then present the technology and experience in more detail.
Reliability, however, may be hampered by inconsistent background noise, speaker utterances, and room acoustics. This problem can be reduced significantly by limiting the size of the vocabulary. In this case, we can obtain reliable speaker-independent interaction, thereby providing universal access to the system. The vocabulary we have chosen consists of the digits "1" through "5" for variable menu selections, and the words "disconnect", "page up," "page down," "show menu," "previous menu," and "hide," for these generic actions.
While speech serves as a natural input mechanism for human to computer, there are two problems with respect to its use for output from computer to human (i.e. to provide a menu of services to visitors):
The effect of video overlay is much like credits in a movie. Menus too large to fit on a single screen are distributed over multiple pages, with next/previous page selections available as menu items.
Furthermore, with video, we can overcome the shortcomings of time and memory, mentioned above:
The AVSA is a PC-based application that lives at the called site (the University of Toronto). As such, it talks to the main iiif server just like any other client. That is to say, it is a free-standing independent module, technically consistent with the main desk-top clients.
Initially, the AVSA provided an electronic seat changing service, offering remote attendees the ability to move between the front and back of the room, as dictated by their social roles in a videoconference [6]. This served as a proof of concept for the system and allowed us to obtain preliminary user feedback. As expected, we found that the presentation of menu options through video was highly effective.
More functionality and generality were added as the system progressed. The next stage introduced the ability to contact different nodes of the local media space, thereby solving the problem of contacting me at my desk. With respect to human-human communication, our system was now delivering the capabilities of the telephone. The most recent stage of the AVSA added the ability for remote visitors to obtain information about our media space and view video demos on demand (DOD). This component represents the television aspect of our system. Figure 5 depicts the current status of the system.
Some functions, such as connecting to a particular node, are relatively time consuming. In this case, users wanted some means by which they could interrupt or cancel a command, especially one invoked inadvertently. Furthermore, some functions, such as VCR recording, have potentially destructive consequences. Many users wanted a confirmation process for these actions. We are presently experimenting with both of these features.
One solution is to periodically display a message indicating the required command. However, this would require forgetful users to wait until this message appears before interacting with the media space. Alternatively, we could display the message in response to a distress utterance such as "help." This type of solution is consistent with a reactive environment [4]. Another solution is to display the message until it has been used a certain number of times, after which, we assume that the user has learnt the command. We expect that some combination of these solutions is desirable.
We are investigating several possible solutions to these problems. One is to section off an area of the screen that will display commands through a translucent or outlined box. Another is to display the menu in the center of the screen with a larger font so that visitors know they are talking to the AVSA. Our current method assumes that when the menu is displayed, visitors are talking to the AVSA, and otherwise, to the media space.
A related issue is the disruption caused to meetings while the remote attendee issues audio commands to the system. One simple way to minimize this is for the system to temporarily mute the audio channel after the invocation of any menu option. This way, the AVSA will continue to receive speech commands without them bothering other attendees. We are also considering the use of gesture as a visual clutch that would turn audio on or off as appropriate.
Users also commented that they would like a distinction between generic and variable options. For example, generic options could be displayed in a small capitalized font since they will be remembered from previous use, whereas variable options can be displayed in a larger font with brighter color.
The first two concerns are addressed by video feedback. The speech recognition software indicates its interpretation attempts while the AVSA displays an appropriate text message whenever a valid command is recognized. In regards to the third level of feedback, most commands result in readily perceptible changes in either the audio or video channel, as they are executed, indicative of their execution. For those commands that can not otherwise be monitored by the visitor, such as rewinding a video tape, we plan to report the progress of their execution, whenever possible,
These microphones complicate the speech recognition task, as they are as sensitive to background noise as they are to human voice. We could individually fine tune the audio component of the AVSA to a multitude of environments, but this would require our system to undergo manual environment training; making it environmentally dependant. As an alternative to combatting these problems, we are investigating the use of audio filtering techniques that will automatically adapt to an environment.
One such direction is the addition of a video mail function. This would permit remote attendees to leave A/V messages for people in the local media space who are either busy or unavailable.
The AVSA currently allows visitors to control services, such as changing seats and controlling the vcr, within a videoconference room. We are adding a new dimension of control whereby any node can provide services to a remote visitor while connected. These services will be added and deleted easily as attributes of the local node.
So far, we have only used audio as an input mechanism for remote control of a media space. As noted earlier, another input medium, video, is available. Consequently, we will be investigating the use of gesture as a natural input mechanism to allow further control of devices. The head tracking system [3], which uses the visitor's head position to control a motorized camera in our media space, is a working example of the use of gesture in this manner.
We are also augmenting our head track system to use speech as a means of labelling various camera positions, hot-spots, and returning to them later with a simple, user- specified, command.
The AVSA represents a rare example of integrating services typically seen on only one of a computer-centric, telephone-centric or television-centric appliance. From the computer world of the WWW, we see navigable and retrievable hyperdocuments. From the telephone world, we see the ability to support synchronous communication (conversations/phone calls), as well as automated attendants. From the TV world, we see interactive video and video on demand -- in a form that goes well beyond watching "Top Gun" from home whenever you want.
Finally, we have a concrete demonstration of a practical commercial application for a small-scale video server for business: a server that can be delivered with today's technologies, and grow with new emerging technologies and services.
In the long run, it is perhaps this last point that is most important. It seems that much, or even most, of the potential of the Information Highway is following the traditional MIS Big Bang approach to development. That is, customers are being told by the technology providers, "We are going to give you this really great system, all at once, someday soon. Trust me." Our position is that this approach has not worked in the past, and cannot work in the future.
What we believe is required, and what the work described in this paper demonstrates, is that another approach has far more promise, and is far less expensive. This is an approach that involves iterative human-centered design. For example, the development and testing of the AVSA, grounded as it was in an applied context, has provided a range of insights into the architecture of small-scale video servers; insights that would never emerge in developing a full-blown ATM super-server that can distribute 500 movies simultaneously over a network that might someday exist.
We believe that the human potential of these new technologies is immense. We are concerned, therefore, that this potential is met. Our hope is that the work described in this paper makes some small contribution to its realization.
This research has resulted from the Ontario Telepresence Project. Support has come from the Government of Ontario, the Information Technology Research center of Ontario, the Telecommunications Research Institute of Ontario, the Natural Sciences and Engineering Research Council of Canada, British Telecom, Xerox PARC, Bell Canada, Alias|Wavefront, Sun Microsystems, Hewlett Packard, Hitachi Corp., the Arnott Design Group and Adcom Electronics. This support is gratefully acknowledged.
2. Buxton, W. and Moran, T., EuroPARC's Integrated Interactive Intermedia Facility (iiif): Early Experi- ence, In S. Gibbs & A.A. Verinj-Stuart (Eds.). Multi- user interfaces and applications, Proceedings of the IFIP WG 8.4 Conference on Multi-user Interfaces and Applications, Heraklion, Crete. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), pages 11-34, 1990.
3. Cooperstock, J., Tanikoshi, K., and Buxton, W., Turn- ing Your Video Monitor into a Virtual Window, Proc. of IEEE PACRIM, Pacific Rim Conference on Commu- nications, Computers, Visualization and Signal Pro- cessing 1995, Victoria, May 1995.
4. Cooperstock, J., Tanikoshi, K., Beirne, G., Narine, T., and Buxton, W., Evolution of a Reactive Environment. Proceedings of Human Factors in Computing Systems 1995 (CHI'95), (Denver, May 7-11), ACM Press. pp. 170-177.
5. Dourish, P., and Bly, S., Portholes: Supporting Aware- ness in a Distributed Work Group. Proceedings of Human Factors in Computing Systems 1992 (CHI'92), (Monterey, California), pp. 541-547.
6. Gujar, A., Daya, S., Cooperstock, J., Tanikoshi, K., and Buxton, W. Talking Your Way Around a Confer- ence: A speech Interface for Remote Equipment Con- trol. CASCON'95 CD-ROM Proceedings, Toronto, Ontario, Canada.
7. Mantei, M., Baecker, R., Sellen, A., Buxton, W., Milli- gan, T., and Wellman, B., Experiences in the use of a media space. Proceedings of CHI'91. ACM Confer- ence on Human Factors in Software. Pages 203-208. Reprinted in D. Marca & G. Bock (Eds.) 1992. Group- ware: software for computer-supported collaborative work. Los Alamitos, CA.: IEEE Computer Society Press, pages 372-377.
8. Narine, T., Leganchuk, A., Mantei, M., and Buxton, W., Collaboration Awareness and its Use to Consoli- date a Disperse Group, Paper submitted, for publica- tion in Procedings, to Human Factors in Computing Systems 1996 (CHI'96).
9. Riesenbach, R. The Ontario Telepresence Project, Human Factors in Computing Systems 1994 Confer- ence Companion (CHI'94), (Denver, Colorado), pp 173-174.
10. Weiser, M. (1993). Some Computer Science Issues in Ubiquitous Computing. Communications of the ACM, 36(7), 75-83.
11. Yamaashi, K., Cooperstock, J., Narine, T., Buxton, and W., Beating the Limitations of Camera-Monitor Medi- ated Telepresence with Extra Eyes, Paper submitted, for publication in Proceedings, to Human Factors in Computing Systems 1996 (CHI'96).
Figure 1: Combining technology metaphors. The menu (computer) seen by a visitor to the media space. The visitor is given the options of calling a person (telephone), or viewing demos (television). Note that the image quality of this photograph has been degraded significantly due to the loss of color.
Figure 2: Simultaneous and continuous display. This example shows that the video channel can be used to reduce the time required to identify options and memory load placed on visitors-- eliminating the shortcomings of the telephone. Note that the image quality of this photograph has been degraded significantly due to the loss of color.
Figure 3: Interface Functionality. A videoconference requires a subset of the functionality provided by a human or computer interface. Therefore we are developing the AVSA, starting with a small-scale model, constantly being expanded as new technology emerges, to make use of only that functionality that is required.
Figure 4: Configuration of the AVSA system.
Figure 5: AVSA status. Dotted lines indicate future work.
Figure 6: Feedback in the top section of the display indicates that the AVSA is responding to a "Page Up" command. Note that the image quality of this photograph has been degraded significantly due to the loss of color.