(1)Telepresence Project
(2)Hitachi Research Laboratory
Hitachi Limited
7-1-1 Omika-cho, Ibaraki
Japan
+81-294-52-7524
tanikosi@hrl.hitachi.co.jp
(3)Alias | Wavefront
110 Richmond Street
Toronto, ON M5C 1P1
+1-416-362-8558
bbuxton@aw.sgi.com
When visitors contact a media space without pre-arranging their calls, they receive, at best, a preset view of the environment and at worst, no view at all. In either case, visitors would be stranded. To avoid falling into this "hole" of media space communication, visitors must pre-arrange their calls with a local attendee through, for example, e-mail. Both the visitor and the local attendee must then connect to the modem at a specified time in order to videoconference. The Audio Video Server Attendant overcomes this limitation by providing visitors with the ability to navigate independently through a media space.
Our solution to this problem, the Audio Video Server Attendant (AVSA), is an automated attendant that instructs visitors on available commands and processes these commands as they are received. The attendant allows visitors to see if individuals are present and if so, to contact them directly. The AVSA caters to the lowest common denominator in user capabilities, including those without computers. Hence, the visitor is only required to use the traditional videoconferencing equipment already in place to support human-human communication, namely a camera, microphone and monitor. Using these devices, the AVSA obtains input through speech and provides output through video overlay, as shown in Figure 1.
These attendants are typically controlled by the touch-tones generated from the telephone keypad. However, this control mechanism does not transfer to the codec situation. Codecs may not have keypads, and certainly do not have the standardized touch tones of telephony. Regardless, the visitor should not be required to suffer from the cognitive burden of using a keypad, or worse still, a computer, assuming they even have one. We note that speech is a common denominator in all videoconference communication and thus, use it to provide a natural interaction mechanism that is far more suited to the menu selection task. This is supported by previous research, showing that speech is the preferred input modality when the task involves short, interactive communication [6].
Two further problems of automated attendants stem from their use of audio as the sole medium of interaction: The time required to listen to the message of options in its entirety and the cognitive load of remembering the options and the action associated with each. Both of these problems can be solved by exploiting the extra communication channel afforded through video. By presenting the menu of available options with computer graphics, rather than sequentially with voice, the options are visible simultaneously and instantaneously, as shown in Figure 1. Furthermore, the options are displayed continuously, and remain displayed until an action is initiated. Hence, the shortcomings of being presented with the options through the audio channel are eliminated by transferring the burden to the more suited visual channel.
So far, we have only used audio as an input mechanism for remote control of a media space. We are also investigating the use of video, specifically gesture, as a natural input mechanism for tasks such as control of devices. A working example is the head tracking system [3], which uses the visitor's head position to control a motorized camera.
2. Buxton, W. and Moran, T., EuroPARC's Integrated Interactive Intermedia Facility (iiif): Early Experience, In S. Gibbs & A.A. Verinj-Stuart (Eds.). Multi-user interfaces and applications, Proceedings of the IFIP WG 8.4 Conference on Multi-user Interfaces and Applications, Heraklion, Crete. Amsterdam: Elsevier Science Publishers B.V. (North-Holland), pages 11-34, 1990.
3. Gaver, W., Smets, G., Overbeeke, K. A Virtual Win- dow on Media Space. Proceedings of Human Factors in Computing Systems 1995 (CHI'95), (Denver, May 7-11), ACM Press. pp. 257-264.
4. Kelly, P. H., Katkere, A., Kuramura, D. Y., Moezzi, S., Chatterjee, S., Jain, R. An Architecture for Multiple Perspective Interactive Video. Proceedings of ACM Multimedia 1993, pp. 201-212.
5. Mantei, M., Baecker, R., Sellen, A., Buxton, W., Milli- gan, T., and Wellman, B., Experiences in the use of a media space. Proceedings of CHI'91. ACM Confer- ence on Human Factors in Software. Pages 203-208. Reprinted in D. Marca & G. Bock (Eds.) 1992. Group- ware: software for computer-supported collaborative work. Los Alamitos, CA.: IEEE Computer Society Press, pages 372-377.
6. Martin, G. L. The utility of speech input in user-com- puter interfaces. International Journal of Man/Machine Studies, Vol. 30, 1989, pp. 355-375.
7. Schmandt, C. (1993). Phoneshell: the Telephone as Computer Terminal. Proceedings of ACM Multimedia 1993, pp. 373-382.
8. Yankelovich, N., Levow, G., and Marx, M. (1995) Designing SpeechActs: Issues in Speech User Inter- faces. Proceedings of Human Factors in Computing Systems 1995 (CHI'95), (Denver, May 7-11), ACM Press. pp. 369-376.
Figure 1: The initial AVSA menu offers a selection of people with whom the user can visit. Selections are made by uttering the desired option enclosed in quotes.
Figure 2: While connected, visitors can control equipment (vcr control and pip control), and change their electronic view (head tracking and seat changing).