CHAPTER 3 Stage 1: System Considerations

This chapter describes the first stage in the development of the AVSA- defining the basic look and feel of the interface. To accomplish this we first present five qualities that all elements of the system should adhere to. We then consider all of the options available to the interface and, based on the qualities, choose what technology to use for presentation, input and feedback. Finally, we present the general architecture of the system.

3.1 The Qualities

From our observations and input from various users we identified the following five qualities that the AVSA should have: First the interface should cater to the lowest common denominator (LCD). That is to say that the system should place as few new requirements for equipment and require as few new skills as possible. The system should be functional without adding extra computer or videoconferencing equipment to the orphan CODEC and it should be usable by anyone with aspirations to communicate information.

The second quality is that the system should be as ubiquitous as possible[27]. In our context this means the user should be able to use the system from any node without being attached to any special equipment.

The third quality is that the system should be self-explanatory. This will minimize the training required for new users to learn the system. Note that a self-explanatory system is also one that allows the user to interact with it as naturally as possible.

The fourth quality is that the system should work seamlessly in concert with a "privileged" system. That is to say that the AVSA should augment, rather than degrade, service of those with full IIIF technology.

The final quality is that the system should be non-intrusive. We know we will require an interface of some sort to provide the control and make the system as self-explanatory as possible. However, we also know that an intrusive interface will interfere with the visitor's concentration, reminding them that they are not actually a part of the media space and thus detract from their sense of presence in the media space. The issue here is one of contrast. The interface should offer just the right amount of instruction so that the user has a choice of blocking out the system, paying passive attention to the system or concentrating on using the system.

The next section describes the options that were available for each component of the interface and, based on the qualities, why we chose the options we did.

3.2 Interface

Without a good interface the system will not be usable or even be used. Therefore, we placed special importance on considering our options for this aspect of the AVSA. We broke the interface into three components: Presentation refers to informing the visitor of what control they have over the media space. Input concerns how the visitor actually commands the system to perform an action. Feedback refers to how the visitor knows if the requested action was performed and what the results of the action were.

3.2.1 Presentation

The first consideration was how to advise the visitors of what control they have over the media space. There are only two output devices at the videoconference room- the speaker and the monitor. Therefore, the two options for presentation are:

Speech Prompts

One approach would have been to provide prompts to the visitor through the audio channel in the form of speech. For example, when the visitor connects to the media space they could be presented with a computer generated audio message like "do action A to connect to Bill, do action B to connect to John, do action C to connect to Tracy.....". This is comparable to the very popular automated attendant style of interface used millions of people everyday. It is a very natural way of receiving information.

Unfortunately, speech output requires sequential processing and is thus time dependent. The problems with using it are threefold. The first being that speech output can be intrusive. Because of its nature, speech requires a significant amount of attention to understand the message. In the case of the AVSA, electronic visitors will be faced with situations wherein they are attempting to listen to members of the media space while also accessing the control information. It is very difficult to block out one stream of speech and pay attention to another. Especially if they are both coming from the same source (i.e. the speaker of the remote node). It is next to impossible to pay partial attention to both.

The second problem with using speech for the output component of the interface is that for every option x, y time is required to listen to the option. Research and experience tell us that long lists presented in this manner frustrate the user.

A third problem is identified as also being related to the time it takes to speak a list of options with long descriptions. Many times during such a list the user is mentally storing the options that most suit their needs. Once the list is has been completed the user searches through their mental store and picks the best of the possible options. This process places an undesirable cognitive load on the user which can be distracting; thus diminishing the experience of visiting the media space.

It is quite clear from our discussion and other research[2][12] that using speech prompts is not the correct route to take for this component. If we were using only the telephone we would have no option and our interface would be compromised. Fortunately, the orphan CODEC has access to audio and video output. Therefore, our options expand to include the visual channel. Our analysis of why speech is not suitable leads us to a suitable option.

Visual Prompts

The underlying cause for rejecting speech prompts is its sequential nature. Visual prompts, on the other hand, are parallel in nature. One can quickly scan a visual image and acquire vast amounts of information. The question is, how would a visual system fare in the context of the AVSA? To answer this question we look at two features that visual displays offer. These features address the second and third problems that would be associated with a speech prompt driven interface. Therefore, the solution will involve providing visual prompts indicating what options are available. These prompts may be in the form of text or some appropriate graphical metaphor. We know, however, that we do not want whatever visual prompt we use to significantly interfere with the other visual tasks that a visitor is involved in.

The discussion provided so far motivates four plausible solutions:

The first option is to provide an extra monitor on which options could be shown (Figure\x118). This option does not conform to the LCD quality.

The second possible solution is to provide a windowing system where the prompts and view from the media space are displayed on one monitor but in different areas of the display (Figure\x119). This option eliminates the LCD problem of the first option, but by using this split screen method we would be reducing the screen "real-estate" available to either of the two tasks.

The third possibility is to provide the one monitor associated with the node of the orphan CODEC with the capability of switching what it displays. However, by requiring the visitor to switch video signals between two sources, this third option introduces discontinuities in the visitor's experience and thus makes the interface intrusive. Even though the audio connection can remain intact and the visual connection is only temporarily broken it is still unacceptable. For example, if a visitor wants an explanation (from a member of the media space) as to what an option will do, they would want to be able to see both the person they are talking to and the interface at the same time. In any case, designers should strive to reduces discontinuities so as to minimize the visibility and thus intrusiveness of an interface as much as possible.

The fourth approach, and the one taken, is to use video overlay technology. This technology allows one to combine computer generated images or video signals with other computer generated images or video signals[16][17]. By using a pre-defined key (chroma or luma keying) parts of any image can be filtered out, much like the overlay used to display information during a televised sports event or the credits/titles on a film.

In terms of the AVSA, this technology can be used to allow a computer generated graphic of options to be `overlaid' on top of the video signal from the media space (Figure\x1110).

The important parts of the graphic (i.e. the parts that indicate the options) will be shown to the visitor while the non-valuable sections can be filtered out. By using this technology we are exploiting a third feature of video:

Take an X-Windows interface for example. If two windows A and B are present and one wants to concentrate on window A, one simply has to turn one's head or shift one's eyes. Compare this to a speech prompted interface. There is no equivalent action. You cannot close one ear or shut out audio from any one stream. It is possible to concentrate more one audio stream to reduce the impact of the other especially if they are coming from different locations, but requiring a visitor to do this for long periods of time would be irritating, stressing and tiring.

3.2.2 Input

The second aspect that needed to be considered in designing the interface was providing some mechanism by which the visitor can indicate which option they wished to choose. For the interface to be non-intrusive, and thus usable, it was clear that we had to use an accurate form of input. Unfortunately, the most accurate ways to provide input (i.e. a keyboard press, a touch screen, pointing device, etc.) would require extra equipment to be placed at the videoconferencing site. This does not conform to the limited equipment design constraint. Therefore, these forms of input are not practical options.

The only input receptors available at traditional videoconference rooms are:

Touch-tone keypads

With today's technology, touch-tone keypads would offer the most accurate interaction when compared to the two other receptors. However, CODECs do not necessarily have keypads and they certainly do not have the standardized touch tones of telephony. Consequently, by using this input device we would not be adhering to the LCD quality. In addition, it is not a very natural way of interacting and thus would be intrusive.

Camera

The second receptor, the camera, would require the AVSA to be equipped with a gesture recognition algorithm. For our purposes, accurate use of gesture recognition would impose a steep learning curve on the visitor. This would not conform to the self-explanatory quality. The accuracy problem would be compounded because accuracy of gesture recognition algorithms is highly influenced by the environment.

Another problem with using the camera stems from the fact that there is only one camera at the videoconferencing room. Members of the media space will see whatever the AVSA sees- namely, the gestures. The gestures seen by the members of the media space will, most likely, seem irrelevant and thus be distracting. Depending on culture, they may even be inappropriate. One way around this problem would be to temporarily discontinue the video seen by the media space member. This would solve the original problem, but introduce disturbing discontinuities for the media space members. Both problems would violate our seamless interaction quality.

A final problem with using the camera is that since the camera's field of view is limited, the user would be restricted to a certain area within the videoconference room. This would violate the ubiquitous quality.

Microphone

The third input receptor, and consequently the one chosen, available at a traditional videoconference room is the microphone. Since most people have the ability to talk, using speech would allows adherence to the LCD quality. It is ubiquitous because the input device travels with the person and the input receptor is always present at an orphan CODEC. It can be made non-intrusive by placing an omnidirectional microphone at the orphan CODEC. Finally, since speech is a very common way to communicate, the interface can be made self-explanatory. One potential problem is that the audio commands may disturb members of the media space.

Ideally, we wanted to equip the AVSA with a speaker-independent, continuous-speech automatic speech recognizer (ASR). The reality, however, is that today's ASR technology is not very accurate. Especially in our context where the receptor is an omnidirectional microphone and the noise levels of the environment are uncontrollable. However, despite this problem and based on our evaluation of the possible input mechanisms speech is still the best option.

We decided that we could sacrifice the flexibility of the ASR in order to increase the accuracy. As a result, we decided to use a small vocabulary discrete-utterance recognizer.

3.2.3 Feedback

The final consideration in designing the interface is indicating to the visitor what their input accomplished. For the same reasons outlined in Section\x113.2.1, it was clear that feedback should be provided in the same form as is provided by the presentation of the system- as a visual stimulus.

3.3 Architecture

Having defined the basic look and feel of the interface, the next task is to define the basic look of the system architecturally. The two details we consider are:

3.3.1 Location of AVSA

It was quite clear from the design constraint that the AVSA must be located at the media space. Figure\x1111 shows how the AVSA should be connected to the media space.

3.3.2 The AVSA structure

The AVSA itself is divided into three layers (Figure\x1112):

The interface layer

The interface layer allows a person to pass and receive information to and from the AVSA. Section\x113.2 provides a basic description of how it will work.

The communication layer

The communication layer allows the AVSA to pass and receive information to and from the media space. The AVSA will use the Ethernet communication channels already used by the media space TP client to communicate control information. The AVSA will also be connected to the A/V network so that the AVSA can send the visitor the composite image containing the interface overlaid on the view of the media space.

The translator layer

The translator layer is the link between the interface and communication layers. In terms of the AVSA the two jobs of the translator are: The translator basically parses data coming from the IIIF server and stores this data into appropriate structures. It also constructs commands to be sent to the IIIF server based on a users requests.

3.4 Summary

In this chapter we introduced the five qualities with which we want to design the system. We then outlined the basic look and feel of the system based on these qualities. The AVSA will receive input as speech commands from a person at the videoconferencing room. It will prompt the user and provide feedback through the video channel by ovelaying the text on the media space image.

We identified the location of the AVSA in relation to the media space and videoconferencing room.

Finally, the basic architecture of the AVSA is laid out in terms of three layers- the interface, communication, and the translator.