Our workshop aims to develop speech and multimodal interaction as a well-established area of study within HCI, aiming to leverage current engineering advances in ASR, NLP, TTS, multimodal/gesture recognition, or brain-computer interfaces. In return, advances in HCI can contribute to creating NLP and ASR algorithms that are informed by and better address the usability challenges of speech and multimodal interfaces.
We also aim to increase the cohesion between research currently dispersed across many areas including HCI, wearable design, ASR, NLP, BCI complementing speech, EMG interaction and eye-gaze input. Our hope is that by focusing and challenging the research community on multi-input modalities for wearables, we will energize the CHI and engineering communities to push the boundaries of what is possible with wearable, mobile, and pervasive computing, but also make advances in each of the respective communities. As an example, the recent significant breakthroughs in deep neural networks is largely confined to audio-only features, while there is a significant opportunity to incorporate into this framework other features and context (such as multimodal input for wearables). We anticipate this can only be accomplished by closer collaboration between the speech and the HCI communities.
Our ultimate goal is to cross pollinate ideas from the activities and priorities of different disciplines. With its unique format and reach, a CHI workshop offers the opportunity to strengthen future approaches and unify practices moving forward. The CHI community can be a host to researchers from other disciplines with the goal of advancing multimodal interaction design for wearable, mobile, and pervasive computing.
What are the important challenges in using speech as a “mainstream” modality? Speech is increasingly present in commercial applications – can we characterize which other applications speech is suitable for or has the highest potential to help with?
What interaction opportunities are presented by the rapidly evolving mobile, wearable, and pervasive computing areas? How and how much does multimodal processing increase robustness over speech alone, and in what contexts?
Can speech and multimodal increase usability and robustness of interfaces and improve user experience beyond input/output?
What can the CHI community learn from the Automatic Speech Recognition (ASR), Text-to-Speech Synthesis (TTS), and Natural Language Processing (NLP) research, and in turn, how can it help the these communities improve the user-acceptance of such technologies? For example, what should we be asking them to extract from speech beside words/segments? How can work in context and discourse understanding or dialogue management can shape research in speech and multimodal UI? And can we bridge the divide between the evaluation methods used in HCI and the AI-like batch evaluations used in speech processing?
How can UI designers make better use of the acoustic-prosodic information in speech beyond simply word recognition, such as emotion recognition or identifying users' cognitive states?
What are the usability challenges of synthetic speech? How can expressiveness and naturalness be incorporated into interface design guidelines, particularly in mobile or wearable contexts where text-to-speech could potentially play a significant role in users' experiences? And how can this be generalized to designing usable UIs for mobile and pervasive (in-car, in-home) applications that rely on multimedia response generation?
What are the opportunities and challenges for speech and multimodal interaction with regards to spontaneous access to information afforded by wearable and mobile devices? And can such modalities facilitate access in a secure and personal manner, especially since mobile and wearable interfaces raise significant privacy concerns?
What are the implications for the design of speech and multimodal interaction presented by new contexts for wearable use, including hands-busy, cognitively demanding situations and perhaps even unconscious and unintentional use (in the case of body-worn sensors)? Wearables may have form factors that verge on being ‘invisible’ or inaccessible to direct touch. Such reliance on sensors requires clearer conceptual analyses of how to combine active input modes with passive sensors to deliver optimal functionality and ease of use. And what role can understanding users' context (hands, eyes busy) play in selecting best modality for such interactions or in predicting user needs?