How Automatic Speech Recognition (ASR) and Speech Synthesis (or Text-To-Speech – TTS) work and why these are such computationally-difficult problems
Where are ASR and TTS used in current commercial interactive applications
What are the usability issues surrounding speech-based interaction systems, particularly in mobile and pervasive computing
What are the challenges in enabling speech as a modality for mobile interaction
What is the current state-of-the-art in ASR and TTS research
What are the differences between the commercial ASR systems' accuracy claims and the needs of mobile interactive applications
What are the difficulties in evaluating the quality of TTS systems, particularly from a usability and user perspective
What opportunities exist for HCI researchers in terms of enhancing systems' interactivity by enabling speech
A new sub-topic that was developed for the presentation at CHI 2015 is interactive speech-based applications centred around language translation, language learning support, and interacting across multiple languages. This will be updated and expanded for the CHI 2017 tutorial. Recent advances in Deep Neural Networks have dramatically improved the processing accuracy of speech recognition systems; however, this requires powerful computational resources not available to all developers – we will discuss and engage the audience in an analysis of its implications for the design of interactive systems. Additionally, even the most capable computation servers continue to struggle when acoustic, language, or interaction environments are adverse, resulting in large variations in the accuracy of processing speech – this is particularly relevant for home-based smart personal assistants such as Amazon Echo where unexpected interaction contexts (e.g. loud music) can negatively impact performance and thus user experience. The 2017 course materials will include further discussion and analysis of such examples.
The course includes three interactive, hands-on activities. The first activity will engage participants in proposing design alternatives for the error-handling interaction of a smartphone's voice-based search assistant, based on an empirical assessment of the type of ASR errors exhibited (e.g. acoustic, language, semantic). For the second activity, participants will conduct an evaluation of the quality of the synthetic speech output typically employed in mobile-based speech interfaces, and propose alternate evaluation methods that better reflect the mobile user experience. NEW FOR 2017: The third activity will center around uncovering speech processing errors of a home-based personal assistant and designing interactions that maintain a positive user experience in the face of unexpected variations in speech processing accuracy.