S3: Speech, Script and Scene driven Head and Eye Animation SIGGRAPH 2024

YIFANG PAN, RISHABH AGRAWAL, KARAN SINGH

JALI Research Inc., University of Toronto

Conversational audio, and a tagged transcript are aligned and diarized into separate streams. Speaker gaze during segments of speech are predicted as focused-on or averted-from a conversation partner (a). A 3D scene context defines a dynamic saliency map (b), which refines the predicted gaze transitions, into a set of 3D gaze trajectories (c). Speech audio generates rhythmic head motion (d), and it is used with other gestures to produce head+eye motion satisfying the gaze trajectories (e).

Abstract

We present \(S^3\), a novel approach to generating expressive, animator-centric 3D head and eye animation of characters in conversation. Given speech audio, a Directorial script and a cinematographic 3D scene as input, we automatically output the animated 3D rotation of each character’s head and eyes. \(S^3\) distills animation and psycho-linguistic insights into a novel modular framework for conversational gaze capturing: audio-driven rhythmic head motion; narrative script-driven emblematic head and eye gestures; and gaze trajectories computed from audio-driven gaze focus/aversion and 3D visual scene salience. Our evaluation is four-fold: we quantitatively validate our algorithm against ground truth data and baseline alternatives; we conduct a perceptual study showing our results to compare favourably to prior art; we present examples of animator control and critique of \(S^3\) output; and present a large number of compelling and varied animations of conversational gaze.

Code (coming soon) Dataset (coming soon) Paper Siggraph Presentation

Heat

Truman

Gaga

Jimmy

Supplemental Video