Despite the abundance of edited short-form audio content on the internet, original unedited recordings are typically unavailable. To kickstart research in automatic editing and retargeting of short-form audio content, we collected a dataset of unedited audio story recordings on Amazon Mechanical Turk.
Audio time stretching alters an audio signal’s playback speed and duration, and is commonly used for video and audio editing when ed- itors want to conform longer material to a designated time slot. Though widely used on short-form platforms like TikTok and Youtube Shorts, audio time stretching usually introduce artifacts. The artifacts could go unnoticeable for a minimal speed-up but become more prominent as the degree of stretching increases–the words become less intelligible as phonemes are clus- tered together. Though intuitively understandable, the relationship between speeding up and the naturalness/intelligibility of speech is unclear, let alone in the context of social media audio stories. Therefore, we conducted a listening study to investigate how much speed-up, when exceeded, would result in perceivable degradation of naturalness and intelligibility.
In our study with Amazon Mechanical Turk, we observed a steady monotonic decreasing naturalness rating as the speed-up factor increased. There is a slight but noticeable dip at 110% speed up, indicating that the turkers start to perceive a reduction of naturalness compared to unedited samples. There is a slight but statistically significant dip at 120% speed-up (p-value < 0.001, compared with 125%) that may be considered a soft limit for speed-up without significantly impacting the naturalness of the recording.
We formulate automatic shortening as a combinatorial optimization problem to select optimal sentence combinations complying with length constraints. Our algorithm first transcribes the recording and segments it into sentences. It then selects optimal sentence subsets from the original recording by maximizing the total sentence score given the length constraints. We designed the sentence score function to consider both a sentence’s duration and its relevance to the summary of the audio story, obtained with neural abstractive summarization, in the sentence embedding space. We use dynamic programming for efficient optimization, which runs in real-time. Once the optimal selection is obtained, ROPE synthesizes the final audio output by cropping and concatenating the selected sentences. We also apply an audio enhancement technology to increase sound quality