SenseTime researcher showcases advances in AI face syncing with audio

Digital manipulation of faces can be used to spoof biometric systems or sow misinformation, but the technology to coordinate facial movements with the speaker’s voice is also in demand in several fields, as Yuxin Wang explained. , researcher at SenseTime, at a conference for the European Biometrics Association (EAB).

His presentation on “Talking Faces: Audio-to-Video Face Generation” was part of EAB’s workshop on Digital Face Manipulation and Detection, held this week for members of the organization.

Digital technologies have been used since the 1990s to generate synthetic videos of people talking, for applications such as virtual assistants, teleconferencing, film and video game dubbing, and digital twins.

The output of the talking head generation should have “significantly more head movement” than the source material of the audio-based face reconstruction, Wang says.

Wang reviewed modeling techniques that measure relationships between head movement and vocalization for talking face generation. Data extracted from an audio representation is used to ensure that the mouth movement and speaker expressions in the video accurately and consistently match the sound.

He described a pair of pipelines to do this; one based on audio and image encoders producing representations executed through a single decoder, and the other using a regression model on audio features for a combination with an intermediate feature, such as a facial landmark, and rendered at from the intermediate characteristic. Wang also explained image refinement and background compositing in post-processing.

The talk then discussed the methods and datasets used in the generation of 2D and 3D faces.

Various measures have been developed that can be applied to image quality, synchronization between audio signals and the speaker’s lips, identity preservation, and blinking, which Wang described.

The remaining challenges in generating talking faces range from exercising fine control over facial features like eyes and teeth, head movements and emotions, to generalizing identity and body . Then there are considerations around counterfeit detection and social responsibility.

In an example from the first challenge, Wang notes that blinking is related to the mechanics of speech and thought processes, but the relationships are not yet well understood. Blinking can be generated from target images or Gaussian noise. Some models associate eye movement with overall facial expression, but this method is still in its early stages of development.

According to the SenseTime researcher, larger and more diverse datasets could help with the generation.

Manipulated video detection was briefly considered, and deepfake detection was the subject of several other presentations during the event.

Wang sees the talking face generation technology improving in the near future, and the practical applications developing at the same time.

Article topics

biometrics | biometric research | fake fake | eBike | European Association for Biometrics | research and development | SenseTime | usurpation | synthetic data

Comments are closed.