Publication
Speech is a rich biometric signal that contains information about the identity, gender and emotional state of the speaker. In this work, we explore its potential to generate face images of a speaker by conditioning a Generative Adversarial Network (GAN) with raw speech input. We propose a deep neural network that is trained from scratch in an end-to-end fashion, generating a face directly from the raw speech waveform without any additional identity information (e.g reference image or one-hot encoding). Our model is trained in a self-supervised fashion by exploiting the audio and visual signals naturally aligned in videos. With the purpose of training from video data, we present a novel dataset collected for this work, with high-quality videos of ten youtubers with notable expressiveness in both the speech and visual signals.
Model
Results

Presentation
code
This project was developed with Python 2.7 and PyTorch 0.4.0. To download and install PyTorch, please follow the official guide. You can also fork or download the project from [here] (https://github.com/miqueltubau/Wav2Pix.git).
acknowledgements
We especially want to thank our technical support team:
Amanda Duarte Phd grant is funded by ”la Caixa” Foundation through the MSCA actions in the the Horizon 2020 FP of the European Commission. | ![]() |
We gratefully acknowledge the support of NVIDIA Corporation with the donation of the GeForce GTX Titan Z and Titan X used in this work. | ![]() |
The Image Processing Group at the UPC is a SGR17 Consolidated Research Group recognized by the Government of Catalonia (Generalitat de Catalunya) through its AGAUR office. | ![]() |
This work has been developed in the framework of projects TEC2015-69266-P and TEC2016-75976-R, financed by the Spanish Ministerio de Economía y Competitividad and the European Regional Development Fund (ERDF). | ![]() |