[1] A. Oord et al., Neural Discrete Representation Learning, NeurIPS 2017
[2] L. Liu et al., Bridging Discrete and Backpropagation: Straight-Through and Beyond, NeurIPS 2023
[3] P. Esser et al., Taming Transformers for High-Resolution Image Synthesis, CVPR 2021
[4] V. Iashin et al., Taming Visually Guided Sound Generation, BMVC 2021
[5] N. Zeghidour et al., SoundStream: An End-to-End Neural Audio Codec, arXiv:2107.03312 2021
[6] C. Wang et al., Neural Codec Language Models are Zero-Shot Text to Speech Synthesis, arXiv:2301.02111 2023
[7] A. Ramesh et al., Zero-Shot Text-to-Image Generation, ICML, 2021
[8] S. Gu et al., Vector Quantized Diffusion Model for Text-to-Image Synthesis, CVPR 2022
[9] R, Kumar et al., High-Fidelity Audio Compression with Improved RVQGAN, NeurIPS 2023
[10] J. Kong et al., HiFiGAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis, NeurIPS 2020
[11] S. Lee et al., BigVGAN: A Universal Neural Vocoder with Large-scale Training, ICLR 2023
[12] J. Copet et al., Simple and Controllable Music Generation, NeurIPS 2023
[13] A. Defossez et al., High Fidelity Neural Audio Compression, arXiv:2210.1348 2022
[14] S. Han et al, The Interface for Symbolic Music Loop Generation Conditioned on Musical Metadata, NeurIPS Workshop on ML4CD 2023