Synth vox

The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

Things and Stuff Wiki - An organically evolving personal wiki knowledge base. An on-the-fly taxonomy containing a patchwork trail of topic outlines, descriptions, notes, stubs and breadcrumbs, with links to sites, systems, software, manuals, organisations, people, articles, guides, slides, papers, books, comments, videos, screencasts, webcasts, scratchpads and more. Content is orientated towards mostly free/libre/open, mostly Linux. Quality and age varies drastically. Sometimes old things are first, sometimes last. Use the Table of Contents menu to navigate long pages. Zoom in if text is too small. Dead link? Wayback Machine. I probably need to fix the theme CSS after an update. See also libreav.org. Chat to msg me (not checking tho atm). e

Resources

Weather

Synthesis

http://en.wikipedia.org/wiki/Speech_synthesis

to sort/categorise

https://forums.homeseer.com/showthread.php?t=175012

https://archive.org/details/flexibleformants00lalw - Flexible formant synthesizer : a tool for improving speech production quality
https://archive.org/details/flexiblehighqual00hsie - A flexible and high quality articulatory speech synthesizer

https://github.com/guan-yuan/Awesome-Singing-Voice-Synthesis-and-Singing-Voice-Conversion - A paper and project list about the cutting edge Speech Synthesis, Text-to-Speech (TTS), Singing Voice Synthesis (SVS), Voice Conversion (VC), Singing Voice Conversion (SVC), and related interesting works (such as Music Synthesis, Automatic Music Transcription, Automatic MOS Prediction, SSL-based ASR, ...etc).

Voder

https://en.wikipedia.org/wiki/Voder - the first attempt to electronically synthesize human speech by breaking it down into its acoustic components. It was invented by Homer Dudley in 1937–1938 and developed on his earlier work on the vocoder. The quality of the speech was limited; however, it demonstrated the synthesis of the human voice, which became one component of the vocoder used in voice communications for security and to save bandwidth.

Voder Speech Synthesizer (1939) - an early attempt at speech synthesis developed by Bell Telephone Laboratory for the 1939-40 New York World's Fair. Controlled by hand, the operator manually forms each syllable using complex button sequences and it would take about a year of practice to able to produce fluid speech. Helen Harper was one of the first people to figure out how to operate the Voder effectively, and was the live demonstrator at the World's Fair. She later went on to form a year-long course instructing women to use the Voder. Of 100 applied students, only 20 were able to graduate and match Harper's skills

https://github.com/gmoe/voder - The Voder was an early attempt at speech synthesis developed by Bell Telephone Laboratory for the 1939 New York World's Fair. Controlled by hand, the operator manually forms each syllable using complex button sequences, and it would take about a year of practice to able to produce fluid speech. This website allows you to put yourself in the shoes of the few women capable of operating the Voder.

SAM

https://en.wikipedia.org/wiki/Software_Automatic_Mouth - or SAM, is a speech synthesis program developed and sold by Don’t Ask Software. The program was released for the Apple II, Lisa, Atari 8-bit family, and Commodore 64.

SAM: Software Automatic Mouth - WASM

rsynth

rsynth - Text-to-Speech.
- https://github.com/rhdunn/rsynth - fork
- YouTube: Virtualized dragging brake equipment detector using rsynth

Festival

Festival - or The Festival Speech Synthesis System, offers a general framework for building speech synthesis systems as well as including examples of various modules. As a whole it offers full text to speech through a number APIs: from shell level, though a Scheme command interpreter, as a C++ library, from Java, and an Emacs interface. Festival is multi-lingual (currently English (British and American), and Spanish) though English is the most advanced. Other groups release new languages for the system.

https://vocaloid.fandom.com/wiki/Festival_Speech_Synthesis_System

Festvox - aims to make the building of new synthetic voices more systemic and better documented, making it possible for anyone to build a new voice.

Rocaloid

Rocaloid - a free, open-source singing voice synthesis system. Its ultimate goal is to fast synthesize natural, flexible and multi-lingual vocal parts. Like other vocal synthesizing software, after installing the vocal database, inputting lyrics and pitch, you can synthesize attractive vocal parts. What’s more, Rocaloid highlights on providing you more controllable parameters which enabling to take control of exquisite dimensions of the synthesized voice and export with better quality. By using a fully constructed Rocaloid Database, you can synthesize singing voice in any phonetic-based languages.
- https://github.com/Rocaloid - dead?

Festvox

Festvox - aims to make the building of new synthetic voices more systemic and better documented, making it possible for anyone to build a new voice. Specifically we offer: Documentation, including scripts explaining the background and specifics for building new voices for speech synthesis in new and supported languages. Example speech databases to help building new voices. Links, demos and a repository for new voices. This work is firmly grounded within Edinburgh University's Festival Speech Synthesis System and Carnegie Mellon University's small footprint Flite synthesis engine.

MaryTTS

MaryTTS - an open-source, multilingual Text-to-Speech Synthesis platform written in Java. It was originally developed as a collaborative project of DFKI’s Language Technology Lab and the Institute of Phonetics at Saarland University. It is now maintained by the Multimodal Speech Processing Group in the Cluster of Excellence MMCI and DFKI.

https://github.com/marytts/marytts-txt2wav - An example project demonstrating use of MaryTTS in a deliberately standalone application

https://github.com/synesthesiam/marytts-txt2wav - Command-line utility for text to speech with MaryTTS

eSpeak

eSpeak - a compact open source software speech synthesizer for English and other languages, for Linux and Windows. eSpeak uses a "formant synthesis" method. This allows many languages to be provided in a small size. The speech is clear, and can be used at high speeds, but is not as natural or smooth as larger synthesizers which are based on human speech recordings.

https://github.com/divVerent/ecantorix - a singing synthesis frontend for espeak. It works by using espeak to generate raw speech samples, then adjusting their pitch and length and finally creating a LMMS project file referencing the samples in sync to the input file.

OpenSource SpeechSynth

http://web.media.mit.edu/~stefanm/osss/

MBROLA

http://tcts.fpms.ac.be/synthesis/ - The MBROLA Project
- https://github.com/numediart/MBROLA
- https://en.wikipedia.org/wiki/MBROLA

Assistive Context-Aware Toolkit

Assistive Context-Aware Toolkit (ACAT) - an open source platform developed at Intel Labs to enable people with motor neuron diseases and other disabilities to have full access to the capabilities and applications of their computers through very constrained interfaces suitable for their condition. More specifically, ACAT enables users to easily communicate with others through keyboard simulation, word prediction and speech synthesis. Users can perform a range of tasks such as editing, managing documents, navigating the Web and accessing emails. ACAT was originally developed by researchers at Intel Labs for Professor Stephen Hawking, through a very iterative design process over the course of three years.
- http://blogs.msdn.com/b/cdndevs/archive/2015/08/14/intel-just-open-sourced-stephen-hawking-s-speech-system-and-it-s-a-net-4-5-winforms-app.aspx [1]

Praat

Praat - doing phonetics by computer

Gnuspeech

gnuspeech - makes it easy to produce high quality computer speech output, design new language databases, and create controlled speech stimuli for psychophysical experiments. gnuspeechsa is a cross-platform module of gnuspeech that allows command line, or application-based speech output. The software has been released as two tarballs that are available in the project Downloads area of http://savannah.gnu.org/projects/gnuspeech. [2]

Project Merlin

Project Merlin - A truly free virtual singer, no matter how you want to send all kinds of ideas. [3]
- https://github.com/ProjectMeilin - not fully open yet?
- http://www.cstr.ed.ac.uk/projects/merlin/
- http://ml.cs.yamanashi.ac.jp/world/english
- YouTube: UTAU】Honeyworks ママ ver.acoustic 【徵音梅林cover】
- YouTube: 【徴音梅林】Umbrella カバー

UTAU

https://en.wikipedia.org/wiki/Utau - a Japanese singing synthesizer application created by Ameya/Ayame. This program is similar to the Vocaloid software, with the difference that it is shareware instead of being released under third party licensing

https://github.com/stakira/OpenUtau - Open singing synthesis platform / Open source UTAU successor

https://github.com/topics/utau

Sinsy

Sinsy - HMM-based Singing Voice Synthesis System
- http://sinsy.sourceforge.net
- https://github.com/zyamusic/sinsy

Sinsy - Singing Voice Synthesizer - how to

https://github.com/hyperzlib/Sinsy-Remix - The HMM-Based Singing Voice Syntheis System Remix "Sinsy-r"

https://github.com/YuzukiTsuru/SingSyn - The HMM-Based Singing Voice Synthesis System "SingSyn" Base On Sinsy

https://github.com/mathigatti/midi2voice - Python script that relies on the sinsy.jp website from the Nagoya Institute of Technology which implements a HMM-based Singing Voice Synthesis System.

UTSU

UTSU - A cross-platform vocal synth frontend | UtaForum.net

qtau

https://notabug.org/isengaara/qtau - Qt based UTAU clone also supports FESTIVAL and MBROLA voices

cadencii

https://github.com/cadencii/cadencii - simple musical score editor for singing synthesis: VOCALOID, VOCALOID2, UTAU, STRAIGHT with UTAU, and AquesTone are available as synthesizer.

Mozilla TTS

https://github.com/mozilla/TTS - Deep learning for Text to Speech

CMU Flite

CMU Flite - a small, fast run-time open source text to speech synthesis engine developed at CMU and primarily designed for small embedded machines and/or large servers. Flite is designed as an alternative text to speech synthesis engine to Festival for voices built using the FestVox suite of voice building tools.
- https://github.com/festvox/flite

mesing

https://github.com/usdivad/mesing

Adobe VoCo

IPOX

IPOX - an experimental, all-prosodic speech synthesizer, developed many years ago by Arthur Dirksen and John Coleman. It is still available for downloading, and was designed to run on a 486 PC running Windows 3.1 or higher, with a 16-bit Windows-compatible sound card, such as the Soundblaster 16. It still seems to run on e.g. XP, but I haven't tried it on Vista.

NPSS

Neural Parametric Singing Synthesizer

https://github.com/seaniezhao/torch_npss - pytorch implementation of Neural Parametric Singing Synthesizer 歌声合成

A Neural Parametric Singing Synthesizer

Pink Trombone

Pink Trombone - Bare-handed procedural speech synthesis, version 1.1, March 2017, by Neil Thapen
- https://github.com/giuliomoro/pink-trombone

Klatter

https://github.com/fundamental/klatter - a bare bones formant synthesizer based upon the description given in the 1979 paper "Software For a Cascade/Parallel Formant Synthesizer" by Dennis Klatt. This program was not designed for interactive use, though there is code for some minimal midi control. In it's current state, it is enough of a curiosity that it will be preserved, though it may not see much if any use.

Tacotron 2

https://github.com/Rayhane-mamah/Tacotron-2 - DeepMind's Tacotron-2 Tensorflow implementation

https://github.com/NVIDIA/tacotron2 - Tacotron 2 - PyTorch implementation with faster-than-realtime inference

SqueezeWave

https://github.com/tianrengao/SqueezeWave - Automatic speech synthesis is a challenging task that is becoming increasingly important as edge devices begin to interact with users through speech. Typical text-to-speech pipelines include a vocoder, which translates intermediate audio representations into an audio waveform. Most existing vocoders are difficult to parallelize since each generated sample is conditioned on previous samples. WaveGlow is a flow-based feed-forward alternative to these auto-regressive models (Prenger et al., 2019). However, while WaveGlow can be easily parallelized, the model is too expensive for real-time speech synthesis on the edge. This paper presents SqueezeWave, a family of lightweight vocoders based on WaveGlow that can generate audio of similar quality to WaveGlow with 61x - 214x fewer MACs.

WaveGlow

https://github.com/NVIDIA/waveglow - In our recent paper, we propose WaveGlow: a flow-based network capable of generating high quality speech from mel-spectrograms. WaveGlow combines insights from Glow and WaveNet in order to provide fast, efficient and high-quality audio synthesis, without the need for auto-regression. WaveGlow is implemented using only a single network, trained using only a single cost function: maximizing the likelihood of the training data, which makes the training procedure simple and stable.

STT

https://github.com/coqui-ai/STT -a deep learning toolkit for Speech-to-Text, battle-tested in research and production

Real-Time-Voice-Cloning

https://github.com/CorentinJ/Real-Time-Voice-Cloning - Clone a voice in 5 seconds to generate arbitrary speech in real-time

rapping-neural-network

https://github.com/robbiebarrat/rapping-neural-network - Rap song writing recurrent neural network trained on Kanye West's entire discography

yukarin

https://github.com/Hiroshiba/yukarin - This repository is refactoring the training code for the first stage model of Bcome Yukarin: Convert your voice to favorite voice.

leesampler

https://github.com/YuzukiTsuru/lessampler - a Singing Voice Synthesizer

VoiceOfFaust

https://github.com/magnetophon/VoiceOfFaust - Turn your voice into a synthesizer!

tomomibot

https://github.com/adzialocha/tomomibot - Artificial intelligence bot for live voice improvisation.

Nanceloid

https://github.com/MegaLoler/Nanceloid - a vocal synth using digital waveguides n stuff lol

Parakeet

https://github.com/PaddlePaddle/Parakeet - PAddle PARAllel text-to-speech toolKIT (supporting WaveFlow, ClariNet, WaveNet, Deep Voice 3, Transformer TTS and FastSpeech)

PaddleSpeech

https://github.com/PaddlePaddle/PaddleSpeech - A Speech Toolkit based on PaddlePaddle.

Flowtron

https://github.com/NVIDIA/flowtron - an Autoregressive Flow-based Network for Text-to-Mel-spectrogram Synthesis

TransformerTTS

https://github.com/as-ideas/TransformerTTS - Implementation of a non-autoregressive Transformer based neural network for text to speech.

TensorflowTTS

https://github.com/dathudeptrai/TensorflowTTS - TensorflowTTS: Real-Time State-of-the-art Speech Synthesis for Tensorflow 2

AutoSpeech

https://github.com/VITA-Group/AutoSpeech

HiFi-GAN

https://github.com/george-roussos/hifi-gan - a GAN-based model capable of generating high fidelity speech efficiently.

https://github.com/rhasspy/hifi-gan -Version of Hi-Fi GAN designed to work with; tacotron2-train, glow-tts-train

https://github.com/rhasspy/tacotron2-train - An implementation of Tacotron2 designed to work with Gruut

Wave-U-net-TF2

https://github.com/satvik-venkatesh/Wave-U-net-TF2 -implements the Wave-U-net architecture in TensorFlow 2

larynx

https://github.com/rhasspy/larynx - End to end text to speech system using gruut and onnx

FastSpeech2

https://github.com/ming024/FastSpeech2 - An implementation of Microsoft's "FastSpeech 2: Fast and High-Quality End-to-End Text to Speech"

TensorVox

https://github.com/ZDisket/TensorVox - an application designed to enable user-friendly and lightweight neural speech synthesis in the desktop, aimed at increasing accessibility to such technology.Powered by TensorflowTTS, it is written in pure C++/Qt, using the Tensorflow C API for interacting with the models. This way, we can perform inference without having to install gigabytes worth of pip libraries, just a 100MB DLL.

Coqui TTS

Coqui TTS - an open source embedded (offline, on-device) speech-to-text engine which can run in real time on devices ranging from a Raspberry Pi 4 to high power GPU servers.Coqui STT has APIs for numerous languages (Python, C/C++, Java, JavaScript, .NET...), is supported on many platforms (Linux, macOS, Windows, ARM...), and is available on GitHub.
- https://github.com/coqui-ai/TTS - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

VITS

https://github.com/jaywalnut310/vits - VITS: Conditional Variational Autoencoder with Adversarial Learning for End-to-End Text-to-Speech

FreeVC

https://github.com/OlaWod/FreeVC - Towards High-Quality Text-Free One-Shot Voice Conversion. In this paper, we adopt the end-to-end framework of VITS for high-quality waveform reconstruction, and propose strategies for clean content information extraction without text annotation. We disentangle content information by imposing an information bottleneck to WavLM features, and propose the spectrogram-resize based data augmentation to improve the purity of extracted content information.

phonemizer

https://github.com/bootphon/phonemizer

espeak-phonemizer

https://github.com/rhasspy/espeak-phonemizer - Uses ctypes and libespeak-ng to transform test into IPA phonemes

nnsvs

https://github.com/r9y9/nnsvs - Neural network-based singing voice synthesis library for research.

unagan

https://github.com/ciaua/unagan - contains the code and samples for our paper "Unconditional Audio Generation with GAN and Cycle Regularization", accepted by INTERSPEECH 2020. The goal is to unconditionally generate singing voices, speech, and instrument sounds with GAN.The model is implemented with PyTorch.

vocshape

https://github.com/PaulBatchelor/vocshape - a very simple proof-of-concept musical instrument for Android that aims to demonstrate the sculptability of a simple articulatory synthesis physical model for vocal synthesis.

YouTube: vocshape demo3

lexconvert

https://github.com/ssb22/lexconvert - Convert phoneme codes and lexicon formats for English speech synths

HiFiSinger

https://github.com/CODEJIN/HiFiSinger - This code is an unofficial implementation of HiFiSinger. The algorithm is based on the following papers:Chen, J., Tan, X., Luan, J., Qin, T., & Liu, T. Y. (2020). HiFiSinger: Towards High-Fidelity Neural Singing Voice Synthesis. arXiv preprint arXiv:2009.01776.Ren, Y., Ruan, Y., Tan, X., Qin, T., Zhao, S., Zhao, Z., & Liu, T. Y. (2019). Fastspeech: Fast, robust and controllable text to speech. Advances in Neural Information Processing Systems, 32, 3171-3180.Yamamoto, R., Song, E., & Kim, J. M. (2020, May). Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram. In ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) (pp. 6199-6203). IEEE.

PortaSpeech

https://github.com/keonlee9420/PortaSpeech - PyTorch Implementation of PortaSpeech: Portable and High-Quality Generative Text-to-Speech

KaraSinger

https://github.com/jerrygood0703/KaraSinger - SCORE-FREE SINGING VOICE SYNTHESIS WITH VQ-VAE USING MEL-SPECTROGRAMS Demo

hifigan

https://github.com/bshall/hifigan - An 16kHz implementation of HiFi-GAN for soft-vc.

VOICEVOX

VOICEVOX - 無料で使える中品質なテキスト読み上げソフトウェア] - 無料で使える中品質なテキスト読み上げソフトウェア
- https://github.com/VOICEVOX/voicevox_core

Comprehensive-Transformer-TTS

https://github.com/keonlee9420/Comprehensive-Transformer-TTS - A Non-Autoregressive Transformer based TTS, supporting a family of SOTA transformers with supervised and unsupervised duration modelings.

DiffGAN-TTS

https://github.com/keonlee9420/DiffGAN-TTS - PyTorch Implementation of DiffGAN-TTS: High-Fidelity and Efficient Text-to-Speech with Denoising Diffusion GANs

DiffSinger

https://github.com/MoonInTheRiver/DiffSinger - DiffSinger: Singing Voice Synthesis via Shallow Diffusion Mechanism (SVS & TTS); AAAI 2022; Official code

tortoise-tts

https://github.com/neonbjb/tortoise-tts - A multi-voice TTS system trained with an emphasis on quality

https://github.com/manmay-nakhashi/tortoise-tts-fastest - Faster Tortoise inference then Tortoise Fast Fork

TTS

https://github.com/coqui-ai/TTS - a deep learning toolkit for Text-to-Speech, battle-tested in research and production

VALL-E

VALL-E - Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers - We introduce a language modeling approach for text to speech synthesis (TTS). Specifically, we train a neural codec language model (called VALL-E) using discrete codes derived from an off-the-shelf neural audio codec model, and regard TTS as a conditional language modeling task rather than continuous signal regression as in previous work. During the pre-training stage, we scale up the TTS training data to 60K hours of English speech which is hundreds of times larger than existing systems. VALL-E emerges in-context learning capabilities and can be used to synthesize high-quality personalized speech with only a 3-second enrolled recording of an unseen speaker as an acoustic prompt. Experiment results show that VALL-E significantly outperforms the state-of-the-art zero-shot TTS system in terms of speech naturalness and speaker similarity. In addition, we find VALL-E could preserve the speaker's emotion and acoustic environment of the acoustic prompt in synthesis.
- https://github.com/microsoft/unilm/tree/master/valle

RobotVoice

https://github.com/polvanrijn/RobotVoice - Verified Set of TTS Voices for Common Robots

Piper

Piper - a fast and local text to speech system.
- https://github.com/rhasspy/piper

Natural Speech 2 - Pytorch

https://github.com/lucidrains/naturalspeech2-pytorch - Implementation of Natural Speech 2, Zero-shot Speech and Singing Synthesizer, in Pytorch

WG-WaveNet

Audio samples from "WG-WaveNet: Real-Time High-Fidelity Speech Synthesis without GPU" - Abstract: In this paper, we propose WG-WaveNet, a fast, lightweight, and high-quality waveform generation model. WG-WaveNet is composed of a compact flow-based model and a post-filter. The two components are jointly trained by maximizing the likelihood of the training data and optimizing loss functions on the frequency domains. As we design a flow-based model that is heavily compressed, the proposed model requires much less computational resources compared to other waveform generation models during both training and inference time; even though the model is highly compressed, the post-filter maintains the quality of generated waveform. Our PyTorch implementation can be trained using less than 8 GB GPU memory and generates audio samples at a rate of more than 5000 kHz on an NVIDIA 1080Ti GPU. Furthermore, even if synthesizing on a CPU, we show that the proposed method is capable of generating 44.1 kHz speech waveform 1.2 times faster than real-time. Experiments also show that the quality of generated audio is comparable to those of other methods.
- https://github.com/BogiHsu/WG-WaveNet

Voicebox

https://github.com/lucidrains/voicebox-pytorch - Implementation of Voicebox, new SOTA Text-to-Speech model from MetaAI, in Pytorch. In this work, we will use rotary embeddings. The authors seem unaware that ALiBi cannot be straightforwardly used for bidirectional models.

EmotiVoice

https://github.com/netease-youdao/EmotiVoice - a powerful and modern open-source text-to-speech engine. EmotiVoice speaks both English and Chinese, and with over 2000 different voices. The most prominent feature is emotional synthesis, allowing you to create speech with a wide range of emotions, including happy, excited, sad, angry and others. An easy-to-use web interface is provided. There is also a scripting interface for batch generation of results.

UniSpeech

https://github.com/microsoft/UniSpeech - Large Scale Self-Supervised Learning for Speech

Laughter

Laughter synthesis using Pseudo Phonetic Tokens

Audio Samples of the paper "Laughter Synthesis using Pseudo Phonetic Tokens with a Large-scale In-the-wild Laughter Corpus" - We present a large-scale in-the-wild Japanese laughter corpus and a laughter synthesis method. Previous work on laughter synthesis lacks not only data but also proper ways to represent laughter. To solve these problems, we first propose an in-the-wild corpus comprising 3.5 hours of laughter, which is to our best knowledge the largest laughter corpus designed for laughter synthesis. We then propose pseudo phonetic tokens (PPTs) to represent laughter by a sequence of discrete tokens, which are obtained by training a clustering model on features extracted from laughter by a pretrained self-supervised model. Laughter can then be synthesized by feeding PPTs into a text-to-speech system. We further show PPTs can be used to train a language model for unconditional laughter generation. Results of comprehensive subjective and objective evaluations demonstrate that the proposed method significantly outperforms a baseline method, and can generate natural laughter unconditionally.
- https://github.com/Aria-K-Alethia/laughter-synthesis

Conversion

crank

https://github.com/k2kobayashi/crank - Non-parallel voice conversion based on vector-quantized variational autoencoder

MelGAN-VC

https://github.com/marcoppasini/MelGAN-VC - : Voice Conversion and Audio Style Transfer on arbitrarily long samples using Spectrograms

World

https://github.com/mmorise/World - free software for high-quality speech analysis, manipulation and synthesis. It can estimate Fundamental frequency (F0), aperiodicity and spectral envelope and also generate the speech like input speech with only estimated parameters.

Scyclone

https://github.com/Miralan/Scyclone - (Voice Conversion)

Shallow WaveNet Vocoder

https://github.com/patrickltobing/shallow-wavenet - Shallow WaveNet Vocoder with Laplacian Distribution using Multiple Samples Output based on Linear Prediction / with Softmax Output

speech-resynthesis

https://github.com/facebookresearch/speech-resynthesis - Implementation of the method described in the Speech Resynthesis from Discrete Disentangled Self-Supervised Representations.

SoftVC VITS Singing Voice Conversion

https://github.com/svc-develop-team/so-vits-svc - differs fundamentally from VITS, as it focuses on Singing Voice Conversion (SVC) rather than Text-to-Speech (TTS). In this project, TTS functionality is not supported, and VITS is incapable of performing SVC tasks. It's important to note that the models used in these two projects are not interchangeable or universally applicable.

Retrieval-based-Voice-Conversion

https://github.com/w-okada/voice-changer

https://github.com/RVC-Project/Retrieval-based-Voice-Conversion-WebUI - An easy-to-use Voice Conversion framework based on VITS.

Source-Filter HiFi-GAN / SiFi-GAN

-2210.15533- Source-Filter HiFi-GAN: Fast and Pitch Controllable High-Fidelity Neural Vocoder - Our previous work, the unified source-filter GAN (uSFGAN) vocoder, introduced a novel architecture based on the source-filter theory into the parallel waveform generative adversarial network to achieve high voice quality and pitch controllability. However, the high temporal resolution inputs result in high computation costs. Although the HiFi-GAN vocoder achieves fast high-fidelity voice generation thanks to the efficient upsampling-based generator architecture, the pitch controllability is severely limited. To realize a fast and pitch-controllable high-fidelity neural vocoder, we introduce the source-filter theory into HiFi-GAN by hierarchically conditioning the resonance filtering network on a well-estimated source excitation information. According to the experimental results, our proposed method outperforms HiFi-GAN and uSFGAN on a singing voice generation in voice quality and synthesis speed on a single CPU. Furthermore, unlike the uSFGAN vocoder, the proposed method can be easily adopted/integrated in real-time applications and end-to-end systems.

https://github.com/vtuber-plan/hifi-gan - An High-resolution implementation of HiFi-GAN Vocoder for Voice Conversion.

https://github.com/chomeyama/SiFiGAN - This repo provides official PyTorch implementation of SiFi-GAN, a fast and pitch controllable high-fidelity neural vocoder.