Speech

Things and Stuff Wiki - An organically evolving personal wiki knowledge base. An on-the-fly taxonomy containing a patchwork trail of topic outlines, descriptions, notes, stubs and breadcrumbs, with links to sites, systems, software, manuals, organisations, people, articles, guides, slides, papers, books, comments, videos, screencasts, webcasts, scratchpads and more. Content is orientated towards mostly free/libre/open, mostly Linux. Quality and age varies drastically. Sometimes old things are first, sometimes last. Use the Table of Contents menu to navigate long pages. Zoom in if text is too small. Dead link? Wayback Machine. I probably need to fix the theme CSS after an update. See also libreav.org. Chat to msg me (not checking tho atm). e

General

https://github.com/jim-schwoebel/voicebook

https://old.reddit.com/r/speechtech

https://github.com/codeforequity-at/botium-speech-processing - a unified, developer-friendly API to the best available Speech-To-Text and Text-To-Speech services. [1]

https://github.com/gooofy/py-nltools - A collection of basic python modules for spoken natural language processing

https://github.com/facebookresearch/seamless_communication - our foundational all-in-one Massively Multilingual and Multimodal Machine Translation model delivering high-quality translation for speech and text in nearly 100 languages. SeamlessM4T models support the tasks of: Speech-to-speech translation (S2ST), Speech-to-text translation (S2TT), Text-to-speech translation (T2ST), Text-to-text translation (T2TT), Automatic speech recognition (ASR)

Analysis

ESPS

http://www.speech.cs.cmu.edu/comp.speech/Section1/Labs/esps.html

ESPS - Entropic Signal Processing System, is a package of UNIX-like commands and programming libraries for speech signal processing. As a commercial product of Entropic Research Laboratory, Inc, it became extremely widely used in phonetics and speech technology research laboratories in the 1990's, in view of the wide range of functions it offered, such as get_f0 (for fundamental frequency estimation), formant (for formant frequency measurement), the xwaves graphical user interface, and many other commands and utilities. Following the acquisition of Entropic by Microsoft in 1999, Microsoft and AT&T licensed ESPS to the Centre for Speech Technology at KTH, Sweden, so that a final legacy version of the ESPS source code could continue to be made available to speech researchers. At KTH, code from the ESPS library (such as get_f0) was incorporated by Kåre Sjölander and Jonas Beskow into the Wavesurfer speech analysis tool. This is a very good alternative way to use many ESPS functions if you want a graphical user interface rather than scripting.

https://github.com/jeremysalwen/ESPS - This archive contains source files from the ESPS toolkit.

Speech Research Tools

https://sourceforge.net/projects/speechresearch - Software for speech research. It includes programs and libraries for signal processing, along with general purpose scientific libraries. Most of the code is in Python, with C/C++ supporting code. Also, contains code releases corresponding to publishe

HTK Speech Recognition Toolkit

HTK Speech Recognition Toolkit - a portable toolkit for building and manipulating hidden Markov models. HTK is primarily used for speech recognition research although it has been used for numerous other applications including research into speech synthesis, character recognition and DNA sequencing. HTK is in use at hundreds of sites worldwide.HTK consists of a set of library modules and tools available in C source form. The tools provide sophisticated facilities for speech analysis, HMM training, testing and results analysis. The software supports HMMs using both continuous density mixture Gaussians and discrete distributions and can be used to build complex HMM systems. The HTK release contains extensive documentation and examples.

speechrate

speechrate - software for the analysis of speech. Below you will find a script that automatically detects syllable nuclei in order to measure speech rate without the need of a transcription. Peaks in intensity (dB) that are preceded and followed by dips in intensity are considered as potential syllable nuclei. The script subsequently discards peaks that are not voiced. On this page you find an example of how the script works.
- Praat: doing Phonetics by Computer

HAT

Higgins Annotation Tool - can be used to transcribe and annotate speech with one or more audio tracks (such as dialogue). Windows.

SHIRO

https://github.com/Sleepwalking/SHIRO - Phoneme-to-speech alignment toolkit based on liblrhsmm

WARP-Q

https://github.com/wjassim/WARP-Q - code is to run the WARP-Q speech quality metric. WARP-Q (Quality Prediction For Generative Neural Speech Codecs) is an objective, full-reference metric for perceived speech quality. It uses a subsequence dynamic time warping (SDTW) algorithm as a similarity between a reference (original) and a test (degraded) speech signal to produce a raw quality score. It is designed to predict quality scores for speech signals processed by low bit rate speech coders.

phonemes2ids

https://github.com/rhasspy/phonemes2ids - Flexible tool for assigning integer ids to phonemes

libllsm2

https://github.com/Sleepwalking/libllsm2 - Low Level Speech Model (version 2.1) for high quality speech analysis-synthesis

Recognition

https://en.wikipedia.org/wiki/Speech_recognition - an interdisciplinary subfield of computational linguistics that develops methodologies and technologies that enables the recognition and translation of spoken language into text by computers. It is also known as automatic speech recognition (ASR), computer speech recognition or speech to text (STT). It incorporates knowledge and research in the linguistics, computer science, and electrical engineering fields.

https://en.wikipedia.org/wiki/Speech_recognition_software_for_Linux

https://github.com/Picovoice/speech-to-text-benchmark - a minimalist and extensible framework for benchmarking different speech-to-text engines. It has been developed and tested on Ubuntu using Python3. [2]

CMUSphinx

CMUSphinx Open Source Speech Recognition
- http://cmusphinx.sourceforge.net/wiki
- https://github.com/cmusphinx/sphinxbase

https://github.com/cmusphinx/pocketsphinx - a lightweight speech recognition engine, specifically tuned for handheld and mobile devices, though it works equally well on the desktop
- https://github.com/syl22-00/pocketsphinx.js - Speech recognition in JavaScript and WebAssembly

pocketVox

https://github.com/benoitfragit/pocketVox - a both an application and a library written in C by Benoit Franquet. It uses Pocketsphinx to do voice recognition. The voice recognition is done offline and doesn't need an Internet connexion. It's aim is to provide an eficient way to integrate voice recognition on the Linux desktop. More particularly, its main goal is to give to visual impaired (as I am) a powerfull tool to control their desktop.
- http://cmusphinx.sourceforge.net/2014/11/pocketvox-is-listening-you/

Audiogrep

Audiogrep - transcribes audio files and then creates "audio supercuts" based on search phrases. It uses CMU Pocketsphinx for speech-to-text and pydub to stitch things together. [3]

Speech Signal Processing Toolkit

Speech Signal Processing Toolkit (SPTK) - a suite of speech signal processing tools for UNIX environments, e.g., LPC analysis, PARCOR analysis, LSP analysis, PARCOR synthesis filter, LSP synthesis filter, vector quantization techniques, and other extended versions of them. This software is released under the Modified BSD license. SPTK was developed and has been used in the research group of Prof. Satoshi Imai (he has retired) and Prof. Takao Kobayashi (currently he is with Interdisciplinary Graduate School of Science and Engineering, Tokyo Institute of Technology) at P&I laboratory, Tokyo Institute of Technology. A sub-set of tools was chosen and arranged for distribution by Prof. Keiichi Tokuda (currently he is with Department of Computer Science and Engineering, Nagoya Institute of Technology) as a coordinator in cooperation and other collaborates (see "Acknowledgments" and "Who we are" in README).The original source codes have been written by many people who took part in activities of the research group. The most original source codes of this distribution were written by Takao Kobayashi (graph, data processing, FFT, sampling rate conversion, etc.), Keiichi Tokuda (speech analysis, speech synthesis, etc.), and Kazuhito Koishida (LSP, vector quantization, etc.).
- https://github.com/r9y9/SPTK

julius

https://github.com/julius-speech/julius - Open-Source Large Vocabulary Continuous Speech Recognition Engine

Services

https://cloud.google.com/speech/ [4]

to sort

YouTube: Emily Shea - "Perl Out Loud"

Kaldi

Kaldi - a toolkit for speech recognition, intended for use by speech recognition researchers and professionals. [5]
- https://github.com/kaldi-asr/kaldi

https://github.com/alumae/kaldi-gstreamer-server

https://github.com/alphacep/kaldi-android-demo

https://github.com/alphacep/vosk-api - Kaldi API for Android, Python and Node

https://github.com/daanzu/kaldi-active-grammar - Python Kaldi speech recognition with grammars that can be set active/inactive dynamically at decode-time

k2

https://github.com/k2-fsa/k2 - The vision of k2 is to be able to seamlessly integrate Finite State Automaton (FSA, and Finite State Transducer (FST) algorithms into autograd-based machine learning toolkits like PyTorch and TensorFlow. For speech recognition applications, this should make it easy to interpolate and combine various training objectives such as cross-entropy, CTC and MMI and to jointly optimize a speech recognition system with multiple decoding passes including lattice rescoring and confidence estimation. We hope k2 will have many other applications as well.

lhotse

https://github.com/lhotse-speech/lhotse - a Python library aiming to make speech and audio data preparation flexible and accessible to a wider community. Alongside k2, it is a part of the next generation Kaldi speech processing library.

icefall

https://github.com/k2-fsa/icefall - icefall project contains speech-related recipes for various datasets using k2-fsa and lhotse.

RASR

RWTH ASR - The RWTH Aachen University Speech Recognition System, a software package containing a speech recognition decoder together with tools for the development of acoustic models, for use in speech recognition systems. It has been developed by the Human Language Technology and Pattern Recognition Group at the RWTH Aachen University since 2001. Speech recognition systems developed using this framework have been applied successfully in several international research projects and corresponding evaluations.

ESPnet

https://github.com/espnet/espnet - an end-to-end speech processing toolkit, mainly focuses on end-to-end speech recognition and end-to-end text-to-speech. ESPnet uses chainer and pytorch as a main deep learning engine, and also follows Kaldi style data processing, feature extraction/format, and recipes to provide a complete setup for speech recognition and other speech processing experiments.

RETURNN

RETURNN - extensible training framework for universal recurrent neural networks, is a Theano/TensorFlow-based implementation of modern recurrent neural network architectures. It is optimized for fast and reliable training of recurrent neural networks in a multi-GPU environment.
- https://github.com/rwth-i6/returnn

Zamia

https://github.com/gooofy/zamia-speech/ - Open tools and data for cloudless automatic speech recognition

Snowboy

Snowboy Hotword Detection - DNN based hotword and wake word detection toolkit
- https://github.com/kitt-ai/snowboy

voice2json

voice2json - a collection of command-line tools for offline speech/intent recognition on Linux. It is free, open source, and supports 16 languages.

sak

https://github.com/fredvs/sak - Speecher Assistive Kit. With sak, your application becomes assistive directly, without changing anything in your code. sak uses the PortAudio and eSpeak Open Source Libraries.

Respeaker

https://github.com/respeaker - To build voice interface objects

https://github.com/respeaker/get_started_with_respeaker - This is the wiki of ReSpeaker Core V2, ReSpeaker Core and ReSpeaker Mic Array.

https://github.com/respeaker/respeakerd - server application for the microphone array solutions of SEEED, based on librespeaker which combines the audio front-end processing algorithms.

https://github.com/respeaker/respeaker_python_library - To build voice enabled objects/applications with Python and ReSpeaker

https://github.com/respeaker/seeed-voicecard - 2 Mic Hat, 4 Mic Array, 6-Mic Circular Array Kit, and 4-Mic Linear Array Kit for Raspberry Pi

https://github.com/HinTak/seeed-voicecard - an enhancement fork with the explicit aim of supporting current shipping Raspbian/Ubuntu kernels without requiring downgrading.

speech_recognition

https://github.com/Uberi/speech_recognition - Library for performing speech recognition, with support for several engines and APIs, online and offline.

ExVo

https://github.com/Aria-K-Alethia/ExVo - Official implementation of the paper "Exploring the Effectiveness of Self-supervised Learning and Classifier Chains in Emotion Recognition of Nonverbal Vocalizations" accepted by the Few-Shot track of the ICML Expressive Vocalizations (ExVo) Competition 2022.

Neural net

WaveNet

DeepMind: WaveNet: A Generative Model for Raw Audio

to sort

Sequence Modeling with CTC - A visual guide to Connectionist Temporal Classification, an algorithm used to train deep neural networks in speech recognition, handwriting recognition and other sequence problems. [6]

Audio Adversarial Examples - [7]

NICO toolkit

NICO toolkit - mainly intended for, and originally developed for speech recognition applications, a general purpose toolkit for constructing artificial neural networks and training with the back-propagation learning algorithm. The network topology is very flexible. Units are organized in groups and the group is a hierarchical structure, so groups can have sub-groups or other objects as members. This makes it easy to specify multi-layer networks with arbitrary connection structure and to build modular networks.

DeepSpeech

https://github.com/mozilla/DeepSpeech - an open source Speech-To-Text engine, using a model trained by machine learning techniques based on Baidu's Deep Speech research paper. Project DeepSpeech uses Google's TensorFlow to make the implementation easier.

wav2letter

Online speech recognition with wav2letter@anywhere - a fast, open source speech processing toolkit from the Speech team at Facebook AI Research built to facilitate research in end-to-end models for speech recognition. It is written entirely in C++ and uses the ArrayFire tensor library and the flashlight machine learning library for maximum efficiency.
- https://github.com/facebookresearch/wav2letter [8]

https://github.com/talonvoice/wav2train - automatically align transcribed audio and generate a wav2letter training corpus

Neurst

https://github.com/bytedance/neurst - aims at easily building and training end-to-end speech translation, which has the careful design for extensibility and scalability. We believe this design can make it easier for NLP researchers to get started. In addition, NeurST allows researchers to train custom models for translation, summarization and so on.

TSTNN

https://github.com/key2miao/TSTNN - transformer based neural network for speech enhancement in time domain

SpeechBrain

https://github.com/speechbrain/speechbrain - A PyTorch-based Speech Toolkit

SpeechSplit

https://github.com/auspicious3000/SpeechSplit - provides a PyTorch implementation of SpeechSplit, which enables more detailed speaking style conversion by disentangling speech into content, timbre, rhythm and pitch.

WeNet

https://github.com/wenet-e2e/wenet - Production First and Production Ready End-to-End Speech Recognition Toolkit

phonemes2ids

https://github.com/rhasspy/phonemes2ids - Useful for text to speech or speech to text applications where text is phonemized, converted to an integer vector, and then used to train a network.

PASE

https://github.com/santi-pdp/pase - This repository contains the official implementations of PASE and PASE+. These are speech waveform encoders trained in a self-supervised manner with the so called worker/minion framework. A PASE model can be used as a speech feature extractor or to pre-train an encoder for our desired end-task, like speech classification such as in ASR, seaker recognition, or emotion recognition, or speech generation such as in voice conversion or TTS.

PyTorch-Voice-Conversion

https://github.com/KimythAnly/PyTorch-Voice-Conversion - Implementations of Voice Conversion models; AdaIN-VC, AGAIN-VC, AutoVC, VQVC+

ConST

https://github.com/ReneeYe/ConST - code for paper "Cross-modal Contrastive Learning for Speech Translation" (NAACL 2022)

Whisper.cpp

https://github.com/ggerganov/whisper.cpp - Port of OpenAI's Whisper model in C/C++

Transcriber

https://github.com/davabase/transcriber_app - A real-time speech to text transcription app built with Flet and OpenAI Whisper.

ExVo

https://github.com/Aria-K-Alethia/ExVo - Official implementation of the paper "Exploring the Effectiveness of Self-supervised Learning and Classifier Chains in Emotion Recognition of Nonverbal Vocalizations" accepted by the Few-Shot track of the ICML Expressive Vocalizations (ExVo) Competition 2022. Abstract: We present an emotion recognition system for nonverbal vocalizations (NVs) submitted to the ExVo Few-Shot track of the ICML Expressive Vocalizations Competition 2022. The proposed method uses self-supervised learning (SSL) models to extract features from NVs and uses a classifier chain to model the label dependency between emotions. Experimental results demonstrate that the proposed method can significantly improve the performance of this task compared to several baseline methods. Our proposed method obtained a mean concordance correlation coefficient (CCC) of in the validation set and in the test set, while the best baseline method only obtained in the validation set.

insanely-fast-whisper

https://github.com/Vaibhavs10/insanely-fast-whisper - Powered by Transformers, Optimum & flash-attn. TL;DR - Transcribe 150 minutes (2.5 hours, of audio in less than 98 seconds - with OpenAI's Whisper Large v3. Blazingly fast transcription is now a reality!

Transcription

Transcribe!

Transcribe! - an assistant for people who want to work out a piece of music from a recording, in order to write it out, or play it themselves, or both. It doesn't do the transcribing for you, but it is essentially a specialised player program which is optimised for the purpose of transcription. It has many transcription-specific features not found on conventional music players. It is also used by many people for play-along practice. It can change pitch and speed instantly, and you can store and recall any number of named loops. So you can practice in all keys, and you can speed up as well as slow down. There is some advice about play-along practice in Transcribe!'s help, under the heading "Various Topics". And it is also used for speech transcription. With its support for foot pedals and its superior slowed-down sound quality, it is an excellent choice for this purpose. There is some advice about speech transcription in Transcribe!'s help, under the heading "Various Topics".

live-transcribe-speech-engine

https://github.com/google/live-transcribe-speech-engine - an Android application that provides real-time captioning for people who are deaf or hard of hearing. This repository contains the Android client libraries for communicating with Google's Cloud Speech API that are used in Live Transcribe.

Live-Subtitle

https://github.com/botbahlul/Live-Subtitle - ANDROID APP that can RECOGNIZE VLC LIVE AUDIO/VIDEO STREAMING (using free Android Developers Speech Recognition API, then TRANSLATE (using ANDROID MLKIT TRANSLATE API) and display it as LIVE CAPTION / LIVE SUBTITLE

MT3

https://github.com/magenta/mt3 - a multi-instrument automatic music transcription model that uses the T5X framework.This is not an officially supported Google product.

Pop2Piano

Pop2Piano
- https://github.com/sweetcocoa/pop2piano

auto-subtitle

https://github.com/m1guelpf/auto-subtitle - This repository uses ffmpeg and OpenAI's Whisper to automatically generate and overlay subtitles on any video.

subtitle-chan

subtitle-chan - Live speech transcription and translation in your browser. This is a live demo showing how to use subtitle-chan.
- https://github.com/ae9is/subtitle-chan

LiveCaptions

https://github.com/abb128/LiveCaptions - an application that provides live captioning for the Linux desktop. [9]

Video Subtitle Generator

https://github.com/navneeeth/live-subtitle-generator - Video Subtitle Generator is a project that combines the capabilities of PyAudio, opencv-python, OpenAI's Whisper API, and Deepgram API to generate subtitles for videos both uploaded and live.

bbb-live-subtitles

https://github.com/uhh-lt/bbb-live-subtitles - a plugin for automatic subtitling in BigBlueButton (BBB), an open source web conferencing system. bbb-live-subtitles will run real time automatic speech recognition (ASR, and will generate subtitle captions on-the-fly. No cloud services are used for ASR, instead we use our own speech recognition models that can be run locally. This ensures that no privacy issues arise. There are pre-trained German and English models available and ready to use (English needs PyKaldi > 0.2.0 and Python 3.8).

liveSubs

https://github.com/cxhawk/liveSubs - Wanna add subtitles and lower third easily in your live stream? This tool is great to be used with a video switcher like the ATEM Minis. Make sure this source is configured to be chroma keyed with the background color in settings.

pyTranscriber

https://github.com/raryelcostasouza/pyTranscriber - can be used to generate automatic transcription / automatic subtitles for audio/video files through a friendly graphical user interface.

Voice control

Programming by Voice: Becoming a Computer Whisperer

Talon

Talon - Powerful hands-free input
- https://github.com/talonvoice

Simon

Simon - an open source speech recognition program that can replace your mouse and keyboard. The system is designed to be as flexible as possible and will work with any language or dialect.

Blather

Blather is a speech recognizer that will run commands when a user speaks preset sentences.
- Intro to Blather: Speech Recognition for Linux

Xvoice

Xvoice - enables continuous speech dictation and speech control of most X applications. To convert users' speech into text it uses the IBM ViaVoice speech recognition engine, which is distributed separately (see below).When in dictation mode Xvoice passes this text directly to the currently focused X application. When in command mode, Xvoice matches the speech with predefined, user-modifieable, key sequences or commands. For instance "list" would match "ls -l" when commanding the console, so that when the user says "list" "ls -l" will be sent to the console as if the user had typed it.

CVoiceControl

CVoiceControl - a speech recognition system that enables a user to connect spoken commands to unix commands. It automagically detects speech input from a microphone, performs recognition on this input and - in case of successful recognition - executes the associated unix command.

Perlbox Voice

Perlbox Voice - an voice enabled application to bring your desktop under your command. With a single word, you can start your web browser, your favorite editor or whatever you want. Perlbox Voice now has desktop plugins, which will allow you to control desktop functions. You can switch virtual desktops, invoke the desktop menu, switch wallpaper or lock the screen. This is command, and it's right on the tip of your tongue! Start with the tutorial, which has a lot of screenshots.

Voximp

https://github.com/arrdem/Voicely - A voice command interface for Linux, and a fork of the Voximp project

Other

http://catb.org/jargon/html/Y/yak-shaving.html

https://github.com/Benjaminnechicattu/Image-Editing-Using-Voice-Commands

Numen

Numen - voice control letting you have full control of a Linux machine without needing to type. It empowers people who otherwise couldn't use their computers, and helps more avoid hand strain. Numen is free (libre) software and runs locally. [10]
- https://git.sr.ht/~geb/numen

Enhancement

unified2021

https://github.com/nay0648/unified2021 - A UNIFIED SPEECH ENHANCEMENT FRONT-END FOR ONLINE DEREVERBERATION, ACOUSTIC ECHO CANCELLATION, AND SOURCE SEPARATION. Dereverberation (DR), acoustic echo cancellation (AEC), and blind source separation (BSS) are the three most important submodules in speech enhancement front-end. In traditional systems, the three submodules work independently in a sequential manner, each submodule has its own signal model, objective function, and optimization policy. Although this architecture has high flexibility, the speech enhancement performance is restricted, since each submodule's optimum cannot guarantee the entire system's global optimum. In this paper, a unified signal model is derived to combine DR, AEC, and BSS together, and the online auxiliary-function based independent component/vector analysis (Aux-ICA/IVA) technique is used to solve the problem. The proposed approach has unified objective function and optimization policy, the performance improvement is verified by simulated experiments.

DCS-Net

https://github.com/jackhwalters/DCS-Net - Implementation of paper "DCS-Net: Deep Complex Subtractive Neural Network for Monaural Speech Enhancement"

DNN-based source separation

https://github.com/tky823/DNN-based_source_separation - A PyTorch implementation of DNN-based source separation.

BSSD

https://github.com/rrbluke/BSSD - contains python/tensorflow code to reproduce the experiments presented in our paper Blind Speech Separation and Dereverberation using Neural Beamforming.

NRES

https://github.com/rrbluke/NRES - Neural Residual Echo Suppressor

Ultimate Vocal Remover GUI

https://github.com/Anjok07/ultimatevocalremovergui - This application uses state-of-the-art source separation models to remove vocals from audio files. UVR's core developers trained all of the models provided in this package (except for the Demucs v3 and v4 4-stem models).

Resemble Enhance

https://github.com/resemble-ai/resemble-enhance - an AI-powered tool that aims to improve the overall quality of speech by performing denoising and enhancement. It consists of two modules: a denoiser, which separates speech from a noisy audio, and an enhancer, which further boosts the perceptual audio quality by restoring audio distortions and extending the audio bandwidth. The two models are trained on high-quality 44.1kHz speech data that guarantees the enhancement of your speech with high quality.