{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4.3.2 - Advanced - Vocalization\n",
"\n",
"COMET Team
*Siena Serikawa, Irene Berezin*\n",
"\n",
"## Outline\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"Text-to-speech tools have improved dramatically in the last few years,\n",
"and are now capable of producing fluent, human-like vocalization of\n",
"text. In the context of research, there are cases where researchers have\n",
"to explain the instruction in the same manner to each participant to\n",
"avoid any potential impact on research outcomes. This notebook aims to\n",
"show learners how to create vocalizations of text-based content using\n",
"modern, open-source tools that can handle non-standard or technical\n",
"English words (jargon).\n",
"\n",
"## Prerequisites\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"1. Introduction to Jupyter Notebooks\n",
"2. Some familiarity programming in Python\n",
"\n",
"## Learning Outcomes\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"In the notebook, you will\n",
"\n",
"1. Familiarize yourself with Text-to-Speech models in Python.\n",
"2. Understand the models and structures behind them in the context of\n",
" research and synthesize speech from texts using open-source tools.\n",
"\n",
"## Sources\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"This notebook is based on the following sources, which we highly\n",
"recommend you explore if you’re interested in furthering your knowledge\n",
"and application of Text-to-Speech Tools. These sources provide\n",
"additional insights and tutorials that can enhance your understanding\n",
"and skills in this area.\n",
"\n",
"- gTTS (Library using Google Text-to-Speech)\n",
" - [gTTS PyPI Documentation](https://pypi.org/project/gTTS/)\n",
" - [gTTS Documentation](https://gtts.readthedocs.io/en/latest/)\n",
" - [GitHub for gTTS](https://github.com/pndurette/gTTS)\n",
"- mozilla TTS (Open-source API)\n",
" - [GitHub for mozilla\n",
" TTS](https://github.com/mozilla/TTS?referral=top-free-text-to-speech-tools-apis-and-open-source-models)\n",
" - [mozilla TTS Documentation and\n",
" Tutorials](https://github.com/mozilla/TTS/wiki/TTS-Notebooks-and-Tutorials)\n",
"- Microsoft Azure Text-to-Speech (Cloud-based)\n",
" - [Microsoft Azure AI Speech Service\n",
" Documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/)\n",
" - [Microsoft Azure Text-to-Speech\n",
" Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech)\n",
" - [Microsoft Azure Text-to-Speech Quick\n",
" Start](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-text-to-speech?tabs=linux%2Cterminal&pivots=programming-language-python)\n",
"\n",
"## 0. Preparation\n",
"\n",
"> (As a reference, I relied on the [YouTube\n",
"> video](https://www.youtube.com/watch?v=MYRgWwis1Jk) for\n",
"> troubleshooting. Please delete it if needed.)\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"Before we dive into the Text-to-Speech Tools, it’s always a good\n",
"practice to create a virtual environment whenever you start a new\n",
"project. In this preparation section, we will go through the steps to\n",
"create a virtual environment.\n",
"\n",
"First, create a new folder called `tts`\n",
"\n",
" mkdir tts\n",
" cd tts/\n",
"\n",
"Within the folder, create a virtual environment and activate it\n",
"\n",
" !python3 -m venv .\n",
" !source bin/activate\n",
"\n",
"That’s it!\n",
"\n",
"## 1. Python Library Tool: gTTS\n",
"\n",
"> This section is based on the [gTTS PyPI\n",
"> Documentation](https://pypi.org/project/gTTS/), [gTTS\n",
"> Documentation](https://gtts.readthedocs.io/en/latest/), and [GitHub\n",
"> for gTTS](https://github.com/pndurette/gTTS).\n",
"\n",
"### Part 1: Introducing gTTS\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"In this section, we’re introducing `gTTS`, a Python library and CLI tool\n",
"that generate speech from texts using google Text-to-Speech. Their\n",
"customizable speech-specific sentence tokenizer allows users to process\n",
"long passages while maintaining proper intonation, abbreviations,\n",
"decimals, and other nuances effectively. Their customizable text\n",
"pre-processors can also implement adjustments such as correcting\n",
"pronunciation errors. It offers ***5 languages with different local\n",
"accents***, which we will play around with in Part 3! For more\n",
"information, take a look at the [language\n",
"list](https://gtts.readthedocs.io/en/latest/module.html).\n",
"\n",
"### Part 2: Installing gTTS\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"Before we begin, make sure you have the supported version of Python,\n",
"which is **Python \\> 3.7**!\n",
"\n",
"First thing first, install the `gTTS` from PyPI\n",
"\n",
" !pip install gTTS\n",
"\n",
"### Part 3: Let’s Generate Speech from Text!\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"We can now generate speech from texts using gTTS! Import `gTTS` from the\n",
"library we just installed.\n",
"\n",
" from gtts import gTTS\n",
"\n",
"Let’s let them say “Hello world!” in default English, which is British\n",
"English, and save the output as an mp3 file:\n",
"\n",
" tts = gTTS('Hello World!', lang='en')\n",
" tts.save('Hello_World.mp3')\n",
"\n",
"Download the file on your computer, and play it!\n",
"\n",
"Then, let’s try our favourite accent, Canadian! In this case, code:\n",
"\n",
" tts = gTTS('Hello World', lang='en', tld='ca')\n",
" tts.save('Hello_World.mp3')\n",
"\n",
"Again, download the file and open it on your computer to listen.\n",
"\n",
"## 2. Open-Source Free API: mozilla TTS\n",
"\n",
"> This section is based on [GitHub for mozilla\n",
"> TTS](https://github.com/mozilla/TTS?referral=top-free-text-to-speech-tools-apis-and-open-source-models)\n",
"> and [mozilla TTS Documentation and\n",
"> Tutorials](https://github.com/mozilla/TTS/wiki/TTS-Notebooks-and-Tutorials).\n",
"\n",
"> (As a reference, I also relied on the [YouTube\n",
"> video](https://www.youtube.com/watch?v=MYRgWwis1Jk) and [GitHub\n",
"> discussion](https://github.com/coqui-ai/TTS/discussions/3477) for\n",
"> troubleshooting. Please delete it if needed.)\n",
"\n",
"### Part 1: Introducing mozilla TTS\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"In this section, we’re introducing mozilla TTS, a free, open-source API,\n",
"to synthesize speech. Before we begin, take a look at the [sample\n",
"page](https://erogol.com/ddc-samples/) and explore what mozilla TTS can\n",
"do. It allows us to generate natural-sounding voices from basic texts as\n",
"well as complex utterances. mozilla TTS has been applied to ***20+\n",
"languages*** due to its fast, easy, and efficient model training!\n",
"\n",
"### Part 2: Installing mozilla TTS\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"We’re using their pre-trained models to synthesize speech from texts in\n",
"Python. mozilla TTS supports **Python \\>=3.6, \\<3.9**, so make sure you\n",
"installed the correct version!\n",
"\n",
"Before we install the TTS tool, let’s install or update some basic\n",
"modules and tools\n",
"\n",
" !pip3 install setuptools wheel pip -U\n",
"\n",
"Finally, we install mozilla TTS from PyPI and import OS. It may take a\n",
"while…But be patient!\n",
"\n",
" !pip install TTS\n",
" import os\n",
" os.environ[\"COQUI_TOS_AGREED\"] = \"1\"\n",
"\n",
"Now we have installed the TTS tool! Let’s take a look at the list of\n",
"pre-trained models\n",
"\n",
" !tts --list_models\n",
"\n",
"As you can see, there are **70 pre-trained models and 17 vocoders** in\n",
"mozilla TTS. We will see how they are different at the end of the next\n",
"section. For more details about their pre-trained models, see\n",
"[documentation](https://github.com/mozilla/TTS/wiki/Released-Models).\n",
"\n",
"### Part 3: Let’s Generate Speech from Text!\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"You’re ready to synthesize speech from texts using their pre-trained\n",
"models! Run a `model_name` and a `vocoder_name` from the list we took a\n",
"look in the previous part.\n",
"\n",
"You can copy and paste the names of the models and vocoders you want to\n",
"use from the list as arguments for the command below:\n",
"\n",
" !tts --model_name \"//\" \\ \n",
" --vocoder_name \"///\"\n",
"\n",
"For learning purposes, let’s proceed with the first one for both TTS\n",
"model and vocoder for now. In this case, the code will be something like\n",
"this:\n",
"\n",
" !tts --model_name \"tts_models/multilingual/multi-dataset/xtts_v2\" \\\n",
" --vocoder_name \"vocoder_models/universal/libri-tts/wavegrad\"\n",
"\n",
"Then, install `espeak-ng` package, a text-to-speech software that\n",
"converts texts into spoken words in a lot of languages.\n",
"\n",
" !sudo apt-get install espeak-ng\n",
"\n",
"Lastly, type something (let’s say your name) in the code below and let\n",
"it speak!\n",
"\n",
" !tts --text \"type your name here\"\n",
"\n",
"A wav. file will appear in your folder. Click and open it!\n",
"\n",
"### Part 4: Different Pre-Trained Models and Vocoders\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"In the previous section, we used `xtts_v2 model` (for more details, see\n",
"[documentation](https://docs.coqui.ai/en/latest/models/xtts.html)).\n",
"These models make some differences in the generated speech. In this\n",
"part, we will take a look at the differences between models.\n",
"\n",
"We use the same code as the previous section, but with a different\n",
"model, `fast_pitch`.\n",
"\n",
" !tts --model_name \"tts_models/en/ljspeech/fast_pitch\" \\\n",
" --vocoder_name \"vocoder_models/universal/libri-tts/wavegrad\"\n",
"\n",
"Then, type whatever you want to generate a speech from.\n",
"\n",
" !tts --text \"type whatever you like!\"\n",
"\n",
"Does it make any difference? Try other TTS models on your own!\n",
"\n",
"## 3. Cloud-Based AI: Microsoft Azure Text-to-Speech\n",
"\n",
"> This section is based on [Microsoft Azure AI Speech Service\n",
"> Documentation](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/),\n",
"> [Microsoft Azure Text-to-Speech\n",
"> Overview](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/text-to-speech),\n",
"> and [Microsoft Azure Text-to-Speech Quick\n",
"> Start](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-text-to-speech?tabs=linux%2Cterminal&pivots=programming-language-python).\n",
"\n",
"### Part 1: Introducing Microsoft Azure Text-to-Speech\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"In this section, we’re introducing our third Text-to-Speech option,\n",
"Microsoft Azure Text-to-Speech, a cloud-based AI tool that generates\n",
"speech from texts. Their prebuilt neural voice allows us to generate\n",
"highly natural, human-like speech from texts. It is **free up to 0.5\n",
"million characters per month**. Check out the [Voice\n",
"Gallery](https://speech.microsoft.com/portal/voicegallery) and find your\n",
"favourite voice! For the pricing option details, see [Azure AI Speech\n",
"Pricing](https://azure.microsoft.com/ja-jp/pricing/details/cognitive-services/speech-services/).\n",
"For the list of supported languages, see the\n",
"[table](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts).\n",
"\n",
"### Part 2: Installing Microsoft Azure Text-to-Speech\n",
"\n",
"> This section relies heavily on the [Microsoft\n",
"> Tutorial](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/get-started-text-to-speech?tabs=linux%2Cterminal&pivots=programming-language-python),\n",
"> which will be helpful for you to solve any troubles you may encounter\n",
"> during the process. \\*\\*\\*\n",
"\n",
"Before we begin, make sure to [create an Azure\n",
"account](https://azure.microsoft.com/free/cognitive-services) for free,\n",
"[create a Speech\n",
"resource](https://portal.azure.com/#create/Microsoft.CognitiveServicesSpeechServices)\n",
"in your portal, and [get the\n",
"keys](https://learn.microsoft.com/en-us/azure/ai-services/multi-service-resource?pivots=azportal#get-the-keys-for-your-resource)\n",
"for your resource. Microsoft Azure Text-to-Speech requires **Python \\>=\n",
"3.7**, so double check if you have the supported version before you\n",
"start the installation process.\n",
"\n",
"Install the Microsoft Azure Speech SDK\n",
"\n",
" !pip install azure-cognitiveservices-speech\n",
"\n",
"Then, replace the speech region and speech key with your actual values\n",
"below:\n",
"\n",
" SPEECH_REGION = \"type your speech region\"\n",
" SPEECH_KEY = \"type your speech key\"\n",
"\n",
"### Part 3: Let’s Generate Speech from Text!\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"Now you’re ready to generate speech from texts! Execute the following\n",
"request, and a mp3 file should appear in your folder. As an example, the\n",
"mp3 file speaks “Hello World!” but you can always replace it with\n",
"whatever you want!\n",
"\n",
" import requests\n",
"\n",
" url = f\"https://{SPEECH_REGION}.tts.speech.microsoft.com/cognitiveservices/v1\"\n",
" headers = {\n",
" \"Ocp-Apim-Subscription-Key\": SPEECH_KEY,\n",
" \"Content-Type\": \"application/ssml+xml\",\n",
" \"X-Microsoft-OutputFormat\": \"audio-16khz-128kbitrate-mono-mp3\",\n",
" \"User-Agent\": \"python-requests\"\n",
" }\n",
" \n",
" data = '''\n",
" \n",
" \n",
" Hello World!\n",
" \n",
" \n",
" '''\n",
"\n",
" response = requests.post(url, headers=headers, data=data.encode('utf-8'))\n",
"\n",
" if response.status_code == 200:\n",
" with open('output.mp3', 'wb') as f:\n",
" f.write(response.content)\n",
" print(\"Audio file saved as output.mp3\")\n",
"\n",
" else:\n",
" print(\"Error:\", response.status_code, response.text) \n",
"\n",
"Although we used Ava’s voice who speaks American English\n",
"(`en-US-AvaMultilingualNeural`) in this example, you can always change\n",
"the speech systhesis language by replacing `en-US-AvaMultilingualNeural`\n",
"with another [supported languages and\n",
"voices](https://learn.microsoft.com/en-us/azure/ai-services/speech-service/language-support?tabs=tts#prebuilt-neural-voices)\n",
"(check out [Voice\n",
"Gallery](https://speech.microsoft.com/portal/voicegallery) to listen to\n",
"demo voices). Reminder: make sure to use **Neural voice**, which is free\n",
"up to 0.5 million character!\n",
"\n",
"Note that all neural voices are not just fluent in their own language\n",
"but are multilingual and can speak English! If you select a voice that\n",
"is not English for English texts, they will generate speech in English\n",
"but with an accent from their own language.\n",
"\n",
"Let’s try `es-ES-ElviraNeural`, Elvira who speaks Spanish, and type an\n",
"input text “I’m so excited to use Text-to-Speech Tools in my own\n",
"research!”.\n",
"\n",
" import requests\n",
"\n",
" url = f\"https://{SPEECH_REGION}.tts.speech.microsoft.com/cognitiveservices/v1\"\n",
" headers = {\n",
" \"Ocp-Apim-Subscription-Key\": SPEECH_KEY,\n",
" \"Content-Type\": \"application/ssml+xml\",\n",
" \"X-Microsoft-OutputFormat\": \"audio-16khz-128kbitrate-mono-mp3\",\n",
" \"User-Agent\": \"python-requests\"\n",
" }\n",
" \n",
" data = '''\n",
" \n",
" \n",
" I'm so excited to use Text-to-Speech Tools in my own research!\n",
" \n",
" \n",
" '''\n",
"\n",
" response = requests.post(url, headers=headers, data=data.encode('utf-8'))\n",
"\n",
" if response.status_code == 200:\n",
" with open('output.mp3', 'wb') as f:\n",
" f.write(response.content)\n",
" print(\"Audio file saved as output.mp3\")\n",
"\n",
" else:\n",
" print(\"Error:\", response.status_code, response.text) \n",
" \n",
"\n",
"## 4: Application of TTS in Real-World Research Context\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"Congratulations! You can use 3 different Text-to-Speech Tools to\n",
"synthesize speech from texts. In the real-world research contexts,\n",
"however, you may want to repeat the process without coding the same\n",
"thing again and again. Here’s a great tip to generate speech from texts\n",
"without copying and pasting the same thing repeatedly, by using a text\n",
"box. Take Microsoft Azure Text-to-Speech as an example, execute the\n",
"following request, and a text box should appear at the bottom of this\n",
"cell. Type whatever words or sentences you want it to speak! As always,\n",
"an mp3 file will be created under your file.\n",
"\n",
" import requests\n",
"\n",
" url = f\"https://{SPEECH_REGION}.tts.speech.microsoft.com/cognitiveservices/v1\"\n",
" headers = {\n",
" \"Ocp-Apim-Subscription-Key\": SPEECH_KEY,\n",
" \"Content-Type\": \"application/ssml+xml\",\n",
" \"X-Microsoft-OutputFormat\": \"audio-16khz-128kbitrate-mono-mp3\",\n",
" \"User-Agent\": \"python-requests\"\n",
" }\n",
"\n",
" text = ''\n",
" while True:\n",
" text = input('Enter text: ')\n",
" if text == 'quit':\n",
" break\n",
" \n",
" data = f'''\n",
" \n",
" \n",
" {text}\n",
" \n",
" \n",
" '''\n",
"\n",
" response = requests.post(url, headers=headers, data=data.encode('utf-8'))\n",
"\n",
" if response.status_code == 200:\n",
" with open('output.mp3', 'wb') as f:\n",
" f.write(response.content)\n",
" print(\"Audio file saved as output.mp3\")\n",
"\n",
" else:\n",
" print(\"Error:\", response.status_code, response.text) \n",
"\n",
"Download the mp3 file just created in your file on your computer and\n",
"open it! Type `quit` in the box when you want it to stop running."
],
"id": "e89f230c-db36-4452-a272-f1072137b354"
}
],
"nbformat": 4,
"nbformat_minor": 5,
"metadata": {
"kernelspec": {
"name": "python3",
"display_name": "Python 3 (ipykernel)",
"language": "python3"
}
}
}