{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# 4.3.1 - Advanced - Transcription\n", "\n", "COMET Team
*Irene Berezin*\n", "\n", "------------------------------------------------------------------------\n", "\n", "## Prerequisites\n", "\n", "- Have installed Anaconda Navigator and Python on your computer\n", "\n", "## Learning outcomes\n", "\n", "- Understand the basic mechanics behild audio transcription\n", "- Be familiar with the various elements of Whisper audio transcription\n", "- Be able to transcribe and diarize short-form and long-form audio\n", "\n", "------------------------------------------------------------------------\n", "\n", "## 1. Introduction\n", "\n", "\"whisper\n", "\n", "### 1.1 What is audio transcription?\n", "\n", "Audio transcription is the language processing task of converting audio\n", "files containing human speech into text using a computer. This task most\n", "often includes the process of *diarization*, the process of\n", "distinguishing and labeling the various speakers in the audio file.\n", "Application of audio transcription include multi-lingual captions on\n", "online videos, real-time online meeting transcription, and much more.\n", "\n", "*Automatic speech recognition (ASR) systems* are interfaces that use\n", "machine learning/artificial intelligence to process speech audio files\n", "into text. In this module, we will be using the open-source ASR system\n", "*Whisper* by OpenAI to transcribe and diarize various audio files.\n", "\n", "### 1.2 What is *Whisper*?\n", "\n", "Whisper is a ASR model for transcription and speech recognition designed\n", "by OpenAI. Whisper stands out from it’s predecessors due to it being\n", "trained on roughly 680 thousand hours of labeled audio transcription\n", "data, signfificantly more than the models that came before it; thus,\n", "Whisper exhibits a much higher accuracy when tested against audio data\n", "outside of it’s training set compared to older models such as\n", "*Wav2Vec*$^{[1]}$.\n", "\n", "#### 1.2.1 How does Whisper work?\n", "\n", "Whisper, and audio transcription models in general, work by converting\n", "raw audio data into a spectrogram, specifically a *Log-Mel spectrogam*,\n", "which plots the time on the x-axis, the mels scale (a logarithmic form\n", "of the Hertz frequency) on the y-axis, and colors the data with respect\n", "to the amplitude of the audio data at each mel frequency and point in\n", "time.\n", "\n", "The mel-spectogram is then ran though a tokenizer, which converts the\n", "individual words in the spectrogram into lexical tokens- strings with\n", "assigned meaning that can be read by the language model. The encoder is\n", "a stack of transformer blocks that process the tokens, extracting\n", "features and relationships between different parts of the audio. The\n", "processed information from the encoder is passed to the decoder, another\n", "stack of transformer blocks that generate an output sequence (predicting\n", "the corresponding text captions word by word)$^{[2]}$.\n", "\n", "\"transformer\n", "\n", "#### 1.2.2 Optimizing Whisper transcription\n", "\n", "Alongside whisper, there exist many libraries that aim to optimize the\n", "current whisper model by increasing transcription speed and accuracy.\n", "Some examples include:\n", "\n", "**[Distil-Whisper](https://github.com/huggingface/distil-whisper)**: a\n", "smaller, optimized version of whisper created by *HuggingFace* using\n", "knowledge distillation. The Distil-Whisper model claims to be 6x faster,\n", "50% smaller, and within a 1% word error rate relative to the original\n", "whisper model $^{[3]}$. \\> Pros: CPU-compatible, significantly faster\n", "compared to OpenAI’s Whisper model.\n", "\n", "> Cons: Only supports english-speech to english-text transcription.\n", ">\n", "> This is the model that we will be using in this notebook, due to it’s\n", "> relevance and compatability with our use cases for audio\n", "> transcription. However, if you have a relatively powerful computer and\n", "> feel up for the challenge, consider following along with one of the\n", "> alternatives listed below.\n", "\n", "**[Whisper-Jax](https://github.com/sanchit-gandhi/whisper-jax)**:\n", "Another optimized version of whisper built on the *Transformers*\n", "library. Whisper-Jax claims to be 70x faster than the original Whisper\n", "model $^{[4]}$. \\> Pros: CPU-compatible, significantly faster compared\n", "to OpenAI’s Whisper model.\n", "\n", "> Cons: Optimized for GPU/TPU usage.\n", "\n", "**[Insanely-Fast-Whisper](https://github.com/Vaibhavs10/insanely-fast-whisper)**:\n", "A command-line interface that greatly speeds up whisper performance and\n", "claims to be able to trascribe 150 minutes of audio in less than 98\n", "seconds $^{[5]}$.. \\> Pros: One of the fastest versions of whisper\n", "available today.\n", "\n", "> Cons: Only works on NVIDIA GPUs.\n", "\n", "------------------------------------------------------------------------\n", "\n", "## 2. Installations\n", "\n", "### 2.1 Activating conda environment & downloading Jupyter Lab\n", "\n", "(If you’ve already done this, please move on to section 2.2)\n", "\n", "#### 2.1.1 Setting up and activating a conda envornment\n", "\n", "An *environment* is a repository of packages that you have installed on\n", "your computer. It acts similar to a virtual machine, keeping the\n", "packages needed for one project seperate from other projects, to avoid\n", "version conflicts, cluttering, etc.\n", "\n", "Let’s start by opening up the conda command prompt. 1) On windows 11,\n", "press the windows icon at the bottom of the screen. 2) Press *“all\n", "apps”*, and open the `anaconda3 (64bit)` folder. 3) Left-click on\n", "`anaconda prompt`, select `more`, and press `run as administrator`. This\n", "will open the command prompt window. 4) Lets create a new environment\n", "and call it *“whisper”*. In the command prompt, copy-and-paste the\n", "following line of code: `conda create -n whisper`. 5) Let’s activate our\n", "new environment. Once your new environment is created, type\n", "`conda activate whisper`.\n", "\n", "We’ve successfully created and activated our environment.\n", "\n", "#### 2.1.2 Installing and opening Jupyter lab\n", "\n", "To install jupyter, type in the following line of code:\n", "`conda install jupyter`. Once jupyter is finished installing, simply\n", "type `jupyter lab` in the command prompt. This will open up jupyter\n", "locally in your default browser.\n", "\n", "Note: these steps only need to be done once on each computer. The\n", "next time you wish to open jupyter locally, you only need to activate\n", "your conda environment and type in “jupyter lab” in the conda prompt.\n", "\n", "Warning: Make sure not to close the anaconda prompt while jupyter\n", "is running. Doing so will cause Jupyter to lose connection and may\n", "result in you losing unsaved work.\n", "\n", "### 2.2 Installaling Whisper\n", "\n", "#### 2.2.1 Installing Pytorch\n", "\n", "Lets start off by installing *PyTorch*, a machine learning library based\n", "on the Torch frame work, on which Whisper is built on. To install\n", "pytorch, open the conda prompt as an administrator, ensure that you are\n", "in the `whisper` enviornment that we created, and type in the following\n", "line of code:\n", "\n", "`conda install pytorch torchvision torchaudio cpuonly -c pytorch`\n", "\n", "If, for some reason, the installation does not work, you can also\n", "install pytorch through pip:\n", "\n", "`pip3 install torch torchvision torchaudio`\n", "\n", "Note: This installation is CPU only. If you have a NVIDIA GPU and would\n", "like to run whisper on your GPU, download\n", "CUDA and\n", "follow the PyTorch GPU setup\n", "here.\n", "\n", "
\n", "\n", "\"pytorch\n", "\n", "
\n", "\n", "Note that the installation may take a few minutes to complete, and that\n", "the conda prompt will ask you to confirm installation by pressing ‘y’.\n", "If the end result looks like this, you’ve installed Pytorch correctly.\n", "\n", "#### 2.2.2 Installing Transformers\n", "\n", "*Transformers* is a python package developped by *HuggingFace* which\n", "allows for easier downloading and training of natural langauge\n", "processing models, such as Whisper. The transformers library simplifies\n", "the audio transcription process by converting our audio files into text\n", "tokens for transcription without redundant code $^{[6]}$.\n", "\n", "We can download the transformers library using the following line of\n", "code in our conda prompt. We’ll also install the `Datasets` library to\n", "in case you’d like to use additional short-form audio:\n", "\n", "`pip install --upgrade pip`\n", "\n", "`pip install --upgrade transformers accelerate datasets[audio]`\n", "\n", "#### 2.2.3 Installing Whisper\n", "\n", "We can now install whisper. To do so, type the following line of code\n", "into the conda command prompt: `pip install -U openai-whisper`.\n", "\n", "Additionally, we’ll need to install the command-line tool *FFmpeg*, a\n", "open source software that helps with audio and video processing. We can\n", "do so by running the following line of code in conda prompt:\n", "`conda install conda-forge::ffmpeg`.\n", "\n", "#### 2.2.4 Installing Librosa and Soundfile\n", "\n", "Lastly, we’ll need to install librosa and soundfile, python packages for\n", "music and video analysis, which will allow us to preprocess our audio\n", "recordings before transcribing them. To do this, enter\n", "`pip install librosa soundfile` in the conda command prompt.\n", "\n", "------------------------------------------------------------------------\n", "\n", "## 3. Loading audio and preprocessing\n", "\n", "### 3.1 Loading audio samples\n", "\n", "It’s always a good idea to at least partially listen to the audio we\n", "wish to transcribe, to make sure that the audio file has no issues.\n", "\n", "Lets start off by loading some of the audio samples provided in this\n", "module. We’ll do this using the `IPython` library, which should already\n", "be installed on your device. If the code fails to run, run the following\n", "line of code in the conda prompt: `pip install ipython`.\n", "\n", "``` python\n", "import IPython\n", "IPython.display.Audio(\"audio samples/mixkit-cartoon-kitty-begging-meow-92.wav\")\n", "\n", "# warning: Turn down your volume as the audio may be loud!\n", "```\n", "\n", "Here is another example, this time from a longer [ColdFusion Youtube\n", "video](https://www.youtube.com/watch?v=a32RLgqNfGs).\n", "\n", "``` python\n", "import IPython\n", "IPython.display.Audio(\"audio samples/The Boeing Scandal Just Got A LOT Worse.mp3\")\n", "\n", "# warning: Turn down your volume as the audio may be loud!\n", "```\n", "\n", "### 3.2 Preprocessing audio\n", "\n", "A *sampling rate* is the number of samples per second (or per other\n", "unit) taken from a continuous signal (the actual audio) to make a\n", "discrete signal (the audio recording)$^{[7]}$. It’s important to note\n", "that Whisper transcription is designed to work on **16kHz audio\n", "samples**. Since not all audio is 16kHz, we need to check the sampling\n", "rate of our audio file, and if it is not 16kHz, we can resample audio to\n", "the correct sampling rate using the librosa library.\n", "\n", "Let’s start off by checking the sampling rate of the kitty audio sample:\n", "\n", "``` python\n", "import librosa\n", "import soundfile\n", "\n", "# Load the audio file\n", "audio_file_path = \"audio samples/mixkit-cartoon-kitty-begging-meow-92.wav\" \n", "y, sr = librosa.load(audio_file_path, sr=None) # Load the audio file and get the original sampling rate\n", "\n", "print(\"Sampling rate:\", sr)\n", "```\n", "\n", "We see that the sampling rate, in this audio sample, is well above the\n", "16khz sampling rate that Whisper requires. Thus, we need to convert it\n", "to the proper sampling rate of 16kHz. We’ll do this using the `librosa`\n", "package.\n", "\n", "``` python\n", "import librosa\n", "import soundfile as sf\n", "\n", "# Load the audio file\n", "audio_file_path = \"audio samples/mixkit-cartoon-kitty-begging-meow-92.wav\" \n", "y, sr = librosa.load(audio_file_path, sr=44100) # Load the audio file with the current sampling rate\n", "\n", "# Resample the audio to 16 kHz\n", "y_resampled = librosa.resample(y, orig_sr=44100, target_sr=16000) # Resample the audio to a sampling rate of 16 kHz\n", "\n", "# Save the resampled audio to a new file\n", "output_file_path = \"audio samples/mixkit-cartoon-kitty-begging-meow-92_resamples.wav\" # Path to save the resampled audio file\n", "sf.write(output_file_path, y_resampled, 16000) # Save the resampled audio to a WAV file\n", "```\n", "\n", "``` python\n", "audio_file_path = \"audio samples/mixkit-cartoon-kitty-begging-meow-92_resamples.wav\" \n", "y, sr = librosa.load(audio_file_path, sr=None) # Load the audio file and get the new sampling rate\n", "\n", "print(\"Sampling rate:\", sr)\n", "```\n", "\n", "------------------------------------------------------------------------\n", "\n", "This also works on MP3 files, such as the coldfusion video we played\n", "earlier:\n", "\n", "``` python\n", "audio_file_path = \"audio samples/House debates CPC motion of non-confidence against Trudeau's carbon tax CANADIAN POLITICS.mp3\" # Replace \"your_audio_file.wav\" with the path to your audio file\n", "y, sr = librosa.load(audio_file_path, sr=None) # Load the audio file and get the original sampling rate\n", "\n", "print(\"Sampling rate:\", sr)\n", "```\n", "\n", "``` python\n", "# Load the audio file\n", "audio_file_path = \"audio samples/House debates CPC motion of non-confidence against Trudeau's carbon tax CANADIAN POLITICS.mp3\" \n", "y, sr = librosa.load(audio_file_path, sr=44100) # Load the audio file with the current sampling rate\n", "\n", "# Resample the audio to 16 kHz\n", "y_resampled = librosa.resample(y, orig_sr=44100, target_sr=16000) # Resample the audio to a sampling rate of 16 kHz\n", "\n", "# Save the resampled audio to a new file\n", "output_file_path = \"audio samples/House_debates_CPC_motion_of_non-confidence_against_Trudeau's_carbon_tax_CANADIAN POLITICS_resampled.mp3\" # Path to save the resampled audio file\n", "sf.write(output_file_path, y_resampled, 16000) # Save the resampled audio to a mp3 file\n", "```\n", "\n", "``` python\n", "audio_file_path = \"audio samples/House_debates_CPC_motion_of_non-confidence_against_Trudeau's_carbon_tax_CANADIAN POLITICS_resampled.mp3\" \n", "y, sr = librosa.load(audio_file_path, sr=None) # Load the audio file and get the original sampling rate\n", "\n", "print(\"Sampling rate:\", sr)\n", "```\n", "\n", "------------------------------------------------------------------------\n", "\n", "## 4. Transcribing single-speaker audio\n", "\n", "We can now begin testing out audio transcription. There are two\n", "important distinctions to keep in mind when transcribing audio:\n", "\n", "1. Short-form versus long form audio - whisper is trained on 30-second\n", " audio clips, and will thus cut off audio longer than 30 seconds. We\n", " can overcome this by *chuncking* our audio sample into multiple\n", " audio samples, and then stitching them back together after the\n", " transcription process.\n", "2. Single-speaker versus multi-speaker audio: Audio with a single\n", " speaker is easier to transcribe compared to audio with multiple\n", " speakers. The segmentation of speech into individual speakers, known\n", " as ***diarization***, requires a slightly different approach to\n", " transcription and will be covered in section 5.\n", "\n", "### 4.1 Transcribing short form audio\n", "\n", "Let’s begin transcribing our first audio sample. We’ll be using a\n", "trimmed 25 second audio sample from the Wall Street Journal. As\n", "mentioned, it’s always a good idea to start off by listening to our\n", "audio sample before transcribing it.\n", "\n", "``` python\n", "import IPython\n", "IPython.display.Audio(\"audio samples/WSJ-23andme_resampled.wav\")\n", "```\n", "\n", "#### 4.1.1 Preprocessing\n", "\n", "You’ll notice that the file in question is an mp4 file rather than a\n", "mp3/wav file, meaning that the original file contains both audio and\n", "video. This isn’t an issue as we can convert it to a mp3/wav file during\n", "the preprocessing step.\n", "\n", "``` python\n", "audio_file_path = \"audio samples\\WSJ-23andme.mp4\" \n", "y, sr = librosa.load(audio_file_path, sr=None) # Load the audio file and get the original sampling rate\n", "\n", "print(\"Sampling rate:\", sr)\n", "```\n", "\n", "``` python\n", "import soundfile as sf\n", "\n", "# Load the audio file\n", "audio_file_path = \"audio samples/WSJ-23andme.mp4\" \n", "y, sr = librosa.load(audio_file_path, sr=44100) # Load the audio file with the current sampling rate\n", "\n", "# Resample the audio to 16 kHz\n", "y_resampled = librosa.resample(y, orig_sr=44100, target_sr=16000) # Resample the audio to a sampling rate of 16 kHz\n", "\n", "# Save the resampled audio to a new file\n", "output_file_path = \"audio samples/WSJ-23andme_resampled.wav\" # Path to save the resampled audio file\n", "sf.write(output_file_path, y_resampled, 16000) # Save the resampled audio to a WAV file\n", "```\n", "\n", "``` python\n", "audio_file_path = \"audio samples/WSJ-23andme_resampled.wav\" \n", "y, sr = librosa.load(audio_file_path, sr=None) # Load the audio file and get the original sampling rate\n", "\n", "print(\"Sampling rate:\", sr)\n", "```\n", "\n", "#### 4.1.2 Transcribing\n", "\n", "We’re now ready for our first transcription. The transcription process\n", "using distill-whisper is divided into the following steps:\n", "\n", "1. **Model specifications:** We start with initializing a PyTorch model\n", " for transcription, selecting either GPU or CPU based on\n", " availability. We then specify the pre-trained model we wish to use,\n", " with options for optimizing memory usage and ensuring safety in\n", " tensor operations.\n", "\n", "``` python\n", "import torch\n", "from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline\n", "\n", "device = \"cuda:0\" if torch.cuda.is_available() else \"cpu\" #If you have CUDA, this will run the transcription process on your GPU. If not, it will default to your CPU.\n", "\n", "torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 #specifying GPU/CPU parameters for pytorch\n", "\n", "model_id = \"distil-whisper/distil-large-v3\" #specifies the model ID, in this case we are using distil-large-v3\n", "# you can replace the model with any model you want that is compatible\n", "\n", "model = AutoModelForSpeechSeq2Seq.from_pretrained(\n", " model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True #specifying CPU parameters and model, you can change low_cpu_mem_usage to `false` for faster transcription \n", ")\n", "model.to(device)\n", "\n", "processor = AutoProcessor.from_pretrained(model_id) #specifying processor\n", "```\n", "\n", "1. **Pipeline:** We then create a pipeline for automatic speech\n", " recognition using the specified model, tokenizer, and feature\n", " extractor, utilizing the specified torch data type and device for\n", " computation.\n", "\n", "A ***pipeline*** is a series of interconnected steps or processes\n", "designed to accomplish a specific task efficiently. The\n", "huggingface audio transcription pipeline is structured to take\n", "raw audio inputs and convert them into transcriptions using automatic\n", "speech recognition. You can read more about the pipeline used in this\n", "tutorial\n", "[here](https://huggingface.co/docs/transformers/main/en/pipeline_tutorial).\n", "\n", "``` python\n", "pipe = pipeline( \n", " \"automatic-speech-recognition\", #specifies what we want our model to do\n", " model=model,\n", " tokenizer=processor.tokenizer,\n", " feature_extractor=processor.feature_extractor,\n", " max_new_tokens=128,\n", " torch_dtype=torch_dtype,\n", " device=device,\n", ")\n", "```\n", "\n", "1. **Transcription:** Finally, we *pipe* our audio sample into our\n", " pipeline, and generate an output.\n", "\n", "Note that steps 1 and 2 will only need to be ran once in a given\n", "notebook, unless you need to change the model specifications at a later\n", "point.\n", "\n", "``` python\n", "result = pipe(\"audio samples/WSJ-23andme_resampled.wav\")\n", "print(result[\"text\"])\n", "```\n", "\n", "You’ll notice that the transcription has minor mistakes, namely\n", "transcribing “23andMe” incorrectly. One limitation with automated\n", "transcription is that ASR models are trained on a finite vocabulary, and\n", "thus struggle with transcribing uncommon or out-of-vocabulary words\n", "accurately. Therefore, it’s always a good idea to proofread the\n", "generated output.\n", "\n", "### 4.2 Transcribing long-form output\n", "\n", "Realistically, most audio you’ll be working with is longer than 30\n", "seconds. However, the Whisper ASR model is inherently built on 30 second\n", "samples. Any audio shorter than 30 seconds will have additional white\n", "noise added to it to bring it to 30 seconds, and any audio longer than\n", "30 seconds will be cut at the 30 second mark. To overcome this, we can\n", "modify our code to allow for long-form audio transcription by\n", "“chuncking” our audio into 30-second segments, transcribing each segment\n", "individually, and then “stitching” the resulting text back together to\n", "form the complete transcription.\n", "\n", "You can learn more about long-form audio transcription on huggingface\n", "[here](https://huggingface.co/blog/asr-chunking).\n", "\n", "
\n", "\n", "\"segmentation\"\n", "\n", "
\n", "\n", "We will modify our code by adding the following lines of code to our\n", "pipeline: `chunk_length_s=25,` `batch_size=16,` and\n", "`stride_length_s=(4, 2)`. The `chunk_length_s` argument specifies the\n", "length of each individual chunk to be cut from our audio sample. The\n", "`stride_length_s` argument specifies the length of each *stride*, the\n", "overlapping section between individual chunks. The `batch_size` argument\n", "specifies how many chunks whisper should process at once.\n", "\n", "``` python\n", "pipe = pipeline( \n", " \"automatic-speech-recognition\", #specifies what we want our model to do\n", " model=model,\n", " tokenizer=processor.tokenizer,\n", " feature_extractor=processor.feature_extractor,\n", " max_new_tokens=128,\n", " chunk_length_s=25,\n", " stride_length_s=(4, 2),\n", " batch_size=16,\n", " torch_dtype=torch_dtype,\n", " device=device,\n", ")\n", "```\n", "\n", "Let’s test out our long-form transcription model on the parliamentary\n", "debate sample we saw earlier.\n", "\n", "``` python\n", "result = pipe(\"audio samples/House_debates_CPC_motion_of_non-confidence_against_Trudeau's_carbon_tax_CANADIAN POLITICS_resampled.mp3\" )\n", "print(result[\"text\"])\n", "```\n", "\n", "## 5. Transcribing multi-speaker audio\n", "\n", "***Speech diarization*** is the process of partitioning audio containing\n", "human speech into segments according to the identity of each speaker\n", "$^{[7]}$.\n", "\n", "
\n", "\n", "\"speakers\"\n", "\n", "
\n", "\n", "Most audio contains more than one speaker. Thus, diarization can be a\n", "useful tool for determining who is speaking, and when. Whisper, on its\n", "own, does not support speaker diarization. For this reason, we’ll be\n", "combining a number of tools to allow us to diarize audio output. Namely,\n", "we’ll be using *pyannote*, a open-source toolkit for speaker diarization\n", "in python $^{[8]}$, and *pyannote-whisper*, a python library that\n", "extends pyannote diarization to whisper ASR $^{[9]}$.\n", "\n", "### 5.1 Installations\n", "\n", "*Pyannote* is built on a number of libraries that require huggingface\n", "access tokens to access. Therefore, the first thing we’ll need to do is\n", "create an account on [huggingface](https://huggingface.co/) and create\n", "our own personal access token.\n", "\n", "1. Go to https://huggingface.co/join and create an account.\n", "2. Navigate to **settings** by pressing the circular button at the top\n", " right of the screen.\n", "\n", "
\n", "\n", "\"speakers\"\n", "\n", "
\n", "\n", "1. Nagivate to the left-hand side of the screen and press **Access\n", " Tokens**.\n", "\n", "
\n", "\n", "\"speakers\"\n", "\n", "
\n", "\n", "1. Press **New token**, and create a new token. Make sure the token\n", " type is a *read* token. Then, **copy your token**.\n", "\n", "
\n", "\n", "\"speakers\"\n", "\"speakers\"\n", "\n", "
\n", "\n", "1. Head over to https://huggingface.co/pyannote/segmentation-3.0 and\n", " accept the user license. Make sure you do this while logged in.\n", "2. Head over to https://huggingface.co/pyannote/speaker-diarization-3.1\n", " and accept the user license. Make sure you do this while logged in.\n", "3. Lastly, we’ll need to install pyannote. Head over to your conda\n", " navigator in administrator mode, activate your environment, and\n", " enter `pip install pyannote.audio`.\n", "\n", "You should now be all set!\n", "\n", "Warning: When running the cell below, you may get a\n", "warning stating that you must accept the user agreements for a few other\n", "libraries. Please accept the user agreements for those libraries as well\n", "(they will be linked in the error message) and re-run the cell below.\n", "\n", "### 5.2 Diarization\n", "\n", "Let’s transcribe and diarize the CBC interview we played earlier. The\n", "first thing we must do is import the pipeline from pyannote and\n", "authenticate ourselves.\n", "\n", "``` python\n", "from pyannote.audio import Pipeline\n", "pipeline = Pipeline.from_pretrained(\n", " \"pyannote/speaker-diarization-3.1\",\n", " use_auth_token=\"INSERT_TOKEN_HERE\") #replace this with your authentication token!\n", "```\n", "\n", "We’ll also need to specify the number of speakers in the audio. If you\n", "are unsure about the number of speakers, you can enter `none` for one or\n", "all of the categories below. The pipeline segment below *diarizes* our\n", "audio into the individual speakers, without transcribing it.\n", "\n", "``` python\n", "who_speaks_when = pipeline(\"audio samples\\House_debates_CPC_motion_of_non-confidence_against_Trudeau's_carbon_tax_CANADIAN_POLITICS_resampled.mp3\",\n", " num_speakers=2, \n", " min_speakers=2, \n", " max_speakers=2) #since this code is diarizing the entire audio file, it may take a while to run!\n", "```\n", "\n", "We can take a look at the contents of our audio by running the result of\n", "the pipeline. We can see that there are two speakers in this interview.\n", "\n", "``` python\n", "who_speaks_when\n", "```\n", "\n", "Now that we’ve diarized our audio sample, let’s transcribe it using\n", "whisper. We’ll also add timestamps and speaker identifiers. The OpenAI\n", "whisper model is better suited for diarization, therefore, we’ll be\n", "working with the whisper-small model rather than the distill-whisper\n", "model.\n", "\n", "``` python\n", "# load OpenAI Whisper ASR\n", "import whisper\n", "\n", "# choose among \"tiny\", \"base\", \"small\", \"medium\", \"large\"\n", "# see https://github.com/openai/whisper/\n", "model = whisper.load_model(\"small\") \n", "```\n", "\n", "``` python\n", "audio_file = \"audio samples\\House_debates_CPC_motion_of_non-confidence_against_Trudeau's_carbon_tax_CANADIAN_POLITICS_resampled.mp3\"\n", "\n", "from pyannote.audio import Audio\n", "\n", "for segment, _, speaker in who_speaks_when.itertracks(yield_label=True): #iterating over segments of the audio file and creating speaker labels\n", " waveform, sample_rate = audio.crop(audio_file, segment) # extract the waveform data and sampling rate\n", " text = model.transcribe(waveform.squeeze().numpy())[\"text\"] # transcribes the speech in the segment into text\n", " print(f\"{segment.start:06.1f}s {segment.end:06.1f}s {speaker}: {text}\") #formats start and end times\n", "```\n", "\n", "As we can see, the individual speakers have successfully been segmented.\n", "However, the resulting output is a bit untidy. Let’s clean it up by\n", "assigning names to our two speakers, fixing the timestamps, and adding\n", "vertical spacing between each speaker.\n", "\n", "``` python\n", "from pyannote.audio import Audio\n", "\n", "def rename_speaker(speaker):\n", " if speaker == \"SPEAKER_00\":\n", " return \"Mike\"\n", " elif speaker == \"SPEAKER_01\":\n", " return \"Todd\"\n", " # Add more elif conditions if you are using a different audio with more speakers\n", " else:\n", " return speaker \n", "\n", "# Function to format output for each speaker\n", "def format_speaker_output(segment, speaker, text):\n", " start_minutes, start_seconds = divmod(segment.start, 60)\n", " formatted_output = f\"{int(start_minutes):02d}:{start_seconds:04.1f} - {rename_speaker(speaker)}: {text}\"\n", " return formatted_output\n", "\n", "audio_file = \"audio samples/House_debates_CPC_motion_of_non-confidence_against_Trudeau's_carbon_tax_CANADIAN_POLITICS_resampled.mp3\"\n", "#feel free to try any other audio file \n", "\n", "\n", "for segment, _, speaker in who_speaks_when.itertracks(yield_label=True):\n", " waveform, sample_rate = audio.crop(audio_file, segment)\n", " text = model.transcribe(waveform.squeeze().numpy())[\"text\"]\n", " formatted_output = format_speaker_output(segment, speaker, text)\n", " print(formatted_output)\n", " print() \n", "```\n", "\n", "As we can see, the transcription isn’t perfect: Whisper often struggles\n", "with last names due to them not being included in the training data. For\n", "this reason, it’s important to proof-read the resulting output. If you’d\n", "like to try out different audio, add your audio to the `audio samples`\n", "folder and repeat the process using the correct file path. Additionally,\n", "if you’d like to use a different model, a full list can be found\n", "[here](https://github.com/openai/whisper).\n", "\n", "### 5.2.1 (Optional) Converting ASR transcription output into a PDF document\n", "\n", "The sample code below uses the reportlab library to automatically\n", "convert generated output into a PDF, and serves as an example of an\n", "application for ASR transcription. As metioned, proof-reading is still\n", "very important, as transcriptions may have errors. Feel free to edit the\n", "formatting to your liking. Make sure to install *reportlab* using\n", "`pip install reportlab` if it is not installed already.\n", "\n", "``` python\n", "from pyannote.audio import Audio\n", "from reportlab.lib.pagesizes import letter\n", "from reportlab.platypus import SimpleDocTemplate, Paragraph\n", "from reportlab.lib.styles import getSampleStyleSheet\n", "\n", "def generate_pdf(formatted_output_list, output_file=\"output_transcription_whisper.pdf\"):\n", " doc = SimpleDocTemplate(output_file, pagesize=letter)\n", " styles = getSampleStyleSheet() \n", "\n", " custom_style = ParagraphStyle(\"CustomStyle\", parent=styles[\"Normal\"], fontName=\"Times-Roman\", spaceAfter=8) #PDF formatting specifications\n", "\n", " content = [] #empty list used to store paragraphs and spacers\n", "\n", " for formatted_output in formatted_output_list:\n", " content.append(Paragraph(formatted_output, custom_style))\n", " content.append(Spacer(1, 0.2 * inch)) \n", " #for-loop used to iterate over each formatted output string; creates and appends new paragraphs to the `content` list.\n", " \n", " doc.build(content) #generates PDF\n", "\n", "def rename_speaker(speaker):\n", " if speaker == \"SPEAKER_00\":\n", " return \"Mike\"\n", " elif speaker == \"SPEAKER_01\":\n", " return \"Todd\"\n", " # Add more elif conditions if you are using a different audio with more speakers\n", " else:\n", " return speaker \n", "\n", "formatted_output_list = []\n", "\n", "# Function to format output for each speaker\n", "def format_speaker_output(segment, speaker, text):\n", " start_minutes, start_seconds = divmod(segment.start, 60)\n", " formatted_output = f\"{int(start_minutes):02d}:{start_seconds:04.1f} - {rename_speaker(speaker)}: {text}\"\n", " return formatted_output\n", "\n", "audio_file = \"audio samples/House_debates_CPC_motion_of_non-confidence_against_Trudeau's_carbon_tax_CANADIAN_POLITICS_resampled.mp3\"\n", "#feel free to try any other audio file \n", "\n", "\n", "for segment, _, speaker in who_speaks_when.itertracks(yield_label=True):\n", " waveform, sample_rate = audio.crop(audio_file, segment)\n", " text = model.transcribe(waveform.squeeze().numpy())[\"text\"]\n", " formatted_output = format_speaker_output(segment, speaker, text)\n", " formatted_output_list.append(formatted_output)\n", "\n", "generate_pdf(formatted_output_list)\n", "```\n", "\n", "## 6. Resources\n", "\n", "### 6.1 In-text citations\n", "\n", "1. https://huggingface.co/learn/audio-course/en/chapter5/asr_models\n", "2. https://en.wikipedia.org/wiki/Whisper\\_(speech_recognition_system)\n", "3. https://github.com/huggingface/distil-whisper\n", "4. https://github.com/sanchit-gandhi/whisper-jax\n", "5. https://github.com/Vaibhavs10/insanely-fast-whisper\n", "6. https://huggingface.co/docs/transformers/en/index\n", "7. https://en.wikipedia.org/wiki/Sampling\\_(signal_processing)\n", "8. https://github.com/pyannote/pyannote-audio\n", "9. https://github.com/yinruiqing/pyannote-whisper\n", "\n", "### 6.2 Audio sources\n", "\n", "1. https://mixkit.co/free-sound-effects/cat/\n", "2. https://www.youtube.com/watch?v=a32RLgqNfGs\n", "3. https://www.youtube.com/watch?v=9x6IN_zOvoQ&t=11s\n", "4. https://www.youtube.com/watch?v=gkxxtP9F6FY\n", "\n", "### 6.3 Additional resources\n", "\n", "1. https://www.youtube.com/watch?v=wjZofJX0v4M&t=193s" ], "id": "e1c0acf5-4fdd-40fe-9651-739d8fe0dd13" } ], "nbformat": 4, "nbformat_minor": 5, "metadata": { "kernelspec": { "name": "python3", "display_name": "Python 3 (ipykernel)", "language": "python", "path": "/usr/local/share/jupyter/kernels/python3" }, "language_info": { "name": "python", "codemirror_mode": { "name": "ipython", "version": "3" }, "file_extension": ".py", "mimetype": "text/x-python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.12.3" } } }