{
"cells": [
{
"cell_type": "markdown",
"metadata": {},
"source": [
"# 4.3.1 - Advanced - Transcription\n",
"\n",
"COMET Team
*Irene Berezin*\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"## Prerequisites\n",
"\n",
"- Have installed Anaconda Navigator and Python on your computer\n",
"\n",
"## Learning outcomes\n",
"\n",
"- Understand the basic mechanics behild audio transcription\n",
"- Be familiar with the various elements of Whisper audio transcription\n",
"- Be able to transcribe and diarize short-form and long-form audio\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"## 1. Introduction\n",
"\n",
"\n",
"\n",
"### 1.1 What is audio transcription?\n",
"\n",
"Audio transcription is the language processing task of converting audio\n",
"files containing human speech into text using a computer. This task most\n",
"often includes the process of *diarization*, the process of\n",
"distinguishing and labeling the various speakers in the audio file.\n",
"Application of audio transcription include multi-lingual captions on\n",
"online videos, real-time online meeting transcription, and much more.\n",
"\n",
"*Automatic speech recognition (ASR) systems* are interfaces that use\n",
"machine learning/artificial intelligence to process speech audio files\n",
"into text. In this module, we will be using the open-source ASR system\n",
"*Whisper* by OpenAI to transcribe and diarize various audio files.\n",
"\n",
"### 1.2 What is *Whisper*?\n",
"\n",
"Whisper is a ASR model for transcription and speech recognition designed\n",
"by OpenAI. Whisper stands out from it’s predecessors due to it being\n",
"trained on roughly 680 thousand hours of labeled audio transcription\n",
"data, signfificantly more than the models that came before it; thus,\n",
"Whisper exhibits a much higher accuracy when tested against audio data\n",
"outside of it’s training set compared to older models such as\n",
"*Wav2Vec*$^{[1]}$.\n",
"\n",
"#### 1.2.1 How does Whisper work?\n",
"\n",
"Whisper, and audio transcription models in general, work by converting\n",
"raw audio data into a spectrogram, specifically a *Log-Mel spectrogam*,\n",
"which plots the time on the x-axis, the mels scale (a logarithmic form\n",
"of the Hertz frequency) on the y-axis, and colors the data with respect\n",
"to the amplitude of the audio data at each mel frequency and point in\n",
"time.\n",
"\n",
"The mel-spectogram is then ran though a tokenizer, which converts the\n",
"individual words in the spectrogram into lexical tokens- strings with\n",
"assigned meaning that can be read by the language model. The encoder is\n",
"a stack of transformer blocks that process the tokens, extracting\n",
"features and relationships between different parts of the audio. The\n",
"processed information from the encoder is passed to the decoder, another\n",
"stack of transformer blocks that generate an output sequence (predicting\n",
"the corresponding text captions word by word)$^{[2]}$.\n",
"\n",
"
\n",
"\n",
"#### 1.2.2 Optimizing Whisper transcription\n",
"\n",
"Alongside whisper, there exist many libraries that aim to optimize the\n",
"current whisper model by increasing transcription speed and accuracy.\n",
"Some examples include:\n",
"\n",
"**[Distil-Whisper](https://github.com/huggingface/distil-whisper)**: a\n",
"smaller, optimized version of whisper created by *HuggingFace* using\n",
"knowledge distillation. The Distil-Whisper model claims to be 6x faster,\n",
"50% smaller, and within a 1% word error rate relative to the original\n",
"whisper model $^{[3]}$. \\> Pros: CPU-compatible, significantly faster\n",
"compared to OpenAI’s Whisper model.\n",
"\n",
"> Cons: Only supports english-speech to english-text transcription.\n",
">\n",
"> This is the model that we will be using in this notebook, due to it’s\n",
"> relevance and compatability with our use cases for audio\n",
"> transcription. However, if you have a relatively powerful computer and\n",
"> feel up for the challenge, consider following along with one of the\n",
"> alternatives listed below.\n",
"\n",
"**[Whisper-Jax](https://github.com/sanchit-gandhi/whisper-jax)**:\n",
"Another optimized version of whisper built on the *Transformers*\n",
"library. Whisper-Jax claims to be 70x faster than the original Whisper\n",
"model $^{[4]}$. \\> Pros: CPU-compatible, significantly faster compared\n",
"to OpenAI’s Whisper model.\n",
"\n",
"> Cons: Optimized for GPU/TPU usage.\n",
"\n",
"**[Insanely-Fast-Whisper](https://github.com/Vaibhavs10/insanely-fast-whisper)**:\n",
"A command-line interface that greatly speeds up whisper performance and\n",
"claims to be able to trascribe 150 minutes of audio in less than 98\n",
"seconds $^{[5]}$.. \\> Pros: One of the fastest versions of whisper\n",
"available today.\n",
"\n",
"> Cons: Only works on NVIDIA GPUs.\n",
"\n",
"------------------------------------------------------------------------\n",
"\n",
"## 2. Installations\n",
"\n",
"### 2.1 Activating conda environment & downloading Jupyter Lab\n",
"\n",
"(If you’ve already done this, please move on to section 2.2)\n",
"\n",
"#### 2.1.1 Setting up and activating a conda envornment\n",
"\n",
"An *environment* is a repository of packages that you have installed on\n",
"your computer. It acts similar to a virtual machine, keeping the\n",
"packages needed for one project seperate from other projects, to avoid\n",
"version conflicts, cluttering, etc.\n",
"\n",
"Let’s start by opening up the conda command prompt. 1) On windows 11,\n",
"press the windows icon at the bottom of the screen. 2) Press *“all\n",
"apps”*, and open the `anaconda3 (64bit)` folder. 3) Left-click on\n",
"`anaconda prompt`, select `more`, and press `run as administrator`. This\n",
"will open the command prompt window. 4) Lets create a new environment\n",
"and call it *“whisper”*. In the command prompt, copy-and-paste the\n",
"following line of code: `conda create -n whisper`. 5) Let’s activate our\n",
"new environment. Once your new environment is created, type\n",
"`conda activate whisper`.\n",
"\n",
"We’ve successfully created and activated our environment.\n",
"\n",
"#### 2.1.2 Installing and opening Jupyter lab\n",
"\n",
"To install jupyter, type in the following line of code:\n",
"`conda install jupyter`. Once jupyter is finished installing, simply\n",
"type `jupyter lab` in the command prompt. This will open up jupyter\n",
"locally in your default browser.\n",
"\n",
"Note: these steps only need to be done once on each computer. The\n",
"next time you wish to open jupyter locally, you only need to activate\n",
"your conda environment and type in “jupyter lab” in the conda prompt.\n",
"\n",
"Warning: Make sure not to close the anaconda prompt while jupyter\n",
"is running. Doing so will cause Jupyter to lose connection and may\n",
"result in you losing unsaved work.\n",
"\n",
"### 2.2 Installaling Whisper\n",
"\n",
"#### 2.2.1 Installing Pytorch\n",
"\n",
"Lets start off by installing *PyTorch*, a machine learning library based\n",
"on the Torch frame work, on which Whisper is built on. To install\n",
"pytorch, open the conda prompt as an administrator, ensure that you are\n",
"in the `whisper` enviornment that we created, and type in the following\n",
"line of code:\n",
"\n",
"`conda install pytorch torchvision torchaudio cpuonly -c pytorch`\n",
"\n",
"If, for some reason, the installation does not work, you can also\n",
"install pytorch through pip:\n",
"\n",
"`pip3 install torch torchvision torchaudio`\n",
"\n",
"Note: This installation is CPU only. If you have a NVIDIA GPU and would\n",
"like to run whisper on your GPU, download\n",
"CUDA and\n",
"follow the PyTorch GPU setup\n",
"here.\n",
"\n",
"