The Future of Automatic Speech Recognition - Whisper

More from Author

Engineering Lead

4 min read


Whisper is a cutting-edge automatic speech recognition (ASR) system that has been meticulously trained on a massive 680,000 hours of multilingual and multitask supervised data sourced from the web. The exceptional size and diversity of this dataset result in enhanced robustness to various factors such as accents, background noise, and technical language. Additionally, the system's capabilities extend beyond transcribing speech in multiple languages to include translation of those languages into English. To promote useful applications and encourage further research on robust speech processing, the models and inference code have been made open-source.

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder Transformer. Input audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an encoder. A decoder is trained to predict the corresponding text caption, intermixed with special tokens that direct the single model to perform tasks such as language identification, phrase-level timestamps, multilingual speech transcription, and to-English speech translation.

Let's get our hands dirty!

How to use whisper with command line

If you want to use the Whisper command line interface (CLI) on your local machine, there are a few prerequisites that you need to install first. Specifically, you'll need to have Python, PyTorch, and FFmpeg installed on your system. Here's how you can get started:

  • To install Python, head over to the official Python website using the link below and follow the installation instructions provided.
  • Next, you'll need to install PyTorch. To do this, visit the PyTorch website using the link below and obtain the pip install command that corresponds to your system's compute platform.
  • Finally, you'll need to install FFmpeg. To do this, you can visit the FFmpeg website using the link below and follow the installation instructions provided.
  • Alternatively, if you're a Windows user, you can download and install the Chocolatey package manager using the link below. Once installed, use the command "choco install ffmpeg" to install FFmpeg on your system.
Copy Code
  choco install ffmpeg

Now that you have Python, PyTorch, and FFmpeg installed, you're ready to install Whisper itself. Simply run the following command to install the latest version of the package:

Copy Code
  pip install -U openai-whisper

Assuming that you have all your audio files in a single folder, here's how you can use the Whisper command line interface to transcribe them:

  • First, navigate to the folder that contains all your audio files.
  • Open the command prompt in that folder.
  • To transcribe a specific audio file, simply run the following command:

Note: First run will take a minute as it has to download the base model

Copy Code
  whisper "file name" 

Note that by default, Whisper will use the base model to transcribe the audio file. If you want to use a different model (tiny, base, small, medium, large ), you can specify it using the --model argument. For example, to use the medium model, you would run the following command:

Copy Code
  whisper "file name"  --model medium

Users can manually enter the language, if known.

Copy Code
  whisper "file name"  --language English

To see all the available options and commands for Whisper, you can run the following command:

Copy Code

  whisper -help

Once file is transcribed, whisper will provide transcribed files in different formats like json, srt, tsv, txt and vtt

Find the original audio and the transcribed files in the link below.

Audio file: Audio File

Transcribed text file: Transcribed text

Transcribed JSON file: Transcribed JSON

Why whisper when you can use Whisper? It's like having a language interpreter for your inner thoughts - minus the awkward misunderstandings.

Stay tuned for the next post where we'll explore how to integrate Whisper into your Python programs. With Whisper's powerful automatic speech recognition capabilities, the possibilities are endless!

Back To Blogs

contact us