Introduction
Whisper is a cutting-edge automatic speech recognition (ASR) system that has been meticulously
trained on a massive 680,000 hours of multilingual and multitask supervised data sourced from the
web. The
exceptional size and diversity of this dataset result in enhanced robustness to various factors such
as
accents, background noise, and technical language. Additionally, the system’s capabilities extend
beyond
transcribing speech in multiple languages to include translation of those languages into English. To
promote
useful applications and encourage further research on robust speech processing, the models and
inference code
have been made open-source.

The Whisper architecture is a simple end-to-end approach, implemented as an encoder-decoder
Transformer. Input
audio is split into 30-second chunks, converted into a log-Mel spectrogram, and then passed into an
encoder. A
decoder is trained to predict the corresponding text caption, intermixed with special tokens that
direct the
single model to perform tasks such as language identification, phrase-level timestamps, multilingual
speech
transcription, and to-English speech translation.
Let’s get our hands dirty!
How to use whisper with command line
If you want to use the Whisper command line interface (CLI) on your local machine, there are a
few prerequisites that you need to install first. Specifically, you’ll need to have Python, PyTorch,
and
FFmpeg installed on your system. Here’s how you can get started:
-
To install Python, head
over to the official Python website using the link below and follow the installation
instructions provided. -
Next, you’ll need to install PyTorch. To
do this, visit the PyTorch website using the link below and obtain the pip install command that
corresponds to your system’s compute platform.

- Finally, you’ll need to install
FFmpeg. To do this, you can visit the FFmpeg website using the link below and follow the
installation instructions provided. -
Alternatively, if you’re a Windows user, you can download and install the Chocolatey package
manager using the
link below. Once installed, use the command “choco install ffmpeg” to install FFmpeg on your
system.
choco install ffmpeg
Now that you have Python, PyTorch, and FFmpeg installed, you’re ready to install Whisper itself.
Simply run
the following command to install the latest version of the package:
pip install -U openai-whisper
Assuming that you have all your audio files in a single folder, here’s how you can use the Whisper
command
line interface to transcribe them:
- First, navigate to the folder that contains all your audio files.
- Open the command prompt in that folder.
- To transcribe a specific audio file, simply run the following command:
Note: First run will take a minute as it has to download the base model
whisper "file name"
Note that by default, Whisper will use the base model to transcribe the audio file. If you want to
use a
different model (tiny, base, small, medium, large ), you can specify it using the –model
argument. For
example, to use the medium model, you would run the following command:
whisper "file name" --model medium
Users can manually enter the language, if known.
whisper "file name" --language English

To see all the available options and commands for Whisper, you can run the following command:
whisper -help
Once file is transcribed, whisper will provide transcribed files in different formats like json,
srt, tsv, txt
and vtt

Find the original audio and the transcribed files in the link below.
Audio file:
Audio
File
Transcribed text file:
Transcribed
text
Transcribed JSON file:
Transcribed
JSON
Why whisper when you can use Whisper? It’s like having a language interpreter for your inner
thoughts – minus
the awkward misunderstandings.
Stay tuned for the next post where we’ll explore how to integrate Whisper into your Python programs.
With
Whisper’s powerful automatic speech recognition capabilities, the possibilities are endless!