This project provides a FastAPI-based application for Automatic Speech Recognition (ASR) of Hindi audio using a pre-trained NVIDIA NeMo model. The model is optimized for inference using ONNX.
- ASR Endpoint: Exposes a
/transcribe
POST endpoint that accepts.wav
audio files. - NVIDIA NeMo Model: Utilizes the
stt_hi_conformer_ctc_medium
model from NVIDIA NeMo. - ONNX Optimization: The NeMo model is loaded and run via ONNX for efficient inference.
- Audio Preprocessing: Handles audio files, including resampling to 16kHz and clipping to a maximum duration (currently 10 seconds).
- Concurrent Request Handling: Designed to efficiently process multiple simultaneous transcription requests.
- Containerized: The application is packaged as a Docker image for easy deployment.
See the application in action!
A short animated GIF demonstrating the application's functionality.
Click the image above to watch the complete 25-second video demonstration on YouTube.
ASR App Home Page |
ASR App Result Page |
---|---|
![]() |
![]() |
- Docker Desktop (or Docker Engine) installed on your system.
-
Navigate to the project directory:
cd /path/to/your/nemo-asr-app
(Replace
/path/to/your/nemo-asr-app
with the actual path where yourDockerfile
,app.py
,requirements.txt
, andnemo_models
directory are located.) -
Build the Docker image:
This command builds the Docker image named
hindi-asr-app
. The build process might take some time as it downloads dependencies and models.docker build -t hindi-asr-app .
Note: The Docker build process will automatically download the required models:
- NeMo model: stt_hi_conformer_ctc_medium.nemo
- ONNX model: stt_hi_conformer_ctc_medium.onnx
These will be stored within the
nemo_models
directory inside the container.
-
Run the Docker container:
This command starts the FastAPI application inside a Docker container and maps port
8000
from the container to port8000
on your host machine.docker run -p 8000:8000 hindi-asr-app
You should see output similar to this, indicating the FastAPI server has started:
INFO: Started server process [1] ... INFO: Application startup complete. INFO: Uvicorn running on [http://0.0.0.0:8000](http://0.0.0.0:8000) (Press CTRL+C to quit)
Once the Docker container is running, you can test the /transcribe
endpoint.
You'll need a sample .wav
audio file (e.g., named hindi.wav
or hindi2.wav
) for testing. Make sure it's a 16kHz WAV file for optimal performance.
-
Create a sample audio file: For example, record a short Hindi phrase and save it as
hindi.wav
. Ensure it's a WAV file. -
Send the
curl
request:Open a new terminal window (keep the Docker container running in the first terminal) and execute the following
curl
command:curl -X POST http://localhost:8000/transcribe -F 'audio_file=@hindi.wav'
You should receive a JSON response similar to this:
{"transcription":"बंगाल की खाडी के शीर्ष तट से एक सौ अस्सी किलोमीटर दूर हुगली नदी के बाएं किनारे पर स्थित कोलकाता पश्चिम बंगाल की राजधानी है","note":"Input audio was 10.58s long. It has been clipped to the first 10.0s for processing as per requirements."}
Another example:
curl -X POST http://localhost:8000/transcribe -F 'audio_file=@hindi3.wav'
Sample response:
{"transcription":"बंगाल की खाडी के शीर्ष तट से एक अस्ठी किलोमीटर दूर हुगली नदी के बाएं किनारे"}
-
Slim Base Image: Using
python:3.10-slim-buster
as the base image for the Dockerfile helps in creating a smaller final image size, which is beneficial for deployment and faster downloads. -
Layer Caching: The Dockerfile is structured to take advantage of Docker's layer caching. Dependencies like
ffmpeg
,sox
,g++
,torch
,numpy
, andCython
are installed in earlier layers. This means that ifrequirements.txt
or the application code changes, these heavier layers don't need to be rebuilt, speeding up subsequent builds. -
ONNX for Inference: While NeMo models can be used directly, converting and loading them as ONNX offers potential performance benefits (lower latency, reduced memory footprint) for production inference, as ONNX runtimes are highly optimized.
-
Audio Preprocessing: The application includes logic to resample audio to 16kHz and clip it to a maximum of 10 seconds. This is crucial for consistency with the model's training data and to prevent excessively long processing times for very long audio inputs, aligning with the requirement of handling 5-10 second clips.
-
Error Handling (Basic): The application handles cases where
soundfile
might fail to read the audio directly (e.g., if the WAV header is malformed or it's not a standard PCM WAV), falling back topydub
andffmpeg
for conversion. This increases robustness. -
Asynchronous Request Handling: For this CPU-bound ONNX-optimized ASR model, the core inference operation is synchronous. However, by leveraging FastAPI with Uvicorn (which utilizes an
asyncio
event loop and can be configured with multiple worker processes), the application is robustly designed to efficiently handle concurrent requests. This architecture ensures that even when a synchronous inference is ongoing for one request, other incoming requests are not blocked and can be processed by available Uvicorn workers.This current setup is exceptionally efficient and perfectly suited for the specified requirements of serving multiple concurrent CPU-bound transcription requests for short (5-10 second) audio clips. This approach masterfully balances simplicity of implementation with highly robust concurrent handling, making it a pragmatic and optimal choice for the current scale and use case.
-
Model and Tokenizer Loading: The NeMo model is loaded once at application startup, which avoids repeated loading overhead for each request, improving responsiveness.