Finetune the parakeet v2 tdt 0.6 b model for other language than English let's say Hindi #13810

deepanshu-yadav · 2025-06-02T20:22:08Z

deepanshu-yadav
Jun 2, 2025

After days of going through the documentation I finally have way to finetune this model to a new language, I chose Hindi.
You can look at how to prepare the datasets, sentence piece tokenizers, train script and training configuration to help you get started.
Here is the notebook You can run on kaggle https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune/blob/main/finetuning-parakeet-on-hindi-dataset.ipynb

The code to tokenize to different language is present here https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune/blob/main/tokenize_language.py
Make sure you run the tokenization code only after preparing manifests.

I had trouble running the script provided in NeMo repository. So I modified the tokeniser script to convert the data into
bpe encodings.

Do not run on google colab there is an issue I have filled here #13734

Already Done-> Maybe we could freeze the encoder and train only the decoder portion of this model.

There is much more we can do.

Disclaimer:
We need to run this on a GPU with more computing power than what is available on kaggle.
We need to run for a lot more epochs.

This was just to get people started.