Finetune the parakeet v2 tdt 0.6 b model for other language than English let's say Hindi #13810
deepanshu-yadav
started this conversation in
Show and tell
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
-
After days of going through the documentation I finally have way to finetune this model to a new language, I chose Hindi.
You can look at how to prepare the datasets, sentence piece tokenizers, train script and training configuration to help you get started.
Here is the notebook You can run on kaggle https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune/blob/main/finetuning-parakeet-on-hindi-dataset.ipynb
The code for making manifest is present here https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune/blob/main/prepare_manifest.py
The code to tokenize to different language is present here https://github.com/deepanshu-yadav/Hindi_GramVani_Finetune/blob/main/tokenize_language.py
Make sure you run the tokenization code only after preparing manifests.
I had trouble running the script provided in NeMo repository. So I modified the tokeniser script to convert the data into
bpe encodings.
Do not run on google colab there is an issue I have filled here #13734
Already Done-> Maybe we could freeze the encoder and train only the decoder portion of this model.
There is much more we can do.
Disclaimer:
We need to run this on a GPU with more computing power than what is available on kaggle.
We need to run for a lot more epochs.
This was just to get people started.
Beta Was this translation helpful? Give feedback.
All reactions