Description
[Help] [Metis] Voice Conversion Irreproducible
Problem Overview
The example code in models/tts/metis/metis_infer_vc.py
is incorrect and cannot be run as-is. Specifically:
- It loads
ft.json
viaload_config
, which is unrelated to voice conversion. - It attempts to load
metis_vc.safetensors
, which does not exist in the HuggingFace repo. Only the following two files are available:metis_vc_lora_16.safetensors
metis_vc_lora_16_adapter.safetensors
Steps Taken
-
Referred to the example usage for TTS here: https://github.com/open-mmlab/Amphion/tree/main/models/tts/metis#2-example-usaage
-
Modified the code to a voice conversion (VC) version, as follows:
device = "cuda:0" metis_cfg = load_config("./models/tts/metis/config/vc.json") base_ckpt_dir = snapshot_download( "amphion/metis", repo_type="model", local_dir="./models/tts/metis/ckpt", allow_patterns=["metis_base/model.safetensors"], ) lora_ckpt_dir = snapshot_download( "amphion/metis", repo_type="model", local_dir="./models/tts/metis/ckpt", allow_patterns=["metis_vc/metis_vc_lora_16.safetensors"], ) adapter_ckpt_dir = snapshot_download( "amphion/metis", repo_type="model", local_dir="./models/tts/metis/ckpt", allow_patterns=["metis_vc/metis_vc_lora_16_adapter.safetensors"], ) base_ckpt_path = os.path.join(base_ckpt_dir, "metis_base/model.safetensors") lora_ckpt_path = os.path.join(lora_ckpt_dir, "metis_vc/metis_vc_lora_16.safetensors") adapter_ckpt_path = os.path.join(adapter_ckpt_dir, "metis_vc/metis_vc_lora_16_adapter.safetensors") metis = Metis( base_ckpt_path=base_ckpt_path, lora_ckpt_path=lora_ckpt_path, adapter_ckpt_path=adapter_ckpt_path, cfg=metis_cfg, device=device, model_type="vc", ) prompt_speech_path = "./models/tts/metis/wav/vc/prompt.wav" source_speech_path = "./models/tts/metis/wav/vc/source.wav" n_timesteps = 20 cfg = 1.0 gen_speech = metis( prompt_speech_path=prompt_speech_path, source_speech_path=source_speech_path, cfg=cfg, n_timesteps=n_timesteps, model_type="vc", ) sf.write("./models/tts/metis/wav/vc/gen.wav", gen_speech, 24000)
-
Used the example WAV files in
models/tts/metis/wav/vc/
.
Expected Outcome
Expected to generate intelligible and high-quality converted speech, similar to the samples on the demo page.
Actual Outcome
The generated audio is very low quality and does not contain any human voice — it's mostly noise. This makes the current VC pipeline irreproducible.
Environment Information
- Operating System: Ubuntu 20.04.5 LTS
- Python Version: 3.10.16
- Driver & CUDA Version: Driver 470.103.01 & CUDA 11.4
- Error Messages and Logs: No runtime errors, but the model output is unusable.
Additional Context
Please provide the correct inference code used to generate the demo samples at https://metis-demo.github.io/#metis-vc. It would be especially helpful if you could:
- Fix the example script at
metis_infer_vc.py
- Clearly specify which checkpoint files are required
- Share the hyperparameters (
cfg
,n_timesteps
, etc.) and audio preprocessing steps used in your demos
Thanks for your work. I am more than excited to use Metis VC once this is resolved.