Skip to content

[Help] [Metis] Voice Conversion Irreproducible #437

Open
@yileitu

Description

@yileitu

[Help] [Metis] Voice Conversion Irreproducible

Problem Overview

The example code in models/tts/metis/metis_infer_vc.py is incorrect and cannot be run as-is. Specifically:

  • It loads ft.json via load_config, which is unrelated to voice conversion.
  • It attempts to load metis_vc.safetensors, which does not exist in the HuggingFace repo. Only the following two files are available:
    • metis_vc_lora_16.safetensors
    • metis_vc_lora_16_adapter.safetensors

Steps Taken

  1. Referred to the example usage for TTS here: https://github.com/open-mmlab/Amphion/tree/main/models/tts/metis#2-example-usaage

  2. Modified the code to a voice conversion (VC) version, as follows:

    device = "cuda:0"
    metis_cfg = load_config("./models/tts/metis/config/vc.json")
    
    base_ckpt_dir = snapshot_download(
        "amphion/metis",
        repo_type="model",
        local_dir="./models/tts/metis/ckpt",
        allow_patterns=["metis_base/model.safetensors"],
    )
    lora_ckpt_dir = snapshot_download(
        "amphion/metis",
        repo_type="model",
        local_dir="./models/tts/metis/ckpt",
        allow_patterns=["metis_vc/metis_vc_lora_16.safetensors"],
    )
    adapter_ckpt_dir = snapshot_download(
        "amphion/metis",
        repo_type="model",
        local_dir="./models/tts/metis/ckpt",
        allow_patterns=["metis_vc/metis_vc_lora_16_adapter.safetensors"],
    )
    
    base_ckpt_path = os.path.join(base_ckpt_dir, "metis_base/model.safetensors")
    lora_ckpt_path = os.path.join(lora_ckpt_dir, "metis_vc/metis_vc_lora_16.safetensors")
    adapter_ckpt_path = os.path.join(adapter_ckpt_dir, "metis_vc/metis_vc_lora_16_adapter.safetensors")
    
    metis = Metis(
        base_ckpt_path=base_ckpt_path,
        lora_ckpt_path=lora_ckpt_path,
        adapter_ckpt_path=adapter_ckpt_path,
        cfg=metis_cfg,
        device=device,
        model_type="vc",
    )
    
    prompt_speech_path = "./models/tts/metis/wav/vc/prompt.wav"
    source_speech_path = "./models/tts/metis/wav/vc/source.wav"
    
    n_timesteps = 20
    cfg = 1.0
    
    gen_speech = metis(
        prompt_speech_path=prompt_speech_path,
        source_speech_path=source_speech_path,
        cfg=cfg,
        n_timesteps=n_timesteps,
        model_type="vc",
    )
    
    sf.write("./models/tts/metis/wav/vc/gen.wav", gen_speech, 24000)
  3. Used the example WAV files in models/tts/metis/wav/vc/.

Expected Outcome

Expected to generate intelligible and high-quality converted speech, similar to the samples on the demo page.

Actual Outcome

The generated audio is very low quality and does not contain any human voice — it's mostly noise. This makes the current VC pipeline irreproducible.

Environment Information

  • Operating System: Ubuntu 20.04.5 LTS
  • Python Version: 3.10.16
  • Driver & CUDA Version: Driver 470.103.01 & CUDA 11.4
  • Error Messages and Logs: No runtime errors, but the model output is unusable.

Additional Context

Please provide the correct inference code used to generate the demo samples at https://metis-demo.github.io/#metis-vc. It would be especially helpful if you could:

  • Fix the example script at metis_infer_vc.py
  • Clearly specify which checkpoint files are required
  • Share the hyperparameters (cfg, n_timesteps, etc.) and audio preprocessing steps used in your demos

Thanks for your work. I am more than excited to use Metis VC once this is resolved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions