Replicate
Replicate.com hosts a variety of machine learning models for audio generation, which can be broadly categorized into Music Generation, Text-to-Speech (TTS)/Voice Generation, and Sound Effects (SFX) & Audio Processing.
Here is a list of some of the most prominent and frequently used audio generation models available on Replicate via their API:
1. Music Generation (Text-to-Music)
These models create full musical tracks, instrumentals, or loops from a text prompt or melody input.
| Model Name | Developer/Origin | Key Features |
|---|---|---|
| [ ] meta/musicgen | Meta | Generate music from a text prompt or a melody (using a reference audio input). Highly popular and offers “melody” and “large” versions. by selecting a stereo- model version |
| stability-ai/stable-audio-2.5 | Stability AI | Generate high-quality music and sound from text prompts, supporting full-length songs and sound design. |
| minimax/music-1.5 | Minimax | Generate full-length songs (up to 4 mins) with natural vocals and rich instrumentation. |
| minimax/music-01 | Minimax | Quickly generate up to 1 minute of music with lyrics and vocals, often using a reference track for style. |
| [ ] google/lyria-2 | Generates 48kHz stereo audio from text-based prompts. | |
| riffusion/riffusion | Riffusion | Generates music in real-time using Stable Diffusion, often focused on generating loops and visualizing music as a spectrogram. |
| lucataco/ace-step | Luca Tacconelli | A Text-to-Music generation model designed as a foundation model. |
2. Text-to-Speech (TTS) & Voice Generation
These models convert text into spoken audio and often include features like voice cloning and emotional expression.
| Model Name | Developer/Origin | Key Features |
|---|---|---|
| minimax/speech-02-turbo | Minimax | Low-latency Text-to-Audio (T2A) for real-time applications, offering voice synthesis and emotional expression. |
| minimax/speech-02-hd | Minimax | High-fidelity Text-to-Audio (T2A) optimized for voiceovers and audiobooks. |
| jaaari/kokoro-82m | jaaari | A popular text-to-speech model based on StyleTTS2. |
| suno-ai/bark | Suno AI | A transformer-based model that can generate highly realistic, natural-sounding, and expressive speech, including music, sound effects, and non-speech sounds. |
| minimax/voice-cloning | Minimax | Specifically for cloning a voice to be used with their speech-02-hd and speech-02-turbo models. |
| zsxkib/dia | Nari Labs | Generates realistic dialogue audio from text, including non-verbal cues and voice cloning. |
3. Sound Effects (SFX) & Video-to-Audio
These models focus on generating short sound samples or generating audio to match video content.
| Model Name | Developer/Origin | Key Features |
|---|---|---|
| [ ] stackadoc/stable-audio-open-1.0 | Stability AI (Open) | Optimized for generating short audio samples, sound effects (SFX), and production elements from text prompts. |
| zsxkib/mmaudio | zsxkib | A Video-to-Audio synthesis model that generates high-quality audio (SFX, ambient sound) to match the visual content of an input video. |
4. Speech Recognition & Audio Analysis
These models process existing audio to extract information such as transcriptions, timestamps, and speaker identification.
| Model Name | Developer/Origin | Key Features |
|---|---|---|
| victor-upmeet/whisperx | Victor Upmeet | Automatic Speech Recognition (ASR) with word-level timestamps and speaker diarization for accurate transcription and timecode generation. |
For Text-to-Speech (TTS) and Voice Generation, the most effective models for creating a conversation between multiple people are those specifically designed for multi-speaker dialogue.
Based on industry capabilities and models available through platforms like Replicate, here are the key models/platforms that can produce this kind of output:
5. Top Models/Platforms for Multi-Voice Dialogue
| Model/Service | Key Feature for Dialogue | How it Works (General Concept) |
|---|---|---|
| Microsoft VibeVoice | Explicitly designed for long-form, multi-speaker conversational audio (e.g., podcasts). | Can synthesize long-form speech for up to 4 speakers using a conversational approach to maintain a natural “vibe” and turn-taking. |
| zsxkib/dia (Dia 1.6B) | Realistic Dialogue Generation using speaker tags. | You input your script with speaker tags (e.g., [S1], [S2]) and the model generates the conversation, including non-verbal cues like laughter or coughing. |
| lucataco/higgs-audio-v2 | Demonstrates zero-shot generation of natural multi-speaker dialogues in multiple languages. | It’s a powerful foundation model with deep language understanding, allowing it to generate realistic conversations. |
| Google Cloud Text-to-Speech | Has a Multi-Speaker TTS capability. | You use a specialized voice (en-US-Studio-MultiSpeaker) and define the turns with different speakers (e.g., Speaker R, Speaker S) in a structured markup. |
| ElevenLabs (via its Studio interface) | Offers tools to generate natural dialogue between multiple speakers using their advanced models (like v3). | The associated editor/studio allows you to assign a unique voice to each line of dialogue in a script. |
Summary of the Technique
The common technique across these services is to use speaker separation/tagging within the text input:
- Write the script: You write the conversation as you would a screenplay.
- Tag the lines: You add a specific tag or instruction to each line to tell the model which voice should speak it (e.g.,
"Speaker 1: Hello.","Speaker 2: Hi there!"). - Model assigns voices: The model then generates the audio, ensuring a different, consistent, and naturally flowing voice for each speaker throughout the dialogue.
Note: The models available on Replicate are constantly changing as new research is published. You should always check the official Replicate Explore page or the specific model collection for the most up-to-date list and features.