Replicate

Replicate.com hosts a variety of machine learning models for audio generation, which can be broadly categorized into Music Generation, Text-to-Speech (TTS)/Voice Generation, and Sound Effects (SFX) & Audio Processing.

Here is a list of some of the most prominent and frequently used audio generation models available on Replicate via their API:

1. Music Generation (Text-to-Music)

These models create full musical tracks, instrumentals, or loops from a text prompt or melody input.

Model NameDeveloper/OriginKey Features
[ ] meta/musicgenMetaGenerate music from a text prompt or a melody (using a reference audio input). Highly popular and offers “melody” and “large” versions. by selecting a stereo- model version
stability-ai/stable-audio-2.5Stability AIGenerate high-quality music and sound from text prompts, supporting full-length songs and sound design.
minimax/music-1.5MinimaxGenerate full-length songs (up to 4 mins) with natural vocals and rich instrumentation.
minimax/music-01MinimaxQuickly generate up to 1 minute of music with lyrics and vocals, often using a reference track for style.
[ ] google/lyria-2GoogleGenerates 48kHz stereo audio from text-based prompts.
riffusion/riffusionRiffusionGenerates music in real-time using Stable Diffusion, often focused on generating loops and visualizing music as a spectrogram.
lucataco/ace-stepLuca TacconelliA Text-to-Music generation model designed as a foundation model.

2. Text-to-Speech (TTS) & Voice Generation

These models convert text into spoken audio and often include features like voice cloning and emotional expression.

Model NameDeveloper/OriginKey Features
minimax/speech-02-turboMinimaxLow-latency Text-to-Audio (T2A) for real-time applications, offering voice synthesis and emotional expression.
minimax/speech-02-hdMinimaxHigh-fidelity Text-to-Audio (T2A) optimized for voiceovers and audiobooks.
jaaari/kokoro-82mjaaariA popular text-to-speech model based on StyleTTS2.
suno-ai/barkSuno AIA transformer-based model that can generate highly realistic, natural-sounding, and expressive speech, including music, sound effects, and non-speech sounds.
minimax/voice-cloningMinimaxSpecifically for cloning a voice to be used with their speech-02-hd and speech-02-turbo models.
zsxkib/diaNari LabsGenerates realistic dialogue audio from text, including non-verbal cues and voice cloning.

3. Sound Effects (SFX) & Video-to-Audio

These models focus on generating short sound samples or generating audio to match video content.

Model NameDeveloper/OriginKey Features
[ ] stackadoc/stable-audio-open-1.0Stability AI (Open)Optimized for generating short audio samples, sound effects (SFX), and production elements from text prompts.
zsxkib/mmaudiozsxkibA Video-to-Audio synthesis model that generates high-quality audio (SFX, ambient sound) to match the visual content of an input video.

4. Speech Recognition & Audio Analysis

These models process existing audio to extract information such as transcriptions, timestamps, and speaker identification.

Model NameDeveloper/OriginKey Features
victor-upmeet/whisperxVictor UpmeetAutomatic Speech Recognition (ASR) with word-level timestamps and speaker diarization for accurate transcription and timecode generation.

For Text-to-Speech (TTS) and Voice Generation, the most effective models for creating a conversation between multiple people are those specifically designed for multi-speaker dialogue.

Based on industry capabilities and models available through platforms like Replicate, here are the key models/platforms that can produce this kind of output:

5. Top Models/Platforms for Multi-Voice Dialogue

Model/ServiceKey Feature for DialogueHow it Works (General Concept)
Microsoft VibeVoiceExplicitly designed for long-form, multi-speaker conversational audio (e.g., podcasts).Can synthesize long-form speech for up to 4 speakers using a conversational approach to maintain a natural “vibe” and turn-taking.
zsxkib/dia (Dia 1.6B)Realistic Dialogue Generation using speaker tags.You input your script with speaker tags (e.g., [S1], [S2]) and the model generates the conversation, including non-verbal cues like laughter or coughing.
lucataco/higgs-audio-v2Demonstrates zero-shot generation of natural multi-speaker dialogues in multiple languages.It’s a powerful foundation model with deep language understanding, allowing it to generate realistic conversations.
Google Cloud Text-to-SpeechHas a Multi-Speaker TTS capability.You use a specialized voice (en-US-Studio-MultiSpeaker) and define the turns with different speakers (e.g., Speaker R, Speaker S) in a structured markup.
ElevenLabs (via its Studio interface)Offers tools to generate natural dialogue between multiple speakers using their advanced models (like v3).The associated editor/studio allows you to assign a unique voice to each line of dialogue in a script.

Summary of the Technique

The common technique across these services is to use speaker separation/tagging within the text input:

  1. Write the script: You write the conversation as you would a screenplay.
  2. Tag the lines: You add a specific tag or instruction to each line to tell the model which voice should speak it (e.g., "Speaker 1: Hello.", "Speaker 2: Hi there!").
  3. Model assigns voices: The model then generates the audio, ensuring a different, consistent, and naturally flowing voice for each speaker throughout the dialogue.

Note: The models available on Replicate are constantly changing as new research is published. You should always check the official Replicate Explore page or the specific model collection for the most up-to-date list and features.