Multi-Provider Podcast Audio Generation¶
Feature Status
Status: โ Production Ready Since: October 2025 Related Issues: Custom Voice Support
Overview¶
RAG Modulo's podcast generation system now supports multi-provider audio generation, enabling seamless mixing of custom voices (ElevenLabs) with predefined provider voices (OpenAI) in a single podcast. This feature provides per-turn TTS provider selection, custom voice resolution, and intelligent audio stitching.
Key Features¶
1. Per-Turn Provider Selection¶
Each dialogue turn can use a different TTS provider based on the voice selected:
# Example: HOST using custom ElevenLabs voice, EXPERT using OpenAI voice
{
"host_voice": "38c79b5a-204c-427c-b794-6c3a9e3db956", // Custom voice (UUID)
"expert_voice": "nova" // OpenAI predefined voice
}
The system automatically:
- Detects voice ID format (UUID = custom, string = predefined)
- Resolves custom voices from database
- Selects appropriate TTS provider per turn
- Generates audio segments
- Stitches segments together with natural pauses
2. Custom Voice Resolution¶
UUID-Based Detection:
async def _resolve_voice_id(self, voice_id: str, user_id: UUID4) -> tuple[str, str | None]:
"""
Resolve voice ID to provider-specific voice ID.
UUID format: xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx
Returns: (provider_voice_id, provider_name)
"""
Validation Steps:
- Parse voice ID as UUID
- Look up custom voice in database
- Validate ownership (user_id matches)
- Check voice status (must be "ready")
- Return provider-specific voice ID and provider name
3. Supported Providers¶
| Provider | Voice Types | Use Cases |
|---|---|---|
| OpenAI TTS | Predefined voices (alloy, echo, fable, onyx, nova, shimmer) | Quick generation, consistent quality |
| ElevenLabs | Custom cloned voices + presets | Brand voices, personalized podcasts |
| WatsonX TTS | IBM Watson voices | Enterprise deployments |
4. Audio Stitching¶
Technical Implementation:
# Generate audio for each turn with appropriate provider
for turn in script.turns:
voice_id = host_voice_id if turn.speaker == Speaker.HOST else expert_voice_id
provider = get_provider(provider_type)
segment = await provider._generate_turn_audio(...)
audio_segments.append(segment)
# Add 500ms pause between turns
if idx < len(script.turns) - 1:
pause = AudioSegment.silent(duration=500)
audio_segments.append(pause)
# Combine all segments
combined = AudioSegment.empty()
for segment in audio_segments:
combined += segment
Benefits:
- Seamless transitions between providers
- Natural pauses between speakers
- Single output file (MP3, WAV, OGG, FLAC)
Configuration¶
Environment Variables¶
Add to your .env file:
# Default audio provider for podcasts
PODCAST_AUDIO_PROVIDER=openai # Options: openai, elevenlabs, watsonx
# OpenAI TTS Configuration
OPENAI_API_KEY=your-openai-api-key
OPENAI_TTS_MODEL=tts-1-hd
OPENAI_TTS_DEFAULT_VOICE=alloy
# ElevenLabs TTS Configuration
ELEVENLABS_API_KEY=your-elevenlabs-api-key
ELEVENLABS_API_BASE_URL=https://api.elevenlabs.io/v1
ELEVENLABS_MODEL_ID=eleven_multilingual_v2
ELEVENLABS_VOICE_SETTINGS_STABILITY=0.5
ELEVENLABS_VOICE_SETTINGS_SIMILARITY=0.75
ELEVENLABS_REQUEST_TIMEOUT_SECONDS=30
ELEVENLABS_MAX_RETRIES=3
Get your API keys:
- OpenAI: https://platform.openai.com/api-keys
- ElevenLabs: https://elevenlabs.io/app/settings/api-keys
Provider Configuration¶
The system uses AudioProviderFactory to create provider instances:
from rag_solution.generation.audio.factory import AudioProviderFactory
# Create provider from settings
provider = AudioProviderFactory.create_provider(
provider_type="elevenlabs", # or "openai", "watsonx"
settings=settings
)
# List available providers
providers = AudioProviderFactory.list_providers()
# Returns: ["openai", "elevenlabs", "watsonx", "ollama"]
Usage¶
1. Creating Custom Voices¶
Upload and Clone Voice (ElevenLabs):
POST /api/voices/upload-and-clone
Content-Type: multipart/form-data
Parameters:
- file: Audio file (MP3, WAV) - 1+ minute of clear speech
- name: Voice name (e.g., "Brand Voice")
- description: Optional voice description
Response:
{
"voice_id": "38c79b5a-204c-427c-b794-6c3a9e3db956",
"user_id": "ee76317f-3b6f-4fea-8b74-56483731f58c",
"name": "Brand Voice",
"status": "ready",
"provider_name": "elevenlabs",
"provider_voice_id": "21m00Tcm4TlvDq8ikWAM"
}
2. Generating Podcasts with Custom Voices¶
Mixed Provider Example:
POST /api/podcasts/script-to-audio
Content-Type: application/json
{
"collection_id": "5eb82bd8-1fbd-454e-86d6-61199642757c",
"title": "My Podcast",
"duration": 5,
"host_voice": "38c79b5a-204c-427c-b794-6c3a9e3db956", # Custom ElevenLabs
"expert_voice": "nova", # OpenAI predefined
"audio_format": "mp3",
"script_text": "HOST: Welcome...\nEXPERT: Thank you..."
}
Both Custom Voices:
{
"host_voice": "38c79b5a-204c-427c-b794-6c3a9e3db956", # Custom voice 1
"expert_voice": "7d2e9f1a-8b3c-4d5e-9f6a-1b2c3d4e5f6a" # Custom voice 2
}
Both Predefined Voices:
3. Script Format Flexibility¶
The system now accepts multiple dialogue formats:
HOST: Welcome to today's podcast...
EXPERT: Thank you for having me...
Host: Welcome to today's podcast...
Expert: Thank you for having me...
[HOST]: Welcome to today's podcast...
[EXPERT]: Thank you for having me...
[Host]: Welcome to today's podcast...
[Expert]: Thank you for having me...
All formats are parsed correctly and validated.
Technical Architecture¶
Component Diagram¶
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ Podcast Service โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โ โ _generate_audio() - Multi-Provider Orchestration โ โ
โ โ โข Resolve voice IDs (UUID โ provider mapping) โ โ
โ โ โข Cache provider instances โ โ
โ โ โข Generate per-turn audio โ โ
โ โ โข Stitch segments with pauses โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ AudioProviderFactory โ
โ โข create_provider(type, settings) โ
โ โข list_providers() โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ
โโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโ
โ โ โ โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
โ OpenAI โ โ ElevenLabs โ โ WatsonX โ โ Ollama โ
โ Provider โ โ Provider โ โ Provider โ โ Provider โ
โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโ
Key Classes¶
1. PodcastService¶
Location: backend/rag_solution/services/podcast_service.py
Key Methods:
async def _resolve_voice_id(self, voice_id: str, user_id: UUID4) -> tuple[str, str | None]:
"""
Resolve voice ID to provider-specific voice ID.
Logic:
1. Try to parse as UUID
2. If UUID: Look up in database, validate, return (provider_voice_id, provider_name)
3. If not UUID: Return (voice_id, None) - it's a predefined voice
Returns:
Tuple of (resolved_voice_id, provider_name)
"""
async def _generate_audio(
self,
podcast_id: UUID4,
podcast_script: PodcastScript,
podcast_input: PodcastGenerationInput,
) -> bytes:
"""
Generate audio from parsed script with multi-provider support.
Strategy:
1. Resolve both voices upfront to determine providers
2. Create provider instances as needed (cached)
3. Generate each turn with appropriate provider
4. Stitch all segments with pauses
5. Export to requested format
"""
2. AudioProviderFactory¶
Location: backend/rag_solution/generation/audio/factory.py
class AudioProviderFactory:
"""Factory for creating audio generation providers."""
_providers: ClassVar[dict[str, type[AudioProviderBase]]] = {
"openai": OpenAIAudioProvider,
"elevenlabs": ElevenLabsAudioProvider,
"watsonx": WatsonXAudioProvider,
"ollama": OllamaAudioProvider,
}
@classmethod
def create_provider(cls, provider_type: str, settings: Settings) -> AudioProviderBase:
"""Create audio provider instance from settings."""
@classmethod
def list_providers(cls) -> list[str]:
"""List all registered provider names."""
3. ScriptParser¶
Location: backend/rag_solution/utils/script_parser.py
Updated Patterns:
HOST_PATTERNS: ClassVar[list[str]] = [
r"^HOST:\s*(.*)$",
r"^Host:\s*(.*)$",
r"^H:\s*(.*)$",
r"^\[HOST\]:\s*(.*)$", # [HOST]: format (with colon)
r"^\[HOST\]\s*(.*)$", # [HOST] format (without colon)
r"^\[Host\]:\s*(.*)$", # [Host]: format
]
Performance & Cost¶
Benchmarks¶
| Configuration | Generation Time | Cost (5 min podcast) |
|---|---|---|
| OpenAI only | ~30-45 seconds | ~$0.05-0.10 |
| ElevenLabs only | ~45-60 seconds | ~$0.15-0.30 |
| Mixed (OpenAI + ElevenLabs) | ~40-55 seconds | ~$0.10-0.20 |
Optimization¶
Provider Caching:
# Cache provider instances to avoid recreation per turn
provider_cache: dict[str, AudioProviderBase] = {}
def get_provider(provider_type: str) -> AudioProviderBase:
if provider_type not in provider_cache:
provider_cache[provider_type] = AudioProviderFactory.create_provider(...)
return provider_cache[provider_type]
Benefits:
- Reduces provider initialization overhead
- Reuses HTTP connections
- Faster per-turn generation
Error Handling¶
Common Errors¶
1. Custom Voice Not Found¶
{
"error": "ValidationError",
"message": "Custom voice '38c79b5a-...' not found",
"field": "voice_id"
}
Solution: Verify voice ID exists in database and belongs to user.
2. Voice Not Ready¶
{
"error": "ValidationError",
"message": "Custom voice '38c79b5a-...' is not ready",
"status": "processing"
}
Solution: Wait for voice cloning to complete (usually 30-60 seconds).
3. Provider API Error¶
{
"error": "AudioGenerationError",
"provider": "elevenlabs",
"error_type": "api_error",
"message": "HTTP 401: Invalid API key"
}
Solution: Check API key configuration in .env.
4. Script Format Validation Error¶
Solution: Ensure script has both HOST and EXPERT dialogue turns.
Best Practices¶
1. Voice Selection¶
Custom Voices:
- Use for brand consistency
- Requires 1+ minute of clear audio
- Better for recognizable voices
Predefined Voices:
- Faster to set up (no cloning)
- Consistent quality
- Good for generic podcasts
2. Script Quality¶
Good:
HOST: Welcome to today's podcast on machine learning.
EXPERT: Thank you for having me. Let me explain the core concepts.
Avoid:
HOST: Welcome, [EXPERT NAME]! # โ Placeholder names
EXPERT: [Placeholder response] # โ Template text
3. API Rate Limits¶
OpenAI:
- 50 requests/minute (free tier)
- 500 requests/minute (paid tier)
ElevenLabs:
- 10,000 characters/month (free tier)
- Unlimited (paid tier)
Recommendations:
- Use provider caching
- Implement retry logic (already built-in)
- Monitor usage via provider dashboards
Migration Guide¶
From Single-Provider to Multi-Provider¶
Before (single provider for entire podcast):
# Old approach - all turns use same provider
podcast_input = PodcastGenerationInput(
host_voice="alloy",
expert_voice="onyx",
# Provider determined by PODCAST_AUDIO_PROVIDER setting
)
After (per-turn provider selection):
# New approach - each voice can use different provider
podcast_input = PodcastGenerationInput(
host_voice="38c79b5a-...", # Custom ElevenLabs voice
expert_voice="nova", # OpenAI predefined voice
# Providers automatically resolved per turn
)
Backward Compatibility: All existing podcasts continue to work without changes. The system detects voice ID format and selects appropriate provider automatically.
Troubleshooting¶
Issue: Voice Cloning Fails¶
Symptoms: Custom voice stuck in "processing" status
Solutions:
- Check audio quality (clear speech, minimal background noise)
- Ensure file is 1+ minute duration
- Verify API key is valid
- Check ElevenLabs account quota
Issue: Audio Stitching Produces Clicks¶
Symptoms: Audible clicks/pops between turns
Solutions:
- Adjust pause duration (default 500ms)
- Ensure all providers use same sample rate
- Check audio format consistency
Issue: Generation Times Out¶
Symptoms: Request times out after 120 seconds
Solutions:
- Reduce podcast duration
- Use faster provider (OpenAI typically faster)
- Increase timeout in settings:
Future Enhancements¶
Planned Features¶
- Voice Style Control
- Emotion/tone settings per turn
-
Speaking rate variation
-
Background Music
- Auto-mix background music
-
Fade in/out support
-
Multi-Language Support
- Voice cloning for multiple languages
-
Automatic language detection
-
Advanced Audio Processing
- Noise reduction
- Volume normalization
- EQ adjustments
References¶
Last Updated: October 15, 2025 Contributors: Claude Code Assistant