Skip to content

Custom Voice API

Overview

The Custom Voice API allows users to upload voice samples and use them for personalized podcast generation. This feature integrates with voice cloning providers to create custom voices that can be used alongside preset TTS voices.

Implementation Strategy

Phase 1: ElevenLabs Integration (Current) ๐Ÿš€

Focus: Fast time to market with proven cloud-based voice cloning

Available Providers: - ElevenLabs: Industry-leading voice cloning (5/5 quality), managed service

Timeline: Phase 1 is currently being implemented (~12-15 hours remaining)

Phase 2: Self-Hosted Option (Future) ๐Ÿ”ง

Focus: Cost optimization and data sovereignty for power users

Planned Providers: - F5-TTS: Self-hosted voice cloning with zero-shot capabilities - 20-80% cheaper than ElevenLabs at scale (50+ podcasts/month) - Privacy-focused (voice samples stay on-premise) - Open-source (MIT license)

Timeline: Phase 2 planned for future release (~20-25 hours)

Runtime Provider Selection

Users can choose their preferred provider when processing voices:

POST /api/voices/{voice_id}/process
{
  "provider_name": "elevenlabs"  // Phase 1
  // "provider_name": "f5-tts"   // Phase 2 (future)
}

Architecture

Components

1. Voice Upload
   โ””โ”€> FileManagementService โ†’ Store voice sample files

2. Voice Processing
   โ””โ”€> TTS Provider API โ†’ Clone voice from sample

3. Voice Storage
   โ””โ”€> Voice Database โ†’ Track voice metadata and status

4. Voice Usage
   โ””โ”€> Podcast Generation โ†’ Use custom or preset voices

Database Model

Table: voices

Field Type Description
voice_id UUID Primary key
user_id UUID Foreign key to users
name VARCHAR(200) Human-readable voice name
description TEXT Optional voice description
gender VARCHAR(20) male/female/neutral
status VARCHAR(20) uploading/processing/ready/failed
provider_voice_id VARCHAR(200) Provider-specific voice ID (after cloning)
provider_name VARCHAR(50) TTS provider name (elevenlabs, playht, resemble)
sample_file_url VARCHAR(500) Path to voice sample file
sample_file_size INTEGER File size in bytes
quality_score INTEGER Voice quality (0-100)
error_message TEXT Error details if failed
times_used INTEGER Usage counter
created_at TIMESTAMP Creation time
updated_at TIMESTAMP Last update time
processed_at TIMESTAMP Processing completion time

Voice File Storage

Structure: {storage_path}/{user_id}/voices/{voice_id}/sample.{format}

Supported Formats: - mp3 - wav - m4a - flac - ogg

API Endpoints

1. Upload Voice Sample

Upload a voice sample file for custom voice creation.

Endpoint: POST /api/voices/upload

Authentication: Required (JWT token)

Content-Type: multipart/form-data

Form Fields:

name: string (required, 1-200 chars)
description: string (optional, max 1000 chars)
gender: string (required, one of: male, female, neutral)
audio_file: file (required, max 10MB)

Request Example:

curl -X POST http://localhost:8000/api/voices/upload \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -F "name=Professional Narrator Voice" \
  -F "description=Clear, authoritative voice for podcasts" \
  -F "gender=male" \
  -F "audio_file=@voice_sample.mp3"

Response (201 Created):

{
  "voice_id": "123e4567-e89b-12d3-a456-426614174000",
  "user_id": "ee76317f-3b6f-4fea-8b74-56483731f58c",
  "name": "Professional Narrator Voice",
  "description": "Clear, authoritative voice for podcasts",
  "gender": "male",
  "status": "uploading",
  "provider_voice_id": null,
  "provider_name": null,
  "sample_file_url": "/api/voices/123e4567-e89b-12d3-a456-426614174000/sample",
  "sample_file_size": 2457600,
  "quality_score": null,
  "error_message": null,
  "times_used": 0,
  "created_at": "2025-10-13T10:30:00Z",
  "updated_at": "2025-10-13T10:30:00Z",
  "processed_at": null
}

Error Responses: - 400 Bad Request: Invalid input (empty name, unsupported format, file too large) - 401 Unauthorized: Missing or invalid JWT token - 413 Payload Too Large: File exceeds size limit - 415 Unsupported Media Type: Invalid audio format

2. Process Voice with TTS Provider

Process an uploaded voice sample with a TTS provider for voice cloning.

Endpoint: POST /api/voices/{voice_id}/process

Authentication: Required (JWT token)

Content-Type: application/json

Request Body:

{
  "provider_name": "elevenlabs"
}

Supported Providers (Phase 1): - elevenlabs - ElevenLabs voice cloning (available now)

Future Providers (Phase 2): - f5-tts - Self-hosted F5-TTS voice cloning (planned)

Request Example:

curl -X POST http://localhost:8000/api/voices/{voice_id}/process \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "provider_name": "elevenlabs"
  }'

Response (202 Accepted):

{
  "voice_id": "123e4567-e89b-12d3-a456-426614174000",
  "status": "processing",
  "provider_name": "elevenlabs",
  "message": "Voice processing started. This may take 30-120 seconds."
}

Error Responses: - 400 Bad Request: Unsupported provider, voice not in uploadable state - 401 Unauthorized: Missing or invalid JWT token - 403 Forbidden: User doesn't own this voice - 404 Not Found: Voice not found - 409 Conflict: Voice already processed or processing

3. List User's Voices

Get a list of all voices owned by the authenticated user.

Endpoint: GET /api/voices

Authentication: Required (JWT token)

Query Parameters: - limit (optional, integer, 1-100, default: 100) - Maximum number of results - offset (optional, integer, >=0, default: 0) - Pagination offset

Request Example:

curl -X GET "http://localhost:8000/api/voices?limit=10&offset=0" \
  -H "Authorization: Bearer $JWT_TOKEN"

Response (200 OK):

{
  "voices": [
    {
      "voice_id": "123e4567-e89b-12d3-a456-426614174000",
      "user_id": "ee76317f-3b6f-4fea-8b74-56483731f58c",
      "name": "Professional Narrator Voice",
      "description": "Clear, authoritative voice for podcasts",
      "gender": "male",
      "status": "ready",
      "provider_voice_id": "elvenlabs_voice_abc123",
      "provider_name": "elevenlabs",
      "sample_file_url": "/api/voices/123e4567-e89b-12d3-a456-426614174000/sample",
      "sample_file_size": 2457600,
      "quality_score": 85,
      "error_message": null,
      "times_used": 3,
      "created_at": "2025-10-13T10:30:00Z",
      "updated_at": "2025-10-13T10:32:15Z",
      "processed_at": "2025-10-13T10:32:15Z"
    }
  ],
  "total_count": 1
}

Error Responses: - 401 Unauthorized: Missing or invalid JWT token

4. Get Voice Details

Get details of a specific voice.

Endpoint: GET /api/voices/{voice_id}

Authentication: Required (JWT token)

Request Example:

curl -X GET http://localhost:8000/api/voices/{voice_id} \
  -H "Authorization: Bearer $JWT_TOKEN"

Response (200 OK):

{
  "voice_id": "123e4567-e89b-12d3-a456-426614174000",
  "user_id": "ee76317f-3b6f-4fea-8b74-56483731f58c",
  "name": "Professional Narrator Voice",
  "description": "Clear, authoritative voice for podcasts",
  "gender": "male",
  "status": "ready",
  "provider_voice_id": "elvenlabs_voice_abc123",
  "provider_name": "elevenlabs",
  "sample_file_url": "/api/voices/123e4567-e89b-12d3-a456-426614174000/sample",
  "sample_file_size": 2457600,
  "quality_score": 85,
  "error_message": null,
  "times_used": 3,
  "created_at": "2025-10-13T10:30:00Z",
  "updated_at": "2025-10-13T10:32:15Z",
  "processed_at": "2025-10-13T10:32:15Z"
}

Error Responses: - 401 Unauthorized: Missing or invalid JWT token - 403 Forbidden: User doesn't own this voice - 404 Not Found: Voice not found

5. Update Voice Metadata

Update voice name, description, or gender classification.

Endpoint: PATCH /api/voices/{voice_id}

Authentication: Required (JWT token)

Content-Type: application/json

Request Body (all fields optional):

{
  "name": "Updated Voice Name",
  "description": "Updated description",
  "gender": "female"
}

Request Example:

curl -X PATCH http://localhost:8000/api/voices/{voice_id} \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "name": "My Updated Voice",
    "description": "New description"
  }'

Response (200 OK):

{
  "voice_id": "123e4567-e89b-12d3-a456-426614174000",
  "name": "My Updated Voice",
  "description": "New description",
  ...
}

Error Responses: - 400 Bad Request: Invalid input (empty name, invalid gender) - 401 Unauthorized: Missing or invalid JWT token - 403 Forbidden: User doesn't own this voice - 404 Not Found: Voice not found

6. Delete Voice

Delete a voice and its associated sample file.

Endpoint: DELETE /api/voices/{voice_id}

Authentication: Required (JWT token)

Request Example:

curl -X DELETE http://localhost:8000/api/voices/{voice_id} \
  -H "Authorization: Bearer $JWT_TOKEN"

Response (204 No Content)

Error Responses: - 401 Unauthorized: Missing or invalid JWT token - 403 Forbidden: User doesn't own this voice - 404 Not Found: Voice not found - 409 Conflict: Voice is currently being used in podcast generation

7. Download Voice Sample

Download or stream the voice sample file.

Endpoint: GET /api/voices/{voice_id}/sample

Authentication: Required (JWT token)

Request Example:

curl -X GET http://localhost:8000/api/voices/{voice_id}/sample \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -o voice_sample.mp3

Response (200 OK): - Content-Type: audio/mpeg (or appropriate MIME type) - Binary audio data

Supports HTTP Range Requests: Yes (for streaming/seeking)

Error Responses: - 401 Unauthorized: Missing or invalid JWT token - 403 Forbidden: User doesn't own this voice - 404 Not Found: Voice or sample file not found

Voice Status Workflow

1. UPLOADING โ†’ Upload in progress
   โ†“
2. PROCESSING โ†’ Voice cloning with TTS provider
   โ†“
3. READY โ†’ Voice is ready for use
   โ†“
4. FAILED โ†’ Processing failed (see error_message)

Using Custom Voices in Podcasts

Voice ID Format

Custom voices use UUID format:

custom:{voice_id}

Preset voices use string names:

alloy, echo, fable, onyx, nova, shimmer

Example: Generate Podcast with Custom Voice

curl -X POST http://localhost:8000/api/podcasts/generate \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{
    "collection_id": "your-collection-id",
    "duration": 15,
    "host_voice": "custom:123e4567-e89b-12d3-a456-426614174000",
    "expert_voice": "nova",
    "title": "Podcast with Custom Voice"
  }'

Mixed Voice Scenarios

You can mix custom and preset voices:

Scenario 1: Custom HOST + Preset EXPERT

{
  "host_voice": "custom:voice-uuid",
  "expert_voice": "onyx"
}

Scenario 2: Preset HOST + Custom EXPERT

{
  "host_voice": "alloy",
  "expert_voice": "custom:voice-uuid"
}

Scenario 3: Both Custom

{
  "host_voice": "custom:voice-uuid-1",
  "expert_voice": "custom:voice-uuid-2"
}

Configuration

Environment Variables

Phase 1: ElevenLabs Configuration ๐Ÿš€

# Voice TTS Providers
VOICE_TTS_PROVIDERS=elevenlabs              # Available providers
VOICE_DEFAULT_PROVIDER=elevenlabs           # Default provider

# Voice Storage
VOICE_STORAGE_BACKEND=local                 # Storage backend (default: local)
VOICE_LOCAL_STORAGE_PATH=./data/voices      # Local storage path
VOICE_MAX_FILE_SIZE_MB=10                   # Max upload size (default: 10)
VOICE_MAX_PER_USER=10                       # Max voices per user (default: 10)
VOICE_ALLOWED_FORMATS=mp3,wav,m4a,flac,ogg  # Supported formats

# ElevenLabs API Configuration
ELEVENLABS_API_KEY=<your-api-key>           # Get from elevenlabs.io
ELEVENLABS_API_BASE_URL=https://api.elevenlabs.io/v1
ELEVENLABS_MODEL_ID=eleven_multilingual_v2  # Voice cloning model
ELEVENLABS_VOICE_SETTINGS_STABILITY=0.5     # Voice stability (0.0-1.0)
ELEVENLABS_VOICE_SETTINGS_SIMILARITY=0.75   # Voice similarity boost (0.0-1.0)
ELEVENLABS_REQUEST_TIMEOUT_SECONDS=30       # API timeout
ELEVENLABS_MAX_RETRIES=3                    # Retry attempts

# Voice Processing
VOICE_PROCESSING_TIMEOUT_SECONDS=30         # Timeout for voice cloning
VOICE_MIN_SAMPLE_DURATION_SECONDS=5         # Minimum sample length
VOICE_MAX_SAMPLE_DURATION_SECONDS=300       # Maximum 5 minutes

Phase 2: F5-TTS Configuration (Future) ๐Ÿ”ง

# F5-TTS Self-Hosted Provider (Phase 2)
VOICE_TTS_PROVIDERS=elevenlabs,f5-tts       # Multiple providers
F5_TTS_SERVICE_URL=http://localhost:8001    # F5-TTS microservice
F5_TTS_MODEL_PATH=/models/f5-tts            # Model storage
F5_TTS_GPU_ENABLED=true                     # Use GPU for inference
F5_TTS_LANGUAGE=en                          # Default language
F5_TTS_CACHE_DIR=/cache                     # Voice embedding cache

File Size Limits

Format Recommended Size Max Size
MP3 1-5 MB 10 MB
WAV 5-20 MB 10 MB
M4A 1-5 MB 10 MB
FLAC 10-30 MB 10 MB
OGG 1-5 MB 10 MB

Voice Sample Requirements

For best results, voice samples should: - Be 30 seconds to 2 minutes long - Have clear, high-quality audio - Be free of background noise - Contain natural, conversational speech - Be in a supported audio format

Cost Estimates

ElevenLabs Pricing

Based on ElevenLabs pricing (as of Oct 2025):

Operation Cost Notes
Voice cloning $0.30 One-time per voice
TTS generation $0.18/1K chars Per podcast generation

Example Costs

Scenario: Create 1 custom voice, generate 5 podcasts (15 min each)

Item Calculation Cost
Voice cloning (1x) 1 ร— $0.30 $0.30
Podcast TTS (5x) 5 ร— ~2,250 words ร— 5 chars ร— $0.18/1K $10.13
Total $10.43

Troubleshooting

Voice Upload Fails: "Unsupported format"

Cause: Audio file format not supported

Solution: Convert to supported format (MP3, WAV, M4A, FLAC, OGG)

# Convert using ffmpeg
ffmpeg -i voice.aac -c:a libmp3lame -q:a 2 voice.mp3

Voice Processing Stuck in "processing" Status

Cause: TTS provider API timeout or error

Solution: 1. Check provider API status 2. Verify API keys are correct 3. Check voice sample meets requirements 4. Retry processing after 5 minutes

Voice Quality Score is Low

Cause: Poor quality audio sample

Solution: - Re-record with better microphone - Remove background noise - Ensure clear, natural speech - Use lossless format (WAV, FLAC) for upload

Cannot Use Voice in Podcast: "Voice not ready"

Cause: Voice status is not "ready"

Solution: 1. Check voice status via GET /api/voices/{voice_id} 2. If status is "processing", wait for completion 3. If status is "failed", check error_message and re-upload

Security Considerations

Access Control

  • Users can only access their own voices
  • Voice sample files are access-controlled via JWT
  • Cross-user voice sharing is not supported (by design)

File Validation

  • File type validation (magic number check)
  • File size limits enforced
  • Virus scanning (recommended in production)

API Rate Limiting

Recommended rate limits: - Voice upload: 5 per hour per user - Voice processing: 10 per hour per user - Voice listing: 100 per hour per user

Testing

Manual Testing

# 1. Upload voice sample
VOICE_ID=$(curl -X POST http://localhost:8000/api/voices/upload \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -F "name=Test Voice" \
  -F "gender=male" \
  -F "audio_file=@test_voice.mp3" \
  | jq -r '.voice_id')

# 2. Process voice
curl -X POST http://localhost:8000/api/voices/$VOICE_ID/process \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d '{"provider_name": "elevenlabs"}'

# 3. Check status (wait for "ready")
curl -X GET http://localhost:8000/api/voices/$VOICE_ID \
  -H "Authorization: Bearer $JWT_TOKEN"

# 4. Use in podcast generation
curl -X POST http://localhost:8000/api/podcasts/generate \
  -H "Authorization: Bearer $JWT_TOKEN" \
  -H "Content-Type: application/json" \
  -d "{
    \"collection_id\": \"$COLLECTION_ID\",
    \"duration\": 5,
    \"host_voice\": \"custom:$VOICE_ID\",
    \"expert_voice\": \"alloy\"
  }"

Automated Testing

# Unit tests
cd backend
poetry run pytest tests/unit/test_voice_service.py -v

# Integration tests (requires provider API keys)
export ELEVENLABS_API_KEY=your-key
poetry run pytest tests/integration/test_voice_integration.py -v

Future Enhancements

  • Multi-sample voice cloning (upload multiple samples for better quality)
  • Voice preview before processing
  • Voice sharing between team members
  • Voice templates/presets
  • Batch voice processing
  • Voice analytics (usage metrics, quality trends)
  • Voice versioning (update voice samples)
  • Automatic voice enhancement (noise reduction, normalization)