Cross-Encoder Reranking¶

Overview¶

Cross-Encoder reranking is a production-ready feature that provides 250x faster and more accurate relevance scoring compared to LLM-based reranking. It uses specialized BERT-based models from the sentence-transformers library to score query-document pairs directly, achieving sub-100ms latency for typical workloads.

Key Benefits:

Performance: 20-30s → 80ms (250x speedup over LLM-based reranking)
Accuracy: Higher relevance scores using specialized cross-encoder models
Cost: No LLM API calls required for reranking
Scalability: Can rerank hundreds of documents in <1 second
Production-Ready: Comprehensive error handling, async support, extensive test coverage

Performance Comparison¶

Reranking Method	20 Documents	50 Documents	100 Documents
LLM-Based (WatsonX)	~20-30s	~60-100s	~180-300s
Cross-Encoder (MiniLM-L-6)	~80ms	~150ms	~250ms
Speedup	250x	400-600x	720-1200x

Model Options¶

Model	Size	Speed	Accuracy	Use Case
`ms-marco-TinyBERT-L-2-v2`	17MB	30-40ms	Good	Low-latency production
`ms-marco-MiniLM-L-6-v2`	80MB	80-100ms	Better	Recommended default
`ms-marco-MiniLM-L-12-v2`	120MB	150-200ms	Best	Maximum accuracy

Architecture¶

Component Overview¶

┌─────────────────────────────────────────────────────────────┐
│                    SearchService                             │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  1. Vector Search (Milvus)                          │    │
│  │     └→ Returns top 100 candidates                   │    │
│  └─────────────────────────────────────────────────────┘    │
│                          ↓                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  2. CrossEncoderReranker                            │    │
│  │     ├─ Model: ms-marco-MiniLM-L-6-v2                │    │
│  │     ├─ Scores: 100 query-doc pairs in ~80ms         │    │
│  │     └─ Returns: Top 10 reranked results             │    │
│  └─────────────────────────────────────────────────────┘    │
│                          ↓                                   │
│  ┌─────────────────────────────────────────────────────┐    │
│  │  3. Response Assembly                                │    │
│  │     └→ Final ranked results to user                 │    │
│  └─────────────────────────────────────────────────────┘    │
└─────────────────────────────────────────────────────────────┘

Core Classes¶

CrossEncoderReranker (backend/rag_solution/retrieval/reranker.py)

class CrossEncoderReranker:
    """Cross-encoder reranker using sentence-transformers models.

    Provides both synchronous and asynchronous reranking with:
    - Automatic model loading and caching
    - Error handling and validation
    - Performance metrics logging
    - Type-safe QueryResult integration
    """

    def __init__(self, model_name: str | None = None):
        """Initialize with optional model override."""

    def rerank(
        self,
        query: str,
        results: list[QueryResult],
        top_k: int | None = None,
    ) -> list[QueryResult]:
        """Synchronous reranking (CPU-bound operation)."""

    async def rerank_async(
        self,
        query: str,
        results: list[QueryResult],
        top_k: int | None = None,
    ) -> list[QueryResult]:
        """Async reranking (runs in thread pool executor)."""

Usage¶

Basic Usage¶

from rag_solution.retrieval.reranker import CrossEncoderReranker
from vectordbs.data_types import QueryResult

# Initialize reranker (model loads once, ~7s first time, <1s cached)
reranker = CrossEncoderReranker()

# Rerank search results
query = "What is machine learning?"
reranked_results = reranker.rerank(
    query=query,
    results=search_results,  # list[QueryResult] from vector search
    top_k=10  # Return top 10 reranked results
)

# Results are sorted by cross-encoder scores
for result in reranked_results:
    print(f"Score: {result.score:.4f} - {result.chunk.text[:100]}")

Async Usage (Recommended for Web APIs)¶

import asyncio
from rag_solution.retrieval.reranker import CrossEncoderReranker

async def search_with_reranking(query: str, collection_id: str):
    reranker = CrossEncoderReranker()

    # Get initial results from vector store
    vector_results = await vector_store.search(query, collection_id, top_k=100)

    # Rerank asynchronously (non-blocking)
    reranked_results = await reranker.rerank_async(
        query=query,
        results=vector_results,
        top_k=10
    )

    return reranked_results

# Run async
results = asyncio.run(search_with_reranking("ML basics", "col_123"))

Custom Model Selection¶

# Use fastest model for low-latency production
fast_reranker = CrossEncoderReranker(
    model_name="cross-encoder/ms-marco-TinyBERT-L-2-v2"
)

# Use most accurate model for critical queries
accurate_reranker = CrossEncoderReranker(
    model_name="cross-encoder/ms-marco-MiniLM-L-12-v2"
)

Integration in SearchService¶

The cross-encoder is integrated into SearchService (backend/rag_solution/services/search_service.py):

async def search(self, search_input: SearchInput) -> SearchResponse:
    # 1. Vector search (top 100 candidates)
    vector_results = await self.vector_store.search(
        query=query,
        collection_id=collection_id,
        top_k=100  # Over-fetch for reranking
    )

    # 2. Cross-encoder reranking (if enabled)
    if pipeline_config.reranker_enabled:
        reranked_results = await self.reranker.rerank_async(
            query=query,
            results=vector_results,
            top_k=10  # Final top-k
        )

    # 3. Return final results
    return SearchResponse(results=reranked_results)

Configuration¶

Pipeline Configuration¶

Cross-encoder reranking is configured via pipeline settings:

# In PipelineService or configuration
pipeline_config = {
    "reranker_enabled": True,  # Enable/disable reranking
    "reranker_model": "cross-encoder/ms-marco-MiniLM-L-6-v2",  # Model name
    "reranker_top_k": 10,  # Final number of results
    "vector_top_k": 100,  # Initial candidates to rerank
}

Environment Variables¶

# Optional: Override default cross-encoder model
CROSS_ENCODER_MODEL=cross-encoder/ms-marco-MiniLM-L-6-v2

# Optional: Model cache directory
SENTENCE_TRANSFORMERS_HOME=/path/to/model/cache

Model Caching¶

Models are cached automatically by sentence-transformers:

First load: ~7 seconds (download + initialize)
Subsequent loads: <1 second (cached)
Cache location: ~/.cache/torch/sentence_transformers/

Production tip: Pre-download models in Docker build:

# In Dockerfile
RUN python -c "from sentence_transformers import CrossEncoder; \
    CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')"

Error Handling¶

The reranker includes comprehensive error handling:

Exception Types¶

try:
    reranked = reranker.rerank(query, results)
except ValueError as e:
    # Wrapped exception with context
    # Example: "Reranking failed for model ms-marco-MiniLM-L-6-v2: OOM error"
    logger.error(f"Reranking failed: {e}")
    # Fall back to original vector search results
    return results

Common Errors¶

Error	Cause	Solution
`ValueError: Reranking failed`	Model prediction error (OOM, CUDA, etc.)	Use smaller model or reduce batch size
`ValueError: zip() argument mismatch`	Model returned wrong number of scores	Check model compatibility
`AttributeError: 'NoneType' object has no attribute 'predict'`	Model initialization failed	Check model name and network connection

Logging¶

The reranker logs performance metrics:

INFO: Loading cross-encoder model: cross-encoder/ms-marco-MiniLM-L-6-v2
INFO: Cross-encoder loaded successfully in 0.852s
INFO: Reranked 100 results → 10 results in 0.082s (model=ms-marco-MiniLM-L-6-v2)

Enable debug logging for detailed output:

import logging
logging.basicConfig(level=logging.DEBUG)

Testing¶

Unit Tests¶

Comprehensive test coverage (35 tests) in tests/unit/retrieval/test_cross_encoder_reranker.py:

# Run all cross-encoder tests
poetry run pytest tests/unit/retrieval/test_cross_encoder_reranker.py -v

# Run specific test categories
poetry run pytest tests/unit/retrieval/test_cross_encoder_reranker.py::TestCrossEncoderRerankerBasicFunctionality -v
poetry run pytest tests/unit/retrieval/test_cross_encoder_reranker.py::TestCrossEncoderRerankerAsync -v
poetry run pytest tests/unit/retrieval/test_cross_encoder_reranker.py::TestCrossEncoderRerankerErrorHandling -v

Test Coverage¶

✅ Initialization (5 tests)
✅ Basic reranking (5 tests)
✅ Top-k filtering (5 tests)
✅ Empty input handling (3 tests)
✅ Async operations (5 tests)
✅ Score validation (4 tests)
✅ Error handling (5 tests)
✅ QueryResult integration (3 tests)

Total: 35 tests, 100% passing

Manual Testing¶

# Test with real data
from rag_solution.retrieval.reranker import CrossEncoderReranker
from vectordbs.data_types import *

# Create test data
metadata = DocumentChunkMetadata(source=Source.PDF, document_id="test_doc")
chunk = DocumentChunkWithScore(
    chunk_id="chunk_1",
    text="Machine learning is a subset of artificial intelligence.",
    metadata=metadata
)
result = QueryResult(chunk=chunk, score=0.85)

# Test reranking
reranker = CrossEncoderReranker()
reranked = reranker.rerank("What is ML?", [result])
print(f"Original score: 0.85, Reranked score: {reranked[0].score:.4f}")

Performance Optimization¶

Best Practices¶

Over-fetch for reranking

# Retrieve more candidates than needed
vector_results = vector_search(query, top_k=100)
# Rerank to final top-k
final_results = reranker.rerank(query, vector_results, top_k=10)

Use async for web APIs

# Non-blocking reranking in FastAPI
@app.post("/search")
async def search_endpoint(query: str):
    results = await vector_search(query)
    reranked = await reranker.rerank_async(query, results)
    return reranked

Batch similar queries

# Rerank multiple queries concurrently
tasks = [
    reranker.rerank_async(q, results)
    for q in queries
]
all_results = await asyncio.gather(*tasks)

Choose appropriate model
Latency-critical: TinyBERT-L-2-v2 (~30ms)
Balanced: MiniLM-L-6-v2 (~80ms) ← Default
Accuracy-critical: MiniLM-L-12-v2 (~150ms)

Memory Usage¶

Model	RAM	GPU (optional)
TinyBERT-L-2	~100MB	~200MB VRAM
MiniLM-L-6	~200MB	~400MB VRAM
MiniLM-L-12	~300MB	~600MB VRAM

Production tip: Models run on CPU by default (no GPU required).

Migration Guide¶

From LLM-Based Reranking¶

Before (LLM-based):

# Old: Slow LLM-based reranking (~20-30s)
reranked = await llm_reranker.rerank(query, results, top_k=10)

After (Cross-encoder):

# New: Fast cross-encoder reranking (~80ms)
reranker = CrossEncoderReranker()
reranked = await reranker.rerank_async(query, results, top_k=10)

Migration checklist:

✅ Replace LLMReranker with CrossEncoderReranker
✅ Update imports: from rag_solution.retrieval.reranker import CrossEncoderReranker
✅ Remove LLM provider configuration for reranking
✅ Adjust vector_top_k to over-fetch (e.g., 100) for better reranking
✅ Update tests to use new reranker
✅ Monitor latency improvements in logs

Troubleshooting¶

Issue: Model Download Fails¶

Symptom: OSError: Can't load model from 'cross-encoder/...'

Solutions:

# 1. Check network connection
curl -I https://huggingface.co

# 2. Pre-download model manually
python -c "from sentence_transformers import CrossEncoder; \
    CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')"

# 3. Use local model path
reranker = CrossEncoderReranker(model_name="/path/to/local/model")

Issue: Out of Memory (OOM)¶

Symptom: RuntimeError: CUDA out of memory or process killed

Solutions:

Use smaller model: ms-marco-TinyBERT-L-2-v2
Reduce batch size (process fewer results)
Force CPU-only: export CUDA_VISIBLE_DEVICES=""

Issue: Slow First Request¶

Symptom: First reranking takes 7-10 seconds

Cause: Model download + initialization

Solutions:

Pre-warm in startup (recommended):

# In main.py or startup event
@app.on_event("startup")
async def warmup_reranker():
    reranker = CrossEncoderReranker()
    logger.info("Cross-encoder model pre-loaded")

Pre-download in Docker:

RUN python -c "from sentence_transformers import CrossEncoder; \
    CrossEncoder('cross-encoder/ms-marco-MiniLM-L-6-v2')"

Issue: Inconsistent Scores¶

Symptom: Scores vary between runs for same query

Cause: Non-deterministic model behavior (rare with cross-encoders)

Solution: Cross-encoders are deterministic by default (unlike LLMs). If scores vary:

Cross-Encoder Reranking¶

Overview¶

Performance Comparison¶

Model Options¶

Architecture¶

Component Overview¶

Core Classes¶

Usage¶

Basic Usage¶

Async Usage (Recommended for Web APIs)¶

Custom Model Selection¶

Integration in SearchService¶

Configuration¶

Pipeline Configuration¶

Environment Variables¶

Model Caching¶

Error Handling¶

Exception Types¶

Common Errors¶

Logging¶

Testing¶

Unit Tests¶

Test Coverage¶

Manual Testing¶

Performance Optimization¶

Best Practices¶

Memory Usage¶

Migration Guide¶

From LLM-Based Reranking¶

Troubleshooting¶

Issue: Model Download Fails¶

Issue: Out of Memory (OOM)¶

Issue: Slow First Request¶

Issue: Inconsistent Scores¶

References¶

Technical Documentation¶

External Resources¶

See Also¶

Cross-Encoder Reranking¶

Overview¶

Performance Comparison¶

Model Options¶

Architecture¶

Component Overview¶

Core Classes¶

Usage¶

Basic Usage¶

Async Usage (Recommended for Web APIs)¶

Custom Model Selection¶

Integration in SearchService¶

Configuration¶

Pipeline Configuration¶

Environment Variables¶

Model Caching¶

Error Handling¶

Exception Types¶

Common Errors¶

Logging¶

Testing¶

Unit Tests¶

Test Coverage¶

Manual Testing¶

Performance Optimization¶

Best Practices¶

Memory Usage¶

Migration Guide¶

From LLM-Based Reranking¶

Troubleshooting¶

Issue: Model Download Fails¶

Issue: Out of Memory (OOM)¶

Issue: Slow First Request¶

Issue: Inconsistent Scores¶

References¶

Technical Documentation¶

External Resources¶

Related Features¶

See Also¶