Source Attribution in Chain of Thought¶

Source attribution is a critical component of the CoT system that tracks which documents and sources were used at each step of the reasoning process. This provides transparency and allows users to understand how the AI arrived at its conclusions.

Overview¶

The source attribution system addresses the key question: "How did the RAG search arrive at this answer?"

Unlike traditional RAG systems that only show sources for the final answer, CoT source attribution provides:

Step-level attribution: Which sources influenced each reasoning step
Relevance scoring: How relevant each source was to each step
Source aggregation: Combined view of all sources used across the reasoning chain
Primary source identification: Most influential sources highlighted
Usage tracking: Which steps used which sources

Architecture¶

Core Components¶

SourceAttribution: Individual source document metadata
SourceSummary: Aggregated source information across all steps
SourceAttributionService: Service for managing source tracking
Enhanced ReasoningStep: Reasoning steps with source attribution

Data Flow¶

graph TD
    A[Retrieved Documents] --> B[SourceAttributionService]
    B --> C[SourceAttribution Objects]
    C --> D[Enhanced ReasoningStep]
    D --> E[Aggregate Sources]
    E --> F[SourceSummary]
    F --> G[Final CoT Output]

Schemas¶

SourceAttribution¶

Represents attribution for a single source document used in reasoning.

class SourceAttribution(BaseModel):
    document_id: str = Field(..., description="Unique identifier for the source document")
    document_title: Optional[str] = Field(None, description="Title or name of the source document")
    relevance_score: float = Field(..., description="Relevance score for this source (0-1)")
    excerpt: Optional[str] = Field(None, description="Relevant excerpt from the source")
    chunk_index: Optional[int] = Field(None, description="Index of the chunk within the document")
    retrieval_rank: Optional[int] = Field(None, description="Rank in the retrieval results")

Field Details:

document_id: Unique identifier (e.g., "doc_12345", "pdf_section_3")
document_title: Human-readable title for display
relevance_score: 0.0-1.0 score indicating source relevance to the reasoning step
excerpt: First 200 characters of relevant content
chunk_index: For chunked documents, which chunk was used
retrieval_rank: Original ranking from vector search (1-based)

SourceSummary¶

Aggregated view of all sources used across the entire reasoning chain.

class SourceSummary(BaseModel):
    all_sources: List[SourceAttribution] = Field(default_factory=list, description="All unique sources used")
    primary_sources: List[SourceAttribution] = Field(default_factory=list, description="Most influential sources")
    source_usage_by_step: Dict[int, List[str]] = Field(default_factory=dict, description="Sources used by each step")

Aggregation Logic:

all_sources: Deduplicated list of all sources, sorted by relevance
primary_sources: Top 3 sources with relevance > 0.7, or top 3 overall
source_usage_by_step: Maps step numbers to lists of document IDs used

Enhanced ReasoningStep¶

Standard ReasoningStep enhanced with source attribution:

class ReasoningStep(BaseModel):
    # ... existing fields ...
    source_attributions: List[SourceAttribution] = Field(default_factory=list, description="Structured source attributions")
    context_used: List[str] = Field(default_factory=list, description="Context documents used (legacy)")

SourceAttributionService¶

Key Methods¶

`create_source_attribution()`¶

Creates structured source attribution objects:

def create_source_attribution(
    self,
    document_id: str,
    relevance_score: float,
    document_title: Optional[str] = None,
    excerpt: Optional[str] = None,
    chunk_index: Optional[int] = None,
    retrieval_rank: Optional[int] = None,
) -> SourceAttribution

Example Usage:

attribution = service.create_source_attribution(
    document_id="ml_handbook_ch3",
    relevance_score=0.85,
    document_title="Machine Learning Handbook - Chapter 3",
    excerpt="Machine learning algorithms can be categorized into supervised...",
    chunk_index=2,
    retrieval_rank=1
)

`extract_sources_from_context()`¶

Extracts source attributions from context documents or search results:

def extract_sources_from_context(
    self,
    context_documents: List[str],
    search_results: Optional[List[Dict[str, Any]]] = None,
) -> List[SourceAttribution]

Two Modes of Operation:

Structured Search Results (preferred):

search_results = [
    {
        "document_id": "doc_123",
        "title": "Introduction to AI",
        "score": 0.92,
        "content": "Artificial intelligence involves...",
        "chunk_index": 0
    }
]
attributions = service.extract_sources_from_context([], search_results)

Raw Context Strings (fallback):

context_docs = [
    "id:doc_123 Artificial intelligence involves creating systems...",
    "Machine learning is a subset of AI that..."
]
attributions = service.extract_sources_from_context(context_docs)

`aggregate_sources_across_steps()`¶

Combines source attributions from all reasoning steps:

def aggregate_sources_across_steps(
    self, reasoning_steps: List[ReasoningStep]
) -> SourceSummary

Aggregation Process:

Collect all sources from each reasoning step
Deduplicate sources by document_id, keeping highest relevance score
Identify primary sources (relevance > 0.7 or top 3)
Track step-by-step usage for each source

`enhance_reasoning_step_with_sources()`¶

Adds source attributions to reasoning steps:

def enhance_reasoning_step_with_sources(
    self,
    step: ReasoningStep,
    retrieved_documents: Optional[List[Dict[str, Any]]] = None,
) -> ReasoningStep

Enhancement Process:

Extract sources from retrieved documents or step context
Create SourceAttribution objects
Attach attributions to the reasoning step
Return enhanced step

Caching and Deduplication¶

The service includes built-in caching to avoid duplicate processing:

def __init__(self) -> None:
    self._source_cache: Dict[str, SourceAttribution] = {}

# Cache for deduplication
self._source_cache[document_id] = attribution

Integration with CoT Pipeline¶

During Reasoning Step Execution¶

async def execute_reasoning_step(self, step_number: int, question: str, context: list[str], previous_answers: list[str]) -> ReasoningStep:
    # ... reasoning logic ...

    # Create basic reasoning step
    step = ReasoningStep(
        step_number=step_number,
        question=question,
        context_used=full_context[:5],
        intermediate_answer=intermediate_answer,
        confidence_score=confidence_score,
        reasoning_trace=f"Step {step_number}: Analyzing {question}",
        execution_time=time.time() - start_time
    )

    # Enhance with source attributions
    enhanced_step = self.source_attribution_service.enhance_reasoning_step_with_sources(
        step=step,
        retrieved_documents=None  # TODO: Pass actual retrieved documents
    )

    return enhanced_step

During Final Synthesis¶

async def execute_chain_of_thought(self, cot_input: ChainOfThoughtInput, context_documents: list[str] | None = None) -> ChainOfThoughtOutput:
    # ... execute reasoning steps ...

    # Generate source summary
    source_summary = self.source_attribution_service.aggregate_sources_across_steps(reasoning_steps)

    return ChainOfThoughtOutput(
        original_question=cot_input.question,
        final_answer=final_answer,
        reasoning_steps=reasoning_steps,
        source_summary=source_summary,  # Include aggregated sources
        # ... other fields ...
    )

Display and User Interface¶

Formatted Output for UI¶

The service provides UI-friendly formatting:

def format_sources_for_display(
    self, source_summary: SourceSummary, include_excerpts: bool = True
) -> Dict[str, Any]:
    """Format sources for user-friendly display."""
    return {
        "total_sources": len(source_summary.all_sources),
        "primary_sources": [
            {
                "document_id": source.document_id,
                "title": source.document_title or source.document_id,
                "relevance": round(source.relevance_score, 2),
                "excerpt": source.excerpt if include_excerpts else None
            }
            for source in source_summary.primary_sources
        ],
        "step_breakdown": {
            f"step_{step_num}": {
                "step_number": step_num,
                "sources_used": len(doc_ids),
                "document_ids": doc_ids
            }
            for step_num, doc_ids in source_summary.source_usage_by_step.items()
        }
    }

Example UI Output¶

{
  "total_sources": 5,
  "primary_sources": [
    {
      "document_id": "ml_guide_ch1",
      "title": "Machine Learning Guide - Chapter 1",
      "relevance": 0.92,
      "excerpt": "Machine learning algorithms fall into three main categories..."
    },
    {
      "document_id": "dl_paper_2023",
      "title": "Deep Learning Advances 2023",
      "relevance": 0.87,
      "excerpt": "Recent advances in deep learning have shown significant improvements..."
    }
  ],
  "step_breakdown": {
    "step_1": {
      "step_number": 1,
      "sources_used": 3,
      "document_ids": ["ml_guide_ch1", "ai_intro", "algorithms_ref"]
    },
    "step_2": {
      "step_number": 2,
      "sources_used": 2,
      "document_ids": ["dl_paper_2023", "neural_networks"]
    }
  }
}

Relevance Scoring¶

Automatic Scoring¶

When structured search results aren't available, the system uses heuristic scoring:

# Decreasing relevance by retrieval rank
relevance_score = max(0.3, 1.0 - (i * 0.1))

Confidence-Based Scoring¶

Relevance can be adjusted based on reasoning step confidence:

# Higher confidence steps get higher source relevance
adjusted_relevance = base_relevance * step.confidence_score

Integration with RAG Search¶

Traditional RAG vs CoT Source Attribution¶

Traditional RAG:

# Shows final sources only
return SearchOutput(
    answer="Machine learning is...",
    sources=[doc1, doc2, doc3]  # Sources for final answer only
)

CoT with Source Attribution:

# Shows step-by-step source usage
return SearchOutput(
    answer=cot_result.final_answer,
    sources=cot_result.source_summary.primary_sources,
    reasoning_trace=cot_result.reasoning_steps,  # Each step has its own sources
    source_breakdown=cot_result.source_summary.source_usage_by_step
)

Enhanced Transparency¶

CoT source attribution provides:

Which sources influenced which reasoning steps
How relevant each source was to each step
Which sources were most influential overall
Complete traceability from question to final answer

Future Enhancements¶

Planned Improvements¶

Citation Generation: Automatic citation formatting for academic use
Source Validation: Cross-reference sources for consistency
Interactive Source Exploration: UI for drilling down into source usage
Source Quality Scoring: Evaluate source reliability and authority
Conflict Detection: Identify when sources contradict each other

Integration Opportunities¶

Document Versioning: Track which version of a document was used
Real-time Updates: Update attributions when source documents change
User Feedback: Allow users to rate source relevance
A/B Testing: Compare source attribution strategies

Testing and Validation¶

Unit Tests¶

The source attribution system includes comprehensive tests:

def test_source_extraction_from_context():
    service = SourceAttributionService()
    context_docs = ["id:doc_123 Machine learning content...", "Deep learning overview..."]

    attributions = service.extract_sources_from_context(context_docs)

    assert len(attributions) == 2
    assert attributions[0].document_id == "doc_123"
    assert attributions[0].relevance_score > 0.3

Integration Tests¶

End-to-end tests validate the complete source attribution pipeline:

async def test_cot_with_source_attribution():
    # Execute CoT with source tracking
    result = await cot_service.execute_chain_of_thought(cot_input, context_docs)

    # Verify source attribution
    assert result.source_summary is not None
    assert len(result.source_summary.all_sources) > 0
    assert len(result.source_summary.primary_sources) <= 3

    # Verify step-level attribution
    for step in result.reasoning_steps:
        assert len(step.source_attributions) > 0

This comprehensive source attribution system transforms CoT from a "black box" reasoning process into a transparent, traceable, and trustworthy system where users can understand exactly how the AI arrived at its conclusions.