CoT Schemas and Data Models¶

This document provides detailed information about all Pydantic schemas and data models used in the Chain of Thought system.

Overview¶

The CoT system uses strongly-typed Pydantic models for:

Input validation: Ensuring data integrity at API boundaries
Configuration management: Type-safe configuration handling
Data serialization: Consistent JSON serialization/deserialization
Documentation: Self-documenting schema with field descriptions
Testing: Reliable test fixtures and data generation

All schemas include comprehensive field validation and descriptive error messages.

Input/Output Schemas¶

ChainOfThoughtInput¶

Input schema for initiating CoT reasoning.

class ChainOfThoughtInput(BaseModel):
    question: str = Field(..., min_length=1, description="User question")
    collection_id: UUID4 = Field(..., description="Collection ID")
    user_id: UUID4 = Field(..., description="User ID")
    cot_config: Optional[Dict[str, Any]] = Field(None, description="CoT configuration")
    context_metadata: Optional[Dict[str, Any]] = Field(None, description="Context metadata")

Field Details: - question: User's input question (required, non-empty) - collection_id: UUID of the document collection to search - user_id: UUID of the requesting user - cot_config: Optional runtime configuration overrides - context_metadata: Additional context information

Usage Example:

cot_input = ChainOfThoughtInput(
    question="Compare supervised and unsupervised learning approaches",
    collection_id=UUID("123e4567-e89b-12d3-a456-426614174000"),
    user_id=UUID("987fcdeb-51d2-43a1-b123-456789abcdef"),
    cot_config={
        "max_reasoning_depth": 4,
        "reasoning_strategy": "comparison",
        "evaluation_threshold": 0.75
    }
)

ChainOfThoughtOutput¶

Comprehensive output from CoT reasoning process.

class ChainOfThoughtOutput(BaseModel):
    original_question: str = Field(..., description="Original user question")
    final_answer: str = Field(..., description="Final synthesized answer")
    reasoning_steps: List[ReasoningStep] = Field(default_factory=list, description="Reasoning steps taken")
    source_summary: Optional[SourceSummary] = Field(None, description="Summary of source attributions")
    total_confidence: float = Field(default=0.0, description="Overall confidence score")
    token_usage: Optional[int] = Field(None, description="Total tokens used")
    total_execution_time: Optional[float] = Field(None, description="Total execution time")
    reasoning_strategy: Optional[str] = Field(None, description="Strategy used")

Validation Rules: - total_confidence: Must be between 0.0 and 1.0 - token_usage: Must be positive if specified - total_execution_time: Must be positive if specified

Usage Example:

output = ChainOfThoughtOutput(
    original_question="Compare supervised and unsupervised learning",
    final_answer="Supervised learning uses labeled data while unsupervised learning...",
    reasoning_steps=[step1, step2, step3],
    source_summary=source_summary,
    total_confidence=0.87,
    token_usage=1250,
    total_execution_time=3.2,
    reasoning_strategy="comparison"
)

Configuration Schemas¶

ChainOfThoughtConfig¶

Configuration parameters for CoT reasoning behavior.

class ChainOfThoughtConfig(BaseModel):
    enabled: bool = Field(default=False, description="Whether CoT is enabled")
    max_reasoning_depth: int = Field(default=3, description="Maximum reasoning steps")
    reasoning_strategy: str = Field(
        default="decomposition",
        description="Strategy: decomposition, iterative, hierarchical, causal"
    )
    context_preservation: bool = Field(default=True, description="Preserve context across steps")
    token_budget_multiplier: float = Field(default=2.0, description="Token budget multiplier")
    evaluation_threshold: float = Field(default=0.6, description="Evaluation threshold")

Field Validation: - max_reasoning_depth: Must be > 0 - token_budget_multiplier: Must be > 0 - evaluation_threshold: Must be between 0.0 and 1.0 - reasoning_strategy: Must be one of ["decomposition", "iterative", "hierarchical", "causal"]

Strategy Descriptions: - decomposition: Break complex questions into sub-questions - iterative: Build understanding through iterative refinement - hierarchical: Use hierarchical reasoning from general to specific - causal: Follow causal chains of reasoning

Configuration Examples:

# Conservative configuration
conservative_config = ChainOfThoughtConfig(
    enabled=True,
    max_reasoning_depth=2,
    reasoning_strategy="decomposition",
    evaluation_threshold=0.8
)

# Aggressive configuration
aggressive_config = ChainOfThoughtConfig(
    enabled=True,
    max_reasoning_depth=5,
    reasoning_strategy="hierarchical",
    token_budget_multiplier=3.0,
    evaluation_threshold=0.5
)

Process Schemas¶

QuestionClassification¶

Classification of questions to determine CoT applicability.

class QuestionClassification(BaseModel):
    question_type: str = Field(..., description="Type of question")
    complexity_level: str = Field(..., description="Complexity level")
    requires_cot: bool = Field(..., description="Whether CoT is needed")
    estimated_steps: Optional[int] = Field(None, description="Estimated reasoning steps")
    confidence: Optional[float] = Field(None, description="Classification confidence")
    reasoning: Optional[str] = Field(None, description="Classification reasoning")

Validation Rules: - question_type: Must be one of ["simple", "multi_part", "comparison", "causal", "complex_analytical"] - complexity_level: Must be one of ["low", "medium", "high", "very_high"] - confidence: Must be between 0.0 and 1.0 if specified - estimated_steps: Must be > 0 if specified

Classification Matrix:

Question Type	Complexity Level	Requires CoT	Estimated Steps
simple	low	False	1
multi_part	medium	True	2-3
comparison	high	True	3-4
causal	high	True	3-4
complex_analytical	very_high	True	4-5

DecomposedQuestion¶

Individual sub-question in the reasoning chain.

class DecomposedQuestion(BaseModel):
    sub_question: str = Field(..., description="The sub-question")
    reasoning_step: int = Field(..., description="Step number in reasoning chain")
    dependency_indices: List[int] = Field(default_factory=list, description="Dependencies on other steps")
    question_type: Optional[str] = Field(None, description="Type of question")
    complexity_score: float = Field(default=0.5, description="Complexity score 0-1")

Validation Rules: - reasoning_step: Must be > 0 - complexity_score: Must be between 0.0 and 1.0 - question_type: Must be one of ["definition", "comparison", "causal", "procedural", "analytical"] if specified

Example Decomposition:

id=__span-8-1>decomposed = [ DecomposedQuestion( sub_question="What is supervised learning?", reasoning_step=1, question_type="definition", complexity_score=0.3 ), DecomposedQuestion( sub_question="What is unsupervised learning?", reasoning_step=2, question_type="definition", complexity_score=0.3 ), DecomposedQuestion( sub_question="How do supervised and unsupervised learning differ?", reasoning_step=3, dependency_indices=[1, 2], question_type="comparison", complexity_score=0.7 ) class=p>]

ReasoningStep¶

Individual step in the reasoning chain with results and metadata.

class ReasoningStep(BaseModel):
    step_number: int = Field(..., description="Step number")
    question: str = Field(..., description="Question for this step")
    context_used: List[str] = Field(default_factory=list, description="Context documents used (legacy)")
    source_attributions: List[SourceAttribution] = Field(default_factory=list, description="Structured source attributions")
    intermediate_answer: Optional[str] = Field(None, description="Intermediate answer")
    confidence_score: Optional[float] = Field(default=0.0, description="Confidence score 0-1")
    reasoning_trace: Optional[str] = Field(None, description="Reasoning trace")
    execution_time: Optional[float] = Field(None, description="Execution time in seconds")

Validation Rules: - step_number: Must be > 0 - confidence_score: Must be between 0.0 and 1.0 if specified - execution_time: Must be > 0 if specified

Example Step:

step = ReasoningStep(
    step_number=1,
    question="What is supervised learning?",
    source_attributions=[attribution1, attribution2],
    intermediate_answer="Supervised learning is a machine learning approach...",
    confidence_score=0.85,
    reasoning_trace="Step 1: Analyzing definition of supervised learning",
    execution_time=1.2
)

Source Attribution Schemas¶

SourceAttribution¶

Attribution information for individual source documents.

class SourceAttribution(BaseModel):
    document_id: str = Field(..., description="Unique identifier for the source document")
    document_title: Optional[str] = Field(None, description="Title or name of the source document")
    relevance_score: float = Field(..., description="Relevance score for this source (0-1)")
    excerpt: Optional[str] = Field(None, description="Relevant excerpt from the source")
    chunk_index: Optional[int] = Field(None, description="Index of the chunk within the document")
    retrieval_rank: Optional[int] = Field(None, description="Rank in the retrieval results")

Validation Rules: - relevance_score: Must be between 0.0 and 1.0

Example Attribution:

attribution = SourceAttribution(
    document_id="ml_textbook_ch3",
    document_title="Machine Learning Textbook - Chapter 3: Supervised Learning",
    relevance_score=0.92,
    excerpt="Supervised learning algorithms learn from labeled training data to make predictions...",
    chunk_index=0,
    retrieval_rank=1
)

SourceSummary¶

Aggregated source information across the reasoning chain.

class SourceSummary(BaseModel):
    all_sources: List[SourceAttribution] = Field(default_factory=list, description="All unique sources used")
    primary_sources: List[SourceAttribution] = Field(default_factory=list, description="Most influential sources")
    source_usage_by_step: Dict[int, List[str]] = Field(default_factory=dict, description="Sources used by each step")

Usage Patterns:

# Access all sources
for source in summary.all_sources:
    print(f"Source: {source.document_title} (relevance: {source.relevance_score})")

# Access primary sources for display
primary_display = [
    {
        "title": source.document_title,
        "relevance": f"{source.relevance_score:.1%}",
        "excerpt": source.excerpt[:100] + "..."
    }
    for source in summary.primary_sources
]

# Access step-by-step breakdown
for step_num, doc_ids in summary.source_usage_by_step.items():
    print(f"Step {step_num} used {len(doc_ids)} sources: {', '.join(doc_ids)}")

Serialization and Validation¶

JSON Serialization¶

All schemas support JSON serialization for API responses:

# Serialize to JSON
output_json = cot_output.model_dump_json()

# Deserialize from JSON
parsed_output = ChainOfThoughtOutput.model_validate_json(output_json)

Field Validation¶

Schemas include comprehensive field validation:

# This will raise ValidationError
try:
    invalid_config = ChainOfThoughtConfig(
        max_reasoning_depth=-1,  # Invalid: must be > 0
        evaluation_threshold=1.5,  # Invalid: must be <= 1.0
        reasoning_strategy="invalid"  # Invalid: not in allowed list
    )
except ValidationError as e:
    print(f"Validation errors: {e.errors()}")

Custom Validators¶

Advanced validation using Pydantic validators:

@field_validator("relevance_score")
def validate_relevance_score(cls, v):
    if v < 0 or v > 1:
        raise ValueError("relevance_score must be between 0 and 1")
    return v

@field_validator("reasoning_strategy")
def validate_strategy(cls, v):
    valid_strategies = ["decomposition", "iterative", "hierarchical", "causal"]
    if v not in valid_strategies:
        raise ValueError(f"reasoning_strategy must be one of {valid_strategies}")
    return v

Testing Support¶

Factory Functions¶

Schemas can be used with factory functions for testing:

def create_test_cot_input(**kwargs):
    defaults = {
        "question": "Test question",
        "collection_id": UUID("123e4567-e89b-12d3-a456-426614174000"),
        "user_id": UUID("987fcdeb-51d2-43a1-b123-456789abcdef")
    }
    defaults.update(kwargs)
    return ChainOfThoughtInput(**defaults)

# Usage in tests
def test_cot_execution():
    test_input = create_test_cot_input(
        question="Compare A and B",
        cot_config={"max_reasoning_depth": 2}
    )
    # ... test logic

Mock Data Generation¶

Schemas support mock data generation for testing:

def create_mock_reasoning_step(step_number: int = 1):
    return ReasoningStep(
        step_number=step_number,
        question=f"Mock question {step_number}",
        intermediate_answer=f"Mock answer {step_number}",
        confidence_score=0.8,
        execution_time=1.0
    )

Error Handling¶

Validation Errors¶

Schemas provide detailed validation error messages:

try:
    invalid_step = ReasoningStep(
        step_number=0,  # Invalid
        question=""     # Invalid
    )
except ValidationError as e:
    for error in e.errors():
        print(f"Field: {error['loc']}, Error: {error['msg']}")

Custom Error Messages¶

Field validators include descriptive error messages:

# Example validation error output:
{
    "loc": ["max_reasoning_depth"],
    "msg": "max_reasoning_depth must be greater than 0",
    "type": "value_error"
}

Schema Evolution¶

Backward Compatibility¶

Schemas are designed with backward compatibility in mind:

Optional fields have default values
New fields are added as optional
Deprecated fields are marked but not removed immediately

Migration Support¶

# Handle legacy data formats
def migrate_legacy_reasoning_step(legacy_data: dict) -> ReasoningStep:
    # Convert old context format to new source attribution format
    if "context_used" in legacy_data and not legacy_data.get("source_attributions"):
        # Convert context strings to basic attributions
        attributions = []
        for i, context in enumerate(legacy_data["context_used"]):
            attribution = SourceAttribution(
                document_id=f"legacy_context_{i}",
                relevance_score=0.5,
                excerpt=context[:200]
            )
            attributions.append(attribution)
        legacy_data["source_attributions"] = attributions

    return ReasoningStep(**legacy_data)

This comprehensive schema system provides type safety, validation, and documentation while maintaining flexibility for future enhancements to the CoT system.