Ingestion Adapters¶
WikiMind's ingestion layer transforms raw sources into normalized documents ready for LLM compilation.
Architecture¶
IngestService¶
The IngestService orchestrates all adapters. When a URL is submitted:
- YouTube detection -- URLs containing
youtube.comoryoutu.beroute toYouTubeAdapter - PDF detection -- URLs ending in
.pdfroute toPDFAdapter. If the extension is missing, the responseContent-Typeis checked. Ifapplication/pdf, the content is downloaded and passed toPDFAdapter. - HTML fallback -- Everything else goes to
URLAdapter
URL Adapter¶
Uses trafilatura for HTML extraction. Trafilatura strips navigation, ads, and boilerplate, returning clean article text with metadata (title, author, published date).
Input: URL string Output: Source row + NormalizedDocument with clean text, metadata, and token estimate
PDF Adapter¶
Two extraction backends with automatic fallback:
docling-serve (primary)¶
docling-serve runs as a sidecar container and provides:
- Heading hierarchy preservation
- Table structure extraction
- OCR for scanned pages
- Image/figure extraction
The adapter sends PDF bytes to docling-serve's /v1/convert/source endpoint and receives structured markdown.
pymupdf fallback¶
When docling-serve is unavailable, the adapter falls back to pymupdf (fitz) for basic text extraction. This extracts text but loses structural information.
Vision-enhanced processing¶
For slide decks and image-heavy PDFs, pages with fewer characters than WIKIMIND_VISION_TEXT_THRESHOLD (default: 300) are rendered as images and described by a multimodal LLM. This captures content from diagrams, charts, and cover slides that text extraction misses.
Image extraction¶
Docling extracts figures and tables from PDFs as images. These are:
- Saved to
~/.wikimind/images/{source_id}/ - Served via the
/images/static mount - Displayed in the frontend's FiguresPanel alongside the article
Controlled by WIKIMIND_IMAGE_EXTRACTION_ENABLED (default: true) and WIKIMIND_IMAGE_MAX_PER_PDF (default: 30).
Text Adapter¶
The simplest adapter -- accepts raw text content and an optional title.
Input: Content string + optional title Output: Source row + NormalizedDocument
YouTube Adapter¶
Uses youtube-transcript-api to extract video transcripts. Handles both youtube.com and youtu.be URL formats.
Input: YouTube URL Output: Source row + NormalizedDocument with transcript text
NormalizedDocument¶
All adapters produce a NormalizedDocument -- the common format consumed by the compiler:
| Field | Type | Description |
|---|---|---|
raw_source_id | str | ID of the saved Source |
clean_text | str | Extracted text content |
title | str | Document title |
author | str or None | Author if detected |
published_date | str or None | Publication date if detected |
estimated_tokens | int | Approximate token count |
chunks | list | Pre-chunked segments for large documents |
Deduplication¶
Sources are deduplicated by content hash (SHA-256 of the extracted text). The compute_hash utility generates the hash, and find_source_by_hash checks for existing sources before creating a new one.
Chunking¶
Large documents are split into semantic chunks using the chunk_text utility. Chunks:
- Target approximately 500 tokens each (configurable)
- Preserve heading boundaries where possible
- Include overlap for context continuity
The chunked segments are stored in the NormalizedDocument and used by the compiler's chunked compilation path for documents exceeding 80K tokens.