Document Commands¶
Document commands manage files within collections, including upload, processing, and retrieval operations. These commands handle the core content management features of the RAG system.
Overview¶
Document management provides: - Multi-Format Support: PDF, DOCX, TXT, MD, and other text-based formats - Intelligent Processing: Automatic text extraction and chunking - Metadata Management: Rich document tagging and categorization - Version Control: Track document updates and changes - Batch Operations: Efficient bulk document handling
Commands Reference¶
rag-cli documents list¶
List documents within a collection.
Usage¶
Arguments¶
| Argument | Description | Required |
|---|---|---|
COLLECTION_ID | Collection identifier | Yes |
Options¶
| Option | Description | Default |
|---|---|---|
--format FORMAT | Output format (table, json, csv, yaml) | table |
--limit LIMIT | Maximum documents to return | 50 |
--offset OFFSET | Number of documents to skip | 0 |
--filter FILTER | Filter by title, type, or status | None |
--sort FIELD | Sort by (title, created_at, size, status) | title |
--order ORDER | Sort order (asc, desc) | asc |
--include-stats | Include processing statistics | false |
Examples¶
Basic listing:
Filtered by file type:
With processing statistics:
Sorted by upload date:
Expected Output¶
Table format:
โโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโฌโโโโโโโโโโโโโฌโโโโโโโโโโโโโโโโโโโโโโ
โ ID โ Title โ Type โ Size โ Status โ Uploaded โ
โโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโผโโโโโโโโโโโโโผโโโโโโโโโโโโโโโโโโโโโโค
โ doc_abc123 โ ML Research Paper โ PDF โ 2.3 MB โ Processed โ 2024-01-15 09:30:00 โ
โ doc_def456 โ Technical Requirements โ DOCX โ 456 KB โ Processing โ 2024-01-15 10:15:00 โ
โ doc_ghi789 โ Quick Notes โ TXT โ 12 KB โ Processed โ 2024-01-14 16:45:00 โ
โโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโดโโโโโโโโโโโโโดโโโโโโโโโโโโโโโโโโโโโโ
Total: 3 documents (2.8 MB)
JSON with statistics:
{
"documents": [
{
"id": "doc_abc123",
"title": "ML Research Paper",
"filename": "ml-research-2024.pdf",
"file_type": "pdf",
"size_bytes": 2415616,
"status": "processed",
"uploaded_at": "2024-01-15T09:30:00Z",
"processed_at": "2024-01-15T09:32:15Z",
"chunk_count": 47,
"page_count": 12,
"processing_time_seconds": 135,
"metadata": {
"author": "Dr. Jane Smith",
"subject": "Machine Learning",
"creation_date": "2024-01-10"
}
}
],
"total": 3,
"total_size_bytes": 2927616
}
rag-cli documents upload¶
Upload one or more documents to a collection.
Usage¶
Arguments¶
| Argument | Description | Required |
|---|---|---|
COLLECTION_ID | Collection identifier | Yes |
FILE_PATH | Path to file(s) to upload | Yes |
Options¶
| Option | Description | Default |
|---|---|---|
--title TITLE | Custom document title | Filename |
--description DESC | Document description | Empty |
--tags TAGS | Comma-separated tags | None |
--metadata KEY=VALUE | Custom metadata pairs | None |
--auto-title | Extract title from document content | false |
--wait | Wait for processing to complete | false |
--batch-size SIZE | Number of files to upload concurrently | 5 |
--recursive | Upload files recursively from directories | false |
--pattern PATTERN | File pattern filter (e.g., *.pdf) | * |
Examples¶
Single document upload:
Multiple documents with custom metadata:
./rag-cli documents upload col_123abc report.pdf presentation.pptx \
--tags "research,quarterly" \
--metadata "department=engineering" \
--metadata "quarter=Q1-2024"
Bulk upload from directory:
./rag-cli documents upload col_123abc ./documents/ \
--recursive \
--pattern "*.pdf" \
--auto-title \
--wait
Upload with custom processing:
./rag-cli documents upload col_123abc manual.pdf \
--title "User Manual v2.1" \
--description "Updated user manual with new features" \
--tags "manual,user-guide,v2.1" \
--wait
Expected Output¶
Single file upload:
๐ค Uploading document...
File: report.pdf
Size: 2.3 MB
Collection: Knowledge Base (col_123abc)
โ
Upload successful!
Document ID: doc_abc123def
Title: report.pdf
Status: Processing
Estimated processing time: 2-3 minutes
Monitor progress: ./rag-cli documents get col_123abc doc_abc123def
Batch upload with progress:
๐ค Uploading 5 documents to Knowledge Base...
[โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] 100% (5/5)
โ
Batch upload completed!
Successfully uploaded:
- report.pdf โ doc_abc123 (Processing)
- slides.pptx โ doc_def456 (Processing)
- notes.txt โ doc_ghi789 (Processing)
- manual.pdf โ doc_jkl012 (Processing)
- readme.md โ doc_mno345 (Processing)
Total: 5 documents (12.7 MB)
Processing: All documents are being processed in the background.
Upload with wait flag:
๐ค Uploading and processing document...
File: research-paper.pdf
Size: 3.1 MB
Collection: Research Papers (col_research)
[โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] Upload: 100%
[โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] Processing: 100%
โ
Document processed successfully!
Document ID: doc_research123
Title: AI in Healthcare: A Comprehensive Review
Pages: 24
Chunks: 89
Processing time: 3m 42s
Ready for search: ./rag-cli search query col_research "AI healthcare applications"
rag-cli documents get¶
Get detailed information about a specific document.
Usage¶
Arguments¶
| Argument | Description | Required |
|---|---|---|
COLLECTION_ID | Collection identifier | Yes |
DOCUMENT_ID | Document identifier | Yes |
Options¶
| Option | Description | Default |
|---|---|---|
--format FORMAT | Output format (table, json, yaml) | table |
--include-content | Include extracted text content | false |
--include-chunks | Include chunk information | false |
--include-metadata | Include all metadata fields | false |
Examples¶
Basic document info:
Detailed information:
./rag-cli documents get col_123abc doc_abc123 \
--include-chunks \
--include-metadata \
--format json
Content preview:
Expected Output¶
Basic information:
๐ Document Details
ID: doc_abc123
Title: ML Research Paper
Filename: ml-research-2024.pdf
Collection: Knowledge Base (col_123abc)
๐ File Information
Type: PDF
Size: 2.3 MB (2,415,616 bytes)
Pages: 12
Status: โ
Processed
๐
Timeline
Uploaded: 2024-01-15 09:30:00
Processed: 2024-01-15 09:32:15
Processing time: 2m 15s
๐ Search Data
Chunks: 47
Average chunk size: 418 tokens
Ready for search: โ
Detailed with chunks:
๐ Document Details
[... basic info ...]
๐ Content Structure
Total chunks: 47
Chunk distribution:
- Introduction: 3 chunks
- Methodology: 12 chunks
- Results: 18 chunks
- Discussion: 9 chunks
- Conclusion: 3 chunks
- References: 2 chunks
๐ท๏ธ Metadata
Author: Dr. Jane Smith
Subject: Machine Learning
Keywords: artificial intelligence, neural networks, deep learning
Creation date: 2024-01-10
Document version: 1.2
Tags: research, quarterly, ml
๐ Processing Statistics
Text extraction time: 45s
Chunking time: 23s
Embedding generation: 87s
Index update time: 20s
rag-cli documents download¶
Download a document from a collection.
Usage¶
Arguments¶
| Argument | Description | Required |
|---|---|---|
COLLECTION_ID | Collection identifier | Yes |
DOCUMENT_ID | Document identifier | Yes |
Options¶
| Option | Description | Default |
|---|---|---|
--output PATH | Output file path | Original filename |
--format FORMAT | Download format (original, text, json) | original |
--include-metadata | Include metadata in output | false |
Examples¶
Download original file:
Download to specific location:
Download extracted text:
Download with metadata:
./rag-cli documents download col_123abc doc_abc123 \
--format json \
--include-metadata \
--output document-export.json
Expected Output¶
Successful download:
๐ฅ Downloading document...
Document: ML Research Paper (doc_abc123)
Source: col_123abc
Format: Original PDF
โ
Download completed!
File: ml-research-2024.pdf
Size: 2.3 MB
Location: ./ml-research-2024.pdf
Text extraction download:
๐ฅ Extracting and downloading text...
Document: ML Research Paper (doc_abc123)
Format: Plain text
Pages: 12 โ Text file
โ
Download completed!
File: ml-research-2024.txt
Size: 156 KB
Location: ./ml-research-2024.txt
Chunks: 47 text segments included
rag-cli documents delete¶
Delete a document from a collection.
Usage¶
Arguments¶
| Argument | Description | Required |
|---|---|---|
COLLECTION_ID | Collection identifier | Yes |
DOCUMENT_ID | Document identifier | Yes |
Options¶
| Option | Description | Default |
|---|---|---|
--force | Skip confirmation prompt | false |
--backup | Create backup before deletion | false |
--backup-path PATH | Custom backup location | ./document-backup/ |
Examples¶
Interactive deletion:
Force delete without confirmation:
Delete with backup:
Expected Output¶
Interactive deletion:
โ ๏ธ Delete Document Confirmation
Document: ML Research Paper (doc_abc123)
Collection: Knowledge Base (col_123abc)
File: ml-research-2024.pdf (2.3 MB)
Chunks: 47 text segments
Uploaded: 2024-01-15 09:30:00
This action cannot be undone!
Document will be removed from search index.
Are you sure you want to delete this document? (y/N): y
โ
Document deleted successfully!
Removed:
- Original file (2.3 MB)
- 47 text chunks
- Search index entries
- Metadata records
Collection updated: 14 documents remaining
rag-cli documents reprocess¶
Reprocess a document with updated settings.
Usage¶
Arguments¶
| Argument | Description | Required |
|---|---|---|
COLLECTION_ID | Collection identifier | Yes |
DOCUMENT_ID | Document identifier | Yes |
Options¶
| Option | Description | Default |
|---|---|---|
--wait | Wait for reprocessing to complete | false |
--force | Force reprocessing even if not needed | false |
--chunk-size SIZE | Override default chunk size | Collection default |
--chunk-overlap OVERLAP | Override chunk overlap | Collection default |
Examples¶
Basic reprocessing:
Reprocess with custom settings:
./rag-cli documents reprocess col_123abc doc_abc123 \
--chunk-size 1024 \
--chunk-overlap 100 \
--wait
Force reprocessing:
Expected Output¶
Reprocessing initiated:
๐ Initiating document reprocessing...
Document: ML Research Paper (doc_abc123)
Collection: Knowledge Base (col_123abc)
Reason: Collection chunk size updated
Current chunks: 47 (512 tokens each)
New chunk size: 1024 tokens
Estimated new chunks: ~24
โ
Reprocessing started!
Status: Processing
Estimated time: 2-3 minutes
Monitor progress: ./rag-cli documents get col_123abc doc_abc123
Reprocessing completed (with --wait):
๐ Reprocessing document...
[โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ] 100%
โ
Reprocessing completed!
Document: ML Research Paper (doc_abc123)
Processing time: 1m 34s
Changes:
- Chunks: 47 โ 24 (-49%)
- Average chunk size: 418 โ 892 tokens
- Index updated: โ
Document is ready for search with updated chunking.
Advanced Usage¶
Batch Document Operations¶
Upload entire directory structure:
#!/bin/bash
find ./documents -type f \( -name "*.pdf" -o -name "*.docx" -o -name "*.txt" \) | \
while read file; do
echo "Uploading: $file"
./rag-cli documents upload col_123abc "$file" \
--auto-title \
--tags "batch-upload,$(date +%Y-%m)" \
--metadata "source=directory-import"
done
Mass document reprocessing:
#!/bin/bash
# Reprocess all documents after collection settings change
./rag-cli documents list col_123abc --format json | \
jq -r '.documents[] | select(.status == "processed") | .id' | \
while read doc_id; do
echo "Reprocessing: $doc_id"
./rag-cli documents reprocess col_123abc "$doc_id"
done
Document health check:
#!/bin/bash
echo "๐ Document Health Check"
echo "======================="
# Check for failed documents
failed=$(./rag-cli documents list col_123abc --format json | \
jq '[.documents[] | select(.status == "failed")] | length')
echo "Failed documents: $failed"
# Check for stuck processing
processing=$(./rag-cli documents list col_123abc --format json | \
jq '[.documents[] | select(.status == "processing")] | length')
echo "Processing documents: $processing"
# List documents needing attention
if [ "$failed" -gt 0 ] || [ "$processing" -gt 0 ]; then
echo ""
echo "Documents needing attention:"
./rag-cli documents list col_123abc --format json | \
jq -r '.documents[] | select(.status == "failed" or .status == "processing") | [.id, .title, .status] | @tsv' | \
while IFS=$'\t' read -r id title status; do
echo " - $title ($id): $status"
done
fi
Document Metadata Management¶
Bulk metadata updates:
#!/bin/bash
# Add department metadata to all documents
./rag-cli documents list col_123abc --format json | \
jq -r '.documents[].id' | \
while read doc_id; do
./rag-cli documents update col_123abc "$doc_id" \
--metadata "department=engineering" \
--metadata "reviewed=2024-01"
done
Extract and export metadata:
#!/bin/bash
echo "Document Metadata Report"
echo "======================="
./rag-cli documents list col_123abc --format json --include-stats | \
jq -r '.documents[] | [.title, .file_type, .size_bytes, .chunk_count, .tags] | @csv' > document-report.csv
echo "Report exported to: document-report.csv"
Content Analysis¶
Document statistics dashboard:
#!/bin/bash
collection_id="col_123abc"
echo "๐ Document Statistics for Collection: $collection_id"
echo "=================================================="
# Get collection info
info=$(./rag-cli documents list "$collection_id" --format json --include-stats)
# Total documents
total=$(echo "$info" | jq '.total')
echo "Total Documents: $total"
# File type distribution
echo ""
echo "File Types:"
echo "$info" | jq -r '.documents | group_by(.file_type) | .[] | [.[0].file_type, length] | @tsv' | \
while IFS=$'\t' read -r type count; do
echo " - ${type^^}: $count documents"
done
# Size statistics
echo ""
echo "Size Statistics:"
total_size=$(echo "$info" | jq '[.documents[].size_bytes] | add')
avg_size=$(echo "$info" | jq '[.documents[].size_bytes] | add / length')
echo " - Total: $(numfmt --to=iec $total_size)"
echo " - Average: $(numfmt --to=iec ${avg_size%.*})"
# Processing status
echo ""
echo "Processing Status:"
echo "$info" | jq -r '.documents | group_by(.status) | .[] | [.[0].status, length] | @tsv' | \
while IFS=$'\t' read -r status count; do
echo " - ${status^}: $count documents"
done
Error Handling¶
Common Error Scenarios¶
Document Not Found¶
$ ./rag-cli documents get col_123abc invalid-doc-id
โ Document not found
Document 'invalid-doc-id' does not exist in collection 'col_123abc'.
List available documents:
./rag-cli documents list col_123abc
Upload Failed - Unsupported Format¶
$ ./rag-cli documents upload col_123abc image.jpg
โ Upload failed
File 'image.jpg' has unsupported format: JPEG
Supported formats: PDF, DOCX, DOC, TXT, MD, RTF, ODT
Convert to supported format or use text extraction tool first.
Processing Failed¶
$ ./rag-cli documents get col_123abc doc_failed123
๐ Document Details
ID: doc_failed123
Status: โ Processing Failed
Error: Text extraction failed - corrupted PDF
Retry options:
1. Re-upload original file: ./rag-cli documents upload col_123abc original-file.pdf
2. Force reprocess: ./rag-cli documents reprocess col_123abc doc_failed123 --force
3. Delete and retry: ./rag-cli documents delete col_123abc doc_failed123
Storage Quota Exceeded¶
$ ./rag-cli documents upload col_123abc large-file.pdf
โ Upload failed
Collection storage quota exceeded.
Current usage: 4.8 GB / 5.0 GB limit
File size: 245 MB
Options:
1. Delete unused documents to free space
2. Contact administrator to increase quota
3. Split large file into smaller documents
Debugging Document Issues¶
Enable debug mode:
Check processing logs:
Validate document integrity:
Integration Examples¶
CI/CD Document Updates¶
#!/bin/bash
# Automated documentation update script
collection_id="col_docs"
docs_dir="./updated-docs"
echo "๐ Updating documentation collection..."
# Upload new/updated documents
for file in "$docs_dir"/*.md; do
if [ -f "$file" ]; then
title=$(basename "$file" .md)
# Check if document already exists
if ./rag-cli documents list "$collection_id" --filter "$title" --format json | jq -e '.documents | length > 0' > /dev/null; then
echo "Updating existing document: $title"
doc_id=$(./rag-cli documents list "$collection_id" --filter "$title" --format json | jq -r '.documents[0].id')
./rag-cli documents delete "$collection_id" "$doc_id" --force
fi
echo "Uploading: $title"
./rag-cli documents upload "$collection_id" "$file" \
--title "$title" \
--tags "documentation,auto-updated" \
--metadata "version=$(git rev-parse --short HEAD)" \
--metadata "updated=$(date -Iseconds)"
fi
done
echo "โ
Documentation update completed"
Document Backup System¶
#!/bin/bash
# Complete document backup script
collection_id="$1"
backup_dir="./backups/$(date +%Y%m%d_%H%M%S)"
echo "๐ฆ Creating document backup for collection: $collection_id"
mkdir -p "$backup_dir"
# Export document metadata
./rag-cli documents list "$collection_id" --format json --include-stats > "$backup_dir/documents.json"
# Download all documents
./rag-cli documents list "$collection_id" --format json | \
jq -r '.documents[] | [.id, .title, .filename] | @tsv' | \
while IFS=$'\t' read -r doc_id title filename; do
echo "Backing up: $title"
./rag-cli documents download "$collection_id" "$doc_id" \
--output "$backup_dir/$filename"
done
echo "โ
Backup completed: $backup_dir"
tar -czf "$backup_dir.tar.gz" -C "$(dirname "$backup_dir")" "$(basename "$backup_dir")"
echo "๐ฆ Archive created: $backup_dir.tar.gz"
Next Steps¶
After mastering document management: 1. Search Commands - Query your document collections effectively 2. Collection Management - Advanced collection configuration 3. Configuration - Optimize document processing settings 4. Troubleshooting - Resolve document processing issues